I’ve been talking for the better part of the last decade about one of the more serious problems looming for HPC, and frankly for all computing. Call it a data deluge or exponential data growth, whatever you would like. At the end of the day it means that you have more data than before, and it is growing faster than you think. Usually much faster than Moore’s law which gives you an order of magnitude about every 6 years.
This issue was highlighted for us in some recent conversations, and the acquisition of a JackRabbit by a world renown genome sequencing center in the midwest US, specifically to handle data ingress from sequencing units. Getting 1 TB of new data per week places some very strict lower limits on performance. Not just disk performance.
The issue is in part storage of this data. This is what storage vendors want you to think. But it is more complex than this. You don’t store things you can’t use. And if you use it, you need to read it, and likely, move it. This is where the problem is. More about this in a moment.
Lets do some simple math at this point.
To move 1 GB over a 100 MB/s network (gigabit) takes 10 seconds, assuming you get 100 % of this network to your data transfer. You rarely get the full speed of the net. Lets see some additional time-to-move issues.
To move 10 GB over this same net, you need 100 seconds. 100 GB would take 1,000 seconds (about 28% of an hour). 1 TB would take 10,000 seconds or about 2.8 hours. 10 TB would take 100,000 seconds, or about 1.2 days. 100 TB would take 11.6 days, and 1 PB would take 116 days.
This is to read the data set.
Ok, lets bump up the network speed by an order of magnitude. Lets use a nice Infiniband or 10 GbE net. Drop all those times above by a factor of 10. To read 1 PB of data would still take 11.6 days.
Now remember that each drive can read/write about 70 MB/s at todays rates. 1TB can be had with ~2 drives, providing at maximum 140 MB/s. 10 TB would take 14 drives, providing at maximum 980 MB/s read speed. Well, this assumes RAID0. No protection. Lets put it in a RAID6. Would need 16 drives (2 parity). Assuming you get get your reads into a streaming mode, you can still hit 980 MB/s. The problem will be efficiency of the RAID calculation. The problem is that most machines you can construct will have limits around 0.8 GB/s for PCI-x (ala thumper), or 0.5*86%*Number_of_PCIe_lanes (ala JackRabbit) . So you are going to run out of IO on a per-node basis as soon as you run out of PCI-x busses, unless you oversubscribe, which is a bad thing to do with PCI-x busses. Or you are going to run out of IO on a per node basis once you consume all your PCI-e lanes. Which means that you will hit a limit per node on moving data onto or off of disk.
Never mind getting it out of the node. You will be limited again by your per-node IO. And you are sharing this with your disk. Assume a 50-50 split, so half your IO bandwidth can get stuff off disk, and half can push it out of the box. Current systems give you about 44 lanes of PCI-e total. This is 22 GB/s or 11 GB/s in each direction. You get about 86% of that due to protocol overhead. Thats 9.5 GB/s.
Of course this is highly optimistic.
Per socket, you get about 7-9 GB/s of bandwidth on Opteron machines. You get a fixed 7 GB/s on Intel machines. Which means, that your 50-50 split is really sharing the bandwidth to and from the memory for DMA or memory mapped IO. For file or disk block system interactions, the data has to transit to memory before going out over the net.
Now lest you think that this paints Opteron machines in a great light relative to Intel machines, please note that most of the designs of Opteron motherboards push all the IO operations to a single CPU socket. This is unfortunate, as you really want to have PCI-e lanes on both sockets. These were design choices and cost choices for the motherboards. The only folks who will really notice are the ones who want and need really high performance IO.
You will get about 4.5 GB/s to disk and about that maximum to network. This suggests that in current processor designs, you will top out with a 2 port DDR Infiniband card, for IO in and out of the unit, and your raid cards interior to the unit. This is also assuming perfect functioning at maximum possible data rates. It is naive at best to assume this perfect efficiency. Won’t happen, not even close.
Ok, I have drilled in deep to a node right now. Going to pull back out to the bigger picture now, in steps.
In aggregate, the fastest IO is always local to the node via a block device. The issue is sharing metadata, insuring atomic metadata changes, …. Metadata winds up being the performance limiting factor for distributed filesystems.
So 1000 nodes, each with a 70 MB/s drive, can read in aggregate, 70 GB/s. Cool.
Uh… but how do we a) tell all these 1000 nodes which thing to read, b) recover the data we need, and then c) transport it to the node requesting it
Point “a” is hard. There are some cool technologies being developed to make this saner, but it is still a hard thing (meta-data shuttling). Point “b” is easy, exists today. Point “c” is the killer.
You have to move the data at some point.
You have to store it. And you have to read it and move it.
That 1 TB read from disks. If your disks can provide the 1+ GB/s read of JackRabbit, then you can read that 1 TB in 1000 seconds or so. About 17 minutes. You can write it about that quickly too. If you have a good Infiniband network in place, you should be able to move that data in about that period of time as well.
1TB/week is not hard. 1TB/day isn’t hard.
For a single sequencer, with one server.
If you have the right system, 1 TB/hour isn’t hard, though inefficiencies will cut it close.
10 sequencers will give you 10 TB/week. 1.4 TB/day. 100 sequencers will give you 100TB/week, 14.3 TB/day, or 0.6 TB/hour.
This is still easily within the capacity and performance of JackRabbit. 0.6 TB/hour is 33% of what it can handle, thought this assumes a 1 GB/s network connection, Infiniband or 10GbE.
The problem is if you want to store this long term for processing. And retrieve it. 1 TB/week is 52 TB/year. 10 TB/week is 520 TB/year, or 0.5 PB/year. 100 TB/week is 5200 TB/year, or 5 PB/year.
So how are you going to manage this data? How are you going to retrieve it? Analyze it? Storage is but one element of this. As the storage requirements get larger and larger, I argue that the interface should get simpler and simpler.
Can we represent 1 PB of storage as just a simple single namespace file system? Yes.
Can you do this reliably? Yes.
With high performance? Yes.
Complexity should be eschewed in these scenarios. The more moving parts you have the more likely that something will break. The more interactions, the more likely that complex emergent behaviors will rear their heads. Just as important, you need a good design to be able to sink and source this data. If you don’t it doesn’t matter how many spindles you have, if you can’t pull the data off and push it out over the net quickly enough. As data set and storage requirements grow, these problems are only going to be exacerbated. Which means you have to make the management and interaction simpler if possible.
The data is coming. And it is going to ingress fast and hard. We have had multiple conversations with customers in the last month where performance density is as important as overall storage density, and price performance. 1PB is great, but if you are stuck touching it at 100 MB/s, you are going to be hurting.
1PB is only about 1333 drives, or 28 JackRabbits. This is less than 4 racks full.
Note: with an AFR of 0.73 or so, 1333 drives should see about 10 drives fail per year. With RAID6 and a few hot spares per unit, you shouldn’t have to worry about file systems being lost. The idea is keep it simple. RAIN is effectively layered RAID. RAID6 with hot spares in a box. RAID across boxes atop of the RAID in the box. Not only no single points of failure, but no single node should be able to bring down the entire system. You can’t think of large storage systems without considering what happens if one element goes away. If the answer is “you are in trouble” then we need to rethink using that storage system. Multiple layers of redundancy are needed.