HPC in the first decade of a new millenium: a perspective, part 7

Storage changes
In the beginning of the millenium, Fibre Channel ruled the roost. Nothing could touch it. SATA and SAS were a ways away. SCSI was used in smaller storage systems. Networked storage meant a large central server with ports. SANs were on the rise.
In HPC you have to move lots of data. Huge amounts of data. Performance bottlenecks are no fun.
FC is a slow technology. It is designed to connect as many disks as you can together for SAN architecture. It is not designed specifically for HPC, to move data as fast as possible. Yeah, I know, there are a few spit-takes from folks who think FC is fast.
Put it this way. FC4 is a 4Gb/s network protocol. Thats about 500MB/s. FC8 is an 8 Gb/s network protocol. Thats about 1000 MB/s.
10GbE is a 10 Gb/s network protocol, or about 1200 MB/s. 40 Gb/s QDR ways in at a hefty 3200 MB/s.
As I said, FC4 and FC8 aren’t fast.

Especially for large amounts of data.
Look at it this way. 1TB is 1000GB. At 1 GB/s (FC8) this is 1000s (15 minutes) to walk through this data. Once. At 3.2 GB/s (QDR), this is ~300 seconds (5 minutes) to walk through the data. Further, the way SANs are constructed with FC, they aggregate as much storage as possible behind a single loop. So you keep growing your capacity, without growing your bandwidth.
On the other hand, with something like our siCluster, each additional node can add up to 4 QDR links. So take 10 nodes with a local 3 GB/s data rate to disk, and hand them a network that they can push data out or pull data in from at 3.2 GB/s. You can read that 1TB in 30-ish seconds.
This of course assumes that the QDR actually operates at QDR speed (dealing with things like this at a cluster site now), but the gist of it is that as long as the technology is working or can easily be made to work, that you can grow your bandwidth with your capacity. Which you must do in HPC storage, lest you make your data effectively inaccessible due to insufficient bandwidth to access the data.
In the decade, non-optimal (this is being kind) storage architectures were used in all manner of clusters. Sadly, the IO performance is a very important aspect of HPC systems, one often (sadly) overlooked until it causes significant pain. Specifically we have seen large multi-terabyte storage systems served off of a single gigabit port. This is wrong at many levels.
During the decade, several technological shifts occurred. Sun designed a moderately good file system in ZFS, and then, as with their failed compute cloud offering, proceeded to lock it up into Solaris. They went as far as to speciously claim it was open source, as it was covered by the CDDL or some such license. Rendering it incompatible with the other open source OS products of note, specifically one that Sun had a very schizophrenic relationship with. Its sort of like claiming white chocolate is chocolate. Yeah, they both have chocolate in the name. Thats about as far as it goes. Same with the open source claims here.
This effectively relegated ZFS to be constrained to the rapidly declining Solaris platform. ZFS has some nice features, but the baggage?
Nexenta looked to solve that problem by wrapping the Debian userspace tools around the Solaris kernel. Its a partial step, but not what is fully needed. Nexenta offers now an appliance OS so that you can use their tools to create large centralized storage. Competitive with OpenFiler, Open-E, FalconStor and many others.
But the lack of this sort of file system for Linux led Chris Mason to start work on BTRFS. Its a long story, and again, Sun ideologues do attack this as not being on par with ZFS. They are sadly mistaken, it is significant in terms of design improvements over ZFS … everything is a btree+ … well, look online for documents, and specifically Val Hanson’s discussion of the two. I’ll link to this from LWN shortly.
But this isn’t the only interesting HPC filesystem … not that BTRFS is HPC specific, but it has great utility in HPC. NILFS2 is a remarkable technology … a log structured file system, designed to make snapshots and other operations trivial and fast. BTRFS also makes snapshots trivial and fast. NILFS2 is potentially one of the best SSD file systems out there …
… which provides a segue into talking about the physical storage devices. Physical storage has been defined for decades by magnetic disk. Small electromechanical heads flying above polished, spinning rust. Writing bits meant magnetic operations ??? put a current through a loop to write a bit one way or the other. Reading bits meant flying the head above the surface and watching the corresponding signal pattern. Some newer technology in here, Magneto-Resistance. Giant Magneto-Resistance (I attended some colloquia on that in 1991 or 1992 as I remember). But the basic technology involves moving parts, which induces latency and reduces bandwidth.
What if we could take the moving parts out of the equation? Seriously, what if we could make all this electronic? So that seeks … moving heads to a specific location, requiring on average 1/2 of a disk spin in addition to the head motion time, were effectively eliminated? Moreover, what if the rate limitation for reading data had less to do with the rate of spin, and more to do with the flow of current/signal?
Such is flash memory. And SSDs (Solid State Disk) as an instance of flash memory. No moving parts. Head crashes are things of the past.
Here we invoke Landman’s rule of problem solving: You never solve a problem. You only exchange one problem for another. You try to exchange for problems you can deal with addressing.
Disks dying are a thing of the past … well … sort of. Now we have to deal with write caches, wear leveling, yadda yadda yadda.
SSDs don’t work so well with their write cache off. Which makes many in the database world ask why they should use them. I think this is a technological issue, that will be solved.
Bigger picture, will SSDs eventually displace spinning rust? Or will some new technology displace spinning rust?
I do think spinning rust has an expiration date on it. This said, I’ve been notoriously wrong about silicon technology (hey, I researched GaAs, arguably a competitive material, for my thesis) and its longevity. We will talk more about that in prognostication posts.
I have a friend who, right before he left grad school in 1990, wrote all his directories onto a Vax tape. His reasoning was that Vax tape would be around forever.
Funny, I don’t see so many Vax tape readers/writers these days … or Vax machines to connect them to. This is not to poke him about this decision, but to point out that permanent storage resources are, sadly, ephermeal. They are ironically impermanent. Write to tapes the tape vendors cry, best cost per GB. While this may be true in a narrow analysis, I’ve collected enough anecdotes from people with failed drives several years into their 20+ year tape life, that they were effectively unable to read their tapes, and they had no recourse on getting more drives, the vendor either no longer sold/serviced them, or went out of business, or …
Permanent storage means mobile storage. You have to be able to move your data. That Vax tape is pretty impermanent. Yet if he took the Vax tape, and spun it into a new thing, say a compressed tar archive, he could have added another 10-15 years onto his data storage life.
I have a bunch of 1.4MB floppy disks, from my research days. Even some 360k and 1.2MB floppies.
That I can’t read. Well, not entirely true, I have the reading hardware, but the medium was never designed to be permanent. Nor was the data.
So a fair amount of this is lost to me.
SSDs can be permanent, though no one has been able to perform real aging studies on them … as mechanical parts wear out in a different way than electrical parts. So we don’t really know how long SSDs and flash in particular, could last. Spinning rust, we have an idea.
Once we figure this out, this could be an amazing technology. All we have to do is increase the density a bit.
SSDs are game changing to a number of companies. They offer some interesting and amazing capabilities. They allow us to rethink some designs for IO … not having to chose between IOPs and bandwidth.
But they come with a set of their own problems. They aren’t a panacea … a solution to all problems. Spinning rust has an expiration date. But it is a ways out.
HPC storage has moved from NAS type NFS devices to SANs and are now rapidly transitioning to storage clusters. The argument is similar to computing clusters, lower the cost per GB stored, without negatively impacting the access time. That is, don’t increase the height of the storage wall … the size of the storage medium divided by the bandwidth to access it.
Also, as density of spinning rust has increased, the statistical error rates, that rate at which an “UnCorrectable Error” or UCE occurs, has not decreased as rapidly as the density has increased. Which has lead to a finite probability of observing a UCE during a RAID5 rebuild on common sized SATA platforms. Work the numbers with me here, along with some estimates to keep it simple.
1 UCE in 1014 bits read. Roughly 1 UCE in 1013 bytes read. 1 TB is 1012 bytes. Build an array of 10x 1TB drives. Thats 1013 bytes. A RAID5 rebuild now has a fairly good chance of hitting a UCE.
Whoops. A UCE during a rebuild, e.g. the second correlated failure after the first failure, would take out a RAID5. With all your data.
Which gets to another problem. RAID.
RAID has a particular state model. Failed, rebuilding, normal. During rebuilds, it has to recompute parity on blocks of data. Most older implementations of RAID will compute this on all blocks on a device. Newer RAID technologies will only do these computations when they are necessary. While this is nice, this also is a band-aid over this particular wound.
Why? Because large RAID arrays have a huge bandwidth wall you have to overcome to read your data. Even if you use 10% of a huge storage system, this is still, usually a very large amount of data. The time spent re-building the RAID could still be enormous. For example, suppose I put 1TB of data on my 10TB array. Further suppose I can read this data at 100MB/s or so (standard low end RAID card). 1GB takes 10 seconds, 1TB takes 10,000 seconds. An eighth of a day.
So to rebuild my RAID, I could see rebuild times on the time scale of a day.
Sure enough, we do. Which means, during this time, your data is at enhanced risk.
And this means, with the rapidly growing data sets in HPC, that we have to worry about second correlated failures.
If you have 1PB of data … or more, what is the best approach to backup and recovery? Remember that bandwidth wall? Its in place here as well. Tape is a serial medium. It takes time to read 1TB, never mind 1PB. Replication is how Google and others handle this. Take the massive data set, distribute it widely. If you lose one set, or multiple sets, you can repopulate from other copies.
This is of course, quite expensive. Throw in SSDs and it gets even more so.
Of course, in the storage industry, there is a strong push toward “dedup” (deduplication) and tiering. The idea being use the fast (and expensive) disks for the frequently accessed data, move the less frequently accessed data off to “slower” and more cost effective tiers. Deduplication on the other hand seeks to create an effective run-length compression, replacing blocks of some size with a key that points to this block. This works great for presentations, desktop files, etc. Doesn’t work so well for (nearly random) HPC data files.
But on the tiering side, one has to ask, if the lower “tiers” are as fast or faster as the upper “tiers” then why are you spending money on the upper “tiers”? This is the situation in HPC today.
In the preceding decade, there has been a growing realization that RAID is running out of steam, that spinning rust may have an expiration data, that data at rest needs to be constantly moved to new media, that techniques that work fine for mail servers may not work so well for HPC file servers, that scalability requires that we not bottleneck IO, that new file systems are coming out to help with some of the problems we’ve been dealing with … and that new vendors are emerging with a better/cheaper/faster product focus that has upset a few apple carts.
It has been an interesting storage decade. It is likely to be an even more interesting next decade as we answer some of these problems, and introduce new ones.

4 thoughts on “HPC in the first decade of a new millenium: a perspective, part 7”

  1. In 2011 the Sequoia-System, developed and manufactured by IBM, will be installed at LLNL and will start running in 2012.

  2. Thanks for the stream of consciousness on HPC storage. I’m currently trying to solve a large storage challenge for one of my clusters. Do you have a suggestion for 8000 IOPS and 50TB real? I’m looking to grow to over 100TB. In terms of filesystem, GPFS and Lustre are top contenders. Cheers.

Comments are closed.