Cargo cult HPC

This is a short thread of thought, which was triggered by a casual browse through Wikipedia on another topic (for an article I swear I am writing, right now, as I er … uh … write this). Way back in graduate school, we all had read Feynman’s book. Call it required reading at the academy. Good things came out of this, as we (a few friends and I) reverse engineered his discussions of differentiation under the integral sign and suddenly got a real powerful tool available to us (which seems to have pissed off a few profs in classes with homework, but thats a story for another beer).

Feynman in his book discussed Cargo Cult science. He gave the analogy of what various Pacific Islanders did to bring the all-powerful planes back to their island, making headphones with shells, building faux runways, and then not grasping the reasons why it didn’t work. Their world view was constricted and constrained by their understanding. And changing that understanding would be strenuous and frankly shocking to them (questions about the ethics or morality of ripping peoples illiusions, beliefs, and dogma from them are worth thinking about here).
So this got me thinking about HPC as well. There are a few practitioners of HPC out there, a few wannabe’s and some Cargo Culters. No, I won’t name who I think goes into each group. Look at it this way. Some companies put up the facade that they know or understand the market, and then do things that clearly demonstrate a lack of understanding of the fundamental forces in the market. We have seen this in software and hardware companies. Some folks don’t get it, or just don’t care, and like to try to label practitioners as “old school”. Yeah, ok. Whatever.
Those are Cargo Culters. They may be able to change, to get to the wannabe levels. The wannabes are companies that want to have a serious HPC footprint. This is their goal. They may not be quite there yet, but with the right investments and over time, they may get there. Or they may not. HPC isn’t easy and it is easy to get distracted from high performance. Some of the wannabes have hands that they have to play which for various reasons won’t likely be successful. The successful ones will be able to influence their future product mix to make it more successful in the market.
The practitioners have been doing this stuff successfully for a while. The difference between the Cargo culters and the wannabes is that the Cargo culters will never quite grasp why what they have won’t work. Moreover, they will have little to no say to be able to change the course of the company to offer a meaningful product in the space.
The one example that comes to mind right now are (without naming names) a storage vendor with a “scalable” product for large storage. I won’t get into specifics. I will say that hiding terabytes and more behind a gigabit bandwidth wall is not a wise use of resources. Yet we see exactly this, being used. Its even sadder when we see others ignore this issue, until someone complains that their new cluster is slow and they don’t understand why.
Not all codes are IO bound. But the ones that are, you need serious I/O firepower at the ready to bring the cluster out to best performance levels.
1 GB takes about 10 seconds to move at wire speed over a gigabit link.
1 TB takes about 10,000 seconds (about 1/8th of a day) to move at wire speed over a gigabit link.
1 PB takes 10,000,000 seconds (about 1/3rd of a year) to move at wire speed over a gigabit link.
Yet we see groups propose and often request storage for their HPC with single or dual gigabit links and 10’s to hundreds of TB in size.
We see large clusters being architected with stacked switches which will be used for NFS and MPI traffic. The last one we dealt with like this had a stacked switch, and they were running a 256 node job that almost always failed. They didn’t get why it failed. Their HPC resource was a stack of 128 desktop machines with 100 Mb cards. Their storage was a head node with a popular RAID card and 1 disk platter devoted to storage (raid1 mirrored at least).
HPC practitioners will understand the issues in the above, as will some of the wannabes. The cargo culters won’t, and will probably make remarks that demonstrate their lack of understanding.
This is the market we are in today.

4 thoughts on “Cargo cult HPC”

1. One disk platter and RAID 1? Either that’s a typo, or it’s a really good way to test drive longevity by slamming the head from one end to the other on every I/O.

2. @Jeff
Effectively 1 disk. RAID1, on a small drive pair It was a 20 GB disk pair as I remember, IDE. Might have been 2 platters for all I remember. My bad.

3. Ahem, Rev. Landman! This blog should be required reading for people who manage or direct companies with HPC parts in their business. I am truly surprised how clueless many people are.
Now, back to you complaints about the storage company with GigE lines… 🙂 They do have a 10GigE version but I don’t the price differential. This particular storage vendor has a really cool offering in that it’s remarkable scalable and easy to use (better than the others). I agree that the GigE lines can be limiting on the storage side and on the client side. I think it’s particularly limiting on the client side (more on that in a second). On the storage side, the more of the units you add, the faster you can get (with some limitations that are mostly a function of the file size). You can actually get good performance if there are a number of units since the file can be spread across several units.
Back to the client. I think that for any application that does any reasonable level of IO, GigE is not enough. This is true for cluster larger than perhaps 8 or 16 nodes. The reason is that with a simple dual-socket node you have 8 cores in a node. For an 8-node cluster that’s 64 cores and for a 16-node cluster that’s 128 cores. If all of the cores perform IO at the same time, you’ve got 64 or 128 cores trying to write to the file system at the same time. However, more importantly you could 8 cores trying to push data down a single GigE pipe. If we assume 100 MB/s for a single GigE line, each core is only getting 12.5 MB/s. That’s going to kill performance.
The, sort of, good news, is that there aren’t very many apps that do MPI-IO or something like it (perhaps using POSIX – yuck!). The DOD and DOE folks have some apps and there are a few ISV’s moving in this direction (Fluent and CD-Adapco), but in general there aren’t very many. That’s the good news because we won’t run into the situation of having every core doing IO at the same time. But for some apps, such as bio apps, that are embarrassingly parallel, we could easily have every socket or even every core doing IO. Then we’re back in the same situation – yuck!
I just wish people in HPC would start taking storage seriously. I’ve seen way too much garbage going on such as the examples you cited. In addition, I’m now starting to see the “Top500” effect in storage. There are several people who are striving to have at least 1PB of very high speed storage in place by next summer (if not earlier). The whole goal is to have more storage than their competitors (kind of like being higher on the Top500 list). They are running down this path even though they can’t fill up that much space in 5 years. In addition, they aren’t thinking about the long term aspects of huge amounts of storage. They don’t think about such things as cost (hardware and admin), how will they migrate their data when the hardware starts failing, backups (if they want do to them or if not, redundancy), tiering (i.e. saving ), archiving, snapshot and replication (these can be very useful), etc. It’s just a mad rush to put as much spinning media on-line as they can. What a waste of money.
I will stop since i could go on an on for a long time. But I’m thinking about putting together a storage talk entitled something like, “People are really, really stupid about HPC storage – how to fix this problem”. I doubt many people will listen, but I can try 🙂
Thanks for the blog!
Jeff

4. As noted offline to Jeff, I am not talking about Panasas who have (IMO) great kit.