Again, John West reads more than I, and notes at InsideHPC.com blog, an article from Doug Eadline on Linux Magazine, all about really big clusters.
These are subjects I have explored a number of times. Doug points to nature and how nature scales and isolates failure.
This also reminds me of the multiple types of networks that can be formed for computation/processing. One that I deal with every now and then are the spammers, and their bot-nets. In experimenting on them (disrupting them), I think I see support for them being by design, scale free networks. In a scale free network, the sensitivity on a particular set of nodes going down is quite high. Hub nodes are critical. Single or small numbers of points of failure.
Sadly, the way MPI is designed, it is by definition, a scale free network. Have the master process die and it can take down the rest of the network computation.
John and Doug point out that something like a RAID is needed for computation. Doug suggests a new acronym for this as well.
Doug is onto something big here, so allow me to express something I have been for a while now to our storage/computing customers.
As you increase the number of disks, the mean time between failures of the collection of these disks goes to zero. Or nodes if you prefer.
Doug notes this as well in his article, in his discussion of MTBF.
Basically you need resilient systems (no single/few points of failure). This suggests that we need to rethink (in a somewhat radical manner) MPI, as MPI has, by design, a single point of failure. This was not intentional in its design, it was a side effect.
Things like Cilk, extended to a cluster, could be quite nice. No single point of failure. No centralized model. It just scales. Cilk is actually quite nice, and could help with lots of multi-core programming bits. No fortran version, and their initial platform focus appears to be windows PCs, but they still have an interesting product.
The question is whether or not you can extend the model to a distributed processing system. This isn’t really known (their model extending that is).
FWIW: we did something roughly like this using LSF as a centralized scheduler in 2000-2001 for SGI GenomeCluster. Each node pulled a job when it was ready.
This said, if Cilk could be extended to clusters, you could have its small “joblets” for lack of a better term, replicate across nodes. You could get your RAID.
This said, it is a hard problem Doug addresses. Resiliency in hardware and software layers.
[update] I should point out that we see the impact of disk MTBF in our JackRabbit 48 drive bay units. We see something within a constant multiplicative factor of the failure rates from the vendors. Call it 4-5% failure rate, not the 0.73% failure rate the vendor reports. Others we have spoken to have seen the same. Basically use MTBF as an order of magnitude guidance, and within a constant multiplicative factor of being right.