humongous computing systems

Again, John West reads more than I, and notes at blog, an article from Doug Eadline on Linux Magazine, all about really big clusters.
These are subjects I have explored a number of times. Doug points to nature and how nature scales and isolates failure.

This also reminds me of the multiple types of networks that can be formed for computation/processing. One that I deal with every now and then are the spammers, and their bot-nets. In experimenting on them (disrupting them), I think I see support for them being by design, scale free networks. In a scale free network, the sensitivity on a particular set of nodes going down is quite high. Hub nodes are critical. Single or small numbers of points of failure.
Sadly, the way MPI is designed, it is by definition, a scale free network. Have the master process die and it can take down the rest of the network computation.
John and Doug point out that something like a RAID is needed for computation. Doug suggests a new acronym for this as well.
Doug is onto something big here, so allow me to express something I have been for a while now to our storage/computing customers.
As you increase the number of disks, the mean time between failures of the collection of these disks goes to zero. Or nodes if you prefer.
Doug notes this as well in his article, in his discussion of MTBF.
Basically you need resilient systems (no single/few points of failure). This suggests that we need to rethink (in a somewhat radical manner) MPI, as MPI has, by design, a single point of failure. This was not intentional in its design, it was a side effect.
Things like Cilk, extended to a cluster, could be quite nice. No single point of failure. No centralized model. It just scales. Cilk is actually quite nice, and could help with lots of multi-core programming bits. No fortran version, and their initial platform focus appears to be windows PCs, but they still have an interesting product.
The question is whether or not you can extend the model to a distributed processing system. This isn’t really known (their model extending that is).
FWIW: we did something roughly like this using LSF as a centralized scheduler in 2000-2001 for SGI GenomeCluster. Each node pulled a job when it was ready.
This said, if Cilk could be extended to clusters, you could have its small “joblets” for lack of a better term, replicate across nodes. You could get your RAID.
This said, it is a hard problem Doug addresses. Resiliency in hardware and software layers.
[update] I should point out that we see the impact of disk MTBF in our JackRabbit 48 drive bay units. We see something within a constant multiplicative factor of the failure rates from the vendors. Call it 4-5% failure rate, not the 0.73% failure rate the vendor reports. Others we have spoken to have seen the same. Basically use MTBF as an order of magnitude guidance, and within a constant multiplicative factor of being right.

2 thoughts on “humongous computing systems”

  1. When I read posts like yours and Doug’s, I feel my code writing elves stir in their sleep. Then I realize I have a powerpoint presentation and 900 management things due and they roll over and go back to sleep. If I was going to start doing real work again in HPC, though, this is where I’d pitch my tent.

  2. Yeah, it is hard to get time to do this stuff much anymore. Last bit of playing I did was some Cuda stuff last month. Got to do more (Cuda 2 is out).
    My problem is that there are many things I would like to do, but I have a company to run, marketing to develop, sales to make … Time for the fun stuff just isn’t there (for me).

Comments are closed.