A teraflop here, a teraflop there, and pretty soon you are talking about real computing power

It seems IBM will be building another new NNSA machine. So whats interesting about this, other than IBM getting good press? Well this appears to be part of a growing wave of heterogenous high performance computing systems. Roadrunner appears to be a mix of COTS Opteron hardware, and Cell based blades as Accelerator Processing Units (APUs).

Why is that interesting? Programming parallel systems is hard. Programming heterogenous parallel systems is … interesting.
Its always about the apps. But off the shelf apps won’t run on this. Building apps for this will not necessarily be easy. But it will likely be worth it.
If you can offload processing to the Cell units that the Cell does well, and run other stuff on the Opteron’s that the Opteron does well, have them all communicate over a high bandwidth low latency fabric, you might just have something there.
We have been working on doing things like this in my day job. Some of our efforts have resulted in better utilization of tightly coupled processor resources, in Scalable HMMer. This is a good direction to go. The question is how to make it easier to program against.
This also ties in to the MPI/OpenMP or more accurately, the parallel API debate and many other similar debates. Generally speaking, if you can make using something easy, people will use it if it adds value to their efforts. Using OpenMP is very easy in most simple cases, and it will add value modulo the restrictions. MPI is harder to use, you have to design with it in mind. Then again it has far fewer restrictions. How would you program APUs and heterogeneous processors with OpenMP? With MPI?
At minimum with OpenMP, you would need to extend basic OpenMP to tag the sections of the code which should be able to reside on the APU and communicate via a shared memory region. So something that you might do is have (with a request for a pardon from my Fortranesque readership)

#pragma omp parallel
#pragma omp for target=apu
    /* ... */

and have the compiler generate code for both the CPU and the APU, as well as an APU presence detection (should be done at OS level so we don’t get a million APU drivers all using different techniques of detection) that switches between the two, or even load balances.
Make it simple, let the compiler do the hard work. This forces the APU to have a fixed API or at least implement an API which has a consistent subset across multiple APUs. This also would require very tight coupling between the APU and the processor, say over a nice HyperTransport link,
Of course lots of MPI folks out there are shaking their heads now, as MPI is, according to most of them, the “one true way”. Fine. How do we do this with MPI then? Basically you would have to monkey about with the ranks. Say you assume that you have 1 APU per rank. Then you might be able to do something like negative ranks to represent the APUs. This won’t work once you have more than one APU though. You might need to do something like ranks 0 .. N-1 are for the main threads then N .. 2N-1 are for the first APU threads, 2N .. 3N -1 are for the second APU threads … and so on. The only problem I potentially see down the road is if N gets large enough, 2N or larger might overflow the ints …
Yeah, sure, 2**31 (thats pow(2,31) for the C speakers, isn’t that Fortran notation nice? 🙂 ) processors … Sure we will ever see that. Like we will ever need more than 640k of RAM … 😀
Ok, back to the programming Roadrunner. I am interested in seeing how it will be done. We have some ideas we have been working on related to this. MPI could work, though it would require a little work. Same with OpenMP, but as these processors are in different address spaces, OpenMP is moot for this.
The extending MPI idea might be quite interesting. Need to think on it some more.
And back to the machine in toto. People will use the machine if they can write programs on it and expect them and cause them to perform well. That means it has to be “easy” to some degree to use. Writing in assembly language is no fun, well it lost its fun for me about 20 minutes after I started playing with it in 1985. But if this is how people will need to program it, then the app porting will be slow and painful.
And hopefully the next big machine there will be either Wiley Coyote or Acme Supercomputing.
And again, kudos to IBM. This shows what happens when a company invests in R&D for HPC masquerading as a mass market processor architecture. Remember that the Cell is going into some mass market huge volume boxes for games. Sure its a game processor if you really want to call it that. It can still do a very nice number of single precision and double precision floating point calculations per second. And it doesn’t cost $5000 per unit.