Blue waters are a-movin…

NSF is funding a 2×10^5 processor monster machine at NCSA. At $208M, each dollar will by you 4.8 MFLOP (4.8×10^6 FLOP).
Assuming a quad core CPU would be able to provide (in theory) 32 GFLOP (4 cores x 8 GFLOP/core), you would need 31,250 units to provide this … (125000 cores).
There are some interesting things about this machine. Very interesting … not just the price tag or the estimated sustainable performance

It is a shared memory machine. Quoting the article:

All of that memory and storage will be globally addressable, meaning that processors will be able to share data from a single pool exceptionally quickly, researchers said.

Ok.. they did say globally addressable, not necessarily “shared”. These have slightly different contextual meanings.
I want to know how they are going to program it. If it is a shared memory machine, then, technically, we could use OpenMP. Which means I could write simple loops (in theory), and have them spread far and wide (in theory). In practice this doesn’t work well without some significant help and hints from the compiler/user.
Or maybe, this is just a 447×447 cell spreadsheet for Amir with each cell being a processor, local ram, and some local code.
We are getting to the point, rapidly, where we may need to think about processors as being part of a continuoum, a hive, and not as discrete entities unto themselves. This harkens back to Doug Eadline’s articles on how self-organization in large colonies tends to evolve successful models of behavior for programs … er … ants and insects.

3 thoughts on “Blue waters are a-movin…”

  1. I was kind of wondering some of the same things. There’s no OS that will run a single instance on that many processors *at all*, let alone well, so clearly not all of the memory will be shared. If there’s coherency logic in there, then it’s hard to see how they could meet their cost goals. What seems more likely is that each node will have private memory plus access to a global non-coherent pool composed of memory exported from each node. That actually sounds a lot like what we were doing at Dolphin ten years ago, where the remote memory was out in PCI space and could be accessed either via plain old memory references or by programming a dedicated DMA engine for larger transfers. Of course, it would be a shame to run naive code that races with itself or thrashes the hell out of the shared-memory interconnect(s) on such an expensive system, so I’d expect them to put some kind of PGAS layer ( on top of that.
    As for your other point, about how to conceptualize such systems, some of us are already there. The code I work on usually has to be aware of location and topology because it’s part of the infrastructure, but I often see other people at work treating a system as just a big bag of completely interchangeable nodes. This system has more, that system has fewer, but all the nodes in a system boot together and have jobs scheduled on them together etc.

  2. Maybe its a similar thing, but I was thinking of the reflective memory bits from the folks who became Quadrics. Its not message passing per se, but could be used for it.
    Coherency at this level (e.g. the shared memory view of globally addressable memory with strong coherency) seems like it wouldn’t be possible without some serious cost. The Stanford Dash project which became the SGI Origin series had some of these features, but it was hard to build machines of more than 1000 cores.
    I am going to have to learn more about SiCortex soon.

  3. the hardware can’t physically be a spreadsheet without some NUMA architecture constraining the number cycles to synchronize remote cells. with shared memory among groups of physical nodes you can use a series of MOV operations to get data from one point in space to another.
    there’s a decided lack of elegance in the crap-ton of COTS processors approach especially when it comes to managing communication between nodes. with that budget i’d wafer-bond 10000 wafers with ~100M logic blocks per wafer (about 1K-4K transistors per logic block in a 45nm process on 30mm wafers). we’d get an absurdly more efficient petaflop out of a 1 Trillion block array (lower clock speed to about 10Mhz and with no off-chip I/O drivers and no fans). the most important supercomputer algorithms designed on it can be made into an ASIC for about the same cost. – for 3-D FPGA fun…

Comments are closed.