Did distributed memory really win?

About a decade or more ago, there was a “fight” if you will, for the future of high performance computing systems application level programming interfaces. This fight was between proponents of SMP and shared memory systems in general, and DMP shared-nothing approaches.
In the ensuing years, several important items influenced the trajectory of application development. Shared memory models are generally easier to program. That is, it’s not hard to create something that operates reasonably well in parallel. But it is still hard to get great (near theoretical maximum) performance out of these systems. And, back in that day, shared memory busses, for single core CPUs, became more expensive as you added more CPUs to them. That is, going from 4 processors to 8 processors involved a great deal more wire, motherboard lands, chipset support, and other things like this.
DMP (Distributed memory parallel) shared nothing approaches were and are harder to program. This hasn’t changed. MPI exists and it works. But it is quite easy to get yourself into trouble with it. MPI isn’t terribly complex, but it allows complex interactions to be created, and behaviors to emerge. These behaviors can have performance impacts, not usually what you want.
In the early 2000’s, people realized that they could write code for DMP, and it would run just as nicely on SMP. So … to a degree, the game is over. Just write MPI and be done with it.
Sort of.

Today, we have NUMA architectures, with 12+ processor cores per socket in some cases. These are shared memory machines.
In fact, as we’ve been pointing out for a while, clusters are becoming more SMP like. With vSMP and other tools, you can turn the dual and quad core systems into a single large SMP. And as Intel and AMD add more cores into this, I’d expect that you can build an ever larger SMP. Why waste time building a cluster, with all the associated difficulties in managing it, when you can manage a single (or very few) boxes of a large SMP?
And I am not convinced that the people writing code are not cognizant of these changes. You can keep writing MPI, which isn’t a bad thing, or what I expect to see happening, is more of a hybrid model. Something along the lines of OpenMP + MPI.
For years I’ve seen people disparage this model, stating that they haven’t seen a case where this is better than straight MPI. Well, I can’t say where I saw this, but I saw this definitively on a large code this morning. I suspect that this isn’t an anomaly, that we will in fact start seeing more codes like this.
Basically this model allows you to avoid overdoing the MPI process contention for resources. You can have a small number of threads handle communication between nodes, and calculate using shared memory (NUMA at that) within a node.
Understand, when you get 4->8->24->48 processes all trying to grab a resource, say an Infiniband tx/rx queue for a QP, this contention will be observable in many cases. Using the hybrid model would allow you to more intelligently manage that contention. Moreover, it could reduce the impact of messages upon the fabrics, reduce the time spent managing messages, etc. That is, it could have significant positive benefits.
Moreover, coding in the hybrid style allows you to easily add more large SMP nodes later on, or even smaller SMP nodes if that is what you have. That is, you don’t need to assume one thread or core per machine (which is not explicit and not quite implicit within MPI, but these assumptions do seem to appear as echos in things like machine files, rank files, etc).
Basically what I am saying is that in the programming model wars, many of us thought MPI won. But SMP systems have been back building for a while. More of the systems you are running on will be SMPs. I bet that more of the code you will be running will be SMP code. So in this sense, SMP may have simply taken a detour.
MPI execution isn’t completely transparent yet. It needs a helper application (mpirunn/mpiexec) to launch. This could be folded into the OS. Or it could spell trouble for future codes.
The future is very multi-threaded, and quite in-homogeneous. SMP and asymmetric SMP (e.g. APU and accelerators in general) using both shared and unshared techniques. MPI may have won some of the initial battles, but I think it may have been a mistake to declare the war over. The problems that plagued SMP in the past, well, they aren’t significant problems anymore.
The next few years should be quite interesting.