How do you program N cores (as N -> infinity)

I have been touching on some of the aspects of this in various posts here recently. Basically you have 2 roughly related technologies to work with today. Shared memory (OpenMP) and distributed memory (MPI). Sure there are others, but these dominate.
But there is a problem with these.

In the case of OpenMP, you annotate your source, and the compiler does the work. But it only does the work assuming you have a shared memory system. So OpenMP is kind of a superset of the underlying language, extending it a little here and there, without changing it. Unfortunately the large machines of the future are pretty much guaranteed to be distributed memory machines, with shared memory NUMA based multiprocessing. So OpenMP doesn’t work for the large jobs across multiple machines. But the compiler can help you optimize jobs within a machine.
In the case of MPI, you have explicit, exact control over many aspects of the programming task. The compiler gives you no help, as MPI is bolted on, outside the compiler. The compiler won’t help you optimize. Worse, as everything is explicit, you are free to choose particular methods of operation that might not be very good (MPI calls in inner loops, disaggregation of messages/data).
I have argued the point before that we need to rethink languages a bit. Not so that we have more/less explicit control, but so we have a better way to express parallelism.
This is not to say OpenMP is bad. I like it, it just can’t be used on distributed memory machines (the future of supercomputing). This is not to say MPI is bad, it is simply too low a level to be working with. You have to explicitly do what a compiler does normally in terms of mapping your algorithm into code.
I liken MPI to an assembly language. There is nothing wrong with writing in assembly, just don’t expect it to be easy to work with, debug, or even get good performance out of. Many MPI programmers can get 10-15% efficiency out of their code as it scales up in number of CPUs. Few can get 50+%. This is not because they are bad programmers. It is mostly because it is very hard to enclose as much work as possible in parallel regions, which is what you need to do to get the best parallel performance.
The other issue I had started touching on with file systems and other bits is the concept of locality. Right now, most MPI programs operate with the implicit view that every rank can communicate easily with every other rank with about the same latency. That is, no matter how large the number of nodes are, you have to assume that your nodes are in some sense of the word, local, to each other such that your latency is about the same node to node. We have high bandwidth low latency fabrics to enable this (Infiniband et al).
But this misses an important point. Nature is non-local by default. Our algorithms, which, to a degree, express in approximately mathematical language, models of this nature, ought to reflect this. This nature could be in part encoded in the network linking disparate machines, and in part in the algorithms themselves. But again, this forces us to adapt the code to the architecture or the architecture to the code. Shouldn’t the compiler be doing that? The compiler has a model of the underlying CPU architecture, and it tries to apply this model against the source code. From which we get a binary executable which represents its effort at modeling our code and our machine. If we could teach the compiler about the non-local nature of the extended machine, this might be helpful, though the compilation process may get much more complex and take longer.
Hence, we need to start thinking about how to program non-local machines. Machines where you effectively have a light cone of reachability in a fixed period of time. We need to think about the same thing for file systems as well. Here it is harder, as you would like file systems to have a coherent universal view across an entire machine, no matter how hard/complex this is. Or at least this is the paradigm now.
Programming large processor count machines is only going to grow more complex over time as we will be using asymmetric multiprocessing (aSMP), accelerator processing units (APUs), and all sorts of neat ways of moving data around. My hope is that we will break some of the bonds we have been saddled with and figure out sensible ways to do it.
I even have a name for this hypothetical new generation language that works across all types of machines.