When you have a great deal of power, but you can't use it, because it is too hard to use it …

For decades, I have been debating friends and colleagues talking about high performance computing, specifically parallel computing. They doubt that parallel computing techniques will ever go “mainstream”. That is, that there will ever be a large upswing in the number of users of parallel programming techniques and methods, or for that matter codes which use parallel programming effectively. I argue that this will occur, when such usage gets to be “easy”.

Of course, the trick is to define what “easy” is. Having N+1 different API variants for parallel programming does not help.
Look at LAPACK and BLAS codes. There is a reason why they have been successful, and why so many people working on programs with a numerical linear algebra component, have used them. They make using many elements and calculations in linear algebra easy. You string together calls to the routines, not design them from the ground up.
Similar to this are the other “PACK”s, SPARSEPACK, SCALAPACK, … They make access to this power, easy. You don’t have to think that hard about what/how you are going to do something relative to the existing implementation. The complexity of the linear algebra calculation is to a degree masked by the abstraction of the interface to the calculation. You can use the power, as you don’t need to think about it. You can just use it.
Keep this in mind. This is important.
Also remember that the vast majority of computer programming languages assume a simple Von Neumann model of underlying machine. You have a processor, memory, and other stuff. Things proceed in an orderly logical manner. This virtual machine representation allows you to extend it by adding other things. Like more processors. And now more memory.
Before we go down that road, it should be noted that programming existing simple virtual machines, and doing a good job of it, is somewhat hard for most folks. It is always an iterative process, it never completes. Programs keep getting better. Corner cases emerge with some inputs generating odd output or behavior. Your mission is to get the right results, quickly, across all case inputs.
As it turns out, we don’t even gat that right most of the time.
Now add in complications. Have these little virtual machines communicate. Have them exchange data. Have them work on the same problems.
Now you have to deal with allocation issues, data motion issues, datum scheduling issues (is this data where it needs to be, when it needs to be there), not to mention efficiency issues such as how can I effectively partition this problem on this machine. How can I map the problem to the underlying architecture.
So if this isn’t bad enough, now add a range of decidedly non-Von Neumann architectures to the mix. MMCPs (massive multicore processors), GPUs and stream processors, Cell-BE with heterogeneous processing models, array processors. And then soft “processors” where you create a dedicated computing circuit attached to a processor.
Ignoring the MMCPs, I call the whole collection of the rest of this stuff, APUs for acceleration processing units. It is an abstraction. Lets me talk about the whole collection without diving into the details. It is a simplification.
The MMCPs break a uniform view of memory and processing. Now you no longer have a “mirror” symmetry in your programming system, all of the processors are no longer equidistant from each other (in terms of time to communicate/interact), or from their respective memorys (remote stores cost more than local stores).
The problem is that expressing this architecture in terms of a programming language/methodology is hard, as most programming languages are designed to be sequential in nature. One thing happens after another. Very few if any indicate that things should happen at the same time.
This means that if you simply pretend that all those extra processors are simply more CPUs in your machine, gloss over the timing access differences to get to them and memory attached to them, you can be reasonably successful when the massive portion is around “2”. At 4, unless you have the memory bandwidth to handle it, you start to see impacts from memory contention, “bus” contention, and related. You have fixed amounts of shared resources. You have to share them. If your model assumes that you have all of the shared resource to yourself, this will be a problem. And it is.
The approaches you can take in programming these are either to a) let the compiler worry about it, or b) do it yourself. Most computer scientists I know are loathe to consider “a”, and prefer to do all their work on “b”. This is not to say “a” is bad. Actually having the compiler see a more precise model of the underlying machines is often a good way to abstract these machines. Let the compiler deal with the hard bits, you focus upon the problems you want to focus upon.
This is why I like OpenMP. It is easy to use. But it still feels like it is “bolted” into the language. That said, it is not hard to get good performance with it. This is also why I don’t like MPI as much. You have to think about everything you do. And you have many possible MPI implementations, all non-interoperable, and ever so slightly different. I cannot write for MPICH, compile it, and then use it on a LAM system.
Using OpenMP is moderately easy, but it has limits. Using MPI is hard. It gives you more freedom. And you pay a price for that freedom.
Ignoring most of the other parallel methods as being (relatively speaking) in the noise as compared to these two.
Well, now we have PGAS languages, and they are very interesting. They let you express the parallelism, naturally, in the language. They know about local versus remote memory (its built into their models).
That is, PGAS make parallel programming much easier. They hide some of the painful bits. This is good.
If we could incorporate APUs into this …
I am envisioning that the way we will be able to program most APUs is via a *PACK like model. We have a fixed API to express something, and issue function calls. This could work nicely. The problem is that we have 3 different APU models, 1 of which might be programmable in something close to what we know, one requires that we learn this thing called “stream programming”, and the final one is scary in that you can build your own API for it, and you have to design your own processor while you are at it.
That is, PGAS may make parallel programming better, but unless we agree to a sensible APU programming model (excluding FPGA for the moment), programming them will be hard. PGAS could with the right models in place, help us figure out how to program the MMCPs without going through nasty gyrations.
What works against this is that the PGAS require writing in a new language. You aren’t using C or Fortran anymore. You are using something almost but not completely like them. Which represents a learning curve, and an adoption curve. The former needs to be made very simple in order for the latter to occur at all. Rewriting huge chunks of code is simply not likely to occur. Code has a momentum of its own. Many years with C or Fortran has large code bases that need to be ported.
If the PGAS look exactly like C and Fortran with some extentions, things could be good. UPC and Co-Array Fortran try to do this. Chapel breaks from the past and tries something new. I am not precisely sure which problem Fortress was attempting to solve.
But the important point is that they all abstract the pain of parallel development whenever possible. HPF tried this some years ago.
The question is what we can do with the APUs now. They are a fact of life, they are not going to go away.
I would, as indicated, suggest treating them as hardware versions of function calls. Tie the interface logic into an API, handle all the hard bits for the programmer, so they can do what they need to, and not have to think too hard about mapping their problem.
Basically what I am saying is, make it simpler for the user, and you will get more users. Sort of like the cluster and HPC market. Make acquisition costs drop, make complexity drop, and more users will buy and use them. This has been happening for years, predating some recent entrants HPC efforts. The market has been growing because it is getting easier and less expensive. Programming these complex systems will be easier when we use programming languages and tools that understand this architecture. And programming accelerators, non-FPGA ones, will get easier if we can hide some of the complexity.
Make it easier and it will be used. Make it harder, erect barriers, and few will use it.
MMCPs pretend there isn’t a barrier (there is, and you do trip on it every now and then). APUs have a barrier writ large. Parallel programming has barriers. OpenMP makes them less obvious, but they are still there. MPI highlights and glorifies the barriers.
But we still gotta program all this stuff. Which means that things that look like PGAS probably are going to be needed, globally.