… Whether ‘tis nobler in the developers mind to suffer
The slings and arrows of outrageous application performance,
Or to take arms against a sea of development troubles
And by abstraction end them?
– “Bill S” on whether or not to use higher level abstractions when programming for performance.
Ok, “Bill” didn’t really write that, his text was paraphrased and adapted. I am also pretty sure he wasn’t writing parallel code (parallel prose maybe).
Abstraction to a computer science person, is creating an artiface of “virtual” structure or method that is easier to think about and work with than the original, and gives you significant advantages such as productivity, by working with it. To a scientist, abstraction helps you build mental models of the thing you are working on. Ok, this reminds me of the spherical horse joke years ago, but the idea is the same. You replace a hard problem with one you know how to solve, and then try to express the hard problem in terms of the solvable problem.
Put in terms of HPC codes, is it easier to write:
C = A * B;
as compared to
double **A,**B, **C;
**A = calloc(N,sizeof(double *));
**B = calloc(N,sizeof(double *));
**C = calloc(N,sizeof(double *));
*A[i]= calloc(N,sizeof(double ));
*B[i]= calloc(N,sizeof(double ));
C[i]= calloc(N,sizeof(double ));
/ matrix product
C = A * B;
We can see which is more abstract than the other. The application developer will likely be far more productive in the first model than the second. They can think at a higher level, work at a level that they are comfortable with, and not worry about the details of how the program maps into the hardware.
Sounds great, doesn’t it?
Only one problem.
The mapping between the more abstract representation and the physical hardware is less direct in the first case than the second. Which means either the compiler has to know a great deal more about how to implement matrix multiplication on the specific hardware in question, or it has to write relatively general assembly code, not taking advantage of specific hardware functionality, or missing better ways to do it.
The mapping between the lower level representation and the physical hardware is more direct. You have greater control. As the expression goes, you are given enough rope to blow your foot off. Yes, this is mixing two metaphors about the danger of having lots of power at your disposal. You can cause more damage to your system the lower the level you work at. This mixing is on purpose. You get great power the lower you go, at significant cost. The question is whether or not the cost is worth it.
The higher the level of language, the more productive the user is in developing their application. This is good. The problem is that the applications will often run much slower at higher levels of abstraction. Which means you lose performance. This isn’t 5-10% performance loss, this is typically 5x and more performance delta on programs we have seen.
When I teach the HPC course at my alma mater, I often (jokingly refer to this as “Landman’s first law of HPC programming”) say “Portable programs are not fast, fast programs are not portable.” Portable usually means very high levels of abstraction. Will run anywhere, at a cost. If you invoke a virtual machine, you are, by definition, accepting slower performance, as your virtual machine is modeling physical hardware, and it will not do so as fast as the actual hardware. Just remember that when looking at Java and other VM languages for HPC for the time or performance critical computing sections.
Now apply this to parallel programming. You have several overlapping and partially competing standards. OpenMP on shared memory parallel, and MPI on distributed memory machines. You have a great deal more power available in MPI. You also have to work quite a bit harder. OpenMP is operating at a higher level, and you are subject to its constraints.
MPI is, despite claims to the contrary, a low level library and set of tools. They abstract lots of good stuff away from us. They hide the pain of writing socket based applications. Or writing to the underlying hardware layers. This is masked. All MPI asks in return is that we check our error messages, follow its requirements, and it will run nicely. But you still need to think hard about how to map your algorithms across multiple processors, and move data about. This is a large area of CS research. It is not a completely solved problem. I have seen papers on distributed memory parallel CG solvers in the 2003 time frame.
C itself I liken to a somewhat abstracted assembly language. Fortran was a great language for numerical programming, if you could deal with its abstractions and limitations. It optimizes very well. You have to work hard at C to generate code as optimal as what lots of fortran compilers could generate for matrix problems. But C is lower level, so you have more control.
MPI is a lower level library that has support for some nice abstracted things. Not only can we move data about with it, we can create collective communication patterns to mirror our algorithmic design needs. It is not a thin abstraction layer, it is a rich library and mechanism that does indeed make life easier.
But it is still hard for the average scientist/engineer to use.
Yes, you can learn it. In order to implement what you need, often you need to read about CS research in mapping particular algorithms to distributed processing elements. Gene Golub’s book on Matrix programming is a great resource. But it gets hard for a scientist who cares a great deal more about their research, or an engineer who cares a great deal more about their simulation to stay current with the CS literature on this. I don’t know many scientists and engineers actively following developments in theoretical methods of ODE/PDE in order to improve their research. A few do, but not many. Most leverage what they have used and pick up.
I would like to see scientists and engineers be able to regularly use large collections of processors and APUs in an efficient and meaninful manner, without requiring a deep knowledge of the underlying hardware. This knowledge should be in the tools themselves. Think of the ATLAS library. A self tuned library for BLAS on a machine.
This is what we need for MPI. A higher level of abstraction that end users don’t have to worry about the myriad details of the implementation, and they can focus on the application. The GSL library is like this to a degree, and it is very good. Though it is not parallel or easily parallelizable in its innards. This is a shame.
Doug Eadline notes this in his article on ClusterMonkey.
I don’t think about the SSE registers on my chips. I expect my compiler to do that thinking for me. I would like the same thing for MPI.