We are working on some benchmarks for a customer. This is a commercial code, closed source, MPI based.
Cluster in question is an Infinipath based system. I cannot say enough good things about the HTX based Infinipath systems, they are very fast, very low latency.
And they come with an MPI stack.
Ok, let me give you a hint where this is going. The benchmark could not run, as the code could not run on the nice super fancy Infinipath system.
Well, the answer to that is the point of this post. Something I have been saying we need for a long time, and few vendors appear to be interested, and fewer developers appear to be interested. Which is sad, as without this, we are practically handing the market to a competitor, lock stock and barrel.
First off, I won’t name the commercial code vendor. I have no bad comments or thoughts about them, on the contrary, they have been very helpful. As these folks always are. It isn’t their fault that their code doesn’t work on this MPI stack.
The fault is in that there is no ABI for MPI on linux. There is a single version of API to write to, and that is true. But you cannot take code compiled for LAM 7.1.2 and run it on MPICH 1.2.7p1. Like it or not, this is a huge problem.
The reason this is a huge problem is as follows. The application vendor has to commit resources to every stack that it supports. It has to add it into its testing cycle. It has to add tests and systems and … well, you get the picture, each additional stack costs the application vendor money. Of course, there are a few open source folks calling from the periphery that this is why the Open Source model is better, as it takes away the need for ABIs. I will disagree with this, but get back to that later.
This means, for this application, and many others, you have to test with LAM, OpenMPI, MPICHgm, MPICHchp4, mvapich, mvapich2, mpich2-1.0.3, …
Yup, thats right. An explosion of MPI. And none of them, not a single one of them, can support an ABI so that we build once, and link to a backend to talk to the hardware.
This is wrong at so many levels, it is simply not worth arguing any of the points. The competition will use one ABI. Write to the API which is standard. Compile. Along comes Joe-Schmoe’s new fangled faster than light interconnect with imaginary latency time and group velocity greater than c, and it will work as soon as the drivers are installed. All in one nice DLL/.so.
This means that application vendors like the one I mentioned will prefer the simpler, and lower cost platform for them. That is, support fewer of these stacks, reduce their costs, testing, and let them focus upon their value.
Alright, by now the OSS folks are screaming at me. “Its the source, stupid” they are saying. Well, yes. That is a solution. Now show me any of these commercial vendors who get 5-30k$/compute node license costs who will willingly abandon that model for the “consulting” model that OSS effectively forces you into. This won’t happen.
As a side note, we have released quite a bit of software as OSS in the hope of getting end user installations and getting people to pay us for support. What we see is not only do people not want to pay for their software, they don’t want to pay for support for something that is “free”. So, if you are an OSS advocate (which believe it or not, I am for critical OS and middleware), could you explain to me precisely which business model will work in OSS? Most OSS folks point to Redhat or MySQL. Well, sadly they are wrong. Redhat is an aggregator, and it relies upon the value of its packaging (which is part of the reason that dependency radii are so huge and annoying in RHEL). MySQL relys on the fact that out of the several tens of millions of installations, there are tens of thousands who wil willingly pay for support. How does this model work when your total audience numbers in the hundreds? (hint: it doesn’t work)
OSS is good, and we get great software out of it, it just isn’t a business model. If you have infinitely deep pockets ala IBM and others, sure it can be a great strategic direction. If you depend upon revenue from licenses to pay your staff, it can be … a challenge … to say the least. This is why this company, and many others like it will not open source their products. This is why this company therefore will support fewer MPI stacks, and this is why this company may cease to work on Linux as with too many MPI stacks, their costs rise, and they defocus from developing their value.
All this said, let me state the now very obvious. The lack of a standardized MPI ABI on Linux will be the major driver of application vendors to competitive platforms, at the cost of decreasing support on Linux.
Kyril Faenov was claiming in our conversation that Linux has too many stacks and there are too many variants. On this he is wrong, Linux is itself very standardized. If he is talking about MPI on Linux, he is IMO correct.