Convergence and diversification

The market has been consolidating behind various OSes for a while. Reducing the number of ports reduces ISV costs. It reduces end user management headache. Curiously enough it also reduces the engineering costs of the relevant hardware vendors, but don’t tell a few of them that, as they still perceive value where they feel they can be different. Unfortunately I have a sense of mayhem in two of the converged OSes, Linux and Windows. Sure, some might try to lump Solaris in here as an alternative, but most of us know it isn’t. The market has told us.

From IDC and other reports, as well as articles in a number of magazines, we know with pretty good certainty how various OSes are growing/shrinking over time. From a Linux Magazine article in January 2007 we see:

In November of 2006 the IDC Worldwide Quarterly Server Tracker put Sun’s Solaris worldwide unit market share on x86 servers at just 0.25%. Worldwide shipments of Linux on x86 outpaced Solaris on x86 by 87.5 units to 1.

and while Sun downplays Linux in favor of Solaris

You might assume that spinning Linux and existing commercial Linux support as inferior to the Solaris offering is in Sun’s best interest. But it’s robbing Peter to pay Paul; IDC also stated that 69.9% of Sun’s x86 server shipments were Linux-based.

But back to the issue at hand.
Assume I am a developer (true in part, we do lots more than just code development). I have an application I want to run on these “two” platforms, Linux and Windows.
I have a problem.
Which version of Linux, and which version of Windows?
The folks in Microsoft like to pretend this is a Linux problem, that there are lots of fractured Linux implementations, and that this is how they will position their “product” against it. Well, which product?
For Linux: At its core it is one system. If we have to write a driver, we have to keep a wary eye out for an ever changing ABI. This problem is ameliorated if you work with the folks in the kernel development team to get a driver into the mainline kernel. Actually it is better than that, because if you can get it in, you will get help developing it, expanding it, and you won’t have to chase that ABI yourself. Not bad, eh? That lowers your development costs, and speeds your development time. Especially as getting drivers into the kernel mean they get tested on multiple ABIs for you.
But other elements may bug you. The great fun I have had with building OFED, the Infiniband stack for Linux and others, is an example. It only works with specific kernels from specific distribution vendors. So if you want to run this open source software on, say, a Debian kernel, you are largely out of luck unless you do the porting. The flip side of this is that as OSS, you can do the porting. This is how we are approaching it, we can handle this porting. By porting BTW, this is simply fixing up minor differences in packaging, include file and library file specification. This is not, as Microsoft appears to have implied, in past conversations with me and others, a “port” as in moving Amber9 to windows. That is a “port” (e.g. rewrite of the applications).
To be very frank about this, the pain you have in porting tends to be minor annoying things like this. They do add up, specifically, if you want to deploy MPI applications, you have lots of different MPI stacks you can target, and lots of different fabric technologies. Discussing this with a few groups, we had not been the only ones to have multiple MPI stacks for each interface technology. Call it an embarrassment of riches.
Now onto windows.
Which windows should we target? Oh, don’t be silly, we should target windows.
Yeah, but which one? We have 32 bit XP pro, 64 bit XP pro, 32 and 64 bit Vista. 32 and 64 bit Windows server.
Oh, but they all have the same ABI. Right?
If this was the case, then you wouldn’t need 64 bit drivers now. The 32 bit drivers would work just fine.
The issue is that the windows side is fragmented into not just ABIs but into which targeted version you need to deploy on. This latter issue is problematic on Linux as well, due to the distribution situation.
But the multiple ABIs exist, on both platforms. Which effectively renders multiple versions of both platforms. Luckily they should differ by a recompilation, at most. And this is one of the nicer features of Linux, in that you can recompile. I am not sure that you can simply change compilation options on the windows compilers. Since the Portland group compilers, and Intel compilers are available for Windows, you should be able to do this, though I am not sure of the implications.
So while we see a convergece in terms of “platforms” we see a diversification within the sub-species, if you will, of the platforms.
This was supposed to save money somehow. Time reduction, and platform reduction. If I have to qualify my application for windows xp, windows xp x64, windows vista, window vista x64, windows 200x x64 server, windows 200x server, how precisely does this new set of diverse platforms save me testing time/effort/money?
Linux’s issue is the ABI, MPI+compiler issue. As indicated, you have 2 ABIs you are likely targeting now: legacy x86 and modern x86_64. IA64 isn’t likely on your targeted ABI list, and that is for good reason. You have to have a different MPI for each architecture. Sure you can use MPI between architectures. I am not going to speak to this here.
Each MPI may be built with a separate compiler. Under Linux, there are 3 compilers to give serious attention to outside of the default gcc/gfortran. The Portland Group compilers are pretty decent, the PathScale compilers are incredible (when they decide to be), and the Intel compiler, which at least in the 9.1 version time frame still had trouble generating reasonable code for anything but an Intel chip. The 10.0 are out, so I will see if I can try those out.
But the point is that if you build an application using the PGI compiler, don’t expect it to link cleanly to an MPI built using on of the other compilers.
Before I go on, this is also a problem under windows, as OFED under windows comes with OpenMPI, mvapich2, etc. So if you target one of these high quality stacks in order to converge your development onto fewer stacks, this is a good thing … but you still have these build issues to worry about. Regardless of the platform.
So you get an MPI built with a compiler. Lets see. 4 compilers (3 + gcc). 3 MPIs. Thats 12 different builds. Your testing space (if you want to target all compilers and all MPI stacks) just went up by an order of magnitude. Well, lets do convergence there as well. Assume 1 MPI stack, and one compiler. Since I presume you care about performance on the x86_64 ABI, and you don’t assume everyone will be running an Intel chip, this pretty much knocks the Intel compilers out of the mix. Gcc is not a performance oriented compiler. So you are left with PathScale and PGI. Both are good. For the moment, lets pick PGI. The added advantage is that it will work in Windows as well. Choose say OpenMPI along with it, atop the OFED stack, and, at least in theory, you have very little porting to do between platforms.
You need to converge the compiler choices as well as the platform choices. Similarly on the MPI side, you need to decide upon a particular implementation. This will keep your testing space smaller.
The final step would be platform certification. From your converged platform model, you have Linux and Windows. But you know each is diverse. Assuming your MPI is locked down, and the other bits are locked down the diversity in platform distributions (windows versions, or linux distros) could be handled via automated smoke testing. Set up your cluster to diskless boot, load the image you need, and run your test case. This could be completely automated. No additional platforms to support (with new compilers and new test cases). Same compilers, same ABI, same MPI. Whats left are the “minor” differences.
The danger is that you would certify the app on one version of windows. Or one distribution of linux. And then declare yourself done. If you certify on windows xp pro, you have covered most of your user cases for desktops. If you certify on Redhat you have hit about 50% of clusters. It is better to have a coverage matrix of windows xp pro x64, windows 200x server x64, Redhat linux el5, Suse SLES, and Debian (or Ubuntu).
If you write your code well enough, you could do even more convergence on the Linux side; just avoid little things like the HZ macro I mentioned in a previous post.
Our code tends to run nicely across all x86_64 and x86 Linux. Did some Ubuntu smoke testing and we have a build issue, but it looks to be more related to our locale settings than a real issue. But it still takes time to resolve.
The platforms have converged. But the sub-versions are diverse.
You never solve problems. You simply replace one with another. The question at the end of the day, is which set of problems do you want?