The future of kernel-specific version subsystems

By joe

May 23, 2010 - 8 minutes read - 1510 words

One of the issues we ran into with Lustre on our siCluster was the inability to use the kernel of our choice. Lustre is quite invasive in its patch sets. So modern kernels, ones with subsystem fixes, driver updates, and other things we need …. can’t necessarily host Lustre without some serious forward porting of the code base. And this got me thinking. This isn’t the only project tied to specific kernel versions, and effectively unable to use an arbitrary kernel version. Xen is another example. And I’d argue OFED is another. There are many more of them out there, and I can’t name them all. Its worth asking the question … what does the future hold for these projects?

In the case of OFED, its obvious that there is a strong committment to getting it working correctly, though they have an unfortunate tendency to focus upon particular kernel releases, which means that end users get surprised when they try to build OFED 1.4.2 for RHEL 5.4 and discover … rapidly … that it doesn’t build. This is because of some patch overlap … OFED attempts to apply patches to specific kernels to enable their build, so if the patches are a) already in (as in the case of RHEL), or b) incompatible, the build will fail. Its fine if the patches are in there, but this means you have to bump your version of OFED to take advantage of this. Which suggests that OFED shouldn’t be where the kernel patches are applied. Generally, I have trouble with the OFED build process. It should be source RPMs or packages, but its a bit more than this. They have a perl wrapper around the packages (this is fine) feeding build options to the rpmbuild (this is not fine) based upon build environments (again, not fine, this needs to be handled within the RPM itself). What results is a build system that often breaks. Its not enough to fix broken RPMs, we often have to fix the broken build options. This is troubling. RPM is about building and packaging. The building part should be completely self contained. You should be able to type rpmbuild --rebuild source.code.src.rpm and have it generate a working binary. Without build options. If you are exposing build options at the RPM build level, I argue that your packaging effort has failed. While I have also argued that RPM is a moving target (sadly it is), you can build a reasonable RPM to handle 99% of what you need, and package a few scripts, into the RPM, to handle the 1% of additional programmatic bits. Its not hard, and it is a reasonable approach. I don’t advocate going the rPath route of turning the packaging system into a python script … this is bad for many reasons in the general case. I do advocate a cleaner build, that doesn’t try an end-run around the existing process. Note: Our kernel installation process is done by script. We don’t package that into the RPMs right now, as we want a single RPM to install on any RPM based distro. The scripts handle the distro specific methods of installing, preparing modules and source for compilation, preparing grub, building initial ram disks, etc. I’m ok with this, as we can build our kernel with the rpmbuild as indicated above. But there is enough value in wide adoption of OFED, and enough of the subsystems are included in modern kernels, so that the user space side, though occasionally out of sync with the kernel side, will likely continue for a while. It wouldn’t surprise me to see most of the offload network cards go this route (Chelsio 10GbE iWarp is in there, as well as RDMA over EE). So I think OFED has a future, though some parts of the process are currently painful. But what about Xen and Lustre? We’ve had customers request help building Xen. Building Xen is, IMO, a nightmare. It is extraordinarily kernel specific, you have to know the right bits to grab … and its not obvious, you have to check multiple repositories for it. You have to set things up in a specific manner. Compare this to KVM, which is part of the kernel. You don’t need to do anything like this. Its part of the kernel. Did I mention that its part of the kernel? If I haven’t, please refer to the half-open drivers post from a few days ago, this is relevant. KVM is part of the kernel, so as bugs arise due to kernel changes, they are fixed. Which means problems inherently have limited lifetimes, and KVM does not lag the kernel. Did I mention KVM is part of the kernel? This is very important. Redhat has taken notice of this, and while they continue to support Xen, they have thrown in with KVM, going as far as buying Moshe Bar’s company, Qumranet, and pushing KVM as their virtualization solution going forward. Customers whom have built up a Xen dependency will be supported for a while, but the writing is on the wall. Xen has lost favor at Redhat and now I believe at SuSE as well. The last statement in that article is setting a stage IMO … telling people that a decision on direction hasn’t been made, when in the past it had been with Xen, should telegraph the intentions going forward. Did I mention KVM is part of the kernel? Xen isn’t and frankly, I don’t expect it will ever be. Its pretty obvious at this juncture where things are headed. You build depencies upon Xen based systems at your own (future support) peril. And this leads us to Lustre. Lustre has similar problems to Xen. Building a brand shiny new Lustre system on a 2.6.32 kernel is a porting exercise. One most people won’t undertake. You can pay Clusterstor to do it for you. But the bits might not be accepted into the mainline code. Lustre, though it is open source, is a corporate controlled project. The corporate bosses of the project may decide (and have apparently already decided) upon project directions that do not jive with general kernel support. I expect that, over time, Lustre 2.x and 3.x will be focused upon OEL. They have dropped SuSE, and I do believe that RHEL support will be slowly phased out. I know there will be strong pushback against the latter, from existing Lustre partners, but I think its likely to be inevitable. OEL competes directly with RHEL, and its derived from RHEL. So are there alternatives to Lustre? Yes. Ceph, which is now in the kernel, has many nice features, including ones Lustre doesn’t even have on their roadmap. And Ceph is in the kernel. I don’t think Lustre will ever be in the kernel. But Ceph isn’t the only file system out there. GlusterFS has a very interesting design, and it scales well. V3.0 works nicely (fewer showstoppers than in v2.x). It eschews shared designs, opting for a completely distributed infrastructure. It uses one kernel module, fuse, which … is in the kernel. Gluster has a number of specific advantages over Lustre, including no centralized metadata server. But it also allows local mounting of the file system with no loss of performance. They have a translator to provide local file system scheduler preference called NUFA, that exploits this. You don’t need to use it, but you can. Gluster’s striping translators don’t work well (as we have found out), and it tends to drive infiniband quite hard, exposing some significant issues in the process. We’ve seen some corruption and other issues from IB layer. OTOH, running Gluster over TCP has its own issues, and requires a very carefully tuned tcp stack, so as to avoid resource exhaustion. We are currently investigating latency issues on these systems to see if there are problems with IOP bound loads … see earlier reported results on IOPs which the Gluster team ran. We are looking at this to see where IOP bottlenecks are. But neither Gluster nor Ceph are problematic getting running on new kernels. Ceph uses BTRFS, and Gluster runs on any backing store. So … I think the writing is on the wall for these projects with hard kernel dependencies. Some, like OFED, are adapting. They have to. Some like Xen, and Lustre, are not. Long term, OFED will survive and thrive. As will KVM, Ceph, and Gluster. Long term, I am not sure how long Xen and Lustre can keep up. Like with the half-open driver problem, market forces will ultimately favor the more adaptable solutions. The message should be clear: if you are not in the kernel, and you don’t have a light footprint relative to the kernel, and you depend strongly upon a specific kernel version, your long term survival as a project is not looking good. Conversely, if you are in the kernel, or have a relatively light footprint relative to the kernel, and don’t have a strong kernel version dependency, you likely are in good shape going forward.