Observations on kernel stability

There is just no nice way to say this. We have a real (serious) concern over the stability of the baseline Redhat/SuSE kernels on newer hardware. Not just our JackRabbit systems (and our forthcoming ΔV systems), but clusters of newer gear, newer servers, etc. We install baseline systems, using nothing but the baseline components, perform the recommended upgrades. Place these systems under moderate load, and whamo … kernel panic.

Replace their kernel with our 2.6.23.14 built one (with a number of important patches), and the same system, with the same RHEL/Centos load, is rock solid under our extraordinary testing loads. We have seen similar problems with the SuSE kernels we have played with, though less so.

I am not quite sure why, other than the age of the kernels. The 7.x Ubuntu series was pretty good. The 8.04 LTS is in desperate need of kernel QC. Similar issues there.

We are working on our next-gen kernel release, which I had thought was going to be based upon 2.6.26.x though based upon the number of fixes going in to 2.6.27, and that some of the issues we ran into in 2.6.26.1 and 2.6.26.2 are taken care of in 2.6.27, we may simply wait for that.

Sadly, the kernel process itself looks like it is changing internal data structures, which make including out-of-tree drivers much harder. We wind up (often) patching the drivers ourselves, and then testing testing testing … I know the kernel folks don’t care, they want all drivers in-tree, but as noted recently on the excellent lwn.net, the kernel maintainers are starting to worry about old/obsolete drivers. Eventually I hope that the kernel folks start with 2.x being a fixed set of interfaces, with 2.x+1 being the new set, and a clear delineation on what you have to do to change the code to get there. I don’t think they want to do that, but it would help (cough cough) folks to not be chasing such a rapidly moving target.

I have seen/heard of other kernel driver writers getting tremendously frustrated with the long process to get in-tree, and the rapid changes that they have to chase/deal with during that process. More often than not, the kernel driver in-tree is out of date relative to functionality/features (we run into this all the time with, for example, intel NIC drivers … the e1000-e1000e-igb split actually caught us somewhat by surprise … and we track development on the e1000 and now the rest every few months).

Fundamentally the issue with the distros is that they have to pick a kernel and support it. Sadly, a number of them have decided to pursue (aggressively) the backporting approach, which brings features (often without the resulting changed kernel internals and bug-fixes) back to their current kernel. Which IMO doesn’t make much sense. Yeah, I have heard Redhat’s argument about it, and no, I don’t buy it.

The problem is that I think that these backports may compromise stability, not enhance features. Its a cost benefit analysis.

So when we do ship Redhat (or SuSE, or Ubuntu), we will continue to ship our kernel by default, unless our customer insists otherwise. If they get lots of crashes with the default updated distro kernels as we do in testing, we will ask them to shift to ours. We even install ours on other peoples hardware; it has helped stabilize their systems. More of a support load for us, but in the end, it is worth it for our customers.

Viewed 6405 times by 1136 viewers

Facebooktwittergoogle_plusredditpinterestlinkedinmail