Learning limits of Linux distribution infrastructure

Its only when you stress a distribution infrastructure that you truly see its limits. And as often as not, the fail winds up being widespread.
Our new 60 bay JackRabbit unit with CentOS 6.3 on it … and this is not a bash at CentOS, they do a great job rebuilding the Red Hat distribution without the copyrighted bits … has a number of software RAID elements on it. 9 in the current test. There are a number of reasons for this right now, but specifically, we are testing this configuration for a customer.
The kernel (our updated one, patched/tuned/etc.) has no trouble seeing the disks, the full system. Thats not the problem.
The problem are the udev software RAID assembly routines that, as often as not, mess up the RAID bringup. I’ve tried tracing this through the system, and it looks like udev and some of the other elements that Red Hat uses as infrastructure, are messing up … either in scanning, or other related things.
This could be a timeout, or a race condition that is exacerbated by so many RAIDs and drives.
What I am finding we need to do, is to inject a RAID “shutdown” and “reassembly” operation, under control, after the rest of the system has come fully up. This has proven to be successful as a work around. Unfortunately, the “smart” udev RAID bringup, as often as not, will see disks show up “late” and put them in their own /dev/md12x for x&epslion;{7 .. 0}. This “smarts” doesn’t quite help anyone, and shows me that the event based plumbing is likely missing the holistic goal of bringing up fully correctly assembled working RAIDs.
I will look to see if we can patch this to fix it, but past experience (and scars) dealing with the eyeball-melting mess that is udev, suggest, strongly, that it might be better to simply disable its RAID assembly for non-root systems, and have an external, smarter bit of code handle this.
Note: While I am leveling some fire at the Red Hat infrastructure, don’t take this to mean that other distributions are any better. Actually, we’ve had a number of very hard to diagnose/fix race conditions show up in Ubuntu among others with things like this. They manifest strongly in diskless conditions. So what we do there is, as with other things*, spend the absolute minimum time in the dangerous routines, by turning off everything not essential to coming up to a “working” state (for various definitions of “working”). Then we handle ALL post-boot configuration in a different infrastructure, with a minimal footprint in the boot chain of the system.
Every time I naively think “oh well, lets try to use the plumbing inside the distro for this”, I am reminded how incredibly borked their plumbing is. Debian, and for a while, Ubuntu based upon Debian, had a reasonable bit of plumbing that didn’t break nearly as much as the Red Hat based distros did. But with the latest Ubuntu, I can categorically state that this is no longer the case, that Ubuntu has as much breakage as Red Hat, and sometimes its more tragic (the resolvconf debacle anyone? Lets take something that works and … break … it … for … no … reason … /sigh)
*anaconda, the Red Hat, CentOS, and other installer, is a terrible monstrosity, that is best not messed with. Its failure modes are … well … epic. It is best to spend as little time as possible in this installer, to do as absolutely little configuration as possible within it, and handle everything outside of it that you possibly can. This is the philosophy of Chef, Puppet, our Tiburon, etc. Encode the business logic and configuration in a repeatable infrastructure, and use the baseline infrastructure to start with. Make that as vanilla and baseline as possible, do all the heavy lifting outside of their infrastructure.

2 thoughts on “Learning limits of Linux distribution infrastructure”

Comments are closed.