stateless booting

A problem I’ve been working on dealing with for a while has been the sad … well … no … terrible state of programmatically configured Linux systems, where the state is determined from a central (set of) source(s) via configuration databases, and NOT by local stateful configuration files. Madness lies in wait for those choosing the latter strategy, especially if you need to make changes.

All sorts of variations on the themes have been used over the last decade or so, with this. Often programmatic things like Chef or puppet, are there to do a push of configuration to a system. This of course breaks terribly with new systems, and the corner cases they bring up.

Other approaches have been to mandate one particular OS and OS version, combined with a standard hardware configuration. Given how hardware is built by large vendors, that word “standard” is … interesting … to say the least.

You have a myriad of possible network, disk, etc. combinations. And you have to handle startup of these configurations in a very restricted environment. Getting new drivers into there is an exercise in frustration and pain.

This makes life painful.

The way the distributions handle stateless booting (NFS, iSCSI, etc.) is a disgrace. Its poorly documented in most cases, and it also doesn’t work as written. We’ve done quite a bit of work to get it working smoothly and correctly.

One of the, oddly enough, harder problems we’ve encountered has been network autoconfiguration upon boot. Basically there has been a hardwired assumption (no pun intended) that what you want to do is to use the ports as named in the configuration file. The configuration file is assumed to reside on the unit itself in some persistent storage.

Or, [sarcasm] even better [/sarcasm], you hand a boot parameter, like ip=::::ethX:dhcp: or something to that effect.

Experience with this is that when you don’t want to use persistent storage on the unit for configuration (hey, software defined appliances are a thing!), you have to use the boot line version. Which gets to the slight problem that, well, you don’t quite know your port names/numbering in advance.

How you want it to work:

  1. Plug an RJ45 in some where to a working wire
  2. Have the system sense which port has a link, and configure that
  3. Post bringing the OS up, do any additional configuration, again, backed by a (distributed) database

What you don’t want is to hardwire ip=… lines, or even mess with those if you can avoid it. Scalable OS (formerly known as Tiburon) had handled these, albeit we kept running into interesting issues when we switched to our lab machines with different generations of MB. Ethernet ordering will vary depending upon how the driver handles scanning the busses. This has been an annoying aspect of our life for the past 8 years or so, as we build/install our own kernels. Install a new kernel and now the persistent-net.rules file is no longer matching what exists on your machine. Which means, not only is your numbering wrong, but sometimes the driver may refuse to load, or worse, the numbering will be happily changed for you.

Think about this for a moment. You have N machines, you make a kernel update of some sort, or a driver update of some sort, and they rapidly go offline and don’t come back after a reboot. For N=2 or N=3, you can handle this by walking to your data center space and manually dealing with this.

Puppet and Chef won’t help here, thanks to potentially non-deterministic reordering.

What you need is a sane way to handle this.

Rethink the problem. You want to try to dhcp on ports which have link. Ignore everything else. Don’t worry about numbering/naming.

So we just rolled this into our Scalable OS load, and we did this as part of a mechanism to use our distribution to handle the NFS diskless config in a far saner way.

Really, you want to get into the full environment as QUICKLY as possible, as you can do so much more stuff with it.

We do this in a simple way, and actually encourage people to use it so we can get away from the idiocy of stateful network configs that we really don’t need for a stateless world.

First we set all networks to manual start. Second we include any driver modules we need in the initramfs, or hard wire them into the kernel. The drivers register themselves under /sys/class/net. Since we are dealing only with ethernet devices, we don’t care about wireless, etc. its easy to focus our efforts.

# bring interfaces up
        for net in $(ls /sys/class/net/ | grep eth)
        do
         ifconfig $net up
        done
        sleep 10
        # sleep 10 seconds to make sure they establish link, then
        # probe for link.  dhclient on all interfaces with a link
        dhclient -x
        ports=""
        for net in $(ls /sys/class/net/ | grep eth)
        do
         carrier="$( cat  /sys/class/net/$net/carrier )"
         if [ $carrier -eq 1 ]
          then
            #echo $net
            ports="$ports $net"
          fi
        done
        echo "ports= $ports"
        dhclient -v  $ports

Thats it. It will dhclient on all ports that have carrier. Now its not as sophisticated as it could be (detecting loss of carrier and moving the network to another port, or bridging, etc.). Yet. Stay tuned.

After we got this working, I wanted to get the NFS root working. NFS root is very helpful for large deployments where you want/need a very simple paradigm for management. Distro based nfs root is an evil joke. It mostly doesn’t work, though we’ve been able to coax it to for years.

It pissed me off no end. So I went back and rethought some things, and came up with a mechanism that just works.

  • First, since we do ramdisk based OS load from PXE boot so well, lets start with that.
  • Second, once we are at the full OS load, we can load/run anything, any way we want. We have a better environment, far better control, so lets make that work.

In the same script that starts up the network, I added a detector for a new command line argument. Right now, I’ve added

root=ram rootfstype=ramdisk

What if I extend this, by adding a modifier to this, say,

nfsserver=...

Like this:

root=ram rootfstype=ramdisk nfsserver=10.100.100.250:/data/tiburon/diskless/SciApp1.0_x64/rootfs

Since we are at full OS level, my script can catch this, when I have all my capabilities running, with full programming languages. So the script has this:

if (grep -q nfsserver= /proc/cmdline); then
                /etc/init.d/nfs-common stop
                mkdir -p /var/lib/nfs/
                cd /tmp 
                /etc/init.d/nfs-common start
                # grab the nfsserver option from the command line
                nfsserver=$( /opt/scalable/bin/get_nfsserver.pl )

The /opt/scalable/bin/get_nfsserver.pl code is brain dead simple

#!/usr/bin/perl

use strict;
my ($cmdline,$nfsserv);
chomp($cmdline = `cat /proc/cmdline`);
if ($cmdline =~ /nfsserver=(.*?)[\s\n]/) {
   printf "%s\n",$1;
}

And this is what it gives based upon the /proc/cmdline I showed.

root@usn-ramboot:~# /opt/scalable/bin/get_nfsserver.pl
10.100.100.250:/data/tiburon/diskless/SciApp1.0_x64/rootfs

From there, I create a new NFS root mount point, pivot into it, move my mounts, unmount old stuff, and fire up init 3.

echo "nfsserver=${nfsserver}"
                mkdir /new_root
                mount -t nfs -o soft,rsize=65536,wsize=65536,retry=1,tcp,nolock,intr ${nfsserver} /new_root
                cd /new_root
                mkdir old_root
                pivot_root . old_root
                ./bin/mount -n --move ./old_root/sys ./sys
                ./bin/mount -n --move ./old_root/proc ./proc
                ./bin/mount -n --move ./old_root/dev ./dev
                ./bin/mount -n --move ./old_root/run ./run
                ./bin/mount -n --move ./old_root/var ./var
                ./bin/mount -n --move ./old_root/var/lib/nfs ./var/lib/nfs
                ./bin/mount -n --move ./old_root/tmp ./tmp
                ./bin/umount ./old_root/var/tmp
                ./bin/umount ./old_root/var/lib/nfs
                ./bin/umount ./old_root/data
                exec chroot . /bin/bash -c 'umount -l /old_root ; /sbin/init 3 ' <dev /console >dev/console 2>&1      
        fi
</dev>

And darn it, it works, really well:

root@usn-ramboot:~# df -h /
Filesystem                                                  Size  Used Avail Use% Mounted on
10.100.100.250:/data/tiburon/diskless/SciApp1.0_x64/rootfs  931G  833G   99G  90% /

I’ll have to fix up a few things with the /var directories, but otherwise its going quite nicely. I’ll probably add in an option for overlay directories, and a few other things.

But the same code and the same concept will work very well for iSCSI. And FCoE. And *oIB. And …

Basically my argument has been for a long time, that the initrd/initramfs and the whole startup process in linux is terribly fragile. The systemd argument intrudes on this somewhat as well. So I’ve been looking for a way around it, to (as I used to do for Red Hat based distributions), spend as little time in and as little effort on broken-by-design systems as possible, and have them transition to the full on system as rapidly as I can get them there.

Another very nice aspect of this is that it is distribution agnostic. So I can support CentOS/RedHat, Ubuntu, Debian etc. absolutely trivially, with updated kernels and drivers. No more boot disks (sounds of loud clapping).

Since our ramdisk based boot has kvm integrated within it, we can, hey imagine that, use it to boot smaller VMs. And once we integrate dockerfiles and other bits into Scalable OS management layer (aka Tiburon), we can do the same for running containers atop these systems.

Took me too long to get this done, but I was distracted by many things.

More soon.

Viewed 56214 times by 6717 viewers

Facebooktwittergoogle_plusredditpinterestlinkedinmail