This has been an itch we’ve been working on scratching a few different ways, and its very much related to forgoing distro based installers.
Ok, first the back story.
One of the things that has always annoyed me about installing systems has been the fundamental fragility of the OS drive. It doesn’t matter if its RAIDed in hardware/software. Its a pathway that can fail. And when it fails, all hell breaks loose.
This has troubled me for many years, and this is why tiburon, now SIOS has been the technology we’ve developed to solve this problem.
It turns out when you solve this problem you solve many others sort of automatically. But you also create a few.
The question of balance, which set of problems you want, and how you solve them, is what matters.
For a long time, we’ve been using NFS based OS management in the Unison storage system, as well as our FastPath big data appliances. This makes creation of new appliances as simple as installation and booting the hardware or VM. In fact, we’ve done quite a bit of debugging of odd software stacks for customers in VMs like this.
But the NFS model pre-supposes a fully operational NFS server available at all times. This is doable with a fail-over model, though it provides a potential single point of failure if not implemented as a HA NFS.
The model we’ve been working towards for a long time, was a correctly functional, and complete appliance OS that ran entirely out of RAM, but PXE booted the kernel/initrd, and then possibly grabbed a full OS image.
We want to specify the OS on the PXE command line, as SIOS aka tiburon, provides a database backed mechanism for configuration, presented as a trivial web-based API. We want all the parts of this served by PXE and http.
Well, we’ve made a major step towards the full version of this last week.
root@unison:~# cat /proc/cmdline root=ram BOOT_DEBUG=2 rw debug verbose console=tty0 console=ttyS1,115200n8 ip=::::diskless:eth0:dhcp ipv6.disable=1 debug rootfstype=ramdisk verbose root@unison:~# df -h Filesystem Size Used Avail Use% Mounted on rootfs 8.0G 2.5G 5.6G 31% / udev 10M 0 10M 0% /dev tmpfs 16M 360K 16M 3% /run tmpfs 8.0G 2.5G 5.6G 31% / tmpfs 5.0M 0 5.0M 0% /run/lock tmpfs 19G 0 19G 0% /run/shm tmpfs 4.0G 0 4.0G 0% /tmp tmpfs 4.0G 0 4.0G 0% /var/tmp tmpfs 16K 4.0K 12K 25% /var/lib/nfs tmpfs 1.0M 0 1.0M 0% /data
Notice what is completely missing from the kernel boot command line. Hint, its the root=… stuff. Hell, I could even get rid of the ip=:::: bit.
The rootfstype=ramdisk currently uses a hardwired snapshot of a pre-installed file system. But the way we have this written, we can fetch a specific image by adding in something akin to
for appropriate values of $URL. The $URL can be over the high performance network, so, say, grabbing a 1GB image over a 10GbE or IB network should be pretty quick.
We could do iscsi, or FCoE, or SRP, or iSER, or … whatever we want if we want to attach an external block device, though given our concern with the extended failure domain and failure surface, we’d prefer the ramdisk boot.
We can have the system fall back to the pre-defined OS load if the rootimage fails. The booting itself can be HA.
So we can have a much simpler to set up HA http server handing images to nodes, config to nodes, as well as a redundant set of PXE servers … in a far easier to configure, at far lower cost, and far greater scalability. This will work beautifully at web scale, as all aspects of the system are fully distributable.
Couple this to our config server, and automated post boot config system, this is becoming quite exciting as a product in and of itself.
More soon, but this is quite exciting!