Years ago … ok … decades ago, when I was building my first large clusters, I worried about configuration and drift. OS installers are notoriously finicky, and one of the hard lessons is that you should spend as absolutely little time inside them as possible. Do the bare minimum work you need to in order to get a functional system, and handle everything else after the first boot.
I actually learned this lesson at SGI, while writing Autoinst, a tool to handle large scale OS deployment. There, it was Irix that we were installing, but as noted, the installer was quite fragile. All installers are. And sometimes it would leave you with a non-bootable system.
This is simply unacceptable in any context. Your installer should not fail. Ever.
You should be able to mass update/upgrade fleets. And configure them.
I worked around the installer by running the OS from an NFS booted disk. I automated the creation of these NFS bootable directories, and the requisite infrastructure. Customers were able to update hundreds of units at a time. The OS installer was little more than a perl script running through a recipe, which enlisted rules we defined and stored in various files. These rules enabled customization per system, with differing disks, networks, etc.
This was a big deal back in the mid 90s. Cattle vs pets. Though not spoken of in that manner, with that analogy.
Fast forward a few years at Scalable Informatics. I took the same learnings and applied this to linux. Previously I had used the Rocks distribution, but this suffered from numerous issues with systems that didn’t match their opinion on how things should be built, and overall installation brittleness. The latter was somewhat of a consequence of the former. I tried creating tooling within the Rocks environment to help with some of these things (finishing scripts), but was actively rebuffed.
What this suggested to me was that highly opinionated tools often will be broken by design for anything other than a very narrow case. As I was building clusters of many different types and capabilities, it was obvious that Rocks was simply not the right way to go.
This said, Rocks had an idea that was fundamentally sound. Specifically, re-installation on failure/reboot. The OS should be treated as a detail of the startup. Their implementation rested upon Red Hat’s anaconda, which was in numerous ways, quite broken then. They took over the anaconda process, and added their functionality in. Configuration was applied.
This was similar to what I had originally tried to do with Irix, before deciding to push the installer out of the way, due to its brittleness.
Configuration of the system occurred in all cases, at install time. Indeed many packages in the RPM/DEB world will install default configuration options, which need to be changed for reasonable operations. That is, the packaging maintainers like putting their spin on things, and often this reflects the distribution’s “way” of thinking.
Not the users way.
This is important, as users and groups have specific workflows, processes, and configurations that simply do not correspond to the way the distribution chooses to do things.
This is important, as it violates one of the core principles of configuration management. The principle is that, basically, there is one way to do it, and that depends upon the configuration management system. In this case, this could be the OS installer, or the CM controller.
When you update your installed image, these systems can (and often will) fight each other. Breaking things.
And they still won’t do things the way the users want. They force users into the model and opinion that they have. Not what the users want.
Put another way, if CM (either installers or tooling) were so important to people, docker containers would have never taken off. As docker allows users to do exactly what they want, without any CM system getting in the way.
One of the complaints against docker containers by the CM focused crowd is usually indicated as “what happens if you have a container with an insecure library/binary? How do you upgrade the container?”. I’ve heard this question numerous times, in numerous contexts.
At its core, the question makes a fundamental assumption which is simply incorrect. The assumption is that you manage containers the way you “manage” operating system installations.
You don’t. You build a new container, push that artefact to your repository, and it will take the place of the old one after a few settings in your system.
You have converted a problem in configuration management and drift, into a problem of artefact management, and service restarts. I am not advocating everything as a microservice here … you can do this with a huge monolithic app in a container. And since you do this “upgrade” in this manner, you can also, trivially, do a rollback. Without invoking OS/FS level controls.
That is, you have far more control over your application than the CM system, be it an external app or a distribution installer and software manager, ever allowed. There is far less likelihood of destroying an important service, as you can always run the previous artefact versus the newer one.
Now, lets apply that same thinking to OSes. The OS installation itself can be an artefact, decided upon at boot time, fetched from a repository, without putting any permanent state on a system. That is, you can manage OS fleets on physical or virtual systems via artefacts you build as needed. OS rollbacks in the event of issues are trivial. Literally a reboot away. OS testing is trivial, can be done in a VM, and then in a staged way on physical hardware. This allows you to treat the OS level as canaries, in addition to the applications you are using.
To do this, you need an intelligent database backed booting system . You’d need a sane object store to store artefacts, and enable booting from it with this system. And you’d need an OS artefact builder, that lets you take your distributions, and turn them into ramdisk booted systems. With these systems, you can do any post boot configuration trivially. You can even use the CM apps if you wish, though simple scripts work as well, without the extensive overhead.
My argument is fundamentally simple. Your machines can (and largely should) run stateless OSes, so their OS would be determined at boot time. This allows you to treat your physical machines (and VMs that boot the same way) as cattle. It pushes CM away (OS drift literally a thing of the past) with immutable images. And if you decide you want to build a new image of an OS, you should be able to generate one and push it to the repository in far less than 30 minutes. Typical build time for this on my laptop is about 5-10 minutes at worst.
I believe that the image artefact management … OSes, containers, etc. is a simpler problem to deal with than configuration management.