Stateful vs Stateless systems

Long ago, back when we banged desktop/deskside computers together for HPC clusters, I was a firm believer in stateful systems. That is, systems with an OS installed directly on the unit, where you had to manage configuration, and configuration drift.

Many different tools were developed for this. CFEngine, puppet, chef, salt, some of the Hashicorp tools, other proprietary tools that have gone to weed/bitrot, and the latest favorite, ansible. All of these tools shared a deep, and fundamental problem. You could not really get identical images on each unit. It was literally impossible. Due to the nature of the tools, effectively driving installation tools.

So if, for some reason, the installation hiccuped on some nodes (nah, wouldn't ever happen, impossible, simply cannot occur), you would have differently configured nodes.

The issue is in part, these potential hiccups. Anyone ever install a 100 or 1000 node cluster and see hiccups mess up a few nodes? Remember the whole cattle vs pets discussion of these tools? The argument was treat the machines as cattle, no special snowflakes, and not as pets, whom you might envision as deserving of special attention. I am not a fan of that analogy, as my "pets" are my (non-human) children.

Ok, so we have these tools to shepard clusters into shape, and they've taken off in the HPC/AI world. Everyone and their brother are using these tools for deploying "identical" systems.

Except when X% of your nodes fail in the configuration for some reason, the commits to disk, the stateful install, isn't atomic. You can't roll it back to a functional state. You generally have to re-deploy that/those nodes. Apply the shepard dogs to the misbehaving cattle.

Sounds good, right?

So how do you tell which nodes had deployment failures. What test can you make that will show you a vector of things working or not working? Or do you grade this as pass fail simply on the output of the tool, and how do you discern pass/fail?

The config management systems aren't atomic. They can't be for a number of reasons, unless you see something akin to Nix package manager (which I am in awe of BTW). If you are using them to build clusters, you are literally solving the wrong problem.

The problem statement should be "how do I get identical software running on every (cattle) machine", not "how do I run an installation on every (cattle) machine such that my shepard tool d'jour can re-install if I deem it necessary."

Basically, it took me a long time to appreciate image management, programmatic booting these images. This enables exact reproduction of runtime environment, e.g. a direct solution to the real problem statement for HPC/AI systems, as compared to wrappers around installs/scripts/etc. that run on every machine ... that can take different code paths depending upon things you cannot control. Until this is internalized, we are going to keep replacing config management systems with more config management systems, all failing to solve the same basic problem.

Once I came around to this view (around 2009 or so), I built tools to help build and serve these images. Curiously, this enables people to escape the "one cluster, one OS" scenario, where maybe Ubuntu won't let you run some version of some commercial software that depends upon RedHat. Now you can make the OS a detail of the job to be run. And even better, you can reproduce these runs, without requiring all manner of hoop jumping, partitioning of the cluster as needed, etc. You can have nodes booting these images include all the software you traditionally need/want, or as little as you wish, to adjust the size of your attack surface.

From a security point of view, this is where it gets really good. Have the nodes run out of ramdisk. PXE boot with appropriate hardware root of trust and authentication (just another image!) then chain to the job image on success. No compromise of the system needs to have the nodes burned. Just reboot them.

I've seen what happens on traditional cluster systems when there is a compromise. Many moons ago, in the early part of the new millenium, a cluster we built for a university was compromised thanks to windows on a laptop, a keylogger, a graduate student who didn't listen to my guidance not to use passwords and use certificates. This cluster compromise knocked the entire university off the internet for a few hours. Had they been running an image based system, the login nodes (not the management node, which they lovingly telnetted into, not ssh /sigh) would have reset, and we could have helped them recover. Sadly, the hackers built themselves some sort of bot agent farm on the cluster and started working on DoS attacks. They succeeded.

I don't really like that. Stateful means you can modify state. Stateless means there is limited/no state to modify, and if you do modify it, it gets cleared upon restart.

This is deeply frustrating to me in a number of cases, as I see far too much time spent with "automation tooling" that fails to work, to deliver the service it needs to deliver in a meaningful/workable time frame. If you are wasting time arguing about how you need to configure your configuration management system, you aren't doing work.

Stateful vs Stateless systems

Joe Landman

Topics