Systemd has taken the linux world by storm. Replacing 20-ish year old init style processing for a more legitimate control plane, and replacing it with a centralized resource to handle this control. There are many things to like within it, such as the granularity of control. But there are any number of things that are badly broken by default. Actually some of these things are specifically geared towards desktop users (which isn’t a bad thing if you are a desktop linux user, as I am). But if you are building servers and images, you get a serious case of Whiskey Tango Foxtrot dealing with all the … er … features. Especially the ones you didn’t need, know you don’t need, and really want to avoid seeing live and in the field. Ever again.
My biggest beefs with systemd have been some of the lack of observability into specific inner workings during startup and shut down, a seeming inability to control systemd’s more insane leanings (a start event/a shutdown event …), and its journaling infrastructure. The last item is still a sore point for us, as we find it very hard to correctly/properly control logging for a system that should be running disklessly, when we see the logger daemon ignoring the limits we imposed on it in the relevant config files, and filling up the ramdisk. Yeah, not so much fun.
The startup/shutdown hang timeouts are also very annoying, and despite the fact that systemd provides a good control plane for some of this, these delays (which are strictly, and in the absolute sense, completely) unneeded, cause a very poor UX for folks like me whom value efficiency. I do not want systemd trying to automatically start all my network services, and hanging the whole system while it is waiting for network to autoconfigure. I really … truly do not want that. I’ve been looking at how to change this so that it does the startup sanely, but the control plane seems … incomplete … at best here.
The shutdown hangs are due in large part to complexities around the sequencing of things that systemd does, and thinks it knows, and ignores. The things it ignores are what cause the problems, as the dependency graph of shutdowns seems to not know how to deal correctly with things like parallel file systems, object stores, etc. We’ve been working on improving this, and with judicious use of the watchdogs, and some recrafting of various unit files, we have it saner. But its still not perfect.
And don’t get me started on the Intel MPSS bit with systemd.
My point is simple. Systemd tries to do too much, and it messes up IMO, because of this. I’d like it to be a simple control plane. Thats it. Handle start/stop of daemons. Handle that level of its own logging.
I don’t want it to be DNS, and logger, and login, and …
Because when it is all that, things break. Badly.
Our systems are not vulnerable to this bug. And yes, he should have followed responsible disclosure protocol rather than posting a blog entry.
The net of why this bug exists is an assert function. Assert is never, and I repeat, NEVER, something you should use in critical system software. Nor is BUG_ON.
When the revolution comes, the coders who write using BUG_ON and assert will be the first against the wall.
Crashing a core service because you get input you don’t like IS NOT A VALID MECHANISM OF ERROR HANDLING. I’ll argue its somewhat worse than throwing an exception for every “problem” as compared to handling the problem gracefully locally. Exceptions should only ever be thrown for serious things. So add the folks who built exception throwing as the “right” pattern to handle what amounts to a branch control point in a code, to the folks first against the wall.
Divide by zero? Yeah, throw an exception (the processor will). Access memory outside of your allocated segments? Throw an error (OS will). Input value for n to some routine is not 1 or greater? If you answer “I must throw an exception, by using an assert here if this is not true”, then you need to rethink your design. Very … very … carefully.
This bug that the writer alluded to is a simple case of passing a function call a value of zero where it expects a 1 or greater. And rather than gracefully returning a no-op (which would make perfect sense in this context) ….
THERE IS A )_(&)(&^*&^%&%
assert(n > 0); in the code.
Seriously? WTF? In a core control plane? AYFKM?
The level of fail for others is profound. A simple user, with no elevated privileges whatsoever, can trivially run a DOS attack against a system.
For us, because we are using a slightly dated version of systemd, we just see some daemons stop and restart, but no significant loss in functionality, like others see.
I’ve been non-committal about the whole systemd bit for a while, I have/had hope for it. I am not now against it, but will be now looking for more ways to actively constrain what it can do. We already automatically defang some of its more annoying “features”. Now I am going to spend more time looking at how to turn off some of the functionality that I do not want it to handle itself.
I have no problem with it as a control plane. I have lots of problems with it as the Borg.