So we’ve been using Debian 8 as the basis of our SIOS v2 system. Debian has a number of very strong features that make it a fantastic basis for developing a platform … for one, it doesn’t have significant negative baggage/technical debt associated with poor design decisions early on in the development of the system as others do.
But it has systemd.
I’ve been generally non-committal about systemd, as it seemed like it should improve some things, at a fairly minor cost in additional complexity. It provides a number of things in a very nice and straightforward manner.
That is … until … you run into the default config scenarios. These will leave you, as the server guy, asking “seriously … whiskey tango foxtrot???!?”
Well, ok, some of these are built atop Debian, so there is blame to share.
The first is the size of tmpfs (ramdisks). By default, this is controlled in early boot (and not controllable via a kernel boot parameter) by the contents of /etc/default/tmpfs. In it, you see this:
as the default. That is, each tmpfs you allocate will get a 20% of your virtual memory total as its size by default, unless you specify a size. And as it turns out, this is actually a bad thing. As the /run directory is allocated early on in the boot, not governed by /etc/fstab (not necessarily a bad thing, as the fstab is a control point) and not having any other control points …
root@unison:~# df -h /run Filesystem Size Used Avail Use% Mounted on tmpfs 13G 2.5M 13G 1% /run root@unison:~# grep run /etc/fstab root@unison:~#
Hey, look at that. Its 13GB for a /run directory that would struggle to ever be 1GB.
Ok, its tmpfs, so the allocation isn’t locked. But it is backed by swap.
UNLESS YOU TURN SWAP OFF IN WHICH CASE AAAARRRRRRGGGGGHHHHH
So … to recap … Whiskey Tango Foxtrot?
But, before you get all “hey, relax dude, its just one mount … chillax” … you have to ask about the interaction with other systemd technology (/run is mounted by systemd … oh yes … it is).
Like, I dunno. Logind mebbe?
So there you are. Logging into your machine. And you notice, curiously, you have this whole /run/user/$pid thing going on. And if you look closely enough, you have these as tmpfs mounts. And they are each getting 20% of VM.
Starting to see the problem yet? No?
Ok. So you have these defaults … And a bunch of users. Whom log in. And use up these resources.
Now, to add complexity, lets say you have a swapfile rather than a swap partition. I am not a huge believer in swap … rather the opposite. If you are swapping, this is a strong signal you need more memory. If it is very rare swapping, once a month, on a non-critical system, sure, swapping is fine. If it is a daily occurance under load on a production box, you need to buy more memory. Or tune your processes so they don’t need so much memory.
This swapfile, is sitting atop a file system. This is a non-optimal scenario, but the user insisted upon swap, so we provided it. This is a failure waiting to happen, as filesystem IO requires memory allocations, which, if you think about what swap is/does, will be highly problematic in the context of actually swapping. That is, if you need to allocate memory, in order to page out to a disk, because you are trying to allocate memory … lets just say that this is the thing livelocks are made of.
And, of course, to make things worse, we have a caching layer between the physical device and the file system. One we can’t turn off completely. The caching layer also does allocations. With the same net effect.
Now that I’ve set out the chess pieces for you, let me explain what we’ve seen.
6 or 7 users log in. These tmpfs allocations are made. No swap. vm.overcommit=0. Failure. Ok, add swap. Change vm.overcommit=1. Make the allocatable percentage 85% rather than 50%. Rinse. Repeat.
Customer seriously questioning my sanity.
All the logs are showing allocation problems, but no swap. Change to vm.overcommit=2. Stamp a big old FAIL across any process that wants to overallocate. Yeah, it will catch others, not unlike the wild west of OOM killer, but at least we’ll get a real signal now.
… and …
who authorized 20% ram for these logins? The failures seem correlated with them.
Thats /etc/default/tmpfs defaults (which are insane). Ok, can fix those. But … still a problem, as logind thinks we should give this out.
Deep in the heart of darkness … er … /etc/systemd/ we find logind.conf. Which has this little gem.
as its default.
Whiskey. Tango. Foxtrot.
This is where you put user temp files for the session.
Yeah … for Gnome, and other desktop uses cases, sure, 20% may be reasonable for the vast majority of people.
Not so much for heavily used servers. For the same reasons as above.
Do yourself a favor, and if you have a server, change this to
which may be overkill itself.
We really don’t need these insane (for server) defaults in place … which is why I am wondering what else in systemd defaults I am going to have to fix to not cause surprises …
I’ll document them as I run into them. We are building the fixes directly into SIOS, so our users will have our updated firmware on reboot.