new SIOS feature: compressed ram image for OS

By joe

April 27, 2016 - 4 minutes read - 770 words

Most people use squashfs which creates a read-only (immutable) boot environment. Nothing wrong with this, but this forces you to have an overlay file system if you want to write. Which complicates things … not to mention when you overwrite too much, and run out of available inodes on the overlayfs. Then your file system becomes “invalid” and Bad-Things-Happen(™). At the day job, we try to run as many of our systems out of ram disks as we can. Yeah, it uses up a little ram. And no, its not enough to cause a problem for our hyperconverged appliance users. I am currently working on the RHEL 7/CentOS 7 base for SIOS (our Debian 7 and?8 base already work perfectly, and our Ubuntu 16.04 base is coming along as well). Our default platform is the Debian 8 base, for many reasons (engineering, ease of support, etc.) SIOS, for those whom are not sure, is our appliance OS layer. For the most part, its a base linux distribution, with our kernel and our tools layered atop. It enables us to provide the type of performance and management that our customers demand. The underlying OS distro is generally a detail, and not a terribly relevant one, unless there is some major badness engineered into their distro from the outset. SIOS provides an immutable booting environment, in that all writes to the OS file system are ephermal. They last only for the lifetime of the OS up time. ?Upon reboot, a pristine and correct configuration is restored. This is tremendously powerful, in that it eliminates the roll-back process if you mess something up. Even more so, it completely eliminates boot drives from the non-administrative nodes in your system. And with the other parts of SIOS Tiburon tech, we have a completely decentralized and distributed booting and post-boot configuration* system. All you need are working switches. More on that in a moment. First the cut-n-paste from a machine.

ssh root@n01-1g
Last login: Wed Apr 27 18:22:18 2016 from unison-poc
[root@usn-ramboot ~]# uptime
 18:46:17 up 25 min, 2 users, load average: 0.46, 0.34, 0.38
[root@usn-ramboot ~]# df -h /
Filesystem Size Used Avail Use% Mounted on
/dev/zram0 7.8G 6.2G 1.6G 80% /
[root@usn-ramboot ~]# cat /etc/redhat-release
CentOS Linux release 7.2.1511 (Core)

Yes, that is a compressed ram device. With an ext4 file system atop it. It looks like it is using 6.2GB of space …

[root@usn-ramboot ~]# cat /sys/block/zram0/orig_data_size
6772142080

… but the actual compressed size is 3.3GB

[root@usn-ramboot ~]# cat /sys/block/zram0/compr_data_size
3333034374

I know zram was more intended for swap operation than OS drives. But its a pretty nice use of it. FWIW, the squashfs version is a little larger, and the mount is more complex. Now for the people clamoring on how we preserve useful state, like, I dunno, config files, logs, etc. First, logs. Should always be forwarded if possible to logging databases for analysis/event triggering. You can use something as simple as rsyslog for this, or more robust like SIOS-metrics (to be renamed soon). Not everything goes through the syslog interface, and some subsystems like to write to strange places, or try to be your logging stack (systemd anyone?). SIOS-metrics will be including mechanisms to help vampire out the data from the tools that are currently hoarding/hiding it. This includes BTW, reluctant hardware like RAID cards, HBAs, IB HCA, etc. Second, configs. There are many ways to attack this problem, and SIOS allows you to use any/all/some of them. That is, we aren’t opinionated about which tool you want to use (yet). This will change, as we want the config to come from the distributed database, so we’ll have more of a framework in place soon for this, with a DB client handling things. Right now, we create a package (script and/or tarball usually, but we are looking at doing this with containers) which has things pre-configured. Then we copy the script/tarball/container to the right location after boot and network config, and then proceed from there. I should note that our initial network configuration is generally considered to be ephermal. We configure networking on most of our units this way via programatic machinations. This allows us to have very complex and well tuned networking environments dynamically altered, and a single script/tarball/container effecting this. It enables us to trivially configure/distribute optimized Ceph/GlusterFS/BeeGFS/Lustre configs (and maybe GPFS some day). As I noted, the base distro is generally a detail. One we try to ignore, but sometimes, when we have to put in lots of time to work around engineered in breakage and technical debt in the distro … its less fun. More soon.