Every now and then you get an eye opener

This one is while we are conditioning a Forte NVMe unit, and I am running our OS install scripts. Running dstat in a window to watch the overall system …

----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system--
usr sys idl wai hiq siq| read  writ| recv  send|  in   out | int   csw 
  2   5  94   0   0   0|   0    22G| 218B  484B|   0     0 | 363k  368k
  1   4  94   0   0   0|   0    22G| 486B  632B|   0     0 | 362k  367k
  1   4  94   0   0   0|   0    22G| 628B  698B|   0     0 | 363k  368k
  2   5  92   1   0   0| 536k  110G| 802B 2024B|   0     0 | 421k  375k
  1   4  93   2   0   0|   0    22G| 360B  876B|   0     0 | 447k  377k

Wait … is that 110GB/s (2nd line from bottom, in the writ column) ? Wow …

Likely a measurement oddity. But it made me do a double take

Viewed 3145 times by 291 viewers

new SIOS feature: compressed ram image for OS

Most people use squashfs which creates a read-only (immutable) boot environment. Nothing wrong with this, but this forces you to have an overlay file system if you want to write. Which complicates things … not to mention when you overwrite too much, and run out of available inodes on the overlayfs. Then your file system becomes “invalid” and Bad-Things-Happen(™).

At the day job, we try to run as many of our systems out of ram disks as we can. Yeah, it uses up a little ram. And no, its not enough to cause a problem for our hyperconverged appliance users.

I am currently working on the RHEL 7/CentOS 7 base for SIOS (our Debian 7 and?8 base already work perfectly, and our Ubuntu 16.04 base is coming along as well). Our default platform is the Debian 8 base, for many reasons (engineering, ease of support, etc.)

SIOS, for those whom are not sure, is our appliance OS layer. For the most part, its a base linux distribution, with our kernel and our tools layered atop. It enables us to provide the type of performance and management that our customers demand. The underlying OS distro is generally a detail, and not a terribly relevant one, unless there is some major badness engineered into their distro from the outset.

SIOS provides an immutable booting environment, in that all writes to the OS file system are ephermal. They last only for the lifetime of the OS up time. ?Upon reboot, a pristine and correct configuration is restored.

This is tremendously powerful, in that it eliminates the roll-back process if you mess something up. Even more so, it completely eliminates boot drives from the non-administrative nodes in your system. And with the other parts of SIOS Tiburon tech, we have a completely decentralized and distributed booting and post-boot configuration* system. All you need are working switches.

More on that in a moment.

First the cut-n-paste from a machine.

ssh root@n01-1g 
Last login: Wed Apr 27 18:22:18 2016 from unison-poc

[root@usn-ramboot ~]# uptime
 18:46:17 up 25 min, 2 users, load average: 0.46, 0.34, 0.38

[root@usn-ramboot ~]# df -h /
Filesystem Size Used Avail Use% Mounted on
/dev/zram0 7.8G 6.2G 1.6G 80% /

[root@usn-ramboot ~]# cat /etc/redhat-release 
CentOS Linux release 7.2.1511 (Core)


Yes, that is a compressed ram device. With an ext4 file system atop it. It looks like it is using 6.2GB of space …

[root@usn-ramboot ~]# cat /sys/block/zram0/orig_data_size

… but the actual compressed size is 3.3GB

[root@usn-ramboot ~]# cat /sys/block/zram0/compr_data_size

I know zram was more intended for swap operation than OS drives. But its a pretty nice use of it. FWIW, the squashfs version is a little larger, and the mount is more complex.

Now for the people clamoring on how we preserve useful state, like, I dunno, config files, logs, etc.

First, logs. Should always be forwarded if possible to logging databases for analysis/event triggering. You can use something as simple as rsyslog for this, or more robust like SIOS-metrics (to be renamed soon). Not everything goes through the syslog interface, and some subsystems like to write to strange places, or try to be your logging stack (systemd anyone?). SIOS-metrics will be including mechanisms to help vampire out the data from the tools that are currently hoarding/hiding it. This includes BTW, reluctant hardware like RAID cards, HBAs, IB HCA, etc.

Second, configs. There are many ways to attack this problem, and SIOS allows you to use any/all/some of them. That is, we aren’t opinionated about which tool you want to use (yet). This will change, as we want the config to come from the distributed database, so we’ll have more of a framework in place soon for this, with a DB client handling things. Right now, we create a package (script and/or tarball usually, but we are looking at doing this with containers) which has things pre-configured. Then we copy the script/tarball/container to the right location after boot and network config, and then proceed from there. I should note that our initial network configuration is generally considered to be ephermal. We configure networking on most of our units this way via programatic machinations. This allows us to have very complex and well tuned networking environments dynamically altered, and a single script/tarball/container effecting this. It enables us to trivially configure/distribute optimized Ceph/GlusterFS/BeeGFS/Lustre configs (and maybe GPFS some day).

As I noted, the base distro is generally a detail. One we try to ignore, but sometimes, when we have to put in lots of time to work around engineered in breakage and technical debt in the distro … its less fun.

More soon.

Viewed 4330 times by 389 viewers

there are times

that try my patience. Usually with poorly implemented filtering tools of one form or another.

The SPF mechanism is to provide an anti-spoofing system, which identifies which machines are allowed to send email in your domain name.

The tools that purport to test it? Not so good. I get conflicting answers from various tools for a simple SPF record. The online tester (interactive) seems to work and show me my config is working nicely.

The email tester, shows it is working nicely.

The spf policy framework for postfix goes ::shrug::

Some corporate SPF framework with minimal visibility, and no support for non-customers (the ones whose email it is miss-classifying) claims there is a problem.

The DKIM bits seem to work. I’ve not set up DMARC (yet) though I might.

Curiously, all of this is for the $dayjob using the google mail system. For this system, no such issues. Everything seems to work.

Honestly, I think it is time for people to set up a emailtest@domainname so that it becomes easy to diagnose problems with legitimate email. I just wrestled with an earlier header problem (which wasn’t our problem per se, but I am trying to be helpful). Now I have other folks simply rejecting mail for no apparent reason.

Stuff like this wastes my time/effort, makes technology far less fun.

I have more important things to do than to waste on this.

Viewed 12989 times by 688 viewers

Of course, this means more work ahead

Our client code that pulls configuration bits from a boot server works great. But the config it pulls is distribution specific. Where we need to be is distribution/OS agnostic, and set things in a document database. Let the client convert the configuration into something OS specific.

This is, to a degree, a solved problem. Indeed, etcd is just a modern reworking of what we did with the client code … using a fixed client (e.g. no code) and just doing database queries for information. Then they take the key value pairs and they do something with them. There’s nothing terribly special about that. We’ve been doing a more flexible version of this for years (not simply a kv store).

What I am envisioning now is more interesting … think a replicated/distributed document database, client code that can read the document database, as well as a transformation database that maps OS specific things and the configuration document database into a consistent and repeatable control mechanism.

I’ll likely have to disable chunks of systemd, or work on having it talk to this system and grab units/services from it. Or just work around it, by giving it the bare minimum work to do, and then taking over after its done.

Past experience has been that in-distro control planes are often geared towards very different use cases than what we want, and attempting to build in the functionality we need to their control planes is an exercise in futility.

A good example is the whole booting process. Each distro has some concept of a “livecd” type boot which is conceptually similar to what we do, though we realized that the overlay mounts could be problematic. So we built real ramdisks, and unpacked a file system into them. Rather than using a squashfs and an overlay. It is possible with the squashfs and overlay to wind up in a situation where your system cannot boot due to an overlay inode full scenario or similar. This is unacceptable.

Our ramdisk method will boot given sufficient RAM (currently 8GB or so, but this is our kitchen sink build … as in “everything, including the kitchen sink”). The OS in the ramdisk is mutable, but the mutations are non-durable. So you can inflict tremendous damage to the running OS image. And completely fix it with a reboot. Which again, the durable nature of the overlay makes it possible to have a persistent broken state. Which is unacceptable.

Happily our on-disk OS is installed using the same tools as our ramdisk booted version. No config differences apart from durable boot drives.

In the short term, I can rewrite the OS specific setup (networking, etc.). In the longer term, I’ll get to work on the config doc architecture and clients.

Viewed 13747 times by 736 viewers

Very preliminary RHEL7/CentOS7 SIOS base support

This is rebasing our SIOS tech atop RHEL7/CentOS7. Very early stage, pre-alpha, lots of debugger windows open … but …

[root@usn-ramboot ~]# cat /etc/redhat-release 
CentOS Linux release 7.2.1511 (Core) 

[root@usn-ramboot ~]# uname -r

[root@usn-ramboot ~]# df -h /
Filesystem      Size  Used Avail Use% Mounted on
tmpfs           8.0G  4.7G  3.4G  59% /

Dracut is giving me a few fits, but I’ve finished that side for the most part, and am now into the debugging the post-pivot environment.

Good times, as we can use the rebase for our RHEL7 appliances for Ceph and GlusterFS as well as others (RHEV/OpenShift).

Viewed 14180 times by 759 viewers

Best practice or random rule … diagnosing problems and running into annoyances

As often as not, I’ll hear someone talk about a “best practice” that they are implementing or have implemented. Things that run counter to these “best practices” are obviously, by definition, “not best”.

What I find sometimes amusing, often alarming, is that the “best practices” are often disconnected from reality in specific ways. This is not a bash on all best practices, some of them are sane, and real. Like not allowing plain text passwords for logins. Turning on firewalls, restricting access to services/people/data to what/whom needs it.

Best practices, the real ones, should be, generally, a small, focused set of rules that will address specific or general issues. The pain of implementing them, either within the organization, or external to it, is worth the investment of time, as they provide a definable, and measurable benefit.

Most security best practices are like this. Most.

Not all.

There is another class of “best practices” which is better called security theatre. These practices do little to nothing to improve security, yet they are often implemented.

This isn’t simply a security issue … there are all manner of thespian-focused as opposed to pragmatic practices. I’ve seen people claim mantles of providing “best practices” (of the thespian kind) as a way to differentiate their non-differentiated services offerings, simply because they wanted to do something their own way. Rather than make a hard core intellectual argument as to why your method is better, they try to stop the discussion by anointing it as “best”.

Thespian to the core. These practices often aren’t best, and as often as not, aren’t even good. Sometimes downright frighteningly bad.

I ran into one of the thespian security “best practices” recently. I won’t go into specifics, but the net result was that the group had problems with a standard header produced by our ticketing system. The one we’ve been using for years without problem. They objected to a specific format. Claiming, of course, that it was “best practice” not to do this.

Which is absurd at best.

But, I like to try to keep customers happy. Even if we aren’t wrong, I’ll look into accommodating their request.

So I spent some time working on automagic rewriting of the header. I had to do this in such a way that it didn’t break everything else. Got it working, but somehow I think we’ll find it does little to solve the problem we were observing.

Yeah … about those thespian “best practices”. Everyone likes the mantle of using them. Like all rules, you should use the smallest, most consistent set.

Complexity and insecurity arise from complicated rule sets of dubious value. Its better to understand why a rule is in place, than to accept that it should be in place … lest you break something that is working … specifically to use a “best practice”. Which isn’t best.

Viewed 15115 times by 814 viewers

Attempting, and to some degree, failing, to prevent a user from accruing technical debt

We strive to do right by our customers. Sometimes this involves telling them unpleasant truths about choices they are going to make in the future, or have made in the past. I try not to overly sugar coat things … I won’t be judgemental … but I will be frank, and sometimes, this doesn’t go over well.

During these discussions, I often see people insisting that their goal is X, but the steps Y to get there, will lead them to Z, which is not coincident with X. That is, they are optimizing for a different thing.

I ask about this. As often is the case, the optimization is constrained, so Z may be the best they can actually achieve. I understand that. I point out the differences, and ask them to help me understand what are the most important features, so we can look at lower level optimizations that might be able to move closer to the business goal.

X is usually a business goal. Z is usually an endpoint which isn’t at X, and there are ramifications in terms of being there at Z versus being at X. If those ramifications will have little to no impact on the business goals and objectives, then Z is fine. If they do, then what are the most important factors to consider? What will have the largest overall impact upon business? Basically I am asking what do you need versus what do you want, and then how can we deliver on what you need, or a very close approximation to it?

So where does technical debt accrual come in? Technical debt is first and foremost an opportunity cost. It is the cost of choosing an alternative path (the steps Y’ when you should choose Y). The debt is back loaded, in that you don’t start paying interest and premium until it really starts to matter. And then the technical debt has both a large ticking clock, and a real actual expenditure associated with it.

Technical debt arises in picking other optimizations over the one you actually need for your business. It arises in making alternative choices to the choice you should make for reasons that do not wind up returning much on the difference in up-front cost of those choices. It arises in using other non-tangible, non-measurable, and to an extent, largely irrelevant parameters than suitability to task as the basis for a decision (or step Y).

We see technical debt accrual happen when the optimization changes from “lets build the system we need” to “lets build the cheapest possible system.” Or “lets build the system we need” to “lets build it in the cloud.”

I am not saying the cloud is bad here. I am saying that each new endpoint changes what you should expect as the outcome, and your performance, workflow processes, and costing model will also change, sometimes drastically so (more often than not, not for the better). I could write a few posts on people whom have made such decisions, and then run away screaming later. But that’s not the focus of this.

As you make decisions, your paths to these endpoints, and your accumulation of this technical debt, that you are going to have to repay at some point (you can’t escape this), often times, the objective gets lost in the process.

I won’t go into specifics, but I’ve seen (recently) some terrible payback of technical debt, and corresponding negative expectations as a result of this. All because of specific front end optimizations that sought to maximize or minimize one particular feature, which, over the lifetime of the project, was basically irrelevant. But it scored political points for the people.

The backloaded payback … not so much fun. Getting calls at 2am to help fix (self inflicted) problems that the technical debt payback resulted in? Also not so much fun.

Basically I am saying, don’t optimize for the wrong thing. Keep the business objectives in mind when you chart out the path you need. Avoid hyperoptimization of one aspect, as it will have ramifications further down the line.

In grad school, my advisor would have me hyperoptimize our parts ordering (I built many of our machines by hand in the early 90s). I would spend a few hours to shave a few dollars. One thing I learned during this time was that not all parts are created equal. That price optimization can have surprising (and very negative) impacts when you want warranty coverage, or support, or … . I learned that this focus on one aspect (price) when we cared about another aspect (performance and reliability) was misguided. I also learned that brand names were irrelevant. What mattered was the commitment of the vendor to provide a good product, stand behind it, and help you when you needed it. You don’t get that when you hyperoptmize.

There is a cost to every decision. I spent many many hours making the initial decisions, and many weeks/months regretting some of them. All in all a bad use of my time. Yet I was “rewarded” for this.

I see many of the same things happening now. With many of the same outcomes. And I try to warn people of the risks of their choices. As long as the go in with open eyes, hey, they can make their choices. I don’t like the blame game when the technical debt comes due. So I like the paper trail my emails provide. And I try my best to provide some mechanism so they can escape the ravages of the worst of the debt.

But I don’t always win those arguments. And sometimes they have to pay the debt, in full, and quite quickly. “I told you so” isn’t the right course, but “hey, here’s a plan to mitigate these issues” is.

Viewed 25062 times by 1116 viewers

When spam bots attack

I’ve been fixing up a few mail servers to be more discriminating over their connections. And I’ve noted that I didn’t have any automated tooling to block the spammers. I have lots of tooling to filter and control things.

So I wrote a quick log -> ban list generator. Not perfect, but it seems to work nicely.

Like I don’t have enough to do this week. /sigh

Meetings tomorrow starting at 8am.

Viewed 25479 times by 1119 viewers

Why sticking with distro packages can be (very) bad for your security

I’ve been keeping a variety of systems up to date, updating security and other bits with zealous fervor. Security is never far from my mind, as I’ve watched bad practices being used at customers resulting in any number of things … from minor probes, through (in one case, with a grad student impacted by a windows key logger), taking down a linux cluster, but not before knocking the university temporarily off the internet.

Yeah, that last one was a doozy. We had warned the user what not to do , and they ignored us. Multiple times. They wound up messing up many other people in the process. And then they got angry with us when we refused to fix their problems (again) for free (again).

So with this backdrop, I’ve been working on keeping all of our sites up to current best practices. I am not a security expert by any stretch … though I have the requisite paranoia … driven in many cases by the scary logs I’ve collected over the years. I pay a great deal of attention to what people whom focus upon this say. And I leverage a number of tools to test our sites. I am not doing hard core pen testing of our systems (yet), but this is probably going to be on the radar.


So here is the problem.

Using a distro’s most up-to-date packages, the nginx web server didn’t support TLSv1.2. Which, if you read the qualys ssl labs pages, is the only really secure protocol. The testing tools all seemed to indicate that there was a problem.

As Google is your friend, I did some checking, and sure enough, people in 2012 (!!!) had noticed that the nginx version was linked against an older ssl. 0.9.8 or something like this. In 2016, its not like, I dunno … possible? I mean, with all the SSL attacks, wouldn’t a stable distro actually have links against modern kit?

This is one of the reasons, but not the primary reason, why we provide out own toolchain. Long experiences with distros shipping badly outdated tools (some programming language revisions being End-of-Lifed years before the product shipped!!!) led us to start doing this. As the complexity of our needs increased, and our strong dependency upon working modern and patched languages and toolsets grew, it became harder and harder to use the distro tools. We couldn’t just replace them, as the distros often build effective dependencies upon the older broken tools.

So tonight, I am looking at a low grade on a web server, after I updated the certificate. I was stumped. I had checked the ssl bits, and they were correct. The same as the servers getting the high grades.

What could be the issue?

A quick

ldd `which nginx`

showed me this:

	libssl.so.0.9.8 => /lib/x86_64-linux-gnu/libssl.so.0.9.8 (0x00007f5d93b02000)
	libcrypto.so.0.9.8 => /lib/x86_64-linux-gnu/libcrypto.so.0.9.8 (0x00007f5d93774000)

Oh … my.

Ok. So my choices are to either use the package providers package, or build from source. Lets try the former first.

A quick install, and restart later, and my server gets an excellent score.

I wonder, for all the nice “stable” packages out there in the “stable” distros, what level of insecurity they are willing to tolerate?

I don’t want to take ownership for lots of the distro stack … I like distros with well thought out and implemented tools (Debian and alike), well thought out overall theses (alpine). I want them to provide modern packages with up to date security. I do not want … absolutely positively … do not want … backports of “security patches” which provide an illusion of security, while not actually providing the security.

This is what I was running into. The incessant updating? Check. Security? Possibly, but only because I am paranoid about it. Not because the packages have it built in. They don’t.

I am starting to think that most of the packages that open ports to listen to network traffic might need to be paid very close attention to. Far more than simply updating.

Personally I think the illusion of security may be as bad as, if not worse than no security. As if you believe you are secure, and you really aren’t, you may discredit reports that indicate you have issues, real issues, when you shouldn’t.

Viewed 18659 times by 1008 viewers

Not-so-modern file system errors in modern file systems

On a system in heavy production use, using an underlying file system for metadata service, we see this:

kernel:  EXT4-fs warning: ext4_dx_add_entry:1992: Directory index full!

Ok, where does this come from?

Ext3 had a limit of 32000 directory entries per directory, unless you turned on the dir_index feature.

Ext4 theoretically has no limit. Well, its 64000 if you don’t use dir_index. Which we do use. Really the feature you want is dir_nlink.

  -O [^]feature[,...]
              Set or clear the indicated filesystem features (options) in  the
              filesystem.   More than one filesystem feature can be cleared or
              set by separating features  with  commas.   Filesystem  features
              prefixed  with  a  caret  character ('^') will be cleared in the
              filesystem\'s superblock; filesystem features  without  a  prefix
              character  or prefixed with a plus character ('+') will be added
              to the filesystem.

        The following filesystem features can be set  or  cleared  using

                          Use  hashed  b-trees  to  speed  up lookups in large

                          Allow more than 65000 subdirectories per directory.

So, obviously we have to turn this on, right? Before we do that, a quick tune2fs -l /dev/$dev to see what is currently in place

Filesystem features:      has_journal ext_attr resize_inode dir_index filetype
                          needs_recovery flex_bg sparse_super large_file huge_file 
                          uninit_bg dir_nlink extra_isize

So … its already on? And not working?

Sometimes you gotta say whiskey tango foxtrot.

Yet another reason to use xfs and ditch ext*.

(n.b. our new SIOS v2 images will also let you build/use zfs file systems, by building installing the kernel module needed for this upon demand … so yes, we could use zfs as well)

Viewed 22842 times by 1133 viewers