Seagate and ClusterStor: a lesson in not jumping to conclusions based on what was not said

I saw this analysis this morning on the Register’s channel site. This follows on the announcement of other layoffs and shuttering of facilities.

A few things. First a disclosure: arguably, the day job and more specifically our Unison product is in “direct” competition with ClusterStor, though we never see them in deals. This may or may not be a bad thing, and likely more due to market focus (we do big data, analytics, insanely fast storage in hyperconverged packages) than anything else. SGI, HPE, and Cray all resell/rebrand ClusterStor under their own system.

That out of the way, this is speculation on the part of the article. Granted, they are reading into what is, and is not being said … spokes people tend to choose words carefully, and work to “correct” (aka spin) what they perceive as an incorrect read on the matter. Indeed, Ken Claffey of Seagate strove to correct this in the first comment.

Even more to the point, the article itself wasn’t updated, but there is a new article indicating precisely this.

Short version: They are fine, just moving production elsewhere.

This actually highlights a danger in our very high frequency world. “Information” gets out into the wild, and it takes someone’s time/effort and a number of resources to bring this “information” to the point of being correct. I have no reason to disbelieve them … large companies move people/processes about all the time, specifically, to leverage economies of scale, and better cost structures elsewhere.

In the 1980s or so, IBM used to be (internally) nicknamed “Ive Been Moved”.

I think the issue was assuming that the woes in the PC drive space extended to the enterprise/high performance space. I don’t think they do. Seagate may or may not choose to break out revenues/costs associated with each business unit, likely the provide some of this in their investor relations bits.

I think it unlikely that they would have gone on the spending spree they have in this space, and then just shutter it when the PC space contracts.

All this said, in the bigger picture, the storage market is changing dramatically and quickly. Spinning disk is not necessarily toast, but it is being relegated in many designs we’ve worked on, to the same thing that tape has traditionally been used for. This is a fairly fundamental change. But remember, tape is still with us now. Think very long tail, very large volumes of data that cannot be effectively moved from tape to disk. Disk to SSD/NVM is possible, though I think disk still has a longer shelf life than NVM.

Viewed 448 times by 159 viewers

Systemd and non-desktop scenarios

So we’ve been using Debian 8 as the basis of our SIOS v2 system. Debian has a number of very strong features that make it a fantastic basis for developing a platform … for one, it doesn’t have significant negative baggage/technical debt associated with poor design decisions early on in the development of the system as others do.

But it has systemd.

I’ve been generally non-committal about systemd, as it seemed like it should improve some things, at a fairly minor cost in additional complexity. It provides a number of things in a very nice and straightforward manner.

That is … until … you run into the default config scenarios. These will leave you, as the server guy, asking “seriously … whiskey tango foxtrot???!?”

Well, ok, some of these are built atop Debian, so there is blame to share.

The first is the size of tmpfs (ramdisks). By default, this is controlled in early boot (and not controllable via a kernel boot parameter) by the contents of /etc/default/tmpfs. In it, you see this:

TMPFS_SIZE=20%VM

as the default. That is, each tmpfs you allocate will get a 20% of your virtual memory total as its size by default, unless you specify a size. And as it turns out, this is actually a bad thing. As the /run directory is allocated early on in the boot, not governed by /etc/fstab (not necessarily a bad thing, as the fstab is a control point) and not having any other control points …

root@unison:~# df -h /run
Filesystem      Size  Used Avail Use% Mounted on
tmpfs            13G  2.5M   13G   1% /run

root@unison:~# grep run /etc/fstab
root@unison:~# 

Hey, look at that. Its 13GB for a /run directory that would struggle to ever be 1GB.

Ok, its tmpfs, so the allocation isn’t locked. But it is backed by swap.

UNLESS YOU TURN SWAP OFF IN WHICH CASE AAAARRRRRRGGGGGHHHHH

So … to recap … Whiskey Tango Foxtrot?

But, before you get all “hey, relax dude, its just one mount … chillax” … you have to ask about the interaction with other systemd technology (/run is mounted by systemd … oh yes … it is).

Like, I dunno. Logind mebbe?

So there you are. Logging into your machine. And you notice, curiously, you have this whole /run/user/$pid thing going on. And if you look closely enough, you have these as tmpfs mounts. And they are each getting 20% of VM.

Starting to see the problem yet? No?

Ok. So you have these defaults … And a bunch of users. Whom log in. And use up these resources.

Now, to add complexity, lets say you have a swapfile rather than a swap partition. I am not a huge believer in swap … rather the opposite. If you are swapping, this is a strong signal you need more memory. If it is very rare swapping, once a month, on a non-critical system, sure, swapping is fine. If it is a daily occurance under load on a production box, you need to buy more memory. Or tune your processes so they don’t need so much memory.

This swapfile, is sitting atop a file system. This is a non-optimal scenario, but the user insisted upon swap, so we provided it. This is a failure waiting to happen, as filesystem IO requires memory allocations, which, if you think about what swap is/does, will be highly problematic in the context of actually swapping. That is, if you need to allocate memory, in order to page out to a disk, because you are trying to allocate memory … lets just say that this is the thing livelocks are made of.

And, of course, to make things worse, we have a caching layer between the physical device and the file system. One we can’t turn off completely. The caching layer also does allocations. With the same net effect.

Now that I’ve set out the chess pieces for you, let me explain what we’ve seen.

6 or 7 users log in. These tmpfs allocations are made. No swap. vm.overcommit=0. Failure. Ok, add swap. Change vm.overcommit=1. Make the allocatable percentage 85% rather than 50%. Rinse. Repeat.

Eventual failure.

Customer seriously questioning my sanity.

All the logs are showing allocation problems, but no swap. Change to vm.overcommit=2. Stamp a big old FAIL across any process that wants to overallocate. Yeah, it will catch others, not unlike the wild west of OOM killer, but at least we’ll get a real signal now.

… and …

who authorized 20% ram for these logins? The failures seem correlated with them.

Thats /etc/default/tmpfs defaults (which are insane). Ok, can fix those. But … still a problem, as logind thinks we should give this out.

Deep in the heart of darkness … er … /etc/systemd/ we find logind.conf. Which has this little gem.

RuntimeDirectorySize=20%

as its default.

Um.

Whiskey. Tango. Foxtrot.

This is where you put user temp files for the session.

Yeah … for Gnome, and other desktop uses cases, sure, 20% may be reasonable for the vast majority of people.

Not so much for heavily used servers. For the same reasons as above.

Do yourself a favor, and if you have a server, change this to

RuntimeDirectorySize=256M

which may be overkill itself.

We really don’t need these insane (for server) defaults in place … which is why I am wondering what else in systemd defaults I am going to have to fix to not cause surprises …

I’ll document them as I run into them. We are building the fixes directly into SIOS, so our users will have our updated firmware on reboot.

Viewed 5904 times by 551 viewers

You can’t win

Like that old joke about the patient going to the Doctor for a pain …

Patient: Doctor, it hurts when I do this (does some action which hurts)
Doctor: Don’t do it then

Imagine if you will, a patient whom, after being told what is wrong, and why it hurts, and what to do about it, continues to do it. And be more intensive about doing it. And then complains when it hurts.

This is a rough metaphor for some recent support experiences.

We do our best to convince them not to do the things that cause them pain, as in this case, they are self-inflicted.

I dunno. I try. I just don’t see any way to win here (e.g. for the patient in this case to come out ahead) until they make the changes that we recommended.

Viewed 6308 times by 589 viewers

That was fun … no wait … the other thing … not fun

Long overdue update of the server this blog runs on. It is no longer running a Ubuntu flavor, but instead running SIOSv2 which is the same appliance operating system that powers our products.

This isn’t specifically a case of eating our own dog-food, but more a case that Ubuntu, even the LTS versions, have a specific sell by date, and it is often very hard to update to the newer revs. I know, I know … they have this nice, friendly, upgrade me button on their updater. So its “easy”. I could quote Inigo Montoya here

Ok, so roll in SIOSv2. Based upon Debian 8.x (there is a RHEL/CentOS version, but I am moving away from deploying those by default unless there is a customer request behind it, due to the extra effort in making everything work right. I might post on that sometime soon. Flip the OS disks. Reboot. Configure the network. Start up the VM.

The VM required I import the disk and create a new config for it. In this way, I really wish virsh behaved the same as the VM system on SmartOS. For a number of reasons this unit couldn’t be a SmartOS box.

Ok. Had to fix the VM. Took about 10 minutes and done. Now name services and other things work. Yay.

Ok. Now install nginx and other bits for the blog. See, this is where containers would come in handy … and this unit is prepped and ready to go with two different container stacks (depending upon how I want to configure it later). But for the moment, we are building this as a monolith, with the idea of making it a microbox server later.

Install mysql and some php oddity, because WordPress.

Find my daily DB dump, import it, light up the blog and …

Everything is gone. Database connection error.

Ok.

Look at the DB dump. Looks a little small. Look for the blog stuff it it.

AND IT IS MISSING …. OMFG ….

Ok … what happened?

Didn’t I see some mysql error on a table a while ago? One I don’t use anymore in the blog? One that was corrupt?

Could that have blorked the dump?

Swap back to the old boot drives. Bring it up. Run mysqlcheck.

Sure enough, 1 broken table.

Ok, lets fix it.

#insert "sounds_of_birds_and_crickets_chirping.h"

A while later, I redo the dump.

The 75MB file is now a 3.9GB file.

Yeah, was missing some data.

Grrrr… Bad mysql … Bad ….

Swap boot drives. Restart. Reimport. Rinse.

No repeat.

And it works.

Yay.

Viewed 40652 times by 1838 viewers

And this was a good idea … why ?

The Debian/Ubuntu update tool is named “apt” with various utilities built around it. For the most part, it works very well, and software upgrades nicely. Sort of like yum and its ilk, but it pre-dates them.

This tool is meant for automated (e.g. lights out) updates. No keyboard interaction should be required.

Ever.

For any reason.

However … a recent update to one particular package, in Debian, and in Ubuntu, has resulted in installation/updates pausing. Because the person who built the update decided that it would be really … really good … if there were a text pager in the update process. So the update pauses, unless you quit the text pager, or go to the end of it.

That this is moronic is an understatement.

That this is wrong, minimizes how broken it is.

That this ever escaped Q&A boggles the mind.

Don’t make me interact with my fleet of machines for updates. Just … don’t.

If you feel you must, well … hand over maint of your code base to someone whom understands how completely wrong this is.

It is 2016. We’ve got automated tooling going across all of our systems. Our systems will break with a forced manual interaction. Which means someone either wasn’t thinking clearly, or was unaware that this is 2016.

/sigh

Viewed 40702 times by 1845 viewers

M&A: Vertical integration plays

Two items of note here. First, Cavium acquires qlogic. This is interesting at some levels, as qlogic has been a long time player in storage (and networking). There are many qlogic FC switches out there, as well as some older Infiniband gear (pre-Intel sale). Cavium is more of a processor shop, having built a number of interesting SoC and general purpose CPUs. I am not sure the combo is going to be a serious contender to Intel or others in the data center space, but I think they will be working on carving out a specific niche. More in a moment.

Second, Samsung grabbed Joyent. This is Bryan Cantrill’s take on it, but his is denser with the meat of the why, and less filled with (though there is some) marketing blather on synergies, culture, yadda yadda yadda. This is a move by Samsung mobile, one of the Samsung companies. Joyent is famous for starting the node.js project, as well as its cloud with its Triton (data center as a container system), manta (object storage, and move processing to data for in-place computing … very similar in concept to what we’ve been pushing for the last decade), and of course SmartOS.

First off, I don’t see any of the dependency stack going away. Triton lives atop SmartOS. If anything, I see SmartOS benefiting from this massively, as Samsung may add weight to getting drivers operational on SmartOS. This is, IMO, an important weakness in SmartOS, and one I hope, will now be rectified. We were successful in convincing Chelsio to port to SmartOS/Illumos a few years ago, so we had a decent 10GbE driver. But I want 100GbE, and a few other things (NVMe, etc.) that I’d have to hire Illumos kernel devs for. Given Samsung’s focus on NVMe (not mobile, but the other folks), I’ll ping them about helping out with this … as NVMe on SmartOS + 100GbE would be AWESOME … (and for what it’s worth, the major siCloud installation we built a few years ago, started out with SmartDC, and moved to Fifo for a number of reasons … but our systems/code are all SmartOS/SDC/Fifo supporting, as long as we have working drivers).

Ok, bigger picture.

This is vertical integration in both cases. Bring more of the stack in-house, focus on the value that these things can bring. Joyent + Samsung gives you DC wide container engines. Great for mobile. But wildly awesome for other things (think of what OpenStack would like to do, and they are already available with Triton). Then qlogic + Cavium gives a verticalized integration play for a set of DC niches, in storage, in NPUs (possibly), in hyperscale systems …

Both of these are very interesting.

Viewed 48053 times by 2043 viewers

About that cloud “security”

Wow … might want to rethink what you do and how you do it. See here.

Put in simple terms, why bother to encrypt if your key is (trivially) recoverable?

I did not realize that side channel attacks were so effective. Will read the paper. If this isn’t just a highly over specialized case, and is actually applicable to real world scenarios, we’ll need to make sure we understand methods to mitigate.

Viewed 50907 times by 2139 viewers

Ah Gmail … losing more emails

So … my wife and I have private gmail addresses. Not related to the day job. She sends me an email from there. It never arrives.

Gmail to gmail.

Not in the spam folder.

But to gmail.

So I have her send it to this machine.

Gets here right away.

We moved the day job’s support email address off gmail (its just a reflector now) into the same tech running inside our FW. Because it was losing mail, pissing off customers.

Though in one of those cases, the customer had a “best practice” rule (read as: a random rule implemented without a compelling real problem that it “solved”, or risk it “reduced” … e.g. it was a fad, and a bad one at that, that likely caught MANY vendors up in it) that also messed with email.

Its not that this is getting old. Its that I am now actively looking at Gmail based mail as a risk to be mitigated. As mail gets lost. With no logs to trace what happened.

So … do I want to spend the time to manage our own mail, or do I want to continue to lose mail? That is the business question. What is the value of the lost mail, or lost good-will due to the lost mail?

Viewed 48563 times by 2028 viewers

That moment in time where you realize that you must constrain the support people from doing anything other than what you direct them to do

This is Comcast. And my internet connection in my home office. Cable modem spontaneously started rebooting on me over the last few months. Looks like it happened after they replaced my older cable modem which was working nicely, with the new one … which isn’t.

First call in this week, after it kicked out a whole bunch of times while I was working on customer machines with hard deadlines to get things done in … they scheduled a tech, after I requested a replacement cable modem. They promised/swore he would have one with him, and would replace it.

Instead, he blamed filters outside the house (that Comcast had installed previously), that he removed.

This morning while working on a machine in the UK, and this afternoon while working on a machine in Ohio, it kicked out on me. Again, with hard timing deadlines (one was a bank, another a genomics medical site) on me to get it done.

Fed up, I called them back. On the phone now. Will insist they simply replace the box. They seem to get that this is an issue. Will see if they actually do this correctly.

Grrr…

Viewed 49651 times by 2039 viewers

Real scalability is hard, aka there are no silver bullets

I talked about hypothetical silver bullets in the recent past at a conference and to customers and VCs. Basically, there is no such thing as a silver bullet … no magic pixie dust, or magical card, or superfantastic software you can add to a system to make it incredibly faster.

Faster, better performing systems require better architecture (physical, algorithmic, etc.). You really cannot hope to throw a metric-ton of machines at a problem and hope that scaling is simple and linear. Because it really never works like that. You can’t hope that a pile of inefficient cheap and deep machines has any hope whatsoever of beating a very well architected massively parallel IO engine at moving/analyzing data. Its almost embarrassing at how bad these pile of machines are running IO/compute intensive code, when their architecture effectively precludes performance.

Software matters. So does hardware.

What prompted this post (been very busy, but I felt I had to get this out) was this article on HN. I know its an older article, but the points made about implementation mattering in software for a distributed/scalable system, matter just as much (if not more) for high performance hardware systems.

Viewed 16972 times by 1187 viewers