That was fun … no wait … the other thing … not fun

Long overdue update of the server this blog runs on. It is no longer running a Ubuntu flavor, but instead running SIOSv2 which is the same appliance operating system that powers our products.

This isn’t specifically a case of eating our own dog-food, but more a case that Ubuntu, even the LTS versions, have a specific sell by date, and it is often very hard to update to the newer revs. I know, I know … they have this nice, friendly, upgrade me button on their updater. So its “easy”. I could quote Inigo Montoya here

Ok, so roll in SIOSv2. Based upon Debian 8.x (there is a RHEL/CentOS version, but I am moving away from deploying those by default unless there is a customer request behind it, due to the extra effort in making everything work right. I might post on that sometime soon. Flip the OS disks. Reboot. Configure the network. Start up the VM.

The VM required I import the disk and create a new config for it. In this way, I really wish virsh behaved the same as the VM system on SmartOS. For a number of reasons this unit couldn’t be a SmartOS box.

Ok. Had to fix the VM. Took about 10 minutes and done. Now name services and other things work. Yay.

Ok. Now install nginx and other bits for the blog. See, this is where containers would come in handy … and this unit is prepped and ready to go with two different container stacks (depending upon how I want to configure it later). But for the moment, we are building this as a monolith, with the idea of making it a microbox server later.

Install mysql and some php oddity, because WordPress.

Find my daily DB dump, import it, light up the blog and …

Everything is gone. Database connection error.


Look at the DB dump. Looks a little small. Look for the blog stuff it it.


Ok … what happened?

Didn’t I see some mysql error on a table a while ago? One I don’t use anymore in the blog? One that was corrupt?

Could that have blorked the dump?

Swap back to the old boot drives. Bring it up. Run mysqlcheck.

Sure enough, 1 broken table.

Ok, lets fix it.

#insert "sounds_of_birds_and_crickets_chirping.h"

A while later, I redo the dump.

The 75MB file is now a 3.9GB file.

Yeah, was missing some data.

Grrrr… Bad mysql … Bad ….

Swap boot drives. Restart. Reimport. Rinse.

No repeat.

And it works.


Viewed 2321 times by 322 viewers

And this was a good idea … why ?

The Debian/Ubuntu update tool is named “apt” with various utilities built around it. For the most part, it works very well, and software upgrades nicely. Sort of like yum and its ilk, but it pre-dates them.

This tool is meant for automated (e.g. lights out) updates. No keyboard interaction should be required.


For any reason.

However … a recent update to one particular package, in Debian, and in Ubuntu, has resulted in installation/updates pausing. Because the person who built the update decided that it would be really … really good … if there were a text pager in the update process. So the update pauses, unless you quit the text pager, or go to the end of it.

That this is moronic is an understatement.

That this is wrong, minimizes how broken it is.

That this ever escaped Q&A boggles the mind.

Don’t make me interact with my fleet of machines for updates. Just … don’t.

If you feel you must, well … hand over maint of your code base to someone whom understands how completely wrong this is.

It is 2016. We’ve got automated tooling going across all of our systems. Our systems will break with a forced manual interaction. Which means someone either wasn’t thinking clearly, or was unaware that this is 2016.


Viewed 2407 times by 339 viewers

M&A: Vertical integration plays

Two items of note here. First, Cavium acquires qlogic. This is interesting at some levels, as qlogic has been a long time player in storage (and networking). There are many qlogic FC switches out there, as well as some older Infiniband gear (pre-Intel sale). Cavium is more of a processor shop, having built a number of interesting SoC and general purpose CPUs. I am not sure the combo is going to be a serious contender to Intel or others in the data center space, but I think they will be working on carving out a specific niche. More in a moment.

Second, Samsung grabbed Joyent. This is Bryan Cantrill’s take on it, but his is denser with the meat of the why, and less filled with (though there is some) marketing blather on synergies, culture, yadda yadda yadda. This is a move by Samsung mobile, one of the Samsung companies. Joyent is famous for starting the node.js project, as well as its cloud with its Triton (data center as a container system), manta (object storage, and move processing to data for in-place computing … very similar in concept to what we’ve been pushing for the last decade), and of course SmartOS.

First off, I don’t see any of the dependency stack going away. Triton lives atop SmartOS. If anything, I see SmartOS benefiting from this massively, as Samsung may add weight to getting drivers operational on SmartOS. This is, IMO, an important weakness in SmartOS, and one I hope, will now be rectified. We were successful in convincing Chelsio to port to SmartOS/Illumos a few years ago, so we had a decent 10GbE driver. But I want 100GbE, and a few other things (NVMe, etc.) that I’d have to hire Illumos kernel devs for. Given Samsung’s focus on NVMe (not mobile, but the other folks), I’ll ping them about helping out with this … as NVMe on SmartOS + 100GbE would be AWESOME … (and for what it’s worth, the major siCloud installation we built a few years ago, started out with SmartDC, and moved to Fifo for a number of reasons … but our systems/code are all SmartOS/SDC/Fifo supporting, as long as we have working drivers).

Ok, bigger picture.

This is vertical integration in both cases. Bring more of the stack in-house, focus on the value that these things can bring. Joyent + Samsung gives you DC wide container engines. Great for mobile. But wildly awesome for other things (think of what OpenStack would like to do, and they are already available with Triton). Then qlogic + Cavium gives a verticalized integration play for a set of DC niches, in storage, in NPUs (possibly), in hyperscale systems …

Both of these are very interesting.

Viewed 9706 times by 636 viewers

About that cloud “security”

Wow … might want to rethink what you do and how you do it. See here.

Put in simple terms, why bother to encrypt if your key is (trivially) recoverable?

I did not realize that side channel attacks were so effective. Will read the paper. If this isn’t just a highly over specialized case, and is actually applicable to real world scenarios, we’ll need to make sure we understand methods to mitigate.

Viewed 13045 times by 797 viewers

Ah Gmail … losing more emails

So … my wife and I have private gmail addresses. Not related to the day job. She sends me an email from there. It never arrives.

Gmail to gmail.

Not in the spam folder.

But to gmail.

So I have her send it to this machine.

Gets here right away.

We moved the day job’s support email address off gmail (its just a reflector now) into the same tech running inside our FW. Because it was losing mail, pissing off customers.

Though in one of those cases, the customer had a “best practice” rule (read as: a random rule implemented without a compelling real problem that it “solved”, or risk it “reduced” … e.g. it was a fad, and a bad one at that, that likely caught MANY vendors up in it) that also messed with email.

Its not that this is getting old. Its that I am now actively looking at Gmail based mail as a risk to be mitigated. As mail gets lost. With no logs to trace what happened.

So … do I want to spend the time to manage our own mail, or do I want to continue to lose mail? That is the business question. What is the value of the lost mail, or lost good-will due to the lost mail?

Viewed 16026 times by 901 viewers

That moment in time where you realize that you must constrain the support people from doing anything other than what you direct them to do

This is Comcast. And my internet connection in my home office. Cable modem spontaneously started rebooting on me over the last few months. Looks like it happened after they replaced my older cable modem which was working nicely, with the new one … which isn’t.

First call in this week, after it kicked out a whole bunch of times while I was working on customer machines with hard deadlines to get things done in … they scheduled a tech, after I requested a replacement cable modem. They promised/swore he would have one with him, and would replace it.

Instead, he blamed filters outside the house (that Comcast had installed previously), that he removed.

This morning while working on a machine in the UK, and this afternoon while working on a machine in Ohio, it kicked out on me. Again, with hard timing deadlines (one was a bank, another a genomics medical site) on me to get it done.

Fed up, I called them back. On the phone now. Will insist they simply replace the box. They seem to get that this is an issue. Will see if they actually do this correctly.


Viewed 17504 times by 945 viewers

Real scalability is hard, aka there are no silver bullets

I talked about hypothetical silver bullets in the recent past at a conference and to customers and VCs. Basically, there is no such thing as a silver bullet … no magic pixie dust, or magical card, or superfantastic software you can add to a system to make it incredibly faster.

Faster, better performing systems require better architecture (physical, algorithmic, etc.). You really cannot hope to throw a metric-ton of machines at a problem and hope that scaling is simple and linear. Because it really never works like that. You can’t hope that a pile of inefficient cheap and deep machines has any hope whatsoever of beating a very well architected massively parallel IO engine at moving/analyzing data. Its almost embarrassing at how bad these pile of machines are running IO/compute intensive code, when their architecture effectively precludes performance.

Software matters. So does hardware.

What prompted this post (been very busy, but I felt I had to get this out) was this article on HN. I know its an older article, but the points made about implementation mattering in software for a distributed/scalable system, matter just as much (if not more) for high performance hardware systems.

Viewed 16372 times by 899 viewers

Having to do this in a kernel build is simply annoying

So there are some macros, __DATE__ and __TIME__ that the gcc compiler knows about. And some people inject these into their kernel module builds, because, well, why not. The issue is that they can make “reproducible builds” harder. Well, no, they really don’t. That’s a side issue.

And of course, modern kernel builds use -Wall -Werror which converts warnings like

macro "__TIME__" might prevent reproducible builds [-Werror=date-time]

into real honest-to-goodness errors. Ok, they aren’t real errors. Its just a compiler being somewhat pissy with me. And I had to work around it. I could disable the -Wall -Werror, but that is not what I wanted to do.

So I hand-preprocessed the code. In the makefile include. Before starting the compile.


__D__=$(shell date +%x)
__T__=$(shell date +%R)

target_prep: source.c
       sed -i 's|__DATE__|"${__D__}"|g' source.c
       sed -i 's|__TIME__|"${__T__}"|g' source.c
       touch target_prep

Which, I dunno … sorta … kinda … blows chunks … mebbe ? Working around an issue by not fixing what was broke, but instead introducing a new path so I don’t subvert the intentions of the kernel build system?

Viewed 21449 times by 1106 viewers

Talk from #Kxcon2016 on #HPC #Storage for #BigData analytics is up

See here, which was largely about how to architect high performance analytics platforms, and a specific shout out to our Forte NVMe flash unit, which is currently available in volume starting at $1 USD/GB.

Some of the more interesting results from our testing:

  • 24GB/s bandwidth largely insensitive to block size.
  • 5+ Million IOPs random IO (5+MIOPs) sensitive to block size.
  • 4k random read (100%) were well north of 5M IOPs.
  • 8k random read were well north of 2M IOPs.

Over a single 100Gb IB connection with our standard PFS BeeGFS running, we sustained 11.6 GB/s and 11.8 GB/s write and read bandwidth respectively.

Viewed 24725 times by 1297 viewers

Going to #KXcon2016 this weekend to talk #NVMe #HPC #Storage for #kdb #iot and #BigData

This should be fun! This is being organized and run by my friend Lara of Xand Marketing. Excellent talks scheduled, fun bits (raspberry pi based kdb+!!!).

Some similarities with the talk I gave this morning, but more of a focus on specific analytics issues relevant for people with massive time series data sets and a need to analyze them.

Looking forward to getting out to Montauk … haven’t been there since I did my undergrad at Stony Brook. Should be fun (the group always is). Sneaking a day off on Friday to visit with my family, then driving out Saturday morning.

Viewed 27441 times by 1411 viewers