She’s dead Jim

It looks like (if the rumor is true) that Solaris will be pushing up the daisies soon.

Note: Solaris != SmartOS

This has been a long time coming. Combine this with Fujitsu dumping SPARC for headline projects … yeah … its likely over.

FWIW: I like SmartOS. The issue for it are drivers. We tried helping, and were able to get one group to update their driver set. But getting others to update (specifically Mellanox) will be even harder now (and it was impossible beforehand, for reasons that were not Mellanox’s fault). I’d like to use more SmartOS, but I keep running into things I can’t fix or work around. I can’t use my Mellanox 40+ Gb cards, or any Infiniband stack. I can’t use 100Gb cards. I can’t use Intel OPA. CUDA is right out. I had hoped that Samsung would be throwing beau-coup money at Joyent to really solidify the platform after the purchase. Still hoping.

So our OS choices seem to be Linux based and BSD based going forward. We use BSD for specific functions, and Linux for many things.

The closest thing I’ve found to SmartOS on Linux is RancherOS. It is not identical, but darn it, it is close to what we need, and I can replace the kernel, add in a few things we need. Ubuntu is making a strong play for this as well adding ZFS to its mix.

But again, SmartOS != Solaris. I played with recent Solaris a few months ago to see how it had progressed. Still not that impressive (especially compared to SmartOS and others).

So while Solaris is going away, I don’t think it will be missed greatly. If the licensing could be made to work to cross-pollinate between Linux and SmartOS, I’d bet we could solve the driver problem too.

/sigh

Viewed 3091 times by 487 viewers

On closure

I work with many people, have regular email and phone contact with them, as well as occasional face to face meetings. We talk ideas back and forth, develop plans. I work on designs, coordinating everything that goes into those designs (usually built upon our kit). I work hard on my proposals, thinking many things through, developing very detailed plans. I share these with the people … our customers.

And then the pinging begins.

I need to get feedback on any/all aspects so I can adjust what needs to be adjusted.

And sometimes I get it. Sometimes, I hear back … “change X” or “wow, the pricing is not going to work.” So I ask for guidance, as we have some design flexibility relative to the goal sets, and we can make engineering design tradeoffs to hit specific targets. Not always, but many times we can. You have to be ready to compromise on the need/feature side to adjust other elements (price, size, space, power). We’ve got better configurability on our side … our architecture doesn’t change, but we can adjust which components are used (disk sizes, SSD with varying drive writes per day, etc.)

Without this feedback, its just a proposal hitting the initial request, without necessarily specifically being what is needed, but being what was originally wanted. My job is to leverage that feedback and try to converge the needs and wants together subject to the various constraints.

Its an interactive and involved process.

Interactive.

Which means, we need to hear back.

In far too many cases, we provide the initial bit of legwork, for something we are told is an immediate need, and yes they are willing to work on the interactive process with us, to converge on the specifics. And then we don’t hear back.

I check my keyboard for malodorous letters, make sure I don’t have an outbound email filter in place automatically transforming my notes to gibberish. I check to make sure emails I send actually get through, often by bccing them to another of my accounts on an external server. Google’s GMail has been … somewhat cantankerous … it will sometimes just lose outbound and inbound email. And we have no way to trace it. And yes we’ve notified them of this and filed bug reports. And no, their response was along the lines of “check your email client”. smh.

I try not to ping too frequently. I don’t like this done to me, and I don’t do it to others. I respect the people I communicate with. I expect that they are busy and have little time for things, which is why I take so much of the burden upon myself to be helpful.

But at some point, I have to question my sanity on this … should I continue to ping after the 8th or 9th email, spaced out over a respectful number of days?

What do I do when I am busy but working on a project with someone else?

I provide at least a little closure … I send a “I’m buried, will respond later/next week/next month/in another lifetime”. Something … to provide the feedback and ACK to the person kind enough to remind me of the project.

I wish I got this. Maybe 5-10% of the people I work with do this.

I am persistent, but I don’t want to be a pain. Likewise, sometimes pings can’t be answered right away. I get that.

But complete radio silence, after working hard on trying to solve the problems we’ve been asked to solve? This causes me to wonder if we have become a 2×4.

I know many people like to have an adversarial relationship with their vendors to keep them on their toes. We like a consultative approach, to enable us to show our value.

This is one of the more frustrating parts of business, the lack of closure. I’d be fine with a “thanks but not interested”. Its the ACK aspect I am after.

Seems to have gotten worse in the last several months.

Viewed 24899 times by 1599 viewers

Inventory reduction event at the day job

We’ve got 3x Unison (https://scalableinformatics.com/unison) and 1x cadence (https://scalableinformatics.com/cadence) system that we need to clear out.

The Unison machines are 5-7GB/s each, and the Cadence is 10-20GB/s and 200-600k IOPs (depending upon storage configuration).

More info by emailing me.

Everything is on a first come, first served basis, feel free to reach out if you’d like to hear more.

Specs:

ucp-01: Unison1
12 core, 128GB ram
2x40GbE or 4x10GbE ports
60x 2TB drives
4x 800GB SSD

ucp-04: Unison2
12 core, 128GB ram
2x40GbE or 4x10GbE ports
60x 2TB drives
4x 800GB SSD

usn-03: Cadence1
12 core, 128GB ram
2x40GbE or 4x10GbE ports
48x 400GB SATA SSD

One more unlisted Unison unit with the same specs as the others, though with 3TB drives.

Viewed 32687 times by 1978 viewers

Its 2016, almost 2017 … fix your application installer so it doesn’t need to reboot my machine!

There I was running my windows in a window on my desktop. Running a nice little word processor from a company in Redmond, WA. Working on a document. About 15 minutes in, and I usually save at 30 minute boundaries … because … hey … they haven’t quite figured out that the word processor should do this for you … AUTOMATICALLY

Ok, I am shouting. Calm down.

Anyway, for some reason, some little Cupertino company’s code pops up and says “hey, you wanna update me?”

Sure. While I am typing, it is fine.

BAAAAAAADDDDDDD MMMMISSSSTTTTAKKKKEEE.

You see, this aforementioned … “froot“y company has its own little installer. Doesn’t use the Redmond companies installer. So they can do more stuff. Or something.

Anyway … long story short. Go away to heat up my lunch in the nuke-o-wave.

Come back.

I am logged out.

How did that happen?

No problem. Log in.

Fire up the aforementioned word processor and …

WHISKEY TANGO FOXTROT

I know … I know … its my fault for not saving this.

but free LibreOffice seems to have no trouble doing this.

And why in Cthulu’s name does this installer for a device I rarely ever use with windows in a window think it is ok to reboot a machine? Sure, it might have a device driver in it or two.

BUT IT SHOULD NOT ASSUME THAT IT HAS THE RIGHT TO REBOOT ITSELF.

Yeah, lost a little bit of work. Definitely pissed.

Folks: its 2016, almost 2017. Fix your installer. No rebooting needed. Fix your word processor so it is at least at the functional capability of the completely free one, so as to not spontaneously lose work.

On the latter part, I had to console a distraught family member a few years ago over some presentation tool and its propensity to crash, taking all of her hard work with it. I gave her a simple algorithm to execute on a timer. Every 30 minutes save. Every hour, change the name to have (iteration++) at the end.

This was for self protection reasons … if it died and took stuff down, you could limit the damage.

The timer appears to be too generous. Maybe it should be every 5 minutes. Or after each sentence. And definitely don’t run installers in the background.

Viewed 33994 times by 2070 viewers

strace -p is your friend

So there I was, trying to use a serial port on a node which was connected to a serial port on a switch. Which I needed to properly configure the switch.

So I light up minicom and get garbage.

Great, a baud rate mismatch, easily fixed.

Fix it. Connect again. I get the first 10-12 characters … and then garbage.

Hmmm.

I’d like to pause our story for a moment, and say I had the key insight at this moment … but that would not be true. Like a true bonehead, I had a hammer, and it looked like a nail. Back to the story.

Ok, do this a few more times.

It looks like the baud rate kept changing. But this is silly, as I had locked the port, and minicom is one of those things that just works.

Another pause. Minicom was giving me a hint when it lit up saying “hey, this port is locked with a stale lock, let me fix that for you.” Back to the story.

A few more times. Then pull down the later minicom (older rev … obviously it was a software bug … or worse … a hardware bug … ugh).

[editorial]
Obviously not pilot error.

No. Couldn’t be that.

Not at all.
[/editorial]

Finally, after an hour of serious WTF, I am starting to question my sanity. Its time to pull out the heavy machinery. And watch the system calls as they go by.

Start up minicom. Attach strace to it in another window.

Press enter … and ….

1st: I see the port set right.

2nd: I see characters starting to flow … and … then …

3rd: Whiskey Tango Foxtrot … the serial port changed speeds midstream.

Warning: Incoming reality 2×4 headed our way!

Something else is messing with the port … what could it be?

[brain engages]
What on earth could be engaging a serial port? At 11pm? And going into polling mode ….

oh crap.

[sigh /]

I’ve got a getty running on the port. Must have. And it recycles, and polls.

Look in the requisite location and … yuppers, there it is.

[sounds of mad vim typing, and some maniacal laughter later]
getty is off. Try again.

And there is the switch. So I log in, set up what I need to.

For all its warts (speed, heavy penalty in operation), strace is a really good diagnostic tool.

And

BBBBRRRRRRAAAAAIIIIINNNNSSSS

…. must engage them before wasting an hour on things I shouldn’t have wasted an hour on. Ok, it was 10pm, and I was tired, but still.

Todays lesson: strace -p is your friend. And the handy helpful warning message popping up on your terminal …

You are about to engage auto-destruct sequence … are you sure?

… yeah … don’t just ignore it. Pay attention.

Viewed 40864 times by 2520 viewers

Finding unpatched “features” in distro packages

I generally expect baseline distro packages to be “old” by some measure. Even for more forward thinking distros, they generally (mis)equate age with stability. I’ve heard the expression “bug for bug compatible” when dealing with newer code on older systems.

Something about the devil you know vs the devil you don’t.

Ok. In this case, Cmake. A good development tool, gaining popularity over autotools and other things.

Base SIOS image is on Debian 8.x (x=6 at last viewing). Cmake version is 3.0.2 + some patches.

Remember, agestability uber alles.

So I encountered a bug in Cmake, with the FindOpenSSL function. This was in building Julia. Doing some quick sleuthing, I found this patch (for a later version) of Cmake. Looking at the source, it would apply correctly without edits, so I gave it a try (dev machine with our ephermal SIOS boot, no issue if I nuke it by accident … a reboot fixes everything).

Restarted the make and it ran correctly to completion.

So I started looking at Cmake. The distro has 3.0.2 + patches. The patch was for 3.1.2. Out of curiousity … how old is this rev, and are we badly out of date? Looking at the git repo

The 3.1.2 version which fixes this was released 20 months ago. 3.0.2 + patches is more than 2 years old. 3.6.2 is latest stable.

Ugh. Will live with patch for now, but might need to update Cmake on our units to avoid this in the future.

Viewed 48231 times by 2934 viewers

On expectations

This has happened multiple times over the last few months. Just variations on the theme as it were, so I’ll talk about the theme.

The day job builds some of the fastest systems for storage and analytics in market. We pride ourselves on being able to make things go very … very fast. If its slow, IMO, its a bug.

So we often get people contacting us with their requirements. These requirements are often very hard for our competitors, and fairly simple for us to address.

We’ll get inquiries like this:

We'd like 250TB of storage, replicated, and we need to sustain 10GB/s writes, and 10GB/s reads.    Can you do this?

I made up those numbers, but they are around the same order of magnitude in many cases, and the first digits are also quite similar.

We know what is possible. We know what homebrew/self built systems behave as. We know the ins and outs of making this work.

So we start with a spec, work up a few config/design variants to address this, and offer a spectrum to the person whom contacted us.

A quick segue here. Very high performance, very high efficiency is hard. You can’t simply slap components together and hope it will work. As you quickly discover, it doesn’t. Moreover, it is worth noting that most people read spec sheets and presume … really … presume … that they are going to get the maximum performance of the device … all the time, under all conditions. Many people don’t quite have a mental model of the connection between the IO/computing/network load patterns and the perceived performance.

And also, as part of this segue, they don’t really … have a clue as to how much performance implementations will cost. They look at the consumer grade SSD with 80GB of write per day, do a quick bit of math in their heads, and come out with a number they think will work.

And then they come to us.

Back to the expectations discussion.

So the people with this number of what they think the cost for their 10GB/s read/write system would be. And they tell you.

Since we design and build these things, we actually have a pretty good idea of the actual costs involved. The costs … our cost for materials that can actually meet the requirements when assembled into a system … are often significantly larger than their perceived cost.

Its … almost … depressing.

Not that we are going to lose their business. We run a fairly tight ship, and are very aggressive on our pricing. We like repeat customers … this is how we live and grow. But …

But …

in these instances, we would have to subsidize 3/4 or more of the unit for them.

What makes this sadder is that these are often very well funded startups or large companies doing this. In the past it has been universities and research labs.

I do canvass the market fairly regularly, to see if I am missing something, and to see if someone magically came out with a 10 drive write per day SSD at under $0.05/GB that sustains 500-1GB/s and 100k IOPs on 12g SAS.

There seems to be a disconnect between what people believe they’d like to pay, and what it actually costs (even in raw materials). I know, prices are not really firm. A market is made when a buyer and a seller agree upon a price, and the price may not necessarily reflect portions of the cost of the item.

Performance is a valuable feature. More-so than ever in the past. Being able to design and build high sustained performance systems, and deliver appliances that provide this high performance is a valuable service.

I am not quite sure what to think about this disconnect between reality and people’s expectations. I’m respectful and open with the people doing the inquiry. I help them to understand where they should be looking at for their budget. But we can’t afford to pay people to take our solutions.

A few years ago, a potential partner had come to us with an opportunity at a national lab for a sizable system. We looked at the specs, and then the budget.

The lab wanted the highest end kit, of course. You know that. Their requirements specifically called out what you could or could not do.

Then came the budget. When we looked at it … the pricing was below the lowest end raw disks in market (dense consumer grade drives) that we could get in bulk. Speaking on the side with some of the OEMs, and they completely blanched at providing low margin consumer grade units at these prices, never mind the high margin, highest end units.

Someone did eventually “win” this business. But these wins are pyrrhic. Enough of them and they will go out of business. They had a layoff sometime after this was delivered, who knows if there was a connection. End user is happy because they got a fresh new system at high spec, for a price well under the market rate … well under the actual part costs in the system. The vendor isn’t happy, as they not only lost money on the deal, but thanks to the language around these deals, they can’t do any real marketing, so the win is … well … of low value/quality.

We read the spec, and did a no-bid. We can’t afford “wins” like that.

I dunno. This stuff bugs me.

Real performance will cost some money, and you need to likely have a range of performance concepts in mind to compare to your budget.

Viewed 39575 times by 2995 viewers

Excellent article on mistakes made for infrastructure … cloud jail is about right

Article is here at Firstround capital. This goes to a point I’ve made many many times to customers going the cloud route exclusively rather than the internal infrastructure route or hybrid route. Basically it is that the economics simply don’t work.

We’ve used a set of models based upon observed customer use cases, and demonstrated this to many folks (customers, VCs, etc.) Many are unimpressed until they actually live the life themselves, have the bills to pay, and then really … really grok what is going on.

A good quote:

As an example, five years ago, a company doing video encoding and streaming came to Freedman with a $300,000/mo. and rising bill in their hand, which was driving negative margin the faster they grew, the faster they’d lose money. He helped them move 500TB and 10 gigabits/sec of streaming from their public cloud provider to their own infrastructure, and in the process brought their bill to under $100,000/mo., including staff that knew how to handle their physical infrastructure and routers. Today, they spend $250,000/mo. for infrastructure and bandwidth and estimate that their Amazon bill would be well over $1,000,000/mo.

“You want to go into infrastructure with your eyes open, knowing that cloud isn?t always cheaper or more performant,” says Freedman. “Just like you have (or should have) a disaster recovery plan or a security contingency plan know what you’ll do if and when you get to a scale where you can’t run everything in the cloud for cost or performance reasons. Know how you might run at least some of your own infrastructure, and hire early team members who have some familiarity and experience with the options for doing so.”

By this, he doesn’t mean buying a building and installing chillers and racks. He means leasing colocation space in existing facilities run by someone else, and buying or leasing servers and routers. That?s still going to be more cost effective at scale for the non-bursting and especially monotonically increasing workloads that are found in many startup infrastructures.

In house infrastructure tends to have a very different scale up/out costing model than cloud, especially if you start out with very efficient, performant, and dense appliances. Colos are everywhere, so the physical plant infrastructure portion is easy (relatively). The “hard” part is getting the right bits in there, and the team to manage them. Happily providers (like the day job) can handle all of this, as managed service engagement.

Again, fantastic read. The author also notes you shouldn’t adopt “hipster” tools. I used to call these things fads. The advice is “keep it simple”. And understand the failure modes. Some new setups have very strange failure modes (I am looking at you systemd), with side effects often far from the root cause, and impacts often far from the specific problem.

All software … ALL software … has bugs. Its in how you work around them that matters. If you adhere to the mantra of “software is eating the world”, then you are also saying, maybe not quite so loudly, that “bugs are eating my data, services, networks, …”. The better you understand these bugs (keep em simple), the more likely it is you will be able to manage them.

You can’t eliminate all bugs. You can manage their impacts. However, if you don’t have control over your infrastructure, or your software stack (black box, closed source, remote as-a-service), then when bugs attack, you are at the complete mercy of others to solve this problem. You have tied your business into theirs.

Here’s a simple version of this that impacts us at the day job. Gmail, the pay per seat “supported” version (note the scare quotes around the word supported), loses mail to us. We have had customers yell at us over their inability to get a response back, when we never saw their email. There is obviously something wrong in the mail chain, and for some customers, it took a while to figure out where the problem was. But first, we had to route around Gmail, and have them send to/from our servers in house. The same servers I wanted to decommission, as I now had “Mail-as-a-Service”.

So the only way to address the completely opaque bugs was … to pull the opaque (e.g. *-as-a-service) systems OUT of the loop.

We have not (yet) pulled our mail operation back in house. We will though. It is on the agenda for the next year. I spent maybe an hour/month previously diagnosing mail problems. Now I have no idea if emails are reaching us. If customers sending us requests are getting fed up with our lack of visible response, and going to competitors.

That is the risk of a hipster tool, an opaque tool. A tool you can’t debug/diagnose on your own.

Again, a very good read.

Viewed 36804 times by 2956 viewers

The joy of IE and URLs, or how to fix ridiculous parsing errors on the part of some “helpers”

Short version. Day job sending some marketing out. URLs are pretty clear cut. Tested well. But some clients seem to have mis-parsed the url. Like with a trailing “)”. For some reason. That I don’t quite grok.

I tried a few ways of fixing it. Yes, I know, because I fixed it, I baked it into the spec. /sigh

First was a regex rewrite rule. Turns out the rewrite didn’t quite work the way it was intended, and it killed the requests. The regex works fine (we tested). The web server just did strange things.

Ok, lets try a location block. Craft the same basic thing as the rewrite, but before the main server.

# fix the trailing ")" ... yes ... really ... IE I am looking at you
 location ~ /(.*)\)$ {
        return 301 $scheme://blahblahblah.io/$1;
 }

restart the webserver, test …

and it works.

Not fun, and now the trailing characters are encapsulated in the web spec. But at least those whom are fundamentally challenged in their choice of browser options, can now not have said browser muck up the situation … unless they don’t process redirects/moved …

Viewed 35904 times by 2879 viewers