Nominate your favorite HPC product and company for a readers choice award

Please go here and nominate!

Last year, our customer Lucera, won best in Financial Services. We built the vast majority of their infrastructure, so we like to think we contributed in some manner to their success.

This year, please don’t hesitate to nominate us (or second/third/etc.) for Best HPC Storage Product of Technology for Scalable Informatics Unison product, or whatever you’d like.

In addition to the nomination for Unison in storage, I put in nominations for Cadence in Financial Services, and in Data Intensive computing.

Viewed 84 times by 66 viewers

M&A: Seagate snarfs up DotHill

The Register reports this morning, that Seagate has acquired DotHill. DotHill makes arrays and their kit is resold and rebadged by many.

In general the array market (high end) is in a decline, and doesn’t show signs of turning around (ever). The low and mid market, including some of the cloud bits is growing. I am not sure about the OCP stuff, but the low end bits are where we are seeing 4, 8, and 12 drive arrays show up as completely commoditized gear.

This play makes sense from a vertical integration view, as Seagate had already acquired Xyratex, and invested more into selling the kit. It also places significant pressure on all the other storage makers to do similar things.

I do know of at least one building some interesting kit. The others, not so sure of.

I did have one OEM tell us that they didn’t want to compete with their (high end) customers in a different context. One of the rules of business is, if you don’t cannibalize your own products and offerings, your competitors will happily do this for you. I don’t expect that point of view (that they don’t want to compete with the higher end customers) to last very long.

This also suggests that these folks will likely be getting into the same space as us, hardware/software stacks for hyperconvergence.

This isn’t the last of it, I expect significantly more M&A soon.

Viewed 10318 times by 650 viewers

IPO: Pure Storage files

Not really an HPC/Big Data play (yet). But they have filed.

The traditional array market is in a decline, and depending upon how you view it, its either merely a steep decline, or an out-and-out death spiral. The tier1 vendors are defending a shrinking turf against aggressive smaller and more focused players.

Moreover, flash is set to overtake disk in terms of lower cost to deploy in very short order. This plays well for folks like Pure and a few others, though the market they are playing in is in decline. I expect pure to look to branch out to the growing markets at some point, with the proceeds from their IPO.

Its the hyperconverged or serverSAN market that is in explosive growth. This is where our company sits.

Viewed 25088 times by 1003 viewers

rebuilding our kernel build system for fun and profit

No, really mostly to clean up an accumulation of technical debt that was really bugging the heck out of me.

I like Makefiles and I cannot lie. So I like encoding lots of things in them. But it wound up hardwiring a number of things that shouldn’t have been hardwired. And made the builds brittle.

When you have 2 released/supported kernels, and a handful of experimental kernels, it gets hard making changes that will be properly reflected across the set.

Basically we want our changes/patches built into the kernel, not post-installed. Especially since our appliances boot diskless, post-installation is sort of a no-op.

So I broke out the things I needed to break out. And made saner “objectified” Makefiles. With some includes and other bits. Very modular, very easy to update/manage. Very easy to test.

But I made some cut and paste errors, and spent the last hour+ fixing the errors. Ugh.

It does look like I’ve fixed them all though. This gets back to another issue, and that is driver development is largely outside of the kernel source. You can often find 2+ year old drivers without specific features/capabilities, or support for newer hardware. Which makes for exciting times.

No … not exciting … painful.

Our build integrates all theses drivers in. This is, as it turns out, non-trivial as, as noted, a number of these drivers are developed out of kernel. So we have to patch them to build correctly in the kernel. And then … and then … we have to patch them for kernel API changes (yes the )(**&%&$^& API does change). We find these changes with errors in our builds. And then we have to track down where the change was, and then patch. Or write adapter code.

And then we beat the hell out of the kernels. Torture tests. Lets write a petabyte or two, while computing PI, e, and other things to a billion digits or so while our packet howitzers are going full steam. The reason we build our own kernels comes down to stuff like this: your friendly neighborhood distro kernel isn’t stable under what we and our customers would consider medium load, never mind what we would consider heavy load. Our kernels need to be rock solid. They need to be fast and mostly tuned out of the box.

The build system is part of this. Making it work in a way I am happy with is important.

You can’t build awesome performance systems on top of crappy infrastructure … hardware OR software. You shouldn’t even try.

Viewed 47179 times by 1334 viewers

Drama at Violin Memory

Violin has had a rather tumultuous time in market. Post IPO, they’ve not had a great time selling. They have an interesting product, but with SanDisk coming out with their kit, and many others in the competitive flash array space, this can’t be a fun time for them. They don’t have a large installed base to protect, and their competitors are numerous and fairly well funded. Add to the mix that, as a post-IPO public company, they no longer have the luxury of not hitting targets … they will get slaughtered in the market.

They’ve underwhelmed a number of investors as of late with rumors/speculation around what they need to do to shake out the doldrums of revenue. So much so, that an activist investor is looking to make changes.

Activist investors are looking to get better returns in a number of higher profile scenarios, and will usually want the companies to cut costs, sell off or shutter underperforming units.

What’s interesting in the letter is that the investor notes the stock is down far more than the market as a whole and has dropped significantly since the new CEO took over.

The confidence reflected in your letter does not seem to be shared by the market – as the stock price is down 24% since the last quarterly earnings report. Furthermore, the stock price today is down 48% since the beginning of the year and down 33% since Mr. Denuccio was named CEO.

As someone who used to work at a company battered by Wall Street, I can tell you that stock underperforming and not seeming to reflect what you believe to be the companies fundamental value is demoralizing. There was a time when I was at SGI, where our value was lower than the cash we had on hand. This has interesting and negative impacts in customer discussions, as customers note this, and ask you about it.

In 1999 I had a conversation with a large university about Origin systems. What it came down to was when they asked us “will you be in business next year”, and I could not, in all honesty, answer that in the affirmative. Think about what that does to sales.

This said, these investors sometimes get it right, and SGI was … well … the management was … quite clueless at the time.

Violin has a very competitive environment, fighting to take heavily defended hills, coupled with many other armies, often better equipped, trying to take the same hills, while the entire landscape they are all fighting for is being eaten away by hyperconverged systems. Like what the day job makes.

Their best bet would be to stop fighting this, and do something hyperconverged … or buy someone hyperconverged. The former may be hard. The latter, somewhat less hard.

Viewed 49655 times by 1395 viewers

Scalable Informatics 13th year anniversary on Saturday

We started the company on 1-August-2002. I remember arguing with a senior VP at SGI over his decision to abandon linux clusters in Feb 2001. That was the catalyst for me leaving SGI, but I was too chicken to start Scalable then. I thought I could do better than them.

I went to another place for 15 months or so. Tried jumpstarting an HPC group there … hired lots of folks, pursued lots of business. Then it went bang. My team was laid off, and I was left with a serious case of Whiskey Tango Foxtrot.

Scalable Informatics was started in my basement. Our foundational thesis was and is that performance should be end user accessible without Herculean effort. You should have a fighting chance to be able to extract performance and leverage it. And it should be easy if at all possible.

This is what guided the company from the outset. Our path began with clusters, with a detour in 2002-2006 for accelerators. I called them APUs, Accelerated Processing Units. I wrote a bunch of white papers for AMD, and used APU within them, that AMD distributed widely. Now APU is a common term. Go figure.

We tried raising capital to build accelerators, believing (in hind sight, correctly) that they would be one of the most important aspects to high performance computing going forward. Couldn’t get any VC to bite. Even did due diligence for a few where we saw on others slide decks, a few of the slides (actual graphics we built) show up. That convinced me that VCs weren’t worth the time/effort to deal with.

Transitioned out of clusters once Dell decided it wanted that market. Very hard to compete with a massively parallel shipping machine that gets better pricing on parts than you can, and is willing to suck all the oxygen out of the room (or market) to suffocate others. We focused where we could add a great deal more value.

Hyperconverged systems. In 2006 onward.

Made small efforts to interest VCs … local groups … but not a whisper of interest.

Continued to set performance records with our units. Had competitors looking us over thinking they could build the same thing, discovering rapidly that there was indeed significant special sauce powering our kit.

Had a few acquisition offers in the mix. Ranging from “give us the company and we’ll decide if you are worth anything” on downward. It was actually quite humorous in some cases.

Kept getting better and bigger customers, building bigger and faster systems. Building lower cost systems, that while not the top of our line, easily bested our competitions top-of-their-line, at a fraction of the cost.

Ran a number of standard tests, never reported results. Had customers run tests “in anger” and report that jobs that normally took 6 hours on other gear took 5 minutes on ours. Another customer reported years ago that their 5GB/s system was being looked at by a flash vendor, curious as to whose flash they were using. Customer responded “no flash, just spinning disk”. Left the vendor speechless.

We’ve always said architecture matters. Its nice to be proven correct, again and again. Our competitors always seem to underestimate us. Please, by all means, continue to do this.

We kept adding people, growing out of my basement in 2007 to a real facility. Now on our second and we are bulging out of it. Actually blew out our AC, and have to get a new one.

Took our first external investment in Feb this year, and it looks like we are going to do some more pretty soon. Had another discussion that took 9 months and went down in flames over impossible for us to agree to terms. Exactly the sort of terms you bring up and insist cannot be compromised on, if you want to kill a deal. Kill it, they did.

Along the way, we set records on in-box firepower, and between box firepower. Records that are only recently coming under threat.

We’ve got some absolutely wild bits brewing in lab, things we can’t talk much about now, and its killing me … I really want to.

This said, our 13th year and beyond should be quite awesome. More soon. I promise.

Viewed 61436 times by 1605 viewers

Been there, done that, even have a patent on it

I just saw this about doing a divide and conquer approach to massive scale genomics calculation. While not specific to the code in question, it looked familiar. Yeah, I think I’ve seen something like this before … and wrote the code to do it.

It was called SGI GenomeCluster.

It was original and innovative at the time, hiding the massively parallel nature of the computation behind a comfortable interface that end users already knew. It divided the work up, queued up many runs, and reassembled output. In as much the same order as possible. One of my test matrices was taking the md5sum of output of my code and the original. If they differed, it failed.

There were many aspects of this that were (at the time, 1999-2000) quite novel. So we filed a patent on it. Which was granted. It is Patent number 7,249,357 if you care to look.

Next gen version avoiding all of the patented elements was developed at my next employer, whom subsequently had a financial meltdown due to a failed acquisition (or more correctly, failed due diligence during acquisition, so they didn’t uncover the slightly well done books in time). MSC.Life was lost to the ages.

I left there and started Scalable Informatics. 13 years ago this Saturday.

While the folks at Broad and Google seem to have done wonderful things, they may not have been the first to do this. I myself was inspired by the previous work of HT-BLAST from my colleagues at the time. Some whom insisted that there was no way a distributed version of this could ever scale … there were simply too many issues. I have great respect for them, but I set out to prove that it could scale. And scale it did.

Later on, a number of very smart folks at a number of places built mpiblast. I worked on helping to package it and automate builds of it.

Paraphrasing Newton, we’ve seen further because we stood on our predecessors shoulders, as they built the platforms that we could stand on.

This isn’t to minimize what was done. Sort of like the history of the “discovery” of the FFT. Seems to have been “discovered” a number of times. I find that amusing to some degree, but the history of scientific advancement is often composed of half forgotten and half remembered things. Quaternions anyone? Maxwell’s equations in Quaternion representation are a single equation. Not to mention their applicability to special relativity Lorentz transformations …

Viewed 61723 times by 1570 viewers

Build debugging thoughts

Our toolchain that we use for providing up to date and bug-reduced versions of various tools for our appliances have a number of internal testing suites. These suites do a pretty good job of exercising code. When you build Perl, and the internal modules and tools, tests are done right then and there, as part of the module installation.

Sadly not many languages do this yet, I think Julia, R, and a few others might. I’d like to see this as part of Python and other tools.

There is also a strange interaction between the gcc 4.7.2 and Perl 5.20.2. In this, if we use an optimization higher than none, one of the test cases fails.

This isn’t a perl issue per se, it works well with the 4.9.x compilers, and some others. I’ve not yet tried it with clang/LLVM, but should (if I ever get the time).

What I am thankful for are these code builds with the testing. I can see the failure, and have a good concept of what it is, not where it is. Had I more time, I’d see if I can work around the specific code that gcc 4.7.2 is mis-generating. But its easy to use -O0 for now, and not worry about it. I have bigger fish to fry.

I’ve had to work around some pretty insane compiler-language bugs in the past with all manner of interesting parsing errors that only showed up in specific compilation cases.

Since I drive my builds with a makefile, and capture all the output, its pretty easy to see what failed. I’ve been meaning to set up a Jenkins CI system in-house, and have even more async aspect to the process, but I find that being able to see the builds in real time sometimes helps me.

So I let them crank off on the side in a window, while I work on other stuff. That way I can get my iterative work done, while remaining quite productive.

[Update] I tried clang/LLVM and it worked (and was very fast). But the issue for the moment is the size of the ramdisk for the appliance, and adding another compiler runtime toolchain is going to make this larger. So this will take more time to correctly study.

I did note that 5.22.0 was released, so I grabbed that. Seems to build/test properly, and I am not getting the errors I was getting with 5.20.2. Sort of a meta-debugging … I am not digging into why I was getting the errors, bumping to 5.22.0 looks like it solves the build problem.

Viewed 53066 times by 1401 viewers

Insanely awesome project and product

This is one of Scalable Informatics FastPath Unison systems, well the bottom part. The top are clients we are using to test with.

Each of the servers at the bottom is a 4U with 54 physical 2.5 inch 6g/12g SAS or SATA SSDs. We have 5 of these units in the picture. And a number of SSDs on the way to fill them up. Think 0.2PB usable of flash. Distributed in a very nice parallel file system we work quite a bit with.

The network (not shown, ignore the cat6 spaghetti on the sides … need to talk with the team about this) should be some bloody fast stuff that lets us drive the servers at or near their theoretical max bandwidth … that is, its very well matched to these units.

More soon. This is just insanely exciting stuff. Capability class to an insane degree.

Viewed 44405 times by 1376 viewers

Playing “guess which wire I just pulled” isn’t fun

Even less fun when the boxes are half a world away.

Yeah, this was my weekend and a large chunk of today.

This will segue into another post on design and (unintended) changes in design, and end user expectations at some point. Its hard to maintain a concept of an SLO if some of the underlying technology you are relying upon to deliver these objectives (like, I dunno, a wire?), suddenly disappears on you. Or even more interestingly, when someone needs something (also like a wire), sees it connected to your box, and decides to take it.

There is a reason we do what we do, and a reason we do it the way we do it. I am (continually) blown away by the “but you don’t need X here, we’ll provide it for you”, as when we get there, we discover that no, they really can’t provide it for us, and yes, the system design requires that to function.

This is when, to steal what a customer once opined here, we resort to cowboy engineering. Or to put it another way, when you are up to your ass in alligators, its sometimes difficult to remember that the objective is to drain the swamp. But success is defined only in terms of draining the swamp, not the number of alligators you have to overcome. Sometimes (ok, often) the alligators are self-imposed … and that’s even more exacerbating.

There is a reason we do what we do, and why we do it the way we do it. Its not to sell more kit. Its to deliver functional extreme performance, and manageable systems.

Off to class now … need a break.

Viewed 24016 times by 1201 viewers