More than a year in, and where are they now?

Its 2-January-2012, and assuming the Mayans’ were wrong (ok technically I’ve not heard of any suggestion they did anything more than stop their calendar on a convenient-for-them boundary), an interesting question is, what has happened to the company-formerly-known-as-Sun’s HPC assets?
Lustre is one of the most well known, and it now has some type of future ahead of it. I’ll talk about that in a later post. This future was most definitely not assured 1 year ago, and there was considerable uncertainty in its longevity as Oracle had, about a year ago, let go most of the developers.

Obviously, WhamCloud and a few others picked up whom they could, and Lustre does now appear to have a roadmap going forward, and someone to back it. But … and this is a large “but”, are we going to see a repeat of the Oracle scenario? I am not doubting Brett Gorda’s skills in company management, nor WhamCloud’s revenue generation. My question is, is this technology going to fall prey to a specific company’s whims? This is something I was worried about last year at this time, eventually worrying that there would be 2-3 different Lustre trees, and that business imperatives would drive decisions which could cause conflict. For expressing these concerns, I was hammered pretty hard in public and in private. But I was still right in expressing the concerns.
Is this still a concern? Somewhat, as we don’t know how WhamCloud will evolve as a business. It needs to be viable, and growing and strong for this to be not a concern. And dare I say, Brett and the rest of the WhamCloud team would very likely indicate exactly this in public. Which is their job.
But there is consolidation going on in the industry. Gluster was just acquired in Sept 2011 by Red Hat and it is the new Red Hat Storage product. Something tells me they will be less interested in either helping to integrate Lustre in (as it would overlap a little with their product, and Red Hat has a long history of not supporting, and often opposing or ignoring, competitive technologies) or helping people run Lustre.
Cray came out with a product in the Lustre array space. Depending upon how this market grows for them (disclosure: this is directly competitive with a portion of our siCluster offerings, so take what I say with an appropriate grain of salt), Cray may wish to own more of the process/IP around the Lustre file system. Though I doubt it for a number of reasons.
Similarly, other vendors have either boned up or outsourced their Lustre offerings. HP stopped its own Lustre bits in favor of working with DDN a few years ago. IBM has been pushing GPFS. Dell has been pushing Lustre via the older GUI from Terascala.
I should point out that a brand new GUI has come out from WhamCloud (and one of the reasons why I think they are potentially viable long term is that they are innovating around the stack). The effect upon Terascala and others has yet to be determined, but I believe that Dell et al. will probably be switching soon.
DDN is under attack from Netapp/LSI as IBM kicked them out of their portfolio pre-SC11. Fujitsu has built a platform atop Lustre, and its pretty interesting. Netapp/LSI is building up its Lustre portfolio.
I think what we are seeing is a realignment in the market. Its not a consolidation yet, there are no current bloodbaths going on with respect to pricing competition. Its more a gradual tectonic shift. Unlike with Gluster, I don’t perceive Lustre as being fundamental or crucial to any of these vendors (save WhamCloud and Terascala, whom are now actively competing).
Longer term, I don’t expect to see Lustre “in the kernel” as their changes are fairly intrusive, and they require quite a bit of user space support. It might be possible for them to get the “patchless” kernel integrated into distros (this would be a win). But they really … REALLY … need to fix the build system, and reduce the amount of the kernel they modify, as it increases the likelihood of problems going forward.
So in summary on Lustre; it has a future, and the uncertainty has been taken down a few notches. The ecosystem around it has been shifting and changing, and I am not sure how it will evolve over time. There are slight concerns as to how to deal with the project in the case of a “business event”, which I am hoping are addressed (similar issues with many projects, which do not have a long term plan to address the “what happens if the sponsor goes away, decides to become a pretzel company, …”). Call it a long term contingency plan. If the need for this is as high as some claim, then we will see this emerge. OpenSFS was a step in this direction, and I’d argue that it should probably take the leadership role, and make sure it has copies of bug reports, testing, code, … to make any such changes later on if needed. But this is a minor nit.
Lustre has survived. I am cautiously optimistic that it will thrive, but I don’t see it being absorbed by any of the big boys right now. I could be wrong, and mebbe they already have a signed purchase agreement (which would be nice). But barring that, Lustre has dodged some very nasty bullets for the moment. There is still the IP ownership issue (rests with Oracle as I remember). But its open source, there is a leader, and it will go on without more than the single fork it needed to go on.
GridEngine? Maybe not as much as we and many others had hoped.
Last year at this time, GridEngine was in an uncertain state. We (the community) knew something was afoot, and we (my business) understood that Oracle as a business needs to make money from it. So we were expecting changes.
We didn’t quite expect what happened.
Oracle closed the source. Well, updates stopped being free (which is fine), and the cvs/svn access was disabled. And the community site shuttered.
Ok, that last part was unexpected.
I wouldn’t have minded paying a bit per node for the software.
A few years ago, some vendor or the other was trying to convince me that $1k USD per socket or per node was acceptable for users to pay for its software, whilst most of the users we’d spoken to had said, no more than $50 USD/node.
For those marketeers who don’t know, a cluster is not a license to print money, or to force high fees. It is an opportunity to leverage “bulk” pricing models, and woo customers to your technology if you price it appropriately. We see many … MANY
All this said, I don’t think Oracle’s pricing model for GridEngine is terrible (compared with some I’ve heard). But its approach to things pissed off its community of users, some of whom had been paying customers, and were pissed off at the changes.
I won’t describe the goings on in great depth. Lets just say that there are currently 3 “open source” Grid Engine variants out there, not completely compatible, with varying amounts of documentation. The grid engine email list sometimes has these folks butting heads. No one wants to give up their fork and work with the other projects.
And there are at least 2 commercial variants as well, one from Univa and the original one from Oracle. The Univa folks appear to be marketing against the open source projects, and in the process, alienating them.
So Grid Engine is currently a mess. Best case scenario is that the 3 open source projects merge, and Univa stops working against and starts working with them.
I put the probability of that happening as somewhere between zero and epsilon (for very small positive values of epsilon).
I am not sure if an open source grid engine project will survive, to my dismay. We’ve rejiggered our internal projects to be open to using Torque, and I am looking at Lava. I’d like to avoid building my own, but most of these projects have (often significant) failings, were designed/architected for an era long gone, and are really not cloud aware. Meanwhile, with a little work, I can turn DragonFly Engine into a fairly robust fault tolerant and massively scalable engine for clouds. This has not been lost on me. Especially as I can integrate it directly into Tiburon. Integrated job scheduling and OS load as a mere detail in the job, with no OS/machine config needed, working in public and/or private clouds.
So call Grid Engine a failed transition at this time. I am not happy calling it that, as we’ve been users of the technology for a while, and have sold our share of systems using it. But its future is anything but certain.
This is what I had feared Lustre would devolve into BTW. I am glad it didn’t, as unless there is a single emergent community build, Grid Engine is likely to barely survive at best, and go out with an ignominious whimper at worst.
Then there is OpenOffice.
This is what one might call an abject failure. Today, Apache now has this project and is promising releases this year. LibreOffice has pretty much replaced it everywhere that matters though, so I doubt that this is going to matter.
There are many other technologies …
MySQL: Losing out badly now to the other forks
Java: seems to be going but growing less relevant over time
VirtualBox: We had started using it a year before the acquisition. Looking to transfer to kvm over the next year though.
Sparc: Great DB CPUs, lousy FP CPUs … unless you get the Fujitsu ones, which have always been at least a little bit interesting (considering they are the basis for the K machine ranked #1 on the top 500, yeah, they are interesting).
The Oracle take over of Sun has not been generally good for HPC, but that wasn’t Oracle’s goal. Understood in that context, the take over was probably quite successful, and lower performing properties have been kicked to the curb or shuttered/wound down, or sold off.
What will this year bring? Likely more shifting in the market for Lustre. More Java releases. A gradual shift away from MySQL (could imagine it being kicked out as an Open Source project).ikoni

4 thoughts on “More than a year in, and where are they now?”

  1. Joe,
    Of the two schedulers: Torque and Lava, can you summarize the pluses and minuses from your perspective? (Disclosure – I “grew up” with PBS so I know it much better than anything else which tends to cloud my opinion since Torque has worked with everything I’ve thrown at it but I’m sure I’m missing something. But I’ve also used LSF for short time so I know a little about it. Come to think of it, I’ve used GridEngine for a short time as well.)

  2. @Jeff
    Yeah, that would be interesting to compare. Pretty much all job schedulers out there have single centralized engines for pushing jobs out to the nodes.
    I am also thinking more about the massive distributed (at scale) scheduler concept. We already have the backend execution and notification engines working well (needed them for DragonFly years ago). The distributed scheduler part, I think we have a good handle on too.
    Only if I had infinite time/resources to work on these things … 🙁

  3. Joe,
    You are obviously very interested in Lustre. Are you planning on joining OpenSFS, Whamcloud, and the rest of the Lustre community at this year’s Luatre User Group (LUG) in Austin, TX the week of April 23rd? We’d love to have your passion help drive the product forward. See for the Call For Participation.

  4. @Ian
    Last I saw, the costs associated with joining OpenSFS were a bit above what we would be willing to commit given the revenue it drives for us at the moment. I’d like this to change, and we can evaluate joining either when Lustre becomes a greater fraction of our revenue or the costs to join OpenSFS aren’t as large. This said, I am not averse to this, and would like to join. I just can’t come up with enough of a business justification for it at this time (cost benefit analysis).
    The LUG attendance is something I’d seriously consider. I am aware of it, and haven’t committed one way or the other yet. Last year, it was at the same time as the HPC on Wall Street conference (or within a week of it), and that takes precedence for us.

Comments are closed.