Supercomputing as a Service: meet Eka

In this article, the author covers some of what CRL is doing with Eka (pronounced eh-kch).

There are some interesting points:

CRL???s initiative is a pioneering effort, as it is the first time that a corporate institution is taking the lead in extending the domain of High Performance Computing (HPC) from the academic field, to the enterprise. This is partly because a large-scale supercomputing infrastructure has been typically owned by government institutions, and is not largely used to its full potential. Being research focused, these institutions have little inclination or capability to deliver it as a utility service.

Not sure I agree that it is the first time a corporate institution is doing this … others have been there before, and some are continuing, such as Tsunamic Technologies.

This said, the other point is very much on target.

Most of the governmental backed/based HPC providers are doing so, specifically to further their research. And there is nothing wrong with this.

All this means is that some of the decisions being made, aren’t being made with a bottom line or customer focus in mind. That is, they are not trying to minimize the cost per FLOP or GB/s. But CRL has to, as it is a business.

This leads to very different HPC systems designs. And very different charge back models. To break even, you need the money coming in to completely offset the money going out. So, while buying capital equipment and hosting it may work well for various groups, for others it is harder.

CRL has succeeded because few organizations today have the financial bandwidth to afford a supercomputer; or because they are unwilling to invest and pay for the administration costs of maintaining a supercomputer. In this scenario, the concept of offering computing power as a service, has struck the right chord, as clients are happy to do this on a need basis at costs they can afford.

Basically to break even, CRL simply has to make sure the utilization is high, and the cost per unit time used is comparable to what people might pay to host locally. They win then.

The trick in this business model is simply level of utilization.

But they are approaching it quite sensibly, and have multiple business models for selling the resource

To encourage more enterprises to start using the supercomputers on rent, CRL is offering the services through three options: a pay-per-use model, a fixed capacity model, and through turnkey based customized solutions. Krishna envisages percolating the concept of supercomputers on rent, to small-scale enterprises or even professionals who might want to use the processing power of a supercomputer, for a specific period.

This is quite interesting. I wonder if this model could be successful in the US.

Viewed 7310 times by 1530 viewers

Facebooktwittergoogle_plusredditpinterestlinkedinmail

2 thoughts on “Supercomputing as a Service: meet Eka

  1. Interesting comments. I do like the comment about academics are unprepared to do this type of thing for commercial companies. I agree with this statement with caveat that there are a few universities that think like companies (but just a few).

    For a while I was seeing universities focusing on getting the systems up and working quickly and ensuring the systems stayed stable and that there was adequate security, etc. I was very impressed with the initiative they were taking. They started asking “enterprise” level questions.

    But when the economy tanked, all bets are off. I’m now seeing universities reverting to their old ways of wanting the cheapest thing they can find and then asking for it cheaper. I’ve seen a few cases recently where a university, in their drive to keep things cheap, has really screwed themselves royally and end up yelling at the vendor even though the vendor warned them (BTW – this just isn’t the company of my day job).

    I’m seeing lots more of “Lustre can’t be hard so I just through a bunch of grad students at it and they can get it to work.” The problem is, they can’t A prime example of this was a professor who had a student set everything up and the student eventually graduated and the professor couldn’t figure out how to make the system work! (I won’t discuss the ethics of using grad students as cheap labor for grunt work as opposed to actually having them do something useful).

    Jeff

  2. @Jeff

    Yeah, there is a sense of “blood in the water” and people are trying to drive down what they pay under the assumption that companies have infinitely elastic pricing (not true, not even remotely correct), and if they push hard, they can get more for less.

    SGI (and LNXI to an extent) is what you get when this happens. Hopefully this will raise cautionary flags. Not likely, as few people consider the long term implications of their actions in a grand scheme, just how it affects them locally at this moment.

    I’ve seen the vendor held accountable for the users/purchasers mistakes. We try to warn customers off really bad decisions, but it doesn’t always work. We’ve had people yell at us after them doing something we warned them against.

    On the Lustre side … yeah, well … we are getting lots of things like that these days. Few seem to comprehend that Lustre really does require some (large) fraction of an FTE, and serious design considerations, up-front. There are many moving parts, and there is complexity that tends to bite them hard if they aren’t ready for it. On the other side, when I point this out in public fora, I am accused of Lustre/Sun bashing (with the inevitable DoS attempts at our mail system later on). It would be funny if it weren’t so sad.

    Clusters are not rack-em-stack-em, though we get lots of folks who want you to believe precisely this. A well designed cluster and HPC system, including storage, management, etc, will run well for years, with minimal downtime and administrator interaction. A poorly designed system will require constant interaction to shore up its weakness. Designing and implementing a good system is a non-trivial task. Designing and implementing a good system that can grow and scale well is a very hard task, and from what we have seen, precious few know enough how to do this. Scaling in IO rates, or computation rates, etc.

Comments are closed.