Several years ago, before clouds were all the rage, we were working with a large customer discussing an “on-demand” HPC computing service. This service predated Amazon’s setup, and was more in line with what Sabalcore, CRL and others are doing.
I remember distinctly from my conversations with the customer that they had particular desires. Specifically, they wanted to run on always the latest/greatest/fastest possible hardware, and not pay any more for this. A new CPU from Intel or AMD available? Sure, we had to have it fast, in our systems, for them to run on. And it couldn’t cost more than the existing systems.
Fast forward to today. A customer leveraging our small internal cluster isn’t happy with some of the CPU performance. Again, I have a sense that their expectation, and frankly, most customers expectations, are that they will get the latest and greatest for real cheap, as soon as it is available.
Moreover, they (current customer) were having issues with the queuing system. Smaller runs ran fine, but the bigger run had issues. Its obviously an environment interaction between their code and the scheduler. One that is resolvable.
But … and this is critical, they don’t want to “waste time” debugging it. They just want to run. So they pushed the job scheduler out of the way, turning a shared resource into a dedicated resource.
Looking at our partner’s machine offerings, their systems are actually below our internal cluster systems specs. Often significantly. Our system is meant to be the smaller run system that people can use to start to understand their jobs, do smaller runs, before graduating onto the bigger systems at Sabalcore, CRL, Amazon. We are working on (extending) some (of our older) tools to make this very simple.
For the economics of HPC in the cloud to work, you need to be at or below the price performance knee for hardware, where you can minimize the system cost while maximizing the cycles used. You may be able to charge more for faster cycles, but in general, most people don’t want to pay for much faster. They want faster and cheaper (which is why JackRabbit is such a popular storage unit, and why siCluster is rapidly gaining in popularity for cluster storage).
There is a “I need it now” mantra for on-demand customers. Unfortunately, we’ve seen in multiple instances, where they aren’t willing to spend time to make it simple … either their application software has problems and complexity that must be dealt with (true of the current situation) or something else conspires to make it very hard to run their code at or near optimal performance.
We were explicit about performance guidance. There are limited things we can do to help, and help is what we are trying to do.
The sense I get, having dealt with multiple customers on this, is that HPC in the cloud isn’t nearly as easy as anyone would like. Code is implicated in some of this, as are workflow processes.
But expectations need to be set, and managed, in a realistic manner. The day Intel releases(released) Nehalem X5690s, they will not magically appear in all/most/some HPC cloud infrastructure. We see HPC infrastructure with 2 and 3 year old chips, ram, and Infiniband.
Cycles are cycles, some are more efficient and faster than others. There is a market for this, but customers are rarely willing to pay more for the faster cycles. This is fallout from Moore’s law. They expect it to drop in price over time. And expect it to get faster over time. Unfortunately, the people who implement these services need to amortize the costs across many runs on hardware that will age 3+ years before being replaced.
I see these issues as being at odds with each other, and certainly causing conflict for users.
If you can’t spend the time to make it painless, it won’t be painless. If you expect to get the fastest CPUs and largest RAM, you will be disappointed.
I get the sense that people are looking at this as a silver bullet. Its not. There are no silver bullets.
You never solve a problem by making a decision. You simply exchange one set of problems for another, in the hope that the new set are easier to deal with and manage.
What I mean by unrealistic expectations is that I think people might assume that with HPC in the cloud, they will have no problems. Acquisition, installation, etc. are all handled. Their code will just run, without issue. And it will be much faster than their internal systems. Running the latest processors, the fastest infiniband, the most memory.
You still have to fight the code working with the system and generating good results, quickly. You have to commit the time to make this happen. Or you will have pain.
We try to set expectations correctly. We really do. But there are no silver bullets, and systems will have issues with code. Or bugs. Or not be optimally configured for something, or have some incompatibility. Or …