Cloudy expectations for HPC

I’ve mentioned in the past, where users expectations deviated, often wildly, from the reality of a system. The reason for these deviations of expectations could be internal (convincing yourself that “instant” means, literally, “instant”), external (believing marketing blurbs), or some factor between the two.
At HPCinthecloud, an article on a user running head first into the reality of cloud computing, and avoiding the hype.
Ok, a number of critical take-aways. One is that end user expectations can be wildly … badly … out of sync with reality. I am not criticizing the user here. Just noting that a 2ms window to start a new node, and have it provisioned and operational is … well … wishful thinking at best.
Boot time scales (ignoring provisioning) are on the order of 5-60 seconds depending upon the nature of the hardware (physical/virtual) to get to an OS load prompt. OS load itself could take from 5s to a few minutes.
Notice that I haven’t talked about provisioning here, as it is ancillary to the process. Really, provisioning should be done very infrequently, and not part of an “inner loop” in a load balance cycle. You want to boot pre-configured images or bring nodes out of hibernation.

Note also some of the other issues with the specific instance of the cloud (Microsoft’s Azure).

Debugging and profiling – Although Windows Azure programs can be developed and debugged locally, Azure?s architecture does not support remote debugging. This might be a problem to develop and deploy complex applications on Azure.

That one is a show stopper IMO.
We can do remote debugging on Linux clusters very easily with a (wide) range of tools. Such capability is a mandatory part of any HPC system. If you can’t debug on it, it really isn’t worth your time (apart from exceptional circumstances where it could be game changing) to expend any more resources on it.

Like the traditional HPC platforms, light-weight profiling tools will be very useful for analyzing and tuning performance, which are still missed for the most current cloud computing platforms.

This isn’t true in a general case. I am thinking at this moment that the user is, in fact, projecting Azure issues out as a general case.
A remote Linux cluster is a Linux cluster regardless of whether or not its physical or virtual. Profiling applications on Linux is pretty easy. Profiling the underlying platform hardware in the virtualized case is a problem for the virtualization platform, and exactly how much information do they wish to expose. Thats not an Azure problem, thats a problem in general with any virtualized system.
HPC on virtualized systems, with hard latency bound requirements for good functionality, may not be the wisest choice for a platform, Azure or otherwise.
Note that I’ve heard from quite a few customers in the last few weeks, whom have been using any number of the virtualized HPC infrastructures around, that performance, in general, sucks. Their words, not mine. Actually, in a number of cases, their words were … ahem … a little more … colorful.
I do expect over time that this will get better, or that virtualized controls on HPC will allow for better scheduling so as to remove some layers of the pain. But for now, for a pretty wide range of workloads, virtualized HPC still ain’t quite there. Expectations aside for the article author, for end users, their expectation is that HPC in the cloud should be just like HPC in the data center. Except cheaper per run for fewer runs, and about as fast.
But we ain’t there yet. We need to go in with realistic expectations on what cloud will get us. The mad rush in will likely lead to a trough of disillusionment … as I am seeing the onset of something like this now.