HPC in the cloud and cluster distributions

Many things are moving to cloud hosting … I won’t comment on being right or wrong about their moving … and HPC is one of them. This means that cluster distributions are going to follow … or could follow to some degree.

Some cluster distributions focus upon packaging, some focus upon flexibility, some focus upon GUIs.

All try to integrate some subset of needed tools. But all were effectively designed for a cluster computing model where some of the key/critical assumptions at the base of the distribution are simply not the case in the cloud, and due to the way they work, can’t easily be worked around.

These would be distros that require control over the install process. Remember, the cloud focus is upon system provisioning, so conflicting tools or redundant tools will be wasting some level of effort. A better model for this would be to integrate the API of the cloud provider into the installer, so firing up a 10 node cluster, on demand, is as easy as a 1000 node … with similar speed.

That last part is more about how the cloud resource does large allocations, and they may not be close … but its reasonable to expect that most customers don’t really want to pay for installation time and data transfer. Remember, you pay for everything in the cloud. So those distributions that focus upon loading an OS might have an issue going forward.

The sense I get talking to users and customers, is that they want their clouds as effectively “instant on” devices, for some appropriate value of “instant on”. The EC2 cluster bits are close to instant on. But we have customers asking for Ubuntu and other OSes.

And, more interesting to us, is that they are asking for “instant on” for their clusters. We work with a number of tools for these things, including Bright Cluster Manager, our own Tiburon tool, Perceus, and others. Right now, only our own Tiburon tool handles this. We’ve used it for a number of clusters, and it works quite well. I’ve been thinking about how to adapt it to the needs of the HPC cloud APIs.

Bright Cluster Manager is a good solution, and we like lots of elements of its vision. It has some issue with use outside of a set of supported distributions, but thats understandable given its focus. Perceus is very good for its use case.

But we’ve been working on Tiburon, and using it as our base OS loading system in the office, as well as our diagnostic system, our support, … etc. We have all the cluster support elements in it, and we are updating them continuously, as we use Tiburon now in our siCluster as an integrated component. Customers never see it, and generally speaking, they should never see this sort of component.

Our monitoring system is coming along nicely, and will be a separate component. We already have a job launcher (DragonFly) that works nicely in a cloud, cluster, or mixed context … if we combine it with an allocation/scheduling system … this could be really interesting in the combined cloud bits.

But its the instant on thing that’s bugging me. I think customers and users are starting to want stuff to just work, right away, with no extra headache. So which cluster distros can do this? Tiburon can after a few updates.

And then there are job scheduler issues. Most require some sort of annealing time to settle down with new nodes. Time that the end user pays for. Dead computational time. Most job schedulers were not designed for the cloud. No … none of them were. Maybe its time to rethink this as well.

Viewed 50122 times by 6873 viewers