Good article, with tangential relevance to HPC

This was linked from Drudge or one of the other sites. Some of the articles writing is a bit on the biased side, and there are some things I don’t quite agree with.
However, the thrust of the article (ignoring the title and other elements) is summarized in the last few paragraphs.

“Enterprises and individuals must recognise and adapt to these fundamental economic changes. We believe that those with a fossilised frame of mind risk being marginalised.”
In a world in which we are no longer masters, it is a warning that we ignore at our peril.

Yes. Absolutely.
You sink, or you swim.
In HPC, the markets are growing rapidly. And they are shifting rapidly. India and China are consuming more processing power. I don’t expect this to abate, but to accelerate.
Markets are creatively destroyed. Something better-faster-cheaper comes along, and it starts to dominate. Domination is not forever, and unless you also creatively destroy your own products, and expand your own markets, you are going to be eaten by your competitors that do.
We see this across many markets. It happens again and again and again. Anyone betting otherwise is quite likely a sucker.
Look at how HPC has changed over time. Supers in the MFLOP range, in the US. Supermicros in the GFLOP range globally, which destroyed the market for the MFLOP level supers, while growing the market size 10x or more. Clusters in the 10 GFLOP- 500 TFLOP range which destroyed the market for the 10 GFLOP supermicros, while growing the market size 10x or more.
See the trend?
There were maybe 10 supers in the world in the 80s. There were about 100 in early 90s, and this blossomed through several hundred to low thousands in the late 90s. There are tens of thousands of clusters in use now. Rocks, one of the most popular cluster systems, has well over 1000 registered clusters, and more unregistered.
Creative destruction happens again and again and again.
So what is going to replace or augment clusters? The marketeers tried doing it with “grids” in the early 2000’s. Grids are effectively clusters of clusters. Clouds are rebranded grids. Hopefully with less marketing drivel. Current marketing/journo speak talks of 4-5 large clouds with everone using these.
Honestly I think that this may be wishful thinking. Clouds are useful and will be useful, but having power at your desktop and laptop is very important. I don’t expect to see clouds replace clusters, I expect them to be more of an offload engine. That is, unless the costs start working very much in the favor of clouds. I am not sure we are there yet. Cloud owners need to make money from them, and they need a model that works. Cloud users will likely use it if the economic cost of setting up their own or borrowing their own is larger than the cost of using the remote system. And all of this also depends upon cheap/fat pipes. I am not convinced we are there yet, but I think we may be getting closer.
This is where accelerators fit in, as they can provide significant power at the desktop/laptop. For a reasonable cost. And can be incorporated into clouds.

6 thoughts on “Good article, with tangential relevance to HPC”

  1. I am not so sure. I know of too many examples now where people are making money from using the “cloud”. If I had to start a company today, I would use a dedicated cluster or a set of accelerated hardware for certain specialized tasks, but a large chunk could easily be farmed out. It’s cheaper, and most importantly efficient.

  2. I agree lots of things could be farmed out … not all tightly coupled HPC jobs, but a good number of non-tightly coupled could be.
    The business model on the cloud side is the hard part. They need nearly full capacity/utilization to make money. Empty cycles is lost money … because hardware/infrastructure isn’t free.
    It was almost impossible to do this in the past. Looking at the economics of it, it is possible now.
    Take a $3000 server, amortize its cost over 3 years, in a simple linear manner. You need to make $1000/year on it to break even. For the server, this is about $0.114/hour cost neglecting power/cooling costs (assume admin cost == 0). You need to be able to charge more than that per server to be able to make money. Assume your power costs are $0.10/kW-h, a 500W server running for an hour would consume 0.5 kW-h, or $0.05/hour to run. Assume your cooling costs are about the same. So now you need to make ~ $0.224/hour with power + cooling per server.
    Your yearly costs (amortized cost + power/cooling) are $1962.20 per server.
    Now with 25% utilization (guestimate), this is 2190 hours per year. For break even, you would need (in this model) about $0.90/hour price to your customer.
    If you go to you see my pricing model isn’t that far off (look at the extra large instance, which is basically a dedicated machine).
    Ok, where am I going with this.
    Compare the yearly costs of running a cluster with the yearly cost of using this.
    The remote system would have that $0.90 + margin, while your costs are independent of utilization (they are sunk costs). Your ROI is a function of your utilization, not your costs, which are fixed at your amortized rates ($0.224/hour).
    I am not saying that the model doesn’t work, I am saying that the cloud computing vendors need to drive utilization, and drive volume. Amazon does look like they are doing it right. The small instance is liable to be their best seller, and would likely have a significant volume relative to their large instance (a guess).
    This said, I don’t think anyone is claiming it is cheaper than setting up a local resource right now. The question is how to get this resource less expensive than the internal resources. I don’t have an answer to this … google may be one of the few that can do this by virtue of building their own systems. Maybe Amazon does something similar.
    Where the value is, IMO, is in the ease of use of running what you need. This is where Amazon does it completely right. You set up what you want and need. No one is forcing a particular OS/config on you.
    If it sounds like I am impressed with the Amazon service, this is correct. I am.
    But more than that, as the desktops grow more and more powerful, more localized computing will likely move to them. The incremental price jump to get to a localized cluster may be higher than some folks are willing to pay (always that cost-benefit analysis), and they would love to work on the cloud.
    The problem is commercial software licensing. Well this appears to be a significant issue for a number of vendors. Their license terms preclude running on such things.
    My point is that I believe the cloud is a natural progression of where we are now, but we still need those fat data pipes, and lower cost remote computing for it to get very common.
    First, before the cloud, I want the fast data pipes. 1 want 1 Gb/s to my office. I need it. I just dont want to spend 20x DS3 pricing to get it.
    High bandwidth data pipes will really enable this. We need them.

  3. Something else to consider at the mid- to high-end of HPC in particular is the cost of infrastructure.
    I think the discussion so far assumes that you have the power and cooling available to run a system of the size you need/can afford. If you don’t, you might be stuck trying to find funding to built out your infrastructure. There are very long lead times on large infrastructure builds, they can be very expensive and frequently must (in the government, anyway) be funded with different sources of money, and are usually under the control of a different business unit from those responsible for compute. In the federal government, this last point manifests as the need to get an actual act of Congress (!) for permission to build out the infrastructure to host even moderately large systems.
    Medium to large HPC procurements in the government are funded with money that must be obligated in a fixed timeframe (the same is probably true with a different vocabulary for businesses). When infrastructures cannot be built in that timeframe, or cannot be funded from sources fundamentally separated from the hardware funding lines, I think customers may start looking to hosted HPC solutions for reasons aside from a pure cost of cycles perspective.
    Joe I think your utilization argument is right in terms of making the economics of “cloud” provision work, and mid- to high-end HPC is ideally suited to make that model work. Not from the point of view of individual users buying cycles for their codes, which probably wouldn’t work at all in science, but from the point of view of large programs (DOD, NSF) buying blocks of cycles from providers for their users. These programs would pay for X processor hours per year at Y availability, and assume responsibility for allocating those hours to users and establishing program incentives for timely use.

  4. Joe/John
    Great comments. Obviously, I haven’t looked into the question at that level of detail, but those are the kinds of decision points that one will need to evaluate to come to a discussion, and it all comes down to the level of utlitization at the user end.
    What I really like about what Amazon is doing (and that’s where trust comes in), is that their value goes beyond just hardware and computing costs. As John points out, there is the cooling, the room, the “muck” as the Amazon folk like to say. Already companies like mine are increasingly looking at co-location facilities, or vendors like Rackspace or Joyent rather than build out capacity. The “Cloud” just takes that one step further in how you access resources.
    Government grants and software licensing paradigms also need to change. Back in a past life we were taking a strong look at utilization based licensing. E.g. if you were running MD code for x # of CPU hours, how much should the value of each cycle be, questions like that. In the academic setting, that needs to be incorporated into the grant process, especially as IBM and the likes start thinking about providing supercomputing resources using on-demand models.
    Personal note: When I was a grad student, often had to wait weeks to get some time at SDSC or ANL, but if the funding model changed and we had access to cloud resources, a huge barrier would go away. It also supports my idea that you shouldn’t be using a Blue Gene for just larger versions of what you do normally. Farm that out somewhere else and use specialized resources for specialized tasks.

  5. @Deepak
    Speaking of grad school … I remember submitting runs into the PSC machines (early 90s) and having them sit there a week. The folks at Ford motor and others had queue depths typically measured in time periods of days/weeks.
    I agree that the models need to change. See the new posting on Clouds for more comment/content.
    I suspect we are all likely in agreement (with minor detail differences). The issue is how to get the powers that be to make the changes that need to be changed. Some of them have a vested interest in showing long queue depths, as it shows their value and the demand for their product (cycles) being high (artificially due to supply of cycle constraints).
    We are still trying to work with ISVs on the pricing model, most are not grasping it. We have one engineering vendor that did ask us 2 weeks ago to provide this sort of resource though, so I have hope.

Comments are closed.