Cloudy issues

I need to get this out first and foremost. I do believe that cloud computing or similar is inevitable. It is coming.
I am also a realist. I know perfectly well that there are some fairly significant impediments to it.
The impediments are a mixture of technological deployment, and business models. Its not impossible to do this given sufficient money. But some of the dependencies are simply too pricey to enable rapid cloud adoption, and I don’t see this changing rapidly in the near term (next 3 years).
Ok … this is the short version of things. I can go into it in much greater depth, and I may. But not now.
HPCwire reports on some work going on to use the cloud. There are some very very important messages in there for potential cloud users, something the hype has largely covered (and something we worry about all the time).
Data motion. Or more precisely, the time and monetary cost of data motion.

I have been saying for the better part of a decade that data motion will be the major problem going forward. It is easy to move data quickly on a campus. It is hard to move data quickly between campuses, at a reasonable cost or a reasonable rate.

“The first question is how to best split up the process of DNA sequence analysis to fit these computer clusters,” Pop said. “The second is whether or not the benefits of cloud computing outweigh the costs of data transfer and storage.”

We don’t see enough of these cost-benefit analyses when people talk about cloud computing. Sure the remote resources are there and usable. But if you spend so much time or cost to move your data … is the low cost of the computing cycle still worth it?

The massive amounts of data generated by just one genome may take a significant amount of time to transfer over the internet. This, in addition to the data storage needed before analysis, might add costs that outweigh the benefits of using a remote computer cluster.

Yes, precisely.
The time to complete the calculation with the data is

T(total) = T(ingress) + T(compute) + T(egress)

T(compute) includes local data motion from local storage to ram and back, as well as network data motion time. And computation of course.
If T(ingress) + T(egress) >> T(compute), then most of your cost is likely to be in the data motion. This time cost is easy to set bounds on. Take the data volume and divide it by the best case bandwidth. This will give you the lower bound on ingress or egress.
Basically if you are moving gigabytes and terabytes, you are going to be bound by the site to site bandwidth.
And this costs money. 1 MB/s costs about thousand dollars/month. 1 GB = 1000s at 1MB/s. 1 TB = 1,000,000s at 1MB/s. A T3 still runs ~$5000/month and gives you 45 Mb/s, or about 6MB/s.
So external clouds are being marketed at small as well as large companies. These only make sense if you can move the data once. That is, pay the data motion cost, store it at Tsunamic’s site, or Amazon, or CRL. Then do all your operations there as well.
But these models … move the data there and let it rest there … isn’t what is being pushed.
Cloud computing can work. It is effectively ASP v2.0 (if you don’t know what ASP v1.0 was, don’t worry, you aren’t missing much). Its mostly there. The one thing that is missing to make it really work, to uncork the bottle and really let the djinni out … is low cost bandwidth.
Which, curiously enough, would likely help create huge amounts of value, as you can have specialized clouds, and create markets for these specialized clouds.
But you need that cheap internet.
Which also shows why some things are really not meant for the cloud.
Data motion is the rate limiting factor. It always will be.
Solve the first order problem, and the second order becomes the problem. We are in that second order problem set now.

4 thoughts on “Cloudy issues”

  1. We had a talk at VPAC by Martin Sevior and Tom Ffield last Friday on their experiences doing Belle Monte-Carlo production test runs on Amazon EC2 and S3 – they estimated that for a full production run they’d need to return data from EC2 to KEK in Japan at 600MB/s (at least) constantly to be useful. I doubt if S3 could get anywhere near that!
    It was an interesting talk, they made convincing arguments that it would be far cheaper to do the calculations on EC2 than buying and maintaining their own cluster for it.
    Oh, and they built Scientific Linux AMI’s for EC2 too, some of which are public.

  2. Chris,
    Any slides from the talk? I’d love to see some real-world testing results and economic numbers that fall out of the testing.
    BTW – I think Monte Carlo may be something of an ideal situation for clouds. Embarrassingly parallel, huge number of runs, varying amounts of input and/or output data. But, most importantly, if clouds can’t do a good job on MC simulations, I don’t expect them to do a good job with anything more difficult. So it’s good to hear that one data point says clouds passed the first test 🙂

  3. Hiya Jeff,
    There were slides, though I don’t have a copy of them, I’ll see if I can get Martin’s and/or Tom’s contact details and ask them.

Comments are closed.