"Sustaining" strategies and startups

By joe

February 23, 2010 - 8 minutes read - 1497 words

I read an article on InsideHPC.com that I can’t say I agree with. The discussion on creative destruction is correct. You create a new market by destroying an old market. That has happened many times in HPC, by enabling better, cheaper, faster execution. If our SGI boxen of days old were 1/100th the cost of the Cray YMP at the time, and 20% of the performance, who won that battle? In all but a vanishingly small number of cases, SGI won. The same system with an easy migration path (created by removing barriers to migration) eventually won the market from the more expensive platform. There was nothing particularly new about the approach, it was a motherboard, ram, and IO channel. It was better/cheaper than vector machines of the time. And they decimated the vectors market. Today, we see exactly this with GPUs. It took a while for NVidia to comprehend that end users will not port a large Fortran code to C/C++ just to use the accelerator technology. So they worked with PGI to get a fortran compiler out. And that massively increased the size of their addressable market by removing a barrier to adoption (I could go into a long series of posts why its a really bad idea(TM) to insist that all a user has to do is port their code to your nice shiny new system … this is a guaranteed path to failure … you have to make it easy for them to move … drop in replacement … at low cost)

The problem I have with the article is their definition of a “sustaining” technology. The author seems, despite giving the same example I did above, to have missed the lesson from this. HPC has seen quite a few better/cheaper/faster cycles, and they have resulted in a destruction of the existing market in favor of the replacement market, which was larger, more competitive, and more diverse. This is known to have happened, and in pretty much all the cases, there was no single massive proprietary innovation that did this destruction. From Vectors, to SMPs, to clusters, and starting now on GPU/APU/Accelerators, these changes have been incremental, and every one of them has been better/cheaper/faster, with somewhat different technology … nothing astoundingly new and innovative (people have been doing GPU computing for more than a decade). That is, a correct definition of a “sustaining” technology is what I am arguing for. Better cheaper and faster is not sustaining. It is destructive. GPUs provide far lower cost per cycle than CPUs. The Intel/AMD CPUs provided far lower cost per cycle than the RISC CPUs. RISC CPUs provided far lower cost per cycle than Vector CPUs. There is nothing magical about this. There were no great innovations that enabled people to see that this was inevitable. It became inevitable due to economic reasons. The issue is, at the end of the day, purely an economic one. This (underlying) article misses that. This isn’t InsideHPC I am taking issue with, rather with the underlying article. InsideHPC is providing text ofthe underlying article. With this in mind, the correct definition of a sustaining technology is one that does not upset the status quo … which by definition is one that is not necessarily better/cheaper/faster. For disruption to occur, you need the economic argument, coupled with the cost/pain of adoption to be as low as possible. Thats it. This is why, for example, ASP’s failed badly in their first go-around. They had a capex vs opex play … it sold well to CFOs, but not so well to technologists who saw increased costs for the same thing. And now, in ASP v2.0 (or v3.0 … not sure) which has been re-incarnated as “the cloud”(TM), there are nascent cycle markets forming, which show promise in creating an efficient market for cycles. Unfortunately, cost per cycle in this market is still fairly high relative to a capex scenario. During a 3 year lifetime of a machine with two 2.0GHz CPUs of 8 cores, we can use a maximum of 1.5x1018 cycles. At roughly $5k USD/machine, we get 3 x 1014 cycles per USD. Using a $1 USD/hour per core metric (makes scaling easier later) for this same system with 8 cores, we get about 7 x 1012 cycles per USD. Ignoring the up front factor of nearly 2, there are two orders of magnitude difference in these prices. So even if you can get $0.10 USD/hour per core, the local machine still wins. If you can get $0.01 USD/hour per core, it is nearly a tie. That is where we need to get to in order to really see adoption over local machines. That is, there is a fundamental barrier in place that prevents this from being a real game changer (like many are hyping it to be, not unlike the grid was hyped to be). This doesn’t mean that specialist services arising aren’t able to make use of this, they can. But the costs have orders of magnitude of needed change before parity hits. And now introduce GPUs. Add 1x $500 USD GPU card into our mix. Our number of cycles over 3 years for 200 cores operating at 2GHz is now 3.8 x 1019. Cycles per USD of cost is now 6.9 x 1015. Which of these are disruptive? Clouds aren’t cheaper, or faster. Better is possible. APUs in general, and GPUs in particular are cheaper and faster. Better will be getting there over time. I’d argue that APUs (accelerator processor units) and GPUs in particular are disruptive, and are in fact disrupting the HPC market. I’d argue that clouds are sustaining technology, simply re-adjusting the same resources without making them cheaper/faster. This isn’t a dig at clouds. Clouds are great when you need instant-on capability quickly. But are they viable as a long term utilization strategy versus purchasing? You have to look at how many of those cycles you will use over the 3 years. If you are only using 10% of your computer resources over the 10 years, yeah, clouds become viable. I don’t know many HPC shops at that 10% utilization, most are running at or near capacity. Hence I take issue with the underlying article that InsideHPC printed. Specifically the phrasing

We do see that for the makers of non-differentiated systems. The rack-em-stack-em cluster builders have taken a beating from the likes of Dell, HP, and others. Quite a few have gone away. But so have real innovators, people with better, cheaper, faster all over them. Part of the reason why this touched a nerve with me is that we are most definitely not doing a sustaining technology in JackRabbit or in siCluster. We are coming in hard and fast on better/cheaper/faster. I look at lots of what our competitors are doing and it isn’t focused upon this. They simply want to protect an existing market and continue to farm it and manage it. We want to grow our market. We don’t want to sustain an existing market. We look at dedup and related technologies as not being terrifically innovative … they are their as a sustaining technology. Enable tiering to work better, hopefully lessen the argument for the lower cost solutions. But what if, in your tiering model, if our lower cost units are faster than the fast units? Why tier then? Sure, you can look at tiering as caching (which is really what it is), reserving the fastest spinning disk and SSD for the most frequently accessed data. So rather than solve that hard problem with an expensive modality, why not just make everything fast? So now instead of a 4k byte cache, you now have a 4MB cache? That is disruptive. Instead of using FC4 and FC8 with expensive interconnects and other bits around this, with large loops per drive, redundant controllers with cache mirroring, and other technologies of old … why not replace this with fast Infiniband connected servers that replicate in an HA pair? And then build large storage clusters (like siCluster) atop these sorts of units? That is disruptive. Nothing sustaining about this. But by the article quoted on InsideHPC, it looks to be a sustaining technology. Its not. When we can deliver 1.5GB/s per 4U unit, scale up network bandwidth with the number of units, as well as the capacity, redundancy, … and compare that to existing FC modalities … no … it is not sustaining. It is better, cheaper, and faster. These have been the waves of change for a long time in HPC. I’ve watched companies (including those I have worked for) completely miss this. To miss this is to fail in HPC. Don’t try to out Dell Dell. This is in part what killed SGI and LNXI. Dell can always build the same non-differentiated gear cheaper than you can. That is Dell is sustaining its market. It can suck the oxygen out of a room with other competitors in there. We’ve watched it happen.