As the market changes ...

By joe

July 13, 2010 - 7 minutes read - 1415 words

As noted in the previous post, the EC2 CC1 bit is likely to be game changing for commercial users. The market is undergoing one of its transformations, but I am seeing two different, actually complementary trends, occurring at the same time. When these changes have happened in the past, a process of creative destruction has occurred. That is, something old was destroyed, and in the process, something new flourished. The changes driving this market in the past has been the cost per computing cycle, and the up-front purchase/lease costs. Way back in the beginning of the market, we were very high on the price curve, with very few installations. Along the way, technology improved, and costs per cycle dropped very rapidly. Its these large changes, when people realize that they can get useful work done on the lower cost per cycle gear, that have been the leading edges of the shock wave driving through the industry. What happens during and after this wave passes is whats interesting … this is where the creative destructive forces go to work in earnest. But there is something different this time, and this is why it gets so interesting.

Computing cycle cost is dropping very fast thanks to accelerators. I and my business partner saw this ~7+ years ago, and began trying to raise money for a startup to build accelerators. Its gratifying, and perhaps a little frightening, to see how this is playing out, in the sense of being so close to how we predicted it for our business plan. Of course, none of the VCs wanted to fund this concept … after all, how large could this market possibly be … (yeah they messed up, and left an entire market on the table in the process) and most were still smitten with the “call it web 2.0 and we will throw money at it” phase. To this day I have to admit cluelessness as to the decision process that would throw money at something that had no hope of making money, versus something that obviously would have. Maybe there was a sexy-ness factor we missed. I dunno. But accelerators are quite important to the future of HPC, in that they reduce the cost per cycle, the complexity of infrastructure, and the related infrastructure costs on a per cycle basis. So even if your code isn’t terribly efficient on a Fermi or similar unit, if you get 10x performance delta over the baseline platform, you have won. Especially if you can make the baseline platform as low cost as reasonable. Accelerators were always a performance and cost of performance play. To make the successful, you had to lower barriers to access that performance. Human time to port code is not cheap. We realized this, NVidia realized this. Not too many others did. This is why FPGAs are sort of the permanent bridesmaids of HPC, and never the bride. They are, despite often herculean efforts to change this, nearly impossible for mere mortals to code for. There are stopgap mechanisms that get you some of the way there, by creating virtual processors out of the FPGAs, and “solving” the problem associated with IO by putting the units into sockets (ala DRC and Xtreme). For some problems, the cost of moving to these sorts of solutions isn’t large compared to the benefits, but I argue that it is a vanishingly small subset of problems for HPC. Couple this with the high cost of the hardware, and these approaches are relegated to niche status in most cases. You need economies of scale to drive down part pricing, by allowing you to sell units at low cost and in high volumes. FPGA isn’t this. Cell could have been, but the pricing model was wrong, and IBM/Sony/etc chose to go a different route. Cell is effectively a dead end in HPC, though it held a huge amount of promise. The big issue is the cost in time/effort to do the porting of code to the platforms. PGI has an interesting take on this, as does PathScale. The CUDA approach is good, but it is vendor specific, which does to a degree represent a risk. The PGI/HMPP approach is vendor neutral, and allows the accelerator folks to work with the compiler folks to make it easy to get performance out of their units. But this is a solvable issue, and the market will make its determination. It has in the coarser sense, and now it is polishing the details. From what we have seen, customers are already noting that they can run locally at a much lower opex (never mind capex) than on cluster systems. I expect more codes to be ported and targeted to these platforms, and this trend to accelerate, pardon the pun. The second major change was what I indicated in the previous post. To build a cluster, there is a great deal of time/effort/resource that has to be poured into it. Space has to be created. Infrastructure allocated. Power/cooling/… people hired. Like it or not, a cluster is not a pile-o-pc’s (which Microsoft mistakenly had originally thought it was), and the expertise to design and build this correctly, isn’t cheap. So what if you could skip all that, and simply buy time on a pre-existing system, that can be customized for you, on the fly, or on timescales on the order of hours? That is, no capex to set it up and get it going, and something like a small opex, and possibly a consulting expense to help instantiate it. Then your cost is just the cost of these things. So if you need a system capable of 40 TF, for one hour, its not going to cost you, up front, the expense to buy 40 TF and amortize it among the useful life of this gear. This is what is so profound about this shift. Its something that you could see if you thought long/hard enough about it several years ago. So I mentioned the upside. The is the “creative” portion of creative destruction. Lets talk about the “destruction” side. This market is not a zero sum game. It is growing, indeed Intersect360 indicates an 8% growth rate in HPC is quite likely over the next several years. So what is being destroyed here? Vendors. Specifically, the zero value-add vendors … typically rack-em-stack-em resellers, who slap the lowest price parts together, and don’t really do much beyond that. We could rattle off names, but the point is that they own the small/medium cluster market for the most part. This is the market that is going to be split between deskside supers, and fewer larger clusters. With nothing to differentiate them from each other, no real value to add, where are they going to go? Most will likely quietly begin to fold. We saw that process start in earnest in the downturn, but I think this is going to accelerate faster, again, no pun intended. These folks are being squeezed by the Dell’s and others of the world who want the bigger machines (with more revenue), and are willing to discount enough to suck the oxygen out of the room. Dell’s sole source agreements with many university purchasing groups … not much different from the HP, IBM, etc versions … have effectively negated the competitive bid processes in many arenas, and have deprived these smaller companies of valuable revenue. I won’t discuss the legality of these things, if they are ever challenged, it won’t matter as much as most of the competitors will be dead by then. Oops. But that market, for the small to mid sized cluster is going to likely fracture. Its going to split between these super desksides, and the large remote systems. Dell et al will lose revenue. But the companies that depend upon this for their livelihood are likely done. The rack-em-stack-em business model has a shelf life, and expiration is likely quite soon. This will also impact cluster integrators, consultants, etc. This market will change, and has started these changes. Some of them are inexorable … you can’t change the fact that they are changing. You can adapt as a company and overcome the issues, or you can fail to adapt. Some change you may be able to influence and impact. This is a market level shift … customers have spoken. They want more cycles, and they want them cheaper. And they don’t want to pay a great deal to stand up new cycles.