HPC in the first decade of a new millenium: a perspective, part 5
By joe
- 6 minutes read - 1099 wordsAccelerators in HPC … In 2002, my business partner (he wasn’t then), showed me these Cradle SOC chips. 40 cores or something like that, on a single chip, in 2002 time frame. My comment to him was, we should figure out a way to put a whole helluva lotta them (e.g. many chips with RAM etc) onto PCI cards, with programming environments. Make them easy to use. Easy to program. We spent the next 2-3 years looking at a bunch of architectures, a bunch of chips. Wrote a business plan, tried to get funding, had term sheets drafted and yanked back, went to compete in the state technology competitive funding program, and lost. We had this idea in 2003, that accelerators would be important in HPC. In 2005-2006 we had a good rough guess as to how and why, and even when.
Accelerators reduce the cost per flop, even over standard cluster nodes, and awesomely magnify the number (and possibly efficiency) of the flops. This allows you to perform calculations with less power, and faster than your other system, provided you can port your code. If the economics could be made to work out, you could reduce costs for your calculation as well. We were doing this around the time of “The Grid"TM being in vogue, when accelerators were not in vogue. The grid became a marketing term … more or less eclipsing the real definition of what Ian Foster and others had intended for this system. The concepts and work remained, Teragrid embodied some of the concepts of this, showed that it could be made to work. Albeit not in a commercial sense, this was harder. Teragrid was effectively a group of large systems with a shared authentication/access model. The ideas are still valid. Possibly more so than in the past. The grid is also an accelerator of sorts. As are clusters. These are cycle multiplicative accelerators. Power scales as a function of the number of cycles available per unit time. Cost scales in that manner as well, a strong linear scaling, with 1000 cores costing roughly 1000x 1 core, and 1 core per node. The number of cores per node has changed … this is how chip designers have been able to keep Moore’s law going. If you could accelerate with a lesser scaling of cost and power, this would provide a strong incentive to use these systems. The issue is, at a fundamental level, that a flop is a flop is a flop. It doesn’t matter where and how it is computed. What matters is your cost per flop, and how many you could easily deploy, and how easy it was to use them. For a PCI based accelerator to make sense, it had to provide some significant performance advantage over the base (substrate) platform. Imagine 10x more performance. For a cluster, you can get this with 10x the number of nodes on a scalable code. So your pricing could be no more than 10x a node price for 10x performance. That was the hard upper bound on your accelerator price. More to the point, you needed to do better than this. There is no value in doing a port if you can just as easily scale your existing infrastructure and swamp whatever performance advantage you may have had. So we established a baseline estimate in 2005 that suggested that performance less than 5x of the underlying platform (not per core but the entire box), made no sense, and we had to have that level of performance. In 2005 this made sense. Dual core had come out, and was showing ~4GFlop/core. You could get 16GFlop machines fairly easily. At the end of my graduate school career, I ran on Cray YMPs that hit 1GFlop/CPU. So you needed to be at 50+ GFlops/board to be considered serious for accelerators. Hard, but doable. Clearspeed was showing off their card, with about 16 GFlops/board. They came out with a new chip and board, and hit 32 GFlops/board. Before they died, they were showing around 50 GFlops/board. But they did so with computing substrates that were hitting 20-30GFlop/system. 2x performance delta for about 2x platform cost. Why bother then? Similarly, FPGAs held out even more promise. Stop with this generic CPU nonsense and design a special purpose computing circuit. This is a sirens song. Like VLIW, only harder. We looked at the tools available, we looked at what we could hit with this. If we were able to solve the compilation (source code to circuit) problem, we might be able to achieve some pretty astounding performance. But first we had to solve an effectively unsolvable problem … one much harder than the impossible to solve VLIW compilation problem. Yes, I know. Convey is doing this now. I’ll hedge my view by saying the jury is still out, but I would be reluctant to say these will occupy anything more than a small niche. I could be wrong (and a small part of me is hoping that I am). I do know some folks at Convey, and I do wish them success. The reason (apart from Convey) that FPGA’s are doomed to failure in HPC comes from the complete inability to port code between boards. You can’t move bitfiles from Xilinix to Altera. Or even between different FPGA versions. Different IO environments on the boards require adaptation of the bit files. In short, it requires solving problems that you shouldn’t have to solve. Oh .. and the cost … yeah … thats an issue. Big FPGA’s, the ones you really need for HPC … they are thousands of USD per unit. Somewhat kills the argument for FPGA based accelerators. So there we were, with an idea, an architecture/design, a business/marketing plan, some VCs who would sign on if we could get a lead. And we couldn’t find that lead investor. We made predictions on when accelerators would take on a significant role in HPC. Of course, for VCs, there has to be a hockey stick. Ours was in the 2012 region (hey, the end of time or something like that). Turns out we were a little off. Call it 2009. Accelerators in the form of GPUs are rapidly taking mind-share in the lower end HPC space. The next gen IBM super will be using some sort of APU (Accelerator Processing Unit). In fact, to reach the post-petaflop regime, you either need to drop your per core power consumption an order of magnitude to increase your number of cores by an order of magnitude, or you need to get accelerated. Its inevitable.