The elements of success in accelerator technology

By joe

November 24, 2008 - 4 minutes read - 774 words

Some time ago, I had posited that the right approach to building a viable business in accelerators was to target ubiquity. It is worth revisiting some of this and delving into how to make accelerator use painless.

Basically, for people to get real value out of accelerators, they have to provide enough benefit over the life of the host platform such that the investment can be recouped. This is the fundamental raison d’etre for accelerators. It is fundamentally a cost lowering operation. The costs that are being lowered in this case are the cost of cycles. By massively increasing the number of cores available, the average cost per core and therefore per cycle drops tremendously. But you don’t simply buy a tool if you can’t use it. This is the other aspect of the business model for accelerators. To target ubiquity, you have to lower the barriers to use. SSE is ubiquitous. In most cases we have worked on, it gives (at best) a marginal advantage over non-SSE based computing (as measured by the wall clock time differences). If you can shove enough work into the SSE stack, yes, you can get more work (e.g. more cycles) per unit time done, and see some speedups. Though rarely more than 2-4x per algorithm, and very rarely if ever 2x on wallclock. Many compilers do generate SSE2 code. Most do a rather poor job of using the resource though. I have raised this point in other fora. Compilers have two (ok more than that) orthogonal tasks. First: Generate correct code. Second: Generate fast code. They solve the second after solving the first. Which one would be hard pressed to argue against. But the utilization of SSE2 is as an extra set of math computing registers in most cases. Rarely, apart from various pattern matching, do compilers generate good SSE2 code. You still need to design that and write the good code by hand. I have some ideas on some other ways to try to automatically generate good code, but lack the time to work on this. But you can still write for SSE2 at a high level. Just push the computing kernel into a routine that you call from your other code. Curiously, this is the model the Cuda uses. Put your computing kernel as a callable routine, let their threading manager handle it, and provide an interface back to your code. Write most of your code in C and you are in good shape. The down side to this model is the Cuda is currently nVidia specific, though there should be little stopping ATI from supporting this as well. So ATI and AMD have been left by the wayside as GPU computing has focused on the nVidia toolchain. And at SC08, the folks from PGI changed the game a bit. John West of InsideHPC links to a press release. Basically AMD and PGI will be teaming up on developing compilers for the ATI firestream cards. This looks like it provide a nice C/Fortran/C++ front end for generating GPU code. Now lest you think this is AMD specific, these same compilers will generate Cuda code as well. Think about that a moment. PGI is enhancing the toolchain for the lingua franca of HPC (Fortran) and C/C++ to enable GPU usage. Across GPU vendors. Ok, the binaries won’t be compatible, but I would imagine it would be compatible at the source code level. This is good news. Even if you can’t get 90% efficiency the first day out. This is good news. It is the right direction. I had been assessing whether or not to re-up our PGI license. Well, this made the decision for me. We will. I expect to (eventually) see other compilers do this. Probably not gcc for a while though. Not sure if Intel will either with its compiler. In order to target ubiquity, you have to lower the platform costs such that it is a “no brainer” for people, while making the toolchain to develop applications as low cost and easy to obtain as possible. Cuda was a good step in that direction. The PGI compilers are comparatively inexpensive. Then you have to provide examples and sample apps. The mpihmmer app was greeted with tremendous interest at SC08. Basically it enabled not just MPI cluster based acceleration (low cost of tools, scaling cost of hardware), but GPU acceleration (low cost of tools and hardware) … and it enables these simultaneously. This is a good thing IMO. Write for your accelerator in C/Fortran. This paves the way to make an easy transition for ISVs. Cuda/ATI enabling apps just got a whole lot easier.