I am a strong proponent of APUs, and accelerators in general. It is fairly obvious that the explosion in cores on single sockets results in a bandwidth wall, that we have to work around. The reason for many more cores, and for SSE and other techniques is fundamentally to increase the number of processor cycles available per unit time. SSE attempts to increase the efficiency of these cycles by allowing them to do more work per unit time. Similarly an accelerator is about efficiency … if you can provide more efficient cycles, many more efficient cycles, then you may be able to get a multiplicative effect.
The problem is, how do you package and sell this to users?
In short you have to make your APU ubiquitous. In the case of GPUs, consider it a fore-gone conclusion of desktop/laptop ubiquity. Its coming, just like many cores, whether you like it or not.
But what about the more “exotic” architectures, such as ClearSpeed, Cell, FPGA?
This is something we have thought long and hard about. Assume for the moment that the performance of these technologies are all about the same (they aren’t but it is not an order of magnitude difference in most cases). Then the main differentiators between them will be ubiquity and ease of use. Ubiquity is very much a function of price and functionality. Single shot $5-10k units will be anything but ubiquitous. Meanwhile, volume parts that have functionality elsewhere will, by definition, be ubiquitous. Developers are going to target ubiquitous platforms.
This is why I pointed out the (bloody obvious) issue of APU and board and development kit pricing.
If a platform is ubiquitous, and the developer kit is reasonably priced, you will get lots of developers building apps. Which will increase the demand for your platform. The contrapositive is also true.
Look at the expected volumes and pricing (and therefore ubiquity) of the following platforms
|Technology||Estimated Pricing (USD)||Guesstimated volume (in thousands of units)|
Yes, this is a guess at volume. Most HPC consumers will have a machine with a GPU. Most will have access (at home?) to a PS3 or similar unit. Buy one for work? Sure. Its not that much.
So an ISV looking at what platforms to target, is going to look at these trends.
Yes, some vendors need FPGA. They will require it.
And this gets to another issue.
What is the realistic speedup you will achieve using a particular technology? Will you really get 100x or 1000x as many marketeers may claim? Rarely if ever. The portion they are talking about 100x or 1000x is usually just the core algorithm, neglecting the rest of the app. If you only get 2x as a result of your 100x accelerator purchase, was it really worth the money?
Which really elucidates the choices.
For some amount, call it $2500, you can get another compute node. You know that going from N to N+4 CPUs will get you some delta in performance. So you know the cost of that performance. You know how well your runs will scale.
Now introduce an accelerator. This is BTW where all the marketeers muck up the works on pricing, so ignore them for the moment.
The accelerator may get you (wallclock time, the only measurement that is even remotely meaningful) an 8-10x performance improvement. This is what we observed in our tests on the Progeniq Bioboost platform. So what is the cost to get this improvement?
At the end of the process, where accelerators make sense, is where they drive the cost of the improvement to be low enough to be an obvious purchase.
Progeniq’s BioBoost did this in the past when we tested it (last year). You get an order of magnitude better performance on particular codes compared to a single core for about $5000. Today, with 8 cores in a node, running 8 way parallel, I can get almost the same performance. That is, the cost of this performance has become similar to the cost per core of a new machine.
This is the critical metric.
If the cost per core is 10x the base machine, and the performance “boost” is 10x, it makes more sense to stick with the original machine architecture. High priced accelerators simply will fail any reasonable economic test.
ClearSpeed (again, I like their tech, and I am not bashing them, rather I am rooting for them, but they need a serious adjustment of their pricing models) has 96 cores and can give me 50 GFLOP double precision. My Clovertown CPU can give me 7 GFLOP/core. Or 28 GFLOP per CPU. Two in a motherboard with some ram can give me 56 GFLOP.
The ClearSpeed costs ~2x the Clovertown node.
Yes, the ClearSpeed uses far less power. But that isn’t the primary consideration for most purchasers of computing systems. Programming it requires an expensive SDK.
nVidia GPU has something like 200+ stream processors. It can hit something like 500 GFLOP single precision. Eventually I expect them to get double precision. Even if you can use only 10% of its power, thats 50 GFLOP single precision. Same calculation for the Clovertown system. 56 GFLOP.
The nVidia costs ~1/5 the Clovertown node.
Yes, the nVidia uses far more power, and generates far more heat. But that isn’t the primary consideration for most purchasers of computing systems. Programming it requires an freely available SDK.
Right now there aren’t many Cell-BE PCI-e cards. Hopefully this will change, as PS3’s are somewhat underpowered on ram and io capability.
I am not covering FPGAs with this analysis. As Amir points out, you need a good paradigm for programming all of these, and it needs to be … ubiquitous.
The mistake that marketeers in this field make, is that they believe that their product has value as a unit product and not as a platform product. A unit product, not unlike, say, a vector processor. Charge for that value they say. In doing so, they miss the (blindingly obvious) situation. Developers will target ubiquitous platforms that everyone has, in order to maximize their own target market. Which means that the real value of a disruptive (accelerator) technology, is in how many people you can reach.
10M units at $500/unit is not a bad bit of business.
0.002M units at $8000/unit is not something you can build a growing business around.
Especially when you are competitive with the 10M unit scenario, and they are going to ship … anyway …
Again, not bashing the good folks at ClearSpeed or FPGA makers. Simply stating some (rather obvious) thoughts.
What is interesting to us is the sheer interest in leveraging multi-core, GPUs, and Cell-BE as computing platforms. This is hard work … existing paradigms really don’t do that well (ignoring one or the other feature of the system). Emulating a processor gives up performance, but could make the programming easier. The issue is the cost of the development environment. Cuda, for better or worse, set the bar of cost very low. Successful vendors are going to need to keep that in mind.
Update: tin-foil hat time: ‘Exascale’ computing envisioned by Sandia and Oak Ridge researchers … either I was reading their mind or they were reading mine. The future of HPC is very much tied up in acceleration architectures. Getting 1 TFLOP out of commodity CPUs will require on the order of 200 cores (at 8 cores per node, this is ~25 nodes). So a million TFLOP is roughly 0.2B cores. What they want to do is re-architect the supers so that you realize a higher fraction of the available power, more easily, and rearchitect the chips so that the operations are more efficient. If you could get 1TFLOP/chip, you could hit a 1PF in 1000 chips or ~500 nodes. Still would need 0.5M nodes for an 1 EFLOP.
Thats roughly 12200 racks. Assuming 400W/node, thats 200 MW of power.
Familiar numbers (we submitted a white paper on this to see if some group wanted to fund our ideas, sadly they didn’t, though we think we could have dropped the 12200 racks down to about 1000 for 1 EFLOP). Would have been fun to try.