Agglomeration of news

First, by now you have heard Tesla-10 is out. This is a significant performance step up, and I believe it has double precision capability. This is a hardware acceleration platform.
Roadrunner hit the PetaFLOP regime. What is important about this is that it did it at a lower power than many had predicted a PetaFLOP would require, and did it somewhat sooner than others had been predicting. This is an accelerated supercomputer, using Cell technology. The current fastest computer in the world uses accelerators.

HPCwire and other “mainstream” publications are now talking significantly about accelerators. They acknowledge that they are the future of a large part of HPC.
The hard part is programming them. No one denies that they are going to be a significant part of the future now. 3-4 years ago, it was a different story.
But the problem is, today, how are you going to program them? I think it is less of a case of whom will win on the hardware side. I know Amir and a few others do like FPGAs. They are great for, well, what they are great for. A tautology. Dangerous before coffee.
For accelerators to become commercially adopted, we need to see an open (more likely open source) set of tools become widely used/entrenched by the rest of the developer base. The problem is that programming multi-cores is like programming SMPs of old with additional hierarchies (and people didn’t do a good job of it back then, nor did compilers do a good job of it … remember this for later), and programming aSMPs (asymmetric multi processors) requires several different tools.
CUDA for Nvidia. CTM or variants for ATI (though IMO if they are smart they will go CUDA). Verilog/VHDL/Mitrionics/ImpulseC/… for FPGA. Then you have the “multi core” tools, such as Cilk (which looks a great deal to me like OpenMP, though it operates differently under the hood), Aspeed (which operates a great deal like Cilk under the hood), RapidMind, and others.
CUDA looks to be a nascent standard due to adoption. Programming GPUs will likely become synonymous with programming in CUDA.
I simply don’t see widescale adoption of Mitrionics and other FPGA systems. The issue at the end of the day, is that I cannot take my bit-file compiled “code” and move it anywhere. Yeah, there are XtremeData and DRC in-socket systems, and Nallatech, and others … We want one code that just works on all these platforms. No ifs, ands, or buts. Commercial vendors would need to treat each platform as a new product and certify against it. This is not IMO a viable approach.
Yeah, I know people will disagree. I am a fond believer in letting the market decide. It seems to have.
I would not be surprised to start seeing commercial apps with CUDA capabilities within the next 6 months. I am talking serious computational apps.
Once this happens, I expect to see some interesting effects in the market. Not just for accelerators. Think of it this way … if your desktop + a CUDA enabled accelerator can provide the same performance as a 10 node cluster, why buy the 10 node cluster? The former will consume less power, and be easier to manage.
This is the impact I expect, that accelerators will not decimate the low end HPC market, but do a good job of turning off the low end cluster market (those Tyan deskside units), while simultaneously growing the size of the market as HPC suddenly has become more accessible.
I do expect to see some OS impacts of this as well. We have been seeing Linux usage on desktops and laptops on an upswing. This blogs hit/visitor data suggests that the commonly accepted desktop numbers for Linux and MacOSX may in fact be grossly underestimated. Other blogs don’t report things too out of line with this. Considering that these are the people in the market served by HPC, it suggests that we are seeing widespread growth of Linux and MacOSX on the desktop of HPC people. I don’t see any reason why this would not continue apace with the scenarios I postulate above.

1 thought on “Agglomeration of news”

  1. FPGAs win for interconnect. Also power efficiency for HPC apps. GPU/multicore designs are still too focused on “Faster” instead of “more” so they are power hogs).
    Fine granularity means higher ratio of dataflow optimization is possible from efficient place-and-route. Fine granularity also decreases defect susceptibility (devices with 1 defective cell among 1 million versus 1 bad CPU-core among 8).
    Also: if you build a worthwhile digital masterpiece on an FPGA, it’s somewhat well-established that you can make it go 10-20x faster with higher power efficiency and with 10-20x lower scaling cost.

Comments are closed.