Acceleration meme

Amir at Reconfigurable Computing Blog (welcome back Amir!) notes

There’s probably a few hundred of us who are watching the accelerated computing meme spread. It is catching on, there’s probably a few dozen of us ready to put our mouth where the money is when wall street catches on too–email me by the way (my name at where i work).


We have been making the arguments (and pitching) accelerated computing for years. Almost half a decade. Scary. We see in other peoples marketing materials, pitches, etc things we have said years ago. In one particularly egregious example, a potential competitor had some of our slides in their online presentation. Makes you really love to deal with “no-NDA” VCs.

I agree with Amir that the meme has been catching on for a while. Unlike the massively overhyped grid of several years ago (todays “grid” bears very little resemblence to the “grid” hyped several years ago as the companies caught unaware by the cluster-wave struggled to differentiate themselves), this meme appears to be grass-roots based. Some companies are lining up with products. We had been trying to do this as well, but most VCs we have spoken with are disinterested in accelerated computing (and HPC in general), preferring to focus on yet-another-(myspace|linkedin|facebook|…generic flavor of the day…).

I don’t think Amir’s numbers are right, it isn’t several hundred. It is probably an order or two of magnitude better than that.

At the end of the day, we can live with die shrinks, and get our meager exponential growth. Our data volumes are growing faster than Moore’s law. We need cost effective and high performance/low power and pervasive technologies to address these issues. This wide spread, and accelerating need for high performance technologies has been driving the cluster market hard and fast. Insanely hard, and fast. The need is not diminishing, it is accelerating.

This has created a perfect storm of need. And opportunity. With the right mix of ideas and capital, some group of companies can do wonders here.

We have been pitching this stuff for years, and the demand and interest on the customer/user side is accelerating. VCs remain focused on yet-another-(myspace|linkedin|facebook|…generic flavor of the day. This is a shame. Clusters exploded the HPC market from a $2-3B/year to north of $10B/year
(note to UK readers: B == 10**9, or 1E+9 in this context). Clusters are pushing this market at a ferocious pace. There is no real sign of letup here. The growth market is at the lower end.

Talk about a perfect storm. Supercomputers have gone from single digit shipments yearly, through thousands of units shipped per year. With accelerator technologies, at the right price and performance point, we are looking at millions of units shipped, or more. HPC and its followons have been moving downstream for decades. They always will.

15 years ago, my local “supercomputer” could run molecular dynamics simulations: 100 time steps per week. Today my laptop can run 1 time step in about 4 seconds on the same calculation. Moore’s law is nice, but if I were doing simulations today, I wouldn’t want to use 64 atom supercells, but million atom supercells, with many k-points, rather than just the central gamma point.

My results were more accurate the more computing power I had. This wasn’t lost on me. It isn’t being lost on anyone else needing to simulate or calculate.

The end users know this, and this is why they are driving the market for HPC at the ferocious pace it is at.

Some companies know this, and this is why they try to market this stuff.

The money folks don’t know this. And this is why there is little/no capital in this market.

This technology won’t make excel calculate spreadsheets faster. It will allow the models that you plug into excel to run on the supercomputer to run on your desktop. This is a highly disruptive enabling technology with little need for evangelism.

But it is not facebook-v2.

Some group of companies are going to make money here. It doesn’t seem at this moment, that many/any VCs are going to be in this crowd.

Viewed 14251 times by 3257 viewers

9 thoughts on “Acceleration meme

  1. I apologize for this very long post. Your blog entry struck a chord.

    I was in the reconfigurable computing business for about four years, and trekked through the wilderness of money hunting. We had more than just bright technology ideas: we worked out and vetted good plans for engineering, marketing, sales and administration, and researched and lined up a design manufacturing partner. We even had a well-respected retired IT VP advising us and opening doors.

    Of the over 60 investors we contacted, we were invited in by eight for pitches and discussions. Of those eight presentations, one quickly decided to decline, three were non-responsive, and four were genuinely intrigued and followed up. Of those four, one had us back a second time. That one said it was willing to invest if we could convince another particular fund to coinvest, but that fund had decided to back another systems venture targeting the virtualization market.

    We learned that most technology investors are not terribly adept at divining technology market trajectories, conduct lazy due diligence of people and technologies, and do deals almost exclusively with technologies and people they know well. As a class, they’re much more prosaic that we ever expected.

    My partner sourly surmised “these guys aren’t venture capitalists – they’re bankers.” That’s no slam against bankers: They’re usually adept and thorough in due dilligence, and they’re charged with firstly not losing principal, and secondly providing guaranteed returns on investment. We expected professional technology investors to have more sparks of curiosity, more probing natures, and better abilities to zero in upon and discuss the good and bad of plans presented, than we encountered. A few people – perhaps five – were exactly that, and we appreciated the time they took to learn about us and our proposed business, and to critique our efforts and educate us. Most were polite but indifferent or inept, and we wondered why a few were ever allowed to decide the fate of more than $500.

    Although most of the venture funders arrived at their decision by the path of least resistance, they all made the right decision to not back a reconfigurable computing venture. All the incarnations of RC to date have been the domain of serious computationalists but not regular-Joe programmers. The problem is two-fold. Firstly, the lack of silver-bullet parallelizing compilers has kept HPC off the schedule of most programmers working today. Most programmers don’t know how to design and program a parallel algorithm, and most who do aren’t adepts. The market is growing, but nowhere near the rate it could if current (abysmal) parallel applications development approached the current (poor) state-of-the-art of sequential applications development. Secondly, RC depends upon a computing fabric to reconfigure. Currently that fabric is an FPGA or variation on that theme. There are no compilers that can go all the way from HLL to bitfile, and reliably produce a realizable machine (let alone an efficient or correct machine) on reconfigurable fabrics. A company like Mitrion might argue that it’s solved that problem, but it’s a cheat: such compilers write to a virtual machine that’s implemented on the fabric, reducing the problem largely to that of writing a compiler for a particular microprocessor. Also, the current crop of FPGAs present many hurdles to developing those reliable HLL-to-bitstream development technologies.

    Cray, SRC, Mercury, Starbridge, SGI, and to a lesser extent the accelerator board companies like Annapolis and Nallatech, trumpet their marketing claims of all the wonderful hard computing solutions their kit is delivering to their customers. The reality is the machines are hard to work with, and many used for non-embedded applications wind up severely underutilzed or even abandoned because their practical employment couldn’t live up to the hype that sold them.

    We now have the growing crop of multicore machines. They have fixed-architecture processors for which writing compilers is trivial, and creating working programs is at least possible (and fairly deterministic) if not trivial. They have familar OSs, which customers clearly deem a touchstone. In all these ways they don’t, in the words of that VC, “bet against Intel,” and so they are more familiar than weird to the fund partners that ante up venture cash. And importantly, the entrepreneurs behind them are known in the traditional computing world, and the architectures are vetted and promoted (and even designed) by top computing companies. There is more investmentmoney going behind these companies, enough si that I think the Age of the Multicore is ascendant. Any RC advances at this point must practically be made in that multicore context until both the development tools and the computing fabrics advance significantly. The tools, common at many levels to multicore and pure RC, will no doubt advance whatever the order of regime dominance.

    So I believe accelerated computing is a disruptive technology now headed for general acceptance. It took an evolutionary technology (multicore) and its acceleration of of a data-center concept (virtualization) for the marketplace to take interest. The next challenges are to help average programmers write fast applications, and to help businesses integrate accelerated applications into their computational workstream.

  2. (Please dont apologize over length of the post. Its good, and worth a read)

    Heh… we are in fairly good agreement. There are things I think RC can do well, but enabling users to program it well is not one of them. It sounds like your experiences mirrored ours. We just do not (and did not) equate RC with accelerated computing. The universe of accelerated computing is quite a bit more varied.

    MC solves some of the problems (increasing the processor cycle count and efficiency), at the cost of creating others (shared consumable resources). It isn’t perfect, but it just might be good enough for most people.

    The argument I make on acceleration is that it has been here for a while in one form or another, and now it is being pushed hard, and in full force. By the graphics companies. Their stuff was initially very hard to program. And it got easier over time. I suspect that the winners of acceleration based systems are going to be GPUs plus probably Cell. The RC stuff is still too hard to spin applications quickly, and too expensive to deploy them economically. Look at the SGI RAC bit. 1200 “lines” of code to get 10-20x performance on a box that costs 10x a compute node. The cost benefit analysis for that is messed up. Sure the lifetime power costs are much lower, with the concordant cooling costs, but at the same time, very few people are running BLAST-N 24×7 for 3 years. Most want to do other things. And this is where RC falls over.

    The solutions that win the accelerator race will likely have several common features:

    • They will be easy to program at a high level
    • They will be inexpensive to acquire
    • They will be fairly well standardized … code from one will work on a similar one

    As I see it, RC fails pretty much at all of these. As you pointed out Mitrionics is somewhat of a “cheat” in that you are running a virtual processor that you build a compiler to. There are some C->bitfile compilers out there, but all assume a library of elements at some point, and this assumption (the larger it is) precludes real optimization (the more abstract you operate, the slower you are going to operate).

    GPUs on the other hand are produced in the tens and hundreds of millions. They are cheap, they are ubiquitous, and at least with nVidia, you can pull the tools down (hint to AMD/ATI).

    Cell is similar. Again, they are cheap. Tools are a little harder to come by. You can use commercial bits, or open source. Expect them in the millions. If someone else built an accelerator board with Cell, and priced it reasonably, I would expect to see many sold if good tools were made available.

    All in all, building any new fixed processor means convincing large numbers of others to port to your platform. And this is a losing battle. Hell, Intel couldn’t do it with IA-64. Why spend the time/money/effort to support a tiny market? This is why ISVs have largely abandoned certain platform segments (hardware and OS-wise), and why things like Linux and the x86_64 architecture are on hard and fast growth curves. This is also why as technologically good as they may be, it is quite likely that SiCortex won’t last. Nor any other non-compatible architecture. We have a tyranny of an architecture. So strongly ensconced that not even Intel could unseat its own. Replacement philosophies for large and rapidly growing installed bases are generally wishful thinking at best.

    At the end of the day, the important aspects are how much does it cost, how hard is it to program, and will I be able to use it for more than X. End users grok it. VCs don’t.

  3. Since the AMD/ATI merger it has been fairly apparent that Accelerated Computing is what’s on their mind.

    Compiler and operating system infrastructure is non-trivial for heterogeneous environments. However, assuming we can dispatch and profile a thread across a variety of hardware, then we can manage a single model of parallelism across a plurality of hardware and construct an optimal configuration.

    Application level developers shoudn’t care which three or four letter acronym is used to define the hunk of metal that’s gonna say “hello world” to them 1 time from each processing unit. This needs to be abstracted away from the user: the only thing he needs to know is that some hunks of metal are better at “hello world”ing and others are better at “A*B”ing, and that the profiler help you to partition your application among the various pieces of metal available to you–automatically perhaps if you please.

    That said, programming parallel systems is not *that* hard to learn once someone fixes the toolset. Electrical Engineers know the tricks of this trade already: pipelining, synchronized non-blocking assignments, speculative execution. Can hardware design be made into a software development model? I think we’re doing this already.

    The compiler group at MIT has been developing Stream Programming for expressing software pipelined parallelism. Tilera just came out of the dark yesterday to commercialize the RAW architecture and all these multicore compiler developments from our group. The future of computer architecture is tiled arrays of reconfigurable elements in massive datacenters connected in a fiber optic network backbone. This won’t be realized until a software stack can manage such a reconfigurable array.

    I have a 10 dollar FPGA on my desk running 10 pipelined Floating point adders at 150 MHz (PDP-11 floating point no less). No one cares that a 10$ FPGA can produce 1.5 gigaflops. 200M$ for a Petaflop? For 200M$ they should design a fabrication process for Petaflops in a wafer stack and manufacture them by the thousands. A 1 Trillion gate 3-D FPGA at 500Mhz should do the trick.

    If reconfigurable computing scales the way I expect it to, we will be eating Petaflops with Hummus and Falafel in a few years. But who else will be able to program it if we don’t start by making the tools better?

  4. Regarding VCs chasing social networking phenomena: there are about 5-10 emails a day from Harvard Business School students trying to recruit MIT programmers to make a social network site for Argentinian Bull Herders or tennis players with a boot fetish or whatever else someone thought up that moment. i almost created a social networking site for failed social networking entrepreneurs to meet up and discuss their failures.

  5. Amir wrote:

    Compiler and operating system infrastructure is non-trivial for heterogeneous environments. However, assuming we can dispatch and profile a thread across a variety of hardware, then we can manage a single model of parallelism across a plurality of hardware and construct an optimal configuration.

    I agree that it is non-trivial. It is what I consider a “Hard Problem”(TM). A single model of parallelism would be wonderful. Which one? Shared memory? Shared nothing (MPI)?

    I occasionally get into arguments with people over this. Shared memory is IMO the easiest programming model. I like the idea of OpenMP. I wish we could use that or other higher level abstraction for parallel programming. The problem is that shared memory leads to all sorts of problems.

    Also

    The future of computer architecture is tiled arrays of reconfigurable elements in massive datacenters connected in a fiber optic network backbone. This won???t be realized until a software stack can manage such a reconfigurable array.

    Hmmm…. maybe my crystal ball is a little hazy … I don’t see precisely this.

    I do agree that RC toolsets, for lack of a better term, suck. Way way back I wrote assembly code to interface instrumentation (22 years ago? My gosh…) for a fairly notable physics experiment. There my toolset was an editor, a debugger, and a B RS (big red switch). I still managed to do some cool things, though it took a while. It was hard. VHDL et al are that way.

    Hoping someone from Mitrionics pipes up and says great technical things (not marketing things) at this point. Another poster called their model a “hack” and to a degree, I agree with that. It is a pre-designed processor. But if the “hack” works, well, … (look at Windows 3.1 …)

    One big concern is that you can’t take RC code from machine to machine. You have to use the same board. The same FPGA.

    Another big concern is the cost of the toolsets. We had a quote in the last 6 months for $77k for an FPGA development station.

    All in all, I remain somewhat unconvinced that FPGA/RC is the real wave of the future … there are simply too many things (economic/technical/ease of use) working against it. Solving the ease of use is a good thing (better toolsets). Rapidmind is working on that for Cell/GPU/Multicore. Peakstream had been. But it doesn’t solve the $5k/Xilnix issue. Nor does it solve the portability issue. Maybe this is where Mitrionics comes in.

    I’ll bug Mike Calise, who just joined Mitrionics about its toolkit.

    As for eating our computers, well, self assembled organic (and disgestable) semiconductors may be a ways off …

    … though anything is liable to taste good with hummus and falafel

  6. Joe, a graphics processor is nice, but its architecture is optimized for pipelining of data and operations in the graphics domain – clipping, rasterization, texture mapping, blend, etc. GPUs can be employed for non-graphics computing, but the workflow and paradigms characteristic to the device must be used. In a business where the algorithm is usually the most important contributor to computational efficiency, the constraints imposed by GPU architecture can be significant.

    As for the Cell BE, that thing’s Snow White and the Eight (Seven, if you’re Sony) Dwarves connected together by a token ring, of all things. I’ve read an architecture paper on it, and noted how its architecture reduced the level of abstraction available to developers (whom we know will have to write to the machine, just like in every other serious HPC application.) A pal in the aerospace industry tells me, by his conversations with an engineer using the Cell, that it’s not an easy beast to employ. So at the very least there’s a big learning curve. I personally think your observation on the adoption of IA-64, and my comments on using GPUs, will apply to Cell. It’s just too weird.

    A computer based upon reconfigurable fabrics, each having a monster amount of local memory and connected scaleably to each other, still seems ideal to me. What better canvas than a blank one? Sure, the “object code” is entirely non-portable, but my ideal is a machine who’s architecture comports to the algorithm its running at the moment. I want skilled algorithmists to be able to spell out at an abstract level how to do something, let an AI compiler reliably provide an efficient implementation, and having a fire-and-forget deployment and execution management system provide the run-time support and integration to other computing systems. But, alas, compilers are not all that smart yet, and getting from an HLL description to a running FPGA takes knowledge of electronics design and of the particular device. It’s a matter of there being no capable tools, let alone inexpensive or easy-to-use tools.

    Amir, you’re right that no one cares how a $10 device can be used to create surprisingly-powerful computers. They care about the cost of deploying applications on those computers. I think compilers are far more non-trivial to construct than OSs.

    I’d not commit to threads as the one model of parallelism. Also, isn’t the stream programming model more event-based than thread-based? When I think “streams processing”I think “CSP.” No, not the hijackers of the CSP acronym, Configurable Stream Processor and Collaborative Stream Processing, I mean the original Communicating Sequential Processes.

  7. Ooops. Looks like WordPress reinterpreted the HTML anchor in my text. That hyperlink to Ousterhout’s paper should have been anchored to a sentence reading “Some smart cranks echew threads almost entirely.”

Comments are closed.