The coming bi(tri?)furcation in HPC, part 1

This will be short. Mostly a hat tip to Doug Eadline who in a very recent article talks about something we have been talking about privately for a while.
Read the article, and afterwords, ponder a point he was discussing:

If one assumes that cores counts will continue to increase, the 64 core workstation may not be that far off. Back in the day, a 64 processor cluster was something to behold. Many problems still do not scale beyond this limit. Could we see split in HPC?

I believe so. Doug cautions people to not read into his words too much. This said, we are building very muscular desktops sporting 24 cores, 256 GB ram, 1+ GB/s IO channels, and accelerators of several flavors. Each of these machines may be sufficient for what clusters used to be sufficient for in the past.

What I like to point out is that the script that HPC has followed in the past in terms of market change and growth is being followed again.
First, the idea of “good enough” and “inexpensive” rules. Why buy a Mercedes when a Honda will do the same job, just as well, with similar levels of comfort. Maybe not as much cache’ on the nameplate, but just as good (if not better) on the inside. Where it counts.
Second, HPC as a market, always … always … goes down market. Many companies whom have not understood this have been destroyed. A fair number of others are likely to be destroyed, because they don’t grasp this. As many of us said when at SGI, you can’t paint a box purple and charge 3x the price for it that your competitors do.
Twenty years ago, vector supers began to see the glimmering of a challenge from the killer supermicro’s (no not, I mean machines that were single core, shared memory buses with ‘large’ memory systems … several gigabytes in size). I ran on those (vectors and the supermicros).
Fifteen years ago, the battle was over, and supermicros had won. There were these new Pentium II systems that most in the supermicro world looked down on. I ran some tests on those, and found that the cost benefit analysis was going to favor them in the longer term. 1/3 the performance for 1/10th the price. I guestimated in 1995 that SGI had 5 years to make a technology shift or get left behind.
Ten years ago, clusters started emerging with a vengence. I still remember (and recently found in an old sent-mail archive on a machine I am discarding) a benchmark I ran in 1999/2000-ish time frame for informatics codes and fast R10k/R12k processors. The Pentium were faster. And much less expensive. A bunch of us pushed SGI internally to get into the linux cluster market, because we believed it would be big. Some of us also wanted to make Irix cheap so that our fans could buy a used O2 on Ebay, and get Irix OS and compilers cheap. This is a really … really good way to jumpstart application porting/development. But also by then, I was playing with Linux side by side with Irix. I could see the writing on the wall.
Five years ago, the last major supermicro’s finished their retreat to the very high end (shrinking portion) of the market.
From the data, have a look at where the green abruptly terminates.

The supermicros were the SMPs.
One year ago, accelerators began their emergence in earnest.
In every case, the impact on the market, the vendors, was severe. Cray almost went under 15 years ago. They are doing well now. SGI went under. Twice. Many exited the market, or were bought up by rivals. Convex was bought by HP, as was Compaq. Who had bought DEC.
But the impact on consumers was profound.
Price for performance dropped, usually order(s) of magnitude. While you might not be able to sustain something near peak performance, what you were able to get was “good enough”. Or, as often happened, the new stuff on the block was better, cheaper, faster, and the older companies pretty much had to buy every piece of business they got. Which drove them under, or to be sold off.
Not only that, the size of the market was driven much larger. About an order of magnitude larger over 1990-2000, another about order of magnitude from 2000-2009. What was once a 200M$ market became a 2B$ market and now a 15B$ market.
Understanding what the technology which is going to alter the face of the industry and cause disruption is what VC’s want Entrepreneurs to develop, and in theory anyway, they will help build companies to cause this disruption. Unfortunately many VCs are now busily distracted by failing revenueless and profitless web 2.0 social media companies (aka black holes for capital), as well as LPs who are unhappy with their returns. Couple that with a decidedly un-sexy market … and you have a recipe for very little capital. Which makes it harder unless your company is self boot-strapping.
And the technologies have emerged. In a little self-aggrandizement, I picked accelerators years ago, and was dead on right. Just like with clusters. So we know one of the emergent technologies. What about the others?
A big issue with clusters is the up-front capital cost. What if the cost to stand up the Nth node (N=1 … some large number) were a marginal/incremental fee? What if you didn’t need to bear the capital cost? This is where clouds sort of fit in. This is what they promise. The one missing piece for them to really take off in HPC is the data motion piece. As I have pointed out, this is non-trivial … over a network. It is not cheap. But the costs on the compute side scale well, and if you leverage Linux as the OS, your TCO approaches zero. You don’t need to own/maintain it. The service provider will. You just need to install your own app, or pay them to. And off you go.
Also a big issue is control of the resources … IT organizations with draconian support/deployment policies often impede research/engineering/HPC systems from operating. They make it too expensive to run. So we are seeing more users elect to buy a special desktop. Which has many processors, lots of memory. They can have control over it. IT can be excluded. They run Linux on it. Run windows on their laptop. Or in a VM on the machine. We have customers whom have built clusters of these to run their CFD rather than have IT control the machine. More to the point, end users can run their HPC apps on these machines, and as the core counts, processor and system speeds increase, there will be less incentive to spend for the HPC infrastructure around clusters. The startup capital costs are far lower.
So what I see as the up and coming generation are these personal supers. They currently offer compute power once available on small to moderate sized clusters. Back these up with a remote cluster in your machine room, or at Newservers, Amazon, Tsunamic Technologies, and you have local and remote power for your computing. The only remaining issue in the remote power is the data motion, and this is solvable if need be, with Fedex/UPS. That is, it is an eminently solvable problem, even if it is not elegant to solve.
So when Doug postulates,

If one assumes that cores counts will continue to increase, the 64 core workstation may not be that far off. Back in the day, a 64 processor cluster was something to behold. Many problems still do not scale beyond this limit. Could we see split in HPC?

I think the answer is a resounding … yes. We will see a bifurcation, with purchased clusters occupying the higher end, and muscular desktops with ample computing, graphics, and IO power occupying the lower end, especially when coupled with a cloud HPC provider.
And as with the previous sea changes, I expect the addressable market to grow much larger. Interestingly, several months ago, a commenter on derided the coming open source nature of storage software, suggesting it would take a $30B market and turn it into a $3B market. Odd comment, as this flies in the face of what we have seen in HPC, and other markets with open source has been leveraged with great effect. Open source has been a boon to HPC, lowering costs of scaling up. Which has enabled more people to scale up. Won’t be different in storage either. It will disrupt the old order. In order for new markets to be created, some must be destroyed. And that destruction is stressful, especially if you resist change.
Just my thoughts.

2 thoughts on “The coming bi(tri?)furcation in HPC, part 1”

  1. I think GPUs (the most likely accelerators that people will look at) are still hampered by memory bandwidth – but I don’t know how much longer it’s going to be like that for. Talking to an nVidia guy the other week he didn’t think there was much on the way to help with that for the foreseeable future.
    Of course (a) if there was he might not have been at liberty to talk about it and (b) there’s plenty of people for whom GPUs may be good enough (yes, NAMD, I’m looking at you).. 😉

  2. @Chris:
    GPU accelerators should be treated more like vector processors … like vectors they are quite sensitive to memory access patterns. When you hit the right pattern, you get some good performance (assuming your code is integer/single precision based). It still has issues in double precision.
    I played with the Fixstar’s Cell (GA-180) we are selling in the Pegasus GPU+Cell. It is basically a PC on a card, with 4GB ram, 2x Powercell 8xi (think Roadrunner Cell units). Writing code for it is relatively easy, though I have to learn how to make effective use of the SPUs with the compilers. This is what I find for a slightly modified code:
    AMD: 2.3 GHz Shanghai
    landman@pegasus-a3g:~/rzftest$ time ./rzf-amd.exe
    pi = 3.141592644040497
    error in pi = 0.000000009549296
    relative error in pi = 0.000000003039635
    real 0m0.740s
    user 0m0.736s
    sys 0m0.004s
    Powercell PPU 2.8 GHz
    [landman@pxcab rzftest]$ time ./rzf-cell.exe
    pi = 3.141592644040497
    error in pi = 0.000000009549296
    relative error in pi = 0.000000003039635
    real 0m2.794s
    user 0m2.784s
    sys 0m0.007s
    SPU on Powercell 8xi:
    [landman@pxcab rzftest]$ time ./rzf-spu.exe
    pi = 3.141592644040497
    error in pi = 0.000000009549296
    relative error in pi = 0.000000003039635
    real 0m6.087s
    user 0m0.001s
    sys 0m0.006s
    So as far as acceleration goes, there is a learning curve there as well. I think it is likely a general rule of thumb that in the vast majority of cases, acceleration will require some effort to effect. Our experiments with Cuda yielded similar initial results … only after we understood how to approach the architecture were we able to make effective use of it.
    In the case of the SPUs, in aggregate, we should be able to approach 100 GFLOP double precision for 8 of them, so roughly on the order of 12 GFLOP/SPU for double precision. Which is not that far off an AMD or Intel processor core. SPUs have very little local memory, and the PPU manages the memory access for it, so this usually winds up being a bottleneck for codes that haven’t been re-architected for it. My experiment above can’t be construed as the speed of a PPU or an SPU, but it can help set expectations that speed increases which are possible are not automatic without a code re-architecture.
    And this is true on Cuda, with SSE, with …
    Basically its going to take some effort to get it there.

Comments are closed.