HPC, HTC, AI, and markets

I've not written much in any format outside of Twitter for a while.  Much has happened while I was away.

  • NVidia (NVDA) is looking to get government approval to buy ARM. Old news, but worth noting that the deal is big enough to attract government's attention. Sadly that opens the doors for political meddling, which rarely benefits anyone but politicians, as the companies are swatted around by those in power.

  • AMD is looking to close its acquisition of Xilinx. Also old news, and I suspect its on a much easier glide path than NVidia.

  • HPE bought Cray. I was at Cray for that. My honest opinion? RIP Cray. They (HPE) had also acquired SGI some years ago. I was at SGI, when it first acquired Cray in 1996 or so, at the behest of the US Government (rumored). I was at SGI during the "divorce". Some of the best colleagues I had were originally Crayons. My SGI colleagues left for greener pastures. Like Nvidia.

  • ARM became relevant in HPC. Curious that, given its multiple abortive efforts previously. Relevant in the sense that there are CPUs that are really HPC capable. These would be Fujitsu's A64Fx, Ampere's processor, and possibly the graviton series, though I don't think (correct me if I am wrong) that we can buy a motherboard with a graviton processor to stick in our own rack.

  • Intel started its implosion. Well, ok, it started it in the mid 2010s, call it around 2015 or so. It just picked up steam. It (finally) fired its incompetent CEO, and put a real engineer in charge. Intel has a long road to travel. It won't be easy. And its been shedding marketshare to AMD in a big way.

  • AMD is back. With a vengence. The Zen series are simply fantastic (said as a user of the chips). Especially in comparison to the Intel chips.

Ok, so this is something of the high level view of things. In HPC there is the ultra high end, now exascale, and the systems the rest of us use. In those systems there is tremendous demand for machine learning (ML) tooling, to build and deploy models. One part of the build model process is training a model.

So here's the thing. ML and most of the rest of that alphabet soup, can be looked at as a function minimization, similar in concept to a regression calculation. Remember, in school, when they had you take data with x's and y's (labels), and then draw the "best fit" line? And report the slope, intercept and other parameters? Yeah. Your deployed model would be the slope/intercept, and you could infer the y value (label) from the x value, given the model.

Beleive it or not, that is ML. And AI. Ok, the models aren't necessarily minimizing the mean square error. There are more complex expressions that are being fit. But it is a fit. The machine really isn't "learning", in the sense that you and I learn. We aren't really sure we understand what learning is, we have guesses, and models. But those models fail, spectacularly, in a number of general cases.

Which, not so coincidentally, is why self driving is a recipe for mayhem. But that's a post for another day.

What has this to do with HPC?

Good question. To train these models (e.g. fit and minimize the error of the fit), one has to do many ... many calculations. Usually expressed in terms of "tensors", which if you squint and wave your hands, you can imagine as matrix operations. They are more general than matricies, matricies that we are used to are subsets of what tensors are.

It should be noted that physicists make extensive use of tensors in numerous sub-fields. Actually we use lots of interesting math throughout physics. Doesn't mean I remember them well, haven't played with stress-energy tensors in like 30 years (since my general relativity course), though I did use some tensors (and functionals!) in my thesis.

Anyway ... (see, ADHD ... squirrel!) tensors ... or matrices in this case, require computation environments. This is the connection to HPC.

With large data sets to crunch with, one must calculate with many matrices (ok, tensors). So, what do we have that do matrix operations very quickly?

Accelerators, specifically GPUs. They are designed to perform many matrix operations. Very quickly.

Oh, CPUs can do this too. But they aren't nearly as fast. The speed comes from the massive parallelism of the GPUs, and the massive memory bus bandwidth available with HBM type memory. You can set thousands of cores on these problems. Compared to (using AMD cpus) 100s of cores. From a raw performance perspective, GPUs have more bang per buck.

This is not to say you cannot construct your algorithms such that CPUs can do a comparible or better job. Some people have published such papers recently. But this is not the norm, GPUs are cheap (relatively) and plentiful (relatively).

I say relatively, as GPUs are also usable for various forms of crypto-mining.

What is interesting to me about that, is that it demonstrates, beautifully, the concepts of supply, demand, and price elasticity. Crypto "value" skyrocketed over last year, and the net result was that there was a huge demand for GPUs to power this. Which depleted the market of these GPUs. Given that the market was probably efficient, the prices rose to match the demand.

Which is remarkable to see, but kinda sucks if you are in the market for a GPU. For ML. Or scientific computing.

High Throughput Computing (HTC) is all about maximizing the job flux in a pipeline of computation, not necessarily maximizing the performance of a single run (HPC). GPUs provide localized HPC. Aggregations of GPUs can be used in single large training jobs (HPC), or multiple discrete jobs. Or combinations.

Some sciences are implicitly using HTC capabilities, think large clusters/clouds with batch submission, or large analysis pipelines. Some use HPC capabilities. Some use both. Its fascinating to watch this mix evolve.

So here we are. With some vendors falling out of favor for various reasons, others in favor for various reasons. In the midst of a pandemic, where the initial responses to the pandemic provided supply chain shocks (across all industries), and the government responses (at least in the US) have ranged from moderately bad, to dangerously incompetent. The supply chain shocks have resulted in reduced availability, and longer timelines for parts. Which has increased prices. That in addition to all the other things going on.

So here I am, 8 months into my new position, enjoying working on tech with really smart and nice people (seriously, why didn't I do this before?), watching in awe as things change rapidly and slowly.

I'm somewhat at a loss to imagine what we are going to see next year. Though, I can probably extrapolate using (my hopefully real) biological intelligence ... meat-space if you prefer a Strossian term (Charles Stross, great author, fantastic stories). I'll do that going forward in other posts.

One thing I'd like to mention before finishing. Both exascale, and regular HPC/HTC are making significant use of accelerators. I had predicted something like this about 18 years ago. Still in awe of how accurate these predictions were. Must be biological intelligence at work ... :D