Reflections on where we've been in HPC, and thoughts on where we are going

By joe

January 20, 2019 - 9 minutes read - 1787 words

Looking back on past reviews from 2013 and a few other posts, and what has changed since then up to 2019 (its early, I know), I am struck by a particular thought I’ve expressed for decades now.

In 2009 I wrote

Down market, in this case, means wider use … explicit or implicit … integrated in more business processes. All the while, becoming orders of magnitude less expensive per computational operation, easier to use and interface with.

A terrific example of this may be found in the AI and machine learning (ML) efforts underway. ML is, at its core, an equilibrium optimization problem. Basically fitting data to minimize error, without overfitting data. Some people have brushed it off as “curve fitting”, but I think it is more subtle than that. The equilibrium part is often (explicitly) missed in such arguments. One isn’t merely minimizing the maximum error.

Curiously, this equilibrium, and the methods used to implement it, basically various forms of Gradient Descent with constraints, are computationally very expensive. For a number of reasons, mostly related to compact mathematical expressions of the underlying statistical theory.

Not so curiously, there is a significant crossover to statistical mechanics. This crossover allows powerful mathematical tools to be used in the process, though all come with computational costs.

What I find compelling about this, is the combination of low cost to acquire Accelerated Processing, (relatively) easy to use software tools to build computational experiments. The low cost to acquire could be the GPU for your desktop unit, or the GPU instances in various public clouds, such as AWS, GCE, Azure, and paperspace.

That you can focus your development on things you can build on your laptop or desktop, and then move them into larger resources as (say singularity) containers solves a huge problem we’ve had in HPC systems for decades. Well, mostly solves it, as kubernetes may not be fully stable right now. Which is to be expected of newer technologies and their growing pains.

While these changes have helped many end users infuse their applications with HPC capabilities, many public clouds haven’t quite kept pace on the networking side of HPC. It’s pretty well established that RDMA capabilities are important for a number of HPC computing and storage elements. And its also fairly well established that many container/VM systems running use and overlay network, atop an underlay network, that adds latency/overhead to a network. The question is whether or not it makes sense to treat HPC networking like GPU/APU passthrough.

You don’t need RDMA locally on a machine, so quite a few cloud vendors have put lots of fast NVMe and SSD on their cloud machines, with PCIe pass through, or similar mechanism. This provides storage performance you’ll need, though it doesn’t generally allow you to build very large computing infrastructure out of a single box. You can build a scale-up system with ScaleMP, but that would require bare metal systems with RDMA connectivity, which have an altered boot process. Which is difficult in many clouds for now. Still, this is an intriguing direction for software defined systems.

This said, it looks like the collective downmarket moves of HPC hardware in the form of accelerators, along with tool sets that don’t necessarily require you learn the lower level programming (CUDA, OpenMP/OpenCL/OpenACC, MPI, …), enable many (more) people to be productive, faster. Add NVMe to the mix, and you have very high performance local storage, which you can use on your local machines. Which means that you can do effective work on your desktop.

This was the concept I had more than 10 years ago for what I called the muscular desktop. Make HPC hardware easy to deploy, use, afford. The toolchains would follow. The toolchains did follow, but, as I noted years later, I was wrong about the market interest in these units. I do find it interesting that Nvidia has created the DGX line, specifically to address a niche of this market I didn’t think was there in 2013. The DGX-2 is the follow-on product, and looks very interesting. While not likely affordable for home users, or researchers without a significant capex grant, your code should be able to transfer from your home/desktop machine to one of these, very easily.

And this gets us to more of a discussion on what cloud really offers. For years, the price tag has been the main feature. That is, there’s the capability to spin up and down systems when you need them. I’ll ignore all the other arguments for now, pro or con. Just focus on that.

What cloud has wrought is far lower friction to starting up work. This reduced friction enables people to move faster. They may be willing to pay for this faster motion in terms of giving up some things that they’ve come to expect from existing supercomputing resources.

Think of it this way. Any decision point you have, often means you have to compromise in specific ways, or if you prefer, engineer your outcome, knowing not only the technological costs of the decisions, but the impact upon your business/research/processes. In a fairly large number of cases, there is a strong argument for HPC in the cloud, via accelerators and local storage. In other cases, there isn’t that strong of a pull for speed of startup, and other factors are more important. Either way, this is an optimization problem to be solved.

What you need are the tools to enable you to use these systems effectively. This comes down to engineering your network, your storage, and so forth, to meet your criteria. Part of this work is to iterate quickly, which does imply common toolsets across environments. And it generally suggests local resources for development, that at least mirror some aspects of the large remote systems. Again, to minimize friction.

Watching the toolchains grow up has been fairly uplifting. You can write your own CUDA code to handle GPU work. Or, in ML, you can leverage higher level toolkits such as TensorFlow, MXnet, PyTorch, etc. For a number of reasons, ML folks seem to have converged around Python as their lingua franca, though given its many issues (GIL, format as structure, etc.), the door is open to other systems.

One of my favorite is Julia language. It has a wonderful interface to GPUs via the JuliaGPU project. Unlike Python, it doesn’t have a GIL, as it is compiled and built for parallelism, nor does it have structure by indentation. It has a native GPU compiler built in to its LLVM stack, meaning that it can optimize not merely for the CPUs, but also for the GPUs. It is rapidly maturing, so there are a few rough spots, but I expect tools like this to become more standard for HPC applications.

There are other languages as well that are interesting for this, most try to help you eschew the boilerplate of earlier languages (which I believe was a source of friction for many). Basically, much like what cloud and related technologies have done for us, I believe that toolchains that remove boilerplate, and allow you to be more productive faster, that work everywhere, will be driving us forward in HPC.

This also has implications on non-open resources. Think about proprietary accelerator chips that you can’t install in your desktop/local cluster. Like TPU-x, or various APUs that Amazon is talking about. These are, implicitly, barrier (or infinite friction if you prefer) creating technologies. I wrote about this in the past when discussing business models for APUs, or accelerated processing units, a term which AMD has now been using for more than 10 years, after I wrote it in a number of white papers for them.

Basically, CUDA was a winner as GPUs, even cheap desktop ones, were available, with a toolchain that could be made to work without too much pain. Other tools with artificial scarcity are going to have a problem with adoption, if for no other reason, than people will not be able to get access locally for low cost.

Target ubiquity was what I had said in the past as the required business model. Make it so useful, and so wide spread, that its use becomes an obvious extension of normal workflows.

This is in part why AMD needs to adopt CUDA API to its toolchain. CUDA is everywhere. So, like it or not, if you are not CUDA compatible (drop in level, with maybe a re-link at most), you’ll not likely get many customers. Basically the more friction increases, the harder you make people do the work to adopt what you sell, the fewer sales you are going to get.

Put another way, this harkens back to the wars of unix workstations, where we had to argue with application makers to port to SGI’s platform. Many vendors didn’t want the added cost in time, resources, and support, for a small additional amount of revenue. The same effect is here in the accelerator space. For better or worse, Nvidia won the toolchain battle at the lower level.

At the intermediate level, where most engineers work, the toolchains support GPU (Nvidia mostly), TPU fairly well for Tensorflow, and GPU (Nvidia mostly) for the other systems. More adventurous folks are working on FPGA for inferencing, and if they could leverage the power of something like Julia to “compile” for their “platform”, this could be a very interesting growth time for FPGA.

This said, on FPGA, we’ve heard the sirens song before, with Tensilica, Mitrion-C, and other tools. This is a non-trivial example of a “simple matter of programming”.

Basically I see a number of interesting un-answered questions in the field, as it evolves and grows. I see some utility in the legacy systems, and significant growth in the low friction systems.

Time will tell, though I expect to see some more interesting things this year. Quantum computing is coming soon, with Quantum Dominance/Superiority following closely. This is tongue in cheek, of course. I did see some interesting stuff at SC18, I suspect we still have at minimum, half a decade before a functional quantum machine becomes available. Yes, this means I am discounting D-Wave, as it is more of an adiabatic computer.

Neuromorphic computing is also coming, though likely on similar timescales. I expect that to be more useful in the near term. Basically, take something that provides (multiple) order of magnitude better performance for time critical tasks, that dominate time used by processes performing our work. This sort of performance delta, is exactly what we should be investing in. Sort of the round 2 of accelerators. Same basic concept, and if they get the toolchains right to get people to use them easily, they will likely be tremendously successful.