Much ado in the [HPC] world
I've been relatively quiet here on this platform, apart from my wonderful canine son. I plan to rectify this.
At the end of 2022, we have a world where Intel is on the ropes in HPC, it can't deliver on a reliable schedule. We have AMD ascendent, and rightly so, with the EPYC chips and architecture. We have NVidia and AMD pushing new GPU platforms out that are tremendously performant for various workloads. We have ARM based systems also, though they aren't quite winning in the on-premisis market, though Amazon AWS is providing interesting systems in Graviton.
We have storage systems making bold claims about performance, showing wonderful benchmarketing numbers. Real end user IO workloads in HPC are indistinguishable from a distributed denial of service attack on the storage system. How that system behaves under these conditions matters, as this is what users experience.
Composible systems are on the rise. I remember, and had used, early versions of these things. Panta Systems, ScaleMP. I think this is great, though ... though ... the underlying OS and systems can rarely survive a failure of the underlying systems. You can't migrate computational state off a non-responsive system. I am not condemning such systems, rather I believe we are at the beginning of their utility. I think, somehow, the underlying problems are solvable. This will take more time, but imagine a world in which you can specify the shape of the system you need as part of your run.
To a degree, this is done now with batch systems, but they are subdividing larger machines into smaller, and aggregating many distributed smaller systems into a larger entity. Its the boundaries, and data motion(!) that gets us. That is part of the problem.
With ScaleMP, a large allocation could cause a (believe it or not!) blue screen crash. Application level code, doing a very large (multi TB) allocation would freeze all machines in the system. This should be fixable. In the composable world, we could, for example, start looking at storage in terms of large blocks of erasure coding. So if a machine dies running part of a calculation, and its memory goes away, it could simply be viewed in erasure channel terms. Similarly, state synchronization may (may) be possible this way.
However, HPC codes often use lots of RAM. My own analysis efforts are regularly hitting 0.1TB of resident memory. Looking at how to distribute this amongst more machines will help. This gets to the crux of one of the more interesting conversations from the 1990s ... large memory SMP vs MPI distributed memory. In my opinion this conversation was decided in favor of distributed memory, MPI like things.
One of the more curious phenomenon around this, was that MPI really didn't have any sort of error management. So the state of the art was, and still is, to spill state to disk/storage (burst buffers) at regular intervals and make the code such that it could recover from these data dumps.
This isn't a great solution. But, it did work.
Composable sales/marketing organizations could learn a great deal from those older conversations. As with everything, it's not a panacea. There's lots of hard work needed to make these things work better.
However, as most new HPC users are using it via jupyter notebooks, or using python code to marshall data for the real computational engines, this may be less of an issue. If your code is self healing, that is, you break it up into smaller chunks, and preserve state on permanent media, this could work quite nicely.
That is, composable machines may not be what people doing traditional leadership class HPC want to use, but it would be quite interesting creating a set of job-purpose built machines on the fly. From a jupyter notebook.
Another thing I note with pleasure, is the downstream motion of HPC systems. Today, I have various GPUs at home with 1-5 TFLOPS performance on double precision. To put this in perspective, when I was actively working on my own research (gawd ... is that really 30-ish years ago?), the fastest machines I could get access to were of order of magnitude of ... oh ... 1-10 GFLOPS. Those were Cray units (C90's, J90's), and workstations (0.1 - 1 GFLOPS). And my code was in Fortran (77 dialect, and yes, it still compiles and runs today, with modern compilers).
Thats 3-4 orders of magnitude of performance. The cost of that performance, well, they were well under $1000 USD. Compared to millions of USD for the Cray machines, and tens to hundreds of thousands for the workstation systems.
Again, 3-4 orders of magnitude reduction in cost. Adding in the 3-4 orders of magnitude in performance, we are talking 6-8 orders of magnitude better price performance in 30-ish years. This is what I've meant, all these years, about HPC moving down market. It has become very wide spread. Anyone with a semi-decent GPU, and some higher end CPUs, consumer class at that, can do reasonably good computing tasks.
In my opinion, this enables more people to engage in computational science. Combine excellent libraries, and simple to use tools, and people will use HPC technologies to do wonderful things. HPC will keep moving down market. And that is a good thing.
Geopolitics is having an impact on HPC systems, specifically in terms of chips. China wants to M&A Taiwan. And Taiwan (and much of the rest of the world) do not want that. Sabres are rattling. While territorial disputes, and who controls what is rarely a problem for most things computational, as it turns out, TSMC, that is Taiwan Semiconductor Manufacturing Company, is potentially at risk in a hostile takeover.
TSMC makes most of the advanced chips in market. Samsung is slowly catching up, and Intel is making big noises about wanting to be a contract manufacturer. Global Foundaries has been lagging the others, but may be working on jump starting to gain position.
HPC drives some of this. There is strong demand for computing, at the (massively) overhyped edge. AI models are trained in HPC systems, and are often deployed to portable (edge) computing platforms.
So chips matter. And this bubbling cauldron of a conflict, matters. I know it sounds callous striping out the human costs of these conflicts. I despair of the lives to be lost over political control. The impact upon the people in Taiwan will be terrible, regardless of the outcome.
In martial arts, you learn to be humble, well, if you have good teachers. You learn to eschew physical conflict, knowing that physical conflict never ends well. You learn that physical conflict is a last resort. Any fight you can walk away from is preferable to one you need to be carried away from.
That is, you are taught that belligerence comes at a price.
And the people who will pay that price ... they will be cost of the conflict.
This brewing conflict has disrupted supply chains. Finally ... finally ... people are recognizing that crossing national borders for important/materiel goods is a risk you need to quantify in terms of political stability and reliability of "allies".
Remember of course, that countries have "interests". Sometimes those interests align, and sometimes they don't. So if you have, lets say, a hard dependency upon a supply line that crosses a large ocean, and has some politically belligerent neighbors ... That is a risk. A materiel risk. The cost savings you have by setting up shop and shipping to the US or Europe, is now seen to be (in many cases) massively dwarfed by the risk associated with a long supply line.
So now we have most chips being manufactured overseas. Our supply of chips, and frankly everything, should be considered unreliable at this time. That is, we have gutted our manufacturing capability in order to save a buck. And we are reaping what we have sown.
We will feel the conflict over here. While politicians posture, and claim irrelevant victories as if they were game changing, our ability to function will be stressed.
Its not just about HPC chips. Its about everything that is manufactured in China. And its about a government that wishes to control another political entity it claims as its own.
This impacts far more than just our little bit of society and the market for chips.
Please keep this bigger picture in mind. We live in a world. It is what it is for better or worse. And while I celebrate all the positive things that HPCing all the things brings to our society, we live in a world where conflict brews. And people have been, and will be hurt by these conflicts.
There are costs to every decision. Prices to be paid for actions.