I postulated for a while that this was the case. HPC technologies tend to evolve to a point of bandwidth (or latency) limitation. The broader IT market tends to follow.
This is basically stating that as you build out resources, common designs will tend to oversubscribe critical information pathways.
I had a conversation with a potential partner today where we were talking about HPC across multiple different subfields and we kept coming back to this. Unintentional or subconcious. Bandwidth limitations, basically data motion, is a major, and often, the major rate (of productive work) limiting factor in many systems.
There are some folks who need huge numbers of cycles, or more efficient cycles. They will buy accelerators. Some folks need huge data streams, and will buy big fast disk and networks. But as we increase the number of cores tied to single buses, as we tie more processor cores into each node, its the pipes between and within nodes that start bottlenecking.
I used to tell people that you can imagine problems as terms in power series. If you can knock off the first order problems, the second order will be the things that bite you. Or put less mathematically, you never (completely) solve problems, you just change which problems you want to work on.
Add many more cores to a node and memory and IO access are bottlenecks. 8 cores, each reading and writing files over local disk and a network will constrain performance. Sure, you can turn up the speed of the disk, or the network, and gain some period of time to be “free” of actually solving the problem. Intel did this with their FSB up-tick. It doesn’t solve the design issue associated with the single pipe for many cores, it just moves it out a bit. They changed the problem a little, but they still eventually have to solve it.
And NUMA, the solution to the above problem, brings in its own problems. Memory isn’t uniform, you now have to tell code, schedulers,etc about memory hierarchies as well as other things.
Again, NUMA simply changes the problem. You still have one or two memory pipes per socket, and yes, you can fill them up … ask the people running weather and specific CFD and chemistry codes.
The problem in all these cases is running out of a particular shared resource. With N requestors for a resource., you would get, on average, about 1/Nth of this resource available. It scales. But in the wrong way.
As processors get infinitely fast, you still have finite time to access and move the data. Bandwidth is not getting much faster. I can’t plug a new disk or CPU in to my box and double my memory/IO bandwidth. These things are set in system design. Which means they are a fixed resource. A fixed contended-for resource.
Which suggests that there are points of diminshing returns in sharing resources. The resource is bandwidth in this case, and sharing it means that you need to use it effectively.
As noted in my compiler bit this past weekend, compilers don’t do a great job on using the processor resources. Bandwidth is a bit more esoteric. There are no compiler switches to optimize for bandwidth limitation. Careful design and planning are needed.
Just some thoughts … I think there is something fundamental lurking there somewhere. Sort of like a conservation of processor cycle law for SMPs, from which you can derive something that looks an aweful lot like Amdahl’s law.
Worth more time thinking about it.