So way back in the good old days, programming a single core CPU in a high performance manner was a challenge. Compilers promised much and delivered small fractions of maximum theoretical performance. To get nearly optimal performance, you had to hand code assembly language routines. You would never be able to achieve 100% utilization of the processor capabilities, but you might be able to sufficiently balance memory operations with floating point and integer operations so that you were utilizing a sizeable fraction of the chips subsystem capabilities.
This is not to say that the compilers are bad, or they generate rotten code. They don’t.
Its that the job of mapping a high level language into a low level assembly code, efficiently, is very hard even for very simple architectures. Current processors (single and multi core) have all sorts of features that try to ameliorate these problems. Out of order execution to hide pipeline stalls. Local cache to hide memory access latency. And so on.
The problem is that each of these features makes the job of the high performance programmer harder. You can get blistering memory performance on PC, as long as it is to and from cache. So a technique to work with on cache based architectures exploits this by attempting to create temporal and spatial localization. Reuse in a nutshell. If you have to pay to price to bring a cache line in from memory, why not use it as often as possible?
Great idea. There are lots of others that are used as well. Well they are attempted … it is hard to get spatial and temporal localization if you are building object factories and other OO methods. Pointer chasing is a cache killer. Walking dereference trees are cache killers. Cache killers are performance killers on cache based architectures. You don’t want to program like that for high performance.
Now we complicate the picture. We now have a second core, and we are going to contend for resources that we had complete and unabridged access to before. The pins out of the processer are rapidly becoming shared resources.
Now take 2 cores, each core able to completely fill the memory bus, tie them together with a fast internal fabric that connects to the pins. You have to take this into consideration to a degree. With two cores, we have found for some codes, this is not a problem. With the advent of next generation HT, this problem may be mitigated up to about 8 cores (this is a rough estimate based upon public knowledge available from a variety of sources on the net). But what happens when you start going “core-crazy” and toss 24, 48, or 96 cores, or more, per socket?
You have to worry about resource contention, resource scheduling, and all the issues associated with efficiently utilizing the multiple cores, as well as utilizing the resources of each core.
Remember, the current compilers aren’t really up to the task of efficient utilization of current fairly complex chips. Add additional complexity. Anyone think this situation is going to get better? High performance software is hard. Very hard.
Adding N cores will not make your application N times faster. Adding N threads won’t make your application N times faster. As Amdahl showed, its those little things, such as serialization (think of that as a resource contention issue that is resolved by providing serial access to the resource) wind up dominating the performance considerations. Similar forces are likely to be at work here as well.
A shared resource is a contended for resource, and this contention has to be managed in a manner appropriate to enable high performance applications. You can always at first glance, simply use the multi core units as SMPs. This is how most of the OSes handle it today. How do you feed a single chip with 24 or 48 cores, to be able to enable high performance out of it? Its not going to be easy unless the rest of the system is well balanced. And even then, it is going to be hard to find compilers that can handle yet another level of complexity while generating efficient code.
New (or in reality, older) paradigms will need to be used to make efficient use of the multiple levels of resources. We might be able to pretend that some of the resources represent simply additional processors and not pay any heed to whether or not they are on the same socket. But that model is likely to break down at some point. I suspect at a point somewhat below where marketing sheets suggest.