Locality and centrality in massive computing and storage systems

Here we are in the age of the cluster and grid, with distributed shared nothing approaches to processing cycles, and we collectively have this rather ironic fixation on shared file systems. This is amusing as one of the critical arguments for distributed computing is that, in aggregate, N processors provides N times the number of processing cycles that 1 processor can provide, and shared resources are contended for resources (e.g. bottlenecks).

The whole point of enforcing locality is that in aggregate, local is always faster when you scale N up. Suppose for a moment that I take 1000 CPUs, each with 1 GB/s to its local memory, and say, 1 GB/s to remote memory, shared with all other CPUs on the net. This means, in aggregate, if I can write my algorithms to use mostly local bandwidth, I can exploit 1000 GB/s or 1 TB/s aggregate across 1000 processors.
Its a simple matter of programming.
It turns out that writing parallel code is hard. In large part because our languages were not designed with parallelism in mind. In order to achieve scalable parallel execution, we need to pretend that we N independent processes executing serial code. We bolt on artifices such as MPI, OpenMP, … to provide an explicit control of how parallelism is expressed in the code. We do this as we understand that locality is important to performance. Data motion is a necessary evil; we want to minimize this evil whenever possible. As CPUs get faster, each remote operation takes large numbers of clock cycles to complete. We can introduce threading, non-blocking/asynchronous operations and so forth.
All this does is make our core algorithms harder to express efficiently in terms of the underlying architecture. Remember, it is the compilers job to map a high level language into something architecture specific. Unfortunately most languages do not include a concept of locality or remoteness as core to them. This results in optimization not assuming much about locality of resources. Well, sort of. Compilers are at least NUMA aware these days, well, the compilers from some companies are anyway. But the compilers no nothing about remote address spaces, or remote non-shared memory/processors/io. This is all an artiface, bolted on via MPI. Which means that the optimizer is not considering this when performing its transformations.
I like noting something I have called a “law” before. You get your best performance by writing “close to the metal”. And since current supercomputers tend to be large, heavy clusters, or many units, one might even call it “heavy metal programming”.
Put another way, the higher up you are in the abstraction chain, the less performance you are likely to be able to achieve.
Now flip this around. What could we achieve if we could transparently describe locality to our algorithms, such that, at the core language level, we could write our code without worrying about how many CPUs we would use, or where we would use them? Or how we would move memory around? Sort of like HPF, or OpenMP allow you to specify things. You code works nicely in serial, or in parallel. Works the same way. The compiler is given hints about where things are, and how to use them.
I know Intel has come out with Cluster-OpenMP. Its price is prohibative though for common use. I understand its performance isn’t all that great either. This isn’t to say any thing is wrong with it, just that it is a hard problem to tackle.
What I wonder is whether or not we are tackling it the right way. PGAS languages may be the right approach, or they may be a stop gap. I think we need smarter compiler technology that can deal with locality, homogeneity/heterogeneity (e.g. NUMA, APUs, aSMP, …).
Likewise in IO, enforcing locality and distributing operations should wind up providing maximal performance. The issue is that data motion is always expensive, in terms of processor cycles, in terms of time and resources. Morever, syncronization is just like data motion. Worse in some aspects, as synchronization is effectively a blocking operation. If you have a network which lets you push 1 byte in 1 nano second, you can push something on the order of 1GB/s. Of course, your initial latency of 1000+ nano seconds, coupled with 2-4 GHz processor clocks (meaning 0.25-0.5 ns clocks) gives you 2000-4000 clock cycles you have to pay for each latency operation (IO over fast network). This is not accounted for in many people’s optimization work. Many small messages (e.g. synchronization, lock management), mean being swamped in latency. Very little can be done to hide this latency. Which means if you are doing large scale computing across N (large number) of nodes, avoid synchronization and data motion as much as possible.
And this brings us to cluster file systems. If your metadata is distributed all over the place, you have to move it, and likely lock it while changing it atomically. Which means you have to shuttle metadata around. Bad move. This inhibits scalability. Ok, lets centralize metadata into a single large fast metadata server. Again, this hits limits as you run out of room on the MDSes.
Maybe the trick is not to move the metadata.
But this flies in the face of what the cluster FS folks want to do, to aggregate bandwidth (good), by distributing load (problematic).
As the clusters get ever larger, the synchronization and delay issues, for code, for IO, etc., are only going to get worse. Pretty soon, we may need to talk about program “light-cones” and establish near-field (local) and far-field (remote) resources. Items outside the light cones would be invisible to the program. Building wide light cones would require specific types of coding and code design. Think of a single program running across multiple-clusters at once as running with multiple light cones. Here parallelism is layered. Centrality is reduced (no single program running across N resources, but running instead across N/M lightcones, each with M resources per light cone).
Will need to think more about this.