Non-locality in computing

By joe

September 6, 2009 - 3 minutes read - 584 words

I read an article a few weeks ago that mirrors some of the things I’ve said in the past about computing on huge systems. Basically, when you have a system of sufficiently large size, the communications fabric between the nodes are such that for any ith and jth node, the latencies and transit times may not be uniform, or worse, there may be significant time cost to communicate between various nodes. In the text, I use the concept of a light cone. This is a construct from relativity theory, with time representing the vertical axis. As a message or some data needs to move, it is constrained by something approximating the speed of information flow across the distributed computer (lets call this C{dist}). You can exchange information across two nodes when their light cones intersect. But in the case of a very large machine with potentially millions of cores, it may take far longer than anticipated for those light cones to intersect, as not all nodes are one hop away anymore.

So how do you program in the face of non-uniform latency/bandwidth? Are your algorithms making an implicit assumption of uniform latency and bandwidth? Does MPI work well on a mesh, torus, or other topology, without significant (possibly drastic) changes to your code? When we program for highest performance we seek to minimize latency of various operations, or hide them behind other operations so we amortize the cost of that latency. Our algorithms need to be sensitive to this. What I wonder about is the future of large centralized switch infrastructures. They seem incompatible with the scaling we are seeing being pushed. A single point of information flow. A bottleneck. The implications for storage are also interesting. A friend told me in an email that he thinks RAID is dying. I agree if we are looking at large multi-box single arrays. But not storage clusters. Not RAID within a box. Remember, the purpose of RAID is to give you a fighting chance to survive a failure long enough to replace a failed part. But RAID semantics are breaking down as storage capacity rises. Rebuilds are now a significant problem in terms of time, for even moderate sized file systems. Hence the risk of a second failure rises. RAID6 was designed in part to survive this. But looking at the bigger picture of huge storage, this makes less sense. You need to be able to survive various hierarchies of failure. Not just a single disk. Or a single box. Worse, the concept of backup is or needs to change. As storage capacities and requirements rise on their fast exponential curve, backups become harder to perform in a fixed period of time. Which also means restoration becomes ever more difficult as a function of size. Of course if you have infinite budget, this is less of a problem. Most people don’t. So how can you provide resiliency in the face of failure in storage (which is what RAID was designed to do), as well as continuous operations in the face of failure in the hierarchy? And in the case of large clusters and groups of distributed machines, how are you going to move data to where it is needed in a reasonable time period? Or are we looking at the problem wrong … and we need to let the data remain “frozen” while moving the code? Huge numbers of processors with non-uniform fabrics pose some very interesting problems going forward. For computing, for storage, for backup.