Non-locality in computing

I read an article a few weeks ago that mirrors some of the things I’ve said in the past about computing on huge systems. Basically, when you have a system of sufficiently large size, the communications fabric between the nodes are such that for any ith and jth node, the latencies and transit times may not be uniform, or worse, there may be significant time cost to communicate between various nodes.

In the text, I use the concept of a light cone. This is a construct from relativity theory, with time representing the vertical axis. As a message or some data needs to move, it is constrained by something approximating the speed of information flow across the distributed computer (lets call this C{dist}). You can exchange information across two nodes when their light cones intersect.

But in the case of a very large machine with potentially millions of cores, it may take far longer than anticipated for those light cones to intersect, as not all nodes are one hop away anymore.

What I am gradually realizing is that within a lightcone, data motion is not as hard, though spreading out to a larger size cluster increases the difficulty in moving data. This isn???t a just a bandwidth issue. Nor a latency issue.

If I want to transfer a file from A to B, I have to move the blocks between the two. Even if I could move them at infinite speed, I still have read and write times to contend with. Of course, I could take the alternative approach and not move them. In which case I need to worry about that 1/N problem.

And this is somewhat problematic and painful in the mesoscale size, as we are trying to force everything into a single light cone. What I am wondering now is whether that is worth the effort.

So how do you program in the face of non-uniform latency/bandwidth? Are your algorithms making an implicit assumption of uniform latency and bandwidth? Does MPI work well on a mesh, torus, or other topology, without significant (possibly drastic) changes to your code?

When we program for highest performance we seek to minimize latency of various operations, or hide them behind other operations so we amortize the cost of that latency. Our algorithms need to be sensitive to this.

What I wonder about is the future of large centralized switch infrastructures. They seem incompatible with the scaling we are seeing being pushed. A single point of information flow. A bottleneck.

The implications for storage are also interesting. A friend told me in an email that he thinks RAID is dying. I agree if we are looking at large multi-box single arrays. But not storage clusters. Not RAID within a box. Remember, the purpose of RAID is to give you a fighting chance to survive a failure long enough to replace a failed part.

But RAID semantics are breaking down as storage capacity rises. Rebuilds are now a significant problem in terms of time, for even moderate sized file systems. Hence the risk of a second failure rises. RAID6 was designed in part to survive this. But looking at the bigger picture of huge storage, this makes less sense. You need to be able to survive various hierarchies of failure. Not just a single disk. Or a single box.

Worse, the concept of backup is or needs to change. As storage capacities and requirements rise on their fast exponential curve, backups become harder to perform in a fixed period of time. Which also means restoration becomes ever more difficult as a function of size.

Of course if you have infinite budget, this is less of a problem. Most people don’t. So how can you provide resiliency in the face of failure in storage (which is what RAID was designed to do), as well as continuous operations in the face of failure in the hierarchy?

And in the case of large clusters and groups of distributed machines, how are you going to move data to where it is needed in a reasonable time period? Or are we looking at the problem wrong … and we need to let the data remain “frozen” while moving the code?

Huge numbers of processors with non-uniform fabrics pose some very interesting problems going forward. For computing, for storage, for backup.

Viewed 9054 times by 1976 viewers

3 thoughts on “Non-locality in computing

  1. In the case of distributed machines, you move the compute. It’s the approach that many in the large scale distributed computing world have been using for a while, and of course, the core to the MapReduce, etc systems in vogue today..

  2. @Deepak

    I probably didn’t make it clear what I meant. This isn’t the scheduling problem we are dealing with. The question is, when you have a run with 100k cores in it, how would you a) write your algorithm to be able to deal with a very likely topologically distant set of computing nodes (say on “opposite” ends of a 2D toroidal surface), b) how might you deal with data motion and storage for the same.

    Mapreduce and variants are nice for some aspects of computing (specifcally for search and things that can be reformulated into a reduction of a mapping operation). Not everything can, and I am not sure that even a majority of applications can be reformulated in this manner.

    The question boils down to this: at some point, computers will have to deal with non-local effects for their singular simulation tasks. How will we program the algorithms to be either aware of this, or tolerant of this?

  3. I’ve always been somewhat surprised that Charm++ and similar techniques haven’t gone further. Yes, there’s overhead, but tracking any irregularity induces similar overhead. Current schedulers appear mostly to reinvent this old wheel. sigh.

Comments are closed.