lightcones, filesystems and messages, and large distributed clusters with non-infinite bandwidth and finite latency

A point I try to make to customers at the day job is that, as you scale up a systems size, your design will need to scale as well. And this begs the question. How will it need to change?

Currently (May 2007) simple NFS suffers from 1/N problems (1/N meaning that as the average number of requesters N increases, the average available fixed resource available per requester works out to about 1/N … modulo duty cycles, transients, etc). pNFS is coming, soon. Parallel file systems exist, though most of what I have seen/read about also have SPIFs (single points of information flow). SPIFs are bad if they are in critical paths. The designs may push the maximum scalability, and sure, someone can create a simple benchmark test showing writing to a file on the parallel file system is fast, as long as we are not moving metadata, acquiring/testing locks, etc.

The small clusters for which NFS is fine, call it up to 16-24 nodes in conventional designs, 64 if the people building the file system know what they are doing, might be called small in some sense. Not a micro-cluster, but that might not be a bad name. Rather it might make sense to demark cluster physical sizes in terms of the log of the number of nodes. The base of the log isn’t terribly relevant, so lets take 2 for convenience. Lets (for the moment) call clusters of size 6 (e.g. 2**6 nodes, or 2^6, or 2 to the power 6 nodes) as being “micro” clusters. Yes, the power/cooling bill isn’t so micro. At 300W/dual-opteron, and more than 400W/dual-woodcrest/clovertown, I am sure they don’t feel all that “micro”.

The point is that these clusters may allow you to get away with particular configurations and simplistic designs that you would not be able to use as the size increases. You can buy ethernet and faster fabric switches for these systems, so that everything appears local, e.g. one hop away to every node through the switch. That means, that all the nodes are in the same lightcone as each other. Shared data is easy, NFS provides files, and MPI messages can skit easily across a single hop.

From size 6+delta (e.g. 2 to the power 6+delta where delta is tiny compared to 1) to about size 12 (2 to the power 12, or 4096 nodes), I would call a meso-scale cluster. These systems require thinking about how to network, as you will need a fabric of networks. File systems need some thought as well, as few file systems can deal with 4096 simultaneous accesses over the net. In these designs, you start to get hierarchies of design, such as Clos networks. Work is pushed to have these units remain within each others lightcone. Serious expenditure of money and effort goes into this. Likewise on the parallel file system front, you need to use the parallel MPI hooks to get good performance. You have N*constant memory systems, N*constant CPUs, and likely N*constant disks. Yet it is fairly likely that the local disks would be use for OS if anything.

What this gets to is the real hard problem in programming/using clusters these days (and only going to get worse). Data motion. I have been talking about it for years, and people have been picking up on it recently. Data motion is hard. It is expensive. It is often a point of serialization. Exludus and others leverage pre-caching techniques to hide latency of the transfer, and while that is good for a subset of problems (especially if you can trick the job scheduler to believe that data availability is a prequisite to application start).

What I am gradually realizing is that within a lightcone, data motion is not as hard, though spreading out to a larger size cluster increases the difficulty in moving data. This isn’t a just a bandwidth issue. Nor a latency issue.

If I want to transfer a file from A to B, I have to move the blocks between the two. Even if I could move them at infinite speed, I still have read and write times to contend with. Of course, I could take the alternative approach and not move them. In which case I need to worry about that 1/N problem.

And this is somewhat problematic and painful in the mesoscale size, as we are trying to force everything into a single light cone. What I am wondering now is whether that is worth the effort.

In part, because the macro scale systems of size 12+delta to 20 can’t really think in terms of single unified light cones across the cluster. This means that if each node can pull from the file system, locking (latency) will become a massive issue. This is a global problem as we are making it as such. What if we simply accepted that larger macro scale cluster were made up of multiple light cones? So that each node might not be a single hop away? What if they were 2 hops. Or 3.

The global file system for these macro clusters is like adding additional spatial dimensions to a system. It allows us to say that all systems have a connection *here*, whatever *here* means.

In the case of messaging, it allows for that single hop. No hard routing.

So in the macro cluster size, what if we broke this assumption? File systems and messaging are all about moving data. File systems worry about locks (latency) and throughput (bandwidth). Messages do as well.

If we break the assumption, can we deal with file system consistency? Message ordering? Is message ordering strictly needed? File system consistency is needed, but how do you implement this globally as the number of nodes gets large? Maybe implement this in terms of no-local-writes, so all writes have to go out over the net, forcing a data motion.

Gets curiouser and curiouser. These are hard problems to solve, and I suspect that we will need to be solving them quite soon.

Viewed 12618 times by 2780 viewers