lightcones, filesystems and messages, and large distributed clusters with non-infinite bandwidth and finite latency

A point I try to make to customers at the day job is that, as you scale up a systems size, your design will need to scale as well. And this begs the question. How will it need to change?

Currently (May 2007) simple NFS suffers from 1/N problems (1/N meaning that as the average number of requesters N increases, the average available fixed resource available per requester works out to about 1/N … modulo duty cycles, transients, etc). pNFS is coming, soon. Parallel file systems exist, though most of what I have seen/read about also have SPIFs (single points of information flow). SPIFs are bad if they are in critical paths. The designs may push the maximum scalability, and sure, someone can create a simple benchmark test showing writing to a file on the parallel file system is fast, as long as we are not moving metadata, acquiring/testing locks, etc.
The small clusters for which NFS is fine, call it up to 16-24 nodes in conventional designs, 64 if the people building the file system know what they are doing, might be called small in some sense. Not a micro-cluster, but that might not be a bad name. Rather it might make sense to demark cluster physical sizes in terms of the log of the number of nodes. The base of the log isn’t terribly relevant, so lets take 2 for convenience. Lets (for the moment) call clusters of size 6 (e.g. 2**6 nodes, or 2^6, or 2 to the power 6 nodes) as being “micro” clusters. Yes, the power/cooling bill isn’t so micro. At 300W/dual-opteron, and more than 400W/dual-woodcrest/clovertown, I am sure they don’t feel all that “micro”.
The point is that these clusters may allow you to get away with particular configurations and simplistic designs that you would not be able to use as the size increases. You can buy ethernet and faster fabric switches for these systems, so that everything appears local, e.g. one hop away to every node through the switch. That means, that all the nodes are in the same lightcone as each other. Shared data is easy, NFS provides files, and MPI messages can skit easily across a single hop.
From size 6+delta (e.g. 2 to the power 6+delta where delta is tiny compared to 1) to about size 12 (2 to the power 12, or 4096 nodes), I would call a meso-scale cluster. These systems require thinking about how to network, as you will need a fabric of networks. File systems need some thought as well, as few file systems can deal with 4096 simultaneous accesses over the net. In these designs, you start to get hierarchies of design, such as Clos networks. Work is pushed to have these units remain within each others lightcone. Serious expenditure of money and effort goes into this. Likewise on the parallel file system front, you need to use the parallel MPI hooks to get good performance. You have N*constant memory systems, N*constant CPUs, and likely N*constant disks. Yet it is fairly likely that the local disks would be use for OS if anything.
What this gets to is the real hard problem in programming/using clusters these days (and only going to get worse). Data motion. I have been talking about it for years, and people have been picking up on it recently. Data motion is hard. It is expensive. It is often a point of serialization. Exludus and others leverage pre-caching techniques to hide latency of the transfer, and while that is good for a subset of problems (especially if you can trick the job scheduler to believe that data availability is a prequisite to application start).
What I am gradually realizing is that within a lightcone, data motion is not as hard, though spreading out to a larger size cluster increases the difficulty in moving data. This isn’t a just a bandwidth issue. Nor a latency issue.
If I want to transfer a file from A to B, I have to move the blocks between the two. Even if I could move them at infinite speed, I still have read and write times to contend with. Of course, I could take the alternative approach and not move them. In which case I need to worry about that 1/N problem.
And this is somewhat problematic and painful in the mesoscale size, as we are trying to force everything into a single light cone. What I am wondering now is whether that is worth the effort.
In part, because the macro scale systems of size 12+delta to 20 can’t really think in terms of single unified light cones across the cluster. This means that if each node can pull from the file system, locking (latency) will become a massive issue. This is a global problem as we are making it as such. What if we simply accepted that larger macro scale cluster were made up of multiple light cones? So that each node might not be a single hop away? What if they were 2 hops. Or 3.
The global file system for these macro clusters is like adding additional spatial dimensions to a system. It allows us to say that all systems have a connection *here*, whatever *here* means.
In the case of messaging, it allows for that single hop. No hard routing.
So in the macro cluster size, what if we broke this assumption? File systems and messaging are all about moving data. File systems worry about locks (latency) and throughput (bandwidth). Messages do as well.
If we break the assumption, can we deal with file system consistency? Message ordering? Is message ordering strictly needed? File system consistency is needed, but how do you implement this globally as the number of nodes gets large? Maybe implement this in terms of no-local-writes, so all writes have to go out over the net, forcing a data motion.
Gets curiouser and curiouser. These are hard problems to solve, and I suspect that we will need to be solving them quite soon.

3 thoughts on “lightcones, filesystems and messages, and large distributed clusters with non-infinite bandwidth and finite latency”

  1. Hello, I’m trying to understand what you mean by “this assumption” in “So in the macro cluster size, what if we broke this assumption?”. And could you explain what do you mean by “adding additional spatial dimensions to a system” ?
    I’ve read your post on Beowulf –
    My definition of a scalable distributed file system is, BTW, one that connects to every compute node, and gives local I/O speed to simultaneous reads and writes (to the same/different files) across the single namespace.
    Does you mean by “additional spatial dimensions” – a file system, that wound be accessible from every node in one hop? And then you propose to break assumption about that? Am I right?
    Thank you.
    Petrov Alexander

  2. Hi Didro
    Sorry about the comment moderation delay.
    Ok … first, my comment in 2002 applied to the world as I saw it then, 4.5 years ago.
    The assumption is that every node has simultaneous high speed access to the one name space. Specifically, a single file system access provided over an arbitrary sized span of nodes. This is the assumption I am wondering if it makes sense to break. How to break it in a sensible way, I am not sure. Call this wondering-aloud. The idea is not well formed at the moment.
    The extra spatial dimension. Each connection to a file system may be indexed by a number representing its namespace. If every node is connected to the same namespace, then all nodes have the same namespace index, and hence we can pass data between two nodes simple by creating a message in the namespace as a file which the other node can read. All nodes have access to all files without node-to-node traffic, no additional data motion is required. Now place the file on one name space, and have another machine not connected to that name space want to read/write it. First it has to move between namespaces. You can talk in terms of a distance between namespaces as a distance between nodes on a line or on a graph.
    This idea isn’t well formed either. Would like to hash it out better.
    The point I had been getting to is quite simply that machines are growing larger than our initial assumptive and simplistic designs. These designs may break down at larger scales than we have considered. So how do we handle it?
    Basically asking how to design a machine with 1 million cores. And beyond.
    The “murky” imagery I have in my head is something that looks like a transition from discrete nodes to “continuous systems”, and what does that do for things like file systems, MPI/messaging, etc.

  3. I would agree with Joe that I/O is the limiting factor that has been long overlooked. With RepliCator we pre-cache data ahead of its use in parallel so that applications requiring simultaneous access from multiple nodes to large data sets are immensely improved. But not all applications fit this description, Joe is right.
    On the other hand Grid Optimizer aims at the more general purpose case where not everyone is trying to run a “Grand Challenge” problem. And there again I would agree with Joe. Discrete nodes are a challenge to scaling and, bluntly said, exceed the specs of the technology they are based on. Would a continuous system be better? No if one tries to build a single continuous system which includes everything (OS, file system, scheduling, memory, etc) as distributed systems of this sort require too much synchronization work – which we know is a bad thing for scalability. YES if one selects just what is required to manage the behavior of a continuous system. And this is exactly the path Grid Optimizer is on: a continuous virtualized scheduler layer.
    The real hard problem is then how to cope with dynamically changing continuous systems where node and subsets of clusters come and go constantly? This calls for on-the-fly provisioning capabilities which existing tools can’t even dream of yet. And, paradoxically, this makes the whole I/O scaling problem even worst.
    This is the sort of problems we are addressing at eXludus and trying hard to find innovative solutions.

Comments are closed.