As systems scale up, hard problems are exposed

You have a 4 node cluster. You want to share data among the nodes. Pretend it is a “desktop” machine. Fine. Setup NFS. Or Samba/CIFS. Share the data. End of story.

But this doesn’t work as well when you get to 40 nodes, and starts failing badly at 400. At 4000 nodes, well, you need a special filesystem design.

What happened? Why when we scale up do problems arise?

Well, you have several factors. Contention for shared resources (metadata, pipes), and data motion. The resource contention problem is a hard one. Metadata sharing is something that limited scalability of many cluster file system implementations. Some worked around it by keeping single metadata servers. This also limited metadata sharing.

Data pipes are also problematic. As you scale from N to N*M requesters for a particular resource, on average your available resource per requestor drops by a factor of M. So if you have a nice 1 GB/s pipe to your disks, and 10 clients using it, it may on average give you about 100 MB/s performance. Now bump that up by a factor of 10. Your file system should also scale up to handle this. But it doesn’t. Rather it goes in the opposite direction, and you get on average 10 MB/s.

Reliability is an issue. You need a way to guarantee preservation of state and ability to continue execution in the face of errors. How can you do this? Distributed parallel programs have local state, and less global state. Possibly replicating local state to additional nodes. Would cost more, and reduce performance, but compare that to the loss of a single node in an MPI code that doesn’t gracefully handle error.

Finally, the real hard problem. Data motion. A teraflop cluster needs something to operate on. This means you have to move data. You can either host all your data on a really insanely fast disk, or pre-distribute the data, and take advantage of the fact that local data IO is always faster in aggregate than centralized data IO for sufficient sized systems.

As your data sets grow, they become less mobile as well. You can distribute 1TB of data, but unless you are doing it at 1GB/s, be prepared to take at least a large fraction of a day to do it.

Data motion is a hard problem, and is tied to the problem of scalable file systems. We have been worried about data motion for a while. Since 1999 we had developed a number of data distribution tools, the latest, xcp, being a tool to move data efficiently and reliably in a cluster.

There are other hard problems not talked about here. I get the sense that we are at the real infancy of high performance and highly reliable computation, with high data rates. This is an exciting time, not in the Chinese proverb sense, but in terms of the possibilities of what we can do.

Viewed 11948 times by 2755 viewers