As systems scale up, hard problems are exposed

You have a 4 node cluster. You want to share data among the nodes. Pretend it is a “desktop” machine. Fine. Setup NFS. Or Samba/CIFS. Share the data. End of story.

But this doesn’t work as well when you get to 40 nodes, and starts failing badly at 400. At 4000 nodes, well, you need a special filesystem design.

What happened? Why when we scale up do problems arise?

Well, you have several factors. Contention for shared resources (metadata, pipes), and data motion. The resource contention problem is a hard one. Metadata sharing is something that limited scalability of many cluster file system implementations. Some worked around it by keeping single metadata servers. This also limited metadata sharing.

Data pipes are also problematic. As you scale from N to N*M requesters for a particular resource, on average your available resource per requestor drops by a factor of M. So if you have a nice 1 GB/s pipe to your disks, and 10 clients using it, it may on average give you about 100 MB/s performance. Now bump that up by a factor of 10. Your file system should also scale up to handle this. But it doesn’t. Rather it goes in the opposite direction, and you get on average 10 MB/s.

Reliability is an issue. You need a way to guarantee preservation of state and ability to continue execution in the face of errors. How can you do this? Distributed parallel programs have local state, and less global state. Possibly replicating local state to additional nodes. Would cost more, and reduce performance, but compare that to the loss of a single node in an MPI code that doesn’t gracefully handle error.

Finally, the real hard problem. Data motion. A teraflop cluster needs something to operate on. This means you have to move data. You can either host all your data on a really insanely fast disk, or pre-distribute the data, and take advantage of the fact that local data IO is always faster in aggregate than centralized data IO for sufficient sized systems.

As your data sets grow, they become less mobile as well. You can distribute 1TB of data, but unless you are doing it at 1GB/s, be prepared to take at least a large fraction of a day to do it.

Data motion is a hard problem, and is tied to the problem of scalable file systems. We have been worried about data motion for a while. Since 1999 we had developed a number of data distribution tools, the latest, xcp, being a tool to move data efficiently and reliably in a cluster.

There are other hard problems not talked about here. I get the sense that we are at the real infancy of high performance and highly reliable computation, with high data rates. This is an exciting time, not in the Chinese proverb sense, but in terms of the possibilities of what we can do.

Viewed 14946 times by 3333 viewers

2 thoughts on “As systems scale up, hard problems are exposed

  1. It isn’t a file system. It is a pre-caching software, that basically gates a run until the dependent files are there. We have demonstrated similar functionality with mcp and xcp a few years ago. Exludus integrated theirs into the queuing system directly rather than as part of the job script itself. They require some hooks to be able to hold jobs until all the files have been sent.

    It won’t help speed up runs. BLAST and HMMer will not take any less or more time to compute as a result of their software. Their time measurements are from the start of the job to the end of the job, and does not include the time spent moving data with their tool. Moreover, their data motion occurs (as far as I remember) over ethernet. Our xcp uses whatever high performance network you have in place … if you have Infiniband, your data will flow over that.

    For most users of smaller clusters, this is not so much of an issue, as most data motion can be pre-cached easily. For larger data sets, pre-caching is needed though the local drives may not be large enough for the data. The nt data base is rapidly growing to the size of local drives. Add to this that the read speed of disk drives is on the order of 70 MB/s, so 1 GB will take about 14 seconds or so to read. 10 GB, at least 140 seconds. Multiple passes through the file may be problematic for significant sized databases (larger than memory cache). Then we need to start using standard IO performance techniques such as blocking calculations to operate on data in memory, and streaming data into other buffers in ram while calculations are going on in the first buffer. Exludus won’t help with any of this. It will help get the data to the compute node.

Comments are closed.