Brittle systems

Years ago, we helped a customer set up a Lustre 1.4.x system. This was … well … fun. And not in a good way. Right before the 1.6 transition, we had all sorts of problems. We skipped 1.6, and now we have set up a Lustre 1.8.2 system, and have several on quote now for various RFPs.
From our experience with the 1.8.2 system … I have to say, I have a sense that it is brittle. Yeah, you can call it “subtle and quick to anger”, or even praise some of the design features/elements.
It just has many moving parts, some work well (MDS), some … well … not so well (OST problem notifications). The failure surface is huge, and figuring out where you are on that surface has become effectively the morning cat-n-mouse game for us.
Speaking with other vendors, friends running these systems, I get the sense that people design/build Lustre systems defensively. That is, they know and appreciate that it will break, so they aim to limit the damage from this breakage. Control or limit the unknowns.
This could be something of a harsh assessment, but I just caught myself doing exactly this for a customer’s configuration. They requested Lustre, and we went back and designed defensively.

For the moment, I guess we have to do this. I think we will be accelerating our Ceph work even more now. Ceph is an object file store with built in replication, failure tolerance, and many things that are bolt ons for Lustre, and may show up in the Lustre 2.0 or 3.0 roadmaps at some point. They are in Ceph’s roadmaps, or even stubs in the installation.
I just don’t like doing more “belt+suspender+suspenders” type designs if we can avoid them. Replicate when needed.