Years ago, we helped a customer set up a Lustre 1.4.x system. This was … well … fun. And not in a good way. Right before the 1.6 transition, we had all sorts of problems. We skipped 1.6, and now we have set up a Lustre 1.8.2 system, and have several on quote now for various RFPs.
From our experience with the 1.8.2 system … I have to say, I have a sense that it is brittle. Yeah, you can call it “subtle and quick to anger”, or even praise some of the design features/elements.
It just has many moving parts, some work well (MDS), some … well … not so well (OST problem notifications). The failure surface is huge, and figuring out where you are on that surface has become effectively the morning cat-n-mouse game for us.
Speaking with other vendors, friends running these systems, I get the sense that people design/build Lustre systems defensively. That is, they know and appreciate that it will break, so they aim to limit the damage from this breakage. Control or limit the unknowns.
This could be something of a harsh assessment, but I just caught myself doing exactly this for a customer’s configuration. They requested Lustre, and we went back and designed defensively.
Viewed 23166 times by 1610 viewers
