What a difference a distribution makes for Lustre

Lustre 1.8.2 on SuSE is IMO, broken. I am not sure if it is repairable. Most of my comments on the brittle nature of Lustre come from this.
Reloading with Centos 5.4, we are rock solid stable. Its scary.
I am not sure what the issue is, but I think all future Lustre deployments we are going to do will focus upon Centos 5.4.

8 thoughts on “What a difference a distribution makes for Lustre”

  1. I’ll forward your praise to the people nearby who did the actual work to provide that stable platform. 😉 I’m sure they’ll appreciate it.

  2. From memory at SC’10 the Lustre folks said that they were dropping support for SLES on the server side, which annoyed my friend using Debian with the SLES kernel (the CentOS/RHEL Lustre kernel is apparently too old to work with the Debian user space).. 🙁

  3. @Jeff: 🙂 I still have to replace a few drivers, but for Lustre, this is about as stable as I’ve seen it. You can let them know that I can crash that stable platform under fairly intense loads (which is why we usually change out the default kernel with ours … can’t do this easily with Lustre … as we run into all sorts of … well … spectacular … failure modes).
    @Chris: I missed SC’10? Darn … 🙂 Ok, I know what you mean, at SC’09 they said this. Doesn’t surprise me, as I’ve found building drivers, and other bits to be a challenge under SuSE. SuSE doesn’t seem to ship -dev/-devel rpms. You have to generate these from the .src.rpm using zypper. Which means any time you have a package dependency that requires a -dev/-devel package, you have to go through the while *&*%$^#^%& process to build the package from src.rpm, create the -dev/-devel package, just to install this.
    To put it kindly, this was not a very well thought out system.
    The SuSE kernel is arguably more modern than the Redhat kernel … though I’ve found that you also can’t update drivers within it easily … or put another way, you can’t do this in a stable manner. You destabilize the kernel even more by changing an older (known unstable) driver for a newer (stable) driver.
    That latter issue was IMO the source of my “brittle” issues. And this experience has caused me to rethink the use of SuSE in cluster and storage contexts.

  4. Grin, getting a bit ahead of myself there I guess.. 😉
    My experiences with SuSE have only been with SLES9 on IBM Power and I think the main issue we had with it we would have had with RHEL/CentOS too – basically it’s a 64-bit kernel with a 32-bit userspace with some 64-bit extra packages. That’s fun – not!
    In general I’m just not as familiar with SLES as with RHEL, it does seem that with SLES you can make changes and then have a config tool come and overwrite them for you. They also managed to ship an upgrade to yaboot (LILO for PPC) which didn’t work on our Power boxes – fortunately I tested that out on a spare box before applying it and was able to pin things at the old (working) version for the few months whilst they got around to fixing it. This was over 5 years ago now to be fair to them..

  5. @Chris
    RHEL/Centos user space is mostly 64 bit. We can remove all but a few non-64 bit packages, and it functions fine.
    yum remove “*.?86”
    usually works fine in most cases (default install does install both … still working out how to fix/limit that during initial installation … seems silly to have the extra traffic and space which you will then throw away)

  6. Hmm, are you sure ? At the time I was told by IBM that all PowerPC Linux distros (except Gentoo) used 32-bit userland because it was faster than 64-bit because of weird architectural reasons..
    AMD64 is (of course), sane though.. 😉

  7. @Chris
    I can’t speak to PPC, but x64 is relatively sane. x64 is faster in most cases than ia32 on the same processor.

  8. Yup, that’s right, on x86-64 it works in your favour (extra registers, etc), whereas on PPC64 I think the architecture works against you – here’s a quote from a Debian developer:

    Well, the later is not true. 32 bits code tend to run faster than 64 bits code on ppc64. Unlike amd64 where you win by having access to more registers, on ppc64, you just end up having to use more instructions to load a full constant in a register 😉

Comments are closed.