A nice loading test

A customer presented a nice test to us. We thought we had a good loading program going, running the units at heavy load for extended lengths of time. And these are good loading programs.
But they weren’t as intensive as this customers. They run 8 bonnie++ jobs simultaneously on the system. So we ran it. And promptly crashed the unit.

Believe it or not, that was good. In the process we exposed a corner case where the later rev driver and updated firmware had a crash relative to the previous driver release with the same firmware. Running the test on lots of other of our units, we haven’t had a problem. This one got added to our normal testing regime. Nothing quite like pushing user loads to 12-16 to see what your box can do, and how it holds up to very intensive loads.
We have also rolled back to the previous release of the drivers as indicated. Will test all the drivers with this (as well as our normal suite).
While we prefer real use cases, tests like these do tend to stress the systems very hard. And that makes them good. What doesn’t crash the unit makes it stronger 🙂

1 thought on “A nice loading test”

  1. I’ve been doing something similar in my testing recently. I have a loop that untars bonnie++, builds it, runs it, deletes it, and then back again. That loop gets run on a hundred nodes simultaneously, each in a different directory on the same Lustre filesystem, and it shakes things out quite a bit.
    Oh, and what *does* crash the system makes it even stronger still. 😉

Comments are closed.