blasting through heavy loads …

Previously I had told you about octobonnie. 8 simultaneous bonnies run locally to beat the heck out of our servers. If we are going to catch a machine based problem, it will likely show up under this wilting load.
But while that is a heavy load, it is nothing like what we have going on now.

I am sitting here in the office monitoring one of our boxes being tested by a customer before they put it into production (oil and gas market), as they load it from their cluster.
14 simultaneous bonnies running over gigabit. Oversubscribing the networks by more than 2x. Multiple mount points. Random load to the mounts … that is the mounts are randomly used for the bonnie.
Doing this as the channel bonding driver on Linux doesn’t withstand heavy load on any mode other than mode 0. Unit was fine until they set up channel bonding. Used mode 6. Caused a soft-irq lockup. Same one I have seen since 2.6.9 and before (this is our kernel). Problem is likely still in 2.6.27 and 2.6.28, though haven’t tested for this specifically yet. Will do at some point.
Of course, when a partial kernel crash takes out an interrupt handler, other things tied to that pin (the way PCI works is it multiplexes its interrupt pins … interrupts can be assigned specific ports, but this is done in a “soft” manner) could be compromised. Yeah … I am sure some other OSes can survive interrupt handler crashes. Microkernels probably. Just restart the user space service.
But the IRQ went away. Which eventually took one of the RAID cards down.
Because when a RAID card gets confused, your file systems go south, awful fast.
So this is more or less an extreme loading test. After setting up the clients and mounts, we fired up the load generator.
We’ve done this sort of sustained test before. But we are running this until next week. We will see how well it handles it.
For the record, the 8 processor, 16 GB ram machine currently has a user load of about 30. And it is still quite responsive.

4 thoughts on “blasting through heavy loads …”

  1. I’m seeing this same problem everywhere, as well. Really annoying in an “Enterprise-Grade OS”. I see discussions in lkml (from 2007- possible fix code, but no notes on actual inclusion), Ubuntu bugzilla, Debian bugzilla, and I added one for CentOS.
    Debian’s Maximillian got a little grumpy when it was reported to him as critical…but not actual fix was posted. Someone reported that 2.6.26 was “working”, and magically the bug was closed.

  2. Hmmm …. Ok. We are at for a group of customers, will look to test this. just finished building, so we have to test this as well.
    It looks like some sort of corner case that tickles a “bug”. Likely more of a hardware design issue (200X and we are *still* sharing interrupts??? WTH?)

  3. Just thought I’d add that I am having the same problem with a Dell PowerEdge 2950 and using a 2.6.29 kernel. The kernel crashes freezing the machine with the only information pertaining to bnx2_poll_work in the bnx2 module. I am using “balance-alb” (6) as bonding scheme in this case.

Comments are closed.