Rereading posts from 6 years ago …

NFS sucked then as well.
We’ve got a customer whom occasionally pushes their hardware a wee bit too hard. And stuff comes crashing down. Basically it looks like a kernel bug, one I’ve not been able to ID for a number of reasons, and I can’t find a mechanism to reliably tickle it.
This is the definition of a Heisenbug.
Basically the problem is this. They use NFS, extensively. NFS is great for low level IO rates. But ramp up a cluster of 20-ish machines, banging away on it, and you may visit some of the nether reaches of the operational parameters. Bump the state from a well defined orbit of deterministic goodness, into a hard locked nightmare where we have to hit reset, and …
… there’s no log, anywhere.
Ok, when this happens, I think ring 0 code or its a driver. So easy thing to do is to (drastically) alter the drivers and the ring 0 code. Use a new version of the kernel, update the drivers.
Yet the crash keeps happening.
No AERs. With PCIe reporting on maximum, no data returns. Debug and other stuff on maximum levels … nothing. No machine check exceptions. No events on RAID cards. Nothing in the SEL. Nothing in any of the various hardware and software logs.
Occasional warnings on hung tasks (due to the nature of the kernel and some of the additional warnings we’ve built in to their kernels). Occasional stack traces from processes. Nothing that gives me a smoking gun … I’ve eliminated as many of the items as possible. We swapped out the MB and processors last year. Changed the 10GbE cards out. Updated/swapped RAIDs.
Nothing. Its still an issue.
Somehow, somewhere, we are hitting a bug. I am thinking I might need to run a debugging kernel on their box, with a serial port gdb session going. I have not done that in a while so I’ve got some mad skillz to refresh.
And its random … mostly. It looks, somehow, to be correlated with context switches, and interrupt rates, and possibly with some level of network processing. So I am trying a few things to throttle this. If they work (remember its a Heisenbug, so figuring out if it worked is an exercise in determining the exact momentum and position of the bug … its kinda hard, and there are factors of h-bar running around everywhere …. I know work in units where h-bar == 1) then a cgroups container might make sense.
Not sure, have to think harder about this. Customer isn’t badly impacted (not losing data), but they are losing time, and that is a problem in their area. So we are trying to come up with a mechanism to fix it or control it. If we understand which shiny knob does what, then we have a fighting chance to do something.
But remember, Heisenbugs are powerful, quick to anger, and often require interferometric techniques to see (e.g. you prod one thing while looking somewhere else and taking differences between prodded and un-prodded).