the mystery of the week

Customer has had a machine for a while. Generally stable. Followed our advice on doing a reboot recently.

Unit started crashing Monday. Then today. Hard to stay up and stable.

I asked if anything has changed, and haven’t gotten anything conclusive … mostly “we don’t think so”.

About the crashes:

Nothing in the logs. Not a thing. No hardware subsystem, which has logging enabled (RAID, motherboard, PCIe, IPMI, … ) reports an error. Just a hard crash.

I did see something I thought was the error … basically a message from the kernel that cfq scheduler had an issue. Which made no sense as the system was running a different IO scheduler. Since then I’ve switched it to noop to take scheduler bugs out of consideration.

Crashes occur with multiple kernels. 2.6.32, 3.2.x.

Memory errors would manifest as ECC traps, we’d see this in bios messages as well as mcelog output. I am thinking of sending RAM over “just in case”. But this isn’t it, and I hate wasting resources like that.

Thermal excursions would also manifest in bios logs, and we’d see them in the logs (over temp errors), and the units would shut down (they don’t, they hard lock).

External power excursions are possible. Customers really hate hearing their power sucks. Quite a few have very crappy power. Doesn’t matter so much for a compute node. Matters tremendously for a high powered storage node.

Internal power excursions are possible. We’d see them in the IPMI logs (we don’t see them).

What else …

… PCIe cards locking the bus. Yeah, buggy cards are always possible. And we’d see a PCIe error (PCI AER) report … our kernels have that enabled, so even if the driver crashes, we at least get a report of why it crashed. And we aren’t getting that here.

… snmp which bangs on lm_sensors and other things. Yeah, this has serious possibilities. Lm_sensors is a low level interface to motherboard health sensors. You get these wrong, and crashing a machine hard is trivial.

… a monitoring agent which drives snmp. See above.

… some sort of wildly escalating user load that drives the machine into apoplectic seizure.

and finally, what I still consider the most likely scenario

changes to the system which destabilized it. These include configs, drivers, … though they very likely should have left a log signature of some sort.

What is maddening about this is that there is no log trace or signature. None of the hardware logging subsystems are reporting problems. When we get ECC errors, we see them. When we get PCIe errors, we see those too. But this is simply not reporting them.

Of course, the reporting mechanism could be, itself, broken. Or the logging mechanism. That isn’t lost on me.

So we’ve got a machine … that crashes … and very little … effectively nothing to correlate it against. Well, apart from the snmp, and the monitoring agent. There does seem to be a 1-2 hour signal post start of those … could be a complete coincidence.

Very annoying. And if you are the customer on the receiving end of these crashes, even more so.

Viewed 63308 times by 5901 viewers

Facebooktwittergoogle_plusredditpinterestlinkedinmail