Another fun bit of debugging

Ok … so here you are doing a code build.

Your environment is all set. You have ample space. Lots of CPU, lots of RAM. All packages are up to date.

You start your make.

You have another window open with dstat running, just to kinda, sorta watch the system, while you are doing other things.

And while you are working, you realize dstat has stopped scrolling.

Strange, why would that be.

Ping the machine

Not responding.

Ok … hmmm … it crashed? Look in the BMC SEL (our kernel dumps panic messages there). Nothing.

Look at the system condition … overheating? Heck no, its actually running cool.


Ok. Maybe something spurious. Connect up the SOL console, watch it finish booting.

Iterate. Log in 2 windows. Start dstat in one, build in another.

and …

bang …

Hmmm … nothing on the console …

Ok, hook up icl (ipmi console logger) to it. Capture the data. Lets see what is really happening.

Rinse repeat.


Look in the log (ipmi console log that is, it will have everything).

Nope, completely blank.



Only happens under load? Could I have a blown CPU? I did see an EDAC memory error crop up once … ok, lets try something stupid. Something that should not work.

Drop the memory frequency to lowest speed.


Turn off SMT (aka HT).


Ok, lets go full moron, and assume hardware is the culprit, and is somehow … somehow not triggering an MCE or EDAC subsystem.

Let me remove 1/2 the memory.

Why not. Can’t hurt, easy to see it it works, right?

Start the build.


Do two intensive builds at once.


Do 3.



This is new memory, older board, older CPUs. Never given me a problem before.

Crashed with no message whatsoever.

I am going to assume something like a loading issue with the CPU. I can run this at 1/2 the ram, though I’ll probably put 1/2 of what I took out back in to check, and see if its a bad RAM, or a loading problem. Bad RAM should have triggered EDAC/MCE. Loading problem … maybe not.

Viewed 65750 times by 2650 viewers