memtest delenda est

Ok … maybe not so much destroyed. More like “ignored as a reasonable test of anything but DIMM visibility, and very basic functionality”.
Memtest has several variants running around, all of which purport to hammer on, and detect, bad RAM. The only problem is, it doesn’t really work well, apart from trivial cases. That is, if you have an iffy ram, you’d need days/weeks/months of testing with this code rather than putting it in a box and running a hard pounding code on it.
We noted this some years ago, and largely stopped using it.
Recently had someone yank a DIMM and try it with memtest. Passed memtest.
Well, the DIMM is probably bad, and we are getting them to retest the way we need them to retest … differential diagnoses are terribly important for systems … it can help you isolate problems rapidly, by zeroing in on a failed subsystem, and eliminating alternatives. Without it, you are left asking whether or not your test which gave a “pass” is in fact valid, correct, or even relevant.
When you test, you need to reduce impact of additional dimensions, additional variables. Memtest doesn’t quite do this. Careful differential diagnostics does.

2 thoughts on “memtest delenda est”

  1. hmm, apart from mce *cough* and edac, what else do you use for identifying failing DIMMs?
    regarding memtest(+/whatever) I would need to look into the logs when the last time memtest actually DID find an error which a random application did not find within an hour of running.

    • We turn edac on full bore, loog at MCEs, etc. for detection. Even these don’t work all the time. I especially like the no-MCE “hey lets lock the machine hard” memory errors that are a bear to diagnose.
      But apart from that, we run a set of very hard pounding HPC apps that beat up the machine something fierce. Touches as much RAM as possible, and really exercises it. Then we see errors, that memtest and its ilk never report, appear, and appear quickly.
      Its not that memory testing is a dark art or anything like that. Its that running “stressful patterns” meaning has changed significantly over the past 20+ years. Back when the RAM was placed in the ISA bus on expansion cards, stressful memory patterns had meanings (specific traces/chips/…). These days, less so.

Comments are closed.