Fixing pausing Nehalem/Westmere units

иконографияSome Nehalem and Westmere units have … er … interesting unintended features … yeah, thats the politically correct way to say it. We like Intel and their products (and we’ve liked AMD in the past and their products). But we gotta call this one.

As you watch dstat output, you see these occasional … hangs … for a few seconds. As if someone is monkeying with the clock.

And that is, to a degree what appears to be happening. The TSC (time stamp counter) as a clock source isn’t being stable. So you need another mechanism to stabilize it. You generally have 3 options, tsc, hpet, and acpi_pm.

So we’ve found that a simple

echo "acpi_pm" > /sys/devices/system/clocksource/clocksource0/current_clocksource

does a pretty good job of fixing some of the weird latency. But under heavy loads, we see more latency.

Honestly, I think the problem is in silicon. Newer revisions of chipsets have exhibited it more clearly than the older sets. Very annoying.

Unfortunately as indicated, it shows up under load. Such as when work has to get done during an interrupt service routine, which is blocked for some reason while interrupts are turned off. This shouldn’t be … ISRs should do as little work as possible, and never spin/sleep. Especially never sleep/spin with interrupts turned off. Like clock timers.

This is what’s happening.

So, how to fix it?

Viewed 34934 times by 5921 viewers