Caught a not-so-cool bug in a hypervisor running on a production machine

Not naming names. Its a good product. It just gives up the ghost when you request 1.5x available memory, and the OS actually tries … tries … to fulfill the request.

I thought I had set the maximum oversubscription amount to 85% of swap + physical. Yet, along came a nice spike and WHAMMO. Down the machine went.

That this was a high visibility production machine, with hard uptime requirements … not so good.

We were able to transition it back to bare metal for now, and for some reason this doesn’t cause grief when we see these huge memory allocation spikes.

Basically it was a user space code interacting very badly with the hypervisor/OS.

Caused me a few weeks of grief, and it delayed other work.

Viewed 24458 times by 1715 viewers