When infinite resources aren't, and why software assumes they are infinite

We’ve got customers with very large resource machines. And software that sees all those resources and goes “gimme!!!!”.
So people run.
And then more people use it. And more runs.
Until the resources are exhausted. And hilarity (of the bad kind) ensues.
These are firedrills. I get an open ticket that “there must be something wrong with the hardware”, when I see all the messages in console logs being pulled in from ICL saying “zOMG I am out of ram …. allocations of all sorts failing …. must exterminate processes!”.
Sorry, that last bit because my daughter had me watching Dr. Who recently with her, and that nasty “exterminate” keeps running back into my head. Seriously, we need to instrument OOM killer in the kernel to send that to the audio port when it shoots something.
Ok, you might say, why not set up swap? I mean that’s what it is for … right?
Swap is a bandaid, and a BAAAD thing to do to a good machine with a large amount of RAM to begin with.
Machines aren’t infinitely elastic, they don’t have infinite resources. Many application codes seem to treat machines as if they are the only thing running, or the only instance running on the machine. Take a large enough machine, with many users, and this goes from slightly wrong to complete hogwash.
So I am looking to use a variety of technological measures to impose discipline upon the applications themselves. Hopefully without impacting performance.
A job queuing system with a strong interactive component probably makes a great deal of sense right now, but I think I need to talk with the team using this … as many of them might not like that concept (its interactive, right? So why do we need to submit jobs?). This is why I am looking at whether or not I can contain the problem with containers, or see if I need to go full on VMs. For computationally heavy jobs, the VMs might be better … simpler failure domain. For more cooperative/smaller jobs, the containers might be better.
I know the solutions that have existed for decades in HPC circles, I’ve used most of them, configured/deployed/supported most of them for the last 20+ years.
What amuses me is the cyclical nature of these sorts of problems. Same problem, different type of domain. Not pure HPC any more, but big data analytics.