This isn’t what you might think from the title. Its an observation. I hope I don’t misstate what I intend to say, so feel free to chime in if you don’t agree with the wording.
When you have a situation where a customer has a set of vendors, and a problem that needs resolution, the customer will gravitate towards assigning blame for the problem to the most competent of the vendors, the most proactive of the vendors, in the hope that it will be resolved, regardless of whether or not that vendor’s gear/stack is in any way involved.
Ok, one might say “Hah! Wishful thinking.” or “you must say this to keep yourself from going crazy over all the blame for problems you get.” or similar.
We get problem reports, and we take ownership of them. Doesn’t matter the cause, we will get to the bottom of them, figure out a workaround, or a solution, and recommond or implement it. We don’t care if it goes outside of our box/stack, an issue outside can still affect us.
Sometimes the problems are self inflicted (and we’ve seen some doozies). Sometimes they are transient. As often as not they are in hardware we cannot control or even look at.
But we do get requests, fairly often to look at problems. Even for customers whom have none of our gear. Which is why I believe its harder to assign blame for failures to us in these cases … would be a stretch.
In the last several weeks, we’ve seen a number of “we didn’t get the hardware from you, but can you help us solve this problem” type engagements. In some cases these are customer home-brew systems (rarely a good idea, but ok, they happen with significant frequency), in others, they are vendor supplied by an IT vendor with no real understanding of what an HPC system is, how to design/test them, how to debug them.
Some of the self inflicted are due to software stack policies that prevent installation of updates or patches that solve problems.
All in all it gets easier for the user to pin the blame on one thing, supplied by one vendor, if that vendor will relentlessly pursue a solution.
This said, some of the things do deserve their bad reputations … early Lustre/Gluster/… were something of a challenge to stabilize.
So I find it interesting, as we approach 10 years in business, and previous 6 years at SGI, and 1.5 at MSC, that the pattern hasn’t changed that much. Find the most competent of your suppliers and, not merely pick their brains, but actively get them to take ownership of the problems.
There is an easier way.
We do this for our paying customers.