This is something of a hard post to write, for a number of reasons, not the least of which is that the topic comes as something of a surprise to me.
I am just going to state it, and then discuss it.
The vast majority of people (and companies) out there, whom think they know something of hardware/software/system level diagnostics and problem identification (from newbie to “veteran”) are either full of it, or really clueless. Or a mixture of these.
There are precious few people whom really, REALLY, grok problem analysis and diagnostics. They are at a limited number of places. Happily, we have a nice concentration of people who know this process and how to leverage it.
Over the last year, we’ve had people yell at us over (often self-induced) problems, do RMAs of expensive and time sensitive components for completely trivial issues that could be fixed within minutes with proper support in place. We’ve watched profound failures being repeated again and again by people whom I honestly thought would know better. We’ve seen “well known” people insist that particular problems they observed couldn’t possibly be what we indicated, as they know better (yet in the end it was exactly as we had indicated).
I am dumbfounded by this.
Note: this isn’t bragging. This isn’t “we are always right.” We follow a particular set of problem space reduction techniques to find out what the problem isn’t, and then invoke a very Sherlock Holmesian “if you eliminate the impossible, whatever remains ??? however improbable ??? must be the truth.” And we make “mistakes” in the sense that we try to provide rough guesses that we refine as we get more information.
That is, we follow a fairly intensive process it isolate the impossible (does the problem follow a change or remain in place, etc.) from the improbable. This process relies upon honest and accurate observation.
The interesting thing about this process is that it is fairly universal … we can diagnose problems at most any level in a stack with it. Isolate what is known to work from what doesn’t work, by honestly figuring out what is known to work. It works with other vendors kit as well. As long as the software/hardware monitoring doesn’t lie (and yes, some do lie, often badly), we can work this process.
The other aspect of this is that this is also how we build/ship systems. We build them, test them, and verify them before shipping. We know they are working with fairly high accuracy when they leave our dock. So if something isn’t working on arrival at a customer site, we can figure it out and solve it pretty quickly.
What blows me away is how few people can do this sort of analysis. Or maybe its that they are not willing to do this sort of analysis.