We do everything we can to stop failing subsystems from ever entering our customers hands. We beat on our systems, usually with loads far in excess of what our customers will do. No, not using memtest. We run real codes. And we catch lots of problems.
What surprises me, really gets to me, is that some motherboard makers (who shall remain nameless) ship product to their customers (us) for integration into our products, or as subsystems into products we buy from others, and this product does not work. Oh, well it might work in a narrow regime of less than 4 GB ram, and one processor, and no PCI cards. But try to configure and push this unit to where our customers will? Fuhgeddaboutit.
This was so bad with one of our suppliers that we pushed our testing regime back into their site. What a difference that made. They caught problems before they put the motherboards into systems that they shipped to us. Which saved us some testing time and lots of headache. But they now have the headache, and from what they tell me, they are amazed at their observed failure rate.
I would much rather it fail before it gets to my customers than afterwords. With the current crop of motherboards, I am somewhat at a loss to understand what the really low end “rack-n-stack” vendors do. I guess they must not test their stuff the way that we do. From some of the horror stories I hear from customers, I guess not.
On a build we are working on now, one motherboard is happy with 2 slots of 2 GB DIMM each, on 2 processors. This is a common config. Now fully populate each node with the same 2GB DIMMs on the motherboard and … it goes from 8 GB to 6 GB of ram. Huh? I won’t talk about the PCI-e video not working, just happy we were able to find a simple PCI based video card.
Needless to say, I am not all that impressed. Sure, these are newer motherboards. Doesn’t matter. If you don’t test what you build, you might just be shipping lots of garbage. Or worse, if you ship things that only half-work, then only some fraction of your customers will catch that you are cheating.
A ways back, a cluster “vendor”, one of the rack-em-stack-em-ship-em variety, shipped a customer of ours a cluster. They said it was fully tested. To this day I am not sure how it was possible, as the PCI card cage carrying one of the ethernet cards, which was on a PCI riser, was offset from the PCI slot by about a centimeter. There was no power to the card, so it couldn’t function, never mind participate in a network. This fully tested configuration never seemed to detect a critical mistake on their part. How they were able to point to an HPL run on this machine in their factor as evidence of its functionality is even more mind boggling. Must have been some new definition of “fully tested” that I am unaware of.
Quality is in part, paying (obscene) attention to details. It is minimizing the maximum pain. It is making sure that the thing you are most concerned about is not a significant consideration to your customer. Quality isn’t a brand name. It is not a label. Like security, quality is a process. No not ISO9xxx. Anyone can document a (potentially non-working) process. Quality is making sure that stuff works, as advertised. And if it doesn’t, you need to be all over it, making it work in the minimum amount of time and pain for your customer.
Which leads me back to the motherboard. This failing motherboard has caused me lost time and effort. We are altering some of our own processes to detect failed motherboards far earlier in our work. I wonder if I should submit the manufacturer a bill for our time in diagnosing their issues. Giving that some thought.
BTW: another (formerly high quality) MB maker recently refused to honor a warranty. They refused to believe their board was the issue. We took all the components and moved them to another board and it worked fine. Yet still they refused. We had to do the right thing for our customer and honor our warranty, and we paid for a replacement MB from a different vendor to drop into their unit. When I see this MB vendor at SC06, I will be giving them a fairly sizable bill for our time.