Reducing risk: avoiding the bricking phenomenon

Something happened this week in a storage cluster we set up for a customer. You’ll hear more about the storage cluster at SC09, but thats not what this is about.

This is about risk, and how to reduce it.

Risk is a complex thing to define in practice, but there are several … well … simple ways you can indicate relative risk.

A motherboard and power supply blew in one of our nodes. This happens, parts fail. We replaced them, and the node is back up and working.

Ok, so what does this say about risk?

Had we been been another company with a proprietary motherboard, or a proprietary power supply/chassis, and had we existential issues, would your ability to get your broken node fixed been impacted?

That is, the more proprietary, sole source content you have, tremendously increases your risk. The more open, COTS (commercial off the shelf) parts you leverage, the lower your risk.

I should point out that, contrary to some … well … funny things … we have heard recently, COTS is not anything from the big vendors … . Seriously. I found that definition most amusing. Especially as it was being used as justification to avoid buying open systems which are COTS parts based, in favor of proprietary systems.

Your risk increases in an inverse manner to the number of sources of parts you have. That is, risk is proportional to 1/Number_of_sources for parts to fix the system.

Say for example (using our friends at Sun as the example here), you have a Thumper. And you have a motherboard issue, sort of like what we had. Say you needed it replaced. And once Oracle takes over, there is a very real question as to whether or not the number of sources for that motherboard goes from 1 to 0. So you are left with the spare parts kits you can get from ebay or elsewhere.

That is risk.

I could make the same point about the Panasas drive or director blades, and any other technology which might be neat technology, but whose longevity is directly tied to the health and ability of the company/division/group to continue operating.

Phrased like this, it is obvious where the risk lay.

More to the point, had we, Scalable been run over by a wild herd of buses, or had our business acquired and moved into a different area, or … our customer could have still serviced the system on their own purchasing the needed parts from another source.

That is reduction of risk.

In this day and age, where we have to be more careful with our expenditures, and make sure we guard against taking inappropriate risks, for non-proportional reward, we have to ask whether or not the higher risk non-COTS way is the right way to spend precious capital. That is, if your risk is tremendously magnified by the proprietary nature of the product, ought not your reward be also tremendously magnified by using it? And if it isn’t, you probably shouldn’t be using it.

Simple cost-benefit and risk-reward analysis.

Viewed 11167 times by 2545 viewers