Henry Newman and a few other people I know are talking about RAID as being on the way out. John West pointed at this article this morning on InsideHPC. Their points are quite interesting.
It boils down to this: If the time to rebuild a failed raid is comparable to the mean time between uncorrectable errors (UCE), due to reading/writing volume, then RAID as it is currently thought of, is going to need some serious rethinking.
Put another way, if you are more likely than not to suffer an uncorrectable error during a rebuild, then rebuilding is a bad thing … and since this is one of the central pillars of RAID …
So what are the options?
Henry points to declustered raid as one option. I won’t get into this. OSD is the other thing he points to. Honestly, I think OSD or similar is likely as one of the potential replacements.
The issue is that RAID, as implemented is a global operation. The entire data store is in a particular state. RAID is all about preserving the state, and working hard to minimize and ameliorate state transitions.
The states I refer to are the 3 basic states of RAID: normal, degraded, and failed. Remember these are global states. So it is no wonder that, as storage capacities increase rapidly on a per disk basis, you will eventually read and write enough bits to have a realistic chance of hitting one of these UCEs.
Henry gives some tables on this.
For a UCE once in every 1E+15 bits, this is 1.25E+14 bytes, or 125 TB. This is reading and writing a 2TB drive 63 times. Now take say 24 of these drives and put them into a RAID6. This is 48TB. You’ll hit 125TB in two and a half rebuilds.
This assumes of course that the UCE numbers published are correct. Which they aren’t … that is, you don’t automatically get an UCE at that size, just that the probability of getting a UCE approaches 1 as you read and write that number of bits.
Henry opines that RAID6 is a bandaid over RAID5 in this regard, as it handles another problem, basically the correlated second disk failure on rebuild. This is an interesting view, but I am not sure I agree that it is a “band aid” … it is a necessary technology, in that RAID5 has a large failure mode in correlated failure.
RAID provides resilience … which is a nice way to say it gives you time to fix your problem when it arises.
But its global.
The global nature is more of the problem than specifics of RAID. Global means you read/write all of the data space … or just the used portion of the data space. Which in the event of 125TB of reads/writes, you have a fairly good probability of having a disk failure.
OSD avoids the global problem, but it needs some RAID like capabilities itself. And in the event of huge files, it will hit the same problem, albeit at a later time.
All of the techniques I have heard of all try to keep something of the semantics of RAID, while reducing the data traffic. Few try to solve the problem per se.
I won’t get into specifics, but there are solutions, but they require a very deep re-think of some of the concepts. Hopefully I can talk about this at some point.