Interesting reading on SSD reliability

By joe

November 23, 2010 - 3 minutes read - 468 words

Been researching this more. The questions I am asking now are, are the MTBF numbers believable? Are their bad batches of NAND chips … SLC, MLC? What failure rates do people see with SLC? We have seen failures in both SLC and MLC units. MLC is generally indicated to be less reliable than SLC.

I am specifically looking for failure information. What I am finding is concerning me. Generally, among all controller chips out there, there seem to be a number of people reporting sudden failures in 2-3 month windows. These weren’t reported last year, this seems to be a recent phenomenon. From process shrinks? We have calls in to a vendor we’ve been working with (not Corsair) to get a better read on whats going on. There are lots of people indicating that MLC is fine for enterprise based upon the MTBF numbers, and the write/erase block statistics. My concern is, what if those numbers are wrong? I’ve always been somewhat suspicious about the bathtub analysis used to generate MTBFs for other systems. I need to think this through and do some calculations and modeling. [update] this is the article that got me thinking. I’ve been worried that failures were possibly external, or environmental. The authors make good arguments that some of these can be issues (power rail reliability, etc.). They also point out some of the more interesting analysis:

I need to look at this analysis. Is it possible that MTBF is completely meaningless for these devices? MTBF is a statistical measure over a large sample of units. It is not an experimental measure, but is attempting to provide a predictive model. Over many units. Its not really valid to say you can apply it to a single unit. So there is some sort of contextual variance of meaning. I also have to track what SMART does with SSDs. But more to the point, the page also has this gem:

I am not so sure that power quality is as high as I wish it to be. I should note that I am not questioning SSD utility, just the basis for some of the reliability numbers, which, if you will allow me to describe it as such, might best be considered works of optimistic fiction. We don’t like it when things fail, and OS drives failing in RAID1’s and then taking swap out with pages committed to the (now inaccessible) swap space … no this is not something good. I saw a second customer failure like this. The only thing you can do when swap with pages goes away. Thats reboot. Which you can’t always do. So I am looking into the failure modes, rates, and speaking with the vendor to see if we are missing something. I don’t think we are, but we will ask.