Interesting reading on SSD reliability

Been researching this more. The questions I am asking now are, are the MTBF numbers believable? Are their bad batches of NAND chips … SLC, MLC? What failure rates do people see with SLC?
We have seen failures in both SLC and MLC units. MLC is generally indicated to be less reliable than SLC.

I am specifically looking for failure information. What I am finding is concerning me. Generally, among all controller chips out there, there seem to be a number of people reporting sudden failures in 2-3 month windows. These weren’t reported last year, this seems to be a recent phenomenon. From process shrinks? We have calls in to a vendor we’ve been working with (not Corsair) to get a better read on whats going on.
There are lots of people indicating that MLC is fine for enterprise based upon the MTBF numbers, and the write/erase block statistics. My concern is, what if those numbers are wrong? I’ve always been somewhat suspicious about the bathtub analysis used to generate MTBFs for other systems. I need to think this through and do some calculations and modeling.
[update] this is the article that got me thinking. I’ve been worried that failures were possibly external, or environmental. The authors make good arguments that some of these can be issues (power rail reliability, etc.). They also point out some of the more interesting analysis:

Plugging these numbers in the same calculation gives an estimated MLC flash SSD operating life (at max write throughput) which is 6 months! (instead of 51 years for a 64GB SLC SSD).

I need to look at this analysis. Is it possible that MTBF is completely meaningless for these devices? MTBF is a statistical measure over a large sample of units. It is not an experimental measure, but is attempting to provide a predictive model. Over many units. Its not really valid to say you can apply it to a single unit. So there is some sort of contextual variance of meaning.
I also have to track what SMART does with SSDs.
But more to the point, the page also has this gem:

“With the voltage levels closer together for MLC flash the devices are again more susceptible to disturbs and transient occurrences, causing the generation of errors which then have to be detected and corrected. If that is not enough for the chip maker, it poses an even larger problem for the system designer, in that there is more of a variety of technologies employed among competing flash chip designs than DRAM makers, for example, would ever dream of.”

I am not so sure that power quality is as high as I wish it to be.
I should note that I am not questioning SSD utility, just the basis for some of the reliability numbers, which, if you will allow me to describe it as such, might best be considered works of optimistic fiction. We don’t like it when things fail, and OS drives failing in RAID1’s and then taking swap out with pages committed to the (now inaccessible) swap space … no this is not something good. I saw a second customer failure like this. The only thing you can do when swap with pages goes away. Thats reboot. Which you can’t always do.
So I am looking into the failure modes, rates, and speaking with the vendor to see if we are missing something. I don’t think we are, but we will ask.

2 thoughts on “Interesting reading on SSD reliability”

  1. Don’t we seem to be way too cautious about SSDs? Sure they fail, but we have developed technologies like RAID or self-healing filesystems (Sun’s ZFS, NetApp’s WAFL, etc) for decades to protect against known unreliable HDDs. So why can’t we trust these technologies to protect against SSDs?
    Why do you need to reboot the OS when a drive in a RAID1 array fails? This indicates the RAID layer is not providing the expected functionality.

  2. @MRB
    Unfortunately, those making SSDs aren’t that experienced making drives in general, so they don’t have important past experience to guide them. One SSD we’ve used, much to our chagrin, locked up the POST on a motherboard when it failed. The only possible solution was to remove it. Another had a soft fail, in that parts would respond correctly, so we saw the RAID mechanism doing what it was supposed to. But because the failure wasn’t staged correctly, the RAID mechanism was … er … abused. In this case, the RAID attempted something like 200+ rebuilds in ~ 5 seconds due to the soft failure.
    Basically, the RAID level functionality is working fine. Its the “how an SSD should be indistinguishable from a disk drive in operation” that many of the vendors appear not to get. During failure, we need the drive not to lock POST, nor to soft fail. Remove it and be done. Not a RAID issue. Its a drive implementation issue.
    As for self healing file systems, yes, I’d advise using them with SSD. RAID1 or RAID10 at minimum at a block level. ZFS has issues on Linux right now, and isn’t suitable for boot drives under Linux. WAFL isn’t available.

Comments are closed.