When core assumptions that should never be wrong, do turn out to be wrong

So … where does this tale begin?
We had a nice backup system in place at the lab. Twice a week, all the important servers would happily sync their contents to this unit over Gigabit ethernet. It worked well, we were happy.
Place that snippet in the background, it will come up again.
I’ve told our customers for a long time that RAID is not a backup. RAID is RAID, it gives you time to recover from a failure. But it is not a backup in and of itself.
And it gives you time to recover from some failures. Not all failures.
RAID1 lets you survive a single disk failure.
RAID5 is similar to this.
RAID6 lets you survive 2 disk failures.
Our primary company storage is RAID6. With RAID1 for home directories, OS, important files, etc.
It was being backed up 2x per week.
At our new place, last week, Tuesday morning I believe, I came in and found circuit breakers tripped. Odd I thought. The UPSes were howling at me, one had already shut down.
Those among us whom have managed hardware for any length of time know where this is going. And they are right. But, please do continue to read on, as there are some twists. And not happy ones.

So I traced the circuit back, found the relevant units, made sure I didn’t smell anything burning.
Nope, nothing smoking.
Reset the breakers, turned the UPSes on, and WHAMMO.
The breaker popped again.
Hmmm …. inrush current too high? Fried UPS?
Couldn’t be the server … I mean, its rock solid … a deskside JackRabbit unit that’s been through some crazy power, cooling, and environmental quality issues … with nary a hiccup. Not the server. No way. Seriously. Not the server.
Fried UPS made too much sense. Initial playing with it suggested this was the issue.
Went out, bought a new unit. Put it on the circuit with the old UPS, removed the old UPS. Came up, no problem.
Ok, plug our little JackRabbit deskside unit in.
Turn on the power.
Breaker trips. UPS giving same display message as the “failed” unit.
Did that pop come from the JackRabbit chassis? Is that burnt resistors I smell?
Henry heard the pop too. It was more of a BANG than a POP.
Pulled the unit apart. Nothing obvious. Started looking at the motherboard. Had a strong whiff of something burnt. Nothing obvious. Disconnected power to everything, checked out the supply. This is a 1kW supply. It can supply enough current to make POPs turn into BANGs.
The supply was hooked up to several things:

  • The motherboard
  • The disks

Did the supply die and take stuff down with it? A quick check (at that point) showed, no, it did not. The supply was actually still viable.
The motherboard was unresponsive. Pull it out, put another in. Couldn’t find the blown spot on the motherboard. It had to be the motherboard.
It just couldn’t be a disk. I mean … really … disks never go all ‘splody on you.
It wasn’t the motherboard.
Sherlock Holmes had a simple diagnostic flowchart. At the end of the process, eliminating everything left, you have the thing that it is, whether you like it or not.
The disks.
We found this after transfering the contents of the unit to a DV4 chassis. As Henry decanted the disks from their holders, he noticed … a blown disk … or more correctly, a blown electronics package. And burn marks. On other disks.
Most of the people with hardware thats blown up at them are now nodding their heads vigorously. They’ve experienced things going boom. They know what I mean about core assumptions …
er … isn’t that what I meant about core assumptions? You assume one thing and its something else? Sherlock Holmes and all?
Please refer to the opening paragraphs.
Blown up hardware is not a horror story. Blown up hardware plus backups that didn’t … yeah, thats a horror show.
Our backups for this server are 8 months out of date.
We have all of the information we need scattered on other systems, and we can reconstruct. And its gonna take us a while do to this.
We did lose information. Real honest data loss, due to multiple failures, and an assumption that turned out to be silently incorrect.
Because of another change in our network around mid September last year, just after we got back from London for an install, this machine was physically disconnected from the backup network.
#LFMF (learn from my fail). Do not assume your backups are there. Prove they are. Restore to another machine. Do not assume one machine has backups intact. Make multiple copies.
Yeah, there’s some bozo screaming somewhere about cloud backups and such. When network costs for 1GbE connections become reasonable, sure, we will look at that. Sending 5TB over a 1MB/s pipe? Not such a wise idea.
Our backup strategy has, needless to say, changed significantly now. We are going to do the portable USB3/eSATA drive dance at first, and then a spare JackRabbit in my basement later on. Swap drives with the backup units ever few weeks. Deltas via rsync or similar. Will probably upgrade the home to 50M/10M tier to get this done.
I don’t mind running most of our web/mail in the cloud. Actually this gives me a little piece of mind. I can make inexpensive replicas of what we have setup. Launch fewer/more instances. Tweak/tune them.
Unfortunately data motion is going to continue to be an issue. The fastest data networks are UPS and Fedex, and this doesn’t look like its going to change for the forseeable future. Moreover, as the data bandwidth wall gets higher with higher density disks, you need bigger badder/faster units on the far end to send your disks to. Honestly, we make some of the units that can do a good job of scaling bandwidth so that your data is not frozen onto the platters. Not many others make them nearly as fast. And sadly, Amazon hasn’t bought any yet, so its kind of hard for us to move our data to them quickly, and access it quickly.
Learn from our failure. Just because RAID is ‘redundant’ doesn’t mean your data is safe, especially with what wound up to be a quadruple failure. RAID6 isn’t able to survive this, very little can (though ask us again some time later about this, and what we are working on). Just because you have a backup in place, doesn’t mean its really doing its job. It could be failing, silently, unless you can prove it isn’t. If your proof is being done at the first failure, you must love tachycardia and elevated blood pressure … you like living on the edge. I used our backup as frequently as 3 weeks ago to recover another file I’d deleted by accident. Worked like a charm.
Don’t ever assume it works. Prove it works.
And for all of you out there assuming RAID is all you need for backup … well … reread this.
Our new backup will consist of 2x DV4 units running on different power, and a nightly email to a internal tech list. We’ll see exactly how much was backed up, and what was backed up.
Disks die. Subsystems die. A resilent solution can handle this. I assumed ours was resilient due to some of our testing. I didn’t prove it resilient every day (randomly restoring a recent file to a temp space, and generating MD5’s or CRCs for every file for more rapid comparison).
I think about this in the context of hearing/reading RFP requests for multiple RAID cards connected to the same data backplane and power backplane as being redundant. They aren’t. Not even close. You need 2 completely electrically isolated pathways, and dual ported disks for “redundant” like this to actually work. Unless of course, your RAID cards are so crappy that they fail often enough that this is of value … This sort of solution would not only have not survived this failure, it would have destroyed 2 raid cards rather than the 1 it did. And the data would still be gone.
The tale is not over. We are getting one of our DV4’s (loaned to a customer to help them with a problem they had in moving data about) back, and it and its twin are going to be missioned to be our archives. Going to start a disk rotation schedule. And a mirroring system. Should be fun.

5 thoughts on “When core assumptions that should never be wrong, do turn out to be wrong”

  1. Just out of curiosity :
    What do you do that produces so much data ?
    With 50Mbit/s over 8 hours, you can already transport about 180 Gbyte.

  2. @Jan
    Its 10Mb/s up, not 50Mb up. This comes in at ~1.1MB/s upload if I use all the bandwidth on this one thing now.
    1GB is therefore about 1000s. 3.6 GB in ~ 1 hour. 86.4 GB/day. This is about how long it takes for us to transfer one of the directory trees (one of the more important ones) to my home server (this one) over the network.
    The machine currently has about 3.2TB in use on the relevant directories. Many things in there, its our central server.
    Would take 37 days to transfer this once.
    This is, BTW, exactly what I mean by a bandwidth wall height. A time measured in seconds, to read or write all of your data once. And this is why this is so very important, and why cloud backup doesn’t make sense unless you use Fedex/UPS net … which sort of obviates the utility of the concept of the cloud.
    We’d like big fast 10GbE pipes in. Even have a router that could handle this. Just can’t afford the bandwidth from the providers (its more than 10x our current rent) for this. Even a 1GbE pipe would be great. This is about 2x our current rent.
    Sure, we can put everything at a hosting provider. And we still have the data motion problem. Doesn’t solve anything unless we are colocated with them (literally our entire business in the same building as the data center). This isn’t likely to happen any time soon.

  3. But what is it that you _produce_ every day ?
    Yes, a first sync hurts, but after that, you only send deltas over the line.
    Unless you produce a lot of data, and then I would just pull a line between your business and a neighbour across the street a an off-site backup.
    If the whole neighbourhood goes up in flames your problems are likely bigger than a backup can solve 🙂

  4. Thank you very much for sharing your experiences with failing backups and inevitable failures in storage systems. Too bad that they happened simultaneously, as is logical once one needs backups.
    @Jan van Haarst: “If the whole neighbourhood goes up in flames your problems are likely bigger than a backup can solve”
    Nowadays I just think your case and phrase is so wrong and overused leading us to think wrong emergency scenarious and to false feel of data security (as “RAID is not backup” commands one’s mind away from false sense of security with RAID).
    Joe’s case speaks again the worst and the most simplistic scenario, what can badly ruin one’s fine and tested backup strategy and affect operations badly whether backups are in cloud or not: a machine not connected and no scheduled script checking its presence and complaining loud when a check fails.
    From experience I know, how easy it is to forget required changes while moving or in other hasty activities of business requirements even if one had a scheduled script for checks. And I find it almost not worth to mention, but in a first place, how easy it is to leave a check from the script with a scenario like “it is not worth to check the service’s (a machine) presence, since it’s availability is monitored anyway” and after a couple of years things have silently changed and no one has noticed an anonymous change have had an effect on a piece of backups, which churn along silently as there isn’t anything but success to report until…
    I have collected few links to published disaster descriptions. I share them here, if you find them informative and as a sign of gratitude that Joe unfearfully to SI’s company image shared he’s description of events (I think the usual state is the opposite, which hinders general state of system administration in the field, the links below show a big case of the opposite).
    ArsTechnica: Amazon’s lengthy cloud outage shows the danger of complexity
    Storage Mojo’s take on Amazon’s outage
    TheRegister: Titsup EMC VNX kit unleashes 5 days of chaos in Sweden
    ChannelRegister: Flash drive meltdown fingered in Swedish IT blackout
    StorageMojo asks: How fault tolerant are SANs?(I shared the above links about Tieto’s problems with EMC VNX in a comment in StorageMojo’s follow ups, which contain few other.)
    “TD ? LED flashlight JeeLabs” [a data disaster recovery desciption under a subheader] (As a coincidence the link was my yesterday’s reading pleasure right after this blog post.)

    • @Erkki

      I have collected few links to published disaster descriptions. I share them here, if you find them informative and as a sign of gratitude that Joe unfearfully to SI?s company image shared he?s description of events (I think the usual state is the opposite, which hinders general state of system administration in the field, the links below show a big case of the opposite).

      I am reminded of a few events from employers past. At SGI, after the peak in the stock and then during monotonic decline into oblivion (the first time), management decided to monetize assets by selling buildings and consolidating locations. So they contracted a local Realtor to move the property where we were in. A very nice building. With a big old for-sale sign out front. Our competition didn’t say a word. They just suggested our potential customers drive by our building. The impact upon our business was, sadly, quite severe.
      With this anecdote, I hope its more clear why businesses feel compelled to “hide” their failures from public eyes.
      This said, the failure we experienced was a combination of things we could and could not control. Fundamentally, mistakes were made that, had they been corrected, we would have been laughing this off as near miss rather than what transpired.
      I thought the benefit associated with people seeing our near miss would exceed the potential hit our competitors could make saying “see their 4 year old machine gave up its ghost and their backup broke”. I am sure some competitors would do that (we’ve seen some real darn sleezy things attempted upon us) or worse. But at the end of the day, the real value is in seeing what we’ve experienced, learning from the failure.
      The storage isn’t/wasn’t the problem, it was the process. That was broken, and thats the issue that needed to be fixed. We are increasing our use of our storage, haven’t seen anything else as reliable/stable out there for our own use. We eat our own dog food. And our backup is doubling down on our storage, using deltaV’s locally, and a remote JackRabbit for resiliency, along with a disk rotation schedule, and possibly some rsync’s to remote machine.
      For other businesses/organizations, its ok to post your own stories. Think of it as a morbidity and mortality conference. Discuss the failures. Discuss the failure modes. Help improve the state of knowledge, and help people to avoid making a similar mistake in the future. If you want, post them anonymously, and as Erkki did, please link back to us so we can see them.

Comments are closed.