When core assumptions that should never be wrong, do turn out to be wrong

By joe

May 23, 2012 - 7 minutes read - 1431 words

So … where does this tale begin? We had a nice backup system in place at the lab. Twice a week, all the important servers would happily sync their contents to this unit over Gigabit ethernet. It worked well, we were happy. Place that snippet in the background, it will come up again. I’ve told our customers for a long time that RAID is not a backup. RAID is RAID, it gives you time to recover from a failure. But it is not a backup in and of itself. And it gives you time to recover from some failures. Not all failures. RAID1 lets you survive a single disk failure. RAID5 is similar to this. RAID6 lets you survive 2 disk failures. Our primary company storage is RAID6. With RAID1 for home directories, OS, important files, etc. It was being backed up 2x per week. At our new place, last week, Tuesday morning I believe, I came in and found circuit breakers tripped. Odd I thought. The UPSes were howling at me, one had already shut down. Those among us whom have managed hardware for any length of time know where this is going. And they are right. But, please do continue to read on, as there are some twists. And not happy ones.

So I traced the circuit back, found the relevant units, made sure I didn’t smell anything burning. Nope, nothing smoking. Reset the breakers, turned the UPSes on, and WHAMMO. The breaker popped again. Hmmm …. inrush current too high? Fried UPS? Couldn’t be the server … I mean, its rock solid … a deskside JackRabbit unit that’s been through some crazy power, cooling, and environmental quality issues … with nary a hiccup. Not the server. No way. Seriously. Not the server. Fried UPS made too much sense. Initial playing with it suggested this was the issue. Went out, bought a new unit. Put it on the circuit with the old UPS, removed the old UPS. Came up, no problem. Ok, plug our little JackRabbit deskside unit in. Turn on the power. POP!!!! WHAMMO! Breaker trips. UPS giving same display message as the “failed” unit. Did that pop come from the JackRabbit chassis? Is that burnt resistors I smell? Henry heard the pop too. It was more of a BANG than a POP. Pulled the unit apart. Nothing obvious. Started looking at the motherboard. Had a strong whiff of something burnt. Nothing obvious. Disconnected power to everything, checked out the supply. This is a 1kW supply. It can supply enough current to make POPs turn into BANGs. The supply was hooked up to several things:

The motherboard
The disks

Did the supply die and take stuff down with it? A quick check (at that point) showed, no, it did not. The supply was actually still viable. The motherboard was unresponsive. Pull it out, put another in. Couldn’t find the blown spot on the motherboard. It had to be the motherboard. It just couldn’t be a disk. I mean … really … disks never go all ‘splody on you. Never. Ever. It wasn’t the motherboard. Sherlock Holmes had a simple diagnostic flowchart. At the end of the process, eliminating everything left, you have the thing that it is, whether you like it or not. The disks. We found this after transfering the contents of the unit to a DV4 chassis. As Henry decanted the disks from their holders, he noticed … a blown disk … or more correctly, a blown electronics package. And burn marks. On other disks. Ok. Most of the people with hardware thats blown up at them are now nodding their heads vigorously. They’ve experienced things going boom. They know what I mean about core assumptions … er … isn’t that what I meant about core assumptions? You assume one thing and its something else? Sherlock Holmes and all? No. Please refer to the opening paragraphs. Blown up hardware is not a horror story. Blown up hardware plus backups that didn’t … yeah, thats a horror show. Our backups for this server are 8 months out of date. We have all of the information we need scattered on other systems, and we can reconstruct. And its gonna take us a while do to this. We did lose information. Real honest data loss, due to multiple failures, and an assumption that turned out to be silently incorrect. Because of another change in our network around mid September last year, just after we got back from London for an install, this machine was physically disconnected from the backup network. #LFMF (learn from my fail). Do not assume your backups are there. Prove they are. Restore to another machine. Do not assume one machine has backups intact. Make multiple copies. Yeah, there’s some bozo screaming somewhere about cloud backups and such. When network costs for 1GbE connections become reasonable, sure, we will look at that. Sending 5TB over a 1MB/s pipe? Not such a wise idea. Our backup strategy has, needless to say, changed significantly now. We are going to do the portable USB3/eSATA drive dance at first, and then a spare JackRabbit in my basement later on. Swap drives with the backup units ever few weeks. Deltas via rsync or similar. Will probably upgrade the home to 50M/10M tier to get this done. I don’t mind running most of our web/mail in the cloud. Actually this gives me a little piece of mind. I can make inexpensive replicas of what we have setup. Launch fewer/more instances. Tweak/tune them. Unfortunately data motion is going to continue to be an issue. The fastest data networks are UPS and Fedex, and this doesn’t look like its going to change for the forseeable future. Moreover, as the data bandwidth wall gets higher with higher density disks, you need bigger badder/faster units on the far end to send your disks to. Honestly, we make some of the units that can do a good job of scaling bandwidth so that your data is not frozen onto the platters. Not many others make them nearly as fast. And sadly, Amazon hasn’t bought any yet, so its kind of hard for us to move our data to them quickly, and access it quickly. Learn from our failure. Just because RAID is ‘redundant’ doesn’t mean your data is safe, especially with what wound up to be a quadruple failure. RAID6 isn’t able to survive this, very little can (though ask us again some time later about this, and what we are working on). Just because you have a backup in place, doesn’t mean its really doing its job. It could be failing, silently, unless you can prove it isn’t. If your proof is being done at the first failure, you must love tachycardia and elevated blood pressure … you like living on the edge. I used our backup as frequently as 3 weeks ago to recover another file I’d deleted by accident. Worked like a charm. Don’t ever assume it works. Prove it works. And for all of you out there assuming RAID is all you need for backup … well … reread this. Our new backup will consist of 2x DV4 units running on different power, and a nightly email to a internal tech list. We’ll see exactly how much was backed up, and what was backed up. Disks die. Subsystems die. A resilent solution can handle this. I assumed ours was resilient due to some of our testing. I didn’t prove it resilient every day (randomly restoring a recent file to a temp space, and generating MD5’s or CRCs for every file for more rapid comparison). I think about this in the context of hearing/reading RFP requests for multiple RAID cards connected to the same data backplane and power backplane as being redundant. They aren’t. Not even close. You need 2 completely electrically isolated pathways, and dual ported disks for “redundant” like this to actually work. Unless of course, your RAID cards are so crappy that they fail often enough that this is of value … This sort of solution would not only have not survived this failure, it would have destroyed 2 raid cards rather than the 1 it did. And the data would still be gone. The tale is not over. We are getting one of our DV4’s (loaned to a customer to help them with a problem they had in moving data about) back, and it and its twin are going to be missioned to be our archives. Going to start a disk rotation schedule. And a mirroring system. Should be fun.