Data loss, thanks to buggy driver or hardware

So this happened on the 3rd, on one of my systems

Feb  3 03:02:39 calculon kernel: [195271.041118] INFO: task kworker/20:2:757 blocked for more than 120 seconds. 
Feb  3 03:02:39 calculon kernel: [195271.048116]       Not tainted 4.20.6.nlytiq #1 
Feb  3 03:02:39 calculon kernel: [195271.052678] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 
Feb  3 03:02:39 calculon kernel: [195271.060626] kworker/20:2    D    0   757      2 0x80000000 
Feb  3 03:02:39 calculon kernel: [195271.066238] Workqueue: md submit_flushes [md_mod] 
Feb  3 03:02:39 calculon kernel: [195271.071057] Call Trace: 
Feb  3 03:02:39 calculon kernel: [195271.073625]  ? __schedule+0x3f5/0x880 
Feb  3 03:02:39 calculon kernel: [195271.077406]  schedule+0x32/0x80 
Feb  3 03:02:39 calculon kernel: [195271.080646]  wait_barrier+0x146/0x1a0 [raid10] 
Feb  3 03:02:39 calculon kernel: [195271.085212]  ? remove_wait_queue+0x60/0x60 
Feb  3 03:02:39 calculon kernel: [195271.089425]  raid10_write_request+0x74/0x8e0 [raid10] 
Feb  3 03:02:39 calculon kernel: [195271.094596]  ? mempool_alloc+0x69/0x190
Feb  3 03:02:39 calculon kernel: [195271.098560]  ? md_write_start+0xd0/0x210 [md_mod] 
Feb  3 03:02:39 calculon kernel: [195271.103381]  ? __switch_to_asm+0x34/0x70 
Feb  3 03:02:39 calculon kernel: [195271.107416]  ? __switch_to_asm+0x40/0x70 
Feb  3 03:02:39 calculon kernel: [195271.111453]  ? __switch_to_asm+0x34/0x70 
Feb  3 03:02:39 calculon kernel: [195271.115497]  raid10_make_request+0xbf/0x140 [raid10] 
Feb  3 03:02:39 calculon kernel: [195271.120588]  md_handle_request+0x116/0x190 [md_mod] 
Feb  3 03:02:39 calculon kernel: [195271.125590]  md_make_request+0x72/0x170 [md_mod]

Basically, my RAID10 system seemed to go belly up. This isn’t where I lost data as it turns out. I was able to determine that a large part of the problem was due to the higher RAID rebuild limits I had set. This RAID10 is a small SSD based system I use for a local database and some VMs. This blog was (and is again), hosted on that RAID.

Also on this system is a larger archival RAID6. This is where I store data. Well, better said, this is where I pack-rat data that should have been deleted. I am a pack rat. I am not sure if I really need all my research directories from the early 90s on there. Or much of the other stuff I’ve saved. But, you know, pack rat. And cloud storage for pack-rats is still too expensive. Cost per TB is still above what I want to spend. And if I lose this stuff, well, meh. Serves me right.


Now couple this RAID10 flip-out with an updated kernel. One I had left an IOMMU debugging entry on for some reason. It tested well on another box, so this was my second deploy.

Then this happened, with no notice in the log.

md20 : inactive sdp[8](S) sdm[5](S) sdo[7](S) sdn[6](S) sdq[9](S) sdc[0](S) sde[2](S) sdd[1](S) sdf[3](S) sdg[10](S)       39068875120 blocks super 1.2

Yes, my RAID drives were suddenly marked all as spares (S). Looking at a drive in the mix in detail …

root@calculon:~# mdadm -E /dev/sdp
Magic : a92b4efc
Version : 1.2
Feature Map : 0x1
Raid Level : raid6
Raid Devices : 8
Avail Dev Size : 7813775024 (3725.90 GiB 4000.65 GB)
Array Size : 23441324544 (22355.39 GiB 24003.92 GB)
Used Dev Size : 7813774848 (3725.90 GiB 4000.65 GB)
Data Offset : 262144 sectors
Super Offset : 8 sectors
Unused Space : before=262056 sectors, after=176 sectors
State : clean
Internal Bitmap : 8 sectors from superblock
Update Time : Mon Feb 4 22:29:24 2019
Bad Block Log : 512 entries available at offset 72 sectors
Checksum : 342a5811 - correct
Events : 99366
Layout : left-symmetric Chunk Size : 128K
Device Role : Active device 4
Array State : AAAAAAAA ('A' == active, '.' == missing, 'R' == replacing)

The RAID metadata was correct. Looking at the other drives, I found a similar thing, though this popped up.

Array State : AAAAA..A ('A' == active, '.' == missing, 'R' == replacing)

About half the drives had inconsistent metadata, indicating that 2 drives are missing. Which sadly, jived with the error messages I did see in the log. Looking at those drives, they were healthy. It looked like the controller had “crashed”. But this is a simple controller, one not prone to crashing. The driver OTOH, came with the new testing kernel.

And the upper layers … the MD RAID, the file system, all depend upon a reliable driver layer. Looking over driver commits now, but I don’t think that will be it. It looks like something in a different subsystem. But I can see the upper layers doing their job, saying “hey, somethings messed up, and you need to pay attention to me.”

So, here I am, running several long sequential xfs_repair commands. Knowing full well, from the messages, that I’ve lost stuff I should probably have deleted in the past anyway.

In some sense, this is a potential argument for using ZFS. And even FreeBSD or illumos on this server. Cost of converting would be high though. And I need kvm (or possibly bhyve). I’ve been thinking about putting in a 10GbE backbone into the home network, so card compatibility/performance is an issue.

I dunno. Definitely sad for the lost bits … that I’ve probably not looked at for many years …

Seems like there is a solvable problem here w.r.t. cloud stuff. If only we could do it w/o breaking the bank. 1 USD/TB per month or lower would be great. Retrieval costs should be negligible. The economics of this for a 10TB drive today: Use a 1-TB Seagate drive around $210, with a 5 year lifetime. That’s $3.5 USD/month for 5 years of 10 TB, or $0.35/month.