27 Feb 2022 4 min read

Moving to zfs, a journey

The background. I've maintained a server for my home systems for more than 15 years. Originally built as a Frankenstein system, with random available parts during my time running Scalable Informatics (RIP). Rebuilt a number of times since with parts procured from ebay. This server runs a number of my VMs ... yeah yeah ... I can't quite convince myself to go full k8s or k3s on my services. And I like the isolation of the VMs.

The disks I had in there, 4TB units, were used enterprise disks purchased about 5 years ago. I have 10 organized into an 8 disk RAID6 with 2 hot spares, built and managed using mdadm.

I had a set of SSDs in there as well, since changed (as SSDs wear out), for a small/fast 4 disk RAID10 with 1 hot spare.

Well, entropy happens. This is entropy in the physical sense. Things gradually fall apart (maximize entropy). Data decays. Bitrot happens. The spinning disks are all reporting north of 60k hours of operation. That is, apart from the replacements which range from 35k to 45k hours. They are long in the tooth, and beyond their 5 year warranty.

The mdadm array has rarely given me problems. Not never. A few times, on shutdown, the array ejected a disk or two, and then deleted the metadata for the array. That was ... unnerving. I was able to fix it, but ... still ... I moved to software RAID from HW RAID specifically due to my ability to repair a software RAID. With a hardware RAID, this was much harder. And I recall customers occasionally losing data due to a failed hardware RAID card from my time at Scalable.

Add to this, a great wipeout of data had occured thanks to a failed disk and some background file system optimization processes I was running. In the past, I've never had an issue with xfs_fsr. And even today, I don't blame xfs_fsr.

Basically, there was a write error on a drive. I run scans monthly, and they are supposed to catch them and flag them. Needless to say, this didn't happen. This write error was eventually traced to a failed SATA/SAS card and cable, both replaced.

But the damage was done. For those who don't know, xfs_fsr remaps/moves blocks of data around on the file system so as to be sequentially accessible. I'd been using it for years (well, more than a decade). The tool does not check data integrity. Nor does the file system, nor the mdadm RAID on read or write.

It just moves blocks.

I think you can see where this was going.

I had a bad block generator, which occasionally wrote bad block in a read-modify-write cycle, and a file system block mover, wandering all over my file system ... my 20-ish TB file system, containing about 20+ years of data ... it was/is my backup ...

I was an expert with xfs_repair. And hardware debugging/remediation. And Linux internals. I found and fixed the hardware problems. That took about a week. The xfs_repair (not a single repair cycle, but many) took about 2 weeks. At the end of it, I had lost, maybe, 75% of my data.

None of it was irreplaceable, I recovered most of it. But I got a really good scare out of that. And I swore I would eventually shift to zfs when I had the chance.

That was 3 years ago.

About a month ago, I was reviewing system logs, and saw enough smart errors that I thought "I should really do something about that".

So, I bought 5 (new) 14TB enterprise drives. Got them into the unit. Turned it on, and ...

It powered off as it started spinning up the drives. Sure enough, the power supply was old, and the disk rails didn't have enough power. So, lets get a new 1kW power supply.

Put that in, wire it all up and ...

Works.

Ok, lets pull down the latest zfs-on-linux (2.1.2 as of then), build it against this kernel (5.10.x). Then, create my zpool

zpool create storage mirror ata-ST14000NM001G-2KJ103_ZL2H871A ata-ST14000NM001G-2KJ103_ZL2HBL8G mirror ata-ST14000NM001G-2KJ103_ZL2HCAQ7 ata-ST14000NM001G-2KJ103_ZL2HCP50 spare ata-ST14000NM001G-2KJ103_ZL2HQJ4W

zpool set autoreplace=on storage

Make sure it worked

# zpool status
  pool: storage
 state: ONLINE
config:

	NAME                                   STATE     READ WRITE CKSUM
	storage                                ONLINE       0     0     0
	  mirror-0                             ONLINE       0     0     0
	    ata-ST14000NM001G-2KJ103_ZL2H871A  ONLINE       0     0     0
	    ata-ST14000NM001G-2KJ103_ZL2HBL8G  ONLINE       0     0     0
	  mirror-1                             ONLINE       0     0     0
	    ata-ST14000NM001G-2KJ103_ZL2HCAQ7  ONLINE       0     0     0
	    ata-ST14000NM001G-2KJ103_ZL2HCP50  ONLINE       0     0     0
	spares
	  ata-ST14000NM001G-2KJ103_ZL2HQJ4W    AVAIL   

errors: No known data errors

That is, it is a stripe across a pair of mirrors (RAID1's), with a hot spare. And autoreplace turned on so if a drive fails, it will be, autoreplaced from hot spare pool.

Then create the file system atop the pool

zfs create -o compression=zstd storage/data
zfs set atime=off storage/data

Then I rsynced the data over from the mdadm array to the zfs array. It was quite pleasing to see 500MB/s to 1GB/s read and write going during this time. Took about 50k seconds, less than a day, to move the data. And now

# zfs list
NAME           USED  AVAIL     REFER  MOUNTPOINT
storage       9.06T  16.3T      104K  /storage
storage/data  9.06T  16.3T     9.06T  /storage/data

I've got about 25TB total storage, using 9TB (with compression).

All working, right? Right?

Well, no. Turns out the zfs-on-linux package doesn't know much about debian, and prefers to try to use dracut level rather than systemd level to deal with the arrays. This causes pain.

A few frustrating hours and many reboots later, I came to debug what the project shipped as its startup mechanism. I wrote my own unit file, added it into systemd, and voila, I get the arrays mounted on reboot.

I've now commented out the /etc/fstab entry for the xfs file system, made a soft-link, and the system is working perfectly ... apart from the VMs not being launched automatically on boot. That is a systemd ordering issue, and a solvable one at that, I've just not spent time to do it yet.

Next up is gutting this system, and replacing MB and CPU with something from this decade, as compared to the 2000s ...

And finally figuring out the long term backup I need for this.

Joe Landman

OG High Performance Computing (HPC) veteran. Non-practicing physicist, Ph.D in condensed matter theory. Closet mathematician, aspiring astrophotographer. Hubby, Father, Son, Uncle. Human.

Joe Landman

Comments ( )

Comments ()