BTRFS is effectively stable

Yeah, I know, the web page says its under heavy development. And its on disk format can change. And its mailing list is chock full of patches.

But it passed our stability test. 100 iterations (3.2TB written/read in all and compared to checksums) of the following fio test case.

[global]
size=8g
iodepth=32
blocksize=1m
numjobs=4
nrfiles=1
ioengine=vsync
rw=write

[sw1]
create_serialize=0
create_on_open=1
#directory=/data
directory=/mnt/btrfs
verify=crc32c-intel
verify_async=8
group_reporting

This is our baseline test. Failing RAIDs, failing drives, failing SSDs tend not to pass this test. Borked file systems tend not to pass this test.

When something passes this test, again and again (3rd time we’ve run it), and does so without fail, we call it safe.

This said, its not terribly fast on reads/writes. This is a 2.6.36 kernel. We are still working out the kinks, and trying to understand IO with this kernel. So far I like it, and we may start very serious testing with this series (though lots of good patches appear to be scheduled for the .37 and .38 time frame). I’ve not been too happy with .35, as IO performance is about 1/2 of what it should be (same tests on same box/hardware/disks/file system running 2.6.32.22.scalable)

BTRFS is stable enough now for use. It just won’t (yet) win benchmarks (not counting the stuff at Phoronix and others)

[update] some have complained about this (see the bits in the comments below). There are valid points to their complaints … there are real issues remaining with it. You can use it for storing data and retrieving it. But be aware of the fill issues, the current lack of a repair tool.

This said, all of those are solvable issues. I am not concerned by these.

Do I suggest wide spread use? Not until we have those repair tools, and some of the major issues with the fill crash are solved.

Viewed 17620 times by 5929 viewers

Facebooktwittergoogle_plusredditpinterestlinkedinmail

15 thoughts on “BTRFS is effectively stable

  1. IO performance regressions such as a 50% difference between 2.6.35 compared to 2.6.32.22 is quite scary.

    Do you have any insight as to what the causes are?

    Are the kernel devs not actively tracking such large regressions and viewing them as showstoppers??

  2. @Bill

    Not yet (insight). Need to do some tracing. The per-BDI bits came in at 2.6.32, so this shouldn’t be the issue.

    Devs are focused upon correctness first, regressions second.

  3. That’s really useful to know Joe! I’ve been reading many reports of issues that people have had going from 2.6.32 -> 2.6.35 with btrfs so this really cheers me up. 🙂

    One issue to be aware of is that currently you cannot remove a missing drive from a btrfs managed RAID array due to a bug (it tries to write to the missing device and your kernel will OOPS), it’s fixed in a different tree which Chris Mason is still under development. Unlikely to make 2.6.37 given Linus has indicated the merge window will close at the end of October.

  4. Its not stable until it passes the laptop test. I used the 2.6.36 BTRFS code and got OOPs in BTRFS code after couple of S3 suspend-resume cycles.

    Can you try doing your tests with couple of S3 suspend-resumes cycles thrown in?

  5. @schmoe

    We don’t normally use these on our servers, so its not in our testing matrix.

    Then again, with this criteria, I am not sure ext4 is stable.

  6. I have suspended to S3 my laptop 32 times so far over the last 10 days, and its still up and running fine. ext4 is rock solid on my laptop!

  7. @schmoe

    ext4 has had some issues on my laptop. I usually powerdown before closing. Reboot is fast enough that its not a problem.

    On the other points, several are with 2.6.35 and earlier. 2.6.35 isn’t something I’d call good. Several kernel versions weren’t very good. 2.6.34 crashed on lots of our hardware, we had random crashes with 2.6.35, 2.6.33.

    So far, 2.6.36 looks good. I’ll try some of the no-space testing. Won’t do the power yanks right now. We are waiting for the “fsck”-like tool (actually more akin to xfs_repair/xfs_check than fsck).

    Corrupted directory entries are no fun, and we do need the ability to detect/fix these.

    But the file system survives our load case. Understand that zfs-on-fuse doesn’t, and a fair number of others don’t. If it passes the load test, then we don’t have to worry so much about blowing the file system up by shoving too much data down the pipe. Our units can sink and source in excess of 2GB/s sustained for TB sized files, and we absolutely cannot tolerate a file system that isn’t able to take that load.

    The other points on being able to repair it, to clean it … yes, these are very important. But they aren’t the stability issue. We aren’t ready to ship the file system to our customers, but it is getting closer. FWIW, the file system didn’t pass our load test in .33, .34, and .35 days.

  8. I was able to get a stack trace from a hung kernel thread:

    This is what I did:

    [root@jr4-lab ~]# dd if=/dev/zero of=/data/d1/btrfs.file bs=1M count=4k oflag=direct
    4096+0 records in
    4096+0 records out
    4294967296 bytes (4.3 GB) copied, 6.40079 seconds, 671 MB/s

    [root@jr4-lab ~]# losetup /dev/loop0 /data/d1/btrfs.file

    [root@jr4-lab ~]# uname -r
    2.6.36.scalable

    [root@jr4-lab ~]# /opt/scalable/bin/mkfs.btrfs /dev/loop0

    WARNING! – Btrfs v0.19-16-g075587c-dirty IS EXPERIMENTAL
    WARNING! – see http://btrfs.wiki.kernel.org before using

    fs created label (null) on /dev/loop0
    nodesize 4096 leafsize 4096 sectorsize 4096 size 4.00GB
    Btrfs v0.19-16-g075587c-dirty

    [root@jr4-lab ~]# mount -o loop,noatime -t btrfs /dev/loop0 /data/d2
    [root@jr4-lab ~]# df -H
    Filesystem Size Used Avail Use% Mounted on
    /dev/md0 47G 7.7G 37G 18% /
    tmpfs 13G 0 13G 0% /dev/shm
    /dev/sdd2 14T 242G 14T 2% /mnt/btrfs
    /dev/sdc2 14T 177G 14T 2% /data/d1
    /dev/loop0 4.3G 58k 4.3G 1% /data/d2
    [root@jr4-lab ~]# time dd if=/dev/zero of=/data/d2/tempfile count=7000 bs=512k
    7000+0 records in
    7000+0 records out
    3670016000 bytes (3.7 GB) copied, 3.04983 seconds, 1.2 GB/s

    real 0m3.051s
    user 0m0.000s
    sys 0m2.010s
    [root@jr4-lab ~]# df -H
    Filesystem Size Used Avail Use% Mounted on
    /dev/md0 47G 7.7G 37G 18% /
    tmpfs 13G 0 13G 0% /dev/shm
    /dev/sdd2 14T 242G 14T 2% /mnt/btrfs
    /dev/sdc2 14T 177G 14T 2% /data/d1
    /dev/loop0 4.3G 3.5G 855M 81% /data/d2

    [root@jr4-lab ~]# time cp -ar /usr/lib* /data/d2/
    ^C^C^C^C^Z^Z^Z

    and it generated a call trace, and the process is unkillable.

    [108065.624575] INFO: task loop1:17826 blocked for more than 120 seconds.
    [108065.624599] “echo 0 > /proc/sys/kernel/hung_task_timeout_secs” disables this message.
    [108065.624624] loop1 D 0000000100a43b3b 0 17826 2 0x00000080
    [108065.624626] ffff880173699560 0000000000000046 ffff880173699500 0000000000013300
    [108065.624660] ffff880173699fd8 ffff880334c4ddd0 ffff880334c4da40 ffff88033e1c9690
    [108065.624692] ffff880173699fd8 00000000810720a4 ffff880173699540 0000000000013300
    [108065.624724] Call Trace:
    [108065.624741] [] io_schedule+0x6e/0xc0
    [108065.624761] [] sync_page+0x3b/0x50
    [108065.624779] [] __wait_on_bit+0x55/0x80
    [108065.624798] [] ? sync_page+0x0/0x50
    [108065.624816] [] wait_on_page_bit+0x70/0x80
    [108065.624836] [] ? wake_bit_function+0x0/0x30
    [108065.624857] [] shrink_page_list+0x1de/0x780
    [108065.624877] [] ? finish_wait+0x60/0x80
    [108065.624896] [] ? autoremove_wake_function+0x0/0x40
    [108065.624938] [] shrink_inactive_list+0x283/0x2d0
    [108065.624980] [] shrink_zone+0x305/0x460
    [108065.625019] [] zone_reclaim+0x1e9/0x290
    [108065.625060] [] ? zone_watermark_ok+0x25/0xf0
    [108065.625100] [] get_page_from_freelist+0x65e/0x830
    [108065.625142] [] ? zone_reclaim+0x1e9/0x290
    [108065.625182] [] __alloc_pages_nodemask+0x125/0x820
    [108065.625226] [] ? alloc_buffer_head+0x1a/0x80
    [108065.625269] [] alloc_pages_current+0x9e/0x100
    [108065.625311] [] new_slab+0x1fb/0x2a0
    [108065.625350] [] __slab_alloc+0x1c9/0x5d0
    [108065.625390] [] ? radix_tree_preload+0x61/0xd0
    [108065.625431] [] kmem_cache_alloc+0xb8/0x150
    [108065.625471] [] radix_tree_preload+0x61/0xd0
    [108065.625511] [] add_to_page_cache_locked+0x32/0x120
    [108065.625553] [] add_to_page_cache_lru+0x25/0x70
    [108065.625594] [] grab_cache_page_write_begin+0x99/0xc0
    [108065.625637] [] ? blkdev_get_block+0x0/0x70
    [108065.625678] [] block_write_begin+0x3a/0x90
    [108065.625719] [] blkdev_write_begin+0x20/0x30
    [108065.625759] [] pagecache_write_begin+0x18/0x20
    [108065.625805] [] do_lo_send_aops+0xa7/0x1a0 [loop]
    [108065.625847] [] loop_thread+0x374/0x500 [loop]
    [108065.625888] [] ? do_lo_send_aops+0x0/0x1a0 [loop]
    [108065.625930] [] ? autoremove_wake_function+0x0/0x40
    [108065.625972] [] ? loop_thread+0x0/0x500 [loop]
    [108065.626013] [] kthread+0x96/0xa0
    [108065.626051] [] kernel_thread_helper+0x4/0x10
    [108065.626091] [] ? kthread+0x0/0xa0
    [108065.626129] [] ? kernel_thread_helper+0x0/0x10
    [108065.626170] INFO: task btrfs-transacti:17840 blocked for more than 120 seconds.
    [108065.626234] “echo 0 > /proc/sys/kernel/hung_task_timeout_secs” disables this message.
    [108065.626299] btrfs-transac D 0000000100a44756 0 17840 2 0x00000080
    [108065.626301] ffff880235acbc60 0000000000000046 ffff880235acbc00 0000000000013300
    [108065.626375] ffff880235acbfd8 ffff88033e2ec740 ffff88033e2ec3b0 ffff88033e2b1690
    [108065.626448] ffff880235acbfd8 00000000810720a4 ffff880235acbc40 0000000000013300
    [108065.626522] Call Trace:
    [108065.626553] [] io_schedule+0x6e/0xc0
    [108065.626591] [] sync_page+0x3b/0x50
    [108065.626629] [] __wait_on_bit+0x55/0x80
    [108065.626668] [] ? sync_page+0x0/0x50
    [108065.626706] [] wait_on_page_bit+0x70/0x80
    [108065.626746] [] ? wake_bit_function+0x0/0x30
    [108065.626787] [] btrfs_wait_marked_extents+0x127/0x140
    [108065.626830] [] btrfs_write_and_wait_marked_extents+0x38/0x60
    [108065.626894] [] btrfs_write_and_wait_transaction+0x26/0x50
    [108065.626938] [] btrfs_commit_transaction+0x49a/0x670
    [108065.626980] [] ? autoremove_wake_function+0x0/0x40
    [108065.627022] [] transaction_kthread+0x232/0x240
    [108065.627063] [] ? transaction_kthread+0x0/0x240
    [108065.627104] [] ? transaction_kthread+0x0/0x240
    [108065.627144] [] kthread+0x96/0xa0
    [108065.627182] [] kernel_thread_helper+0x4/0x10
    [108065.627222] [] ? kthread+0x0/0xa0
    [108065.627260] [] ? kernel_thread_helper+0x0/0x10
    [root@jr4-lab ~]#

    Obviously this shouldn’t happen. Unit stayed up, no crashing, but, had to reboot it hard.

  9. When the corruption happened, there was no way to do anything about it. In fact, I did not know that btrfsck does do anything. First time ever I had to restore stuff from a full backup.

    I have been a btrfs fanboy ever since it came out but I gave up on it because the basic issues are not addressed. Looking at the mailing list emails scared me a bit about the quality of work going in.

    Oracle doesn’t seem to be too keen on it either. No major commits have happened for a while (months).

  10. @shmoe – fortunately Chris M. managed to get a merge request in (though Linus wasn’t entirely happy about it, but for git reasons more than anything) and so a number of fixes for things you mentioned have now hit the mainline for 2.6.37. Should also have a nice performance boost too with a mount option (though older kernels won’t be able to mount the FS once you’ve done that).

  11. @Chris

    I saw that message from Linus this morning. We are going to give 2.6.37 an early try (e.g. early rc releases). If the performance is there, we’ll start to look into the other aspects.

    Is anyone in particular working on the btrfs repair code? We’d want to give that a shot as well. I am working on a whole mess of other coding, so I don’t think I can commit time to more than testing other peoples code at this point. But we are very interested in getting Ceph up (and we should have a test siCluster running it in the lab and visible for SC10).

  12. @Joe, re: functioning btrfsck repair code, Chris Mason wrote on 27th August that:

    We’re still actively developing it. I don’t have a release date planned yet but we should have betas coming out over the next few months.

    I know Chris has also been busy with the Linux kernel Oracle pushed out which might explain why things have been a bit slow btrfs wise recently.

Comments are closed.