Ceph updates

rbd is in testing. Have a look at the link, but here are some of the highlights

The basic feature set:

  • network block device backed by objects in the Ceph distributed object store (rados)
  • thinly provisioned
  • image resizing
  • image export/import/copy/rename
  • read-only snapshots
  • revert to snapshot
  • Linux and qemu/kvm clients

We are doing something like this now, to a degree, with a mashup of tools in our target.pl creator. Though not likely as nice/clean as this.
Ceph builds upon BTRFS, which is an excellent underlying file system, also maturing alongside Ceph. BTRFS has been called Linux’s answer to ZFS, but if you go through a detailed design analysis comparison, you will see that BTRFS gets a number of things right that zfs doesn’t. From the article at the always wonderful LWN.net:

In my opinion, the basic architecture of btrfs is more suitable to storage than that of ZFS. One of the major problems with the ZFS approach – “slabs” of blocks of a particular size – is fragmentation. Each object can contain blocks of only one size, and each slab can only contain blocks of one size. You can easily end up with, for example, a file of 64K blocks that needs to grow one more block, but no 64K blocks are available, even if the file system is full off nearly empty slabs of 512 byte blocks, 4K blocks, 128K blocks, etc. To solve this problem, we (the ZFS developers) invented ways to create big blocks out of little blocks (“gang blocks”) and other unpleasant workarounds. In our defense, at the time btrees and extents seemed fundamentally incompatible with copy-on-write, and the virtual memory metaphor served us well in many other respects.
In contrast, the items-in-a-btree approach is extremely space efficient and flexible. Defragmentation is an ongoing process – repacking the items efficiently is part of the normal code path preparing extents to be written to disk. Doing checksums, reference counting, and other assorted metadata busy-work on a per-extent basis reduces overhead and makes new features (such as fast reverse mapping from an extent to everything that references it) possible.

Yeah, I know. This will bring the “zfs is the last file system you will ever need” folks out of the woodwork. Whatever.
The point is, Ceph builds upon BTRFS. And exploits much of the goodness of BTRFS.
In lab, we have our stable kernel, and I’ve tried some BTRFS bits. Still some crashes, but we also have a testing kernel, and it looks like 2.6.36 is going to pop soon, so we might just wait for that for our next testing group. Our stable kernels have been 2.6.23.x, 2.6.28.x, 2.6.32.x and we are planning on a .36 or .37 kernel as the next one.
Also, we hope to have (soon) in lab, a siCluster based testbed specific for these nice parallel file systems.
Hopefully more news on this soon.

3 thoughts on “Ceph updates”

  1. Hi Joe,
    Be warned, there is a nasty performance regression in 2.6.35 with btrfs in some configurations (details still unclear – one person reports it’s fine with btrfs in a VM on his hardware but goes like a slug on the bare metal). Not sure if it’s related to the DIRECT_IO issue where they are getting split up – see https://patchwork.kernel.org/patch/119333/ .

  2. @Chris
    Thanks for the heads up. We’ll grab that patch and include it in our build. Should be available for testing tomorrow.
    Oddly, I’ve not been seeing traffic from btrfs list. I wonder if my subscription was axed … hmmm…

Comments are closed.