"New" File systems worth watching

The day job currently has siClusters in the field with GlusterFS, Lustre, and a few other “older” parallel file systems.
GlusterFS is a distributed file system with a very interesting and powerful design concept. It is under active development by a venture backed company, Gluster, Inc. I can’t say enough good things about it, and the company behind it. The day job is in a relationship with them, so you may take this information for what its worth, and weight it accordingly. Our view is, generally, that they have something very close to “the right design” going forward. There are occasional issues that pop up, usually connected with Infiniband, that we can’t necessarily fault Gluster for, but they do bear the brunt of errors in the transport stacks. We’ve seen this derail installs at one location, during effectively corner case testing … these weren’t Gluster issues per se, they were pretty definitively IB stack issues, but ones that couldn’t be easily worked around.

The day job uses Lustre as well. Well, getting back into using Lustre after a several year hiatus. Where Gluster is neat and simple in overall design and implementation, Lustre is a complex beast, with many moving parts. This is a recipe for problems unfortunately, and we have encountered our fair share during bringup. Lustre’s design is an older one, with several centralized servers. This effectively rate limits scalability, and makes stability a function of the least stable centralized resource. This is a concern for us, as we have seen customers blame file systems for every problem they encounter, regardless of the merits of such blame. Having a system with a designed in SPoF (single point of failure) is IMO a very bad idea. Sort of like a permanent storage on RAID0. Yeah its fast, no you really don’t want to do this. Lustre 2.0 looks like they will remove the SPoF of the MDS/MGS device. But their kernel dependencies will again limit their utility across lots of installations.
Adding to our concerns has been the recent acquisition of Sun by Oracle. Cute/funny derisive versions of their combined name aside, we have very real business dependencies upon several of their HPC stack, and while Marc Hamilton (now VP of HPC sales at Oracle) indicates that they have a long life, we are … concerned … about some of these. Virtualbox is IMO an excellent tool. GridEngine … we’ve got a love/hate relationship with it … it works well when it works, and when it fails, it can be real annoying. Lustre for siCluster is definitely an option and something we can offer (same hardware, select the parallel file system of your choice if you don’t want the default). Even OpenSolaris, something we’ve not seen many requests for (more for that than Solaris itself), we have an interesting use case within siCluster.
Needless to say, the changes make us (and our customers) nervous.
These are the historical systems. But what about the “new” systems?
First, there is ceph. Ceph is a distributed object store done right. We have set up a few test systems with it, and will get more aggressively into it later this year, including (likely) hosting it as a test option on an internal siCluster for customers to play with. They have a clustered MDS, will use btrfs as the backend data store. Btrfs is something like a better zfs than zfs. Btrfs is part of the linux kernel, and is being developed by Chris Mason and others, at Oracle. Some might point out the “missing raidz*” as a reason zfs is “better” than btrfs, but I’d not harp on that point too heavily, as btrfs will sit nicely upon the md/lvm/… bits, so it gets all the goodness of those as well.
But Ceph isn’t the only one of interest for siCluster. We are also looking at tahoe-LAFS, and Twisted Storage among others. Very application dependent as to which makes the most sense. With Twisted, we see whole new vistas of possible offerings opening up … some nice business models we can enable. Tahoe-LAFS is interesting in that it provides something akin to provably secure distributed data storage, something we think is going to become tremendously important in any cloud storage scenario, where data can span multiple legal regimes, some of which might not be friendly to the content stored.
This latter issue, spanning legal regimes for cloud storage, is one that hasn’t seen any testing in any court case that I am aware of (chime in if you know otherwise). Moreover, being able to deal with a loss of data from that legal regime’s confiscation of servers is going to become just as important.
I won’t go into what Twisted will let us do right now, the day job had an initial, very good call, with them, and we have some very interesting ideas on this.
Of course, some of those ideas require a bit of cash to make happen, and this isn’t a friendly time to be raising capital (long story here).
These are some of the options that we are working on for storage going forward. We think that some of these concepts could be quite interesting to the market going forward.

11 thoughts on “"New" File systems worth watching”

  1. Thanks for mentioning Tahoe-LAFS! We love bug reports, so if you decide not to keep going on Tahoe-LAFS, or if you do keep going but you hit some bumps, please let us know by mail to tahoe-dev@allmydata.org or by opening a trac ticket at http://tahoe-lafs.org . By the way, does this mean the Twisted Storage project is still going? Their web site doesn’t have any new news from recent years. If I recall correctly that project was also using the Twisted Python engine, just like Tahoe-LAFS does.

  2. @Zooko
    We are definitely looking at Tahoe-LAFS, and want to poke around with it more. It looks like it solves some of what we consider the harder aspects of true cloud storage.
    Twisted has been turned into a company. Its into the content accessible side of things. They appear to have solved another aspect of things that we like to see.
    I can see customers for all of these types of products.

  3. “Tahoe-LAFS is interesting in that it provides something akin to provably secure distributed data storage, something we think is going to become tremendously important in any cloud storage scenario, where data can span multiple legal regimes, some of which might not be friendly to the content stored.”
    One of the things we’re likely going to be working on in the next couple of versions is the ability to evenly distribute shares between groups of nodes that are co-located, or otherwise have correlated failures. This would enable you can set up a grid to guarantee some level of remaining redundancy if all shares are lost from all nodes in a group.

  4. Hi Joe,
    There’s also ExoFS in the kernel too that I’ve not played with yet, merged for 2.6.30 and (according to Linus) “implements a filesystem on top of an external object store (ie not a traditional storage of a linear array of anonymous blocks, but a “smart” disk that does objects)”.

  5. @Chris
    Yeah, I’ve seen Exofs. It was developed in part by Panasas. It looks interesting, especially if you take an intelligent hashing algorithm and a few other bits, you can build an in-kernel Gluster like system. Or Gluster could leverage Exofs themselves to avoid going through fuse …

  6. Hi Joe (?),
    Zooko pointed me at this post. I’m not really in the uber cool storage space, so I only half followed what you’re talking about. 🙂
    However, the reason I’m commenting is that I did follow the part about how “Twisted has been turned into a company.” Actually, I wanted to correct this part.
    As a core Twisted () developer and a board member of the unofficial Twisted Software Foundation, I can assure you that “Twisted Storage” () is in no official capacity affiliated with the Twisted project. It’s a separate project with a separate development team. I can understand your confusion, since the Twisted Storage website uses language which seems to conflate the two projects (not to mention the confusing name).
    I’m not suggesting there’s anything technically wrong with “Twisted Storage”, I just want to try to clarify the point about the relationship it has with the Twisted project. Hopefully this is just a point of confusion about naming. If there’s anything else I can help clarify, feel free to get in touch with me.
    Good luck with your project!

  7. @Joe,
    Regarding btrfs – it does have its own RAID code built into it (and SSD optimisations) as it needs to know which devices are returning blocks with corrupted checksums so it can reconstruct from others, so whilst it can sit on top of MD/LVM that’s not really what it’s designed for.

  8. It does RAID 0, 1 and 10 and there are patches for raid456 but they’ve not been merged yet. My bad, I’ve not been keeping up with the list and has assumed they’d gone in, but I can’t see what became of them.
    Interestingly it may well be that MD starts using some of btrfs’s infrastructure – namely their work queue implementation (see the “more raid456 thread pool experimentation” thread on linux-btrfs in March).

  9. Hey, I never have any ideas about GlusterFS, when I just searched on Google, I have got your page, and It seems, it is very useful, I had got some ideas about GlusterFS. Cheers dude.

Comments are closed.