What are xfs's real limits?

Over at Enterprise Storage Forum, Henry Newman and Jeff Layton started a conversation that needs to be shared. This is a very good article.
In it, they reproduced a table comparing file systems coming from this page at Redhat. This is really showing a comparison of what the “limits” are in a theoretical or practical sense between the various versions of RHEL platforms. The file system table compares what you can do in each version.
This said, the table gets some things wrong. This isn’t a criticism of Henry and Jeff, this is a criticism of the table. We know its wrong, unless Redhat purposefully limited the xfs code. Which is not likely.
Remember, Redhat had to be dragged kicking and screaming into supporting xfs, by its customer base who were using it. This support is … well … similar to the concept of damning with faint praise. Its support in a passive aggressive manner.
The RHEL table claims that the maximum size of an xfs file and/or file system is 100TB. It further claims that GFS is effectively much better than this.
Ok, I won’t fisk the GFS claim. I will fisk the xfs claim. Limits for xfs can be found here, at the source. Quoting them

Maximum File Size
For Linux 2.4, the maximum accessible file offset is 16TB on 4K page size and 64TB on 16K page size. For Linux 2.6, when using 64 bit addressing in the block devices layer (CONFIG_LBD), file size limit increases to 9 million terabytes (or the device limits).
Maximum Filesystem Size
For Linux 2.4, 2 TB. For Linux 2.6 and beyond, when using 64 bit addressing in the block devices layer (CONFIG_LBD) and a 64 bit platform, filesystem size limit increases to 9 million terabytes (or the device limits). For these later kernels on 32 bit platforms, 16TB is the current limit even with 64 bit addressing enabled in the block layer.

This pretty much says it all. I’ve personally built (as in in the past, and as you will see in a moment) file systems larger than 100TB with xfs. So unless Redhat altered the xfs source … their table is wrong. And I’d like to either get confirmation of their change (which would permanently rule out using a RHEL kernel), or confirmation that they will correct the table.
You can google around for the larger file system bits with xfs, but let me give you a demo on our lab gear. Real simple demo. I will create a 1.1PB file system and mount it. And use it.

[root@jr5-lab x]# dd if=/dev/zero of=big.data bs=1G count=1 seek=1M
1+0 records in
1+0 records out
1073741824 bytes (1.1 GB) copied, 1.33821 seconds, 802 MB/s
[root@jr5-lab x]# ls -alF
total 1048576
drwxr-xr-x  2 root root               21 Jun  3 00:11 ./
drwxrwxrwt 11 root root              165 Jun  3 00:10 ../
-rw-r--r--  1 root root 1125900980584448 Jun  3 00:11 big.data
[root@jr5-lab x]# ls -alFh
total 1.0G
drwxr-xr-x  2 root root   21 Jun  3 00:11 ./
drwxrwxrwt 11 root root  165 Jun  3 00:10 ../
-rw-r--r--  1 root root 1.1P Jun  3 00:11 big.data
[root@jr5-lab x]# losetup  /dev/loop1 big.data
[root@jr5-lab x]# mkfs.xfs  /dev/loop1
meta-data=/dev/loop1             isize=256    agcount=1025, agsize=268435455 blks
         =                       sectsz=512   attr=2, projid32bit=0
data     =                       bsize=4096   blocks=274878169088, imaxpct=1
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0
log      =internal log           bsize=4096   blocks=521728, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
[root@jr5-lab x]# mkdir /mnt/thats_ah_lotta_data
[root@jr5-lab x]# mount /dev/loop1 /mnt/thats_ah_lotta_data
[root@jr5-lab x]# df -h /mnt/thats_ah_lotta_data
Filesystem            Size  Used Avail Use% Mounted on
/dev/loop1            1.0P   37M  1.0P   1% /mnt/thats_ah_lotta_data
[root@jr5-lab x]# cd /mnt/thats_ah_lotta_data/
[root@jr5-lab thats_ah_lotta_data]# touch you-betcha
[root@jr5-lab thats_ah_lotta_data]# ls -alF
total 8
drwxr-xr-x 2 root root   23 Jun  3 00:30 ./
drwxr-xr-x 3 root root 4096 Jun  3 00:13 ../
-rw-r--r-- 1 root root    0 Jun  3 00:30 you-betcha
[root@jr5-lab thats_ah_lotta_data]# df -h .
Filesystem            Size  Used Avail Use% Mounted on
/dev/loop1            1.0P   37M  1.0P   1% /mnt/thats_ah_lotta_data

The fingers never left the hand. No rabbits (jack or otherwise) were harmed in the making of this file system.
Since this file system is 10x greater than the size indicated to be its maximum by Redhat, I think we can call their myth “busted”. That is, unless they have made a change (which really should be undone). They either got this wrong, or they did something really dumb. I am hoping that it was a naive marketing type rather than an ill considered engineering decision to promote their own (GFS) over the technology that actually works.
Lets hope we get a response from them. I’ll ding some of the folks I know there and try to get them to fix the table. Because we know they would never do the other thing. That would be … er … bad.
The kernel here is our flavor. We are testing it for stability/performance/etc with our tunes/tweaks/patches. xfs is built in, and we use a somewhat updated xfsprogs. And a number of other things thrown in. All this sitting atop …

[root@jr5-lab thats_ah_lotta_data]# cat /etc/redhat-release
CentOS release 5.6 (Final)

Henry and Jeff didn’t get this wrong. The table is wrong. Its an annoyance, and it doesn’t detract from the quality of the article. Their points are still completely valid, and dead on the money. Metadata performance that is sub par and non-scalable will lead to all manner of problems in performance.

9 thoughts on “What are xfs's real limits?”

  1. Hi Joe,
    just to confirm your assumptions on a RHEL kernel, I successfully ran the same commands as you described on one of our test systems to get a 1PB xfs:
    $ mount | grep loop
    /dev/loop1 on /mnt/thats_ah_lotta_data type xfs (rw)
    $ df -h /mnt/thats_ah_lotta_data/
    Filesystem Size Used Avail Use% Mounted on
    /dev/loop1 1.1P 37M 1.1P 1% /mnt/thats_ah_lotta_data
    $ ls -lh /mnt/thats_ah_lotta_data/
    total 0
    -rw-r–r– 1 root root 0 Jun 3 13:21 you-betcha
    $ uname -a
    Linux fhgfs04 2.6.18-238.9.1.el5 #1 SMP Tue Apr 12 18:53:46 EDT 2011 x86_64 x86_64 x86_64 GNU/Linux
    $ cat /etc/redhat-release
    Scientific Linux SL release 5.5 (Boron)

  2. (1) The company name is Red Hat, not Redhat.
    (2) Red Hat constantly changes the XFS source, because they are the upstream for XFS and employ practically all of the XFS developers. Many of these changes are specifically to make it scale *better* than before.
    (3) There’s a difference between what a technology is capable of and what a company explicitly supports. Folks here at Red Hat, including my own manager and peers, regularly test 100TB or billion-file workloads. The main constraints here are time and budget for the necessary equipment. I don’t speak for Red Hat, but if a customer came to us asking for support of larger filesystems we’d certainly be receptive to the idea. The limits just mean that such support doesn’t come automatically with every subscription.
    (4) Henry and Jeff characteristically fail to mention the point that if you have that much data you probably don’t want it on a single server for either performance or availability reasons. Contrary to what’s said in the article, several petabyte-scale filesystems do exist and are deployed at quite a few places. They’re just distributed, which BTW also means that things like fsck can be done in parallel. Of course you know this, Joe, but anyone who talks about high-scale filesystems without even mentioning the distributed option should have their soapbox taken away.
    Disclaimer: for those readers who might not be aware, I’m an “associate” at Red Hat. I’m also the project leader for CloudFS, which is probably at least somewhat relevant to this conversation. I’m not here to speak for either Red Hat or CloudFS, though. I’m just elaborating a bit on things that Joe already said, based on the perspective those roles give me.

    • @Jeff
      1) my apologies. There’s a long story I occasionally tell folks about when I joined SGI. As this was my second day at the company, it surprised me that one of their senior VPs of marketing got hung up on a question about the companies product and competition … so much so that he was unable or unwilling to answer … as he was insisting upon correcting the name. It isn’t SGI (which everyone called it), its Silicon Graphics. At that point, my second day in the company, I had one of them “Oh Feces” moments. The company had an expiration date on it, before it would join “The Loyal Order of the Terminally Boned” as it could not focus on what needed to be focused upon. I guessed 5 years, I was off by about a year.
      Why I mention this. While the company may be “Red Hat” or “The Red Hat Company” or any other variation thereof, this is inconsequential to the discussion. Once the hierarchy starts focusing on that, its time to jump ship.
      I’m not being critical of you on this, just pointing out that when the company is more focused upon its name than the issues it needs to be focused upon, thats a sign of disease and necrosis.
      I hope “Red Hat” is not there. Many companies do get there.
      FWIW: We respond to SI, Scalable, Scalable Informatics, and occasionally Ham-n-eggs, Gynn-n-tonix, etc. 🙂 (this is the obligatory Hitchhikers Guide reference)
      2) My meaning must have been missed, I apologize. Specifically I am talking about inserting artificial limits into the source, not bug fixes or integration. That is, Eric Sandeen, Dave Chinner, etc. all contribute to the xfs base (as do many others). But none of them are purposely putting limits into the code base. Its that latter thing I think is stupid. And I am absolutely convinced that Red Hat (the company) is not doing that to xfs. Which means that the table is wrong.
      3) Of course there are differences. There’s theory, and practice. Theory is nice, but at the end of the day, its what you can do with it that matters. And this is what we like to test.
      4) I’m not sure its “characteristically” … Henry and Jeff are bright guys, with lots of distributed storage chops. This said, I think this is an area we are all in violent agreement on.
      Even more to the point, I have huge concerns about large storage densities behind single controller heads. This model simply doesn’t scale, and is something we don’t generally like. Many of our competitors use this model. I often have to educate customers about why stringing hundreds of SAS/SATA/FC disks behind a cascading set of units is a really … really bad idea. This model is an fail for more than one reason … I am guessing that makes it a candidate for an epic failure? I dunno.
      But yes, you need to go distributed storage for huge capacities. This brings on many other interesting and additional problems which, you know very well, and often blog about (and I learn from reading your posts, and see different points of view than those I had considered, which is a good thing).
      Basically, I am convinced the table is a simple marketing oversight, and I doubt any significant nefarious intent. Though past history (prior to you joining them) with Red Hat (the company), suggests that there might be at least a little internal promotion and external negation going on.

  3. @Jeff Darcy: As a parallel file system developer, I totally agree that it often might not make sense to go to more than 100TB today with a single box for a number of reasons (As usual, there is probably also the higher-than-we-think number of non-standard users out there, with use-cases for which it actually might make total sense).
    However, I think the point here is that the Red Hat comparison table creates the impression that 100TB is an actual limit of xfs. Because the headline says
    “… limits (supported[/theoretical])”
    So in this case there should be a “100TB / ” in the xfs row, right?

    • @Sven
      In particular the issue I am most concerned with is this storage bandwidth wall concept, in that a single box won’t be able to fsck its own file system in a realistic amount of time. I’ve talked about it in the blog a bit, but I am loathe in general to build a file system that requires more than a day to read/write in its entirety.
      That limit, 1 day, is a simple approximation to a non-sound definition of “reasonable” performance. Anything longer than a day could cause issues in the administration/use of a system. Anything shorter is goodness, though the price for lessening that wall height needs to be taken into account in the value proposition for any storage system. That is, if you could make it 100x faster, is it worth 100x the price? There’s an economic principle at work here, in addition to pragmatic ones. Storage bandwidth wall is an attempt to place a metric that we can use to help articulate this better for users.
      BTW: We really like FhGFS here 🙂

  4. Sven: I understand that it would be nice to have a theoretical-limit number after the slash in each of those boxes, but what number? Largest configuration that has been assembled? Largest configuration that has ever survived something beyond setup/teardown activity? Developer’s guess for either? “None known”? The limits are rarely table sizes or bit counts any more, but are more often “soft limits” where the performance of some data structure or algorithm no longer yields good performance. Where the “limit” is depends on how “good performance” is defined, and that’s a notoriously tricky problem.
    I suspect that the fields were left blank because nobody could figure out what to put there that would be useful without lengthy explanation. At that point, might as well encourage people to ask. Again, though, not speaking for Red Hat etc. I just know that as an engineer I’d be reluctant to give numbers that somebody might take as gospel on any basis less than the testing that goes into the “supported” numbers.

    • @Jeff
      I agree … if I put a number down, I want to be able to back it up. If I haven’t measured it, I’ll report it as a theoretical number, and be explicit about it. If I have measured it, then I want to make sure that my measurement is sound (so I’ll make my measurement repeatable/transportable so others can measure as well).
      Actually, this gets into a discussion of testing, benchmarking, and other bits that I’d like to have. I’ve been saving this up for a really long time as I wanted my thoughts to gel on it, and they largely have. I’ll do a post for this soon.

  5. @Joe: Yes, I think the storage wall is a really nice and intuitive way to avoid or show misconceptions in storage design. But from reading your blog posts, I somehow get the impression that there are more than just a few people out there that still don’t seem to really understand it 😉

  6. Personally I find the 1 day fsck to be a little of a red herring since if you have gotten to the point where a journal replay would not bring a filesystem clean you probably have bigger problems with the reliability of the file data as much as the metadata and should probably be looking at backups. If the data is really that important you must restore from a known good source rather than rely on rebuilding the metadata alone ( since most do not contain block level checksums ).

Comments are closed.