Over at Enterprise Storage Forum, Henry Newman and Jeff Layton started a conversation that needs to be shared. This is a very good article.
In it, they reproduced a table comparing file systems coming from this page at Redhat. This is really showing a comparison of what the “limits” are in a theoretical or practical sense between the various versions of RHEL platforms. The file system table compares what you can do in each version.
This said, the table gets some things wrong. This isn’t a criticism of Henry and Jeff, this is a criticism of the table. We know its wrong, unless Redhat purposefully limited the xfs code. Which is not likely.
Remember, Redhat had to be dragged kicking and screaming into supporting xfs, by its customer base who were using it. This support is … well … similar to the concept of damning with faint praise. Its support in a passive aggressive manner.
The RHEL table claims that the maximum size of an xfs file and/or file system is 100TB. It further claims that GFS is effectively much better than this.
Ok, I won’t fisk the GFS claim. I will fisk the xfs claim. Limits for xfs can be found here, at the source. Quoting them
Maximum File Size
For Linux 2.4, the maximum accessible file offset is 16TB on 4K page size and 64TB on 16K page size. For Linux 2.6, when using 64 bit addressing in the block devices layer (CONFIG_LBD), file size limit increases to 9 million terabytes (or the device limits).
Maximum Filesystem Size
For Linux 2.4, 2 TB. For Linux 2.6 and beyond, when using 64 bit addressing in the block devices layer (CONFIG_LBD) and a 64 bit platform, filesystem size limit increases to 9 million terabytes (or the device limits). For these later kernels on 32 bit platforms, 16TB is the current limit even with 64 bit addressing enabled in the block layer.
This pretty much says it all. I’ve personally built (as in in the past, and as you will see in a moment) file systems larger than 100TB with xfs. So unless Redhat altered the xfs source … their table is wrong. And I’d like to either get confirmation of their change (which would permanently rule out using a RHEL kernel), or confirmation that they will correct the table.
You can google around for the larger file system bits with xfs, but let me give you a demo on our lab gear. Real simple demo. I will create a 1.1PB file system and mount it. And use it.
[root@jr5-lab x]# dd if=/dev/zero of=big.data bs=1G count=1 seek=1M 1+0 records in 1+0 records out 1073741824 bytes (1.1 GB) copied, 1.33821 seconds, 802 MB/s [root@jr5-lab x]# ls -alF total 1048576 drwxr-xr-x 2 root root 21 Jun 3 00:11 ./ drwxrwxrwt 11 root root 165 Jun 3 00:10 ../ -rw-r--r-- 1 root root 1125900980584448 Jun 3 00:11 big.data [root@jr5-lab x]# ls -alFh total 1.0G drwxr-xr-x 2 root root 21 Jun 3 00:11 ./ drwxrwxrwt 11 root root 165 Jun 3 00:10 ../ -rw-r--r-- 1 root root 1.1P Jun 3 00:11 big.data [root@jr5-lab x]# losetup /dev/loop1 big.data [root@jr5-lab x]# mkfs.xfs /dev/loop1 meta-data=/dev/loop1 isize=256 agcount=1025, agsize=268435455 blks = sectsz=512 attr=2, projid32bit=0 data = bsize=4096 blocks=274878169088, imaxpct=1 = sunit=0 swidth=0 blks naming =version 2 bsize=4096 ascii-ci=0 log =internal log bsize=4096 blocks=521728, version=2 = sectsz=512 sunit=0 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 [root@jr5-lab x]# mkdir /mnt/thats_ah_lotta_data [root@jr5-lab x]# mount /dev/loop1 /mnt/thats_ah_lotta_data [root@jr5-lab x]# df -h /mnt/thats_ah_lotta_data Filesystem Size Used Avail Use% Mounted on /dev/loop1 1.0P 37M 1.0P 1% /mnt/thats_ah_lotta_data [root@jr5-lab x]# cd /mnt/thats_ah_lotta_data/ [root@jr5-lab thats_ah_lotta_data]# touch you-betcha [root@jr5-lab thats_ah_lotta_data]# ls -alF total 8 drwxr-xr-x 2 root root 23 Jun 3 00:30 ./ drwxr-xr-x 3 root root 4096 Jun 3 00:13 ../ -rw-r--r-- 1 root root 0 Jun 3 00:30 you-betcha [root@jr5-lab thats_ah_lotta_data]# df -h . Filesystem Size Used Avail Use% Mounted on /dev/loop1 1.0P 37M 1.0P 1% /mnt/thats_ah_lotta_data
The fingers never left the hand. No rabbits (jack or otherwise) were harmed in the making of this file system.
Since this file system is 10x greater than the size indicated to be its maximum by Redhat, I think we can call their myth “busted”. That is, unless they have made a change (which really should be undone). They either got this wrong, or they did something really dumb. I am hoping that it was a naive marketing type rather than an ill considered engineering decision to promote their own (GFS) over the technology that actually works.
Lets hope we get a response from them. I’ll ding some of the folks I know there and try to get them to fix the table. Because we know they would never do the other thing. That would be … er … bad.
The kernel here is our 126.96.36.199.scalable flavor. We are testing it for stability/performance/etc with our tunes/tweaks/patches. xfs is built in, and we use a somewhat updated xfsprogs. And a number of other things thrown in. All this sitting atop …
[root@jr5-lab thats_ah_lotta_data]# cat /etc/redhat-release
CentOS release 5.6 (Final)
Henry and Jeff didn’t get this wrong. The table is wrong. Its an annoyance, and it doesn’t detract from the quality of the article. Their points are still completely valid, and dead on the money. Metadata performance that is sub par and non-scalable will lead to all manner of problems in performance.