This is a 5U 108TB (0.1 PB) usable high performance tightly coupled storage unit we are shipping to a customer this week.
This is a spinning rust machine. We’ve been busy little beavers. Tuning, tweaking. And tuning. And tweaking.
Did I mention the tuning and tweaking?
Run status group 0 (all jobs): WRITE: io=196236MB, aggrb=<strong>4155.7MB/s</strong>, minb=4255.4MB/s, maxb=4255.4MB/s, mint=47222msec, maxt=47222msec
Oh. My.
But … it gets … better.
Run status group 0 (all jobs): READ: io=196236MB, aggrb=<strong>5128.8MB/s</strong>, minb=5251.9MB/s, maxb=5251.9MB/s, mint=38262msec, maxt=38262msec
This is spinning rust. This is not SSD/Flash.
I think this just might be the fastest single spinning rust unit on the market. We are more than 2.5x faster at writes, and more than 3.5x faster on reads than the “worlds fastest” storage.
Now imagine building large storage clusters out of units like this. What sort of storage bandwidth wall should you expect? For a single box, 108TB/5.1GB/s = 2.2 x 104 seconds. About 1/4 of a day. Scale up to 10 machines for 1080TB and an aggregate 51 GB/s read speed. Which gives you a constant storage bandwidth wall height.
These units are going to a financial services customer. We are building many more of them.
Viewed 40181 times by 5478 viewers
Nice numbers, Joe. Are those numbers on the box itself, or is that the bandwidth actually exported to clients elsewhere on the network?
@Jeff
On the box, but we are about to test it over QDR IB connected clients.
Linux? FreeBSD? Other?
I don’t expect you to reveal the full config, but dropping some details would be nice.
I imagine IB will be your limiting factor for ideal tests. Mind if I ask what kind of file system will eventually run on that box?
@Anonymous – pretty sure all Joe’s boxes are tested with Linux, but I suspect he’ll ship whatever the customer wants..
@Mark – again I reckon it’d be whatever the customer wants (or is needed to meet the acceptance criteria).
@anon
This is Linux 2.6.32.41 kernel with our tuned drivers/stack. Its actually nothing out of the ordinary for our kit, the same basic JackRabbit kit we use for all machines. We have a 2.6.39.4 kernel that is doing exceptionally well, that we might transition to once we get the IB built for it.
@Mark
This is xfs. No one could hope to do anything like this with ext* or ldiskfs on a single machine. Closest I saw to this performance required a cluster file system and double the number of disks … and they never really measured the performance, they just guessed. See the previous postings on benchmarketing numbers about the skepticism that one should hold over such numbers, and the derision that should be heaped upon those who don’t measure but merely guess.
Again, implementation matters, config and setup matter.
@all
Tests were done using our sw.fio input deck, which is listed elsewhere on this blog. You can try it out yourself on your system(s).
[update] Over QDR IB, using nothing more than NFS over IPoIB, we got a little north of 2GB/s over a single cable.
Again, not bad at all. Could be better, but I am happy with this as a start.
@Joe
I’ve seen north of 10GB/s from XFS on a single node, but that was on some pretty beefy hardware that was intended to be used with CXFS.
Unfortunately there is a bug in XFS that forces you to use default extent sizes or face potential filesystem corruption:
http://oss.sgi.com/bugzilla/show_bug.cgi?id=874
CXFS does all of it’s metadata traffic over the network, so small extent sizes mean tons of RPCs when doing initial writes with small transfers. Of course for reads it’s another story.
@Mark
Please send me your private email again (to joe at scalability dot org).
I just ran the test script in their report on our system:
I don’t think we are using default sized extents:
Could be with a specific kernel. We’ve seen bad xfs bugs in the Centos/RHEL series kernels.