Raw unapologetic firepower in a single machine … a new record

This is a 5U 108TB (0.1 PB) usable high performance tightly coupled storage unit we are shipping to a customer this week.
This is a spinning rust machine. We’ve been busy little beavers. Tuning, tweaking. And tuning. And tweaking.
Did I mention the tuning and tweaking?

Run status group 0 (all jobs):
  WRITE: io=196236MB, aggrb=4155.7MB/s, minb=4255.4MB/s, maxb=4255.4MB/s, mint=47222msec, maxt=47222msec

Oh. My.
But … it gets … better.

Run status group 0 (all jobs):
   READ: io=196236MB, aggrb=5128.8MB/s, minb=5251.9MB/s, maxb=5251.9MB/s, mint=38262msec, maxt=38262msec

This is spinning rust. This is not SSD/Flash.
I think this just might be the fastest single spinning rust unit on the market. We are more than 2.5x faster at writes, and more than 3.5x faster on reads than the “worlds fastest” storage.
Now imagine building large storage clusters out of units like this. What sort of storage bandwidth wall should you expect? For a single box, 108TB/5.1GB/s = 2.2 x 104 seconds. About 1/4 of a day. Scale up to 10 machines for 1080TB and an aggregate 51 GB/s read speed. Which gives you a constant storage bandwidth wall height.
These units are going to a financial services customer. We are building many more of them.

9 thoughts on “Raw unapologetic firepower in a single machine … a new record”

  1. Nice numbers, Joe. Are those numbers on the box itself, or is that the bandwidth actually exported to clients elsewhere on the network?

  2. Linux? FreeBSD? Other?
    I don’t expect you to reveal the full config, but dropping some details would be nice.

  3. @Anonymous – pretty sure all Joe’s boxes are tested with Linux, but I suspect he’ll ship whatever the customer wants..
    @Mark – again I reckon it’d be whatever the customer wants (or is needed to meet the acceptance criteria).

  4. @anon
    This is Linux kernel with our tuned drivers/stack. Its actually nothing out of the ordinary for our kit, the same basic JackRabbit kit we use for all machines. We have a kernel that is doing exceptionally well, that we might transition to once we get the IB built for it.
    This is xfs. No one could hope to do anything like this with ext* or ldiskfs on a single machine. Closest I saw to this performance required a cluster file system and double the number of disks … and they never really measured the performance, they just guessed. See the previous postings on benchmarketing numbers about the skepticism that one should hold over such numbers, and the derision that should be heaped upon those who don’t measure but merely guess.
    Again, implementation matters, config and setup matter.
    Tests were done using our sw.fio input deck, which is listed elsewhere on this blog. You can try it out yourself on your system(s).

  5. [update] Over QDR IB, using nothing more than NFS over IPoIB, we got a little north of 2GB/s over a single cable.
    Again, not bad at all. Could be better, but I am happy with this as a start.

  6. @Joe
    I’ve seen north of 10GB/s from XFS on a single node, but that was on some pretty beefy hardware that was intended to be used with CXFS.
    Unfortunately there is a bug in XFS that forces you to use default extent sizes or face potential filesystem corruption:
    CXFS does all of it’s metadata traffic over the network, so small extent sizes mean tons of RPCs when doing initial writes with small transfers. Of course for reads it’s another story.

  7. @Mark
    Please send me your private email again (to joe at scalability dot org).
    I just ran the test script in their report on our system:

    [root@jr4-1 data]# ./test.sh
    test.sh: generating 10 files
    test.sh: comparing files
    cmp filea_0 filea_1
    cmp filea_0 filea_2
    cmp filea_0 filea_3
    cmp filea_0 filea_4
    cmp filea_0 filea_5
    cmp filea_0 filea_6
    cmp filea_0 filea_7
    cmp filea_0 filea_8
    cmp filea_0 filea_9
    cmp filea_1 filea_2
    cmp filea_1 filea_3
    cmp filea_1 filea_4
    cmp filea_1 filea_5
    cmp filea_1 filea_6
    cmp filea_1 filea_7
    cmp filea_1 filea_8
    cmp filea_1 filea_9
    cmp filea_2 filea_3
    cmp filea_2 filea_4
    cmp filea_2 filea_5
    cmp filea_2 filea_6
    cmp filea_2 filea_7
    cmp filea_2 filea_8
    cmp filea_2 filea_9
    cmp filea_3 filea_4
    cmp filea_3 filea_5
    cmp filea_3 filea_6
    cmp filea_3 filea_7
    cmp filea_3 filea_8
    cmp filea_3 filea_9
    cmp filea_4 filea_5
    cmp filea_4 filea_6
    cmp filea_4 filea_7
    cmp filea_4 filea_8
    cmp filea_4 filea_9
    cmp filea_5 filea_6
    cmp filea_5 filea_7
    cmp filea_5 filea_8
    cmp filea_5 filea_9
    cmp filea_6 filea_7
    cmp filea_6 filea_8
    cmp filea_6 filea_9
    cmp filea_7 filea_8
    cmp filea_7 filea_9
    cmp filea_8 filea_9
    test.sh: 0 errors

    I don’t think we are using default sized extents:

    [root@jr4-1 data]# xfs_info /data | grep extsz
    realtime =none                   extsz=2097152 blocks=0, rtextents=0

    Could be with a specific kernel. We’ve seen bad xfs bugs in the Centos/RHEL series kernels.

Comments are closed.