Testing iSCSI over 10 GbE, iSER over IB, SRPT over IB, …

This will be short, no long discussion of benchmarks.

Basically we tried JackRabbit as a target for many block oriented protocols. With 10 GbE, and with IB. I though 10 GbE would be badly beaten by IB in performance (real world, no ram disks here).

I think I was wrong. 10 GbE based iSCSI was quite simple to set up, pretty easy to tune, and actually nice to work with.

Compare this to building SCST-SRPT or the right version of iSER or the correctly patched OFED for SCSI-TGT, or …

I like IB. I really do. Building OFED is a bear. It is not easy. It is not download a tarball, compile for 2 minutes, install a driver kernel module, and bam, fast networks.

10 GbE is.

The stack is simple. Setting it up is simple. Using it simple.

Simple implies lower barriers to usage.

I bet that the IB people could (if they wanted) integrate IPoIB directly into the driver, and make the actual driver builds nearly clean and simple as the 10 GbE.

The 10 GbE card I played with is a pre-production unit. I am hoping that we get to play with it some more. It is quite nice.

The nice thing was the the 10 GbE performance was comparable (within the same ballpark) as the IB performance for the IO operations. I don’t care if I have double the bandwidth if I never have any hope of ever using that bandwidth. With a real storage system like a JackRabbit, you are not going to be able to pump or pull more than some upper limit from or to the disks in some unit of time. What is nice is that the JackRabbit’s disks allow us to feed something at 10 GbE/IB speeds. What was remarkable to me was how well matched they were.

We were seeing in the vicinity of 450-550 MB/s sustained performance to physical disk on writes and a little lower on reads. On pre-release 10 GbE hardware. Again this is a sub-$10k storage unit, and the storage was RAID6 with 1 hot spare (13 drives in the RAID6). Native on disk speed we tamped down to 750 MB/s (slight de-tune to optimize overall storage throughput, and avoid cache-buffer “beat” effects). With a little tuning, I bet we can get somewhat higher iSCSI performance. There always will be overhead, call it ~10% or so for protocol etc. So our numbers aren’t bad at all.

Viewed 7865 times by 1714 viewers

2 thoughts on “Testing iSCSI over 10 GbE, iSER over IB, SRPT over IB, …

  1. What was the hit on cpu utilization?
    The other promise of IB other than bandwidth, is that the bandwidth will not eat your cpu.

  2. For 10GbE it wasn’t bad at all, about 25% utilization under heavy load. For IB it was lower, ab out 10%.

    One of the problems with earlier adapters was the interrupt and context switch rates. For SRPT and the various RDMA protocols, I was seeing Int’s north of 50k/s, and CSW went into the 80-90k. That is insane.

    This could be an incorrectly configured driver. Or card. Or something else.

    I noticed today on one of the iSCSI / OFED lists, someone made some observations that were in line with what I observed but did not report. In a nutshell, when doing real IO to real devices, RDMA things don’t seem to do as well as non-RDMA things.

    That is:

    very basic benchmarks and surprising (at least for me) results – it
    look’s like reading is much slower than writing and NFS/RDMA is twice
    slower in reading than classic NFS. 😮

    results below – comments appreciated!
    regards, Pawel

    both nfs server and client have 8-cores, 16 GB RAM, Mellanox DDR HCAs
    (MT25204) connected port-port (no switch).

    local_hdd – 2 sata2 disks in soft-raid0,
    nfs_ipoeth – classic nfs over ethernet,
    nfs_ipoib – classic nfs over IPoIB,
    nfs_rdma – NFS/RDMA.

    simple write of 36GB file with dd (both machines have 16GB RAM):
    /usr/bin/time -p dd if=/dev/zero of=/mnt/qqq bs=1M count=36000

    local_hdd sys 54.52 user 0.04 real 254.59

    nfs_ipoib sys 36.35 user 0.00 real 266.63
    nfs_rdma sys 39.03 user 0.02 real 323.77
    nfs_ipoeth sys 34.21 user 0.01 real 375.24

    remount /mnt to clear cache and read a file from nfs share and
    write it to /dev/:
    /usr/bin/time -p dd if=/mnt/qqq of=/scratch/qqq bs=1M

    nfs_ipoib sys 59.04 user 0.02 real 571.57
    nfs_ipoeth sys 58.92 user 0.02 real 606.61
    nfs_rdma sys 62.57 user 0.03 real 1296.36

    results from bonnie++:

    Version 1.03c ——Sequential Write —— –Sequential Read — –Random-
    -Per Chr- –Block– -Rewrite- -Per Chr- –Block– –Seeks–
    Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
    local_hdd 35G:128k 93353 12 58329 6 143293 7 243.6 1
    local_hdd 35G:256k 92283 11 58189 6 144202 8 172.2 2
    local_hdd 35G:512k 93879 12 57715 6 144167 8 128.2 4
    local_hdd 35G:1024k 93075 12 58637 6 144172 8 95.3 7
    nfs_ipoeth 35G:128k 91325 7 31848 4 64299 4 170.2 1
    nfs_ipoeth 35G:256k 90668 7 32036 5 64542 4 163.2 2
    nfs_ipoeth 35G:512k 93348 7 31757 5 64454 4 85.7 3
    nfs_ipoet 35G:1024k 91283 7 31869 5 64241 5 51.7 4
    nfs_ipoib 35G:128k 91733 7 36641 5 65839 4 178.4 2
    nfs_ipoib 35G:256k 92453 7 36567 6 66682 4 166.9 3
    nfs_ipoib 35G:512k 91157 7 37660 6 66318 4 86.8 3
    nfs_ipoib 35G:1024k 92111 7 35786 6 66277 5 53.3 4
    nfs_rdma 35G:128k 91152 8 29942 5 32147 2 187.0 1
    nfs_rdma 35G:256k 89772 7 30560 5 34587 2 158.4 3
    nfs_rdma 35G:512k 91290 7 29698 5 34277 2 60.9 2
    nfs_rdma 35G:1024k 91336 8 29052 5 31742 2 41.5 3
    ——Sequential Create—— ——–Random Create——–
    -Create– –Read— -Delete– -Create– –Read— -Delete–
    files:max:min /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
    local_hdd 16 10587 36 +++++ +++ 8674 29 10727 35 +++++ +++ 7015 28
    local_hdd 16 11372 41 +++++ +++ 8490 29 11192 43 +++++ +++ 6881 27
    local_hdd 16 10789 35 +++++ +++ 8520 29 11468 46 +++++ +++ 6651 24
    local_hdd 16 10841 40 +++++ +++ 8443 28 11162 41 +++++ +++ 6441 22
    nfs_ipoeth 16 3753 7 13390 12 3795 7 3773 8 22181 16 3635 7
    nfs_ipoeth 16 3762 8 12358 7 3713 8 3753 7 20448 13 3632 6
    nfs_ipoeth 16 3834 7 12697 6 3729 8 3725 9 22807 11 3673 7
    nfs_ipoeth 16 3729 8 14260 10 3774 7 3744 7 25285 14 3688 7
    nfs_ipoib 16 6803 17 +++++ +++ 6843 15 6820 14 +++++ +++ 5834 11
    nfs_ipoib 16 6587 16 +++++ +++ 4959 9 6832 14 +++++ +++ 5608 12
    nfs_ipoib 16 6820 18 +++++ +++ 6636 15 6479 15 +++++ +++ 5679 13
    nfs_ipoib 16 6475 14 +++++ +++ 6435 14 5543 11 +++++ +++ 5431 11
    nfs_rdma 16 7014 15 +++++ +++ 6714 10 7001 14 +++++ +++ 5683 8
    nfs_rdma 16 7038 13 +++++ +++ 6713 12 6956 11 +++++ +++ 5488 8
    nfs_rdma 16 7058 12 +++++ +++ 6797 11 6989 14 +++++ +++ 5761 9
    nfs_rdma 16 7201 13 +++++ +++ 6821 12 7072 15 +++++ +++ 5609 9

    These are Pawel Dziekonski’s results, and they are quoted without permission from the nfs-rdma-devel list.

    Oddly, I observed very similar things, and had trouble explaining them. If the protocol is indeed very fast, but the connection to the back end data store is slow, that is a problem. Our back end data store is fast. And we still had problems.

Comments are closed.