SRP joy

ok, not really. Late last night, while benchmarking some alternative mechanisms to connect {MD,OS}S to their respective {MD,OS}T for a Lustre design we are proposing for an RFP, I decided to revisit SRP. I liked SRP in the past, it was a simple protocol, SCSI over RDMA. How could you go wrong with this?

Well, I found out last night.

I put our stack on a DeltaV connected with a 10GbE and QDR IB ports to our respective switches. Lit up the target, iSCSI over 10GbE was nice, wanted to try IB. iSCSI (not the iSER) over IB isn’t so nice, about 1/2 the performance of 10GbE. Yeah, iSER will be tried next, but I wanted to see how SRP behaved.

So the target was set up, and it is visible.

-bash-4.1# scstadmin -list_target

Collecting current configuration: done.

	Driver  Target                                     
	---------------------------------------------------
	ib_srpt ib_srpt_target_0                           
	iscsi   iqn.2011-10.com.scalableinformatics:blockio
	        iqn.2012-10.com.scalableinformatics:fileio 

Connect to it … and …

Hey, the dv4 isn’t responding … why? It was 1am, I figured “pilot error” so I’d look this morning.

Get in, fire up the console. Unit is up. Try connecting to this again and …

KERNEL PANIC on dv4

Damn. I liked SRP. No real time to debug for this, onto iSER.

As a rule of thumb, I think its … I dunno … a bad idea to kernel panic on a connection attempt. It might put users off, a bit.

Viewed 30367 times by 3933 viewers

2 thoughts on “SRP joy

  1. Did you report it upstream? What does the trace look like?
    With recent kernel versions we also had some issues.I tested 3.1-rcX and could fix a serious IPoIB issue before 3.1 was released, but first started to test 3.2 when it was released. Traces were a bit confusing first, as it looked like FhGFS was failing, but then it was an unitialized variable in the IB stack…
    I guess linux-rdma needs a bit more manual or automatic testing before making a release… And then I’m not sure how many people are using ib_srpt at all. If a hardware vendor has issues, they might simply fix or workaround it inhouse, without releasing the patch… (which is probably even allowed, as OFED is BSD/GPL dual licensed).

    Cheers,
    Bernd

  2. For 10GbE it wasn’t bad at all, about 25% utilization under heavy load. For IB it was lower, ab out 10%.One of the perolbms with earlier adapters was the interrupt and context switch rates. For SRPT and the various RDMA protocols, I was seeing Int’s north of 50k/s, and CSW went into the 80-90k. That is insane. This could be an incorrectly configured driver. Or card. Or something else.I noticed today on one of the iSCSI / OFED lists, someone made some observations that were in line with what I observed but did not report. In a nutshell, when doing real IO to real devices, RDMA things don’t seem to do as well as non-RDMA things.That is:very basic benchmarks and surprising (at least for me) results itlook’s like reading is much slower than writing and NFS/RDMA is twiceslower in reading than classic NFS. results below comments appreciated!regards, Pawelboth nfs server and client have 8-cores, 16 GB RAM, Mellanox DDR HCAs(MT25204) connected port-port (no switch).local_hdd 2 sata2 disks in soft-raid0,nfs_ipoeth classic nfs over ethernet,nfs_ipoib classic nfs over IPoIB,nfs_rdma NFS/RDMA.simple write of 36GB file with dd (both machines have 16GB RAM):/usr/bin/time -p dd if=/dev/zero of=/mnt/qqq bs=1M count=36000local_hdd sys 54.52 user 0.04 real 254.59nfs_ipoib sys 36.35 user 0.00 real 266.63nfs_rdma sys 39.03 user 0.02 real 323.77nfs_ipoeth sys 34.21 user 0.01 real 375.24remount /mnt to clear cache and read a file from nfs share andwrite it to /dev/:/usr/bin/time -p dd if=/mnt/qqq of=/scratch/qqq bs=1Mnfs_ipoib sys 59.04 user 0.02 real 571.57nfs_ipoeth sys 58.92 user 0.02 real 606.61nfs_rdma sys 62.57 user 0.03 real 1296.36results from bonnie++:Version 1.03c Sequential Write Sequential Read Random- -Per Chr- Block -Rewrite- -Per Chr- Block Seeks Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CPlocal_hdd 35G:128k 93353 12 58329 6 143293 7 243.6 1local_hdd 35G:256k 92283 11 58189 6 144202 8 172.2 2local_hdd 35G:512k 93879 12 57715 6 144167 8 128.2 4local_hdd 35G:1024k 93075 12 58637 6 144172 8 95.3 7nfs_ipoeth 35G:128k 91325 7 31848 4 64299 4 170.2 1nfs_ipoeth 35G:256k 90668 7 32036 5 64542 4 163.2 2nfs_ipoeth 35G:512k 93348 7 31757 5 64454 4 85.7 3nfs_ipoet 35G:1024k 91283 7 31869 5 64241 5 51.7 4nfs_ipoib 35G:128k 91733 7 36641 5 65839 4 178.4 2nfs_ipoib 35G:256k 92453 7 36567 6 66682 4 166.9 3nfs_ipoib 35G:512k 91157 7 37660 6 66318 4 86.8 3nfs_ipoib 35G:1024k 92111 7 35786 6 66277 5 53.3 4nfs_rdma 35G:128k 91152 8 29942 5 32147 2 187.0 1nfs_rdma 35G:256k 89772 7 30560 5 34587 2 158.4 3nfs_rdma 35G:512k 91290 7 29698 5 34277 2 60.9 2nfs_rdma 35G:1024k 91336 8 29052 5 31742 2 41.5 3 Sequential Create Random Create -Create Read -Delete -Create Read -Delete files:max:min /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CPlocal_hdd 16 10587 36 +++++ +++ 8674 29 10727 35 +++++ +++ 7015 28local_hdd 16 11372 41 +++++ +++ 8490 29 11192 43 +++++ +++ 6881 27local_hdd 16 10789 35 +++++ +++ 8520 29 11468 46 +++++ +++ 6651 24local_hdd 16 10841 40 +++++ +++ 8443 28 11162 41 +++++ +++ 6441 22nfs_ipoeth 16 3753 7 13390 12 3795 7 3773 8 22181 16 3635 7nfs_ipoeth 16 3762 8 12358 7 3713 8 3753 7 20448 13 3632 6nfs_ipoeth 16 3834 7 12697 6 3729 8 3725 9 22807 11 3673 7nfs_ipoeth 16 3729 8 14260 10 3774 7 3744 7 25285 14 3688 7nfs_ipoib 16 6803 17 +++++ +++ 6843 15 6820 14 +++++ +++ 5834 11nfs_ipoib 16 6587 16 +++++ +++ 4959 9 6832 14 +++++ +++ 5608 12nfs_ipoib 16 6820 18 +++++ +++ 6636 15 6479 15 +++++ +++ 5679 13nfs_ipoib 16 6475 14 +++++ +++ 6435 14 5543 11 +++++ +++ 5431 11nfs_rdma 16 7014 15 +++++ +++ 6714 10 7001 14 +++++ +++ 5683 8nfs_rdma 16 7038 13 +++++ +++ 6713 12 6956 11 +++++ +++ 5488 8nfs_rdma 16 7058 12 +++++ +++ 6797 11 6989 14 +++++ +++ 5761 9nfs_rdma 16 7201 13 +++++ +++ 6821 12 7072 15 +++++ +++ 5609 9These are Pawel Dziekonski’s results, and they are quoted without permission from the nfs-rdma-devel list.Oddly, I observed very similar things, and had trouble explaining them. If the protocol is indeed very fast, but the connection to the back end data store is slow, that is a problem. Our back end data store is fast. And we still had perolbms.

Comments are closed.