Design and driver issues exposed under very high loads

Most folks, when they build Fibre Channel systems, aren’t assuming a very high IOP rate. No, really. Each channel of an FC8 connection is about 1GB/s, which with 4k operations (neglecting overheads and other things), would give you about 256k IOPs.
To date, most of these units have been connected to spinning disks, which, individually might max out at 300 IOPs. So from their design perspective, you could put about 874 disks per connection, assuming a perfect configuration, to max out the data channel.
Well, FC lets you connect something like 127 devices per channel, so, obviously, this is massive over-engineering … right?
And moreover, if you decided to save money somehow, and instead of putting more silicon on a card, you simply “oversubscribed” one controller chip with 3 of these 127 drive channels … We are talking only about 38k IOPs per channel, max … right?
Well … no.
Very high IOP and very high bandwidth expose all manner of interesting design … er … features.
Currently dealing with one where it looks like an HBA with 2x FC controller chips has 1 port on one chip, and 3 ports on another.
Now connect this to a speed demon like our soon to be announced siCache unit. In the box in question, we are doing 650k+ IOPs. We expected that a quad port FC card would be well designed with a good driver, and that we’d be able to use … well … some significant fraction of the IOPs. The siCache is an FC target in this case, though it also has 10GbE and IB connections.
If you visit us at SC11 in booth 4101, you will see one of the variants on this unit.

Most 10GbE and IB cards are quite well designed for extremely high performance. They make few assumptions, and generally the silicon (from specific vendors) does a bang up job. With our lab unit, coming with us to SC, we’ve measured sustained 950k+ IOPs locally, and about 600k+ IOPs over 10GbE. Will try it over QDR IB next.
And FC is (like it or not) a declining legacy interconnect technology. It is expensive and slow. Its great to interconnect older large rack sized disk arrays. Its really not good for much else. And it is most definitely not a high performance interconnect … not one people should give serious consideration to for new gear. 10GbE is cheaper, faster, and will be compatible with 40GbE. And QDR IB is here now with EDR/FDR ramping. The rationale for FC is pretty much restricted to legacy systems.
So back to the performance.
Found this gem as we were trying to understand a very odd performance scenario for our unit. Remember, this is a 650k+ IOP box. In place at a customer location. Customer can start banging on one port, and they hit about 95-110k IOPs (though we’ve seen with the recent tuning, up to 130k IOPs sustained). Add in a second … CAREFULLY SELECTED … port and we can almost double this. Call it 220k IOPs for 2 ports. About 1/3 the native capacity of the unit.
Now add in a 3rd port. Performance rises. About 25k IOPs. Huh?
Now add in the 4th port. Performance rises. About 10k IOPs. Huh2 ?
What it looks like to me, after looking carefully at how the unit configures itself … is that one physical port maps to one of the FC cpu chips. The second, third and fourth physical ports map to the second FC cpu chips.
Ummm … ok. I understand this from an old disk IOP scenario. You’ll never get more than 120k IOPs on 3 ports, so we can divide it up this way without any loss of performance. Keeps costs down.
But start dialing up the performance, and you start running into contention. When each port can sink/source 130k+ IOPs (the specs claim 200k+, but thems marketing numbers), yeah, your overall design needs to be able scale as well. This sort of unbalanced oversubscribed scenario won’t work well.
Yeah, this is what we are running into. Would like to fix it. Sadly, we have to work around their design issue by using multiple cards with fewer ports per card. Which takes up valuable PCIe lanes.
We can get many more PCIe lanes as needed, but this increases cost and complexity. We don’t want that.
We are pretty careful on the design side, but we didn’t look closely enough at the design of the FC side. We assumed it would scale. Its FC after all.
The reason that our units generally blow the doors off most of our competitors benchmarks, why we do in half or fewer number of disks in single units, what it takes them multiple chassis to “match” our performance … this is in part due to really good design. There are other issues (highly tuned hardware/software stacks, etc.) but if you start with a crappy design, there is nothing … NOTHING … that is going to solve that for you.
Apart from a forklift, and a new purchase.
Yeah, I am not all that impressed with FC at the moment. The design of the HBA is fine for fast spinning disk. It is terrible for high performance Flash/SSD. We might be able to work around some of these issues with a better designed HBA. Been looking at these. Current group are Qlogic. Looking at Emulex, and Brocade was suggested (though alas, we may not have a target driver for this for our SCSI target stack … may be doable with a work around … we will see).