Myths and hype: first of likely many articles

By joe

December 3, 2006 - 7 minutes read - 1411 words

We have spoken to many customers as of late about storage. Apparantly there is this new high performance physical interconnect akin to the venerable and aging Fibre Channel, SCSI, and other related technologies. Its name? iSCSI. Can you tell whats wrong with this?

The customers can’t. And we can blame the marketing hype machines for this situation. iSCSI is new, is quite interesting, and is the right solution for many users. But it is a protocol, not a physical layer. This means that its performance is fundamentally limited to whatever the underlying physical layer is. Why is this important? Simple. Lots of storage vendors are being disingenuous when they tell customers, or imply to customers that iSCSI will make them go faster. iSCSI has to sit on some physical layer. And most of the customers we have spoken to, whom have implemented iSCSI based upon vendor claims and benchmarks seem to have noticed that their performance is terrible. I ask them how they have implemented it. Most say “oh, we run it over gigabit.” I ask them if this is channel bonded gigabit. “No”. iSCSI in a nutshell: Gives you a block device you can attach to over a network. If you want to run a cluster file system atop this, you can. Whether or not that is a good idea is another story. Multiple hosts cannot share the same block without some sort of locking and coherency protocol, which would significantly slow things down (metadata shuttling is why lots of cluster file systems don’t scale well beyond a certain point). iSCSI will not: a) magically make your storage go faster, b) magically give you a cluster file system, c) instantly solve all of your problems. Despite what other storage vendors tell you. Sounds like I am down on iSCSI, right? No. We think it is great. We use it, and deploy it. We just don’t like the hype and false marketing others seem to be engaged with it are doing. If you are not running iSCSI over a fast enough connection, your performance is going to be lousy. Period. A “fast enough connection” will vary depending upon application. You can run over gigabit, but you should expect less than 100 MB/s to your block unless you channel bond. You could run iSCSI over any network that lets you pass TCP/IP packets. This includes Infiniband, 10 GbE, and even FC (yes, it is a network). As of today (Dec-2006), 10GbE is simply not cost effective for very many applications. Seems that either the underlying technology is too expensive to build economically, or the vendors seem to think they don’t have any technology competitors (they do, and the technology competitors are winning). Switch uplinks: Go get a 48 port high performance gigabit switch for $2-4k, and add a 10 GbE uplink. The uplink costs $10-15k depending upon the switch vendor. Sorta reminds me of the old days with Cray. You bought 1 GB of static ram, and they threw in a free supercomputer. Except in this case, you buy the uplink, and for a small additional amount, you get the 48 port high performance gigabit switch that it attaches to. The point being that $10k++ per 10GbE port is a non-starter. Infiniband looks to be the most reasonably priced high performance iSCSI technology to date. With SDR, and an adapter sitting in a PCIx channel, you could get about 800 MB/s sustained to the iSCSI block. Put it in a dual port 4 lane PCIe (PCIe x4) card in, and you can (theoretically) get 2 GB/s. Use dual port DDR, and an 8 lane PCIe (PCIe x8) and you can (theoretically) get 4 GB/s. At some point you are going to be limited by the internal buffer to card transfers and kernel space to user space copying, unless you have a zero copy driver, in which case memory bandwidth will be you limiting factor. Assume about 5 GB/s for that. This is of course for a single machine to a single iSCSI block. Lets look at this with Gigabit, shall we? After all, this is what everyone has been selling. Single port gigabit will give you at most about 110 MB/s. If your system can do duplex, you might be able to do this in each direction (220 MB/s aggregate). Channel bond 2 gigabit links and you might be able to do 220 MB/s (440 MB/s aggregate). Of course most gigabit is being done over the motherboard NIC, and chances are, if they follow normal motherboard design features, you will likely have the NIC residing on the slowest bus connection to the southbridge. Assume the NIC is on a PCI bus, they are cheap to build, simple to integrate. Allows you to hang a lower cost NIC off of them … and limits your performance to about 100 MB/s, no matter what you do. We have had discussions with customers about this, when they did not get the performance they were expecting out of their channel bonded gigabit; specifically they seemed to be stuck at about 105 MB/s no matter what they did, apart from very small packets which sit in NIC cache. Well, you can use an offload accelerator. Sounds good, right? Adds a few hundred dollars to the equation. Makes you feel better. Except, you have that pesky little problem of the physical layer being the rate limiting issue, not the TCP/IP stack (except in some corner cases, such as many tiny operations where latency matters). Offload accelerators will help with latency. Not with bandwidth. We have been telling customers for years that the fastest I/O is always local I/O. To get local I/O performance, you would have to spend quite a bit of money per node, such that local I/O will be orders of magnitude more cost effective for sizeable storage. iSCSI could change this, as local I/O could be transported out to a remote block server with enough bandwidth in and out, that it could support a large number of requests … though you would need lots of pipes in and out. For example: Each SATA II can talk at 300 MB/s. Each U320 SCSI can talk at 320 MB/s. The disks themselves could talk at 70-90 MB/s. Call it 50 MB/s to make the numbers easier, assuming efficiency losses and contention in talking to the machine. To get 1 GB/s sustained performance (large block sequential IO), you would need about 20 disks. Curiously, this is about what we observe. So to allocate this 1 GB/s among hosts that can talk to it at 100 MB/s, you would need to have about a 10:1 ratio. If your server can host multiple high speed NICs, say 2 dual port HCA’s, you can get about 40 of these units per such a server. Not bad, right? But notice that the clients won’t see anything beyond the physical layer performance. This is one of those pesky limits that give many users fits, and marketeers try to gloss over. Limits are limits. You have to deal with them, or understand when you are hitting them. If you are running iSCSI over (put your favorite physical layer here), you need to understand that you will not see anything faster than the underlying physical layer connection to the block. No amount of marketing will change that fact. Then again, they will do everything in their power to gloss over it. Just like cluster systems, you need to carefully design and architect for performance. High performance design and tuning is sort of like bringing a team of professionals to an auto race. If you bring the wrong team, regardless of the brand emblazoned upon their car, you are going to suffer. If you don’t start with a design that has a fighting chance of working the way you need, you are not likely to get the performance you require. In lots of cases, it appears that PT Barnum’s tome has been proven to hold yet again. iSCSI is a great technology. It won’t by itself solve performance problems, but it will lower the cost of building SAN-like solutions. To get good performance out of it, you need to carefully consider all the aspects of the design, including the network, the iSCSI server, and the connection to the client. Only addressing one of these three is a guaranteed way to get terrible performance. Which is what our customers tell us that they have had with iSCSI.