What is the future of storage?

I am seeing lots of deep soul searching in pundit circles, as well as head scratching on the part of customers, as various vendors writhe and contort in their death throes. Pundits regularly trash that which they neither grasp, nor prefer. Customers wonder what the right path going forward is. Vendors struggle to figure out what the market really wants, and to be able to offer that (all the while the marketing teams are spinning hard and fast).

Most recently, I had read a different article somewhere claiming that all other pundits were wrong, that there was no storage crisis, that dedup has solved all problems, that tiering is the right way to go, that MAID is dead, that … etc. I think their conclusions and analyses are (quite) suspect, and what they ascribe as observations might in fact be quite off (they didn’t quite get what they reported … or spun it in a very strange way). Of course we are biased, but our bias doesn’t mean our criticism is invalid. We see our customers generating and analyzing ever more data. Needing to access and use it, needing to move and store it. Paraphrasing Inigo Montoya “that concept … I don’t think it means what you think it means.”

We have a fairly simple focus based upon our experience in high performance computing. This is a space who’s use cases we do understand quite well.

Henry Newman also gets this space, he’s been doing this stuff longer than most. Henry is at Instrumental, and has a nice article on Enterprise storage forum.

In it, Henry puts on a pundit hat and thinks aloud about the future of storage. This is where it gets interesting, in that I see a convergence of our views.

We have been talking about streaming performance and good IOP performance (though not actively talking about the latter, but we have excellent numbers nonetheless) for a long time. Streaming performance is absolutely critical in many usage scenarios in high performance storage. We regularly are hit with requests on how to design a system for high streaming data rates, as well as providing low seek latencies … customers don’t always have control over how their programs are written, or do IO.

Henry’s article is on SSDs stressing RAID controller design. In it, he talks about the bandwidth issue, pointing out that SSDs have very high seek capability, and actually quite good streaming capability. He points to the PCIe bus as a potential bottleneck in RAID controller design … or rather, how it is used.

I want to delve into this a little, because the problem is actually far worse than he indicates.

PCIe v1.0 has 256 MB/s read and 256 MB/s write capability per lane. Well, sort of. Really it is 86% of this due to protocol overhead. Lets call it 220 MB/s in each direction.

PCIe v2.0 doubles this, but doesn’t get any more efficient. Now you have 512 MB/s read and write, or 440 MB/s deliverable/usable to the card.

A PCIe x8 in v1.0 scenario is about 1760 MB/s, which is, not so oddly, about the limit you saw in many Infiniband benchmarks for DDR data. They could move 20 Gb/s in each direction (2.5 GB/s) … or really 8/10 of 20 Gb/s or 2GB/s due to the Infiniband 8/10 encoding.

You couldn’t use all that bandwidth though, as the PCIe bus itself limited itself to 1760 MB/s due to its overhead.

With PCIe v2.0, we get to double this. Now we have 3520 MB/s available for QDR Infiniband.

Also for storage.

And this is where Henry’s arguments lead

The reason I picked 8 lanes in a PCIe 2.0 bus is that is the highest performance I generally see in the design for controllers. I also saw a number of SSDs just below and around the performance range of the Intel SSD. Most RAID controllers use a PCIe bus to interface between either caches or communications to backend channels. There are actually some RAID controllers that do not even use PCIe 2.0 (500 MB/sec per lane) and still use PCIe 1.1 (250 MB/sec) per lane. As you can see, even when doing purely random I/O to the SSD, the I/O quickly becomes a bandwidth issue given the high performance of the SSD. 29 SSDs might seem like a lot, but if a few of them were doing larger blocks and sequential I/O at 300 MB/sec, four drives would saturate a PCIe 1.1 bus with 8 lanes.

Well, let me do a little math correction here if possible. What we measure coming out of these nice Intel X25E drives is about 170 – 190 MB/s sustained in file systems. Yeah, they claim 220 and some even 300. But you don’t really observe this, expect under conditions you won’t (ever) see in a (deployed) server.

190 MB/s is one PCIe v1.0 lane (effectively). 8 drives per 8 lanes gets you 1.5 GB/s.

This is awful close to the PCIe x8 v1.0 data rate maximum. But that isn’t the real issue right now, more on that in a moment.

Many users will happily connect these SATA devices to the SATA or SAS controllers on their motherboards.

You know, the ones on a PCIe x2 or x4 bus. We see this on systems with the v1 PCIe spec. The v2 systems are a bit better, but still no panacea.

Henry’s point on the RAID cards still using PCIe v1.0 spec are spot on. v2 will negotiate back down to v1 if the card isn’t capable of talking v2.

This is why, if you get an SSD server unit (such as our JackRabbit-Flash systems) you need a well balanced design specifically not overloading the RAID with too many SSDs, or you are throwing away bandwidth.

But thats not even the worst problem.

The real problem is that the dedicated storage processors on these RAID cards can rarely keep up with the lower speed I/O connections. Never mind really high speed storage units. They were not designed for moving GB/s in most cases.

Oh, sure, someone will claim theirs can do it, and provide some specialized test cases that show some kernel cranking through the crunching portions very quickly. But … plug these things into a real system and performance is quite different.

We regularly evaluate a range of RAID cards, and explore performance. What we see astounds us (and not in a good way).

This has led to renewed looks at doing the RAID on the processor. With enough cores, you can do this processing quite quickly.

This is one of the design elements behind ZFS used in various storage systems. It reduces bill of material costs, so you don’t have to spend on a RAID card. And it uses a fast processor. Whats not to like?

That it uses your processors which you want to dedicate to computing. Under very heavy IO and simultaneous computing, the I/O (significantly) negatively impacts the computational cycles available, and under heavy computation the fewer cycles available for I/O reduce the bandwidth.

This makes for a sub-performing design, and one of several reasons why we love benchmarking against certain vendor’s machines.

But the RAID silicon is usually poorly performing relative to what we need. This is because it wasn’t really designed for continuous streaming data in the GB/s region.

Add to this, the cost to move data from disk to the network. Network drivers are (notoriously) inefficient. Networking silicon for 10GbE is getting better, but just try to get a single stream to use all 10Gb without doing handstands to reconfigure the machine, or massively tweak the kernel. You have to disable interrupt balancing, set interrupt affinity, lock pages in ram, and do many other things to achieve near maximum performance.

Or have multiple threads of lower performance going at once. This seems to be the default mode of the 10GbE designs. Many Rx/Tx queues.

This means that single streams might not be so great, but aggregate streams will do well.

So you have inefficient network units you need to deal with, and suboptimal storage units … at the end of the day, you lose performance transitioning the multiple boundaries between networking and IO. There are multiple buffer copies, multiple data transfers. With silicon on the storage side not well designed for these data rates, and NIC designs on the network side that seem to focus not on single streaming performance but multiple streaming performance.

Now add SSDs into this mix. Very low latencies. Think 0.5 milliseconds access time, or about 2000 IOPs/unit. I expect to see this dropping further. RAID cards were designed for systems with rotational latencies on the order of 2-10 milliseconds. Their internal controller pathways are not necessarily designed with performance in mind, but silicon and board cost (and size, power consumption). So they might not be able to handle N devices talking all at the same time to controller memory. More likely f(N) * N where f(N) is some probability function estimating the fraction of devices ready to talk at a given moment after a request is made. That f(N) will guide how much board real estate, silicon, and bandwidth should be granted to the device channel.

If you have 10 devices, all able to talk at 190 MB/s, then you have to budget for 1900 MB/s of data on your RAID card. If you assume at any one given time that 1/2 of your data will be ready to be read/written (due to rotational latency, etc), you only need 950 MB/s. Designing for 950 MB/s is much less expensive than designing for 1900 MB/s.

1900 MB/s could be read on a wide read by a 1 GHz processor by reading 2 bytes per clock cycle from the 10 IO controllers. Which means either an 160 bit data path to the IO controllers or a double pumped 80 bit data path to the IO controllers.

The 1GHz processor is on the RAID card. Or could be. In many cases, the clock speeds are 200-600 MHz. Lower power ARM or PPC processors. This is one of the many issues also associated with very long RAID rebuild times BTW.

Well, you can see we can keep digging into this. The point is that most of the components of this system aren’t designed for the tasks they are being asked to do. While Henry is spot on in his observations, my thoughts are we ought to go further and articulate where we could be.

This said, there are a number of exciting developments going on, none of which I can talk about, which will solve a number of these problems (and some that I haven’t discussed, but are just as real, if not more pressing as time goes on).

High performance storage can look like streaming IO, and like seek bound IO. You don’t know what the programs will do with your input decks a priori. But you have to be ready for all of it.

Viewed 12152 times by 2791 viewers

Facebooktwittergoogle_plusredditpinterestlinkedinmail