Misalignment of performance expectations and reality

We are working on a project for a consulting customer. They’ve hired us to help them figure out where their performance is being “lost”.
Obviously, without naming names or revealing information, I note something interesting about this, that I’ve alluded to many times before.
There is an often profound mismatch between expectations for a system and what it actually achieves.
This is in large part, why we benchmark and test our systems in as real configurations as possible, and report real numbers, while many (most) of our competitors make WAGs at best case/best effort/best condition theoretical numbers.
This said, part of the problem with performance expectations are the assumptions underlying it. One of the things I rail on the current Gluster marketing efforts about, are related to the same assumptions. And these assumptions are used as the basis of statements (and in some cases, marketing materials) that are … wrong … at best.
But its not just Gluster marketing that has this as a problem. These core assumptions are often (completely) wrong for a fairly wide range of things, and yet they are used as the basis for many marketing claims that are, at best, specious. Worse than this, are benchmark tests that are fatally flawed, unrepeatable, or somehow or the other broken, and poorly representative of workloads.
Add it all up and you have no real mechanism to predict performance based upon what is published.
Ok … here’s a simple example. You have a disk. Lets call it a SATA 7.2kRPM drive. You can get 100 MB/s out of it (ok, I know its a little more today, just using easy numbers to make life simpler). If I have 10 of these, I get 1GB/s, right? and 20 will give me 2GB/s. #winning !
Not quite.

That 100MB/s is for a single sequential streaming workload. Very little to no “seeking”, mostly streaming. Queues and pipelines (and caches) are working nearly optimally for this. You are rate limited by head speed above the disk itself … which impacts data rate.
Now make this say, I dunno, 10 streaming workloads. Now you have some seek overhead in this. You rapidly discover that

B(1 thread, streaming) = N(threads) * Baverage(per thread) + Overhead(N(threads))
Beffective = B(1 thread, streaming) - Overhead(N(threads)) = N(threads) * Baverage(per thread)

That is, your maximum bandwidth for a single thread streaming gets split into N(threads), with an average per thread bandwidth (Baverage), and some overhead as a function of the number of threads. That overhead is “vanishingly small” for a single thread, but grows at least linearly for N threads, due to seeks, and other issues. Your effective bandwidth drops to zero as the overhead increases.
So each user of this disk may get a Baverage (average bandwidth per thread) which is lower than the bandwidth for a single thread divided by the number of threads.
This is the streaming case. Now, what about the random IO case?
Same equation holds. Remember, our naive expectation is that our usable bandwidth will be B(1 thread, random) even for the random IO case. It won’t be.
You can take your single thread random IO case, and re-imagine it as N streaming IOs. The overhead jumps enormously though, as each new random IO requires an expensive seek.
So if you think you have 100MB/s available, and you forget about the overhead per seek, yeah, you are going to be scratching your head where all your performance went, when you actually measure it.
The overhead is the killer. Most people building systems really don’t understand how much of a killer it is. IO is usually relegated to secondary or tertiary consideration, though as data sets grow large, this is a very bad/dangerous thing to do. Moreover, expectations of performance are far … far out of whack with reality. What a user seeing 100MB/s on a single threaded streaming case might expect for a random case, shouldn’t be 100MB/s as well (for spinning rust disks). For some of the simpler tests we are doing (random 32k reads and writes to mimic a particular use case), we are seeing from 1-10 MB/s on real hardware.
This is almost like an Amhdahl’s law of disk performance. Your measured aggregate performance per disk is not constant as you vary the number of threads or the number of IOs (random).
The fundamental assumption, that performance is being “lost” is incorrect. It was never there to begin with.
But it gets worse.
Now take these disks, where your actual performance is out of line with your expectations, and use Gluster to aggregate them. Gluster promises

Red Hat’s alternative to costly proprietary storage
Paying too much for proprietary storage but worried open source, software-based storage can’t match its performance? Think again.

That sounds an awful lot like a performance promise …

High-performance storage that grows with you
Red Hat? Storage Software Appliance is a software-based, scale-out network-attached storage (NAS) appliance. Deploy it in minutes for scalable, high-performance storage in your datacenter or on-premises private cloud. Add compute, I/O bandwidth, or storage as needed to meet changing capacity and performance needs?without disrupting data access.

… as does this.
In the bullets on the site, you see this

Freedom from proprietary storage
Instead of becoming locked in to proprietary storage systems, enjoy the significant economic benefits of using commodity hardware and deploying exactly what you need, when you need it.

The deep problem with this, is that most people won’t deploy “what they need”. They will deploy the lowest possible cost implementations (I simply cannot bring myself to call them “designs”, as they are not). In many/most cases, these will be individual disks sitting on motherboard attached SATA connectors. Which, given the propensity of tier 1 designs to (massively) oversubscribe minimal numbers of PCIe lanes for network, SATA, peripherals, pretty much guarantees crappy performance. Its not that the performance will not be good, its going to be crap. Bad … stinky … crap. I’m trying to convey a visual and an olfactory sensation. A big stinking pile of bits it will be.
So now couple the horrible actual performance local disk, with the silver bullet promises above. What do you get?
Badly underperforming systems. Disappointed customers, wondering why their tier 1 purchased hardware is sucking so badly.
We get (many new) customers hiring us to fix it for them.
So I guess I should be thanking Red Hat for all this new revenue?
Landman’s law of crappy designs: A bad design or implementation is going to suck most of the time.
Corollary to above law: No amount of tuning of a bad design will turn it into a good design.
Corollary to Corollary: No amount of money spent trying to tune a bad design will turn it into a good design, until you bit-can the whole kit-n-caboodle, and start fresh, with a good design. Also called “Multiply by zero and add what you need.”

3 thoughts on “Misalignment of performance expectations and reality”

  1. I believe the mismatch stems from currently impossible desires. On the science side, I know a few users who would benefit from near-instant streaming access to a terabyte or twenty slightly irregular subset from a few hundreds of terabytes. They could futz with analysis code in Octave, etc. and play with the data. This is what we want. It’s just not possible right now.
    This really would open doors to new ideas. Right now, they have to manage staging, set up long-running queue jobs, etc. Some of these pieces can be automated, but they still do not have fast enough turn-around time between the idea and the test results.
    Driving me nuts trying to spec out purchasable systems for grant applications. Even the WAGs you cite cannot handle them, and I don’t see any near-ish-future possibilities. SSDs / NVRAM? Cute, but not big enough at sane price points to handle 100s of TB. You fall back on the slowest piece, the “spinning rust,” even for analysis tools parallel across I/O channels. Both the network speeds and the disk system speeds need a major boost for analysis of data from climate, combustion, …
    (And don’t even get me started on the “but we have a bajillion cores!” lines about clusters. sigh. So few people can count the bottlenecks.)

  2. @Jason
    “SSDs / NVRAM? Cute, but not big enough at sane price points to handle 100s of TB.”
    Worth noting that they can handle it technologically. The objection is the price point. This is set by market demand and supply. The demand side is high, and the supply side is currently (very) low.
    This is, I believe, changing. We are being contacted fairly frequently by new (Chinese of course) manufacturers whom promise same performance and reliability at half or less the price of our existing kit. Its interesting enough that it bears some investigation, but I suspect that pretty soon, the number of flash chip factories is going to explode.
    Give that 2-3 years I think. 2015 should look very different for storage than 2012.
    Networking side … yeah, you are dead on. Performance has not been keeping up with need. Demand remains very high for performance networks. 10GbE is IMO still WAY too expensive. Should be 1/2 or less the price of Infiniband now. 40GbE is out, and so far, only Mellanox seems to have a clue on pricing. Their pricing is on the higher side of reasonable. 100GbE is still a ways away (we just worked on a project designing a pair of storage units for 100GbE end to end testing … we needed to sink/source 12.5 GB/s on each end … and yes, we can do that).
    IMO, these crappy designs are being used in far too many cases, usually when people are unaware of the alternatives. Sometimes the alternatives are more pricey, but you have to balance performance, cost, and opportunity cost in the equation.
    For big data type systems, there is absolutely no reason to ever even consider these crappy designs … yet … we see them …

  3. Some funding agencies aren’t so good at the balance and expect everything for no cost. But climate, combustion, and the like *want* 10s of PBs and can generate more. My 100s of TBs is the best I can manage to propose in a typical funding call. I don’t see SSDs playing in that space even with a 75% drop in price, although I’d love to be wrong. In 3-5 years, increase the desires by at least two orders of magnitude.
    And I just put together one of those crappy designs. The hardware was sitting there, and this was the best I could do with zero chargeable time. wheeee… At least it gives access to 100TB that otherwise was just sitting there (turned on, connected, unconfigured).

Comments are closed.