We are working on a project for a consulting customer. They’ve hired us to help them figure out where their performance is being “lost”.
Obviously, without naming names or revealing information, I note something interesting about this, that I’ve alluded to many times before.
There is an often profound mismatch between expectations for a system and what it actually achieves.
This is in large part, why we benchmark and test our systems in as real configurations as possible, and report real numbers, while many (most) of our competitors make WAGs at best case/best effort/best condition theoretical numbers.
This said, part of the problem with performance expectations are the assumptions underlying it. One of the things I rail on the current Gluster marketing efforts about, are related to the same assumptions. And these assumptions are used as the basis of statements (and in some cases, marketing materials) that are … wrong … at best.
But its not just Gluster marketing that has this as a problem. These core assumptions are often (completely) wrong for a fairly wide range of things, and yet they are used as the basis for many marketing claims that are, at best, specious. Worse than this, are benchmark tests that are fatally flawed, unrepeatable, or somehow or the other broken, and poorly representative of workloads.
Add it all up and you have no real mechanism to predict performance based upon what is published.
Ok … here’s a simple example. You have a disk. Lets call it a SATA 7.2kRPM drive. You can get 100 MB/s out of it (ok, I know its a little more today, just using easy numbers to make life simpler). If I have 10 of these, I get 1GB/s, right? and 20 will give me 2GB/s. #winning !
That 100MB/s is for a single sequential streaming workload. Very little to no “seeking”, mostly streaming. Queues and pipelines (and caches) are working nearly optimally for this. You are rate limited by head speed above the disk itself … which impacts data rate.
Now make this say, I dunno, 10 streaming workloads. Now you have some seek overhead in this. You rapidly discover that
B(1 thread, streaming) = N(threads) * Baverage(per thread) + Overhead(N(threads))
Beffective = B(1 thread, streaming) - Overhead(N(threads)) = N(threads) * Baverage(per thread)
That is, your maximum bandwidth for a single thread streaming gets split into N(threads), with an average per thread bandwidth (Baverage), and some overhead as a function of the number of threads. That overhead is “vanishingly small” for a single thread, but grows at least linearly for N threads, due to seeks, and other issues. Your effective bandwidth drops to zero as the overhead increases.
So each user of this disk may get a Baverage (average bandwidth per thread) which is lower than the bandwidth for a single thread divided by the number of threads.
This is the streaming case. Now, what about the random IO case?
Same equation holds. Remember, our naive expectation is that our usable bandwidth will be B(1 thread, random) even for the random IO case. It won’t be.
You can take your single thread random IO case, and re-imagine it as N streaming IOs. The overhead jumps enormously though, as each new random IO requires an expensive seek.
So if you think you have 100MB/s available, and you forget about the overhead per seek, yeah, you are going to be scratching your head where all your performance went, when you actually measure it.
The overhead is the killer. Most people building systems really don’t understand how much of a killer it is. IO is usually relegated to secondary or tertiary consideration, though as data sets grow large, this is a very bad/dangerous thing to do. Moreover, expectations of performance are far … far out of whack with reality. What a user seeing 100MB/s on a single threaded streaming case might expect for a random case, shouldn’t be 100MB/s as well (for spinning rust disks). For some of the simpler tests we are doing (random 32k reads and writes to mimic a particular use case), we are seeing from 1-10 MB/s on real hardware.
This is almost like an Amhdahl’s law of disk performance. Your measured aggregate performance per disk is not constant as you vary the number of threads or the number of IOs (random).
The fundamental assumption, that performance is being “lost” is incorrect. It was never there to begin with.
But it gets worse.
Now take these disks, where your actual performance is out of line with your expectations, and use Gluster to aggregate them. Gluster promises
That sounds an awful lot like a performance promise …
… as does this.
In the bullets on the site, you see this
The deep problem with this, is that most people won’t deploy “what they need”. They will deploy the lowest possible cost implementations (I simply cannot bring myself to call them “designs”, as they are not). In many/most cases, these will be individual disks sitting on motherboard attached SATA connectors. Which, given the propensity of tier 1 designs to (massively) oversubscribe minimal numbers of PCIe lanes for network, SATA, peripherals, pretty much guarantees crappy performance. Its not that the performance will not be good, its going to be crap. Bad … stinky … crap. I’m trying to convey a visual and an olfactory sensation. A big stinking pile of bits it will be.
So now couple the horrible actual performance local disk, with the silver bullet promises above. What do you get?
Badly underperforming systems. Disappointed customers, wondering why their tier 1 purchased hardware is sucking so badly.
We get (many new) customers hiring us to fix it for them.
So I guess I should be thanking Red Hat for all this new revenue?
Landman’s law of crappy designs: A bad design or implementation is going to suck most of the time.
Corollary to above law: No amount of tuning of a bad design will turn it into a good design.
Corollary to Corollary: No amount of money spent trying to tune a bad design will turn it into a good design, until you bit-can the whole kit-n-caboodle, and start fresh, with a good design. Also called “Multiply by zero and add what you need."