A new spin on 'hard cases make for bad laws' … but with benchmark codes

We run (as you might imagine) lots of benchmarks. We do lots of system tuning. We start with null hypotheses and work from there. Sometimes you can call that the baseline expected measurements. Your call on what you want to call it. But a measurement implicitly implies a comparison to a known quantity.
In the case of the baseline or null hypothesis, you measure what you should believe to be a reasonable configuration, the way it would be used. You shouldn’t measure a configuration that is unreasonable, or try to measure in an unreasonable manner. Keep it simple. Repeat it. Get an average. See a distribution. Be happy.
Then you make your changes, and measure. So the fundamental “belief” that you need to have is that your tools will give a generally reasonable result when testing. This is a testable “belief” and yeah, you should test it.
Because not all tools are the same. Some of them, like fio, are simply awesome. Others (I will not mention names) are crap, and very likely vendor specific.
We’ve been being beat up over the results from a particular tool. This tool originates at a competitor, comes in binary only form, and generates IO. Does it do it in a reasonable manner? I’ve looked at what it does … and I’ve been skeptical for a long time. So after dealing with another set of “measurements” today, I finally said “screw it” I’m gonna see if this thing can be trusted.

No, I am not going to name the tool, or the vendor. You would know the latter even if you wouldn’t know the former. Far be it from me to suspect a vendor of rigging a binary only tool for their benefit … oh no … that NEVER happens.
And that isn’t a bash at Intel there (and they are not the vendor). Its a simple acknowledgement that there are tools out there that will favor particular vendors, usually coming from the vendors themselves. Why would that be? Should be obvious.
This is why we like fio. It would be real hard to code specific to a particular vendor without distributing/showing that code. And that would catch them … red handed.
So we like fio. And fio was what helped us catch this particular crappy code in the act.
We set up a 1.4TB cache on a JackRabbit unit. Set it up as a write through cache. Started writing. Wrote 256GB out (JR4 only has 48GB ram, so we are really … REALLY … far outside ram).
Did a no cache read (turned off caching). Not bad, reasonable overall performance. This is our baseline (null hypothesis) for comparison. Did a few measurements. Got about the same number +/- a reasonable error.
Turned on cache, and did a cold cache read. Slightly better than the no cache performance.
With the cache warm, read again.
Very good performance. Extremely good. About what I was hoping for (about 1.6x better than spinning rust, with very little hitting of the spinning rust).
Iterated that test a few times. Very similar results, within a reasonable margin of error. So I know the caching system works, and it works well.
Now use “This Other Tool(TM)” … lets call that TOT for short. Cleared the cache. Measured to the non-cache config. Results about 20-25% lower. Ok, I can deal with scale changes, as long as its consistent.
Turn on the cache. Do the first read.
OMG … its reading, but nothing is going into cache!
Then do an explicit cache warming. cat all the files from that directory into /dev/null. Pulls the files in. Populates the cache. I can see that with dstat .
Retry the cache warming, I can see it pull only from cache.
Good. I know the data is cached.
Retry TOT.
OMG2: Pulls from spinning rust, completely ignoring cache.
Retry the other bits. Yeah cache is still working nicely. Just this tool is obviously working around cache somehow.
And since I don’t have source, all I can do is hack it at a binary level.
So why should I do this, when I have a tool that works, works well, I have source for, and is rapidly becoming an industry testing standard?
Yeah, I didn’t think so.
So we’ve noted that we are going to explicitly ignore complaints of bad performance from TOT. It obviously does a piss poor job of interacting with the file systems in a meaningful way. And there are better tools out there.
Why is it again that anyone would want to use an inferior binary only tool as compared to a superior open source tool?
Yeah. Didn’t think so.