On benchmarking in general

I wonder if the reason there are so many bad benchmarks and incorrect conclusions drawn from bad benchmarks comes, to some significant level, from a basic misunderstanding of measurement, how to perform them, and what you are measuring.
Several years ago, we watched folks who should know better, insist that 2GB bonnie++ data (the 2GB file size) was the only relevant one for their storage systems and it told them everything they needed to know about storage. Which we found … well … amusing, in that 2GB meant that, for the systems they were looking at, they were dealing with cache. Nothing but cache.
Rather hard to draw a legitimate conclusion about disk speed, when you rarely touch the disk. But that’s what happened.
Part of this is architectural. Without a sound understanding of how the system works, and how the system is laid out, the system is basically just a black box. Then detailed architectually sensitive probing measurements, which bonnie++ purports to be, don’t tell you much.
Then again, many people don’t really do the measurement part correct either. even if they understand the architecture. What are you measuring, and are you sure you are measuring what you think you are, and are you getting enough cycles to be meaningful, or is OS jitter getting in the way?
We’ve seen benchmarks where people claimed that running in 0.21 seconds was much faster than running in 0.25 seconds. This belies a fundamental lack of understanding of precision and duration of benchmarks. Yet we see this often enough that it makes us laugh.
Accurate measurement is hard. Meaningful measurement is hard. Being able to draw a meaningful conclusion from data is hard enough without the data being suspect itself.