A plea for sanity in benchmarking SSDs (and storage)

This is really starting to worry me. I see site after site running similar sets of programs against SSDs, generating the same numbers, within error bars.
The problem is that the numbers they generate are meaningless due to several measurement flaws.
First: Sandforce controllers compress data. Which means that some data (say simple repeating patterns of, oh, I dunno, zeros?) will compress really well, and show bandwidths far higher than real use cases will measure. This is a profound problem with some controllers, and most of the testing programs out there aren’t writing effectively in-compressible data, or mixing compressible with in-compressible data. So for some set of controller chips, the reported bandwidth is (potentially significantly) higher than you will get in practice for streaming writes.

Second: some controllers do deduplication. Thing of this as extreme run-length compression if you like. Same issue as above. If you write and rewrite the same data again and again, with a system the deduplicates writes … you aren’t actually measuring the underlying storage. You are measuring deduplication, in something of an unrealistic test scenario.
Third: as I was broadly hinting above, a fair number of the testing codes aren’t generating what might be realistic workloads. Bonnie++, iozone, ior, … none of these are terribly representative of the load a real SSD would see.
The best test cases are your own codes, running the way you run them. You can either learn how to change your codes to run more efficiently on the hardware; you can learn how to match the hardware to your needs; or even something in between. IOMeter doesn’t really match anyone’s workload very well. This is why we like fio, in that we can do a reasonable job of approximating various types of static and dynamic workloads.
Not sure why they keep using the same programs with the same flaws. Maybe its easier than figuring out what the real issues are. The problem is then, that people will continue to publish numbers that no one really achieves.
Before anyone jumps on me about our streaming tests, we have a specific end user application that is used on our systems (in the financial community), that does IO in a very similar way on data sets of these sizes. Our test is a fairly accurate predictor of actual performance on those applications, for our hardware. Moreover, as we learn about markets that are new to us, but that we are developing a strong reputation in, we focus upon their IO use cases, and in one particular instance, our streaming, and soon other elements, play a very important role.
Its important to focus upon the real IO, and not on some fictional aspect of a non-end user realizable result. Really … we need our IO measurements to be sane. Especially on SSDs which hold so much promise for specific use cases.
I could go into great depth, but suffice it to say that you really won’t be seeing 80k IOPs from an 80K (marketing number) IOP SSD. It will be much … much less. Which is why we need to test in a realistic manner. We need to know how much worse, much … much less … actually is.