Thoughts on SSDs, spinning rust, …

So SSDs are upon us with a vengeance. No one is actively predicting the death of spinning rust … yet. But its in the back of many folks minds, even if they aren’t saying it now. Similar to the death of tape. Yeah, I know, its still around.
Call that the long tail. Sequential storage mechanisms are going the way of the dodo bird. The issues everyone worries about are cost per data volume, and speed of access/recovery, not to mention longevity. Sure, tape could cost less than spinning rust, but it is serial, and while tapes can “last forever”, the drives certainly can’t. Looking at inexpensive large volume SATA drives as an integrated drive/media for backup is rapidly supplanting most of the non-diehard tape sites I am aware of.
Basically, tape is dieing out, and being replaced by disks (yeah, there are “counter” examples of this, but they are growing fewer and further between, and actually lending strong support to the thesis that tape is in its long decline). There is an interesting concept coming in from the tape folks that is showing up in SSDs. I am not sure I like it, as it lends itself to incorrect expectations, very easily.
But spinning rust itself is “under attack”. SSDs have great hype, and great hope.

SSDs provide “performance” (purposefully in scare quotes) for end users. If you read the hype, it looks like they provide tremendous performance deltas.
The Sandforce SF-1200 controllers are a case in point. Currently they are reporting 285 MB/s read, and 275 MB/s write. They are the brand new controllers for MLC based units, and most of the press is fairly breathless about this performance.
We use SSDs, and I need to understand how close the marketing numbers are to the actual numbers. We need to establish a ratio for this. Call this the Benchmark Significance Ratio, or BS Ratio for short. Define BS Ratio as
BS Ratio = (what they claim) / (what you measure)
A BS Ratio close to 1 is good. A BS Ratio much greater than 1 is bad. Of course, a BS Ratio much less than 1 is either an indicator of a failed test, or an accidentally released product.
So here I am with my nice shiny new SF-1200 based SSD. Actually 2 of them. We are looking at them for a product and an application.
This is not a bash on Sandforce BTW. Don’t read it as that, and it is not intended as that. The BS Ratio bit is more a bash at marketing numbers.
So I attach them to our JackRabbit system, create partitions, setup an xfs file system (also tried a number of others such as ext4, nilfs2).
Then I use a simple standard streaming write fio input file. And I get 65 MB/s for streaming writes (uncached).
Ok. Try streaming reads, also uncached. 200 MB/s.
I don’t mind the latter number, but I am worried about that former number.
So I tried a simple dd, which uses zeros. And I got the marketing rated speed.
Hmmm…. something doesn’t sound right.
So I tried bonnie++ (which I am not as fond of for real testing), and got the benchmark speed as reported by the media.
A quick strace (Strace Is Your Friend) on the dd confirmed it was writing zeros.
I went back to the fio documention, and found a switch to set to fill the buffer with zeros.
And I got the rated speed.
Uh oh.
So I just added a -Z switch to io-bm (use zeros rather than random data), built a RAID0 out of my 2 units, and ran some tests. Same write, single thread, same file, same file name, same mount, file system, yadda yadda yadda.
Writing zeros:

[root@localhost ~]# mpirun -np 1 ./io-bm.exe -n 10 -f /data/d1/big.file -b 1 -w -d -Z
Thread=00000: host=localhost.localdomain time = 24.305 s IO bandwidth = 421.317 MB/s
Naive linear bandwidth summation = 421.317 MB/s
More precise calculation of Bandwidth = 421.317 MB/s

Writing random bits:

[root@localhost ~]# mpirun -np 1 ./io-bm.exe -n 10 -f /data/d1/big.file -b 1 -w -d
Thread=00000: host=localhost.localdomain time = 88.818 s IO bandwidth = 115.292 MB/s
Naive linear bandwidth summation = 115.292 MB/s
More precise calculation of Bandwidth = 115.292 MB/s

This is a BS Ratio of about 3.7. Ugh.
With my naive understanding of the situation growing gradually more sophisticated, this is something of a redux of what we see in the tape world. They happily talk about compressed bandwidth of 2x native bandwidth and advertise this. But thats only true for compressible data … not all data is compressible.
It appears to be the same case with some of the SSDs. There are valid reasons for the compression. But the performance difference is huge. Almost 4x.
We’ve got more testing to do on these SSDs. Suffice it to say that most of our customers aren’t storing zero bytes everywhere.

10 thoughts on “Thoughts on SSDs, spinning rust, …”

  1. Hi Joe,
    Re: Bonnie++ – when I was testing ZFS compression way back I patched it to use data from /dev/random rather than just 0’s, absolutely essential for that case – sounds like something similar is needed here.

  2. For those who are just storing 0’s – I suspect you can optimise your code very easily by just opening /dev/null for writes and /dev/zero for reads. *Way* faster than any disk I’ve tested..

  3. @Chris:
    I am thinking of some interesting experiments relative to this. Basically providing a well defined “random” and “repeating” pattern. Then measuring performance. I am thinking there are some positional effects as well.
    Yeah, it compresses on the fly. My concern with this is that compression is a highly variable thing. It is also possible that compression causes a larger overall size than the original.
    But I suspect, if I didn’t indicate it explicitly before, that the 285 MB/s and 275 MB/s numbers are just like the tape 2x compression numbers. You wont hit them most of the time … you will be about 1/2 or less.
    I’d like a way to turn off compression, and see if we still get performance. Something tells me that it isn’t likely.
    My concern is that people will base the price performance ratio upon the 2x numbers and not the real numbers. These tests have been very instructive on where our expectations should be.

  4. It’s nice to do compression on the fly for you (I assume it’s doing it in the controller since there should be no driver for the disk that would utilize the CPU) but of course the benchmark numbers will be all over the place depending upon the workload.
    I agree with Chris – /dev/random is the best way to go. Just need to push other benchmarks to do the same (I think IOZone uses zeros – need to fix that).

  5. Just a quick update on IOZone. It uses a data buffer that is variable in it’s ability to be compressed. You can control the level of compression (dedup) at the command line. Very cool stuff.

  6. Interestingly the compression means that (theoretically) the device can store more than its rated capacity, if it wasn’t for the fact it has to be faked to look like a spinning disk for the OS..

  7. @Chris
    The compression is there in order to reduce the number of erase blocks needed to store a file. So if I have 1MB of data that fits in 8 erase blocks by default, and I can compress this to 6 erase blocks, then I effectively increase my usable storage life as I have needed to erase and write fewer blocks. This is a different form of wear leveling.
    I think this is a way to detect these sorts of compression bits. Send compressible and then incompressible streams. Ordinary storage should provide the same bandwidth regardless of the streams.
    Overall this is an interesting bit. I’d like to see them report what people measure, not best theoretical case.

  8. @Joe
    Ahh, now that makes some sense, an interesting twist to that problem.
    Although I can see that now the lifetime of your SSD depends not just on your usage patterns but also now on the compressibility of the data. Hopefully it won’t get used as an excuse – “sorry sir, but you’ve been keeping the wrong type of data on your drive”.

Comments are closed.