Setting expectations for SSDs versus Flash

Nomenclature: SSD is a physical device that plugs into an electrical disk slot. Flash is a PCIe card. Both use the same underlying back end storage technology (flash chips of SLC, MLC, and related).

I’ve had a while to do some testing with a large number of SSD units in a single device. I can give you a definite sense of what I’ve been observing.

First: SSDs are, of course, fast for certain operations.

Second: there’s a whole lotta er … marketing numerology … around SSDs.

Ok. So imagine we have 48x very late model Sandforce 22xx equipped SSDs, in a single chassis. Call this thing an SSD array. Imagine that we’ve done some experimentation on various RAID cards. Including some from a vendor that has not been announced/released. Including some dumber HBAs.

If we believe the underlying theory behind SSDs, and aggregates of the same, for reasonable configurations of SSD (that does not mean RAID0s, but RAID5’s with a smaller chunk size), we should be able to approach the theoretical maximum number of IOPs, assuming that the RAID calculation engine can keep up with the SSD.

This assumption is, sadly, incorrect.

Theoretical maximum IOP rate for this unit, assuming the vendors don’t … er … embellish too badly … is about 2.4M IOPs. After putting them into RAID5’s best possible case should be 2.1M IOPs.

What do we achieve?

Roughly 1/10th that number. For 8k random reads across files much larger than RAM. Well … thats using the HBAs. Using the RAID cards, its 1/30th that number.

And we see that 1/10th number again and again in our measurements. Doesn’t matter what we measure, the IOP rates never come close to the theoretical max, and are always hovering around 10% of it.

This suggests a nice rule of thumb. Take whatever they promise, and shift the decimal once to the left. That will be more along the IOP rate you will achieve.

What about with Flash cards?

There we have seen numbers from 100k-1M IOPs. For the Virident cards, we’ve measured, the numbers quoted were what we achieved without heroics.

Remember I indicated I found a very legitimate reason for PCIe cards? Well, its IOPs. Streaming … not so much, we can get decent stream performance out of spinning rust. No real advantage to SSD or Flash there. But IOPs … yeah … IOPs.

SSDs are in the $2.50 +/- 0.5 USD/GB region today. SLC flash ranges up to $50/GB, and MLC flash is on the order of call it $20/GB or less. Actually this varies quite a bit as well. I can’t talk publicly about some of the pricing we’ve seen, but its getting interesting.

So, a $2/GB SSD unit is, in reality in the 5-10k IOP region for applications (not meaningless benchmarks). A $50/GB Flash unit (25x the pricing) is in the 300k-1M IOP range for applications. 50-100x the performance.

If you look at it this way, for IOP heavy apps, the Flash pricing isn’t outrageous. But you have to compare similar things to understand this. Don’t compare them to 7200RPM spinning rust for bulk data service. Thats not what flash is good at. Compare it to the IOP heavy apps, where that poor spinning rust disk is maxing out at 110 IOPs, and the SSD is crusing along at 50-100x that, while the PCIe flash cruises along at 25-50x that of the SSD.

We have to set expectations correctly. Getting the nomenclature and real performance measures correct is very important. Avoiding the marketing numbers … also very important.

Viewed 34160 times by 5710 viewers

Facebooktwittergoogle_plusredditpinterestlinkedinmail

10 thoughts on “Setting expectations for SSDs versus Flash

  1. Posting your benchmark (FIO?) scripts would put much more meaning behind your various performance claims. Doubly so if you mention specific (besides the jackrabbit) hardware that you can talk about. I’ve seen even relatively small tweaks to workloads make huge differences in performance. Sadly it seems that some popular benchmarks zero fill blocks and the sandforce recognizes these and produces mostly meaningless numbers.

  2. @Bill

    [global]
    size=256g
    iodepth=512
    blocksize=1m
    ioengine=vsync
    numjobs=8
    nrfiles=1
    create_serialize=0
    create_on_open=1
    group_reporting
    direct=0
    rw=write
    
    
    [s1]
    directory=/data
    
    

    and

    [global]
    size=256g
    iodepth=512
    blocksize=1m
    ioengine=vsync
    numjobs=8
    nrfiles=1
    create_serialize=0
    create_on_open=1
    group_reporting
    direct=0
    rw=read
    
    
    [s1]
    directory=/data
    
    

    These definitely aren’t doing zero fill. You have to work at getting fio to do zero fill. I made comments on that last year.

  3. … erp … those are the streaming versions. To get the IOPs versions, change read/write to randread/randwrite, and set the blocksize to 8k. Doesn’t hurt to look at using libaio, though I’ve found it really doesn’t do much.

  4. Part of the reason for this post is that I’ve been getting annoyed at the completely meaningless “lets RAID0 the SSDs and then read/write zeros as our test case … woo-hoo we get some magical number …” reports I see in the “popular” benchmarking press.

    Start with real configurations that people would build for real use … and no, no one in their right mind would store data with any level of permanence on a RAID0 …

  5. Though I love this blog, I’m not in love with this terminology: “SSD is a physical device that plugs into an electrical disk slot. Flash is a PCIe card.” “Flash” should really not be used (in my opinion) other than to describe the underlying material of both Flash-based SSDs and PCI-e attached Flash-based storage. I’m not saying “PCI-e attached Flash-based storage” rolls off the tongue terribly well ;), which may be why you used the terminology in this post, but in general I think it’s best to avoid diluting terms for convenience if it makes it confusing in future uses (especially if future uses refer to new attaching architectures).

  6. Hey Joe – long time no talk! Hope everything is going well.

    Given the compression capability of SandForce SSDs I like to vary the compressibility of the input data to get an idea of how the performance changes. I use IOzone and vary the compressibility level (it’s called dedupability in IOzone) and measure the performance (sequential and IOPS).

    I also run the various IO patterns of IOzone as well. It’s very interesting to see how performance changes with compressibility for certain IO patterns. For the simple home SSD that I tested some IO patterns don’t show as much impact with data compressibility as one would expect.

    Plus I compared the performance to an Intel X-25E and found that even for 98% incompressible data, the write performance for the SandForce SSD was still better than Intel (kind of surprising).

    Personally I love the data compression feature of SandForce. If I’ve got data that is fairly compressible then I get a speed boost. It may be interesting to think of padding the output to the drive to include zeros so that the output is better aligned and more compressible. But this uses more space in the interest of some performance gains (definite trade-off there). But, back to the drive performance, I will also measure the SandForce SSD performance with almost incompressible data to get an idea of worst case performance (it’s where is pessimists like to start 🙂 ).

    Thanks!

    Jeff

  7. Apparently the Sandforce SF1200 does de-duplication as well as compression, which could be good for some things but has raised concerns for filesystems that do store duplicate metadata for safety. Of course you’d have to be very unlucky for that block to die, but if it did you could be in trouble..

  8. @Ellis

    I understand your concerns on the naming/notation aspects. They are all Flash technology. The interconnect mechanism, and the software/firmware stacks traversed are different. I guess I could call them disk Flash versus PCIe Flash. Basically I am looking for a convenient, easy, and meaningful mechanism of differentiating between these two.

    Think about disk technology. You have internally connected SAS, SATA, FC disk. You have external JBOD and RAID arrays. You have USB and Firewire attached disk.

    In most cases, the reference has been to the connection technology rather than the underlying physical technology. That is, you can say RAID disk, SAS disk, JBOD disk, …

    And in many cases, people drop the “disk” portion. So this becomes SAS, SATA, JBOD, RAID, … with any supplemental interconnect mechanism (JBOD SATA, RAID SAS, …) sometimes mentioned.

    I was attempting to use SSD Flash, and PCIe Flash the same way to differentiate, so that SSD (Flash) is the disk controller pathway instance, and PCIe (Flash) is the direct attached block device over PCIe instance. I thought I could omit the word “Flash”.

    Maybe this makes it clearer … but if you still don’t like it, I’d welcome a clean/easy way to help people understand the differences. Basically the same issue with disk. If I talk simply about disk, then I need more specificity on the interconnect and grouping technology to discuss performance metrics. I’d argue the same is true for any instance of a Flash physical object.

    This said, feel free to suggest something. I’m open to good ideas.

  9. @Jeff

    Indeed … we should try to communicate more!

    It would be good to know the nature of the compression. Obviously lossless 🙂 , but still I’d like to know more specifics … RLE, other techniques, …

    I like the concept of what they are doing. From what I can see with the SF22xx controllers, the performance on less compressible data has improved immensely. I’ll generate some data for this in the next few days/weeks and post it.

  10. @Chris

    This is an issue actually for any block level dedup. Some data should not be dedup’ed IMO … it makes each bit more valuable, and potentially increases risk of data loss if the dictionary gets corrupted.

    Dedup is great for some things … effectively random binary data isn’t one of them. VM’s are a good use case for dedup. So should iSCSI boot images for a cluster.

Comments are closed.