Using fio to probe IOPs and detect internal system features

By joe

September 12, 2009 - 4 minutes read - 775 words

Scalable Informatics JackRabbit JR3 16TB storage system, 12.3TB usable.

[root@jr3 ~]# df -m /data
Filesystem           1M-blocks      Used Available Use% Mounted on
/dev/sdc2             12382376    425990  11956387   4% /data
[root@jr3 ~]# df -h /data
Filesystem            Size  Used Avail Use% Mounted on
/dev/sdc2              12T  417G   12T   4% /data

These tests are more to show the quite remarkable utility of the fio tool than anything else. You can probe real issues in your system (as compared to a broad swath of ‘benchmark’ tools that don’t really provide a useful or meaningful measure of anything) This is on a RAID6, so its not really optimal for for seeks. The benchmark is 8k random reads, with 16 threads, each reading 4GB of its own file (64GB in aggregate, well beyond cache, but we are using direct IO anyway). 16 drive RAID6, 1 hot spare, 2 parity, giving 13 physical drives. Using a queue depth of 31 per drive, these 13 data drives have an aggregate queue depth of 403 (13 x 31). Of course, in RAID6, its really less than that, as you are doing 3 reads for every short read. We get asked often if customers can benchmark our units for databases, and we tell them yes, with the caveat that we need to make sure they are configured correctly for databases (SQL type, seek based). This configuration is quite important. Here is the fio input file:

[random]
rw=randread
size=4g
directory=/data
iodepth=403
direct=1
blocksize=8k
numjobs=16
nrfiles=1
group_reporting
ioengine=sync
loops=1

And here are the results:

[root@jr3 ~]# fio random.fio
random: (g=0): rw=randread, bs=8K-8K/8K-8K, ioengine=sync, iodepth=403
...
random: (g=0): rw=randread, bs=8K-8K/8K-8K, ioengine=sync, iodepth=403
Starting 16 processes
^Cbs: 16 (f=16): [rrrrrrrrrrrrrrrr] [2.0% done] [8,061K/0K /s] [984/0 iops] [eta 02h:27m:36s]
fio: terminating on signal 2
random: (groupid=0, jobs=16): err= 0: pid=30405
  read : io=1,483MiB, bw=8,415KiB/s, iops=1,051, runt=180507msec
    clat (usec): min=38, max=191K, avg=17101.10, stdev=2927.27
    bw (KiB/s) : min=  257, max=11992, per=5.56%, avg=468.18, stdev=128.30
  cpu          : usr=0.07%, sys=0.23%, ctx=203801, majf=0, minf=1821
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued r/w: total=189886/0, short=0/0
     lat (usec): 50=0.06%, 100=1.95%, 250=3.38%, 500=0.39%, 750=0.15%
     lat (usec): 1000=0.05%
     lat (msec): 2=0.06%, 4=1.10%, 10=23.37%, 20=48.04%, 50=19.95%
     lat (msec): 100=1.45%, 250=0.04%
Run status group 0 (all jobs):
   READ: io=1,483MiB, aggrb=8,415KiB/s, minb=8,415KiB/s, maxb=8,415KiB/s, mint=180507msec, maxt=180507msec
Disk stats (read/write):
  sdc: ios=189723/0, merge=0/0, ticks=2877670/0, in_queue=2877810, util=100.00%

So this is looking like 1k IOPs for this test case, on a system not configured/designed for seek loads. In fact, if you look at the latency calculation, you can see a broad peak from 10 to 50 milliseconds. Seek time is ~8ms on these drives, and you need to do 3 drive reads. I’d expect that this means your seek would be somewhere between 8 and 3x 8, but there is probably enough of a seek delay so if you miss one rotation, you might be forced to 3x (8+8) or 48 milliseconds. Which seems to be represented in the data. Ok, lets change this from direct IO (uncached) to regular IO (cached). Sometimes cache is a good thing. Sometimes it is not. For seek bound loads which are much larger than physical ram, or local cache ram, cache usage is problematic in that it is basically wasted. This is why we have fadvise and other POSIXy like mechanisms to help the system optimize its memory/cache usage. Don’t cache what you won’t reuse.

[root@jr3 ~]# fio random-cached.fio
random: (g=0): rw=randread, bs=8K-8K/8K-8K, ioengine=sync, iodepth=403
...
random: (g=0): rw=randread, bs=8K-8K/8K-8K, ioengine=sync, iodepth=403
Starting 16 processes
^Cbs: 16 (f=16): [rrrrrrrrrrrrrrrr] [2.5% done] [6,759K/0K /s] [825/0 iops] [eta 02h:54m:27s]
fio: terminating on signal 2
Jobs: 5 (f=5): [E_E_r_r____rrr_E] [2.8% done] [7,471K/0K /s] [912/0 iops] [eta 02h:40m:25s]
random: (groupid=0, jobs=16): err= 0: pid=30431
  read : io=1,860MiB, bw=6,966KiB/s, iops=870, runt=273425msec
    clat (usec): min=84, max=284K, avg=20382.77, stdev=3316.73
    bw (KiB/s) : min=  204, max=  638, per=5.64%, avg=392.56, stdev=13.36
  cpu          : usr=0.06%, sys=0.33%, ctx=476943, majf=0, minf=2732
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued r/w: total=238101/0, short=0/0
     lat (usec): 100=0.01%, 250=0.62%, 500=0.01%, 750=0.01%, 1000=0.08%
     lat (msec): 2=0.31%, 4=0.22%, 10=15.48%, 20=54.51%, 50=26.09%
     lat (msec): 100=2.56%, 250=0.12%, 500=0.01%
Run status group 0 (all jobs):
   READ: io=1,860MiB, aggrb=6,966KiB/s, minb=6,966KiB/s, maxb=6,966KiB/s, mint=273425msec, maxt=273425msec
Disk stats (read/write):
  sdc: ios=476192/0, merge=0/0, ticks=4358100/0, in_queue=4358070, util=100.00%

And again, you can see the wide peak which represents disk latency for 3 reads. You don’t expect good IOP rates on a RAID6 … they are not designed for seek based loads. Streaming loads are good for RAID6. Fio shows us why. Thats why we like using it.