for our little conejo (L. Flavigularis).
Create a 128 GB file. Filled with zeros.
[root@jackrabbit 2]# time dd if=/dev/zero of=big_file bs=1024000000 count=128 128+0 records in 128+0 records out real 3m59.539s user 0m0.000s sys 3m38.978s [root@jackrabbit 2]# ls -alF big_file -rw-rw---- 1 root landman 131072000000 Nov 8 22:11 big_file [root@jackrabbit 2]# du -h big_file 123G big_file
Call it 4 minutes. 240 seconds, to create a 123 GB file. This is a little north of 500 MB/s write. A very cache unfriendly write at that.
My goal was to time md5sum. Watching it now, it is limited by computation it looks like, to about 350 MB/s, so this may not be a good test as the test maxes out before the hardware. Then again, looking at the CPUs, it looks like 14% user space usage, with the remaining 8% in system usage. So my guess is it is doing a character read, a small vector buffer at a time. Ugh. Should look at this code at some point.
[root@jackrabbit 2]# time md5sum big_file 035abc0213f9b8a5c2245d5093b8bbce big_file real 12m1.114s user 6m32.309s sys 3m5.664s
Ok, need a better reader. That was … slow, and it didn’t look like it was slow due to the disks …
Ok, so I wrote a quite reader. Read As Fast As Possible (RAFAP).
Running it. Looks like its pegged at 650 MB/s according to dstat. Hmmm… vmstat and dstat are reporting numbers that differ by a factor of 2. Going to have to investigate that at some point (dstat has been quite reliable in the past).
[root@jackrabbit 2]# ~/rafap.exe -f big_file -d -b 102400000 default block_size = 102400000 bytes processing arguments ... filename = big_file len(filename) = 8 block size = 102400000 [main] file is 131072000000 bytes [main] block size = 102400000 [main] n_blocks = 1280 Done opening file on master node Allocating buffers ... [main] starting loop Milestone 0 to 1: time = 0.000s Milestone 1 to 2: time = 373.097s N(bytes) = 131072000000 N(Megabytes) = 125000.000 IOPs = 3.4 IO BW (MB/s) = 335.033 delta T (s) = 373.097294
(if you want a copy of rafap, send me a note)
So it looks like using fread/fopen, we are limited by the operating system. The user load was 1-2 %, while the system load was around 20%. IOzone and others are still pushing quite a bit higher than this.
I tried some O_DIRECT bits to turn off caching. No impact. I wonder if I am getting zonked with kernel memory/user memory copying affects. Alas I am running rPath OpenFiler, and it is somewhat short of tools, so I am trying to build them in an rPath VMware session and copy them over. Could also be file system journaling issues. I tried creating the journal on a different device, but it crashed the mount command when I tried mounting it. Gaak.
Hmmm… Maybe alignment issues, I didn’t take any pains to align the buffers. Will look at this.
FWIW: other simple tests seem to place many of the large block “random” reads at north of 500 MB/s. Would like to see this better. Looking at block size effects. Will see if block size reduction helps or hurts random IO. I think it will actually hurt it, as the controllers will thrash. Will also try larger block sizes, see if we can un-thrash it. I would rather be limited by larger block reads than by smaller ones, as I can hide some more latency in there.
Update: Ok, so I am thinking about this more, and wondering if the limitation I am running into is actually the processor on the RAID card. Basically it might be rigged to do the RAID calculations really fast, but isn’t clocked fast enough for high throughput non-calculation intensive IO. Will need to look into this. I could always simply export all the drives as a big old jbod, and build a RAID in software … but then if the RAID CPU can’t handle the IO now, it really won’t like having the system CPUs shoving bits down its throat at 4 GB/s per RAID card.
Going to have to think about this one, and look up this processor. If I am hitting its limits, I need to see how I can make more effective use out of it, even if this means the corner cases remain corner cases.