Updated io-bm and results from system I was working on

For those who aren’t aware, I had written (a long, long time ago) a simple IO benchmark test, when I had been displeased with the (at the time) standard tools. Since then fio has come out and been quite useful, though somewhat orthogonal to what I wanted to use.

The new results are at the bottom, I’ll explain later.

Let me explain. At a high level, you want your test runs on your IO system to place your system under heavy sustained load, to explore the holistic system behavior. Micro benchmarks are great to give you a sense of how various subsystems perform. They aren’t so good at holistic system performance. It was the latter I was after.

The common tools at the time (pre-fio) were extensively used. But system architects/builders/designers at the time noticed that in many cases, the tools didn’t even generate IO traffic, that much of what they were doing wound up playing games in cache. Which is great if your storage system is nothing but cache. Its terrible if your IO patterns are not cache friendly.

I wanted tooling that allowed me to explore that region where cache was irrelevant … or actually unhelpful. This provides a lower bound on your performance. This is what your customers will actually see in day to day use, not what they think they will get by reading marketing blurbs.

So I wrote io-bm. You can pick it up at the repository.

Basically io-bm lets you construct an MPI an OpenMP based binary. Here is an example of me running the openmp version on my deskside system bender.

joe@bender:~/dev/io-bm$ ./run-openmp.bash 
 starting parallel region
  write_flag set
  direct IO set
 N=1 gigabytes will be written in total
 each thread will output 0.250 gigabytes
 [tid=0] page size                     … 4096 bytes 
 [tid=0] number of elements per buffer … 131072  
 [tid=0] number of buffers per file    … 256  
 Thread=000: total IO = 256 MB , time = 2.254 s IO bandwidth = 113.574 MB/s
 Thread=001: total IO = 256 MB , time = 5.205 s IO bandwidth = 49.181 MB/s
 Thread=002: total IO = 256 MB , time = 7.944 s IO bandwidth = 32.224 MB/s
 Thread=003: total IO = 256 MB , time = 9.153 s IO bandwidth = 27.970 MB/s
 Total time = 9.175 s, B = 111.613 MB/s

Now running the same thing with MPI. Different location, this one is an SSD

joe@bender:~/dev/io-bm$ ./run-mpi.bash 
 Thread=00000: host=bender time = 2.822 s IO bandwidth = 90.725 MB/s
 Thread=00001: host=bender time = 2.978 s IO bandwidth = 85.973 MB/s
 Thread=00002: host=bender time = 3.093 s IO bandwidth = 82.766 MB/s
 Thread=00003: host=bender time = 3.133 s IO bandwidth = 81.701 MB/s
 Naive linear bandwidth summation = 341.165 MB/s
 More precise calculation of Bandwidth = 326.803 MB/s

The write performance matches this SSD and above spinning rust drive perfectly. Io-bm is a very good tool to inject real IO. You can use it to measure cache performance as well if you wish. Just turn off direct io (remove the -d switch).

Ok. I had a chance during some of the testing I’d been engaged in over the last few weeks, to run it on a large clusterstor system. The original test I had run in 2014 showed I could sustain 2TB write in about 73 seconds. That works out to about 36.5 seconds per TB, on a 1PB system (at that time), of 8 nodes of Scalable Informatics Unison storage. All spinning disk, no cache beyond local disk cache on the spinning rust drives. Roughly 20 drives per LUN, each box held 60 drives. 8 boxes total, 24 LUNs total, using FDR IB. Each box was measured as being able to provide (at the time) roughly 11GB/s raw to/from all the drives. After RAIDing them (RAID6), we sustained about 7.5GB/s, or 2.5GB/s per LUN. The IB 56Gbit network could provide around 6GB/s maximum sustained to each box. Giving us an upper pragmatic number of about 48 GB/s.

We sustained 46 GB/s in live customer tests with this tool, at their site. That’s quite good, even now. 5 years ago, it was phenomenal.

For this clusterstor, there are about double the number of storage units, 84 drives versus 60. Roughly 2.3x the number of drives. RAID is different. Using EDR (100Gb) connection, so you should expect about 10.8 GB/s per box. For the number of units we had, this would correspond to about 139GB/s actual.

In my measurements (using the MPI version), a single node was able to hit about 10 GB/s. When I ran on 2 rack full of clients, I was able to sustain about 100 GB/s. When I doubled this to 4 rack full of clients, I hit a sustained 130 GB/s. No tuning on the file system to do this. Just pushing bits hard and fast.

This was wonderful.

I tried the 2TB write, but for some odd reason, I could only do a 1.6TB write. Not sure why 2TB failed. But still, watching that complete in under 14 second on this system made me happy.