# Taking a JackRabbit-M for a spin

This is a new 24TB raw JackRabbit-M system we are burning in for a customer. Unit will ship in short order, but I thought you might like to see what happens when we take it for a spin.
And when we crack the throttle.

First the basics:
24x 1TB drives (SATA II nearline drives, not desktop units), 4U case. 2 hot spares, RAID6 (yes, these numbers are with RAID6). System has 16 GB RAM. Any file larger than 16 GB will be streaming from disk. Cache won’t be involved (a number of our competitors conveniently forget that when reporting their benchmarks, only up to and including the size of their system memory).
First: Basic bonnie++

[root@jackrabbit ~]# bonnie++ -u root -d /big -f
\Using uid:0, gid:0.
Writing intelligently...done
Rewriting...done
start 'em...done...done...done...
Create files in sequential order...done.
Stat files in sequential order...done.
Delete files in sequential order...done.
Create files in random order...done.
Stat files in random order...done.
Delete files in random order...done.
Version  1.03       ------Sequential Output------ --Sequential Input- --Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
jackrabbit   32168M           639163  67 199640  32           924484  86 503.2   0
------Sequential Create------ --------Random Create--------
files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
16 22183  88 +++++ +++ 23133  86 22058  92 +++++ +++ 11004  40
jackrabbit,32168M,,,639163,67,199640,32,,,924484,86,503.2,0,16,22183,88,+++++,+++,23133,86,22058,92,+++++,+++,11004,40


Yes, there is some sort of Linux bug with cached writes, we should be seeing about 1.8x the 639MB/s we measure. Likely it is due to this kernel (and associated patches). Will update later with final OS load numbers.
Bonnie is not, however, directly relevant to any workload that I am aware of. It is just a standard staple of IO benchmarking.
Our customers want to do things like stream lots of data off (or onto) these units. Really fast.
So lets see how well this unit can read. I created a big file, named, curiously, /big/big.file. It is about 80 GB in size (remember 1 GiB != 1 GB, so there are rounding errors of a few percent if you play loosely with the conversion).

[root@jackrabbit ~]# ls -alF /big/big.file
-rw-r--r-- 1 root root 83886080000 2008-07-19 10:07 /big/big.file
[root@jackrabbit ~]# ls -alFh /big/big.file
-rw-r--r-- 1 root root 79G 2008-07-19 10:07 /big/big.file


Ok, rounding errors are not so important. The performance is. How long does it take a simple dd to read this file ?
uncached

[root@jackrabbit ~]# dd if=/big/big.file ...
40000+0 records in
40000+0 records out
83886080000 bytes (84 GB) copied, 55.363 s, 1.5 GB/s


cached:

[root@jackrabbit ~]# dd if=/big/big.file ...
40000+0 records in
40000+0 records out
83886080000 bytes (84 GB) copied, 68.0477  s, 1.2 GB/s


uncached

[root@jackrabbit ~]# dd  if=/dev/zero ...
...
83886080000 bytes (84 GB) copied, 71.9762 s, 1.2 GB/s


and cached

[root@jackrabbit ~]# dd  if=/dev/zero ...
...
83886080000 bytes (84 GB) copied, 99.9484 s, 839 MB/s


Note that for files of this size, cached reading and writing make no sense (e.g. you shouldn’t do it)
Some IOzone results

	Run began: Sat Jul 19 20:03:57 2008
File size set to 16777216 KB
Record Size 1024 KB
Command line used: iozone -s 16g -r 1024 -t 4 -F /big/f.0 /big/f.1 /big/f.2 /big/f.3
Output is in Kbytes/sec
Time Resolution = 0.000001 seconds.
Processor cache size set to 1024 Kbytes.
Processor cache line size set to 32 bytes.
File stride size set to 17 * record size.
Throughput test with 4 processes
Each process writes a 16777216 Kbyte file in 1024 Kbyte records
...
Children see throughput for  4 initial writers 	=  628427.80 KB/sec
Parent sees throughput for  4 initial writers 	=  598968.88 KB/sec
Min throughput per process 			=  154616.17 KB/sec
Max throughput per process 			=  162537.95 KB/sec
Avg throughput per process 			=  157106.95 KB/sec
Min xfer 					= 15961088.00 KB
Children see throughput for  4 rewriters 	=  763924.11 KB/sec
Parent sees throughput for  4 rewriters 	=  751177.77 KB/sec
Min throughput per process 			=  186018.53 KB/sec
Max throughput per process 			=  195353.62 KB/sec
Avg throughput per process 			=  190981.03 KB/sec
Min xfer 					= 15975424.00 KB
Children see throughput for  4 readers 		=  822380.55 KB/sec
Parent sees throughput for  4 readers 		=  822353.59 KB/sec
Min throughput per process 			=  183661.97 KB/sec
Max throughput per process 			=  223631.67 KB/sec
Avg throughput per process 			=  205595.14 KB/sec
Min xfer 					= 13778944.00 KB
Children see throughput for 4 re-readers 	=  892697.84 KB/sec
Parent sees throughput for 4 re-readers 	=  892657.62 KB/sec
Min throughput per process 			=  215557.22 KB/sec
Max throughput per process 			=  233765.73 KB/sec
Avg throughput per process 			=  223174.46 KB/sec
Min xfer 					= 15470592.00 KB


Performance is very good.

### 4 thoughts on “Taking a JackRabbit-M for a spin”

1. @Kent
Sadly there is very little exact comparative benchmark data available. There are TPC-like things, though we haven’t run them, so we can’t show data there.
We would like to do such relevant benchmarks. The hard part about this has to do with the widely differing nature of real application codes, their IO patterns, and how they interact with the underlying system.
IOzone is the one most frequently abused. If you are running IOzone with file sizes less than ram size, and not using flushes, you aren’t doing more than measuring cache speed.
Pointers to people doing this include http://milek.blogspot.com/2007_04_01_archive.html . In this, the tests are run without flushes, but with a 2GB file size … on a 16 GB machine.
There are simpler ways to test cache on these machines.
This said it is important to understand that when benchmarking, there are different regions of behavior corresponding to differing size of ram, etc, which directly impact the performance of disk systems. This is rarely if ever discussed in analysis.
Others we have seen that do similar things are here though they at least postulate that some of their results may be due to cache fills/reads. This document makes similar measurements with maximum of 4GB files on systems with 96 GB ram … I kid you not, read page 5. This document does a similar thing … their graphics actually show you the size of the relevant cache, and was run on an 8 GB server.
I could criticize the file system formatting claims on page 1 of the executive summary of this document. Since the 17.5 second format of the zfs file system is obvious of great concern to the system authors. The time for a quick format of the volume using a GUI interface is on the order of a minute or less … I know as we have done this on our system. Select quick format, and you can start using the partition right away. Problems with this basic level of information suggest that there could be more issues later on.
[update: my fault, I hit post before finishing]
This document also compares cygwin gzip performance to native zfs on solaris. We purposely did not publish our IOzone numbers or other performance numbers for JackRabbit on windows with cygwin, as cygwin focuses upon application compatibility, and not on performance. We have raised this point with several groups at Microsoft in the past … Cygwin is superior to SUA (markedly so) in all areas but performance. It is in Microsoft’s best interest to improve the performance of Cygwin (and there are lots of other ancillary benefits to doing so) contributing this work back to the community. IO performance under cygwin is terrible, and this is well known. I think it is disingenuous to compare against a known poor performer. It doesn’t bolster the argument they are trying to make.
This blogger tested with a 5GB file size.
If you don’t force flushes, then you are simply testing cache speed. Arguments that these are the real sizes you work with are reasonable, but the counter criticism that you are testing cache only and not a file system, or a system per-se, is hard to argue against. In fact it is a stronger argument.
I could write a test that only every writes to processor cache memory. Then run it long enough that I get “meaningful” data. If I then try to use that data in any sort of predictive capacity, I am doing myself and my users a massive disservice. I haven’t addressed their usage patterns.
Very few programs actually operate the way streams does. Or iozone, or bonnie++. SPECFP/SPECINT are based upon real programs, and it is worth noting that they change them over time to reflect the increases in processor cache, as well as many other items. You can get a sense for how fast various subset codes run on each platform with well defined tests. There are lots of things I don’t like about the SPEC benchmarks, but in general they aren’t terrible as comparison tools (well, the individual components aren’t, subject to significant caveats).
This is the point. They are testing (largely) cache.
If all your file reads/writes are 32k and less, you want to make sure these will get to disk as quickly as possible and the cache aspect is important. But cache speed is not file system speed. Unless you force flushes and effectively disable cache. This gives you your worst case performance. So do very large file tests. Which is why we do them. You want to know how bad it will perform in the worst case. We want to maximize the minimum performance so that the worst case isn’t terrible, and the best case is excellent.

2. Ever since Joe came out with JackRabbit, he has been struggling with meaningful benchmarks. As he discovered there really aren’t any 🙂 But I think he’s made more progress that anyone I know.
Like many others, I also struggled with this problem. I’ve looked at various codes such as IOZone, Bonnie++, etc. and found all of them lacking in some way shape or form as benchmarks. But the only one I use, and only when I have to, is IOZone.
On the other hand, there are some pretty good benchmarks for MPI-IO codes. There are a number of what I think of as good MPI-IO benchmarks. One of my favorites is LANL MPI-IO Test (http://public.lanl.gov/jnunez/benchmarks/mpiiotest.htm) written by James Nunez. It can test POSIX, MPI-IO, and HDF5 I/O interfaces, as well as about any option that the MPI-IO standard has included 🙂
There are others as well: IOR, mpi-tile-io, b_eff_io, pio-bench (http://www.clustermonkey.net//content/view/87/32/) and so on.
The only problem is that not many codes use MPI-IO (yet). I keep hoping that many people are working toward adding this capability to their codes, but I haven’t seen the up take that I was hoping for.
I’ve also been working on an “IO simulator” that allows you to run code, get a list of IO functions using “strace”, and then create a simple C code simulator that simulates what the original did. This allows you to take the simulator and run it anywhere and every where you want 🙂 Now I just need to put my money where my fingers are, and actually finish off the simulator.
Joe – sorry I hijacked your blog for my soapbox, but I couldn’t resist. I’m always glad to see JackRabbit doing well.
Thanks!
Jeff

3. Joe – actually because I put -t 32 it will create a dataset which is 64GB in size (32x 2GB) which is 4 times more than memory. If you look at my numbers you will also find that the results are definitely not coming from cache only.

4. Iozone is actually great for testing straight throughput tests. Just specify the “-I” option to execute direct i/o, and mount your file systems as direct. You don’t even need the close option, you don’t need to worry about file system caching, or host memory size. The commit must be received from the array. Yes the array cache may come into play, at which point you just increase the file size to see how well the caching algorithm on the array works. On most advanced arrays, the array cache will turn down on sequential writes, and will only retain pre-fetched blocks for short periods on sequential reads.