Final sprint before shipping

Now that I (think) understand most of the major issues here, and I can be reasonably sure that I have a good grasp of the tuning, I think I want to take it out on the test track and give it one final once over.

Lets open the throttle. Wide.


I can tune the IO scheduler, number of outstanding IO requests (for sorting), various buffer cache, and the works. I now have the clock left alone (need to set it that way by default), so it is running full speed.

Sanity check: How does buffer cache look?


root@jackrabbit1:~# hdparm -tT /dev/md0

/dev/md0:
Timing cached reads: 4908 MB in 2.00 seconds = 2455.69 MB/sec
Timing buffered disk reads: 2060 MB in 3.00 seconds = 685.95 MB/sec

Good-n-fast. None of this 1.1 GB/s stuff we see when powernow is on.

Quick-n-dirty bonnie++


Version 1.03 ------Sequential Output------ --Sequential Input- --Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
jackrabbit1 32096M 568841 85 279055 54 834536 82 491.6 0
jackrabbit1,32096M,,,568841,85,279055,54,,,834536,82,491.6,0,,,,,,,,,,,,,

Not bad. Saw as high as 950MB/s sequential input in testing. Saw 600+ MB/s in testing. Some additional IO tuning is possible (deadline will favor writes over reads).

Since we have 2GB RAID cache, 1 GB per card, we need to let this get past the RAID cache boundary. This was one of my objections with some others testing we had seen in the past. Their test cases were entirely cache bound, and therefore effectively meaningless as an indicator of performance (other than cache performance). Lets get out of the RAID cache regime, into the region where it needs to spill to disk. This hits the power curve hard. If your performance falls off a ledge at the size of your RAID cache, you have … problems.

For laughs, I grabbed a snapshot of a few seconds of dstat, running in a window above iozone.

—-total-cpu-usage—- -dsk/total—-dsk/sda—–dsk/sdc– -net/total- —paging– —system–
usr sys idl wai hiq siq|_read _writ:_read _writ:_read _writ|_recv _send|__in_ _out_|_int_ _csw_
0 27 71 0 0 2| 0 400M: 0 200M: 0 200M| 529B 1176B| 0 0 |2946 772
0 11 75 12 0 2| 0 964M: 0 483M: 0 482M| 192B 484B| 0 0 |6421 967
0 19 68 11 0 1| 0 684M: 0 341M: 0 342M| 192B 484B| 0 0 |4722 379
0 25 75 0 0 0| 0 80M: 0 40M: 0 40M| 126B 370B| 0 0 | 777 400
0 22 74 4 0 0| 0 642M: 0 322M: 0 320M| 340B 386B| 0 0 |4498 930
0 2 75 21 0 2| 0 1326M: 0 662M: 0 664M| 126B 386B| 0 0 |8902 564

Each adapter can provide about 780 MB/s during writes. We are mostly filling 2 of these up. We can add more.

The corresponding line from IOzone is

2097152 1024 699383 938361 1333639 1344250 1342067 1147908 1343132 1154050 1342384 555308 708427 1328317 1340149

I would argue that we are still in cache. This is bursty, and still only 2GB of IO.

Looking at a few lines of dstat while we are at the 4 GB size (64k record size), I see

—-total-cpu-usage—- -dsk/total—-dsk/sda—–dsk/sdc– -net/total- —paging– —system–
usr sys idl wai hiq siq|_read _writ:_read _writ:_read _writ|_recv _send|__in_ _out_|_int_ _csw_

0 35 64 0 0 1| 0 878M: 0 438M: 0 440M| 152B 370B| 0 0 |6079 1470
0 34 65 0 0 0| 0 880M: 0 440M: 0 440M| 198B 598B| 0 0 |6191 1378
0 32 68 0 0 1| 0 670M: 0 334M: 0 336M| 132B 484B| 0 0 |4973 1324
0 5 75 18 0 2| 0 1260M: 0 629M: 0 631M| 132B 484B| 0 0 |8354 660

with a corresponding output line of

4194304 64 749795 1027169 2223807 2255197 2209095 1239858 2256997 3067745 2227245 754630 963514 2191593 2228561

This is outside of RAID controller cache. Running atop and looking at the IO per controller, it is reporting 15-50% utilization for these cases. We have head room.

All of this running RAID6.

Will work on the benchmark report. This kernel is a step back in version, and we lose about 5-6% performance as compared to the later model kernels. But we are seeing sustained data bolus of 0.8-0.9 GB/s and better, with bursts to 1.3-1.5 GB/s.

Since we are using PCIe x8 controllers, we have 4GB/s to work with (bidirectional), 2 GB/s. Actually 86% of that is available due to the way PCIe works. This gives us a maximal bandwidth per controller of about 1.7 GB/s. 2 controllers gets us to 3.4 GB/s. 4 would get us to 6.8 GB/s. The problem is that the DMA transfers to and from the memory will likely be the bottleneck above this. At 1.7 GB/s we can support 24 disks per controller running at 70 MB/s (their current speed, and the current number of disks per controller). We can spread this out among more, and lower the load/contention per controller. This looks like it will increase the speed, rather significantly.

Overall, I am quite pleased with this. This unit does appear to gain from using an external log for xfs, as well as better tuning of the number of requests per device, the io elevator, and other bits. I will update when I have graphs and analysis later on this week.

Viewed 10737 times by 2321 viewers

Facebooktwittergoogle_plusredditpinterestlinkedinmail