times like this put a smile on my face …

We are running some burn-in tests on the JackRabbit storage cluster. 6 of 8 nodes are up, 2 need to be looked at tomorrow.
On one of the nodes, we have 3 RAID cards. Because of how the customer wants the unit, it is better for us to have 3 separate file systems. So thats what we have. They will all be aggregated shortly (hopefully tomorrow) with a nice cluster file system and some infiniband goodness.
Ok. I wanted to stream some writes and reads to each file system. 3 of each at a time, one to each file system. Make each stream larger than ram, so there is no caching. Caching doesn’t mix well with streaming. And it interferes with measuring the raw horsepower of the underlying system.
So here I am with 3 writes. I lit off a vmstat 1 in another window, just to see what was happening.
the bo column is the number of 1k blocks output in the time interval (1 second). So do a quick multiplication by 1000 to get the aggregate byte output.

Read moretimes like this put a smile on my face …

As the storage cluster builds …

Finally finished the Tiburon changes for the storage cluster config. Storage clusters are a bit different than computing clusters in a number of regards, not the least of those being the large RAID in the middle.
In this case, the storage cluster is 8 identical JackRabbit JR5 units, each with 24 TB storage, 48 drives, 3 RAID cards, dual port QDR cards, and for our testing, we are using an SDR network (as we don’t have a nice 8 port QDR switch in house).
Tiburon is our cluster load and configuration system. It is designed to be as simple as possible, as unobtrusive as you can make it … it does all the heavy lifting in our finishing scripts, to take a base OS install, and configure it with as much level of detail as we require.

Read moreAs the storage cluster builds …

Is RAID over?

Henry Newman and a few other people I know are talking about RAID as being on the way out. John West pointed at this article this morning on InsideHPC. Their points are quite interesting.
It boils down to this: If the time to rebuild a failed raid is comparable to the mean time between uncorrectable errors (UCE), due to reading/writing volume, then RAID as it is currently thought of, is going to need some serious rethinking.
Put another way, if you are more likely than not to suffer an uncorrectable error during a rebuild, then rebuilding is a bad thing … and since this is one of the central pillars of RAID …
So what are the options?

Read moreIs RAID over?

Been horrifically busy … good busy … but busy

Will try to do updates soon, and I owe someone two articles (sorry!). Add to this fighting off a cold … not a happy camper.
Basically we are building an 8x JackRabbit JR5 storage cluster right now. I’ve caught a problem in Tiburon, our OS loader, in the process, and am fixing it. Tiburon is all about providing a very simple platform to enable PXE (and/or iSCSI) booting OSes to make installation/support simple. It uses our finishing scripts, which take a basic OS load and finish it, or polish it for the task at hand.

Read moreBeen horrifically busy … good busy … but busy

M&A: Microsoft buys the *assets* of Interactive Supercomputing

As seen on InsideHPC, John West notes that the assets of Star-P were purchased by Microsoft today.
Parsing of words is important. The phrase “acquired the assets of X” means that the IP was purchased. John points to the blog post where Kyril Faenov mentions that some of the staff will work at the Microsoft Cambridge site.
This is sadly, not a great exit for Star-P.
Acquiring assets usually means the choice has been to shut down the company, and auction the bits off, or find a buyer for the distressed assets and then wind down the rest of the organization that doesn’t go with the assets.

Read moreM&A: Microsoft buys the *assets* of Interactive Supercomputing

The looming (storage) bandwidth wall

This has been bugging me for a while. Here is a simple measure of the height of the bandwidth wall. Take the size of your storage, and divide it by the maximum speed of your access to the data. This is the height of your wall, as measured in seconds. The time to read your data. The higher the wall, the more time you need to read your data.
Ok, lets apply this in practice. A 160 GB drive, that can read/write at 100MB/s. Your wall height is 1600s (= 160GB / 0.1GB/s).
Take a large unit, like our 96TB high performance storage and processing unit. You get ~70TB available at 2GB/s. Your bandwidth wall height is then 35000s (= 70TB / 2E-3 TB/s).
I also wonder if it makes more sense to view this logarithmically … measure the wall height as a log base 10 of this ratio, lopping off the units (what is a log(second) ?). So 1600s wall height would be 3.2. A 35000s wall height would be 4.5. Sort of like the hurricane strength measures. A wall height of 1 second (say fast memory disk) would be a 0 on this log scale.
Using this, you could get a sense of where design points are for nearline, offline/archival storage are.
This is part of a longer set of thought processes on why current large array designs, or backblaze like designs are problematic at best for large storage systems.

Read moreThe looming (storage) bandwidth wall

We're Back!

We were knocked off the air around 11pm on 13-September, by a machine finally deciding to give up its ghost. A partially retired machine which happened to run scalability.org decided, finally, that it no longer wished to correctly run grub.
Grub being the thing essential to booting.
Like the bootloader.
Yeah. It was one of those nights.

Read moreWe're Back!

Using fio to probe IOPs and detect internal system features

Scalable Informatics JackRabbit JR3 16TB storage system, 12.3TB usable.

[root@jr3 ~]# df -m /data
Filesystem           1M-blocks      Used Available Use% Mounted on
/dev/sdc2             12382376    425990  11956387   4% /data
[root@jr3 ~]# df -h /data
Filesystem            Size  Used Avail Use% Mounted on
/dev/sdc2              12T  417G   12T   4% /data

These tests are more to show the quite remarkable utility of the fio tool than anything else. You can probe real issues in your system (as compared to a broad swath of ‘benchmark’ tools that don’t really provide a useful or meaningful measure of anything)
This is on a RAID6, so its not really optimal for for seeks. The benchmark is 8k random reads, with 16 threads, each reading 4GB of its own file (64GB in aggregate, well beyond cache, but we are using direct IO anyway). 16 drive RAID6, 1 hot spare, 2 parity, giving 13 physical drives. Using a queue depth of 31 per drive, these 13 data drives have an aggregate queue depth of 403 (13 x 31). Of course, in RAID6, its really less than that, as you are doing 3 reads for every short read.
We get asked often if customers can benchmark our units for databases, and we tell them yes, with the caveat that we need to make sure they are configured correctly for databases (SQL type, seek based). This configuration is quite important.
Here is the fio input file:

Read moreUsing fio to probe IOPs and detect internal system features