Deskside box with lotsa GPUs

Testing this for a partner.
A Pegasus deskside supercomputer with 12x X5690 CPU cores, 48 GB RAM, 500 MB/s IO channel (soon to 1 GB/s), and a GTX 260 graphics card. Connected to an XCT a-Brix 2U unit with 4x NVidia Fermi C2050’s (normally we’d use a JackRabbit unit, but they are all busy with customer projects right now).
First, lets see whats there:

[root@pegasus C]# lspci | grep nVidia | grep VGA
06:00.0 VGA compatible controller: nVidia Corporation Unknown device 06d1 (rev a3)
0b:00.0 VGA compatible controller: nVidia Corporation Unknown device 06d1 (rev a3)
84:00.0 VGA compatible controller: nVidia Corporation GT200 [GeForce GTX 260] (rev a1)
89:00.0 VGA compatible controller: nVidia Corporation Unknown device 06d1 (rev a3)
8e:00.0 VGA compatible controller: nVidia Corporation Unknown device 06d1 (rev a3)

Ahhh …. nice! And yes, you can order units like this now from the day job.
Now lets have a little fun

Read moreDeskside box with lotsa GPUs

… and NetApp buys Engenio …

[updated]
Ok, this one is huge. Many of the higher end storage folks in the HPC world use this hardware. Which NetApp will now own.
NetApp is not an HPC storage vendor, and I don’t think they have designs to be one [update] yes they do! But this goes to Cray, SGI, Oracle, Dell, IBM, HP, and many others (DDN, Bluearc, Terascala, etc.) who do use Engenio.
We don’t use it, so its really not an issue to us.
To a degree, it makes sense for NetApp, as a way to go after EMC (and EMC’s recent acquisition of Isilon).
The impact in the HPC world should be interesting. We need to see NetApps intentions going forward.
For all HPC companies (and others) whom are worried about these issues, feel free to drop us a line. We and our partners would be more than happy to provide you with absolutely massive IO firepower and capability you need to stay competitive.
[update] According to Netapp’s own PR on this:

Engenio will enable NetApp to address emerging and fast-growing market segments such as video, including full-motion video capture and digital video surveillance, as well as high performance computing applications, such as genomics sequencing and scientific research.

Ok … the game has changed.

Read more… and NetApp buys Engenio …

when failures stick out like a statistical sore thumb

Parts fail. Components fail. You have to operate assuming they will fail. A warranty is fundamentally a bet that parts will fail, and a willingness to place money (the price of the warranty) on that bet.
Over time, with enough components, you get a feel for how often parts fail. You get historical data. When one subset of components have a high failure rate (e.g. Corsair SSD disks), you know you can isolate the problem.
But what happens if you get a holistic failure? Say RAID cards, and disks, and power supplies.
On units you’ve burnt in for a while, so you know that their shouldn’t be any parts failures?
Its a hard to argue crappy parts when so many different subsystems have failures. Its easier to argue that something environmental is a problem. It simply fits the data much better. Otherwise you have to claim that we had correlated failures with subsystem X, Y, and Z within a time period. Which begs the question of what is common to all those subsystems? And if you see these failures not on a single unit, but on multiple units …

Read morewhen failures stick out like a statistical sore thumb

Single vs Multi-stream on JackRabbit JR5

A customer was playing with one of our lab machines (a JackRabbit JR5), and asked us if we could improve the multithread streaming performance. The way we had it set up (for internal testing) was non-optimal for their use case.
So we went back and did some simple tweaks. Somewhat better optimized for their use case. Remember, this is our previous generation unit. Next gen is … a little faster 🙂

Read moreSingle vs Multi-stream on JackRabbit JR5

Quick accounting tool for Torque

A long while ago, I had developed a usage summary tool for gridengine. For our small internal cluster, we are using Torque (we set it up just as the dejecta was hitting the high rotational rate elements w.r.t. gridengine at Oracle, link URL may not be safe for work, and you might be offended by it … if so, I apologize.).
This summary tool was a quick way to parse the accounting records. We developed even more tools, including converting the accounting records into databases, excel spreadsheets (yeah, you can say “oooh” now), and a number of other things. usage.pl was something of a hit for our customers. Many used it for chargeback summaries.
Fast forward to today, and we are missing this in torque.
Well, no longer.

Read moreQuick accounting tool for Torque