IT storage

They see a shiny new storage chassis with 6G backplane. They fill it with “fast” drives, and build “raids” using integrated RAID platforms.
They insist it should be fast, showing calculations that suggest that it should sustain near theoretical max performance on IO.
Yet, the reality is that its 1/10th to 1/20th the theoretical max performance.
Whats going on?
In the past, I’ve railed against “IT clusters” … basically clusters designed, built, and operated by IT staff unfamiliar with how HPC systems worked. They share a number of traits, all partially or mostly anathema to high performance computing. I won’t re-hash that post, you can search for it.

Someone accused me of using IT as a pejorative … and this is incorrect. I am simply applying an appropriate label to a system.
Well, I think it is time to call out the same thing in storage. When you buy/build these sorts of storage units, you see low cost interconnect mechanisms which “sound” like they “should” work. Throw lots of SAS 10k or 15k RPM drives into these large SAS JBOD chassis. Only need one “RAID” card (and since its an IT designed cluster, its a pretty low end … HBA … with some RAID capabilities, but not a real hardware accelerated RAID).
And then your performance just sucks. C’mon now … 24 SAS drives should give me 2.4 GB/s, right? And with 8k random writes/reads I should get 300k IOPs, right?
Yeah, similar to a number of conversations recently.
No, I am not kidding.
Chances are, if you are doing this, or if your reseller/vendor is doing this, then you/they really haven’t matched your needs to the products capable of meeting those needs.
But we run into this. And we are sometimes engaged to help.
How do you explain to someone that their design, that they are so proud of, won’t work?
IT storage falls in this regard. Its great for bulk storage. Its fine for backups. Its ok for some low bandwidth low IOP service.
But its not so good for very intensive workloads. And thats an issue. Especially when they need the system to perform well under the intensive loads.
High performance ain’t easy. IT storage won’t get you there. Like IT clusters, they have designed in limitations. Like IT clusters, they have their strong supporters.
I should point out that we have nothing against IT, or IT folks, or IT processes. Its not a pejorative to point out a bad design. Or to label it as bad. But its problematic when you have to fight battles to solve a problem.
Just because something says 6G doesn’t mean its fast. Just because it says SAS doesn’t mean its fast.

2 thoughts on “IT storage”

  1. I often describe it as ‘IT’ being an umbrella term – someone works in IT much like someone works in ‘health care’. And there are lots of important jobs in those fields. But when you’re in need of a specialized procedure like brain surgery, you don’t turn to, say, a pediatrician.
    Of course, medicine knows this, and sends you to the right specialist, whereas IT positions are generally seen as very interchangeable. Much to the detriment of the ‘patient’.
    Hope business is going well – I’ll check in sometime in the coming weeks.

    • @Brian
      It is an umbrella term, and IT folks generally have a (very) hard job. And I appreciate the job they do.
      Storage, HPC, and high performance network design … these are specialist tasks. Really. Rack-em-stack-em clusters and storage sorta do work for a while, until you start pressing them hard. Once you do, you see the problems. We see multiple groups at multiple sites doing stuff that ranges from bad to … well … really bad. And they aren’t interested in solutions that begin by replacing the badness with goodness.
      A long while ago, we had one cluster support customer who was seeing problems with their cluster for runs above certain sizes. It took a little while for me to figure out what the problem was. An IT generalist would never have even considered what we found to be the problem, a problem. It simply would never have occurred to them that their network design was fundamentally broken, and was the cause of all their cascading issues. They joined 2x 144 port switches together with a single link to build their 256 node cluster. Yeah. Really. Running MPI codes. And lots of IO.
      Yet we see this same paradigm again and again. Different customer at another site did a similar thing (daisy chaining) with Infiniband and gigabit switches.
      I guess its a risk. A cost benefit analysis. You either design it right to begin with, or you have to pay the price for a poor design (if your code starts slamming on the design hard), with slower operations, and often intermittent/hard to diagnose failures.
      I am not being critical of the people who designed it. HPC is a specialist field for system design (for storage, clusters, networks). Less so for CUDA/GPU systems (though there are some small issues to deal with correctly there). I don’t expect a General Practitioner to be performing brain surgery. I don’t expect the brain surgeon to be diagnosing rhinitis. There’s nothing wrong with being a GP, or a brain surgeon. They are different specialties.

Comments are closed.