Brittle systems

Years ago, we helped a customer set up a Lustre 1.4.x system. This was … well … fun. And not in a good way. Right before the 1.6 transition, we had all sorts of problems. We skipped 1.6, and now we have set up a Lustre 1.8.2 system, and have several on quote now for various RFPs.
From our experience with the 1.8.2 system … I have to say, I have a sense that it is brittle. Yeah, you can call it “subtle and quick to anger”, or even praise some of the design features/elements.
It just has many moving parts, some work well (MDS), some … well … not so well (OST problem notifications). The failure surface is huge, and figuring out where you are on that surface has become effectively the morning cat-n-mouse game for us.
Speaking with other vendors, friends running these systems, I get the sense that people design/build Lustre systems defensively. That is, they know and appreciate that it will break, so they aim to limit the damage from this breakage. Control or limit the unknowns.
This could be something of a harsh assessment, but I just caught myself doing exactly this for a customer’s configuration. They requested Lustre, and we went back and designed defensively.

Read moreBrittle systems

Imagine … trying to get something as simple as a quote for Lustre support …

… and not being able to. Seems most of the folks at Sun/Oracle haven’t heard of Lustre. I had to explain it to them on several calls yesterday. They didn’t understand why someone would want to pay for support of a GPL licensed system … er … ah … mebbe we found some real nice … Read moreImagine … trying to get something as simple as a quote for Lustre support …

The evolution of the data center

Way back in the day, data centers used to be cold. Cold air came in, and usually in hot-aisle/cold-aisle configs, left through the back.
Power per rack was measured in a few thousand watts.
Cooling per rack could be mebbe one ton of AC. Up to two in the worst case.
Then stuff got denser. Somewhere along the line someone decided they could run their stuff at higher temperatures. This works fine for machines that are actually mostly open space (blades, sparsely populated server systems, …). It doesn’t work so well for densely populated server systems.
Inlet temps above 72F can be a problem for dense electronics. Poor airflow in a data center (e.g. no real positive pressure on inlet, no real negative (relative) pressure on outlet is a real problem.
Yet we’ve seen enough of our share of such data centers in the last 6 months that I am starting to question some of the designs I see. We might have to start actively asking customers, do you have the following conditions in your data center (and then list them), for optimal use case. If not, we’ll have to ask some defensive questions, such as, do you have inlet temperatures below 72F. Do you have positive front pressure, and negative back pressure.

Read moreThe evolution of the data center

Don't share anything important or of value via Linkedin … they will own it!

[update] trackbacks/pingbacks temporarily disabled. Waaay too much spam. Seriously. From their updated user agreement: License and warranty for your submissions to LinkedIn. You own the information you provide LinkedIn under this Agreement, and may request its deletion at any time, unless you have shared information or content with others and they have not deleted it, … Read moreDon't share anything important or of value via Linkedin … they will own it!

Fixed up some of the siCluster tools

Well … more correctly, fixed the data model to be saner, so that the tools would be easier to develop and use. Still a few more things to do, and one (simple) presentation abstraction to set up.
The gist of it is that (apart from the automatically added nodes), adding nodes by hand should be easy. This also means by XML (not done yet, but I know how to do this), and web (basically XML or CGI like devices).
So I want to add a node into our database.

root@manager:/etc/cluster/bin# ./add_nodes.pl --index=4 --slot=4 --name=paul --location=rack5
Inserting node into cluster.db

And sure enough, its there …

root@manager:/etc/cluster# bin/ls_nodes.pl
george, eth3=10.100.1.1/255.255.0.0, ipmi=10.101.1.1/255.255.0.0, wifi=10.102.1.3/255.255.0.0[fast]
harry
paul

now lets attach a network interface to this node

Read moreFixed up some of the siCluster tools