What he said!

Up early on a friday morning, working through todays’ issues and … found this article on Linux Magazine by the esteemed Doug Eadline. I was in on the discussion that he refers to, and pointed out that you do in fact get what you pay for, and that you will not get an engineered system in many cases. Worse, the configs will likely be those that minimize vendor costs, as that is the problem they are attempting to solve in a low margin business (clusters).
You will not get an engineered or well designed system unless you, curiously enough, go with a group/shop that engineers/designs their clusters to fit your needs. This is not a minor point, a poorly designed system can be painful to work on. I know, I was working on such systems only yesterday.

Doug makes a point that I wish to emphasize, underscore, amplify …

I would listen, then sigh, and explain that selling a rack of servers is not a cluster, the cake still needs icing. Viewed only in terms of hardware, the ticket to the HPC ball got extremely cheap. Even the rackem-stackem-fly-by-night-want-to-be-HPC vendor could get in the door. And yet, most of these systems are not operational clusters.

Recently I ran head first into a poorly designed cluster. End user wanted to do something on it which was frankly hard, the way they had built it. There were many things broken with this system. We wanted to help them get to an operational state. Something where the cluster really worked the way it is supposed to. They bought their hardware (improperly configured) from one of the “major vendors” (more in a moment), along the lines of the one Doug describes. They ran this hardware in a manner which wasted ~50+% of its potential right off the bat. I am not sure if this is due to past experiences, or just general IT focus, but this system was little more than a pile-o-pcs.
I call these in general, IT clusters. They are not HPC clusters by any stretch of the imagination. They don’t really work well. Some things sorta-kinda work. Lots of things don’t or cannot. You have some interesting failure modes.
You can always tell an IT cluster, it has several features that cause it to stand out.
First, it has RHEL. Yes, thats right, it has Redhat as the OS. This is not sufficient in and of itself to guarantee an IT cluster. But it is a strong indicator that the people who put it together were not thinking HPC, or they don’t know/understand HPC enough to understand why this isn’t a good idea. In the simplest view, RHEL on every node often means a software support contract on every node, and a cost per node of the OS.
The RHEL kernel (in the 4.x series and now in the 5.x series) has a number of purposefully designed in limitations. They do not have a meaningful file system offering for high performance/high capacity workloads. You have to explicitly add in/build xfs/jfs kernel modules to get what you need there. As of RHEL5 you now have 4k stacks foisted upon you. I could spend many words going through why this is a “Bad Idea“®. I need only one though. Drivers. There are other issues as well (turning off SELinux to have a fighting chance of running a cluster, LVM for disks by default, …).
Your standard IT shop will install RHELX.y 32 bit on a 4 GB ram 64 bit server and call it a cluster node. This is precisely why you should be working with a group that knows what the heck it is doing.
Second, it has a SAN or a cheap NAS attached. SAN is a low performance system for HPC. No, really. 4Gb is not fast. We can move data within a single box at 20Gb+ and expose this out to the nodes at nearly that rate. A good cluster design will be able to move data very fast. Data motion is rapidly becoming the most painful aspect of internal/external cluster and distributed system usage. A poorly designed system will not be using a reasonable data storage or data motion fabric within the cluster. The cheap NAS is even worse than the SAN. Tell me if this sounds familar. You have this large cluster served by a 1 or 2 gigabit port NAS unit. All those terabytes, effectively hidden behind a network bandwidth wall. This is one of the reasons we designed and built JackRabbit. It solves this problem without breaking the bank.
Third, it has a poorly architected network. Ignoring the usual complete lack of a data transport network separate from the command and control network, we often see high end IT switches strung together. Yes, thats right, daisy chained hundred port gigabit switches. If your HPC vendor cannot tell you why it is a “Bad Idea“® then you really, really, need to be speaking to a different HPC vendor. If your HPC vendor has not diagnosed and solved these problems, then they are unaware of the symptoms, and will likely start suggesting you put some really expensive things together to mitigate a bad design. A few of you are cringing? Yes.
A corollary to this are the cheap switch users. There are very real differences, no not just $$, but honest performance impacts from your switch choices. We have solved numerous customer problems by tracing back to the cheap switching infrastructure people have built out.
There are other examples of the IT cluster, but these are the top on my mind right now (having experienced every single one of them in the last 30 days or less).
Remember, a pile-o-PCs is *NOT* a cluster. It is a pile-o-PCs. Putting a cluster distribution on a pile-o-PCs doesn’t make it a cluster. It makes it a pile-o-PCs with a cluster distribution.
If you need a fleet of 18 wheelers (trucks) and you instead buy a bunch of VW bugs and argue that you have equivalent hauling capacity, you may have, I dunno, missed the point. You certainly shouldn’t expect to be able to haul what the fleet of 18 wheelers can haul. Same is true in HPC. Very true in HPC.
A good vendor will work with you to solve your problems, or help reduce the problem to a manageable situation. They will sit down with you to find solutions. Not impose them. They will help you achieve your goals. Not theirs.
Most of the IT and rack-em-stack-em shops are shipping volumes. Many of those VWs. Few solutions.
Doug’s company (Basement-supercomputing.com) does solutions. So does my company (scalable informatics). There are a few others. Decades of experience do matter.

4 thoughts on “What he said!”

  1. Well, I guess the University I work for doesn’t have an HPC cluster – at least according to you. While we don’t use RHEL, we use CentOS – and we’ve never had any driver issues by the way. Please go ahead and expend however many words you need to explain why this is a bad idea. Oh, and if you even say SLES I’ll laugh so loud you’ll hear me wherever you happen to be in the country in relation to me.
    We also just upgraded to 64-bit CentOS 5 – because we are getting ready for applications that will need to access more than 4 GB RAM on a node. I understand that there are performance tradeoffs involved, but you need to explain why this is *always* a mistake as you imply in your blog or retract the statement.
    We’ve also got a SAN … and we use GPFS as the filesystem. We can deliver I/O faster than our cluster users need, so why is that a problem? Because it’s cheaper than your Thumpers?

  2. taking my time to respond on this.
    It looks like you are at Vanderbilt.edu. That you wrote


    suggests that you aren’t really after dialog.
    Ok, first off, Centos is RHEL. Making a distinction doesn’t help, as it is the same technology. It is actually more than RHEL though, as the Centos-extras repository includes some of the missing things that HPC users need (kmod-xfs, etc). This is an implicit acknowledgement on the part of the Centos people that they need to support more than just vanilla RHEL rebuilds. Some users do have file systems that exceed 2TB. Some have a need for firewire. Some need jfs and xfs, and OCFS2, and …
    That you “just” upgraded to 64-bit centos … hmmm … was this a new cluster, or were you running 32 bit OSes on your 64 bit chips before? This was one of my points. Thank you for lending support to my thesis. There is a well known (at least in HPC circles) performance benefit to running 64 bit code. It is 5-40% for free, simply with a recompilation. On the same machine. If your code is binary only, then you are stuck. Most non-commercial codes are not binary only, so you have no issues in rebuilding. Again, we run into this all the time, and point this out to people. Some of the comments we have received back are similar to the 4GB comment you made. Which is, as noted, not the reason you need to go 64bit. Which again, thank you, made another one of my points for me.
    The issue is, in part, register pressure, and more registers available to code operating in 64 bit mode than in 32 bit mode. Real world (e.g. real applications with real data) performance tests demonstrate better performance on the same machine … a 2GB ram machine at that.
    As noted, this isn’t the only example of this effect. This is not something most IT people know or care about. It is something that HPC people do care about, and a few of them know this pretty well. We have talked about this with many customers, including those that did not believe us. We simply asked them to rebuild/rerun their code on the same machine running a 64 bit version of their distribution (RHEL or otherwise). Sure enough, we haven’t heard of a single “slower” yet. We have heard everywhere from 15% through 100% faster. With all manner of compilers (gcc, Intel, PGI, …).
    This is why getting HPC people working with you is a good thing. If you were running 64 bit processors as 32 bit, then you were leaving lots of performance on the table. A 5 minute call to a company like Doug’s or mine would have helped you with this.
    If on the other hand, you simply didn’t think that going 64 bit was worth your while as no one needed more than 4 GB ram anyway … well …
    The point is that IT people have different knowledge/skill sets than HPC people. They use/build computers differently. The program them differently. They run them differently.
    With all due respect, we have not seen a single use case where 32 bit performance has been consistently 15% or more, better on the same hardware as the 64 bit. You certainly are welcome to point this out, if you can find such a case [sound of Kevin madly typing into google to find such a case]. More to the point, I don’t think there is anything to retract, and more specifically, if you read the linked report (yes, I wrote it, after doing the experiments, and reporting on the results), you may find that many others have done similar work afterwords, and found similar things. Again, feel free to google this.
    As I noted, the performance deltas were observed with 2GB ram. Our customers whom have done similar tests also had less than 4GB ram. Always? Yeah, I believe there are extremely few reasons to use 32 bit, and while corner cases may be found, and I expect that you will be searching for them to try to rebut this, from a general point of view, my experience has been that you are wasting resources by not exploiting the full capability of the system. If you want to have an offline discussion of register pressure in compilers, and the additional registers available in 64 bit mode, I would be more than happy to provide additional documentation for you.
    Ok, next topic. Distros. This is a dangerous area, people have replaced the editor wars of old, and the more recent dynamic language wars, with distro wars. Its Ubuntu vs SuSE vs RHEL vs …. The distro is largely window dressing. The issue is the kernel. More in a moment.
    Your dislike of SLES is interesting. I might suggest you ask some of the people doing a good job of building high performance computing clusters whether or not they use SuSE. The reason most do use SuSE has to do with the kernel as I noted before.
    The current kernel is 2.6.25.x. RHEL5.x ships with a 2.6.18.x, and RHEL4 ships with 2.6.9.x . RHEL5 and RHEL4 before it, disabled many useful things in their builds, and not for technical reasons, but purely for marketing ones. Their arguments around not using/supporting xfs are weak at best … current claims are that it is equivalent in performance to ext3 so why bother. This isn’t true, ask the folks who have tested it and use it with real codes … not iozone/bonnie and other non-workload IO generators. Ask the people who need large file systems. It is somewhat hard to build an ext2/3 file system on more than 8 TB of space. We have customers who need 1PB+ of space. Precisely what should I tell them if they use RHEL’s kernel? Tough luck?
    The reason why SLES/SuSE is popular has to do with a more advanced, and generally perceived to be, better engineered (in terms of patch set compatibility) kernel. This means that IBM, and many others releasing new hardware, have a better than fighting chance to have it install correctly on the hardware with SuSE than with RHEL.
    We take a somewhat different tack. We build kernels, which are usable in any distro. Our customers use our kernels, usually comparing to their native installed kernel, usually the one “out of the box” or after a few updates. Invariably our kernel ( at last build), has better performance, better stability under load, better driver support, … And we provide RPMs to our customers, as well as .debs and others.
    That is, the distro is window dressing. What we care about is the quality of the kernel. When you are pushing your machine really hard, you will care about that as well. We don’t want ours to crash.
    We have replaced kernels for customers with RHEL clusters, Fedora units, SuSE units, Ubuntu units, OpenFiler units, etc. We do this to increase stability, performance, and make the system more easily supportable. We don’t force this on our users, but we do recommend it strongly. Again, I am not aware of a customer, to date, that has voluntarily moved off our kernel once they have installed it. After that, RHEL is tolerable. But you lose your “support”. Such is life.
    Ok, now your “SAN” upon which you are running GPFS. Interesting. GPFS is a parallel file system. The SAN to which I was referring were the standard IT SANs upon which end users connect lots of disk through slow 4Gb channels. So the concept is that for a parallel file system to work well, you need as many parallel streams as possible. Which goes quite a bit against a standard IT SAN scenario. That is, you want as many direct connections to your disks in a parallel file system as you can get. The switch in a SAN is often the limiting factor, as are the 4Gb loops. SAS is looking better than this as they have ~12Gb loop capability. But then again, with Infiniband (or 10 GbE), I get 10-40Gb per link, and I can build my parallel file systems out of that. More in a moment.
    Delivering IO faster to a cluster than your users need. Interesting comment. Would love to ask your users about this. With all due respect, I might suggest you do this. I won’t ask you to retract this comment, but I will ask you to do the background homework on it. We have not yet run into a set of users that said they ever had enough IO bandwidth. No matter how much they had in-system to begin with. This includes users with large installations of Panasas boxes. So, again, as an HPC guy, I recommend having this private discussion with your users. Ask them if they could use more. Don’t jump down their throats (like you did with me here). Just ask. Amazing what comes out when this occurs.
    Onto the SAN being cheaper than our thumpers. Hmmm. So many ways to respond to this, so little time. Sort of like the Steve Martin character in the movie ‘Roxanne’, when providing the 20 better jibes as compared to the lame one thrown his way … No… better to take the high road here.
    What bandwidth do you see in and out of your SAN, in aggregate, when all nodes are pulling on files/blocks?
    If “cheap” is your main metric, I am going to guess that you really aren’t delivering the IOs your customers need.
    We don’t sell “thumpers”. Well, ok, if you press us, we are a Sun reseller (or at least we were for a while, and then seem to have disappeared from their site, go figure), so if you want a “thumper”, a real honest to goodness x4500, we could sell you one. But why bother when we have something faster and better (at least in our humble opinion)? And I would be careful about clicking on that link. If you like lower cost, you may have trouble swallowing your bile after looking at the pricing (on the web site).
    We deliver, per RAID card, 750+ MB/s. Our units scale from 1-4 RAID cards. We can push this out over SDR, DDR, QDR, 10 GbE, and others. We can aggregate units with Lustre, or GlusterFS (which we prefer for a number of reasons), or PVFS2, or Ibrix, or … And we just got a pointer about a month ago to speak with some folks at IBM about getting GPFS on here as well.
    Our units are doing all this at well under $1/GB. So a 48TB unit, capable of sinking/sourcing more than 1GB/s (a bit more than that 🙂 ), pushing the data out via 10 GbE or IB or channel bonded gigabit, will run you well under $48k. And if you want to build a cluster file system out of them, by all means, be our guest, we will help.
    More than that, we will, at your invitation, help you optimize your systems. Curious, our fastest growing support option is our remote support option. Our second fastest growing support option is our re-engineering support. We can help find the (nearly) smallest set of changes which will provide the maximal performance benefit to you.
    But enough about the commercial.
    Kevin, my email is joe _at_ scalability.org . Feel free to contact me to discuss if you would like. I will not retract my comments, I do stand by them and augment them with more information. You may not like them, but I am open to criticism and counter examples. In fact, I encourage it. I also stand by my calling these things IT clusters. We run into so many of them, we try our best to help our customers get past where they are to points where they can be more productive. You can hate me for saying this, you can disagree. You can agree. Whatever. The point being that IT clusters tend to appear to be minimally stacked, cost “optimized” systems.
    As I commented to John West some weeks ago, we live in the era of good enough. If I were you, the point I would have made (the best counter to my post as far as I am concerned) would have been “but it works, so why fix it.” I would have come back with “it sometimes works, we have run into issues with OFED/IB/drivers and file systems. But I would agree that for some users, this ‘good enough’ is fine.” Unfortunately, you did appear to get somewhat emotional over this, so we couldn’t really go down that route. I invite you to though. For some users, good enough is all they need, and my comments are so much hot air and entropy in the universe. For other users, the ones we deal with on a day to day basis, they run into problem after problem that needs serious attention paid to every element of the system and software. This is where knowledge and experience really helps.
    FWIW: I do contact people whom are subject matter experts on a frequent basis to ask basic questions of. While I know my own area reasonably well, there are lots of things I don’t know. So I don’t pretend that I do. I call the experts (Jeff, Doug, Vipin, …) and I ask.

  3. Joe,
    You will note that my personal e-mail address is included above (and I meant to include it with my original response to you and realized that I had not a split-second after clicking on the “submit comment” button. The reason why I did not put the URL of my employer in the provided box is because I am speaking for myself only and not my employer and therefore wished to keep them anonymous (and I wasn’t singling you out either – I routinely insert “noneof@yourbusiness.com” for my e-mail address in web forms where they’re requiring an e-mail address for no reason other than to spam me). Your comment that I am not interested in dialog is totally incorrect – I asked you to respond. Congratulations on doing some very easy detective work, however. I just wish you hadn’t published my employers’ name in a public manner without knowing my real reasons first…
    OK, so it looks like we agree on more of the technical points than it first appeared. Your original blog post was worded in such a way that it implied that 64-bit was a mistake for HPC clusters. Your followup clarifies that, which is what I was after, and I don’t disagree with anything you said about the 32-bit versus 64-bit issue. BTW, just to clarify something … I’m well aware that CentOS is simple RHEL with the RedHat proprietary “stuff” stripped out.
    My dislike (and the dislike of my co-workers) for SLES comes from two factors: 1) we have found it to be a nightmare from a SysAdmin perspective. Things that “just work” with the RHEL clones fail in mysterious ways and for no apparent reason on SLES. 2) Novell’s support, at least here, was atrociously bad.
    You are correct – the issue is the kernel, not the distro. So if we start with CentOS for ease of administration, then customize as our users’ needs dictate, what’s the problem? In other words, my original response was partly because your post came across as implying that anyone who starts with RHEL or a RHEL clone is an idiot who doesn’t know what they’re doing.
    Similarly, you also imply that anyone who uses a SAN is also an idiot that doesn’t know what they’re doing. We looked at GPFS, Panasas, Lustre, PVFS2, and IBrix before choosing GPFS (this was several years ago). I spent quite a few months doing evaluations before we ended up choosing GPFS (Panasas was a close second). GPFS was chosen because it came out first or tied for first in all of our major criteria: performance, reliability, and cost.
    Speaking of performance, I don’t have to go ask my customers if I’m delivering all the I/O they need … I already know the answer to that question. I measure the actual I/O they’re doing – the fact that they rarely use 30% of the available throughput tells me more than asking them what they (think) they need ever could. We’ve also made modifications to our storage environment based on that monitoring of what’s really going on that have resulted in improved performance for our users.
    As to cost, comparing what you can sell me a thumper for today to what we paid for our SAN a few years ago is an unfair comparison for blatantly obvious reasons. Our SAN has been in place since before the products you recommend came to market, so (borrowing your own suggestion here…) why are we idiots for keeping something that “just works”?
    So, to summarize, I responded in the way I did for two reasons: 1) I’m blunt by nature (and evidently so are you, so let’s not be too hard on each other for that). 2) Your post came across as implying that the things we do make us a bunch of idiots who don’t know what we’re doing. Maybe I am, but my co-workers certainly aren’t.

  4. @Kevin
    Thank you for responding. Yes, I am blunt. It does rankle some people, but at the end of the day, people see where I stand and why I stand there. I make no apologies for being blunt … if I see crap foisted on the HPC world, I am one of the (sadly) very few people to call it what it is. Likewise, if I see good things, I point them out.
    Ok, onto the major points I want to address. IO performance first. We have watched a cluster at a customer site use 40% of their IO bandwidth. I asked them if they had enough, and their answer was “no” but that was all they could use at any one moment as others would complain (and did complain to us) when they pushed the system hard. They didn’t use our storage until late in the game. Once we had that hooked up, in all honesty, most of their run of the mill IO problems went away. It is frustrating to us as a company that we can’t use this success story as we are not allowed to identify the customer, due to our contract with them. That said, we have found, over and over again, that what we think people need for IO bandwidth is usually just the “tip of the iceberg”.
    On cost and comparing “old” SANs to “thumpers”. I am talking more along the lines of newer systems. Where we have a small/mid-size SAN switch coupled with various storage crates. These are designed explicitly to maximize the storage per loop, which runs counter to maximizing the performance per device. We are always happy when we run into comparisons with these, for obvious (performance and cost) reasons. If you are building out a new SAN for a cluster, you have to ask exactly how much performance you can pull out of it. If you design the SAN for performance, you wind up building a very expensive version of something we can do for both far less money, and at a higher aggregate performance. So in this sense, yes, it doesn’t make sense, apart from exceptional cases (corner cases) to use a new SAN for a cluster storage system.
    That said, in business, nothing is fair. No, really.
    Our current systems are most definitely more cost effective than the vast majority of older SAN systems. Of that we have little doubt. Performance wise, we have similar advantages to the old systems. So why not yank out what you have and replace them? Well, they work, and why add additional cost to an old cluster unless it is hitting an IO wall (and asking the customers is a good way to find that out). If you are hitting an IO wall, and you need to do something about it (a fair amount of our business), then by all means, we should be a group you speak to.
    If you need to scale out to tens of sustained GB/s, Panasas makes great kit, and our systems happily talk to panfs. If you need to add capacity to smaller clusters or build parallel file system clusters and don’t want to use Panfs, our kit is both high performance and cost effective. And they aren’t “thumpers” 🙂
    Ok, on being idiots … no. I am not implying or insinuating that you or your co-workers are idiots. We run into a large number of clusters designed by IT departments, and they look very similar. When you press them hard, things fail, and this was the point of my original post. We run into these again and again. I guess we shouldn’t complain as it generates quite a bit of support revenue for us. But at the same time, if you start out with a good design, things usually work better.
    I do like to tell our customers that systems designed to fail often do. Some of the clusters we see/work on are in this category. Often designed by the local IT person after reading a few web pages somewhere, and then cost optimizing the components.
    Again, I am not bashing you or your co-workers. Just muttering aloud over bad designs and implementations I see in modern cluster systems. I don’t expect 3 year old systems to have the same set of issues that today’s systems have. I don’t expect them to have the same designs. But I do expect that new systems will be considered more carefully than I have seen after being called in to help.

Comments are closed.