This post has been through countless iterations. I have written/rewritten it a multitude of times, because I was looking for a way to say it, but never quite settled on a particular manner.
So here it is … stream of conciousness and all that.
We had lost a few bids recently, and while I don’t want to comment on to whom or for whom, I want to comment on “why”. The reason I want to comment on the “why” is that it has importance to the market, and those whom forget this “why” are doomed to make the same mistakes.
The “why” is that we built configurations that were excellent for our customers, and cost a bit more than others that were “good enough”. That is, the items we had, that we felt made our bids better, also made our units cost more. And this has a number of interesting consequences… really more side effects to the deeper issue. Lets deal with the consequences.
First: Being a high bidder, or a higher bidder on an RFP is not a winning strategy for an RFP.
Second: When budgets are a fixed amount as a zero sum game and not unbounded, as most budgets are fixed, you need to provide maximal benefit to the customer to win their business. This means you need to identify what is valuable to them, address that in a meaningful way.
These are the consequences, but what are they consequences of?
Well, its a philosophy of “good enough”. That is, if the additional utility of what you want to add is low, then precisely why are you adding it? So you add a rack mount KVM over IP and other related bits. Costs a bit of money. Compare this to a terminal server, could be less for a larger system. Both will work. One won’t be as fancy, but it will work pretty well. It will be “good enough”.
Same thing with other components. Basically at the end of the day you need to maximize utility in a minimum budget, but do so globally across the HPC system, and not just subcomponents.
This of course assumes that someone has not badly under-spec’ed their response. But that is another story, for discussion over a beer.
So start out with the basic hardware. Add needed things that will make a substantial impact upon the users. When we have focused on doing this, we have had a positive impact. This is what we want, to add real realizable value.
Without beating up on other companies/products, lets try a little gedanken experiment.
Suppose we have some product that adds a cost X per node to the cluster (in the case of switches it is a cost per port and per node, but same basic idea). Adding this product to an N node cluster increases the cost by N*X + invariant costs. These invariant costs would be a switch frame or similar. Basically the price you have to pay to use that technology.
This product at the end of the day now adds some additional cost to the cluster. Call this additional cost fraction d, so that the new cost to the cluster is (1+d) * original cost.
In the case of a fixed budget, d should be 0, and the only variable is the size of cluster offered.
Many vendors will happily tell you that their technology is more valuable. So they will use dubious metrics to discuss this purported value, and make their case. A few people will believe them, and even parrot the marketing material.
That our SCSI vs SATA vs FC post is one of the most frequently read here, and is viewed by many with stakes on all sides of the issue is confirmation of this. People want to know that what they are considering is believed to be of value. They want to know that if they are going to pay d extra, that this money will buy them something that they are not getting with alternative technologies. That is, there is a perception of value for this choice.
Other people simply want “good enough”. They want something that will work, meet their needs, not add extraneous costs, regardless of what you may perceive to be of value.
This is important.
d for SCSI as compared to SATA at a fixed disk size (call it 80 GB) is 2-3. So precisely what do you get for that extra factor? Is it really worth that 2-3x? What benefit precisely will it bring the customer, that an alternative choice could not provide?
5 years ago, the choices were between FC and SCSI for storage. There were a few other technologies. 5 years ago, backup was done to tape. There were few other reasonable choices. STK and others made lots of money.
Now, today, the major storage media appear to be migrating to SATA, with SAS at the “higher” end. SAS does offer some differentiation over SATA, but I am not sure it is enough to justify the price differential. SATA is in many cases “good enough”. More than that, it is inexpensive enough that you can add additional units in higher reliability RAID1’s for less than the cost of the single spindle alternative technologies. So we are seeing the “good enough” factor re-align the storage markets. The added cost of that d is not viewed as being of value for most customers.
Well, thats storage, what about HPC?
Good enough factors tremendously in HPC. It has been driving HPC downmarket and into wider radii for quite some time. I have been calling this the “80-20 rule”. Does 80% of what you need for 20% the price. Started with the vector supers and the super-micros. RISC machines were good enough for most things, and pretty much decimated the vector supers-market. Happened again with clusters, effectively decimating the super-micros market. You can see all of this reflected in the top500. Another interesting thing happened which we are enjoying now, specifically the rapid expansion of the HPC market. Making computing power and capability cost less allows larger groups to use the power. This is important.
Basically the lessons to be learned from this are that a) good enough works well for most people (e.g. get rid of the extra cost marginal utility items), b) the market can be grown by offering more for less cost, c) retreating to the high end is a failed strategy as it won’t stave off your competitors. If you don’t eat your lunch, your competitors will.
I think about this after having discussions with some of the 10GBe vendors and multi thousand dollars per port for reasonable sized clusters. I think about this when I see people insist upon SCSI disk per node or SANs in a cluster. Is there a more effective use of this money? Is it better off not being spent?
In the case of our losses, we didn’t stick with our overall design guns, and allowed a little of that extra “bells and whistles” in. Which raised our costs. In one case against a fixed budget (which meant less stuff for the customer), and in another case it meant a higher price on an RFP.
What does this mean for the future of HPC? I saw an article today on a Celloxia card for $15,000USD with 2 FPGA’s a compiler and development environment. Now compare all the effort to get applications going on that to the sub $1,000 cards running Cell BE processors, or several thousand $ cards running Clearspeed chips. Or even sub $1,000 cards running pimped up GPUs.
What some computer scientists are doing now are demonstrating how to use and building the infrastructure to use GPGPU, Cells, and Clearspeed. FPGAs are in there, but the lower cost ones are likely to be “good enough”. $15,000USD cards won’t fly in HPC.
We work with some groups that have accelerators for informatics. They transitioned away from PCI bus based cards to USB2 based cards. We were concerned of the performance impact. The price impact is huge, and the performance impact moderately so. That is, they were right. The USB2 versions are good enough for most people.
A smart organization will learn from their errors, adapt, and thrive. An intransient or ossified organization will not be able to adapt. One that drinks their own koolaid may not be able to grasp why they will likely fail. Simply adding cost without adding significant benefit doesn’t help anyone, and leaves lots of marketeers (and large companies) blinking rapidly with occasionaly sputtering when questioned on the utility of their products.
d has to be meaningful. You can’t paint a box purple and expect people to pay 3 times the price. The cost of high speed networking per cluster node is about 0, it comes built in with the box. You can (mis)design a network for it to work in quite easily … We have diagnosed and fixed such a beast recently (this is incidentally why you want experienced HPC people designing, building, and supporting HPC infrastructure, and not your neighborhood MCSE). So the added performance , really the low latency nature of an Infiniband or Infinipath network, or a 10GBe has to be obvious for you to make that choice. Otherwise GBe is good enough. Same with OS … though in this case, one of the choices is free (as in zero cost), so any non-zero cost OSes are immediately at a disadvantage, and need to be able to justify their costs in terms of savings elsewhere. The zero cost OS is good enough (well, better, really) in most cases, so this is usually a non-decision. The storage design and implementation is again something where “good enough” is fine, especially considering that good enough is as good as the best (and again, in some cases, better) in most cases.
Forgetting “good enough” and adding extra bells and whistles, without a really good clear articulation of what the extra cost brings relative to the alternative choice, or worse, not having a good and reasonable articulation, just marketing bits, is a good way to lose.
I think the FPGA vendors haven’t internalized this yet. They have competitors. They just don’t want to admit it. Intel hadn’t internalized this with Itanium2, the alternatives were better in that they cost less, and delivered about the same value. Intel didn’t grasp this with Opteron, and you can see the impact this failure to acknowledge had upon them. Not that Opteron was “good enough”, it turned out to be a better Xeon than Xeon. IBM had the converse many years ago; did have a better technology which it completely bolluxed up relative to its competitors. Now IBM has great technology again, and they play a much better game of articulating its value. Sun is still digesting that not only is Opteron “good enough”, but it is in fact better than Sparc for the same tasks. Sun hasn’t quite grasped the issue with Linux, still prefering to push its own OS, but hopefully they will finally see the writing on the wall. Whether or not Sun concedes that Linux is good enough or not is now irrelevant.
HPC has been the rapidly expanding market of “good enough”. Low cost accelerated computing will likely push this out into a wide/mass market in much the same way that the cluster market is fueling current HPC market growth. The growth comes downmarket. Suppose you could get cluster performance out of a $10k part. Would you buy it? The real question to ask is who wouldnt? It would be “good enough”. And the market would continue to grow, albeit at a much faster clip.
The irrelevance comes in if you ignore “good enough”. Very few people will pay $50k for something that is 2-3 x faster than something that you can get for $6k. GPUs can deliver 5-10x the FP performance of CPUs, possibly more. And they don’t cost that much. Nor does Cell. Neither of these technologies are irrelevant to the future of HPC.