HPC in the first decade of a new millenium: a perspective, part 2

The death of RISC …

obligatory Monty Python Holy Grail quote:
old man: I’m not dead yet, I think I’ll go for a walk
John Cleese: Look, you are not fooling anyone …

The RISC vendors (SGI, HP, IBM, …) realized that RISC was dead, and that EPIC would be the technology that killed it. I was at SGI at the time, and disagreed that EPIC (Itanium) was going to be the killer. I thought it was going to be x86.
My rationale for this were benchmarks that I had been running on systems around that time. Some basic testing I did for some projects I was working on at the time, from an email in 2001 (found it recently cleaning out an old/dying machine)

200 sequences (my ebola genome sequences from Entrez) run against dbEST on
R12k 400MHz single CPU n64 O3 optimized blast, and a single CPU 1.2 GHz
AMD AthlonMP gcc O2 optimized blast (badly handicapped, as the compiler is
(from memory, could be off in 2nd digit)
MIPS    4400 s
AMD     2200 s
P3 1 GHz 3000 s
P4 1.5GHz 3800 s
Blastn, no translation, E of 10 (normal default values).  Run in the last
month.  2.2.1 NCBI tools.
 I ran it an an HP N-Class 550 MHz CPU.  Got about 3500 s.

and an earlier, very prescient benchmark I had ran.

  cpu+mhz         cache   qmd_a   qmd_b   mdb_a   mdb_b   str_a   str_b
  K7 650          0.5M    122.68  126.76  6.44    6.57    400.0   400.0
  Pent III 700    0.25M   169.32  170.96  8.20    8.19    342.9   331.0
  Xeon III 500    2.0M    189.62  189.45  10.79   10.78   252.6   246.2
  Xeon III 500    0.5M    221.06  226.58  11.23   11.25   290.9   290.9
  Pent III 700    0.25M   205.78  197.89  9.64    9.46    400.0   355.5
  Pent II 266     0.5M    713.45  507.63  23.95   23.70   104.3   104.3
  AMD K6-2 500    2.0M    449.53  450.52  24.59   24.64   90.57   90.57
  R12k 400        8M      86.04   85.93   3.99    3.99    634.4   627.3
  R12k 300        8M      117.5   116.22  5.37    5.39    345.1   344.1
  R12k 400        8M      87.53   86.76   4.11    4.10    378.3   377.8
        All times are in seconds except for str_{a,b} which are MB/s
        from stream.  QMD is my quantum MD code.  MDB is the MDBNCH
        code.  STR is streams triad.  All IA boxes used PGROUP
        compilers with maximal optimizations.  The SGI boxes used SGI
        compilers with maximal optimizations.

I remember arguing that I couldn’t see a customer spending 20x more for less than a 2x performance delta over the commodity platform. This was around 1999-2000 time frame.
I looked at the data, and I could see RISC was doomed, but tests on EPIC yielded not significantly better than RISC performance, rather, often substantially worse.
And this gets into one of the failure modes I have observed for us in HPC. We have a tendency to sometimes believe our own marketing. Make the underlying silicon less smart, and its a SMOP to build smart compilers that will handle all the heavy lifting to make the silicon roar.
A simple matter of programming.
I used a Multiflow. I remember the day long compilation cycle. No, I am not kidding. A make operation took 1+ days. And often the code would bomb due to bugs in the compiler.
How can we work on systems like that?
In short, you can’t. Its hard to put real smarts into compilers. Most of them are highly focused on not violating language semantics, so much so that the willingly trade potentially beneficial performance optimizations for a more provably correct execution. This is due in part to the mismatch between the languages and the architecture.
A compiler is a transliteration device. Between expression media that don’t completely overlap … there aren’t necessary provable one-to-one and onto mappings between the expression of an algorithm in a particular high level language, and the transliteration of that high level language into low level assembler language for the platform. Making it “smart” should mean that it can “learn” how to do a better job, as well as do a better job of inferring meaning.
I will touch on this later. This is a very important topic. And this is why VLIW systems for HPC are pretty much doomed from inception. But that is for later.
So to recap, at the end of 1999, RISC is generally acknowledged by all but Sun, and a few RISC bigots in high places at companies like SGI, as losing out to a new architecture. Unfortunately for all who put all their eggs in that one basket, Itanium wasn’t to be. But we didn’t know this yet, though a few of us had guessed it. Looking over some of the sent emails in that erased box, it is scary how good those guesses were.