Need to understand the SGI RASC BLAST benchmark

Way back when, we developed a little scalable app called CT-BLAST, that ran BLAST in parallel on clusters. I had been thinking about re-doing this outside SGI when I first learned of MPI-BLAST some years ago. Since then many folks have tried accelerating BLAST.
They do this because BLAST consumes so many cycles. Sadly, BLAST doesn’t seem to drive purchases …
That said, some people continue to target this as a core market.

SGI reported some sort of benchmark results. I need to look at these more, attempt to discern what was done.
Marketeers have this annoying habit of regurgitating “results” in many different ways without knowing the provenance or the quality of the tests, or in many cases, what the test actually was. 50% faster is easier that stating what happened.
So people like me (who occasionally wear marketing hats, though I insist upon real verifiable and repeatable numbers and tests, telling our customers how we did our tests) want to figure our what our competitors did.
So I tried running a test of 1000 A. thaliana cDNA clones against nt. About 460 bp average length. nt from 2007-07-01. According to the data from verious online SGI presentations, this should run quite fast on their solution. The claim is 900x on 64 FPGA cores. This is a little more than 10x/FPGA core.
Which is similar to what we had seen in the past with the Progeniq units. An order of magnitude performance for about compute node cost in the Progeniq case.
What we see on the Intel 5482 unit we have had built for another customer (need to do some testing to burn it in), is this:
[landman@pegasus-i5482 ~]$ /usr/bin/time blastall -i thousand.fsa -o out -e 1e-9 -d nt -a 8 -p blastn
2715.09user 24.68system 7:31.05elapsed 607%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (14433major+16142322minor)pagefaults 0swaps
That is, running across all the processors, we finish this in 7m31s (451 seconds).
That averages out to 0.45 s/sequence.
This nt database is

Database: nt
           5,440,657 sequences; 21,048,893,533 total letters

the query set is

[landman@pegasus-i5482 ~]$ grep -v ">" < thousand.fsa | wc
   1000    1000  463386

463386 letters. So we have 9.7538e+15 cells, and 2.1627e+13 cells/second on this box.
Of course, this is an 8 processor run.
So per processor core, we are getting 2.7034e+12 cell updates per second.
This bugs me. There is something wrong with this calculation. I expected 1E10 cell updates per second. The CPU is able to execute on the order of 1E9 instructions/second. So this is 1000x the instruction issue rate. Hmmmm.....
Well its late, so I probably made a mistake in that calc (please let me know if you see it).
For an accelerator to be meaningful, we really need to see it providing significantly better performance and price performance.
Would the SGI be able to perform this calc in 0.5 seconds (900x faster). This would be interesting.
Will see if I can reverse engineer (benchmark forensics?) what benchmark they actually ran to get the "900x" marketing number.

1 thought on “Need to understand the SGI RASC BLAST benchmark”

  1. Also worth noting … I ran it with 4 cores. Here the execution time is 11m 45s. Not linearly scaling. Well, sort of. The system still has one memory bus, and each core can completely fill it. Very likely (without diggging further at this time) we are running into memory contention. The memory bus is faster, but if each core can fill it, then you don’t need much traffic to overflow its capacity or cause contention.
    This is in part why Intel is moving to a NUMA model … as a single memory bus just will not scale at some point (unless your code is a simple Monte Carlo type).

Comments are closed.