So we have something like 5 different Cuda capable machines in the company. My laptop, an older Quadro FX 1400 based Opteron 275 based machine, a GeForce 8800 based machine, a dual GTX260 based machine, and a Tesla machine with 3 GPUs. The latter is to be our new desk(side|top) personal supercomputer offering. Pegasus-(I|A)(3|4)G. Complex .. dealing with case/PS issues now … rest of it works fine. Current unit is powered by a Shanghai pair which we are testing with. I am quite impressed with it … the 2.3 GHz Shanghais are showing themselves to be competitive with the high end of the Intel line. Next months (hopefully) Nehalem launch should change this a bit.
The nVidia folks asked me to work on something, to prove a concept. Looks like I am about 50% of the way there with about an hour of work with Cuda. I am tired, so I am gonna call it a night, but I expect to be able to get baseline testing done tomorrow.
My laptop is a more modern GPU than the QuadroFX 1400. It is a QuadroFX 350M. The deviceQuery tells me I have 2 multiprocessors and 16 cores in it. The older machine has 1 multiprocessor and 128 cores.
So I did some basic porting of a routine over to Cuda. Initial efforts went pretty fast … this was the gfortran stuff I was doing. Though it took me time to remember how to do some of the fortranny things from my days past (when I used to bang out fortran code with ease). Multi-language programming for one … I forgot all the header bits I had to set up in order to enable the libraries to talk to the callers.
But what I found amusing was that the old QuadroFX 1400 device was about an order of magnitude faster on this code than my laptop. CPU is slower than the laptop CPU, GPU appears to be faster. I do have a QuadroFX 3000 for this unit if I want. Might try that as well. FWIW, I have always liked the QuadroFX units.
Will try it on the Tesla, but I need to understand why it is slow on the laptop.
What is interesting is that I saw on another test the 5 second “bug” for Cuda. Basically, if you have a routine running longer than about 5 seconds, the screen flashes, and exits your routine. Neat … huh 🙁
Its not a bug, as Cuda is sharing the thing that draws the bits on the screen. Which means if I hog it, say with double precision LU matrix decomposition, at some point in time, it is going to need to re-assert its primary role.
That said, the sgemm results for it using the fortran calls to the sgemm libs put it about 15 GFlop.
In 1993, I was happy when I could steal time on a Cray and get 1 GFlop. Now I get it on the display device on my laptop.
Maybe I will port cheevx next. Would love to be extracting eigenvalues/eigenvectors at a furious rate 🙂
There are times I miss the research life. This is one of them. I don’t get to play with fun toys as much anymore. I have others do it for me. Maybe later.