Just when I thought I understood things …
Ran the original test case that we ran previously, but with the rebuilt GAMESS with a modern compiler.
Ran one on the 2.66 GHz Woodcrest, one on the 2.2 GHz Opteron. Both are dual core, I don’t have a 2.4 or 2.6 GHz dual core set of Opterons to put into a machine. Used 4 way parallel on shared memory machine. Woodcrest has a 2x cache size advantage, has a 30% faster memory system, and about a 20% clock speed advantage. The Opteron has 6 GB ram as compared to 4 GB on the Woodcrest. Though that shouldn’t matter so much, the memory layout is likely to be more important.
Woodcrest 2.66 GHz: 1h 44m and change
Opteron 2.2 GHz: 1h 46m and change
Ok. I thought a clear picture was emerging. I need to rethink this and start playing with specific code paths. I am getting the sense that the Woodcrest advantage is specific to particular types of operations. This isn’t complete as an analysis goes, I also wonder if the memory on the woodcrest is correctly organized.
Way back in the beginning of Opteron benchmarking, we saw lots of Xeon minded folks putting all the ram in a single bank and then declaring victory to the Xeon for memory bound cases when all they were doing was (mostly inadvertently) biasing the results.
Our interest is in how does one get the best performance out of the system, and once we can figure that out, how fast can we run various codes, and how well will they perform?
What is emerging is that
a) Woodcrest doesn’t make nearly everything go faster. This is different than what the Opteron did in 64 bit mode versus 32 bit mode.
b) Woodcrest is not automatically faster than Opteron at the same clock speed, there are some code paths that might not get any boost, and might in fact inhibit performance on Woodcrest.
c) There may be some learning we need to do with Woodcrest to understand how to organize/adjust memory for best performance.
d) to get the best performance out of Woodcrest, you are going to need to recompile your code. No guarantees that this will help, but we did see dramatic improvement in a single threaded case. There may have been a 4MB vs 1MB cache effect going on there, but still, that is a nice effect. AMD would do well to consider larger caches.
Still haven’t had time to play with the shiny new Intel compilers on this. Will do soon. I promise.