Woodcrest update, day N+1

So we have had a woodcrest in house for a while now. When we have time we beat on it, we ran codes on it. My impressions are now well formed, and I understand where it makes sense as a platform, and where the competitive technologies make sense. This is not from marketing documents, but from real world testing.

Woodcrest is basically an AMD64 platform without the IOMMU. The processor architecture includes a much improved SSE engine, a larger shared cache, and theoretically, a larger memory bandwidth than its competitor.
Note: we have not (yet) tested the AMD 22xx series, though we will be in short order. This is the directly competitive platform. Our tests of a 5150 are relative to an Opteron 275, which is more than 1.5 years old as a platform. Not apples to oranges.
Relative to the Opteron 275, the Woodcrest 5160 is slower in about 30% of the cases and codes, about even with another 30%, and faster in the remaining 40%. Its that last batch that is interesting.
Of that 40%, about 3/4 it appears to be a cache size issue as increasing processor speed has minimal impact upon performance. About 1/4 appear to be due to the better internal SSE engine. In that 1/4 case, the performance gain is modest, its that 3/4 that is quite interesting. When you get lots of stuff fitting into cache, you can achieve much higher overall apparant performance. In fact, the trick of superlinear speedup is tied directly to how to get your working set from operating mostly in RAM at 1 CPU to operating mostly in cache at N cpus.
More cache helps most programs that have regular memory access patterns that map well into a cache architecture. An FFT butterfly is “regular” but very cache unfriendly.
Our testing shows that Woodcrest makes a good Opteron, and in a few specific code cases, a better Opteron than opteron, but …. there are some “gotchas”.
Memory bandwidth, and memory bandwidth bound codes operating out of RAM appear to not fare as well on Woodcrest as they do on Opteron. That is, the much vaunted 21 GB/s memory bandwidth, as reported widely by misty-eyed “independent” press isn’t actually realized. The best we could do is about 7.2 GB/s. This shows up in memory bandwidth limited codes. Faster cores and faster SSE engines don’t matter one bit if you can’t get the data to the engines faster. Intel really needs to look more closely at the Opteron NUMA design, it is very good for memory bandwidth bound code.
Obviously we need to test the Opteron 2218 as well (in our lab, awaiting a replacement motherboard, long painful story). Will do this soon.
Regardless, Intel has an impressive chip here. We are working on some very nice code refactoring that will try to exploit the heck out of that SSE pipeline. If you see me at SC, feel free to ask about it if you interested in informatics.

2 thoughts on “Woodcrest update, day N+1”

  1. Ah, memory hierarchy exploitation. It has always been my belief that exploiting this should be the first optimization software developers should take. The next I would vote for in today’s architectures is making the code “multicore friendly,” whether through OpenMP, Intel’s TBB, or even straight pthread. The third wave would be vectorizing.
    Regarding memory bandwidth, this is a bottleneck in exploiting both memory hierarchy and multicore chips. I agree that AMD’s NUMA is a much better approach, especially as more cores are added in the future. I keep hearing about 32-core machines, but I find this unplausible with a bus architecture; the direct connect architecture appears to be the only way. HPCwire has been covering a few supercomputer startups that are adding tons of HyperTransport lines in their machines to address this issue.
    And as a side note regarding vectorizing: perhaps built-in vector units like SSE and AltiVec will be of lesser importance in the technical computing market given the ready existence of co-processors, such as ClearSpeed, FPGAs, the Cell, and even GPUs.

  2. I agree about HT. I would like to see Intel adopt it as well. HT3 looks very interesting, and the upside is huge for people using it.
    It seems that there is a trend or tendency to get to or surpass a bandwidth limited point in machine designs. I have noticed this. Shared resources are points of contention, and hence rate limiting systems. Yet this seems to be the natural progression of design to very high bandwidth shared things.

Comments are closed.