Woodcrest update, day N+1

By joe

November 1, 2006 - 3 minutes read - 498 words

So we have had a woodcrest in house for a while now. When we have time we beat on it, we ran codes on it. My impressions are now well formed, and I understand where it makes sense as a platform, and where the competitive technologies make sense. This is not from marketing documents, but from real world testing.

Woodcrest is basically an AMD64 platform without the IOMMU. The processor architecture includes a much improved SSE engine, a larger shared cache, and theoretically, a larger memory bandwidth than its competitor. Note: we have not (yet) tested the AMD 22xx series, though we will be in short order. This is the directly competitive platform. Our tests of a 5150 are relative to an Opteron 275, which is more than 1.5 years old as a platform. Not apples to oranges. Relative to the Opteron 275, the Woodcrest 5160 is slower in about 30% of the cases and codes, about even with another 30%, and faster in the remaining 40%. Its that last batch that is interesting. Of that 40%, about 3/4 it appears to be a cache size issue as increasing processor speed has minimal impact upon performance. About 1/4 appear to be due to the better internal SSE engine. In that 1/4 case, the performance gain is modest, its that 3/4 that is quite interesting. When you get lots of stuff fitting into cache, you can achieve much higher overall apparant performance. In fact, the trick of superlinear speedup is tied directly to how to get your working set from operating mostly in RAM at 1 CPU to operating mostly in cache at N cpus. More cache helps most programs that have regular memory access patterns that map well into a cache architecture. An FFT butterfly is “regular” but very cache unfriendly. Our testing shows that Woodcrest makes a good Opteron, and in a few specific code cases, a better Opteron than opteron, but …. there are some “gotchas”. Memory bandwidth, and memory bandwidth bound codes operating out of RAM appear to not fare as well on Woodcrest as they do on Opteron. That is, the much vaunted 21 GB/s memory bandwidth, as reported widely by misty-eyed “independent” press isn’t actually realized. The best we could do is about 7.2 GB/s. This shows up in memory bandwidth limited codes. Faster cores and faster SSE engines don’t matter one bit if you can’t get the data to the engines faster. Intel really needs to look more closely at the Opteron NUMA design, it is very good for memory bandwidth bound code. Obviously we need to test the Opteron 2218 as well (in our lab, awaiting a replacement motherboard, long painful story). Will do this soon. Regardless, Intel has an impressive chip here. We are working on some very nice code refactoring that will try to exploit the heck out of that SSE pipeline. If you see me at SC, feel free to ask about it if you interested in informatics.