Interesting NUMA issues in current SuSE kernel

One of our development systems is a dual socket system with 2 dual core Opteron 275 chips. 4 GB ram, nice disk config, and a quadro fx/1400. This is a good machine t work on.
I had set it up with SuSE 9.x, and had left it at 9.3 for quite a while. Recently we upgraded it to SuSE 10.0 Pro. More modern kernel, somewhat updated apps. I thought it would be nice to stay somewhat current.
We use this for building and testing our accelerated HMMer code. This code is 1.6-2.5x faster than the binaries from the WUSTL site. Source code changes, and we are finalizing a new release of it.
I ran our old binary, and we consistently hit 41 seconds on the benchmark test. Our new version started out giving us 85 seconds as its run time. This didn’t make sense. Ok, return to the our original set of changes. 85 seconds. Except every now and then, we get one about 40 seconds.
Hmmm…. Remember, Opteron is a NUMA architecture. Lets see what happens when we pin the CPU and memory to a particular node.
Run the same binary that gives us 85 seconds this way with numactl. 39 seconds. Rerun it this way. 10, 20, 30 times. Consistently 39 seconds. Take off the CPU pinning switch, but leave on the memory pinning switch. Varies between 39 and 41 seconds. Take off the memory pinning switch. Back to the 85 seconds with an occasional 39 second run. Force th memory on one node and the cpu on another. 41 seconds consistently. Force the memory to the 0 node. 85 seconds, consistently.
I don’t think this is a hardware problem. Likely a kernel bug. I noticed something similarly strange when running the full suite of HMMer benchmark. Great scalability from 1-3 threads, and the 4th thread killed performance. This is the same binary that ran great on this same hardware with the older OS.
I might just load the older OS in a different set of partitions and see if we still see this. SuSE 10.1 is coming out soon, so maybe it is worth looking at that.