Simple problem. You have two “vectors” of 32 bit signed integers, and you want to do the vector equivalent of
vector_a = max_element_by_element(vector_a, vector_b);
Can you do this with SSE2?
As far as I can tell at the moment, you cannot. The MMX support in gcc gives you pmax/pmin instructions, with latencies of 2 cycles, and operating on 64 bits (__m64). The SSE instructions don’t extend this, and SSE2 give you MAX/MIN with 8 and 16 bit signed and unsigned bytes, but not 32.
We need this. Badly.
You can always argue that you can use code like this for MAX (or MIN):
MAX(a,b) = (a+b + abs(b-a)) >> 1;
MIN(a,b) = (a+b – abs(b-a)) >> 1;
Of course this is correct, it can be done. But now you have 1-2 additions, 1-2 subtractions, a sign removal and a shift rather than a simple comparison and conditional move. This makes these functions quite a bit slower, even if you could get them into SSE. The point of SSE being that you gain from the parallelism by reducing the amount of instructions you need to issue in order to achieve these results.
So if you can go 4x faster, but you need to do more than 4x more operations to get that faster code, you will run slower. This is what we observed for the code we are beating on.
Simple is almost always better. It would be great if the compilers were smart enough, and the instruction set capable enough, so that we could write
if (vector_a > vector_b) vector_a = vector_b;
and have that correctly SSE-ized (microparallelized). Unfortunately, you would need to PMAX’s to get there, and then you have to deal with getting the data in and out of the SSE registers to normal registers.
Ugh. Hopefully next round of chips will extend the ISA in a useful/meaningful manner. Lets get an *MAX|*MIN which operates on vectors. What would be really cool would be to be able to operate on multiple SSE registers at once, so I can load in 4, 8 , or 16 by 32 or 64 bit integers, and really get some performance…