Playing with AVX

I finally took some time from a busy schedule to play with AVX. I took my trusty old rzf code (Riemann Zeta function) and rewrote the time expensive inner loop in AVX primatives hooked to my C code.
As a reminder, this code is a very simple sum reduction, and can be trivially parallelized (sum reduction). Vectorization isn’t as straightforward, and I found that compiler auto-vectorization doesn’t work well for it. So you have to hand vectorize, and think through the algorithm more carefully.
The baseline code inner loop looks like this

for(k=inf-1;k>=1;k--)
{
sum += 1.0/pow(k,(double)n);
}

Remarkable how trivial this is. You can unroll it, play with a number of things, but you don’t start seeing serious performance until you vectorize it. Between standard -O3 level gcc optimization and the by-hand vectorization, we get a factor of 10 better performance.
So I thought, hmmm … why not transform the sse2 into avx and see how it runs?
The sse2 inner loop …

__m128d __P_SUM = _mm_set_pd1(0.0); // __P_SUM[0 ... VLEN] = 0
__m128d __ONE = _mm_set_pd1(1.); // __ONE[0 ... VLEN] = 1
__m128d __DEC = _mm_set_pd1((double)VLEN);
__m128d __L = _mm_load_pd(l);
/* parallel sum loop */
for(k=start_index;k>end_index;k-=unroll)
{
__D_POW = __L;
for (m=n;m>1;m--)
{
__D_POW = _mm_mul_pd(__D_POW, __L);
}
__P_SUM = _mm_add_pd(__P_SUM, _mm_div_pd(__ONE, __D_POW));
__L = _mm_sub_pd(__L, __DEC);
}
_mm_store_pd(p_sum,__P_SUM);
/* sum reduction loop */
for(k=0;k=1;k--)
{
sum += one/pow((double)k,(double)n);
}

There is a wind down loop at the end of the reduction to handle any final number of terms that are not integer multiples of the unroll amount (2 for sse2, 4 for avx).
Of course, AVX is the “new shiny hotness”. And we got 10x by vectorizing by hand. Might we get more with AVX?
My first pass is this

sum = 0.0;
/* pre-vector loop */
for(k=inf-1;k>start_index;k--)
{
sum += one/pow((double)k,(double)n);
}
// __P_SUM[0 ... VLEN] = 0 */
register __m256d __P_SUM asm ("ymm2") = _mm256_broadcast_sd(__zero);
// __ONE[0 ... VLEN] = 1
register __m256d __ONE asm ("ymm3") = _mm256_broadcast_sd(__one);
// __DEC[0 ... VLEN] = VLEN
register __m256d __DEC asm ("ymm4") = _mm256_broadcast_sd(__inc);
register __m256d __L asm ("ymm5") = _mm256_load_pd(l);
register __m256d __INV asm ("ymm6") ;
for(k=start_index;k>end_index;k-=2*unroll)
{
__D_POW = __L;
for (m=n;m>1;m--)
{
__D_POW = _mm256_mul_pd(__D_POW, __L);
}
__INV = _mm256_div_pd(__ONE, __D_POW);
__P_SUM = _mm256_add_pd(__P_SUM, __INV);
__L = _mm256_sub_pd(__L, __DEC);
__D_POW = __L;
for (m=n;m>1;m--)
{
__D_POW = _mm256_mul_pd(__D_POW, __L);
}
__INV = _mm256_div_pd(__ONE, __D_POW);
__P_SUM = _mm256_add_pd(__P_SUM, __INV);
__L = _mm256_sub_pd(__L, __DEC);
}
_mm256_store_pd(p_sum,__P_SUM);
/* sum reduction loop */
for(k=0;k=1;k--)
{
sum += one/pow((double)k,(double)n);
}

Performance gives a very minor regression relative to sse2. Same answers. And yes, I did unroll the core loop.
I think with some hand tweaking I could probably get more. Will look. But enjoy in the meanwhile.

1 thought on “Playing with AVX

Comments are closed.