Playing with AVX

I finally took some time from a busy schedule to play with AVX. I took my trusty old rzf code (Riemann Zeta function) and rewrote the time expensive inner loop in AVX primatives hooked to my C code.

As a reminder, this code is a very simple sum reduction, and can be trivially parallelized (sum reduction). Vectorization isn’t as straightforward, and I found that compiler auto-vectorization doesn’t work well for it. So you have to hand vectorize, and think through the algorithm more carefully.

The baseline code inner loop looks like this


for(k=inf-1;k>=1;k--)
{
sum += 1.0/pow(k,(double)n);
}

Remarkable how trivial this is. You can unroll it, play with a number of things, but you don’t start seeing serious performance until you vectorize it. Between standard -O3 level gcc optimization and the by-hand vectorization, we get a factor of 10 better performance.

So I thought, hmmm … why not transform the sse2 into avx and see how it runs?

The sse2 inner loop …


__m128d __P_SUM = _mm_set_pd1(0.0); // __P_SUM[0 ... VLEN] = 0
__m128d __ONE = _mm_set_pd1(1.); // __ONE[0 ... VLEN] = 1
__m128d __DEC = _mm_set_pd1((double)VLEN);
__m128d __L = _mm_load_pd(l);

/* parallel sum loop */
for(k=start_index;k>end_index;k-=unroll)
{
__D_POW = __L;

for (m=n;m>1;m--)
{
__D_POW = _mm_mul_pd(__D_POW, __L);
}

__P_SUM = _mm_add_pd(__P_SUM, _mm_div_pd(__ONE, __D_POW));

__L = _mm_sub_pd(__L, __DEC);

}

_mm_store_pd(p_sum,__P_SUM);

/* sum reduction loop */
for(k=0;k=1;k--)
{
sum += one/pow((double)k,(double)n);
}

There is a wind down loop at the end of the reduction to handle any final number of terms that are not integer multiples of the unroll amount (2 for sse2, 4 for avx).

Of course, AVX is the “new shiny hotness”. And we got 10x by vectorizing by hand. Might we get more with AVX?

My first pass is this


sum = 0.0;

/* pre-vector loop */
for(k=inf-1;k>start_index;k--)
{
sum += one/pow((double)k,(double)n);
}

// __P_SUM[0 ... VLEN] = 0 */
register __m256d __P_SUM asm ("ymm2") = _mm256_broadcast_sd(__zero);

// __ONE[0 ... VLEN] = 1
register __m256d __ONE asm ("ymm3") = _mm256_broadcast_sd(__one);

// __DEC[0 ... VLEN] = VLEN
register __m256d __DEC asm ("ymm4") = _mm256_broadcast_sd(__inc);
register __m256d __L asm ("ymm5") = _mm256_load_pd(l);
register __m256d __INV asm ("ymm6") ;

for(k=start_index;k>end_index;k-=2*unroll)
{
__D_POW = __L;

for (m=n;m>1;m--)
{
__D_POW = _mm256_mul_pd(__D_POW, __L);
}
__INV = _mm256_div_pd(__ONE, __D_POW);
__P_SUM = _mm256_add_pd(__P_SUM, __INV);

__L = _mm256_sub_pd(__L, __DEC);

__D_POW = __L;

for (m=n;m>1;m--)
{
__D_POW = _mm256_mul_pd(__D_POW, __L);
}
__INV = _mm256_div_pd(__ONE, __D_POW);
__P_SUM = _mm256_add_pd(__P_SUM, __INV);

__L = _mm256_sub_pd(__L, __DEC);

}

_mm256_store_pd(p_sum,__P_SUM);

/* sum reduction loop */
for(k=0;k=1;k--)
{
sum += one/pow((double)k,(double)n);
}

Performance gives a very minor regression relative to sse2. Same answers. And yes, I did unroll the core loop.

I think with some hand tweaking I could probably get more. Will look. But enjoy in the meanwhile.

Viewed 87505 times by 7291 viewers

Facebooktwittergoogle_plusredditpinterestlinkedinmail