I **finally** took some time from a busy schedule to play with AVX. I took my trusty old rzf code (Riemann Zeta function) and rewrote the time expensive inner loop in AVX primatives hooked to my C code.

As a reminder, this code is a very simple sum reduction, and can be trivially parallelized (sum reduction). Vectorization isn’t as straightforward, and I found that compiler auto-vectorization doesn’t work well for it. So you have to hand vectorize, and think through the algorithm more carefully.

The baseline code inner loop looks like this

for(k=inf-1;k>=1;k--)

{

sum += 1.0/pow(k,(double)n);

}

Remarkable how trivial this is. You can unroll it, play with a number of things, but you don’t start seeing serious performance until you vectorize it. Between standard -O3 level gcc optimization and the by-hand vectorization, we get a factor of 10 better performance.

So I thought, hmmm … why not transform the sse2 into avx and see how it runs?

The sse2 inner loop …

__m128d __P_SUM = _mm_set_pd1(0.0); // __P_SUM[0 ... VLEN] = 0

__m128d __ONE = _mm_set_pd1(1.); // __ONE[0 ... VLEN] = 1

__m128d __DEC = _mm_set_pd1((double)VLEN);

__m128d __L = _mm_load_pd(l);

```
``` /* parallel sum loop */

for(k=start_index;k>end_index;k-=unroll)

{

__D_POW = __L;

for (m=n;m>1;m--)

{

__D_POW = _mm_mul_pd(__D_POW, __L);

}

__P_SUM = _mm_add_pd(__P_SUM, _mm_div_pd(__ONE, __D_POW));

__L = _mm_sub_pd(__L, __DEC);

}

_mm_store_pd(p_sum,__P_SUM);

` /* sum reduction loop */`

for(k=0;k

{

sum += one/pow((double)k,(double)n);

}

There is a wind down loop at the end of the reduction to handle any final number of terms that are not integer multiples of the unroll amount (2 for sse2, 4 for avx).

Of course, AVX is the “new shiny hotness”. And we got 10x by vectorizing by hand. Might we get more with AVX?

My first pass is this

sum = 0.0;

```
``` /* pre-vector loop */

for(k=inf-1;k>start_index;k--)

{

sum += one/pow((double)k,(double)n);

}

// __P_SUM[0 ... VLEN] = 0 */

register __m256d __P_SUM asm ("ymm2") = _mm256_broadcast_sd(__zero);

// __ONE[0 ... VLEN] = 1

register __m256d __ONE asm ("ymm3") = _mm256_broadcast_sd(__one);

// __DEC[0 ... VLEN] = VLEN

register __m256d __DEC asm ("ymm4") = _mm256_broadcast_sd(__inc);

register __m256d __L asm ("ymm5") = _mm256_load_pd(l);

register __m256d __INV asm ("ymm6") ;

for(k=start_index;k>end_index;k-=2*unroll)

{

__D_POW = __L;

for (m=n;m>1;m--)

{

__D_POW = _mm256_mul_pd(__D_POW, __L);

}

__INV = _mm256_div_pd(__ONE, __D_POW);

__P_SUM = _mm256_add_pd(__P_SUM, __INV);

__L = _mm256_sub_pd(__L, __DEC);

__D_POW = __L;

for (m=n;m>1;m--)

{

__D_POW = _mm256_mul_pd(__D_POW, __L);

}

__INV = _mm256_div_pd(__ONE, __D_POW);

__P_SUM = _mm256_add_pd(__P_SUM, __INV);

__L = _mm256_sub_pd(__L, __DEC);

}

_mm256_store_pd(p_sum,__P_SUM);

` /* sum reduction loop */`

for(k=0;k

{

sum += one/pow((double)k,(double)n);

}

Performance gives a very minor regression relative to sse2. Same answers. And yes, I did unroll the core loop.

I think with some hand tweaking I could probably get more. Will look. But enjoy in the meanwhile.

Viewed 87505 times by 7291 viewers