Quite a few people asked offline for a fortran version of the tests I indicated in the last post on this subject. So here is the basic code and its performance. What is interesting is that, for the same inputs as the C code, with no heroic loop unrolling/unwinding, the Fortran is about 3x faster than the C. Well, except for ifort. But we will get into that more later on.

program rzf

integer i,j,k,milestone,n,rc,inf

double precision total,p_sum,sum,delta_t,pi, pi_exact

n = 0

sum = 0.0d0

pi_exact = 4.0d0*atan(1.0d0)

print *,"Enter the order of the Riemann zeta function"

read *,n

print *,"Enter the limit of the summation (effectively infinity)"

read *,inf

do k=inf-1,1,-1

sum = sum + dble(k)**(-n)

enddo

print *,"zeta(",n,") = ",sum

if (n .eq. 2) then

pi = sqrt(sum*6.0d0)

print *,"pi = ",pi

print *,"error in pi = ",pi_exact-pi

print *,"relative error in pi = ",(pi_exact-pi)/pi_exact

endif

end

So I created a file named “in” which has these 2 lines.

2

100000000

These are the same as inputs I used in the other code.

With basic -O optimization

compiler | loop time (s) |
---|---|

gcc | 1.3 |

icc | 3.8 |

pgcc | 1.3 |

Ok.

This is about 3x the performance (for gfortran 4.2.1, pgf90 7.1-1) and a little better with ifort 10.0.023.

Hmm. Ok, lets turn on maximum optimization and see what happens.

compiler | loop time (s) |
---|---|

gcc | 1.2 |

icc | 3.6 |

pgcc | 1.2 |

Hmmm…

Ok, I won’t dig into the generated assembly language that much, but I am willing to bet that we would see what we saw before. SSE used with half registers.

Before we do that though, note that I exploited the intrinsic fortran power function with an integer argument. My guess is that it replaced the function call to pow with a simple loop, much like my hand optimized C version from the previous example.

Well, it is a little more complex than that. Notice how I used k**-n or more precisely, dble(k)**(-n). This effectively avoids a division. Divisions are expensive. And you might note that I used one in the C code. Ok, lets change than line back to

sum = sum + 1.0d0/dble(k)**(n)

What happens to the optimized performance?

compiler | loop time (s) |
---|---|

gcc | 1.1 |

icc | 3.5 |

pgcc | 1.0 |

Wow. ~10% performance delta (better) for letting a “more expensive” operation stay in.

Counter-intuitive.

Ok, what about a loop unrolling? What happens if I do the same thing I did with the C code? Not heroic effort, but reasonable effort?

Here is what the code looks like now:

program rzf

integer i,j,k,milestone,n,rc,inf

double precision total,sum,delta_t,pi, pi_exact

double precision l(4),p_sum(4),d_pow(4),unroll_array(4)

integer start_index,end_index,unroll

n = 0

sum = 0.0d0

pi_exact = 4.0d0*atan(1.0d0)

unroll = 4

print *,"Enter the order of the Riemann zeta function"

read *,n

print *,"Enter the limit of the summation (effectively infinity)"

read *,inf

start_index = inf - mod(inf,unroll)

end_index = unroll

print *," start index = ",start_index

print *," end index = ",end_index

do i=1,4

p_sum(i) = 0.0d0

unroll_array(i) = dble(unroll)

enddo

do k=inf-1,start_index+1,-1

sum = sum + dble(k)**(-n)

enddo

do i=1,4

l(i) = dble(start_index-i+1)

enddo

do k=start_index,end_index+1,-unroll

do i=1,4

d_pow(i) = l(i)

enddo

do j=n,2,-1

do i=1,4

d_pow(i) = d_pow(i) * l(i)

enddo

enddo

do i=1,4

p_sum(i) = p_sum(i) + 1.0d0 / d_pow(i)

enddo

do i=1,4

l(i) = l(i) - unroll_array(i)

enddo

enddo

do i=1,4

sum = sum + p_sum(i)

enddo

do k=end_index,1,-1

sum = sum + dble(k)**(-n)

enddo

print *,"zeta(",n,") = ",sum

if (n .eq. 2) then

pi = sqrt(sum*6.0d0)

print *,"pi = ",pi

print *,"error in pi = ",pi_exact-pi

print *,"relative error in pi = ",(pi_exact-pi)/pi_exact

endif

end

And this is the performance:

compiler | loop time (s) |
---|---|

gfortran | 0.85 (0.84 when we use -ftree-vectorize on gfortran 4.2) |

ifort | 1.0 (note: if we use -O2, and no -xW, we get 0.80) |

pgf90 | 1.0 |

Ok, this is showing that additional (possibly heroic) efforts to do non-algorithmic shift optimization in Fortran are going to give diminishing returns at best. It also shows that most of the fortran compilers do a pretty good job on optimization of simple loops.

It also shows that you have to work fairly hard (heroic efforts) to get the C compilers we tried to give good optimizations.

It is not wise to draw general conclusions about relative language performance from such a simple test. It does suggest that some serious thought should go into choice of implementation languages, that simply neglecting one because it is 52 years old and may not be the “hot” thing, may have a (potentially significant) negative performance impact.

What this does show me though is that none of the compilers make particularly efficient use of the systems resources. Why this is, I am not sure. Maybe a compiler designer/builder can correct my understanding.

SSE is a remarkable development, albeit not as well designed as one might hope. Compilers, while able to emit code for SSE don’t seem to do an exceptionally good job of mapping problems onto the SIMD (“vector”) registers.

As we increase the numbers of cores per physical processor, it would be good to make sure that we are using them more efficiently. This is unfortunately a simple matter of programming (SMOP). That is, it is a problem that is pretty nearly unsolvable, which is what SMOP really means. Sort of like the joke about the professor and the student, where the student asks for an explanation of an example that the professor called “trivial”. Hours into the effort of understanding the work, the student says, “yes, it is trivial”.

Compiler code generation efficiency is critical to good performance, but so is an understanding of the underlying processor architecture, and how to map code into it.

If I have time, I will try to rewrite these in SSE macros for C/Fortran. Also, my nVidia 8800 GTX came in, and it is meant for Cuda work. I would like to see if we can do something there. Yeah, I know it is single precision … Maybe on the Cell as well. Would be fun.