Performance of a julia code: Riemann ζ function implemented on CPU and GPU

I’ve been playing with Julia for a while now. As the language is evolving quickly, I have to learn and relearn various aspects. Happily, since 1.0 dropped a while ago, its been more consistent, and the rate of change is lower than in the past.

The Riemann ζ function is an interesting problem to test a number of mechanisms for parallelism with. The function itself is simple to write

$$\zeta(a) = \sum_{i=1}^\infty \frac{1}{i^a}$$

The value of a may be complex, though for our testing, we will restrict

$$a \in \mathbb{R}$$

And even more specifically, we will set a=2. When we do that, ζ(2) turns out to have a known closed form solution, $$\frac{\pi^2}{6}$$

What is nice about that, is that we can test performance, as well as accuracy, simultaneously, for our optimizations.

Please note that the code for all of this is located on the github repository at https://github.com/joelandman/rzf . Feel free to clone it and experiment on your own machine(s).

Writing a simple, general Riemann zeta function in Julia turns out to be, again, remarkably easy, with a number of specific implementation considerations. The first is, that a brute force infinite sum will not complete. So rather than using ∞ as our upper limit on our summation, we will use N, for large values of N, and argue that the truncation of the sum won’t significantly impact the results.

Put another way

$$rzf(a,N) = \sum_{i=1}^N \frac{1}{i^a} $$

and

$$\lim_{N\rightarrow\infty} rzf(a,N) = \zeta(a) $$

What this means is, that in the limit of N going to ∞, rzf(a,N) will be ζ(a). We can measure this truncation error, though for the purposes of this article, we’ll avoid that. Its a simple to specify function, that provides an excellent laboratory for testing.

The basic function that we use for this in Julia is this

julia> function ζ(a,N)
          s = 0.0
          for i ∈ 1:N
            s+=(1.0/i)^a
          end
          s
        end
ζ (generic function with 1 method)

We can time execution of this relatively easily

julia> @time ζ(2,1000000000)
  6.684926 seconds (5 allocations: 176 bytes)
1.644934057834575

For those who wish to compare with other languages, here’s the equivalent python3

#!/usr/bin/env python3


def ζ(a,N):
  s = 0.0
  for i in range(1,N):
     s+=1.0/pow(i,a)
  return s

print("rzf(2) = %f " % ζ(2,1000000000))

And running that on the same machine as the Julia code

joe@gandalf:~/rzf/python$ /usr/bin/time ./rzf.py
rzf(2) = 1.644934 
311.16user 0.03system 5:11.21elapsed 99%CPU (0avgtext+0avgdata 9120maxresident)k
0inputs+0outputs (0major+1050minor)pagefaults 0swaps

That is roughly 46.5x longer to execute what is ostensibly the same algorithm, written in almost exactly the same manner, with no optimization. Call this a naive brute force approach.

What I want to do is to see what mechanisms I can use to make the julia code go faster.

First though, I want to make it go slower. By increasing N. By 16 times. So now, if the time goes nice and linearly in number of iterations, we should see about 106.9 seconds for completion in Julia.

julia> @time ζ(2,16000000000)
107.503099 seconds (5 allocations: 176 bytes)
1.644934057834575

Not bad. But this is a 4 processor machine, and I left the SMT enabled.

joe@gandalf:~/rzf/python$ lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              8
On-line CPU(s) list: 0-7
Thread(s) per core:  2
Core(s) per socket:  4
Socket(s):           1
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               158
Model name:          Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
Stepping:            9
CPU MHz:             3607.571
CPU max MHz:         3800.0000
CPU min MHz:         800.0000
BogoMIPS:            5599.85
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            6144K
NUMA node0 CPU(s):   0-7
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp md_clear flush_l1d

This is my 2 year old laptop which replaced my 10 year old laptop.

With 4 physical, and 4 logical cores, I should be able to reduce this execution time on the CPU by basic using parallelism.

So I tried several methods, using julia distributed, simd, threaded, and other techniques. I added in a better timing and observability library, that unfortunately requires that I run each calculation twice.

The code returned this:

 ─────────────────────────────────────────────────────────────────────────────
                                      Time                   Allocations      
                              ──────────────────────   ───────────────────────
       Tot / % measured:            378s / 47.7%           1.78GiB / 0.04%    

 Section              ncalls     time   %tot     avg     alloc   %tot      avg
 ─────────────────────────────────────────────────────────────────────────────
 ζ_gen_thr_range           1     126s  69.9%    126s    668KiB  98.1%   668KiB
 ζ_2_sqr_thr_range         1    41.4s  22.9%   41.4s   1.84KiB  0.27%  1.84KiB
 ζ_2_sqr_simd_range        1    11.3s  6.25%   11.3s     48.0B  0.01%    48.0B
 ζ_2_sqr_gpu_range         1    1.62s  0.90%   1.62s   11.3KiB  1.66%  11.3KiB
 ─────────────────────────────────────────────────────────────────────────────

The first was a general threaded approach, using the native threading. One needs to adjust an environment variable (export JULIA_NUM_THREADS=16) to pass thread counts in, similar to OpenMP. The second was also a threaded approach, but using a specific implementation of the inner computation. Specifically, noting that ζ(2) is as indicated above, leverage that fact, and hard code the

$$i^{-2}$$

as a squaring operation, rather than a call to the power function.

The ζ_2_sqr_simd_range attempts to leverage the microparallelism of the processor, the ability to use SIMD code via AVX/SSE.

The ζ_2_sqr_gpu_range uses a GPU implementation. This machine has an GTX1060M unit in it, and the CPU is a Kaby Lake i7-7700HQ.

Compare this to my deskside unit which has an RTX2060 and a much older Sandy Bridge CPU pair

 ─────────────────────────────────────────────────────────────────────────────
                                      Time                   Allocations      
                              ──────────────────────   ───────────────────────
       Tot / % measured:            701s / 46.9%           2.03GiB / 0.03%    

 Section              ncalls     time   %tot     avg     alloc   %tot      avg
 ─────────────────────────────────────────────────────────────────────────────
 ζ_gen_thr_range           1     181s  54.9%    181s    668KiB  98.0%   668KiB
 ζ_2_sqr_thr_range         1    99.4s  30.3%   99.4s   2.17KiB  0.32%  2.17KiB
 ζ_2_sqr_simd_range        1    47.4s  14.4%   47.4s     48.0B  0.01%    48.0B
 ζ_2_sqr_gpu_range         1    1.26s  0.38%   1.26s   11.3KiB  1.66%  11.3KiB
 ─────────────────────────────────────────────────────────────────────────────

The above machine is a laptop. The below machine is a deskside. Maybe someday, the AMD fairy will drop off a nice Rome or newer system by me that I can play with. Until then, I won’t be able to get AMD numbers. Though I do have a Ryzen 5 laptop here … hmmm ….

Back to the deskside for a moment. When I set the thread count to 16, watch what happens to the timing

─────────────────────────────────────────────────────────────────────────────
                                      Time                   Allocations      
                              ──────────────────────   ───────────────────────
       Tot / % measured:            168s / 41.2%           1.78GiB / 0.04%    

 Section              ncalls     time   %tot     avg     alloc   %tot      avg
 ─────────────────────────────────────────────────────────────────────────────
 ζ_2_sqr_simd_range        1    47.4s  68.6%   47.4s     48.0B  0.01%    48.0B
 ζ_gen_thr_range           1    13.3s  19.2%   13.3s    683KiB  96.0%   683KiB
 ζ_2_sqr_thr_range         1    7.13s  10.3%   7.13s   17.0KiB  2.39%  17.0KiB
 ζ_2_sqr_gpu_range         1    1.27s  1.84%   1.27s   11.3KiB  1.59%  11.3KiB
 ─────────────────────────────────────────────────────────────────────────────

SIMD doesn’t change. But the generalized thread range calc went from 181s to 13.3s. Thats roughly 13.6x faster. The squared thread range calc was about 14x faster. The GPU version, however, was still about 6x faster than the best effort of the threaded code.

There’s still a great deal more experimentation I can do with this. That said, look at how simple this code is for the fastest CPU portion.

function ζ_2_sqr_thr_range(mstart,mend)
    s = 0
    cls  = 8  # cache line size in units of Float64 storage size
              # so we can avoid false sharing
    Nthr = Threads.nthreads()
    psum = zeros(Float64,Nthr*cls)
    ms   = zeros(Float64,Nthr*cls)
    me   = zeros(Float64,Nthr*cls)

    # preparing each thread, start, end, and partial sums
    for i=1:Nthr
        ms[i*cls]   = (N/Nthr)*i
        me[i*cls]   = 1 + (N/Nthr)*(i-1)
        psum[i*cls] = 0.0
    end

    # actual function we will execute per thread
    # psums will be updated per thread id.  This is why
    # the psums are spaced 1 cache line apart, versus adjacent indices

    # Threaded inner loop, each thread has no dependence upon others
    Threads.@threads for t=1:Nthr
        innercpu_2!(psum,ms,me,2,cls)
    end

    # now sum up the partial sums and return s
    s = sum(psum)
    s
end

function innercpu_2!(psum,ms,me,x,cls)
        ndx     = Threads.threadid()
        @simd for i=ms[ndx*cls]:-1:me[ndx*cls]
                @inbounds psum[ndx*cls] += (1.0/i)*(1.0/i)
        end
        return nothing
end

I purposefully structured it just like the GPU portion of this code where instead of a threaded loop, I use an

 @cuda blocks=Nblocks threads=Nthr  innergpu_1!(psum_d)

with the innergpu_1! function looking like this

function innergpu_1!(ps)
        ndx     = threadIdx().x
        i = (blockIdx().x-1) * blockDim().x + threadIdx().x
        r = 1.0/i
        @atomic ps[ndx] += r*r
        return nothing
end

What’s wonderful about this, it is very easy to use all this power of the system, be it cpus or GPUs (limited to NVidia right now, though I hope the good folks at AMD are paying attention to this). Writing parallel and fast code is not hard. Actually, rather the opposite.