# Riemann zeta function in parallel/vector data languages

Continuing the work of the previous post, I looked into rewriting the serial code to run in parallel/vector data languages. My original supposition about what would make a good data language is now in doubt as a result.

First, I used PDL in Perl. But its Perl, right? It can’t possibly be fast. That would be … like, I dunno … wrong? (yes, this is sarcasm).

This completes the task in 12s. Faster than everything but the JIT/static compiled languages, and about the same as Java.

First the vectorized Perl source:

#!/opt/scalable/bin/perl
use Time::HiRes qw( usleep ualarm gettimeofday tv_interval nanosleep
clock_gettime clock_getres clock_nanosleep clock
stat );
use Math::Trig;
use PDL;
use PDL::NiceSlice;
my ($i,$j,$k,$milestone,$n,$rc,$v,$inc,$m); my (@array,$total,$p_sum,$sum,$delta_t,$pi, $pi_exact,@caliper); use constant true => (1==1); use constant false => (1==0); my$inf=1000000000 ;
$m = 10000; my$c;

# enable large data
$PDL::BIGPDL = 1;$milestone  = 0;
$n = 0;$sum                = 0.0;
$pi_exact = 4.0*atan(1.0);$caliper[$milestone] = [gettimeofday];$sum=0.0;

for($i=0;$i< $m;$i++) {
$v = int(($inf/$m)*$i)+ 1 + sequence $inf/$m;
$sum += sum (1.0 / ($v*$v)); }$milestone++;
$caliper[$milestone] = [gettimeofday];

printf("zeta(%i)  = %-.15f \n",$n,$sum);
$pi = sqrt($sum*6.0);
printf("pi = %-.15f \n",$pi); printf("error in pi = %-.15f \n",$pi_exact-$pi); printf("relative error in pi = %-.15f \n",($pi_exact-$pi)/$pi_exact);

#/* now report the milestone time differences */
for ($i=0;$i<=($milestone-1);$i++)
{
$delta_t = tv_interval($caliper[$i] ,$caliper[$i+1] ); printf("Milestone %i to %i: time = %-.3fs\n",$i,$i+1,$delta_t);
}



Here, all the work is in that middle for loop. We create a vector $v of size$inf/$m, offset it appropriately, and, as you can see in the next line, we just compute with it. Honestly, this was far easier than I thought it would be. But its still running on a single CPU. Can we get this going in parallel? The short answer is yes, but the mode I used for this is basically standard pthread create/join, which means you have to work out all of the issues associated with how the work is divided between threads. What we see in performance for 1 to 8 threads: NCPU Wall clock (seconds) 1 17.4 2 9.4 3 6.7 4 5.5 5 5.3 6 5.1 7 4.9 8 5.0 Here is the code #!/opt/scalable/bin/perl use Time::HiRes qw( usleep ualarm gettimeofday tv_interval nanosleep clock_gettime clock_getres clock_nanosleep clock stat ); use Math::Trig ':pi'; use PDL; use PDL::NiceSlice; use PDL::AutoLoader; use threads; use threads::shared; use Probe::MachineInfo::NumCPUs; my ($i,$j,$k,$milestone,$n,$rc,$v,$inc,$m,$si,$ei,$thr,$th,$inf); my (@array,$total,$p_sum,$sum,$delta_t,$pi, $pi_exact,@caliper); my ($NCPU,@done,$alldone,$sumdone,@count,$tc); use constant true => (1==1); use constant false => (1==0); share($sum);
share($inf); share($m);
share($NCPU); share(@done); share(%caliper); share($milestone);
share($delta_t); share($pi);
share($pi_exact); share(@count); share($tc);

$inf = 1000000000;$m          = 1000;
$NCPU = 8; #Probe::MachineInfo::NumCPUs->get() ; # enable large data$PDL::BIGPDL = 1;

$milestone = 0;$n          = 0;
$sum = 0.0;$pi_exact   = pi;

$caliper[$milestone] = [gettimeofday];

$sum=0.0; map {$done[$_] = false} (0..$NCPU);

for($thr=0;$thr< $NCPU;$thr++) {
$th->{$thr} = threads->create({'void' => 1},'psum',$thr); } printf "waiting: NCPU=%i\n",$NCPU;
do {
usleep(10000);
$sumdone = 0; map {$sumdone++ if ($done[$_]) } @done;
} until ($sumdone >=$NCPU);

printf "joining\n";
for($thr=0;$thr< $NCPU;$thr++) {
$th->{$thr}->join();
}

$milestone++;$caliper[$milestone] = [gettimeofday]; printf("zeta(%i) = %-.15f \n",$n,$sum);$pi = sqrt($sum*6.0); printf("pi = %-.15f \n",$pi);

printf("error in pi = %-.15f \n",$pi_exact-$pi);
printf("relative error in pi = %-.15f \n",($pi_exact-$pi)/$pi_exact); #/* now report the milestone time differences */ for ($i=0;$i< =($milestone-1);$i++) {$delta_t = tv_interval($caliper[$i] ,  $caliper[$i+1] );
printf("Milestone %i to %i: time = %-.3fs\n",$i,$i+1,$delta_t); }$tc=0;
for($i=0;$i<$NCPU;$i++) {
printf "thread[%i] -> count=%i\n",$i,$count[$i];$tc+=$count[$i];
}
printf "total count = %i\n",$tc; exit; sub psum { my ($thread_i) = @_;
my ($bs,$be);
printf " -> psum[thread = %i] ",$thread_i;$done[$thread_i] = false;$si = ($thread_i + 0) * int($inf/($NCPU));$ei = ($thread_i + 1) * int($inf/($NCPU)); for(my$block=0;$block <$m ; $block++) {$bs = $si + ($block + 0) * int(($ei-$si)/$m);$be = $si + ($block + 1) * int(($ei-$si)/$m);$v = $bs + 1 + sequence ($be-$bs);$sum += sum (1.0 / ($v*$v));
$count[$thread_i]++;
#        printf "thread,block,count,si,ei: %i, %i, %i, %i, %i\n",$thread_i,$block,$count[$thread_i],$si,$ei;
}
$done[$thread_i] = true;
}



Here we create a thread which calls psum, which computes partial sum panels, 1 block at a time.

So this is running in parallel and “vectorized” in Perl. The vectorized version was trivial to set up. The parallelized less so.

I argue that parallelization needs to be seamless and painless to make work. Pthreads is anything but seamless and painless (and I didn’t even include the boilerplate exception handlers in this).

I tried some things with Python: First Cython, which compiles the Python to C, and then creates a DLL. Unfortunately, I ran into the differences between range and xrange in Python 3 vs Python 2, and Cython for Python 2. It turns out that the language has changed rather drastically between the two versions. I had to change the range back to xrange to get the cython compilation done for Python v2.x

After running it, execution time was 167s.

This is obviously not the solution to higher performance or parallel/vector processing in Python. The compiled code is slower than the interpreted Perl, kdb+/q, …

Lets try numpy, and work on vectors at a time. Using the relatively quick port of rzf.py to numpy, again as a vector language, it looks like 1.8s at the 100M element size, so 1B elements should be fast.

And … they are not. The code is spending most of its time in apport (the numpy bits). I suspect they have some implementation issues for very large data sets, where you have to explicitly tile things. Which is odd, as I keep hearing from folks how Python is the new new-thing for big data. I could paraphrase Inigo Montoya on this, but I don’t think I need to.

Just like with Perl, I’ll have to block this. Ok, we are at 7 minutes of run-time. Its slower than the simple loop. Time to kill it and try the blocked version.

Much better. With 1000 blocks and numpy, I can get it down to 14s run time. Slower than Perl with PDL, Java, and C.

Here’s the code

import math
import time
from numpy  import *

N       = 1000000000
m       = 1000
sum     = 0.0
pi_exact        = 4.0*math.atan(1.0)
start = time.clock()

for i in xrange(m,0,-1):
v       = arange( N*i/m, N*(i-1)/m, -1)
rzfk    = 1.0/(v*v)
sum     += rzfk.sum()

stop = time.clock()

timing  = stop - start
print("timing=", timing)
pi_comp = math.sqrt(sum*6.0)

print("computed pi=",pi_comp)
delta   = pi_exact-pi_comp
rerr    = delta / pi_exact

print("delta=",delta)
print("relative error",rerr)


The code is still pretty straight forward, but it belies the scaling issue I ran into. I am not sure why it is, I haven’t looked at the code, but I could speculate if needed (an algorithm switch between “big” and “small” when array size exceeds a certain size).

This also, curiously, suggests that people using it for “big data” might be doing themselves something of a disservice by doing so, if their data is even moderately big. 109 array elements is 4GB for ints, and 8GB for floats. Big data is often much larger than this.

FWIW: This is a laptop with 16 GB ram, 4x Core i7 CPUs. It is no slouch, and I can run these on my desktop if needed (12 processor cores + 40 GB ram), or on one of the big data boxen (24 processor cores + 512GB ram). Something tells me that not much will change. But I will try this at some point next week.

Moreover, looking at the parallel mechanisms for execution, it appears that they all make use of either a thread/join model, or some variation of distributed programming.

That is, they are also not (completely/nearly/partially) transparent. On the contrary, they are quite involved to get this to work correctly.

Ok, back to Julia. How to make this vectorized … fundamentally, this was/is the big draw of the language. Complex things are easy, simple things trivial.

My first pass through this, I managed to run out of memory. So, as with Perl and Python, I used 1000 blocks. Ran it. 33.4s. Nice.

Here’s the vectorized code:

N=1000000000
m=1000
r=(int)(N/m)
s       = 0.0
pi_exact        = 4.0*atan(1.0)

println("r = ",r)
tic()
for block=m:-1:1
v       = linspace(r*block,r*(block-1)+1,r)
rzfk    = 1.0 / (v .* v)
s       += sum(rzfk)
end
timing = toc()
println("sum reduction took ", timing  * 1000, " ms")

pi_comp = sqrt(s*6.0)
println("pi = ",pi_comp)

println("error in pi = ",pi_exact-pi_comp)
println("relative error in pi = ",(pi_exact-pi_comp)/pi_exact)


So far, the easiest vectorizations have been the kdb+, the Julia, and the Perl code. Python took longer than it should have, both on development time, and execution time, as I ran into gotchas.

Note: I’ve not done much with the Octave code. It looks like I am stumbling against a bug in the core w.r.t. array indexing.

What about parallelizing the vectorized Julia code? Right now, I really don’t want to work hard to get a sum reduction, and I want to understand my code. Pthreads and alike are fine for some stuff, but a scientific programmer, a big data analyst doesn’t really want to know from pthreads (or OpenACC/OpenMPI/OpenMP/…). These users, the ones we need to serve better, just need to know that they can exploit parallelism, trivially, as a built-in and natural extension of their language. So far we haven’t seen that.

Here is the parallel vectorized version of the code.

N=1000000000
m=1000
r=(int)(N/m)
s       = 0.0
pi_exact        = 4.0*atan(1.0)

println("r = ",r)
tic()
s = @parallel (+) for block=m:-1:1
v       = linspace(r*block,r*(block-1)+1,r)
psum    = 1.0 / (v .* v)
sum(psum)
end

timing = toc()
println("sum reduction took ", timing  * 1000, " ms")

pi_comp = sqrt(s*6.0)
println("pi = ",pi_comp)

println("error in pi = ",pi_exact-pi_comp)
println("relative error in pi = ",(pi_exact-pi_comp)/pi_exact)


Again, I did the blocking/paneling due to memory limitations on the laptop (it has enough memory, but none (apart from kdb+) of the tools seem to be able to use it all effectively).

Notice how the entire parallelism is expressed as a sum reduction macro modifier on the loop. This is just like kdb+ with peach (which I am not sure of exactly how to use just yet).

NCPU Wall clock (seconds)
1 34.4
2 18.6
3 13.7
4 11
5 11.4
6 11
7 10.6
8 10.2

This is about as painless parallelism as it gets. Start the code with julia -p NCPU script.jl and it just works. The right way. No fiddling.

Note, for kdb+/q, just like julia, you use q -s NCPU and then replace the central do [M; ... with a peach bit. I don’t quite grasp the syntax of that yet, will update when I do.

I’ll explain why I pursued this in the next post.

Viewed 56661 times by 3953 viewers

Optimization WordPress Plugins & Solutions by W3 EDGE