I am a self admitted speed freak … I like fast things. I like playing with the new shiny as well, to get a feel for its utility and capabilities.

I’ll freely admit I’ve been quite excited about the Julia language. I like the concept of a language specifically designed for high performance.

I am a user of Perl for many things. It is still, IMO, one of the most powerful languages available, and is reasonably fast. I am not sure I am happy with it as a data language (for massive data processing), as this is not its core competency. I’ve been hoping Julia would be able to fill this void eventually.

Matlab and more precisely, Octave, is my go-to modeling/analysis language. Its pretty trivial to express algorithms in it, and its designed to handle data.

I really need to get better at k/q in kdb+. It is very expressive, and very compact simultaneously.

Previously, I had written the rzf (Riemann zeta function calculator) in C, Fortran, node (Javascript), rewritten the C to have hand optimized inner loops, Perl. Now I’ll show the Perl, matlab/octave, julia, python, java, and kdb+ for fixed order 2, and 1 billion (10^{9} iterations). What I am curious about is how easy it is to express, and how fast the loop construct is, without significant hand optimization. Java, Perl and Julia use a JIT compilation system though Perl executes a mid-level set of op-codes, while Julia compiles via a LLVM based JIT, and Java compiles to optimized byte codes. Matlab/octave, python and kdb+ are strictly interpreted.

The julia code is pretty straightforward:

N = 1000000000.0 sum = 0.0 pi_exact = 4.0*atan(1.0) tic() for i=N:-1.0:1.0 sum += 1.0/(i*i) end timing = toc() println("sum reduction took ", timing * 1000, " ms") pi_comp = sqrt(sum*6.0) println("pi = ",pi_comp) println("error in pi = ",pi_exact-pi_comp) println("relative error in pi = ",(pi_exact-pi_comp)/pi_exact)

The time to execute this cam out to about 238.5s on my laptop

It turns out the matlab/octave version is almost identical.

format long N =1000000000.0 sum = 0.0 pi_exact = 4.0*atan(1.0) tic() for i=N:-1.0:1.0 sum += 1.0/(i*i); end timing = toc() "timing = ",timing pi_comp = sqrt(sum*6.0) "pi = ",pi_comp "error in pi = ",pi_exact-pi_comp "relative error in pi = ",(pi_exact-pi_comp)/pi_exact

It takes 2398s.

The kdb+/q version (serial loop)

N:1000000000 s:0.0f i:N \t do[N; s+:1.0%(i*i); i-:1; ] s*:6.0f pi_comp: sqrt s pi_exact: 4.0f * atan(1.0f) delta: pi_comp - pi_exact fdelta: delta % pi_exact pi_comp delta fdelta

takes 640.7 seconds.

The Perl (5.16.1) version

#!/opt/scalable/bin/perl use Time::HiRes qw( usleep ualarm gettimeofday tv_interval nanosleep clock_gettime clock_getres clock_nanosleep clock stat ); use Math::Trig; my ($i,$j,$k,$milestone,$n,$rc); my (@array,$total,$p_sum,$sum,$delta_t,$pi, $pi_exact,@caliper); use constant true => (1==1); use constant false => (1==0); my $inf=1000000000 ; my $c; $milestone = 0; $n = 0; $sum = 0.0; $pi_exact = 4.0*atan(1.0); $caliper[$milestone] = [gettimeofday]; $sum=0.0; for($k=$inf-1;$k>=1;$k--) { $sum += 1.0/($k*$k); } $milestone++; $caliper[$milestone] = [gettimeofday]; printf("zeta(%i) = %-.15f \n",$n,$sum); $pi = sqrt($sum*6.0); printf("pi = %-.15f \n",$pi); printf("error in pi = %-.15f \n",$pi_exact-$pi); printf("relative error in pi = %-.15f \n",($pi_exact-$pi)/$pi_exact); #/* now report the milestone time differences */ for ($i=0;$i< =($milestone-1);$i++) { $delta_t = tv_interval($caliper[$i] , $caliper[$i+1] ); printf("Milestone %i to %i: time = %-.3fs\n",$i,$i+1,$delta_t); }

takes 152s.

The python code (works in Python 3 and 2.x, though you should change range to xrange for 2.x)

import math import time N = 1000000000 sum = 0.0 pi_exact = 4.0*math.atan(1.0) start = time.clock() for i in range(N,0,-1): sum += 1.0/(i*i) stop = time.clock() timing = stop - start print("timing=", timing) pi_comp = math.sqrt(sum*6.0) print("computed pi=",pi_comp) delta = pi_exact-pi_comp rerr = delta / pi_exact print("delta=",delta) print("relative error",rerr)

takes 275.1s with Python 3.3.1

The Java code is also fairly straightforward:

import java.*; public class rzf { public static void main (String[] args) { long N=1000000000; long start,stop; double sum=0.0,timing,rerr,delta,pi_comp,pi_exact; pi_exact=4.0*Math.atan(1.0); start = System.currentTimeMillis(); for( long i = N; i >= 1; i--) { sum += 1.0/((double)i*(double)i); } stop = System.currentTimeMillis(); timing = (double)(stop - start)/ 1000.0; System.out.println (timing); pi_comp = Math.sqrt(6.0*sum); System.out.println (pi_comp); delta = pi_comp - pi_exact; rerr = delta / pi_exact ; System.out.println (delta); System.out.println (rerr); } }

The java code takes 11 seconds.

The same code in C

/* Copyright (c) 2003-2007 Scalable Informatics */ #include <stdio .h> #include <math .h> #include <sys /timeb.h> #include <time .h> #include <stdlib .h> #include <unistd .h> #include "string.h" #define NUMBER_OF_CALIPER_POINTS 10 struct timeb t_initial,t_final,caliper[NUMBER_OF_CALIPER_POINTS]; int main(int argc, char **argv) { int i,j,k,milestone,n,rc; int NMAX=5000000, MMAX=10; double *array,total,p_sum,sum,delta_t,pi, pi_exact; int true = (1==1), false = (1==0),inf=0 ; int name_length=0; char *cpu_name,c; milestone = 0; n = 0; sum = 0.0; pi_exact = 4.0*atan(1.0); rc=ftime(&caliper[milestone]); inf=1000000000; n=2; sum=0.0; milestone++; rc=ftime(&caliper[milestone]); for(k=inf-1;k>=1;k--) { sum += 1.0/((double)k*(double)k); } milestone++; rc=ftime(&caliper[milestone]); printf("zeta(%i) = %-.15f \n",n,sum); pi = sqrt(sum*6.0); printf("pi = %-.15f \n",pi); printf("error in pi = %-.15f \n",pi_exact-pi); printf("relative error in pi = %-.15f \n",(pi_exact-pi)/pi_exact); /* now report the milestone time differences */ for (i=0;i< =(milestone-1);i++) { delta_t = (double)(caliper[i+1].time-caliper[i].time); delta_t += (double)(caliper[i+1].millitm-caliper[i].millitm)/1000.0; printf("Milestone %i to %i: time = %-.3fs\n",i,i+1,delta_t); } }

takes 7.5s

Whats interesting to me is that since none of these are vectorized/parallelized, we see something of raw compiler/interpreter performance. The optimization in terms of using vector operations, and exploiting parallelism is where this gets interesting.

Rewriting the kdb+/q code as a blocked set of vectors (I had to adjust this as it was running the 32 bit version and I kept running out of memory)

N:1000000000 M:10 s:0.0f i:0 rzfk: {[x] 1.0f % (x*x) } v: {[i;r] (i*6h$r)+1+til 6h$r } \t do [M; s+: sum rzfk 9h$(v[i;N%M]) ; i+:1; ] s*:6.0f pi_comp: sqrt s pi_exact: 4.0f * atan(1.0f) delta: pi_comp - pi_exact fdelta: delta % pi_exact pi_comp delta fdelta

we now execute this in 34.4 seconds. Within a factor of 5 of the C code, and within a factor of 3 of Java.

Whats more interesting than this is that this is the pure 32 bit version of kdb+, versus 64 bit for everything else. I am not sure what sort of performance hit this would take running this way.

The construction of the kernel (the rzfk function lambda) is quite interesting. It basically provides the template for the computation. The vector v is one group of vectors (M groups total).

But you can see the exercise of power from this code. I am working on the same thing in Octave, but running into some curious bugs:

octave:25> V=1:10 V = 1 2 3 4 5 6 7 8 9 10 octave:26> sum(V) error: A(I): index out of bounds; value 10 out of bound 1 octave:26> V=[1:10] V = 1 2 3 4 5 6 7 8 9 10 octave:27> sum(V) error: A(I): index out of bounds; value 10 out of bound 1

Cool (not).

Viewed 63668 times by 6478 viewers

Just a quick clarification — you say matlab/octave is “almost identical” to the julia code, but you list the matlab/octave time as 2398s. Do you mean 239.8s?

The code is almost identical. The runtimes were anything but identical.