On optimizing scripting languages, and where they are useful and where they are not

So yesterday, an article and discussion appeared on Hacker News. In the article, the author asks reasonable questions, of how to optimize a Python code. What happened next, probably wasn’t as they intended.

The article was, ostensibly, on optimizing Python code. It, after the 4th attempt at source level optimization, switched languages. It was no longer written in Python, this attempt to optimize … Python code.

Ok. So the code they were “optimizing” was trivial, and not really indicative of any particular workload. These concern me, as one can (as the author had shown) spend inordinate time/resources on “optimizing” such a low value portion of a code.

When you optimize, there is implicitly a value proposition which goes something like: “the time/effort Y spent on making this faster, will be worth it as the code will be faster, and I will be able to get X amount of additional work done per unit time.” That is, it is implicitly an economic argument as to why to optimize, and it presents implicit bounds on how hard you should work to optimize.

The author’s thesis was that the interpreter was somehow in the way of optimization. By switching to C, with calls into Python, they removed the interpreter.

This also, implicitly pulls in another question of using the right tool for the right job. More on that in a moment.

The basic code they wanted to optimize was this.

#!/usr/bin/env python3

def main():  
 for j in range(20):  
   for i in range(1000000):  
     str(i)  
main()

When I run this here (Kaby Lake 4 core/8 thread laptop, 64 GB ram, Linux Mint 19.3 with 5.3.x kernel, and a nice NVidia GTX 1060 video card), I get this.

joe@gandalf:~/opt$ /usr/bin/time ./st.py
3.20user 0.00system 0:03.20elapsed 100%CPU

About 3.2s of wall clock. Python 3.8.2 compiled on this system.

Before I continue, unless you are going to be running 10’s of millions of copies of this, thus hitting about a cpu-year of processing time, this is probably not a good optimization target … or … put more correctly … your time is valuable, so the minimum effort you can put in to make this faster, matters … if being faster will positively impact downstream outcomes.

Obviously, with this code, the answer is no .. it won’t impact things, so as little effort as possible should be put into this.

What about writing a blog article (see what I did there?) on this? Well, I want to talk a bit higher level about appropriate tools for these sorts of things.

My core thesis is that Python is not a fast language, there are faster/better choices, and that it is hard to make Python faster without resorting to either a second language or herculean efforts.

I want to state this up front, and not hide it. You might disagree, but from a tool appropriateness viewpoint, this is important.

This, IMO, is a problem for the analytics and AI/ML/DL communities. They are working on tools such as Jax , which seek to implement changes and a “compiler” for the language to take the mixed language (Python + C/Fortran NumPy/etc.) and emit optimized code for GPU/TPUs. I am not critical of Jax, but it again, points to the appropriateness problem associated with tool choice.

First, I wanted to see how Perl performs with this, because, well, I use Perl for anything scripting by default. I do this as it is immensely powerful and expressive, enabling easy transformation and munging of data.

Here’s the Perl version of this:

#!/usr/bin/env perl

use strict;

sub main {
	my ($i,$j);
	foreach $j (0 .. 20) {
	  foreach $i (1 .. 1000000) {
	    sprintf "%i",$i;
	  }
	}
}

&main();

Note that the Perl code is very similar to the Python code. Moreover, instead of using a string(number) function which is dedicated to converting numbers to strings, I use the sprintf function. As had been pointed out to me in the comments, sprintf is a much heavier weight/cost function than string.

No worries, lets just run with that.

joe@gandalf:~/opt$ /usr/bin/time ./st.pl
2.32user 0.00system 0:02.32elapsed 99%CPU 

That’s nearly a full second faster than the Python. For a code that runs around 3 seconds, this is moderately interesting, but not unexpected. Perl is fast.

And, for laughs, lets pull in Julia.

#!/usr/bin/env julia

function main()
 for j in 0:20  
   for i in 1:1000000
     string(i)  
   end
 end
end

main()

and running it … as a script … which means the full JIT compilation time is included …

joe@gandalf:~/opt$ /usr/bin/time ./st.jl
1.42user 0.55system 0:01.31elapsed 150%CPU

So … yeah. Python is slower for this function than either Perl or Julia.

The author then optimized the code, though when I tried those optimizations on my machine, they made no real difference on the Python code (pre-C) code.

What surprised me is, given that we are multi-core pretty much everywhere, why didn’t the author try to make this a nice multi-core code?

Ok, this is where I start thinking in terms of how to optimize a code. What resources do you have to optimize with, how hard is it to get more performance out of the code, without changing the programming idiom, or language for that matter? Because once you start appealing to other programming languages to save yourself from performance penalties of using your original choice of language, the question becomes why should you try to continue to optimize something that implicitly isn’t going to go very fast, no matter how hard you try, without changing the underlying paradigm?

That is the question I’ve asked.

You might (and probably should) ask, well … what about this optimization in Julia and Perl? If these are so great, the effort to speed them up should be really minimal. And the results should speak for themselves, without needing to change langauges.

Funny you should ask that.

Lets start with Julia.

The way Julia works, you want to generally try to isolate performance heavy regions into inner kernels that you can call using a higher level function. So taking the inner loop, spinning it out into its own function, and then taking the outer loop and parallelizing it with threads …

#!/usr/bin/env julia

function inner!()
  for i in 1:1000000
     string(i)
  end
end

function main()
 Threads.@threads for j in 0:20  
   inner!()
 end
end

main()

And running this with 1 and then 4 threads

joe@gandalf:~/opt$ export JULIA_NUM_THREADS=4


joe@gandalf:~/opt$ /usr/bin/time ./st_opt1.jl
2.74user 1.08system 0:01.14elapsed 334%CPU

Remember, this includes the JIT compilation time, the Julia startup time, etc. Looking at this “speedup” there doesn’t look like there is enough work to keep multiple CPUs busy.

But then again, the effort to do this, was, well, trivial.

We could work harder, but, why? This code is generally not worth harder work.

“Ok”, you say, “What about Perl?”. Pythonistas have been bashing Perl syntax and performance since the mid 90s. Python must be faster than Perl. Right? Right?

And making Perl multi-threaded, and able to execute on multiple cores, quickly and efficiently? Impossible you say.

#!/usr/bin/env perl

use strict;
use MCE::Loop;

MCE::Loop::init {
           max_workers => 8, chunk_size => 'auto'
        };

sub main {
	my ($i,$j);
	mce_loop {
	   map {sprintf "%i"} (1 .. 1000000);
	} ( 0..20 );
}

&main();

Lets see how that performs on my 4 core laptop …

joe@gandalf:~/opt$ /usr/bin/time ./st_opt1.pl
2.67user 0.09system 0:00.51elapsed 538%CPU

Had we been able to statically compile the Julia, I imagine we’d see something similar. But remember, Julia’s time includes Julia’s startup time, and the JIT step as well as the execution. The Perl code is interpreted, there is no JIT.

More importantly, the Perl code was trivially accelerated, replacing one foreach statement with an mce_loop block. This remains an idiomatic single language code.

Similarly, the Julia code was also trivially accelerated, replacing an outer loop with a loop launching threads of the inner loop.

This of course, is not the only way to accelerate this code in Julia. We can use a mechanism I developed for my rzf code, which is also quite trivial, but a tiny bit more involved.

#!/usr/bin/env julia

function inner!(mstart,mend)
 for j in 0:20
   for i in mstart:mend
     string(i)
   end
 end
end

function main()
 N    = 1000000
 Nthr = Threads.nthreads()
 ms   = zeros(Int64,Nthr)
 me   = zeros(Int64,Nthr)
 
 for i=1:Nthr
        me[i]   = (N/Nthr)*i
        ms[i]   = 1 + (N/Nthr)*(i-1)
 end
 
 Threads.@threads for t=1:Nthr
        inner!(ms[t],me[t])
 end

end

main()

When we run this for JULIA_NUM_THREADS=4 we get

joe@gandalf:~/opt$ /usr/bin/time ./st_opt2.jl 
2.73user 1.09system 0:01.10elapsed 345%CPU

So technically, the optimizations put into this Julia version were not really worth the time/effort to do, as they gained nothing (as in statistical noise) in performance.

These codes are idiomatically Perl and Julia. They don’t escape to any second language to obtain performance. They could, if needed, but they don’t. While this particular example is trivial and not really worth the effort the original author, or I for that matter, put into it, it exposes the nature of the point I’ve been trying to make.

You choose the right tools for the job. These tools need to make your life easier, and not get in the way of helping you get better performance. If you need to switch languages for the better performance, you should probably be questioning your initial tool choice.

In this case, though this is a trivial and throw-away bit of code, the author had to resort to a language switch and continued performance optimization in order to keep a semblance of Python associated with the code. Meanwhile, in two other languages, we get, with idiomatic and simple/quick source level edits, and no appeal to external languages, faster than the original, and faster than Python in all cases.

I did look for an MCE::Loop like similar type of module in Python. Most of the multithreading models are variants of pthread interfaces. A few are a bit of syntactic sugar atop the internal tooling. And there appears to be much discussion around how this (multicore) implicitly creates race conditions, and is generally not advised. None of these interfaces are trivial. While this may be true in Python, it is decidedly not so in Perl and Julia. The necessary paradigms exist within both of those languages to provide this capability, and the tooling exists to enable users to trivially exploit that capability. Pthreads interfaces are inherently racy. There are better interfaces. Maybe not in Python though.

The argument I hear for using Python in these use cases, centers around the speed of development. Then I see these sorts of herculean efforts to optimize, which, in the case of Python, invariably lead to escaping the language. Which begs the question, how is it faster to work in an environment that you eventually have to change source language to obtain the performance you need?

Programmer productivity is important. For non-performance sensitive code, things that have to interact with an OS/Network/services etc. , Perl and Python are fine.

For performance sensitive things, Julia is fantastic, often achieving C/Fortran levels of performance in a higher productivity environment. If code I am working on in Perl is slow because parts of Perl are slow, I’ll re-examine the premise upon which I selected Perl. If it’s one small piece that needs acceleration, that’s reasonable. If I have to rewrite the whole thing to achieve needed performance, then I would choose another language.

I won’t undertake a herculean project to rewrite and hook into Perl for a code that is better written in another language. I wouldn’t recommend that for Python either. Neither Perl nor Python are built for speed. They are built for programmer productivity.