Performance optimizing python for #HPC ...

It sounds strange, as python is not, by any stretch of any imagination, a high performance language.  If you have a great deal of data you are working with, like I do for my $dayjob, and you need to performance optimize your code, you have a few specific options for this.

But first, as to why you'd consider doing this.  Python is supposed to be generally an easy to write language.  I disagree with this, as there are all sorts of odd patterns you have to work with, that appear to often be at odds with each other in terms of how the code is written or works.  Some of these are arguably problems with python libraries/modules.  

It is still far better than boilerplate heavy languages (cough java cough), that require significant additional text in the code base in order to accomplish anything.  But python generally suffers from the design goal of "there's precisely one way to do something" which can lead even experienced devs (I've been writing code since 1981, so yeah, that includes me) to go "huh" at times.  Because the "one true way" to do what you need may be idiomatic, but also very low performance.  Compared to the non-idiomatic mechanisms.

These days, I find myself keeping web pages open to relevant manual pages (for any language I write in) precisely because of the variation between what I want to do, and what the standard library/modules want to allow me to do.  This is true of Python, C++, Julia, etc.   I try to describe what I want to do in comments ahead of the code block, thus explaining to future me/others what I was thinking at the time, and why.  As well as giving specific urls to things I reference.

But I'm not trying to diss python over its development vagaries.  I am going to talk about patterns and often anti-patterns in higher performance python.

First, as I noted, python really isn't fast when you are working with lots of data, and large data structures.  Small things, yeah, it can be "fast" such that the cost of developing in a compiled language with its own baggage (say, C++) is larger than the cost of running the slower python code.

Some may take issue with this.  And I'll get to some of the arguments they make about this.  But suffice it to say that idiomatic python (pure python code) working with large (multi tens to hundreds of GB) data, will not be fast, apart from the most trivial calculations (say calculating the number of rows).

Generally, when you are looking to optimize some code, you need to measure its baseline performance with your real workload sizes.  You need to see how it scales from your definition of small data size, through your definition of large data.  

Every person's definition of small and large data may be different.  For some people, huge data sets are maybe 10k rows in a database.  For others, small data sets start at 1PB.

For me, I'm typically working with 5-50 million unique records, with time series datasets of up to a billion rows.  Small to me is 50k unique records.  Large data could be billions or more unique records.  Curiously, when I was running Scalable Informatics (@scaleinfo on twitter), I had to ask every customer what their definition of large vs small was, in order to have a better sense of their needs.  Everyone is different.

So you have a sense of the scale I'm working with.  It's not huge by some peoples standards, and its not small by others.  But I have to work with this data.

The tooling I am working on is in Python and Julia.  I've been working on trying to get the python code performance competitive with the Julia code.   This is not trivial (or possible) for a number of reasons.  But I'm not going to trash talk Python performance here.  Simply to talk about the patterns/anti-patterns that one might run into while tuning.

What it comes down to is, once you know how your code behaves at various sizes, you can start to think about the options.  There are many options to measuring time spent in code.  You can look those up, though one nice one is pyflame.  Flamegraphs are very useful visual tools for understanding time costs of code.  I know there are some who would blanch at the thought of running a perl code (flamegraph.pl) to help generate it, but ... well ... get over it.  Its a good tool (perl).

You can also put timing calipers around your code.  You can do this in jupyter using (for python) the %time prefix on your code.  Or use time.time() before and after.  In Julia you can use the @time macro.  

One thing you see pretty quickly is where your code spends its time.  And by using a variety of sizes, you can see how this scales.  If, as you increase your data size by a factor of N, your run time (for that section, or for the whole thing) increases as some power of N greater than 1, the thing you should be considering is some sort of algorithmic shift.  That is, the algorithm you are using for this is not scaling well with your data, so you need to rethink what you implemented and how you implemented it.

This is where the "one true/correct way to do something" can, and will, bite you in the behind.

There are always multiple ways to look at  and to code things.  You need, as the developer, to be cognizant of the trade-offs.  Not be limited in what you can implement and how to implement it.

This is often where you also discover that idiomatic python does not perform well, for the problem sizes you need.  It could, if you have generally small data sets.  If so, none of this really matters.  

Naive algorithms, say brute force mechanisms to work through   arrays/lists/dictionaries, and other data structures are easy to code in python.  They aren't so good for me in most cases.

This results in an  pattern of needing to choose an often "non-obvious" way to implement something that should be simple to implement.  The obvious (idiomatically pythonic) way becomes the anti-pattern.  

This is the seque to explaining what people generally suggest for making python go faster.  Use things such as numpy (C/Fortran/C++), pandas (C/C++), etc.  That is, to make python go faster, spend less time in ... er ... python.

This is a crutch though.  As one quickly discovers with (again) large data.  Pandas isn't multithreaded.  So even though you have a very large data set in memory, you can't actually share access to it amongst many threads.  As Python doesn't do threading, and by default, nor does pandas.  There are "vectorized" operations that can be fast.  But you run into interesting pathologies with larger data.

Take the groupby operation.  It works great when you have a small (e.g. under 100k row, under 100 groups) dataframe.  For my needs, I needed to group on a hash UInt64 and I had hundreds of thousands (small case) to many 10's of millions of groups (medium size).  I haven't tried the large grouping yet.   As you increase the number of groups, the abstractions that are in place to represent the groups become a significant time expense to use.  I found this out the hard way.

To loop through all my groups and perform the analysis I needed became a very time expensive proposition (order of 20+ minutes for a few 100k groups, with analysis applied to each group).  The solution to this, was not to use the idiomatic method.  Don't use groupby.  Rather, work the data source to sort by the hash, then do some list manipulation, list subtraction, and value detection.  In numpy.  Because it is reasonable for dealing with small-ish lists (a couple of million elements).

Using this method, I could perform my analysis in small number of minutes.  Which is nice, but still slow.

The next pattern is to think about parallelism.  Python is slow, but if we can get many different 'threads' (not necessarily threads, could be processes) going on the same data set, we could reduce the impact of this slowness.  That is, many things being slow, but in parallel.  This isn't a great approach.  But its better than nothing.

In languages that are well designed for parallelism, shared memory threading is a thing, and often it is easy to use.  Julia is an example of this.  That said, you do need to think about the best way to calculate in the presence of real architectural details.   You don't want threads touching nearby memory to other threads on other CPUs.  You'll wind up spending lots of time in overhead of cache line bouncing between threads.  So some care is needed to maximize this performance.  But in general, its not that difficult to do.

If we don't use shared memory parallelism, e.g.  common data structures between various independent threads, we need to distribute the data.  Generally this is a better idea for scaling purposes.  Though ... if your dataset is huge, the data distribution can become a significant overhead.  Which means you need to do this infrequently.  MPI is the (standard HPC) framework specifically to enable this distributed shared nothing approach.  It is more involved to do this correctly.  And more to the point, Amdahl's law bites hard into any non-parallel portion of your code.  Scaling will be limited by how much serial time is spent.  Including parsing, data distribution, etc.

In python, the data distribution mechanism is via pickling which serializes data structures, and then sending the pickled data to the remote process, which then unpickles the data.  This can be (really) expensive for what I call small to moderate sized data.  So if I want to send a multi million row dataframe to another process, that serialization-send-deserialization time is built into the execution time.  Which is unfortunate.

MPI wouldn't help much here, unless we have a way to construct an MPI datatype that enabled us to use it as a dataframe.  So the pythonic way is going to be slow to run in parallel, no matter the mechanism.  More on that in a moment.

There are some possible drop-in modules, again, built in other languages than Python, that try to address these python shortcomings.  Currently in favor is Dask, a distributed memory library to simplify parallelism with dataframes.  I've tried it and been somewhat underwhelmed performance wise.  It is likely I'd need to rewrite the codes significantly to get performance I'd need.  There is pola.rs, which is a shared memory, using a Rust library to provide a (partial) pandas API interface.  I've used it and was impressed that they could get somewhat better performance than python on my test cases.  About 2-3x better.  But not overwhelming.  There's vaex, which is also an external library written in a different language.  I've not tried this yet, so I don't have much to comment on.

My focus on using dataframes here is a limiting factor.  Most HPC development does not use this data structure as such.  For the user community I am working with, the dataframe is an excellent abstraction and mental model of what they need, so I'm trying to keep this as a core data structure, and not just part of the presentation layer.  I'm not currently aware of a CRDT version of a dataframe, that enables fast/performant usage.  I'd love to be wrong about this.

With this constraint, the code and data generally are much easier for people to reason about.  But I still need to get better performance, lest I have people running multi/many hour Python runs, which could be trivially replaced by multi-minute Julia runs.  I've developed both, and I've promised the team I'd try to get the Python code to be as fast as possible.  I may have hit a wall with that, though I need to think through this.

Where I am with this is that the pattern of using drop-in libraries isn't giving great results.  The currently favored distributed mechanisms aren't that great either.  So ...

Thinking about how the calculation proceeds, breaking it up into layers of parallelism.   What you want  in general, is to avoid as much of the overhead as you can.  If you need to pay the overhead, only pay it on the most expensive computations.  Profile your code, find the expensive parts, and look carefully at them.  

Look for algorithmic shifts (such as I'd indicated previously).  Changing the way to obtain a result can be quite useful.  And if done carefully, you can turn slow sections to fast sections.

Look for ways to exploit embarrasingly parallel sections.   The way I constructed the loops, each iteration is independent of the last. Thus, in theory, all could be executed at once.  But ... the way the overhead works, you should group calculations together, so each parallel process does all of its work, and returns the result. Which is returned to an independent batch, and later reconstructed into a single dataframe as needed (again, simple mental model for users).

The pattern for this, is decidedly non-pythonic.

import multiprocessing as mp
import pandas as pd
...
# big list blist, that will be used with dataframe df, to 
# perform some function.
# Nproc = number of processes to distribute to

# create argument bundles, put arguments into the tuple for the function call
N_per_proc = round(len(blist)/Nproc)
bundles = []
for proc in range(Nproc):
    i_start = (proc + 0) * N_per_proc
    i_end = (proc + 1) * N_per_proc - 1
    if i_end > len(blist):
        i_end = len(blist)-1
    bundles.append((blist[i_start:i_end],df))

# Launch the parallel processes
batches = []
pool = mp.pool(...)
for results in pool.starmap(function_kernel, bundles):
    batches.append(results)

# reassemble dataframe
df = pd.concat([batches[i] for i in range(Nprocs)])

(modulo specific syntax bits I've managed to omit or mess up)

This is explicit decomposition of the data into parallel bundles, launching the processes against the function kernel ... think of it similar to a GPU kernel, though rather than assigning one GPU (process in this case, rather than GPU) per iteration, you assign a bundle of values, so as to minimize the total overhead per iteration.  Gather up and append the results to a set of result batches, and reassemble the data structure.

The idea is I want to hide much of this complexity from the user.  Provide them with functions that perform these tasks.  And in some cases, I can.  In others ... not so much.

I'm currently looking at the JuliaCall module on Pypi.  It looks like I could write the functions I need in Julia, and call them "natively" as Python functions, much like Pandas.  I still have the multi-language problem, but now the other language a) looks a great deal like Python (syntax and usage) to begin with (making the cognitive load much lower), b) does shared memory parallelism right, c) can easily interface directly with the Python data structures without (much) marshalling.

Or I could point people to the Julia code equivalents that do the same thing.  But much faster for the size of the data that we are working with.

This all said, it looks like the large/huge use case I want to work with will require some serious out-of-core work.  This will add another level of parallelism and scheduling.  I have to think about the right way to handle that.

All this said ... I'd seen the codon compiler announcement this past week, and I think I should play with it.  

Basically, what I've found from experience dealing with reasonable size data sets (as indicated above) is that its best to avoid pythonic methods to work with it, and instead appeal to external libraries written in other languages.  Performance tuning these codes leads me to conclude that the less time spent in computing in Python itself as compared to libraries, is the right pattern.

Show Comments