Spoiler alert: You can make your Python faster [with speaker notes]

Spoiler alert: You can make Python run faster Saulius Lukauskas
2014-02-04

Outline • Why is Python Slower than C • Implementations
of Python • Idioms are important • Demo: minimal-effort-required optimisations of sample code

Why is Python Slower than C? This question gets thrown
a lot and receives a wide range of answers. Usually these answers focus on the specifics of language, e.g. dynamic typing. In most cases, however, there are other reasons for this.

You code differently in Python This is one of major
reasons why Python (and other languages alike, i.e. Javascript) are slow - you just code differently. Let people writing interpreters for Python worry about the cost of dynamic type checking, and other things specific to the language. In the end, it all boils down to data structures and algorithms - you just use different ones in Python.

point = {'x': 0, 'y': 0} For instance, this is
how any reasonable programmer would code a point in python. There is nothing bad about this code, it is elegant enough, and it would definitely pass a code review.

struct Point { int x; int y; }; This is
a perfectly acceptable way of defining a point class in C. Again, it is elegant, and easy to code.

struct Point { int x; int y; }; point =
{ 'x': 0, 'y': 0 } != Now the problem is that these data structures are not equal. The C structure (leftmost) allows instant memory access to fields x and y (as they would be stored in memory some constant offset away from where point is). ! Now the Pythonic structure, the structure on the right, is a hash table. In order to access the key ‘y’ in this structure, the compiler would need to compute the hash of ‘y’ which takes some short, but nonzero amount of time, and then access the memory location at offset hash(‘y’). ! While hash functions are usually very efficient, it is still a noticeable overhead compared to the C structure.

Hash Tables in C++ std::hash_set<std::string, int> point; point[“x”] = x
point[“y”] = y In fact, it would be ridiculous to code this structure in C/C++. In fact, C++ makes sure nobody does that, by making it extremely painful to do this [have a look at the code in the slide]. ! Python, however, recognises the power of these hash tables (or dictionaries as they are called), and allows you to use them easily.

Structs in python ? class Point(object): x, y = None,
None def __init__(self, x, y): self.x, self.y = x, y Having said that, python also provides a way to code a data structure that is similar to C struct. [Have a look at a slide] Here, we define two fields, x and y. In other words, we are saying the compiler that we will always have two fields in this class and it better makes sure to do something with this hint, in python we do not need (or have no way) to tell what the variable type is though. ! The additional init method is just saying how to initialise these these. ! NB: collections.namedtuple classes are exactly designed for these use cases, however let’s not go off topic here.

Why we do not use them? If there is a
way to define a better structure in python, why is it that nobody does that?

Objects are slower in standard Python def sum_(points): sum_x, sum_y
= 0, 0 for point in points: sum_x += point['x'] sum_y += point['y'] return sum_x, sum_y ! def sum_(points): sum_x, sum_y = 0, 0 for point in points: sum_x += point.x sum_y += point.y return sum_x, sum_y And this is because the structured way is actually slower in default implementation of python.

= 0, 0 for point in points: sum_x += point['x'] sum_y += point['y'] return sum_x, sum_y ! def sum_(points): sum_x, sum_y = 0, 0 for point in points: sum_x += point.x sum_y += point.y return sum_x, sum_y 186 µs 201 µs

Implementation is the key def sum_(points): sum_x, sum_y = 0,
0 for point in points: sum_x += point.x sum_y += point.y return sum_x, sum_y In standard python, this is equivalent to dict(point)[‘x’] not a struct Unfortunately, in standard python this code is roughly equivalent to converting your class to a dict and accessing the ‘x’ variable. By definition this is at least as slow as the hash table approach, and even slower due to the conversion to dict overhead. ! This lead the whole community towards “dicts are lightweight objects” mindset.

Luckily, there is more than one implementation

Standard Python implementation: CPython • Most commonly used implementation of
Python • “C” stands for, well, C. • Can run native C code from python. • Designed to be reference rather than an optimised implementation I have been referring to “standard implementation of python”. The technical name for it is “CPython”.

Smart python interpreters: PyPy • http://pypy.org/ • Fast, compliant alternative
implementation of the Python language (2.7.3 and 3.2.3). • Just-in-time (JIT) compilation • On (geometric) average of all benchmarks 6.2 times faster than CPython • Cannot run C code natively • Does not support numpy, yet There is an optimised version of python that supports just-in-time (JIT) compilation — PyPy. It shows significant speedups compared to the CPython, but is not implemented in C and therefore cannot run C code natively (meaning no numpy support for now — they are working on it though).

= 0, 0 for point in points: sum_x += point['x'] sum_y += point['y'] return sum_x, sum_y ! def sum_(points): sum_x, sum_y = 0, 0 for point in points: sum_x += point.x sum_y += point.y return sum_x, sum_y 186 µs 201 µs Now this is just the recap of the previously shown slide

Under PyPy def sum_(points): sum_x, sum_y = 0, 0 for
point in points: sum_x += point['x'] sum_y += point['y'] return sum_x, sum_y ! def sum_(points): sum_x, sum_y = 0, 0 for point in points: sum_x += point.x sum_y += point.y return sum_x, sum_y 186 µs 201 µs 21.6 µs 3.75 µs In PyPy both tests are much faster. However it is interesting to see that the object version of the code (on the right) is 5.76 times faster as PyPy is able to optimise according to the hints about the number of fields in the object we gave earlier.

Appropriate Data Structures make your code easier to optimise To
recap, using appropriate data structures make your code easier to optimise for compilers.

Idioms can make interpreters job easier similarly, Similarly, using coding
idioms can make the interpreter’s job easier (and therefore your code faster).

Idiom #1: Loops int ans[N]; for (int i=0; i <
N; i++) { int datapoint = data[i]; ans[i] = do_something_with(datapoint); } This is how one would write a loop in C. This is an idiom in C, and I don’t think there is another way to do this.

Naive Pythonic implementation ans = [] for i in range(N):
datapoint = data[i] ans.append(do_something_with(datapoint)) Takes 335 µs to run in my test Translating the said code to python naively would result to similar code.

datapoint = data[i] ans.append(do_something_with(datapoint)) Allocates a list [0, 1, 2, … , N-1] to loop through Now range(N) forces the compiler to allocate a list of size N to memory, just to loop through it.

datapoint = data[i] ans.append(do_something_with(datapoint)) Allocates a list [0, 1, 2, … , N-1] to loop through just to access i-th data point We do not use that allocated loop anywhere but in the line where we access the i-th data point. ! NB: If CPython lists were actually linked lists (as the name says), and not dynamic arrays (as they are implemented), the data[i] lookup would be O(n) making this code completely sluggish.

Idiom for looping through lists ans = [] for datapoint
in data: ans.append(do_something_with(datapoint)) Takes 290 µs to run in my test (13.4% faster) There is a better way to do this kind of loop — use the pythonic loop idiom that just loops through the data list directly, meaning we do not need to allocate a list for looping in memory. ! Intuitively thinking, given the current position in the list, we always know how to access the next one — this is kind of by definition. ! We can see that this makes our code run 13.4% faster already.

Idiom for applying function to lists map(do_something_with, data) Takes 189
µs to run in my test (35% faster than previous) It is worth noticing that we just call `do_something_with` with each of the elements in data and storing that in a new list. As you might have guessed, there is an idiom for that — it is the map function that is borrowed from functional languages. ! Here we cannot be more explicit with the interpreter on what you intend to do with data, we explicitly say that we want to apply function to every element of list. This would: 1) Allow the interpreter to preallocate the list of correct size immediately (it will be of len(data), no other) 2) Cache function do_something_with so it does not need to look up where it is every time. ! We can see that this runs 35% faster than the previous version (that was already faster than naive one).

Parallelisation in Python import multiprocessing p = multiprocessing.Pool() p.map(do_something_with, data)
Using this idiom allows to easily parallelise your code as well. Flick between the previous slide and this one a couple of times to notice how easy this is, once correct idioms are used.

Light Reading: 10 Main Idioms of Python • Safe Hammad
has recently compiled a list of ten main python idioms. • It is a very short read of best practices. • Have a look: • http://safehammad.com/downloads/python- idioms-2014-01-16.pdf There are more idioms in Python, for instance the “it is easier to ask for forgiveness, than permission” idiom that I did not have time to mention in this talk. ! I suggest to read through Safe Hammad’s slides for more info.

Other optimising implementations of python Besides PyPy there are other
optimising implementations of python.

Another JIT implementation: Numba • http://numba.pydata.org/ • Unlike PyPy, it
is not a separate implementation of python • Supports numpy Like PyPy, numba provides JIT compilation capabilities to python. Unlike PyPy, it is not a separate implementation of python, and has support for numpy. !

Another JIT implementation: Numba from numba import autojit ! @autojit
def sum2d(arr): M, N = arr.shape result = 0.0 for i in range(M): for j in range(N): result += arr[i,j] return result This is an example of how easy it is to use numba in the code.

Cython - Python Compiler • http://cython.org/ • Compiles your python
code to C • Allows you to put static type checking into python • Relatively easy to produce C speeds for simple problems • Loots of cooperation with numpy, can run numpy’s C backend natively Cython could be called a compiler for python. It compiles your python code into C directly, allows to statically type your variables and can be a good way to optimise your code to C-like speeds without doing C.

Demo: Optimisation without effort The walkthrough for this demo is
available at http://lukauskas.co.uk/article/2014/02/12/how-to-make-python-faster-without-trying-that-much/

Further Reading Watching • Alex Gaynor “Why Python, Ruby and
Javascript are Slow” in Waza 2013 • http://vimeo.com/61044810 • Great talk about the same issues covered here, a bit in more detail. • Some of my material is heavily inspired by this talk.

Spoiler alert: You can make your Python faster ...

Spoiler alert: You can make your Python faster [with speaker notes]

More Decks by Saulius Lukauskas

Other Decks in Programming

Featured

Transcript