Optimizing Python

Optimizing Python - FOSDEMx 2018 - 05 - 03

Hello! I am Eric Gazoni I’m Senior Python Developer at
Adimian You can find me at @ericgazoni 2

Why optimizing ? And what to optimize

I/O Improve read/write speed from network or filesystem ⬗ Data
science (large data sets) ⬗ Databases ⬗ Telemetry (IoT) 4

MEMORY Require less RAM from the system ⬗ Reduce hosting
costs ⬗ Run on constrained devices (embedded systems) ⬗ Improve reliability 5

FAULT TOLERANCE / RESILIENCE Continue operating even with bad or
missing input ⬗ Web services ⬗ Medical devices ⬗ Distributed systems 6

CONCURRENCY Serve more requests at the same time ⬗ Web
servers ⬗ IoT controllers ⬗ Database engines ⬗ Web scrapers 7

CPU Run code more efficiently ⬗ Reduce processing time (reporting,
calculation) ⬗ Reduce response time (web pages) ⬗ Reduce energy consumption (and hosting costs) 8

ONLY ONE AT A TIME ⬗ Pick one category ⬗
Hack ⬗ Review ⬗ Rinse, repeat Optimizing multiple domains at once = unpredictable results 9

General rules of optimization Applies to all categories

TARGETS Define clear targets or get lost in the performance
maze ⬗ “This page must load below 200ms” ⬗ “One iteration of this loop must execute below 10ms” ⬗ “This must run on a controller with 8KB memory” 11

METRICS ⬗ You know if you improve or make things
worse ◇ You can definitely make things worse ! ⬗ You know if you reached your targets 12

3 RULES OF OPTIMIZATION ⬗ Benchmark ⬗ Benchmark ⬗ Benchmark
“Gut feeling” vs Reality 13

“ “Trust, but verify” Russian proverb 14

IT’S A JUNGLE OUT THERE 15 User land ⬗ Your
program ⬗ Implementation of the interpreter (py2/py3/pypy) ⬗ Implementation of the interpreter language standard lib (C99/C11/…)

IT’S A JUNGLE OUT THERE 16 Operating system ⬗ Implementation
of the OS kernel (linux/windows/unix/…) ⬗ Filesystem layout (ext4/NTFS/BTRFS/...) ⬗ Implementation of the hardware drivers (proprietary Nvidia drivers)

IT’S A JUNGLE OUT THERE 17 Hardware ⬗ CPU architecture
(x86/ARM/…) ⬗ CPU extensions (SSE/MMX/…) ⬗ Memory / hard drive technology (spinning/flash/…) ⬗ Temperature (GPU/CPU/RAM/…) ⬗ Network card (Optical/Copper)

SAFETY NETS ⬗ Version control: rewind, pinpoint exactly what you
did ⬗ Code coverage: make sure you didn’t break something 18

THE DEAD END ⬗ No shame for not succeeding ⬗
Know when to stop and change plans ⬗ There is always more than one tool in the box 20

Optimizer tools

YOUR TOOL BOX ⬗ Profiler ⬗ Profiling analyzer ⬗ timeit
⬗ Improved interpreter (ipython) ⬗ pytest-profiling 22

CAPTURING PROFILE ⬗ Profilers will capture all calls during program
execution ⬗ Only capture what you need (reduce noise) ⬗ Stats (or aggregated calls) can be dumped in pstats binary format 23

PROFILING THE WHOLE PROGRAM ⬗ Will capture a lot of
noise ⬗ Not invasive (can run out of any Python script) $ python -m profile -o output.pstats myscript.py 24

NOTE ON PROFILERS 25 Running code with a profiler is
similar to driving with the parking brake! Don’t forget to disable it when you are done!

EMBEDDING THE PROFILER 26

Profiling the complete program - importlib sits at the top
27

Profiling only the interesting function 28

ANALYSIS IF THE PROFILE 1. Dump stats into a file
2. Load the file into gprof2dot 3. Use dot (from graphviz package) to generate png/svg representation https://github.com/jrfonseca/gprof2dot 29

python -m cProfile -o output.pstats myprogram.py 30

python myprogram.py (with profiler enabled within code) 31

%timeit magic command in ipython (shorthand for timeit module) 32

pytest-profiling ⬗ Useful to run against your unit-tests ⬗ Integrated
generation of pstats + svg output https://github.com/manahl/pytest-plugins/tree/master/pytest-profiling $ py.test test_cracking.py --profile-svg 33

Statical analysis The “low hanging fruits”

LOW HANGING FRUITS ⬗ Less intrusive ⬗ Low impact on
maintenance ⬗ Usually bring the most significant improvements E.g: reducing number of calls, removing nested loops 35

EXAMPLE: PASSWORD BRUTE-FORCING 36 ⬗ CPU intensive ⬗ Straightforward This
is very bad cryptography, only for demonstration purpose. Don’t do this at home !

VOCABULARY Hash: function that turns a given input in a
given output Brute-force: attempting random inputs in hope to find the one used initially, by comparing against a known output Salt: additional factor added to increase the size of the input 37

EXAMPLE 38

43 numeric_salts() is called 110x, accounts for ~10% of total
time

FINDING INVARIANTS ⬗ If A calls B ⬗ And B
does not use any input from A’s scope ⬗ Then B does not vary in function of B B could be called outside of A without affecting its output B is invariant 44

46 generate_hashes() uses cleartext from the function scope

48 numeric_salts() uses salts_space, provided by caller

Extract numeric_salts() call into the main function, only pass result
(salts) 49

numeric_salts() is only called once, and is no longer above
profiler threshold (~10%) 50

The UNIX time command reports 99% CPU usage, and a
total of 7.379 seconds (wall time) 51

Parallel computing

“ [...] an embarrassingly parallel [...] problem [...] is one
where little or no effort is needed to separate the problem into a number of parallel tasks. Wikipedia 53

PARALLEL & SEQUENTIAL PROBLEMS Parallel: if output from B does
not depend on output from A Sequential: if output from B depends on output from A 54

OUR PROBLEM ? Luckily, password cracking is embarrassingly parallel 55

57 pool.apply_async() will execute check_password on different processes (and CPUs)

58 In each process, we repeat the iterative checks for
each salt, but for only 1 password

The UNIX time command reports 353% CPU usage, and a
total of 4.328 seconds (wall time) 59

CPU USAGE Single process Parallel over 4 cores 60

Throwing more hardware at it Effective, but often overlooked

BETTER SPECS CPU speed depends on: ⬗ Pipeline architecture ⬗
Clock speed ⬗ L2 cache Non-parallel problems only need faster CPU clocks 62

PARALLEL + MORE CPUs = WIN For parallel problems: ⬗
Add CPUs ⬗ Add more computers with more CPUs ◇ Need to think about networking, queues, failover, … http://www.celeryproject.org/ 63

High performance libraries Not reinventing the wheel

UNDERSTANDING VECTORS The iterative sum ⬗ Row after row ⬗
Each line can be different 65 The vectorized sum ⬗ Data is typed ⬗ Homogenous dataset ⬗ Optimized operations on rows and columns

NUMPY ⬗ Centered around ndarray ⬗ Homogenous type (if possible)
⬗ Non-sparse arrays (shape = rows * columns) ⬗ Close to C / Fortran API ⬗ Efficient numerical operations ⬗ Good integration with Cython http://www.numpy.org/ 66

PANDAS ⬗ Heavily based on NumPy ⬗ Serie, DataFrame, Index
⬗ Batteries included: ◇ Integrations for reading/writing different formats ◇ Date/datetime/timezone handling ⬗ More user-friendly than NumPy https://pandas.pydata.org/ 67

Counting passwords containing the word “eric” in pure Python 68

Pure Python solution finds 16681 matches in 23 seconds 69

Pandas version - No explicit loop 70

Pandas finds 16625 matches in 19 seconds 71

Cython Reinventing the wheel

WHY NOT JUST WRITE C ? ⬗ Write C code
⬗ Compile C code ⬗ Use CFFI or ctypes to load and call code ⬗ In “C land” ◇ Untangle PyObject yourself ◇ No exception mechanism 73

CYTHON ⬗ Precompile Python code in C ⬗ Automatically links
and wraps the code so it can be imported ⬗ Seamless transition between “C” and “Python” contexts ◇ Exceptions ◇ print() ◇ PyObject untangling 74

Regular Python code 75

C-typing variables 76

C-typing function 77

Cython annotate - White = C / Yellow = Python
78

PACKAGING & DISTRIBUTION 79

PyPy Just in time to save the day

WHAT IS JIT OPTIMIZATION CPython compiler optimize bytecode on guessed
processing What if the compiler could optimize for actual processing ? Just In Time optimization monitors how the code is running and suggest bytecode optimizations on the fly 81

PYPY ⬗ Alternative Python implementation ◇ 100% compatible with Python
2.7 & 3.5 ◇ not 100% compatible with (some) C libraries ⬗ Automatically rewrites internal logic for performance ⬗ Needs lots of data to make better decisions http://pypy.org/ 82

Create 5 million “messages”, count them and check the last
one 83

CPython: 20.4 seconds vs PyPy: 6.6 seconds 84

JIT counter example - CPython is faster for 500 messages
85

JIT PROs & CONs Pros: ⬗ Works on existing codebase
⬗ Ridiculously fast ⬗ Support for NumPy (not yet for Pandas) Cons: ⬗ No support for pandas ⬗ Another interpreter ⬗ Works best with pure-Python types ⬗ Needs “warm-up” 86

YOU CAN’T HAVE IT ALL Optimization is always a trade-off
with maintainability 87

Summary

Summary ⬗ Wide deployment ⬗ “Simple” codebase 1. Low hanging
fruits 2. Vectors 3. Better hardware ⬗ Sequential code ⬗ Limited deployment 1. Better hardware 2. PyPy 3. Cython ⬗ Embarrassingly parallel code 1. Worker threads 2. Worker processes 3. Throw more CPUs 89

90 Thanks! Any questions? You can find me at @ericgazoni
& [email protected]

Credits Special thanks to all the people who made and
released these awesome resources for free: ⬗ Presentation template by SlidesCarnival ⬗ Photographs by Unsplash 91

Optimizing Python

Optimizing Python

More Decks by Eric Gazoni

Other Decks in Education

Featured

Transcript