Slide 1

Slide 1 text

Optimizing Python - FOSDEMx 2018 - 05 - 03

Slide 2

Slide 2 text

Hello! I am Eric Gazoni I’m Senior Python Developer at Adimian You can find me at @ericgazoni 2

Slide 3

Slide 3 text

Why optimizing ? And what to optimize

Slide 4

Slide 4 text

I/O Improve read/write speed from network or filesystem ⬗ Data science (large data sets) ⬗ Databases ⬗ Telemetry (IoT) 4

Slide 5

Slide 5 text

MEMORY Require less RAM from the system ⬗ Reduce hosting costs ⬗ Run on constrained devices (embedded systems) ⬗ Improve reliability 5

Slide 6

Slide 6 text

FAULT TOLERANCE / RESILIENCE Continue operating even with bad or missing input ⬗ Web services ⬗ Medical devices ⬗ Distributed systems 6

Slide 7

Slide 7 text

CONCURRENCY Serve more requests at the same time ⬗ Web servers ⬗ IoT controllers ⬗ Database engines ⬗ Web scrapers 7

Slide 8

Slide 8 text

CPU Run code more efficiently ⬗ Reduce processing time (reporting, calculation) ⬗ Reduce response time (web pages) ⬗ Reduce energy consumption (and hosting costs) 8

Slide 9

Slide 9 text

ONLY ONE AT A TIME ⬗ Pick one category ⬗ Hack ⬗ Review ⬗ Rinse, repeat Optimizing multiple domains at once = unpredictable results 9

Slide 10

Slide 10 text

General rules of optimization Applies to all categories

Slide 11

Slide 11 text

TARGETS Define clear targets or get lost in the performance maze ⬗ “This page must load below 200ms” ⬗ “One iteration of this loop must execute below 10ms” ⬗ “This must run on a controller with 8KB memory” 11

Slide 12

Slide 12 text

METRICS ⬗ You know if you improve or make things worse ◇ You can definitely make things worse ! ⬗ You know if you reached your targets 12

Slide 13

Slide 13 text

3 RULES OF OPTIMIZATION ⬗ Benchmark ⬗ Benchmark ⬗ Benchmark “Gut feeling” vs Reality 13

Slide 14

Slide 14 text

“ “Trust, but verify” Russian proverb 14

Slide 15

Slide 15 text

IT’S A JUNGLE OUT THERE 15 User land ⬗ Your program ⬗ Implementation of the interpreter (py2/py3/pypy) ⬗ Implementation of the interpreter language standard lib (C99/C11/…)

Slide 16

Slide 16 text

IT’S A JUNGLE OUT THERE 16 Operating system ⬗ Implementation of the OS kernel (linux/windows/unix/…) ⬗ Filesystem layout (ext4/NTFS/BTRFS/...) ⬗ Implementation of the hardware drivers (proprietary Nvidia drivers)

Slide 17

Slide 17 text

IT’S A JUNGLE OUT THERE 17 Hardware ⬗ CPU architecture (x86/ARM/…) ⬗ CPU extensions (SSE/MMX/…) ⬗ Memory / hard drive technology (spinning/flash/…) ⬗ Temperature (GPU/CPU/RAM/…) ⬗ Network card (Optical/Copper)

Slide 18

Slide 18 text

SAFETY NETS ⬗ Version control: rewind, pinpoint exactly what you did ⬗ Code coverage: make sure you didn’t break something 18

Slide 19

Slide 19 text

19

Slide 20

Slide 20 text

THE DEAD END ⬗ No shame for not succeeding ⬗ Know when to stop and change plans ⬗ There is always more than one tool in the box 20

Slide 21

Slide 21 text

Optimizer tools

Slide 22

Slide 22 text

YOUR TOOL BOX ⬗ Profiler ⬗ Profiling analyzer ⬗ timeit ⬗ Improved interpreter (ipython) ⬗ pytest-profiling 22

Slide 23

Slide 23 text

CAPTURING PROFILE ⬗ Profilers will capture all calls during program execution ⬗ Only capture what you need (reduce noise) ⬗ Stats (or aggregated calls) can be dumped in pstats binary format 23

Slide 24

Slide 24 text

PROFILING THE WHOLE PROGRAM ⬗ Will capture a lot of noise ⬗ Not invasive (can run out of any Python script) $ python -m profile -o output.pstats myscript.py 24

Slide 25

Slide 25 text

NOTE ON PROFILERS 25 Running code with a profiler is similar to driving with the parking brake! Don’t forget to disable it when you are done!

Slide 26

Slide 26 text

EMBEDDING THE PROFILER 26

Slide 27

Slide 27 text

Profiling the complete program - importlib sits at the top 27

Slide 28

Slide 28 text

Profiling only the interesting function 28

Slide 29

Slide 29 text

ANALYSIS IF THE PROFILE 1. Dump stats into a file 2. Load the file into gprof2dot 3. Use dot (from graphviz package) to generate png/svg representation https://github.com/jrfonseca/gprof2dot 29

Slide 30

Slide 30 text

python -m cProfile -o output.pstats myprogram.py 30

Slide 31

Slide 31 text

python myprogram.py (with profiler enabled within code) 31

Slide 32

Slide 32 text

%timeit magic command in ipython (shorthand for timeit module) 32

Slide 33

Slide 33 text

pytest-profiling ⬗ Useful to run against your unit-tests ⬗ Integrated generation of pstats + svg output https://github.com/manahl/pytest-plugins/tree/master/pytest-profiling $ py.test test_cracking.py --profile-svg 33

Slide 34

Slide 34 text

Statical analysis The “low hanging fruits”

Slide 35

Slide 35 text

LOW HANGING FRUITS ⬗ Less intrusive ⬗ Low impact on maintenance ⬗ Usually bring the most significant improvements E.g: reducing number of calls, removing nested loops 35

Slide 36

Slide 36 text

EXAMPLE: PASSWORD BRUTE-FORCING 36 ⬗ CPU intensive ⬗ Straightforward This is very bad cryptography, only for demonstration purpose. Don’t do this at home !

Slide 37

Slide 37 text

VOCABULARY Hash: function that turns a given input in a given output Brute-force: attempting random inputs in hope to find the one used initially, by comparing against a known output Salt: additional factor added to increase the size of the input 37

Slide 38

Slide 38 text

EXAMPLE 38

Slide 39

Slide 39 text

39

Slide 40

Slide 40 text

40

Slide 41

Slide 41 text

41

Slide 42

Slide 42 text

42

Slide 43

Slide 43 text

43 numeric_salts() is called 110x, accounts for ~10% of total time

Slide 44

Slide 44 text

FINDING INVARIANTS ⬗ If A calls B ⬗ And B does not use any input from A’s scope ⬗ Then B does not vary in function of B B could be called outside of A without affecting its output B is invariant 44

Slide 45

Slide 45 text

45

Slide 46

Slide 46 text

46 generate_hashes() uses cleartext from the function scope

Slide 47

Slide 47 text

47

Slide 48

Slide 48 text

48 numeric_salts() uses salts_space, provided by caller

Slide 49

Slide 49 text

Extract numeric_salts() call into the main function, only pass result (salts) 49

Slide 50

Slide 50 text

numeric_salts() is only called once, and is no longer above profiler threshold (~10%) 50

Slide 51

Slide 51 text

The UNIX time command reports 99% CPU usage, and a total of 7.379 seconds (wall time) 51

Slide 52

Slide 52 text

Parallel computing

Slide 53

Slide 53 text

“ [...] an embarrassingly parallel [...] problem [...] is one where little or no effort is needed to separate the problem into a number of parallel tasks. Wikipedia 53

Slide 54

Slide 54 text

PARALLEL & SEQUENTIAL PROBLEMS Parallel: if output from B does not depend on output from A Sequential: if output from B depends on output from A 54

Slide 55

Slide 55 text

OUR PROBLEM ? Luckily, password cracking is embarrassingly parallel 55

Slide 56

Slide 56 text

56

Slide 57

Slide 57 text

57 pool.apply_async() will execute check_password on different processes (and CPUs)

Slide 58

Slide 58 text

58 In each process, we repeat the iterative checks for each salt, but for only 1 password

Slide 59

Slide 59 text

The UNIX time command reports 353% CPU usage, and a total of 4.328 seconds (wall time) 59

Slide 60

Slide 60 text

CPU USAGE Single process Parallel over 4 cores 60

Slide 61

Slide 61 text

Throwing more hardware at it Effective, but often overlooked

Slide 62

Slide 62 text

BETTER SPECS CPU speed depends on: ⬗ Pipeline architecture ⬗ Clock speed ⬗ L2 cache Non-parallel problems only need faster CPU clocks 62

Slide 63

Slide 63 text

PARALLEL + MORE CPUs = WIN For parallel problems: ⬗ Add CPUs ⬗ Add more computers with more CPUs ◇ Need to think about networking, queues, failover, … http://www.celeryproject.org/ 63

Slide 64

Slide 64 text

High performance libraries Not reinventing the wheel

Slide 65

Slide 65 text

UNDERSTANDING VECTORS The iterative sum ⬗ Row after row ⬗ Each line can be different 65 The vectorized sum ⬗ Data is typed ⬗ Homogenous dataset ⬗ Optimized operations on rows and columns

Slide 66

Slide 66 text

NUMPY ⬗ Centered around ndarray ⬗ Homogenous type (if possible) ⬗ Non-sparse arrays (shape = rows * columns) ⬗ Close to C / Fortran API ⬗ Efficient numerical operations ⬗ Good integration with Cython http://www.numpy.org/ 66

Slide 67

Slide 67 text

PANDAS ⬗ Heavily based on NumPy ⬗ Serie, DataFrame, Index ⬗ Batteries included: ◇ Integrations for reading/writing different formats ◇ Date/datetime/timezone handling ⬗ More user-friendly than NumPy https://pandas.pydata.org/ 67

Slide 68

Slide 68 text

Counting passwords containing the word “eric” in pure Python 68

Slide 69

Slide 69 text

Pure Python solution finds 16681 matches in 23 seconds 69

Slide 70

Slide 70 text

Pandas version - No explicit loop 70

Slide 71

Slide 71 text

Pandas finds 16625 matches in 19 seconds 71

Slide 72

Slide 72 text

Cython Reinventing the wheel

Slide 73

Slide 73 text

WHY NOT JUST WRITE C ? ⬗ Write C code ⬗ Compile C code ⬗ Use CFFI or ctypes to load and call code ⬗ In “C land” ◇ Untangle PyObject yourself ◇ No exception mechanism 73

Slide 74

Slide 74 text

CYTHON ⬗ Precompile Python code in C ⬗ Automatically links and wraps the code so it can be imported ⬗ Seamless transition between “C” and “Python” contexts ◇ Exceptions ◇ print() ◇ PyObject untangling 74

Slide 75

Slide 75 text

Regular Python code 75

Slide 76

Slide 76 text

C-typing variables 76

Slide 77

Slide 77 text

C-typing function 77

Slide 78

Slide 78 text

Cython annotate - White = C / Yellow = Python 78

Slide 79

Slide 79 text

PACKAGING & DISTRIBUTION 79

Slide 80

Slide 80 text

PyPy Just in time to save the day

Slide 81

Slide 81 text

WHAT IS JIT OPTIMIZATION CPython compiler optimize bytecode on guessed processing What if the compiler could optimize for actual processing ? Just In Time optimization monitors how the code is running and suggest bytecode optimizations on the fly 81

Slide 82

Slide 82 text

PYPY ⬗ Alternative Python implementation ◇ 100% compatible with Python 2.7 & 3.5 ◇ not 100% compatible with (some) C libraries ⬗ Automatically rewrites internal logic for performance ⬗ Needs lots of data to make better decisions http://pypy.org/ 82

Slide 83

Slide 83 text

Create 5 million “messages”, count them and check the last one 83

Slide 84

Slide 84 text

CPython: 20.4 seconds vs PyPy: 6.6 seconds 84

Slide 85

Slide 85 text

JIT counter example - CPython is faster for 500 messages 85

Slide 86

Slide 86 text

JIT PROs & CONs Pros: ⬗ Works on existing codebase ⬗ Ridiculously fast ⬗ Support for NumPy (not yet for Pandas) Cons: ⬗ No support for pandas ⬗ Another interpreter ⬗ Works best with pure-Python types ⬗ Needs “warm-up” 86

Slide 87

Slide 87 text

YOU CAN’T HAVE IT ALL Optimization is always a trade-off with maintainability 87

Slide 88

Slide 88 text

Summary

Slide 89

Slide 89 text

Summary ⬗ Wide deployment ⬗ “Simple” codebase 1. Low hanging fruits 2. Vectors 3. Better hardware ⬗ Sequential code ⬗ Limited deployment 1. Better hardware 2. PyPy 3. Cython ⬗ Embarrassingly parallel code 1. Worker threads 2. Worker processes 3. Throw more CPUs 89

Slide 90

Slide 90 text

90 Thanks! Any questions? You can find me at @ericgazoni & [email protected]

Slide 91

Slide 91 text

Credits Special thanks to all the people who made and released these awesome resources for free: ⬗ Presentation template by SlidesCarnival ⬗ Photographs by Unsplash 91