Hello!
I am Eric Gazoni
I’m Senior Python Developer at Adimian
You can find me at @ericgazoni
2
Slide 3
Slide 3 text
Why optimizing ?
And what to optimize
Slide 4
Slide 4 text
I/O
Improve read/write speed from network or filesystem
⬗ Data science (large data sets)
⬗ Databases
⬗ Telemetry (IoT)
4
Slide 5
Slide 5 text
MEMORY
Require less RAM from the system
⬗ Reduce hosting costs
⬗ Run on constrained devices (embedded systems)
⬗ Improve reliability
5
Slide 6
Slide 6 text
FAULT TOLERANCE / RESILIENCE
Continue operating even with bad or missing input
⬗ Web services
⬗ Medical devices
⬗ Distributed systems
6
Slide 7
Slide 7 text
CONCURRENCY
Serve more requests at the same time
⬗ Web servers
⬗ IoT controllers
⬗ Database engines
⬗ Web scrapers
7
Slide 8
Slide 8 text
CPU
Run code more efficiently
⬗ Reduce processing time (reporting, calculation)
⬗ Reduce response time (web pages)
⬗ Reduce energy consumption (and hosting costs)
8
Slide 9
Slide 9 text
ONLY ONE AT A TIME
⬗ Pick one category
⬗ Hack
⬗ Review
⬗ Rinse, repeat
Optimizing multiple domains at once = unpredictable results
9
Slide 10
Slide 10 text
General rules of
optimization
Applies to all categories
Slide 11
Slide 11 text
TARGETS
Define clear targets or get lost in the performance maze
⬗ “This page must load below 200ms”
⬗ “One iteration of this loop must execute below 10ms”
⬗ “This must run on a controller with 8KB memory”
11
Slide 12
Slide 12 text
METRICS
⬗ You know if you improve or make things worse
◇ You can definitely make things worse !
⬗ You know if you reached your targets
12
Slide 13
Slide 13 text
3 RULES OF OPTIMIZATION
⬗ Benchmark
⬗ Benchmark
⬗ Benchmark
“Gut feeling” vs Reality
13
Slide 14
Slide 14 text
“
“Trust, but verify”
Russian proverb
14
Slide 15
Slide 15 text
IT’S A JUNGLE OUT THERE
15
User land
⬗ Your program
⬗ Implementation of the interpreter (py2/py3/pypy)
⬗ Implementation of the interpreter language standard lib
(C99/C11/…)
Slide 16
Slide 16 text
IT’S A JUNGLE OUT THERE
16
Operating system
⬗ Implementation of the OS kernel (linux/windows/unix/…)
⬗ Filesystem layout (ext4/NTFS/BTRFS/...)
⬗ Implementation of the hardware drivers (proprietary Nvidia
drivers)
Slide 17
Slide 17 text
IT’S A JUNGLE OUT THERE
17
Hardware
⬗ CPU architecture (x86/ARM/…)
⬗ CPU extensions (SSE/MMX/…)
⬗ Memory / hard drive technology (spinning/flash/…)
⬗ Temperature (GPU/CPU/RAM/…)
⬗ Network card (Optical/Copper)
Slide 18
Slide 18 text
SAFETY NETS
⬗ Version control: rewind, pinpoint exactly what you did
⬗ Code coverage: make sure you didn’t break something
18
Slide 19
Slide 19 text
19
Slide 20
Slide 20 text
THE DEAD END
⬗ No shame for not succeeding
⬗ Know when to stop and change plans
⬗ There is always more than one tool in the box
20
CAPTURING PROFILE
⬗ Profilers will capture all calls during program execution
⬗ Only capture what you need (reduce noise)
⬗ Stats (or aggregated calls) can be dumped in pstats
binary format
23
Slide 24
Slide 24 text
PROFILING THE WHOLE PROGRAM
⬗ Will capture a lot of noise
⬗ Not invasive (can run out of any Python script)
$ python -m profile -o output.pstats myscript.py
24
Slide 25
Slide 25 text
NOTE ON PROFILERS
25
Running code with a profiler is similar to driving with the
parking brake!
Don’t forget to disable it when you are done!
Slide 26
Slide 26 text
EMBEDDING THE PROFILER
26
Slide 27
Slide 27 text
Profiling the complete program - importlib sits at the top
27
Slide 28
Slide 28 text
Profiling only the interesting function
28
Slide 29
Slide 29 text
ANALYSIS IF THE PROFILE
1. Dump stats into a file
2. Load the file into gprof2dot
3. Use dot (from graphviz package) to generate png/svg
representation
https://github.com/jrfonseca/gprof2dot
29
python myprogram.py (with profiler enabled within code)
31
Slide 32
Slide 32 text
%timeit magic command in ipython (shorthand for timeit module)
32
Slide 33
Slide 33 text
pytest-profiling
⬗ Useful to run against your unit-tests
⬗ Integrated generation of pstats + svg output
https://github.com/manahl/pytest-plugins/tree/master/pytest-profiling
$ py.test test_cracking.py --profile-svg
33
Slide 34
Slide 34 text
Statical analysis
The “low hanging fruits”
Slide 35
Slide 35 text
LOW HANGING FRUITS
⬗ Less intrusive
⬗ Low impact on maintenance
⬗ Usually bring the most significant improvements
E.g: reducing number of calls, removing nested loops
35
Slide 36
Slide 36 text
EXAMPLE: PASSWORD BRUTE-FORCING
36
⬗ CPU intensive
⬗ Straightforward
This is very bad cryptography, only for demonstration
purpose.
Don’t do this at home !
Slide 37
Slide 37 text
VOCABULARY
Hash: function that turns a given input in a given output
Brute-force: attempting random inputs in hope to find the one
used initially, by comparing against a known output
Salt: additional factor added to increase the size of the input
37
Slide 38
Slide 38 text
EXAMPLE
38
Slide 39
Slide 39 text
39
Slide 40
Slide 40 text
40
Slide 41
Slide 41 text
41
Slide 42
Slide 42 text
42
Slide 43
Slide 43 text
43
numeric_salts() is called 110x, accounts for ~10% of total time
Slide 44
Slide 44 text
FINDING INVARIANTS
⬗ If A calls B
⬗ And B does not use any input from A’s scope
⬗ Then B does not vary in function of B
B could be called outside of A without affecting its output
B is invariant
44
Slide 45
Slide 45 text
45
Slide 46
Slide 46 text
46
generate_hashes() uses cleartext from the function scope
Slide 47
Slide 47 text
47
Slide 48
Slide 48 text
48
numeric_salts() uses salts_space, provided by caller
Slide 49
Slide 49 text
Extract numeric_salts() call into the main function, only pass result (salts)
49
Slide 50
Slide 50 text
numeric_salts() is only called once, and is no longer above profiler threshold (~10%)
50
Slide 51
Slide 51 text
The UNIX time command reports 99% CPU usage, and a total of 7.379 seconds (wall time)
51
Slide 52
Slide 52 text
Parallel
computing
Slide 53
Slide 53 text
“
[...] an embarrassingly
parallel [...] problem [...]
is one where little or no
effort is needed to
separate the problem into
a number of parallel tasks.
Wikipedia
53
Slide 54
Slide 54 text
PARALLEL & SEQUENTIAL PROBLEMS
Parallel: if output from B does not depend on output from A
Sequential: if output from B depends on output from A
54
Slide 55
Slide 55 text
OUR PROBLEM ?
Luckily, password cracking is embarrassingly parallel
55
Slide 56
Slide 56 text
56
Slide 57
Slide 57 text
57
pool.apply_async() will execute check_password on different processes (and CPUs)
Slide 58
Slide 58 text
58
In each process, we repeat the iterative checks for each salt, but for only 1 password
Slide 59
Slide 59 text
The UNIX time command reports 353% CPU usage, and a total of 4.328 seconds (wall time)
59
Slide 60
Slide 60 text
CPU USAGE
Single process
Parallel over 4 cores
60
Slide 61
Slide 61 text
Throwing more
hardware at it
Effective, but often overlooked
Slide 62
Slide 62 text
BETTER SPECS
CPU speed depends on:
⬗ Pipeline architecture
⬗ Clock speed
⬗ L2 cache
Non-parallel problems only need faster CPU clocks
62
Slide 63
Slide 63 text
PARALLEL + MORE CPUs = WIN
For parallel problems:
⬗ Add CPUs
⬗ Add more computers with more CPUs
◇ Need to think about networking, queues, failover, …
http://www.celeryproject.org/
63
Slide 64
Slide 64 text
High performance
libraries
Not reinventing the wheel
Slide 65
Slide 65 text
UNDERSTANDING VECTORS
The iterative sum
⬗ Row after row
⬗ Each line can be different
65
The vectorized sum
⬗ Data is typed
⬗ Homogenous dataset
⬗ Optimized operations on rows
and columns
Slide 66
Slide 66 text
NUMPY
⬗ Centered around ndarray
⬗ Homogenous type (if possible)
⬗ Non-sparse arrays (shape = rows * columns)
⬗ Close to C / Fortran API
⬗ Efficient numerical operations
⬗ Good integration with Cython
http://www.numpy.org/
66
Slide 67
Slide 67 text
PANDAS
⬗ Heavily based on NumPy
⬗ Serie, DataFrame, Index
⬗ Batteries included:
◇ Integrations for reading/writing different formats
◇ Date/datetime/timezone handling
⬗ More user-friendly than NumPy
https://pandas.pydata.org/
67
Slide 68
Slide 68 text
Counting passwords containing the word “eric” in pure Python
68
Slide 69
Slide 69 text
Pure Python solution finds 16681 matches in 23 seconds
69
Slide 70
Slide 70 text
Pandas version - No explicit loop
70
Slide 71
Slide 71 text
Pandas finds 16625 matches in 19 seconds
71
Slide 72
Slide 72 text
Cython
Reinventing the wheel
Slide 73
Slide 73 text
WHY NOT JUST WRITE C ?
⬗ Write C code
⬗ Compile C code
⬗ Use CFFI or ctypes to load and call code
⬗ In “C land”
◇ Untangle PyObject yourself
◇ No exception mechanism
73
Slide 74
Slide 74 text
CYTHON
⬗ Precompile Python code in C
⬗ Automatically links and wraps the code so it can be
imported
⬗ Seamless transition between “C” and “Python” contexts
◇ Exceptions
◇ print()
◇ PyObject untangling
74
Slide 75
Slide 75 text
Regular Python code
75
Slide 76
Slide 76 text
C-typing variables
76
Slide 77
Slide 77 text
C-typing function
77
Slide 78
Slide 78 text
Cython annotate - White = C / Yellow = Python
78
Slide 79
Slide 79 text
PACKAGING & DISTRIBUTION
79
Slide 80
Slide 80 text
PyPy
Just in time to save the day
Slide 81
Slide 81 text
WHAT IS JIT OPTIMIZATION
CPython compiler optimize bytecode on guessed processing
What if the compiler could optimize for actual processing ?
Just In Time optimization monitors how the code is running
and suggest bytecode optimizations on the fly
81
Slide 82
Slide 82 text
PYPY
⬗ Alternative Python implementation
◇ 100% compatible with Python 2.7 & 3.5
◇ not 100% compatible with (some) C libraries
⬗ Automatically rewrites internal logic for performance
⬗ Needs lots of data to make better decisions
http://pypy.org/
82
Slide 83
Slide 83 text
Create 5 million “messages”, count them and check the last one
83
Slide 84
Slide 84 text
CPython: 20.4 seconds vs PyPy: 6.6 seconds
84
Slide 85
Slide 85 text
JIT counter example - CPython is faster for 500 messages
85
Slide 86
Slide 86 text
JIT PROs & CONs
Pros:
⬗ Works on existing codebase
⬗ Ridiculously fast
⬗ Support for NumPy (not yet for
Pandas)
Cons:
⬗ No support for pandas
⬗ Another interpreter
⬗ Works best with pure-Python
types
⬗ Needs “warm-up”
86
Slide 87
Slide 87 text
YOU CAN’T HAVE IT ALL
Optimization is always a trade-off with maintainability
87
90
Thanks!
Any questions?
You can find me at @ericgazoni & [email protected]
Slide 91
Slide 91 text
Credits
Special thanks to all the people who made and released these
awesome resources for free:
⬗ Presentation template by SlidesCarnival
⬗ Photographs by Unsplash
91