Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Optimizing Python

Optimizing Python

Presented at FOSDEMx 0 in May 2018.
Aimed at CS students, it gives some high-level pointers for improving code execution speed in several different ways, from multiprocessing to Cython and PyPy.

Eric Gazoni

May 03, 2018
Tweet

More Decks by Eric Gazoni

Other Decks in Education

Transcript

  1. Hello! I am Eric Gazoni I’m Senior Python Developer at

    Adimian You can find me at @ericgazoni 2
  2. I/O Improve read/write speed from network or filesystem ⬗ Data

    science (large data sets) ⬗ Databases ⬗ Telemetry (IoT) 4
  3. MEMORY Require less RAM from the system ⬗ Reduce hosting

    costs ⬗ Run on constrained devices (embedded systems) ⬗ Improve reliability 5
  4. FAULT TOLERANCE / RESILIENCE Continue operating even with bad or

    missing input ⬗ Web services ⬗ Medical devices ⬗ Distributed systems 6
  5. CONCURRENCY Serve more requests at the same time ⬗ Web

    servers ⬗ IoT controllers ⬗ Database engines ⬗ Web scrapers 7
  6. CPU Run code more efficiently ⬗ Reduce processing time (reporting,

    calculation) ⬗ Reduce response time (web pages) ⬗ Reduce energy consumption (and hosting costs) 8
  7. ONLY ONE AT A TIME ⬗ Pick one category ⬗

    Hack ⬗ Review ⬗ Rinse, repeat Optimizing multiple domains at once = unpredictable results 9
  8. TARGETS Define clear targets or get lost in the performance

    maze ⬗ “This page must load below 200ms” ⬗ “One iteration of this loop must execute below 10ms” ⬗ “This must run on a controller with 8KB memory” 11
  9. METRICS ⬗ You know if you improve or make things

    worse ◇ You can definitely make things worse ! ⬗ You know if you reached your targets 12
  10. IT’S A JUNGLE OUT THERE 15 User land ⬗ Your

    program ⬗ Implementation of the interpreter (py2/py3/pypy) ⬗ Implementation of the interpreter language standard lib (C99/C11/…)
  11. IT’S A JUNGLE OUT THERE 16 Operating system ⬗ Implementation

    of the OS kernel (linux/windows/unix/…) ⬗ Filesystem layout (ext4/NTFS/BTRFS/...) ⬗ Implementation of the hardware drivers (proprietary Nvidia drivers)
  12. IT’S A JUNGLE OUT THERE 17 Hardware ⬗ CPU architecture

    (x86/ARM/…) ⬗ CPU extensions (SSE/MMX/…) ⬗ Memory / hard drive technology (spinning/flash/…) ⬗ Temperature (GPU/CPU/RAM/…) ⬗ Network card (Optical/Copper)
  13. SAFETY NETS ⬗ Version control: rewind, pinpoint exactly what you

    did ⬗ Code coverage: make sure you didn’t break something 18
  14. 19

  15. THE DEAD END ⬗ No shame for not succeeding ⬗

    Know when to stop and change plans ⬗ There is always more than one tool in the box 20
  16. YOUR TOOL BOX ⬗ Profiler ⬗ Profiling analyzer ⬗ timeit

    ⬗ Improved interpreter (ipython) ⬗ pytest-profiling 22
  17. CAPTURING PROFILE ⬗ Profilers will capture all calls during program

    execution ⬗ Only capture what you need (reduce noise) ⬗ Stats (or aggregated calls) can be dumped in pstats binary format 23
  18. PROFILING THE WHOLE PROGRAM ⬗ Will capture a lot of

    noise ⬗ Not invasive (can run out of any Python script) $ python -m profile -o output.pstats myscript.py 24
  19. NOTE ON PROFILERS 25 Running code with a profiler is

    similar to driving with the parking brake! Don’t forget to disable it when you are done!
  20. ANALYSIS IF THE PROFILE 1. Dump stats into a file

    2. Load the file into gprof2dot 3. Use dot (from graphviz package) to generate png/svg representation https://github.com/jrfonseca/gprof2dot 29
  21. pytest-profiling ⬗ Useful to run against your unit-tests ⬗ Integrated

    generation of pstats + svg output https://github.com/manahl/pytest-plugins/tree/master/pytest-profiling $ py.test test_cracking.py --profile-svg 33
  22. LOW HANGING FRUITS ⬗ Less intrusive ⬗ Low impact on

    maintenance ⬗ Usually bring the most significant improvements E.g: reducing number of calls, removing nested loops 35
  23. EXAMPLE: PASSWORD BRUTE-FORCING 36 ⬗ CPU intensive ⬗ Straightforward This

    is very bad cryptography, only for demonstration purpose. Don’t do this at home !
  24. VOCABULARY Hash: function that turns a given input in a

    given output Brute-force: attempting random inputs in hope to find the one used initially, by comparing against a known output Salt: additional factor added to increase the size of the input 37
  25. 39

  26. 40

  27. 41

  28. 42

  29. FINDING INVARIANTS ⬗ If A calls B ⬗ And B

    does not use any input from A’s scope ⬗ Then B does not vary in function of B B could be called outside of A without affecting its output B is invariant 44
  30. 45

  31. 47

  32. The UNIX time command reports 99% CPU usage, and a

    total of 7.379 seconds (wall time) 51
  33. “ [...] an embarrassingly parallel [...] problem [...] is one

    where little or no effort is needed to separate the problem into a number of parallel tasks. Wikipedia 53
  34. PARALLEL & SEQUENTIAL PROBLEMS Parallel: if output from B does

    not depend on output from A Sequential: if output from B depends on output from A 54
  35. 56

  36. 58 In each process, we repeat the iterative checks for

    each salt, but for only 1 password
  37. The UNIX time command reports 353% CPU usage, and a

    total of 4.328 seconds (wall time) 59
  38. BETTER SPECS CPU speed depends on: ⬗ Pipeline architecture ⬗

    Clock speed ⬗ L2 cache Non-parallel problems only need faster CPU clocks 62
  39. PARALLEL + MORE CPUs = WIN For parallel problems: ⬗

    Add CPUs ⬗ Add more computers with more CPUs ◇ Need to think about networking, queues, failover, … http://www.celeryproject.org/ 63
  40. UNDERSTANDING VECTORS The iterative sum ⬗ Row after row ⬗

    Each line can be different 65 The vectorized sum ⬗ Data is typed ⬗ Homogenous dataset ⬗ Optimized operations on rows and columns
  41. NUMPY ⬗ Centered around ndarray ⬗ Homogenous type (if possible)

    ⬗ Non-sparse arrays (shape = rows * columns) ⬗ Close to C / Fortran API ⬗ Efficient numerical operations ⬗ Good integration with Cython http://www.numpy.org/ 66
  42. PANDAS ⬗ Heavily based on NumPy ⬗ Serie, DataFrame, Index

    ⬗ Batteries included: ◇ Integrations for reading/writing different formats ◇ Date/datetime/timezone handling ⬗ More user-friendly than NumPy https://pandas.pydata.org/ 67
  43. WHY NOT JUST WRITE C ? ⬗ Write C code

    ⬗ Compile C code ⬗ Use CFFI or ctypes to load and call code ⬗ In “C land” ◇ Untangle PyObject yourself ◇ No exception mechanism 73
  44. CYTHON ⬗ Precompile Python code in C ⬗ Automatically links

    and wraps the code so it can be imported ⬗ Seamless transition between “C” and “Python” contexts ◇ Exceptions ◇ print() ◇ PyObject untangling 74
  45. WHAT IS JIT OPTIMIZATION CPython compiler optimize bytecode on guessed

    processing What if the compiler could optimize for actual processing ? Just In Time optimization monitors how the code is running and suggest bytecode optimizations on the fly 81
  46. PYPY ⬗ Alternative Python implementation ◇ 100% compatible with Python

    2.7 & 3.5 ◇ not 100% compatible with (some) C libraries ⬗ Automatically rewrites internal logic for performance ⬗ Needs lots of data to make better decisions http://pypy.org/ 82
  47. JIT PROs & CONs Pros: ⬗ Works on existing codebase

    ⬗ Ridiculously fast ⬗ Support for NumPy (not yet for Pandas) Cons: ⬗ No support for pandas ⬗ Another interpreter ⬗ Works best with pure-Python types ⬗ Needs “warm-up” 86
  48. Summary ⬗ Wide deployment ⬗ “Simple” codebase 1. Low hanging

    fruits 2. Vectors 3. Better hardware ⬗ Sequential code ⬗ Limited deployment 1. Better hardware 2. PyPy 3. Cython ⬗ Embarrassingly parallel code 1. Worker threads 2. Worker processes 3. Throw more CPUs 89
  49. Credits Special thanks to all the people who made and

    released these awesome resources for free: ⬗ Presentation template by SlidesCarnival ⬗ Photographs by Unsplash 91