Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The High Performance Python Landscape PyDataLon...

ianozsvald
February 23, 2014

The High Performance Python Landscape PyDataLondon2014

Write-up: http://ianozsvald.com/2014/02/23/high-performance-python-at-pydatalondon-2014/

Profiling (runsnake, line_profiler, memory_profiler, memit)
Compiling (Cython, ShedSkin)
JITs (PyPy, Pythran, Numba)

ianozsvald

February 23, 2014
Tweet

More Decks by ianozsvald

Other Decks in Science

Transcript

  1. www.morconsulting.c The High Performance Python Landscape - profiling and fast

    calculation Ian Ozsvald @IanOzsvald MorConsulting.com
  2. [email protected] @IanOzsvald PyDataLondon February 2014 What is “high performance”? •

    Profiling to understand system behaviour • We often ignore this step... • Speeding up the bottleneck • Keeps you on 1 machine (if possible) • Keeping team speed high
  3. [email protected] @IanOzsvald PyDataLondon February 2014 “High Performance Python” • “Practical

    Performant Programming for Humans” • Please join the mailing list via IanOzsvald.com
  4. [email protected] @IanOzsvald PyDataLondon February 2014 line_profiler Line # Hits Time

    Per Hit % Time Line Contents ============================================================== 9 @profile 10 def calculate_z_serial_purepython( maxiter, zs, cs): 12 1 6870 6870.0 0.0 output = [0] * len(zs) 13 1000001 781959 0.8 0.8 for i in range(len(zs)): 14 1000000 767224 0.8 0.8 n = 0 15 1000000 843432 0.8 0.8 z = zs[i] 16 1000000 786013 0.8 0.8 c = cs[i] 17 34219980 36492596 1.1 36.2 while abs(z) < 2 and n < maxiter: 18 33219980 32869046 1.0 32.6 z = z * z + c 19 33219980 27371730 0.8 27.2 n += 1 20 1000000 890837 0.9 0.9 output[i] = n 21 1 4 4.0 0.0 return output
  5. [email protected] @IanOzsvald PyDataLondon February 2014 memory_profiler Line # Mem usage

    Increment Line Contents ================================================ 9 89.934 MiB 0.000 MiB @profile 10 def calculate_z_serial_purepython( maxiter, zs, cs): 12 97.566 MiB 7.633 MiB output = [0] * len(zs) 13 130.215 MiB 32.648 MiB for i in range(len(zs)): 14 130.215 MiB 0.000 MiB n = 0 15 130.215 MiB 0.000 MiB z = zs[i] 16 130.215 MiB 0.000 MiB c = cs[i] 17 130.215 MiB 0.000 MiB while n < maxiter and abs(z) < 2: 18 130.215 MiB 0.000 MiB z = z * z + c 19 130.215 MiB 0.000 MiB n += 1 20 130.215 MiB 0.000 MiB output[i] = n 21 122.582 MiB ­7.633 MiB return output
  6. [email protected] @IanOzsvald PyDataLondon February 2014 Profiling possibilities • CPU (line

    by line or by function) • Memory (line by line) • Disk read/write (with some hacking) • Network read/write (with some hacking) • mmaps • File handles • Network connections • Cache utilisation via libperf?
  7. [email protected] @IanOzsvald PyDataLondon February 2014 Cython 0.20 (pyx annotations) #cython:

    boundscheck=False def calculate_z(int maxiter, zs, cs): """Calculate output list using Julia update rule""" cdef unsigned int i, n cdef double complex z, c output = [0] * len(zs) for i in range(len(zs)): n = 0 z = zs[i] c = cs[i] while n < maxiter and (z.real * z.real + z.imag * z.imag) < 4: z = z * z + c n += 1 output[i] = n return output Pure CPython lists code 12s Cython lists runtime 0.19s Cython numpy runtime 0.16s
  8. [email protected] @IanOzsvald PyDataLondon February 2014 Cython + numpy + OMP

    nogil #cython: boundscheck=False from cython.parallel import parallel, prange import numpy as np cimport numpy as np def calculate_z(int maxiter, double complex[:] zs, double complex[:] cs): cdef unsigned int i, length, n cdef double complex z, c cdef int[:] output = np.empty(len(zs), dtype=np.int32) length = len(zs) with nogil, parallel(): for i in prange(length, schedule="guided"): z = zs[i] c = cs[i] n = 0 while n < maxiter and (z.real * z.real + z.imag * z.imag) < 4: z = z * z + c n = n + 1 output[i] = n return output Runtime 0.05s
  9. [email protected] @IanOzsvald PyDataLondon February 2014 ShedSkin 0.9.4 annotations def calculate_z(maxiter,

    zs, cs): # maxiter: [int], zs: [list(complex)], cs: [list(complex)] output = [0] * len(zs) # [list(int)] for i in range(len(zs)): # [__iter(int)] n = 0 # [int] z = zs[i] # [complex] c = cs[i] # [complex] while n < maxiter and (… <4): # [complex] z = z * z + c # [complex] n += 1 # [int] output[i] = n # [int] return output # [list(int)] Couldn't we generate Cython pyx? Runtime 0.22s
  10. [email protected] @IanOzsvald PyDataLondon February 2014 Pythran (0.40) #pythran export calculate_z_serial_purepython(int,

    complex list, complex list) def calculate_z_serial_purepython(maxiter, zs, cs): … Support for OpenMP on numpy arrays Author Serge made an overnight fix – superb support! List Runtime 0.4s #pythran export calculate_z(int, complex[], complex[], int[]) … #omp parallel for schedule(dynamic) OMP numpy Runtime 0.10s
  11. [email protected] @IanOzsvald PyDataLondon February 2014 PyPy nightly (and numpypy) •

    “It just works” on Python 2.7 code • Clever list strategies (e.g. unboxed, uniform) • Little support for pre-existing C extensions (e.g. the existing numpy) • multiprocessing, IPython etc all work fine • Python list code runtime: 0.3s • (pypy)numpy support is incomplete, bugs are tackled (numpy runtime 5s [CPython+numpy 56s])
  12. [email protected] @IanOzsvald PyDataLondon February 2014 Numba 0.12 from numba import

    jit @jit(nopython=True) def calculate_z_serial_purepython(maxiter, zs, cs, output): # couldn't create output, had to pass it in # output = numpy.zeros(len(zs), dtype=np.int32) for i in xrange(len(zs)): n = 0 z = zs[i] c = cs[i] #while n < maxiter and abs(z) < 2: # abs unrecognised while n < maxiter and z.real * z.real + z.imag * z.imag < 4: z = z * z + c n += 1 output[i] = n #return output Runtime 0.4s Some Python 3 support, some GPU prange support missing (was in 0.11)? 0.12 introduces temp limitations
  13. [email protected] @IanOzsvald PyDataLondon February 2014 Tool Tradeoffs • PyPy no

    learning curve (pure Py only) easy win? • ShedSkin easy (pure Py only) but fairly rare • Cython pure Py hours to learn – team cost low (and lots of online help) • Cython numpy OMP days+ to learn – heavy team cost? • Numba/Pythran hours to learn, install a bit tricky (Anaconda easiest for Numba) • Pythran OMP very impressive result for little effort • Numba big toolchain which might hurt productivity? • (numexpr not covered – great for numpy and easy to use)
  14. [email protected] @IanOzsvald PyDataLondon February 2014 Wrap up • Our profiling

    options should be richer • 4-12 physical CPU cores commonplace • Cost of hand-annotating code is reduced agility • JITs/AST compilers are getting fairly good, manual intervention still gives best results BUT! CONSIDER: • Automation should (probably) be embraced ($CPUs < $humans) as team velocity is probably higher
  15. [email protected] @IanOzsvald PyDataLondon February 2014 Thank You • [email protected]

    @IanOzsvald • MorConsulting.com • Annotate.io • GitHub/IanOzsvald