Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The High Performance Python Landscape PyDataLon...

Avatar for ianozsvald ianozsvald
February 23, 2014

The High Performance Python Landscape PyDataLondon2014

Write-up: http://ianozsvald.com/2014/02/23/high-performance-python-at-pydatalondon-2014/

Profiling (runsnake, line_profiler, memory_profiler, memit)
Compiling (Cython, ShedSkin)
JITs (PyPy, Pythran, Numba)

Avatar for ianozsvald

ianozsvald

February 23, 2014
Tweet

More Decks by ianozsvald

Other Decks in Science

Transcript

  1. www.morconsulting.c The High Performance Python Landscape - profiling and fast

    calculation Ian Ozsvald @IanOzsvald MorConsulting.com
  2. [email protected] @IanOzsvald PyDataLondon February 2014 What is “high performance”? •

    Profiling to understand system behaviour • We often ignore this step... • Speeding up the bottleneck • Keeps you on 1 machine (if possible) • Keeping team speed high
  3. [email protected] @IanOzsvald PyDataLondon February 2014 “High Performance Python” • “Practical

    Performant Programming for Humans” • Please join the mailing list via IanOzsvald.com
  4. [email protected] @IanOzsvald PyDataLondon February 2014 line_profiler Line # Hits Time

    Per Hit % Time Line Contents ============================================================== 9 @profile 10 def calculate_z_serial_purepython( maxiter, zs, cs): 12 1 6870 6870.0 0.0 output = [0] * len(zs) 13 1000001 781959 0.8 0.8 for i in range(len(zs)): 14 1000000 767224 0.8 0.8 n = 0 15 1000000 843432 0.8 0.8 z = zs[i] 16 1000000 786013 0.8 0.8 c = cs[i] 17 34219980 36492596 1.1 36.2 while abs(z) < 2 and n < maxiter: 18 33219980 32869046 1.0 32.6 z = z * z + c 19 33219980 27371730 0.8 27.2 n += 1 20 1000000 890837 0.9 0.9 output[i] = n 21 1 4 4.0 0.0 return output
  5. [email protected] @IanOzsvald PyDataLondon February 2014 memory_profiler Line # Mem usage

    Increment Line Contents ================================================ 9 89.934 MiB 0.000 MiB @profile 10 def calculate_z_serial_purepython( maxiter, zs, cs): 12 97.566 MiB 7.633 MiB output = [0] * len(zs) 13 130.215 MiB 32.648 MiB for i in range(len(zs)): 14 130.215 MiB 0.000 MiB n = 0 15 130.215 MiB 0.000 MiB z = zs[i] 16 130.215 MiB 0.000 MiB c = cs[i] 17 130.215 MiB 0.000 MiB while n < maxiter and abs(z) < 2: 18 130.215 MiB 0.000 MiB z = z * z + c 19 130.215 MiB 0.000 MiB n += 1 20 130.215 MiB 0.000 MiB output[i] = n 21 122.582 MiB ­7.633 MiB return output
  6. [email protected] @IanOzsvald PyDataLondon February 2014 Profiling possibilities • CPU (line

    by line or by function) • Memory (line by line) • Disk read/write (with some hacking) • Network read/write (with some hacking) • mmaps • File handles • Network connections • Cache utilisation via libperf?
  7. [email protected] @IanOzsvald PyDataLondon February 2014 Cython 0.20 (pyx annotations) #cython:

    boundscheck=False def calculate_z(int maxiter, zs, cs): """Calculate output list using Julia update rule""" cdef unsigned int i, n cdef double complex z, c output = [0] * len(zs) for i in range(len(zs)): n = 0 z = zs[i] c = cs[i] while n < maxiter and (z.real * z.real + z.imag * z.imag) < 4: z = z * z + c n += 1 output[i] = n return output Pure CPython lists code 12s Cython lists runtime 0.19s Cython numpy runtime 0.16s
  8. [email protected] @IanOzsvald PyDataLondon February 2014 Cython + numpy + OMP

    nogil #cython: boundscheck=False from cython.parallel import parallel, prange import numpy as np cimport numpy as np def calculate_z(int maxiter, double complex[:] zs, double complex[:] cs): cdef unsigned int i, length, n cdef double complex z, c cdef int[:] output = np.empty(len(zs), dtype=np.int32) length = len(zs) with nogil, parallel(): for i in prange(length, schedule="guided"): z = zs[i] c = cs[i] n = 0 while n < maxiter and (z.real * z.real + z.imag * z.imag) < 4: z = z * z + c n = n + 1 output[i] = n return output Runtime 0.05s
  9. [email protected] @IanOzsvald PyDataLondon February 2014 ShedSkin 0.9.4 annotations def calculate_z(maxiter,

    zs, cs): # maxiter: [int], zs: [list(complex)], cs: [list(complex)] output = [0] * len(zs) # [list(int)] for i in range(len(zs)): # [__iter(int)] n = 0 # [int] z = zs[i] # [complex] c = cs[i] # [complex] while n < maxiter and (… <4): # [complex] z = z * z + c # [complex] n += 1 # [int] output[i] = n # [int] return output # [list(int)] Couldn't we generate Cython pyx? Runtime 0.22s
  10. [email protected] @IanOzsvald PyDataLondon February 2014 Pythran (0.40) #pythran export calculate_z_serial_purepython(int,

    complex list, complex list) def calculate_z_serial_purepython(maxiter, zs, cs): … Support for OpenMP on numpy arrays Author Serge made an overnight fix – superb support! List Runtime 0.4s #pythran export calculate_z(int, complex[], complex[], int[]) … #omp parallel for schedule(dynamic) OMP numpy Runtime 0.10s
  11. [email protected] @IanOzsvald PyDataLondon February 2014 PyPy nightly (and numpypy) •

    “It just works” on Python 2.7 code • Clever list strategies (e.g. unboxed, uniform) • Little support for pre-existing C extensions (e.g. the existing numpy) • multiprocessing, IPython etc all work fine • Python list code runtime: 0.3s • (pypy)numpy support is incomplete, bugs are tackled (numpy runtime 5s [CPython+numpy 56s])
  12. [email protected] @IanOzsvald PyDataLondon February 2014 Numba 0.12 from numba import

    jit @jit(nopython=True) def calculate_z_serial_purepython(maxiter, zs, cs, output): # couldn't create output, had to pass it in # output = numpy.zeros(len(zs), dtype=np.int32) for i in xrange(len(zs)): n = 0 z = zs[i] c = cs[i] #while n < maxiter and abs(z) < 2: # abs unrecognised while n < maxiter and z.real * z.real + z.imag * z.imag < 4: z = z * z + c n += 1 output[i] = n #return output Runtime 0.4s Some Python 3 support, some GPU prange support missing (was in 0.11)? 0.12 introduces temp limitations
  13. [email protected] @IanOzsvald PyDataLondon February 2014 Tool Tradeoffs • PyPy no

    learning curve (pure Py only) easy win? • ShedSkin easy (pure Py only) but fairly rare • Cython pure Py hours to learn – team cost low (and lots of online help) • Cython numpy OMP days+ to learn – heavy team cost? • Numba/Pythran hours to learn, install a bit tricky (Anaconda easiest for Numba) • Pythran OMP very impressive result for little effort • Numba big toolchain which might hurt productivity? • (numexpr not covered – great for numpy and easy to use)
  14. [email protected] @IanOzsvald PyDataLondon February 2014 Wrap up • Our profiling

    options should be richer • 4-12 physical CPU cores commonplace • Cost of hand-annotating code is reduced agility • JITs/AST compilers are getting fairly good, manual intervention still gives best results BUT! CONSIDER: • Automation should (probably) be embraced ($CPUs < $humans) as team velocity is probably higher
  15. [email protected] @IanOzsvald PyDataLondon February 2014 Thank You • [email protected]

    @IanOzsvald • MorConsulting.com • Annotate.io • GitHub/IanOzsvald