Upgrade to Pro — share decks privately, control downloads, hide ads and more …

High Performance Python Landscape (PyDataLondon Sept 2014)

ianozsvald
September 03, 2014

High Performance Python Landscape (PyDataLondon Sept 2014)

ianozsvald

September 03, 2014
Tweet

More Decks by ianozsvald

Other Decks in Technology

Transcript

  1. www.morconsulting.c The High Performance Python Landscape - profiling and fast

    calculation Ian Ozsvald @IanOzsvald MorConsulting.com
  2. [email protected] @IanOzsvald PyDataLondon February 2014 What is “high performance”? •

    Profiling to understand system behaviour • We often ignore this step... • Speeding up the bottleneck • Keeps you on 1 machine (if possible) • Keeping team speed high
  3. [email protected] @IanOzsvald PyDataLondon February 2014 “High Performance Python” • “Practical

    Performant Programming for Humans” • Please join the mailing list via IanOzsvald.com
  4. [email protected] @IanOzsvald PyDataLondon February 2014 line_profiler Line # Hits Time

    Per Hit % Time Line Contents ============================================================== 9 @profile 10 def calculate_z_serial_purepython( maxiter, zs, cs): 12 1 6870 6870.0 0.0 output = [0] * len(zs) 13 1000001 781959 0.8 0.8 for i in range(len(zs)): 14 1000000 767224 0.8 0.8 n = 0 15 1000000 843432 0.8 0.8 z = zs[i] 16 1000000 786013 0.8 0.8 c = cs[i] 17 34219980 36492596 1.1 36.2 while abs(z) < 2 and n < maxiter: 18 33219980 32869046 1.0 32.6 z = z * z + c 19 33219980 27371730 0.8 27.2 n += 1 20 1000000 890837 0.9 0.9 output[i] = n 21 1 4 4.0 0.0 return output
  5. [email protected] @IanOzsvald PyDataLondon February 2014 memory_profiler Line # Mem usage

    Increment Line Contents ================================================ 9 89.934 MiB 0.000 MiB @profile 10 def calculate_z_serial_purepython( maxiter, zs, cs): 12 97.566 MiB 7.633 MiB output = [0] * len(zs) 13 130.215 MiB 32.648 MiB for i in range(len(zs)): 14 130.215 MiB 0.000 MiB n = 0 15 130.215 MiB 0.000 MiB z = zs[i] 16 130.215 MiB 0.000 MiB c = cs[i] 17 130.215 MiB 0.000 MiB while n < maxiter and abs(z) < 2: 18 130.215 MiB 0.000 MiB z = z * z + c 19 130.215 MiB 0.000 MiB n += 1 20 130.215 MiB 0.000 MiB output[i] = n 21 122.582 MiB ­7.633 MiB return output
  6. [email protected] @IanOzsvald PyDataLondon February 2014 Don't sacrifice unit tests •

    It is possible (but not trivial) to maintain unit tests whilst profiling • See my book for examples (you make no-op @profile decorators)
  7. [email protected] @IanOzsvald PyDataLondon February 2014 ipython_memory_watcher.py # approx 750MB per

    matrix In [2]: a=np.ones(1e8); b=np.ones(1e8); c=np.ones(1e8) 'a=np.ones(1e8); b=np.ones(1e8); c=np.ones(1e8)' used 2288.8750 MiB RAM in 1.02s, peaked 0.00 MiB above current, total RAM usage 2338.06 MiB In [3]: d=a*b+c 'd=a*b+c' used 762.9453 MiB RAM in 0.91s, peaked 667.91 MiB above current, total RAM usage 3101.01 MiB
  8. [email protected] @IanOzsvald PyDataLondon February 2014 Profiling possibilities • CPU (line

    by line or by function) • Memory (line by line) • Disk read/write (with some hacking) • Network read/write (with some hacking) • mmaps, File handles, Network connections • Why not watch memory flows on machine?
  9. [email protected] @IanOzsvald PyDataLondon February 2014 Cython 0.20 (pyx annotations) #cython:

    boundscheck=False def calculate_z(int maxiter, zs, cs): """Calculate output list using Julia update rule""" cdef unsigned int i, n cdef double complex z, c output = [0] * len(zs) for i in range(len(zs)): n = 0 z = zs[i] c = cs[i] while n < maxiter and (z.real * z.real + z.imag * z.imag) < 4: z = z * z + c n += 1 output[i] = n return output Pure CPython lists code 12s Cython lists runtime 0.19s Cython numpy runtime 0.16s
  10. [email protected] @IanOzsvald PyDataLondon February 2014 Cython + numpy + OMP

    nogil #cython: boundscheck=False from cython.parallel import prange import numpy as np def calculate_z(int maxiter, double complex[:] zs, double complex[:] cs): cdef unsigned int i, length, n cdef double complex z, c cdef int[:] output = np.empty(len(zs), dtype=np.int32) length = len(zs) with nogil: for i in prange(length, schedule="guided"): z = zs[i] c = cs[i] n = 0 while n < maxiter and (z.real * z.real + z.imag * z.imag) < 4: z = z * z + c n = n + 1 output[i] = n return output Runtime 0.05s
  11. [email protected] @IanOzsvald PyDataLondon February 2014 Pythran (0.40) #pythran export calculate_z_serial_purepython(int,

    complex list, complex list) def calculate_z_serial_purepython(maxiter, zs, cs): … Support for OpenMP on numpy arrays Author Serge made an overnight fix – superb support! List Runtime 0.4s #pythran export calculate_z(int, complex[], complex[], int[]) … #omp parallel for schedule(dynamic) OMP numpy Runtime 0.10s
  12. [email protected] @IanOzsvald PyDataLondon February 2014 PyPy nightly (and numpypy) •

    “It just works” on Python 2.7 code • Clever list strategies (e.g. unboxed, uniform) • Software Transactional Memory LOOKS INTERESTING • Pure-py libs (e.g. pymysql) work fine • Python list code runtime: 0.3s, faster on second run (if in same session) • No support cost if pypy is in PATH
  13. [email protected] @IanOzsvald PyDataLondon February 2014 Numba 0.13 from numba import

    jit @jit(nopython=True) def calculate_z_serial_purepython(maxiter, zs, cs, output): # couldn't create output, had to pass it in # output = numpy.zeros(len(zs), dtype=np.int32) for i in xrange(len(zs)): n = 0 z = zs[i] c = cs[i] while n < maxiter and z.real * z.real + z.imag * z.imag < 4: z = z * z + c n += 1 output[i] = n #return output Runtime 0.4s (0.2s on subsequent runs) Some Python 3 support, some GPU Not a golden bullet yet but might be...
  14. [email protected] @IanOzsvald PyDataLondon February 2014 Tool Tradeoffs • Always profile

    first – maybe you just need a better alg? • Never sacrifice unit tests in the name of profiling • PyPy no learning curve - easy (non-numpy) win • Cython pure Py hours to learn – team cost low (and lots of online help) • Cython numpy OMP days+ to learn – heavy team cost? • [R&D?] Numba trivial to learn when it works (Anaconda only!) • [R&D?] Pythran trivial to learn, OMP easy additional win, increases support cost
  15. [email protected] @IanOzsvald PyDataLondon February 2014 Wrap up • Our profiling

    options should be richer • 4-12 physical CPU cores commonplace • JITs/AST compilers are getting fairly good, manual intervention still gives best results • Automation should be embraced as CPUs cost less than humans and team velocity is probably higher
  16. [email protected] @IanOzsvald PyDataLondon February 2014 Thank You • [email protected]

    @IanOzsvald • ModelInsight.io / MorConsulting.com • GitHub/IanOzsvald • • I'm training on this in October in London!