Slide 1

Slide 1 text

www.morconsulting.c The High Performance Python Landscape - profiling and fast calculation Ian Ozsvald @IanOzsvald MorConsulting.com

Slide 2

Slide 2 text

[email protected] @IanOzsvald PyDataLondon February 2014 What is “high performance”? ● Profiling to understand system behaviour ● We often ignore this step... ● Speeding up the bottleneck ● Keeps you on 1 machine (if possible) ● Keeping team speed high

Slide 3

Slide 3 text

[email protected] @IanOzsvald PyDataLondon February 2014 “High Performance Python” • “Practical Performant Programming for Humans” • Please join the mailing list via IanOzsvald.com

Slide 4

Slide 4 text

[email protected] @IanOzsvald PyDataLondon February 2014 line_profiler Line # Hits Time Per Hit % Time Line Contents ============================================================== 9 @profile 10 def calculate_z_serial_purepython( maxiter, zs, cs): 12 1 6870 6870.0 0.0 output = [0] * len(zs) 13 1000001 781959 0.8 0.8 for i in range(len(zs)): 14 1000000 767224 0.8 0.8 n = 0 15 1000000 843432 0.8 0.8 z = zs[i] 16 1000000 786013 0.8 0.8 c = cs[i] 17 34219980 36492596 1.1 36.2 while abs(z) < 2 and n < maxiter: 18 33219980 32869046 1.0 32.6 z = z * z + c 19 33219980 27371730 0.8 27.2 n += 1 20 1000000 890837 0.9 0.9 output[i] = n 21 1 4 4.0 0.0 return output

Slide 5

Slide 5 text

[email protected] @IanOzsvald PyDataLondon February 2014 memory_profiler Line # Mem usage Increment Line Contents ================================================ 9 89.934 MiB 0.000 MiB @profile 10 def calculate_z_serial_purepython( maxiter, zs, cs): 12 97.566 MiB 7.633 MiB output = [0] * len(zs) 13 130.215 MiB 32.648 MiB for i in range(len(zs)): 14 130.215 MiB 0.000 MiB n = 0 15 130.215 MiB 0.000 MiB z = zs[i] 16 130.215 MiB 0.000 MiB c = cs[i] 17 130.215 MiB 0.000 MiB while n < maxiter and abs(z) < 2: 18 130.215 MiB 0.000 MiB z = z * z + c 19 130.215 MiB 0.000 MiB n += 1 20 130.215 MiB 0.000 MiB output[i] = n 21 122.582 MiB ­7.633 MiB return output

Slide 6

Slide 6 text

[email protected] @IanOzsvald PyDataLondon February 2014 Don't sacrifice unit tests ● It is possible (but not trivial) to maintain unit tests whilst profiling ● See my book for examples (you make no-op @profile decorators)

Slide 7

Slide 7 text

[email protected] @IanOzsvald PyDataLondon February 2014 ipython_memory_watcher.py # approx 750MB per matrix In [2]: a=np.ones(1e8); b=np.ones(1e8); c=np.ones(1e8) 'a=np.ones(1e8); b=np.ones(1e8); c=np.ones(1e8)' used 2288.8750 MiB RAM in 1.02s, peaked 0.00 MiB above current, total RAM usage 2338.06 MiB In [3]: d=a*b+c 'd=a*b+c' used 762.9453 MiB RAM in 0.91s, peaked 667.91 MiB above current, total RAM usage 3101.01 MiB

Slide 8

Slide 8 text

[email protected] @IanOzsvald PyDataLondon February 2014 memory_profiler mprof https://github.com/scikit-learn/scikit-l earn/pull/2248 Before & After an improvement

Slide 9

Slide 9 text

[email protected] @IanOzsvald PyDataLondon February 2014 Transforming memory_profiler into a resource profiler?

Slide 10

Slide 10 text

[email protected] @IanOzsvald PyDataLondon February 2014 Profiling possibilities ● CPU (line by line or by function) ● Memory (line by line) ● Disk read/write (with some hacking) ● Network read/write (with some hacking) ● mmaps, File handles, Network connections ● Why not watch memory flows on machine?

Slide 11

Slide 11 text

[email protected] @IanOzsvald PyDataLondon February 2014 Cython 0.20 (pyx annotations) #cython: boundscheck=False def calculate_z(int maxiter, zs, cs): """Calculate output list using Julia update rule""" cdef unsigned int i, n cdef double complex z, c output = [0] * len(zs) for i in range(len(zs)): n = 0 z = zs[i] c = cs[i] while n < maxiter and (z.real * z.real + z.imag * z.imag) < 4: z = z * z + c n += 1 output[i] = n return output Pure CPython lists code 12s Cython lists runtime 0.19s Cython numpy runtime 0.16s

Slide 12

Slide 12 text

[email protected] @IanOzsvald PyDataLondon February 2014 Cython + numpy + OMP nogil #cython: boundscheck=False from cython.parallel import prange import numpy as np def calculate_z(int maxiter, double complex[:] zs, double complex[:] cs): cdef unsigned int i, length, n cdef double complex z, c cdef int[:] output = np.empty(len(zs), dtype=np.int32) length = len(zs) with nogil: for i in prange(length, schedule="guided"): z = zs[i] c = cs[i] n = 0 while n < maxiter and (z.real * z.real + z.imag * z.imag) < 4: z = z * z + c n = n + 1 output[i] = n return output Runtime 0.05s

Slide 13

Slide 13 text

[email protected] @IanOzsvald PyDataLondon February 2014 Pythran (0.40) #pythran export calculate_z_serial_purepython(int, complex list, complex list) def calculate_z_serial_purepython(maxiter, zs, cs): … Support for OpenMP on numpy arrays Author Serge made an overnight fix – superb support! List Runtime 0.4s #pythran export calculate_z(int, complex[], complex[], int[]) … #omp parallel for schedule(dynamic) OMP numpy Runtime 0.10s

Slide 14

Slide 14 text

[email protected] @IanOzsvald PyDataLondon February 2014 PyPy nightly (and numpypy) ● “It just works” on Python 2.7 code ● Clever list strategies (e.g. unboxed, uniform) ● Software Transactional Memory LOOKS INTERESTING ● Pure-py libs (e.g. pymysql) work fine ● Python list code runtime: 0.3s, faster on second run (if in same session) ● No support cost if pypy is in PATH

Slide 15

Slide 15 text

[email protected] @IanOzsvald PyDataLondon February 2014 Numba 0.13 from numba import jit @jit(nopython=True) def calculate_z_serial_purepython(maxiter, zs, cs, output): # couldn't create output, had to pass it in # output = numpy.zeros(len(zs), dtype=np.int32) for i in xrange(len(zs)): n = 0 z = zs[i] c = cs[i] while n < maxiter and z.real * z.real + z.imag * z.imag < 4: z = z * z + c n += 1 output[i] = n #return output Runtime 0.4s (0.2s on subsequent runs) Some Python 3 support, some GPU Not a golden bullet yet but might be...

Slide 16

Slide 16 text

[email protected] @IanOzsvald PyDataLondon February 2014 Tool Tradeoffs ● Always profile first – maybe you just need a better alg? ● Never sacrifice unit tests in the name of profiling ● PyPy no learning curve - easy (non-numpy) win ● Cython pure Py hours to learn – team cost low (and lots of online help) ● Cython numpy OMP days+ to learn – heavy team cost? ● [R&D?] Numba trivial to learn when it works (Anaconda only!) ● [R&D?] Pythran trivial to learn, OMP easy additional win, increases support cost

Slide 17

Slide 17 text

[email protected] @IanOzsvald PyDataLondon February 2014 Wrap up ● Our profiling options should be richer ● 4-12 physical CPU cores commonplace ● JITs/AST compilers are getting fairly good, manual intervention still gives best results ● Automation should be embraced as CPUs cost less than humans and team velocity is probably higher

Slide 18

Slide 18 text

[email protected] @IanOzsvald PyDataLondon February 2014 Thank You • [email protected] • @IanOzsvald • ModelInsight.io / MorConsulting.com • GitHub/IanOzsvald • • I'm training on this in October in London!