Slide 1

Slide 1 text

www.morconsulting.c The High Performance Python Landscape - profiling and fast calculation Ian Ozsvald @IanOzsvald MorConsulting.com

Slide 2

Slide 2 text

[email protected] @IanOzsvald PyDataLondon February 2014 What is “high performance”? ● Profiling to understand system behaviour ● We often ignore this step... ● Speeding up the bottleneck ● Keeps you on 1 machine (if possible) ● Keeping team speed high

Slide 3

Slide 3 text

[email protected] @IanOzsvald PyDataLondon February 2014 “High Performance Python” • “Practical Performant Programming for Humans” • Please join the mailing list via IanOzsvald.com

Slide 4

Slide 4 text

[email protected] @IanOzsvald PyDataLondon February 2014 cProfile

Slide 5

Slide 5 text

[email protected] @IanOzsvald PyDataLondon February 2014 line_profiler Line # Hits Time Per Hit % Time Line Contents ============================================================== 9 @profile 10 def calculate_z_serial_purepython( maxiter, zs, cs): 12 1 6870 6870.0 0.0 output = [0] * len(zs) 13 1000001 781959 0.8 0.8 for i in range(len(zs)): 14 1000000 767224 0.8 0.8 n = 0 15 1000000 843432 0.8 0.8 z = zs[i] 16 1000000 786013 0.8 0.8 c = cs[i] 17 34219980 36492596 1.1 36.2 while abs(z) < 2 and n < maxiter: 18 33219980 32869046 1.0 32.6 z = z * z + c 19 33219980 27371730 0.8 27.2 n += 1 20 1000000 890837 0.9 0.9 output[i] = n 21 1 4 4.0 0.0 return output

Slide 6

Slide 6 text

[email protected] @IanOzsvald PyDataLondon February 2014 memory_profiler Line # Mem usage Increment Line Contents ================================================ 9 89.934 MiB 0.000 MiB @profile 10 def calculate_z_serial_purepython( maxiter, zs, cs): 12 97.566 MiB 7.633 MiB output = [0] * len(zs) 13 130.215 MiB 32.648 MiB for i in range(len(zs)): 14 130.215 MiB 0.000 MiB n = 0 15 130.215 MiB 0.000 MiB z = zs[i] 16 130.215 MiB 0.000 MiB c = cs[i] 17 130.215 MiB 0.000 MiB while n < maxiter and abs(z) < 2: 18 130.215 MiB 0.000 MiB z = z * z + c 19 130.215 MiB 0.000 MiB n += 1 20 130.215 MiB 0.000 MiB output[i] = n 21 122.582 MiB ­7.633 MiB return output

Slide 7

Slide 7 text

[email protected] @IanOzsvald PyDataLondon February 2014 memory_profiler mprof https://github.com/scikit-learn/scikit-l earn/pull/2248 Before & After an improvement

Slide 8

Slide 8 text

[email protected] @IanOzsvald PyDataLondon February 2014 Transforming memory_profiler into a resource profiler?

Slide 9

Slide 9 text

[email protected] @IanOzsvald PyDataLondon February 2014 Profiling possibilities ● CPU (line by line or by function) ● Memory (line by line) ● Disk read/write (with some hacking) ● Network read/write (with some hacking) ● mmaps ● File handles ● Network connections ● Cache utilisation via libperf?

Slide 10

Slide 10 text

[email protected] @IanOzsvald PyDataLondon February 2014 Cython 0.20 (pyx annotations) #cython: boundscheck=False def calculate_z(int maxiter, zs, cs): """Calculate output list using Julia update rule""" cdef unsigned int i, n cdef double complex z, c output = [0] * len(zs) for i in range(len(zs)): n = 0 z = zs[i] c = cs[i] while n < maxiter and (z.real * z.real + z.imag * z.imag) < 4: z = z * z + c n += 1 output[i] = n return output Pure CPython lists code 12s Cython lists runtime 0.19s Cython numpy runtime 0.16s

Slide 11

Slide 11 text

[email protected] @IanOzsvald PyDataLondon February 2014 Cython + numpy + OMP nogil #cython: boundscheck=False from cython.parallel import parallel, prange import numpy as np cimport numpy as np def calculate_z(int maxiter, double complex[:] zs, double complex[:] cs): cdef unsigned int i, length, n cdef double complex z, c cdef int[:] output = np.empty(len(zs), dtype=np.int32) length = len(zs) with nogil, parallel(): for i in prange(length, schedule="guided"): z = zs[i] c = cs[i] n = 0 while n < maxiter and (z.real * z.real + z.imag * z.imag) < 4: z = z * z + c n = n + 1 output[i] = n return output Runtime 0.05s

Slide 12

Slide 12 text

[email protected] @IanOzsvald PyDataLondon February 2014 ShedSkin 0.9.4 annotations def calculate_z(maxiter, zs, cs): # maxiter: [int], zs: [list(complex)], cs: [list(complex)] output = [0] * len(zs) # [list(int)] for i in range(len(zs)): # [__iter(int)] n = 0 # [int] z = zs[i] # [complex] c = cs[i] # [complex] while n < maxiter and (… <4): # [complex] z = z * z + c # [complex] n += 1 # [int] output[i] = n # [int] return output # [list(int)] Couldn't we generate Cython pyx? Runtime 0.22s

Slide 13

Slide 13 text

[email protected] @IanOzsvald PyDataLondon February 2014 Pythran (0.40) #pythran export calculate_z_serial_purepython(int, complex list, complex list) def calculate_z_serial_purepython(maxiter, zs, cs): … Support for OpenMP on numpy arrays Author Serge made an overnight fix – superb support! List Runtime 0.4s #pythran export calculate_z(int, complex[], complex[], int[]) … #omp parallel for schedule(dynamic) OMP numpy Runtime 0.10s

Slide 14

Slide 14 text

[email protected] @IanOzsvald PyDataLondon February 2014 PyPy nightly (and numpypy) ● “It just works” on Python 2.7 code ● Clever list strategies (e.g. unboxed, uniform) ● Little support for pre-existing C extensions (e.g. the existing numpy) ● multiprocessing, IPython etc all work fine ● Python list code runtime: 0.3s ● (pypy)numpy support is incomplete, bugs are tackled (numpy runtime 5s [CPython+numpy 56s])

Slide 15

Slide 15 text

[email protected] @IanOzsvald PyDataLondon February 2014 Numba 0.12 from numba import jit @jit(nopython=True) def calculate_z_serial_purepython(maxiter, zs, cs, output): # couldn't create output, had to pass it in # output = numpy.zeros(len(zs), dtype=np.int32) for i in xrange(len(zs)): n = 0 z = zs[i] c = cs[i] #while n < maxiter and abs(z) < 2: # abs unrecognised while n < maxiter and z.real * z.real + z.imag * z.imag < 4: z = z * z + c n += 1 output[i] = n #return output Runtime 0.4s Some Python 3 support, some GPU prange support missing (was in 0.11)? 0.12 introduces temp limitations

Slide 16

Slide 16 text

[email protected] @IanOzsvald PyDataLondon February 2014 Tool Tradeoffs ● PyPy no learning curve (pure Py only) easy win? ● ShedSkin easy (pure Py only) but fairly rare ● Cython pure Py hours to learn – team cost low (and lots of online help) ● Cython numpy OMP days+ to learn – heavy team cost? ● Numba/Pythran hours to learn, install a bit tricky (Anaconda easiest for Numba) ● Pythran OMP very impressive result for little effort ● Numba big toolchain which might hurt productivity? ● (numexpr not covered – great for numpy and easy to use)

Slide 17

Slide 17 text

[email protected] @IanOzsvald PyDataLondon February 2014 Wrap up ● Our profiling options should be richer ● 4-12 physical CPU cores commonplace ● Cost of hand-annotating code is reduced agility ● JITs/AST compilers are getting fairly good, manual intervention still gives best results BUT! CONSIDER: ● Automation should (probably) be embraced ($CPUs < $humans) as team velocity is probably higher

Slide 18

Slide 18 text

[email protected] @IanOzsvald PyDataLondon February 2014 Thank You • [email protected] • @IanOzsvald • MorConsulting.com • Annotate.io • GitHub/IanOzsvald