Profiling to understand system behaviour • We often ignore this step... • Speeding up the bottleneck • Keeps you on 1 machine (if possible) • Keeping team speed high
by line or by function) • Memory (line by line) • Disk read/write (with some hacking) • Network read/write (with some hacking) • mmaps • File handles • Network connections • Cache utilisation via libperf?
boundscheck=False def calculate_z(int maxiter, zs, cs): """Calculate output list using Julia update rule""" cdef unsigned int i, n cdef double complex z, c output = [0] * len(zs) for i in range(len(zs)): n = 0 z = zs[i] c = cs[i] while n < maxiter and (z.real * z.real + z.imag * z.imag) < 4: z = z * z + c n += 1 output[i] = n return output Pure CPython lists code 12s Cython lists runtime 0.19s Cython numpy runtime 0.16s
nogil #cython: boundscheck=False from cython.parallel import parallel, prange import numpy as np cimport numpy as np def calculate_z(int maxiter, double complex[:] zs, double complex[:] cs): cdef unsigned int i, length, n cdef double complex z, c cdef int[:] output = np.empty(len(zs), dtype=np.int32) length = len(zs) with nogil, parallel(): for i in prange(length, schedule="guided"): z = zs[i] c = cs[i] n = 0 while n < maxiter and (z.real * z.real + z.imag * z.imag) < 4: z = z * z + c n = n + 1 output[i] = n return output Runtime 0.05s
zs, cs): # maxiter: [int], zs: [list(complex)], cs: [list(complex)] output = [0] * len(zs) # [list(int)] for i in range(len(zs)): # [__iter(int)] n = 0 # [int] z = zs[i] # [complex] c = cs[i] # [complex] while n < maxiter and (… <4): # [complex] z = z * z + c # [complex] n += 1 # [int] output[i] = n # [int] return output # [list(int)] Couldn't we generate Cython pyx? Runtime 0.22s
“It just works” on Python 2.7 code • Clever list strategies (e.g. unboxed, uniform) • Little support for pre-existing C extensions (e.g. the existing numpy) • multiprocessing, IPython etc all work fine • Python list code runtime: 0.3s • (pypy)numpy support is incomplete, bugs are tackled (numpy runtime 5s [CPython+numpy 56s])
jit @jit(nopython=True) def calculate_z_serial_purepython(maxiter, zs, cs, output): # couldn't create output, had to pass it in # output = numpy.zeros(len(zs), dtype=np.int32) for i in xrange(len(zs)): n = 0 z = zs[i] c = cs[i] #while n < maxiter and abs(z) < 2: # abs unrecognised while n < maxiter and z.real * z.real + z.imag * z.imag < 4: z = z * z + c n += 1 output[i] = n #return output Runtime 0.4s Some Python 3 support, some GPU prange support missing (was in 0.11)? 0.12 introduces temp limitations
learning curve (pure Py only) easy win? • ShedSkin easy (pure Py only) but fairly rare • Cython pure Py hours to learn – team cost low (and lots of online help) • Cython numpy OMP days+ to learn – heavy team cost? • Numba/Pythran hours to learn, install a bit tricky (Anaconda easiest for Numba) • Pythran OMP very impressive result for little effort • Numba big toolchain which might hurt productivity? • (numexpr not covered – great for numpy and easy to use)
options should be richer • 4-12 physical CPU cores commonplace • Cost of hand-annotating code is reduced agility • JITs/AST compilers are getting fairly good, manual intervention still gives best results BUT! CONSIDER: • Automation should (probably) be embraced ($CPUs < $humans) as team velocity is probably higher