Compiling Python with Numba

Compiling Python with Numba Travis E. Oliphant, PhD Continuum Analytics,
Inc Workshop in HPC in Finance Supercomputing 2013 November 18, 2013

Where I’m coming from After Before ⇢0 (2⇡f)2 Ui (a,
f) = [Cijkl (a, f) Uk,l (a, f)] ,j

Founding sponsors of NumFOCUS and creators of PyData Spyder

Enterprise Python Scientiﬁc Computing Data Processing Data Analysis Visualisation Scalable
Computing • Products • Training • Support • Consulting Now: Improve Python for Enterprise Wakari

What is Python? Python is an interpreted, object-oriented, high-level programming
language with dynamic semantics. Its high-level built in data structures, combined with dynamic typing and dynamic binding, make it very attractive for Rapid Application Development, as well as for use as a scripting or glue language to connect existing components together. http://www.python.org/doc/essays/blurb.html Created in 1990 by Guido Van Rossum -- inspired by teaching languages

How many people are using Python? • Tens of millions
of downloads from python.org each year • Does not include distribution downloads or pre-installed systems. • #10 most popular tag on Stack Overﬂow • Consistently ranks 6-8 in Tiobe index • Has both developers and Domain users

A sample of users

Why Python License Community Readable Syntax Modern Constructs Batteries Included
Free and Open Source, Permissive License • Broad and friendly community • Over 36,000 packages on PyPI • Commercial Support • Many conferences (PyData, SciPy, PyCon...) • Executable pseudo-code • Can understand and edit code a year later • Fun to develop • Use of Indentation IPython • Interactive prompt on steroids (Notebook) • Allows less working memory • Allows failing quickly for exploration • List comprehensions • Iterator protocol and generators • Meta-programming • Introspection • (JIT Compiler and Concurrency) • Internet (FTP, HTTP, SMTP, XMLRPC) • Compression and Databases • Logging, unit-tests • Glue for other languages • Distribution has much, much more....

Why Python for Technical Computing • Syntax (it gets out
of your way) • Over-loadable operators • Complex numbers built-in early • Just enough language support for arrays • “Occasional” programmers can grok it • Supports multiple programming styles • Expert programmers can also use it effectively • Has a simple, extensible implementation • General-purpose language --- can build a system • Critical mass

Python supports a developer spectrum Developer Occasional Scientist Developer •
Cut and paste • Modify a few variables • Call some functions • Typical Quant or Engineer who doesn’t become programmer • Extend frameworks • Builds new objects • Wraps code • Quant / Engineer with decent developer skill • Creates frameworks • Creates compilers • Typical CS grad • Knows multiple languages Unique aspect of Python

Easy to Learn What can be learned in 3 hours
What can be learned in a (dedicated) day types bool, int, str, ﬂoat, complex, lists, tuples, sets, dicts control for-loops, while-loops, try-except, if-elif-else organization functions, classes, modules, packages python idioms methods on common objects tour of the library special methods (__add__, __getitem__, ...)

NumPy Array is “typed container” shape

NumPy Stack -- about 2,000,000 users NumPy SciPy Pandas Matplotlib
scikit-learn scikit-image statsmodels PyTables OpenCV Cython Numba SymPy NumExpr astropy BioPython GDAL PySAL ... many many more ...

Code that users want to write xi = i 1
X j=0 ki j,jai jaj O = I ? F Slow!!!!

Why is Python slow? 1. Dynamic typing 2. Attribute lookups
3. NumPy get-item (a[...])

What are Quants doing Now? • Writing critical parts in
C/C++/Fortran and “wrapping” with • SWIG • ctypes • Cython • f2py (or fwrap) • hand-coded wrappers • Writing new code in Cython directly • Cython is “modiﬁed Python” with type information everywhere. • It produces a C-extension module which is then compiled

Requirements for Python compiler • Work with CPython (we need
the full scientiﬁc Python stack!) • Minimal modiﬁcations to code (use type inference) • Programmer control over what and when to “jit” • Ability to build static extensions (for libraries) • Fall back to Python C-API for “object” types.

Requirements Part II • Produce code as fast as Fortran
• Support NumPy array-expressions and be able to produce universal functions (e.g. y = sin(x)) • Provide a tool that could adapt to provide parallelism and produce code for modern vector hardware (GPUs, accelerators, and many-core machines)

Do we have to write the full compiler?? No! LLVM
has done much heavy lifting LLVM = Compilers for everybody

Face of a modern compiler Intermediate Representation (IR) x86 C++
ARM PTX C Fortran ObjC Parsing Code Generation Front-End Back-End

Face of a modern compiler Intermediate Representation (IR) x86 ARM
PTX Python Code Generation Back-End Numba LLVM Parsing Front-End

Example Numba

NumPy + Mamba = Numba LLVM Library Intel Nvidia Apple
AMD OpenCL ISPC CUDA CLANG OpenMP LLVMPY Python Function Machine Code ARM

Simple API • jit --- provide type information (fastest to
call at run-time) • autojit --- detects input types, infers output, generates code if needed, and dispatches (a little more run-time call overhead) #@jit('void(double[:,:], double, double)') @autojit def numba_update(u, dx2, dy2): nx, ny = u.shape for i in xrange(1,nx-1): for j in xrange(1, ny-1): u[i,j] = ((u[i+1,j] + u[i-1,j]) * dy2 + (u[i,j+1] + u[i,j-1]) * dx2) / (2*(dx2+dy2)) Comment out one of jit or autojit (don’t use together)

Example @numba.jit(‘f8(f8)’) def sinc(x): if x==0.0: return 1.0 else: return
sin(x*pi)/(pi*x) Numba

~150x speed-up Real-time image processing (50 fps Mandelbrot)

Speeding up Math Expressions xi = i 1 X j=0
ki j,jai jaj

Image Processing @jit('void(f8[:,:],f8[:,:],f8[:,:])') def filter(image, filt, output): M, N =
image.shape m, n = filt.shape for i in range(m//2, M-m//2): for j in range(n//2, N-n//2): result = 0.0 for k in range(m): for l in range(n): result += image[i+k-m//2,j+l-n//2]*filt[k, l] output[i,j] = result ~1500x speed-up

Compile NumPy array expressions from numba import autojit @autojit def
formula(a, b, c): a[1:,1:] = a[1:,1:] + b[1:,:-1] + c[1:,:-1] @autojit def express(m1, m2): m2[1:-1:2,0,...,::2] = (m1[1:-1:2,...,::2] * m1[-2:1:-2,...,::2]) return m2

Fast vectorize NumPy’s ufuncs take “kernels” and apply the kernel
element-by-element over entire arrays Write kernels in Python! from numba.vectorize import vectorize from math import sin @vectorize([‘f8(f8)’, ‘f4(f4)’]) def sinc(x): if x==0.0: return 1.0 else: return sin(x*pi)/(pi*x)

Case-study -- j0 from scipy.special • scipy.special was one of
the ﬁrst libraries I wrote • extended “umath” module by adding new “universal functions” to compute many scientiﬁc functions by wrapping C and Fortran libs. • Bessel functions are solutions to a differential equation: x 2 d 2 y dx 2 + x dy dx + ( x 2 ↵ 2) y = 0 y = J↵ ( x ) Jn (x) = 1 ⇡ Z ⇡ 0 cos (n⌧ x sin (⌧)) d⌧

scipy.special.j0 wraps cephes algorithm

Result --- equivalent to compiled code In [6]: %timeit vj0(x)
10000 loops, best of 3: 75 us per loop In [7]: from scipy.special import j0 In [8]: %timeit j0(x) 10000 loops, best of 3: 75.3 us per loop But! Now code is in Python and can be experimented with more easily (and moved to the GPU / accelerator more easily)!

Laplace Example @jit('void(double[:,:], double, double)') def numba_update(u, dx2, dy2): nx,
ny = u.shape for i in xrange(1,nx-1): for j in xrange(1, ny-1): u[i,j] = ((u[i+1,j] + u[i-1,j]) * dy2 + (u[i,j+1] + u[i,j-1]) * dx2) / (2*(dx2+dy2)) Adapted from http://www.scipy.org/PerformancePython originally by Prabhu Ramachandran @jit('void(double[:,:], double, double)') def numbavec_update(u, dx2, dy2): u[1:-1,1:-1] = ((u[2:,1:-1]+u[:-2,1:-1])*dy2 + (u[1:-1,2:] + u[1:-1,:-2])*dx2) / (2*(dx2+dy2))

Results of Laplace example Version Time Speed Up NumPy 3.19
1.0 Numba 2.32 1.38 Vect. Numba 2.33 1.37 Cython 2.38 1.34 Weave 2.47 1.29 Numexpr 2.62 1.22 Fortran Loops 2.30 1.39 Vect. Fortran 1.50 2.13 https://github.com/teoliphant/speed.git

Numba can change the game! LLVM IR x86 C++ ARM
PTX C Fortran Numba Numba turns Python into a “compiled language” (but much more ﬂexible). You don’t have to reach for C/C++

New Numba Language • Jit classes (zero-cost abstraction) • Integrated
compiled language with Python run-time for glue • Struct support (NumPy arrays can be structs) • SSA --- can refer to local variables as different types • Can create typed lists, typed dictionaries, and typed sets • pointer support • calling ctypes and CFFI functions natively • create stand-alone dynamic library and executable • create static extension module for Python • some support for GPUs in open source coming in Q1 of 2014 (see Accelerate for GPU support today)

@sjit('StaticTuple[a, b]') class StaticTuple(object): layout = [('hd',
'a'), ('tl', 'b')] @jit def __init__(self, hd, tl): self.hd = hd self.tl = tl @jit('a -‐> b : integral -‐> c') def __getitem__(self, item): if item == 0: return self.hd else: return self.tl[item -‐ 1] @jit('a -‐> Iterator[T]') def __iter__(self): yield self.hd for x in self.tl: yield x @jit('a -‐> int64') def __len__(self): if self.hd is None: return 0 elif self.tl is None: return 1 else: return len(self.tl) + 1 @jit('a -‐> StaticTuple[t1, t2] -‐> c') def __add__(self, other): if self.tl is None: return StaticTuple(self.hd, other) else: return StaticTuple(self.hd, self.tl + other) @jit('a -‐> str') def __repr__(self): return '(%s)' % ", ".join(map(str, self)) def element_type(self): if self.hd is None: raise TypeError("Cannot compute element type of empty tuple!") else: type = typeof(self.hd) if self.tl is not None: type = promote(type, self.tl.element_type()) return type @staticmethod def fromobject(tuple, type): head, tail = type.parameters hd = fromobject(tuple[0], head) if tuple[1:]: tl = fromobject(tuple[1:], tail) else: tl = EmptyTuple() return StaticTuple(hd, tl) @staticmethod def toobject(value, type): head, tail = type.parameters hd = toobject(value.hd, head) if isinstance(value.tl, EmptyTuple): return (hd,) return (hd,) + toobject(value.tl, tail) @jit('a -‐> Tuple[T] -‐> Tuple[T]') def __add__(self, other): result = List[T]() for x in self: result.append(x) return tuple(result) + other @jit('a -‐> a -‐> bool') def __eq__(self, other): return self.hd == other.hd and self.tl == other.tl Example

Numba Development 1852 Mark Florisson 252 Jon Riehl 224 Siu
Kwan Lam 112 Travis E. Oliphant 96 Jay Bourque 68 Hernan Grecco 30 Dag Sverre Seljebotn 22 Ilan Schnell 11 Mark Wiebe 11 Björn Linse 10 majidaldo 8 James Bergstra 8 Gaëtan de Menten 4 Francesc Alted 4 Alberto Valverde 3 Thomas Kluyver 3 GFKjunior 3 Christoph Gohlke ... git log --format=format:%an | sort | uniq -c | sort -r Siu Mark Jon

GPU and Parallel Support fast development and fast execution! Python
and NumPy compiled to Parallel Architectures (GPUs and multi-core machines) $ conda install accelerate

Examples from numbapro import vectorize from math import sin @vectorize([‘f8(f8)’,
‘f4(f4)’], target=‘gpu’) def sinc(x): if x==0.0: return 1.0 else: return sin(x*pi)/(pi*x) @vectorize([‘f8(f8)’, ‘f4(f4)’], target=‘parallel’) def sinc2(x): if x==0.0: return 1.0 else: return sin(x*pi)/(pi*x)

Example from numbapro import vectorize sig = 'uint8(uint32, f4, f4,
f4, f4, uint32, uint32, uint32)' @vectorize([sig], target='gpu') def mandel(tid, min_x, max_x, min_y, max_y, width, height, iters): pixel_size_x = (max_x - min_x) / width pixel_size_y = (max_y - min_y) / height x = tid % width y = tid / width real = min_x + x * pixel_size_x imag = min_y + y * pixel_size_y c = complex(real, imag) z = 0.0j for i in range(iters): z = z * z + c if (z.real * z.real + z.imag * z.imag) >= 4: return i return 255 Kind Time Speed-up Python 263.6 1.0x CPU 2.639 100x GPU 0.1676 1573x Tesla S2050

Example with cuFFT # host -> device d_img = cuda.to_device(img)
# image d_fltr = cuda.to_device(fltr) # filter # FFT forward cufft.fft_inplace(d_img) cufft.fft_inplace(d_fltr) # multply vmult(d_img, d_fltr, out=d_img) # inplace # FFT inverse cufft.ifft_inplace(d_img) # device -> host filted_img = d_img.copy_to_host() @vectorize(['complex64(complex64, complex64)'], target='gpu') def vmult(a, b): return a * b

More examples at https://github.com/ContinuumIO/numbapro-examples http://devblogs.nvidia.com/parallelforall/numbapro-high- performance-python-cuda-acceleration/ CUDA CAST #10 https://www.youtube.com/watch?v=jKV1m8APttU

Compiling Python with Numba

Compiling Python with Numba

More Decks by Travis E. Oliphant

Other Decks in Technology

Featured

Transcript