Slide 1

Slide 1 text

Compiling Python with Numba Travis E. Oliphant, PhD Continuum Analytics, Inc Workshop in HPC in Finance Supercomputing 2013 November 18, 2013

Slide 2

Slide 2 text

Where I’m coming from After Before ⇢0 (2⇡f)2 Ui (a, f) = [Cijkl (a, f) Uk,l (a, f)] ,j

Slide 3

Slide 3 text

Founding sponsors of NumFOCUS and creators of PyData Spyder

Slide 4

Slide 4 text

Enterprise Python Scientific Computing Data Processing Data Analysis Visualisation Scalable Computing • Products • Training • Support • Consulting Now: Improve Python for Enterprise Wakari

Slide 5

Slide 5 text

What is Python? Python is an interpreted, object-oriented, high-level programming language with dynamic semantics. Its high-level built in data structures, combined with dynamic typing and dynamic binding, make it very attractive for Rapid Application Development, as well as for use as a scripting or glue language to connect existing components together. http://www.python.org/doc/essays/blurb.html Created in 1990 by Guido Van Rossum -- inspired by teaching languages

Slide 6

Slide 6 text

How many people are using Python? • Tens of millions of downloads from python.org each year • Does not include distribution downloads or pre-installed systems. • #10 most popular tag on Stack Overflow • Consistently ranks 6-8 in Tiobe index • Has both developers and Domain users

Slide 7

Slide 7 text

A sample of users

Slide 8

Slide 8 text

Why Python License Community Readable Syntax Modern Constructs Batteries Included Free and Open Source, Permissive License • Broad and friendly community • Over 36,000 packages on PyPI • Commercial Support • Many conferences (PyData, SciPy, PyCon...) • Executable pseudo-code • Can understand and edit code a year later • Fun to develop • Use of Indentation IPython • Interactive prompt on steroids (Notebook) • Allows less working memory • Allows failing quickly for exploration • List comprehensions • Iterator protocol and generators • Meta-programming • Introspection • (JIT Compiler and Concurrency) • Internet (FTP, HTTP, SMTP, XMLRPC) • Compression and Databases • Logging, unit-tests • Glue for other languages • Distribution has much, much more....

Slide 9

Slide 9 text

Why Python for Technical Computing • Syntax (it gets out of your way) • Over-loadable operators • Complex numbers built-in early • Just enough language support for arrays • “Occasional” programmers can grok it • Supports multiple programming styles • Expert programmers can also use it effectively • Has a simple, extensible implementation • General-purpose language --- can build a system • Critical mass

Slide 10

Slide 10 text

Python supports a developer spectrum Developer Occasional Scientist Developer • Cut and paste • Modify a few variables • Call some functions • Typical Quant or Engineer who doesn’t become programmer • Extend frameworks • Builds new objects • Wraps code • Quant / Engineer with decent developer skill • Creates frameworks • Creates compilers • Typical CS grad • Knows multiple languages Unique aspect of Python

Slide 11

Slide 11 text

Easy to Learn What can be learned in 3 hours What can be learned in a (dedicated) day types bool, int, str, float, complex, lists, tuples, sets, dicts control for-loops, while-loops, try-except, if-elif-else organization functions, classes, modules, packages python idioms methods on common objects tour of the library special methods (__add__, __getitem__, ...)

Slide 12

Slide 12 text

NumPy Array is “typed container” shape

Slide 13

Slide 13 text

NumPy Stack -- about 2,000,000 users NumPy SciPy Pandas Matplotlib scikit-learn scikit-image statsmodels PyTables OpenCV Cython Numba SymPy NumExpr astropy BioPython GDAL PySAL ... many many more ...

Slide 14

Slide 14 text

Code that users want to write xi = i 1 X j=0 ki j,jai jaj O = I ? F Slow!!!!

Slide 15

Slide 15 text

Why is Python slow? 1. Dynamic typing 2. Attribute lookups 3. NumPy get-item (a[...])

Slide 16

Slide 16 text

What are Quants doing Now? • Writing critical parts in C/C++/Fortran and “wrapping” with • SWIG • ctypes • Cython • f2py (or fwrap) • hand-coded wrappers • Writing new code in Cython directly • Cython is “modified Python” with type information everywhere. • It produces a C-extension module which is then compiled

Slide 17

Slide 17 text

Requirements for Python compiler • Work with CPython (we need the full scientific Python stack!) • Minimal modifications to code (use type inference) • Programmer control over what and when to “jit” • Ability to build static extensions (for libraries) • Fall back to Python C-API for “object” types.

Slide 18

Slide 18 text

Requirements Part II • Produce code as fast as Fortran • Support NumPy array-expressions and be able to produce universal functions (e.g. y = sin(x)) • Provide a tool that could adapt to provide parallelism and produce code for modern vector hardware (GPUs, accelerators, and many-core machines)

Slide 19

Slide 19 text

Do we have to write the full compiler?? No! LLVM has done much heavy lifting LLVM = Compilers for everybody

Slide 20

Slide 20 text

Face of a modern compiler Intermediate Representation (IR) x86 C++ ARM PTX C Fortran ObjC Parsing Code Generation Front-End Back-End

Slide 21

Slide 21 text

Face of a modern compiler Intermediate Representation (IR) x86 ARM PTX Python Code Generation Back-End Numba LLVM Parsing Front-End

Slide 22

Slide 22 text

Example Numba

Slide 23

Slide 23 text

NumPy + Mamba = Numba LLVM Library Intel Nvidia Apple AMD OpenCL ISPC CUDA CLANG OpenMP LLVMPY Python Function Machine Code ARM

Slide 24

Slide 24 text

Simple API • jit --- provide type information (fastest to call at run-time) • autojit --- detects input types, infers output, generates code if needed, and dispatches (a little more run-time call overhead) #@jit('void(double[:,:], double, double)') @autojit def numba_update(u, dx2, dy2): nx, ny = u.shape for i in xrange(1,nx-1): for j in xrange(1, ny-1): u[i,j] = ((u[i+1,j] + u[i-1,j]) * dy2 + (u[i,j+1] + u[i,j-1]) * dx2) / (2*(dx2+dy2)) Comment out one of jit or autojit (don’t use together)

Slide 25

Slide 25 text

Example @numba.jit(‘f8(f8)’) def sinc(x): if x==0.0: return 1.0 else: return sin(x*pi)/(pi*x) Numba

Slide 26

Slide 26 text

~150x speed-up Real-time image processing (50 fps Mandelbrot)

Slide 27

Slide 27 text

Speeding up Math Expressions xi = i 1 X j=0 ki j,jai jaj

Slide 28

Slide 28 text

Image Processing @jit('void(f8[:,:],f8[:,:],f8[:,:])') def filter(image, filt, output): M, N = image.shape m, n = filt.shape for i in range(m//2, M-m//2): for j in range(n//2, N-n//2): result = 0.0 for k in range(m): for l in range(n): result += image[i+k-m//2,j+l-n//2]*filt[k, l] output[i,j] = result ~1500x speed-up

Slide 29

Slide 29 text

Compile NumPy array expressions from numba import autojit @autojit def formula(a, b, c): a[1:,1:] = a[1:,1:] + b[1:,:-1] + c[1:,:-1] @autojit def express(m1, m2): m2[1:-1:2,0,...,::2] = (m1[1:-1:2,...,::2] * m1[-2:1:-2,...,::2]) return m2

Slide 30

Slide 30 text

Fast vectorize NumPy’s ufuncs take “kernels” and apply the kernel element-by-element over entire arrays Write kernels in Python! from numba.vectorize import vectorize from math import sin @vectorize([‘f8(f8)’, ‘f4(f4)’]) def sinc(x): if x==0.0: return 1.0 else: return sin(x*pi)/(pi*x)

Slide 31

Slide 31 text

Case-study -- j0 from scipy.special • scipy.special was one of the first libraries I wrote • extended “umath” module by adding new “universal functions” to compute many scientific functions by wrapping C and Fortran libs. • Bessel functions are solutions to a differential equation: x 2 d 2 y dx 2 + x dy dx + ( x 2 ↵ 2) y = 0 y = J↵ ( x ) Jn (x) = 1 ⇡ Z ⇡ 0 cos (n⌧ x sin (⌧)) d⌧

Slide 32

Slide 32 text

scipy.special.j0 wraps cephes algorithm

Slide 33

Slide 33 text

Result --- equivalent to compiled code In [6]: %timeit vj0(x) 10000 loops, best of 3: 75 us per loop In [7]: from scipy.special import j0 In [8]: %timeit j0(x) 10000 loops, best of 3: 75.3 us per loop But! Now code is in Python and can be experimented with more easily (and moved to the GPU / accelerator more easily)!

Slide 34

Slide 34 text

Laplace Example @jit('void(double[:,:], double, double)') def numba_update(u, dx2, dy2): nx, ny = u.shape for i in xrange(1,nx-1): for j in xrange(1, ny-1): u[i,j] = ((u[i+1,j] + u[i-1,j]) * dy2 + (u[i,j+1] + u[i,j-1]) * dx2) / (2*(dx2+dy2)) Adapted from http://www.scipy.org/PerformancePython originally by Prabhu Ramachandran @jit('void(double[:,:], double, double)') def numbavec_update(u, dx2, dy2): u[1:-1,1:-1] = ((u[2:,1:-1]+u[:-2,1:-1])*dy2 + (u[1:-1,2:] + u[1:-1,:-2])*dx2) / (2*(dx2+dy2))

Slide 35

Slide 35 text

Results of Laplace example Version Time Speed Up NumPy 3.19 1.0 Numba 2.32 1.38 Vect. Numba 2.33 1.37 Cython 2.38 1.34 Weave 2.47 1.29 Numexpr 2.62 1.22 Fortran Loops 2.30 1.39 Vect. Fortran 1.50 2.13 https://github.com/teoliphant/speed.git

Slide 36

Slide 36 text

Numba can change the game! LLVM IR x86 C++ ARM PTX C Fortran Numba Numba turns Python into a “compiled language” (but much more flexible). You don’t have to reach for C/C++

Slide 37

Slide 37 text

New Numba Language • Jit classes (zero-cost abstraction) • Integrated compiled language with Python run-time for glue • Struct support (NumPy arrays can be structs) • SSA --- can refer to local variables as different types • Can create typed lists, typed dictionaries, and typed sets • pointer support • calling ctypes and CFFI functions natively • create stand-alone dynamic library and executable • create static extension module for Python • some support for GPUs in open source coming in Q1 of 2014 (see Accelerate for GPU support today)

Slide 38

Slide 38 text

@sjit('StaticTuple[a,  b]') class  StaticTuple(object):        layout  =  [('hd',  'a'),  ('tl',  'b')]        @jit        def  __init__(self,  hd,  tl):                self.hd  =  hd                self.tl  =  tl        @jit('a  -­‐>  b  :  integral  -­‐>  c')        def  __getitem__(self,  item):                if  item  ==  0:                        return  self.hd                else:                        return  self.tl[item  -­‐  1]        @jit('a  -­‐>  Iterator[T]')        def  __iter__(self):                yield  self.hd                for  x  in  self.tl:                        yield  x        @jit('a  -­‐>  int64')        def  __len__(self):                if  self.hd  is  None:                        return  0                elif  self.tl  is  None:                        return  1                else:                        return  len(self.tl)  +  1        @jit('a  -­‐>  StaticTuple[t1,  t2]  -­‐>  c')        def  __add__(self,  other):                if  self.tl  is  None:                        return  StaticTuple(self.hd,  other)                else:                        return  StaticTuple(self.hd,  self.tl  +  other)      @jit('a  -­‐>  str')        def  __repr__(self):                return  '(%s)'  %  ",  ".join(map(str,  self))        def  element_type(self):                if  self.hd  is  None:                        raise  TypeError("Cannot  compute  element  type  of  empty  tuple!")                else:                        type  =  typeof(self.hd)                        if  self.tl  is  not  None:                                type  =  promote(type,  self.tl.element_type())                        return  type        @staticmethod        def  fromobject(tuple,  type):                head,  tail  =  type.parameters                hd  =  fromobject(tuple[0],  head)                if  tuple[1:]:                        tl  =  fromobject(tuple[1:],  tail)                else:                        tl  =  EmptyTuple()                return  StaticTuple(hd,  tl)        @staticmethod        def  toobject(value,  type):                head,  tail  =  type.parameters                hd  =  toobject(value.hd,  head)                if  isinstance(value.tl,  EmptyTuple):                        return  (hd,)                return  (hd,)  +  toobject(value.tl,  tail)        @jit('a  -­‐>  Tuple[T]  -­‐>  Tuple[T]')        def  __add__(self,  other):                result  =  List[T]()                for  x  in  self:                        result.append(x)                return  tuple(result)  +  other        @jit('a  -­‐>  a  -­‐>  bool')        def  __eq__(self,  other):                return  self.hd  ==  other.hd  and  self.tl  ==  other.tl Example

Slide 39

Slide 39 text

Numba Development 1852 Mark Florisson 252 Jon Riehl 224 Siu Kwan Lam 112 Travis E. Oliphant 96 Jay Bourque 68 Hernan Grecco 30 Dag Sverre Seljebotn 22 Ilan Schnell 11 Mark Wiebe 11 Björn Linse 10 majidaldo 8 James Bergstra 8 Gaëtan de Menten 4 Francesc Alted 4 Alberto Valverde 3 Thomas Kluyver 3 GFKjunior 3 Christoph Gohlke ... git log --format=format:%an | sort | uniq -c | sort -r Siu Mark Jon

Slide 40

Slide 40 text

GPU and Parallel Support fast development and fast execution! Python and NumPy compiled to Parallel Architectures (GPUs and multi-core machines) $ conda install accelerate

Slide 41

Slide 41 text

Examples from numbapro import vectorize from math import sin @vectorize([‘f8(f8)’, ‘f4(f4)’], target=‘gpu’) def sinc(x): if x==0.0: return 1.0 else: return sin(x*pi)/(pi*x) @vectorize([‘f8(f8)’, ‘f4(f4)’], target=‘parallel’) def sinc2(x): if x==0.0: return 1.0 else: return sin(x*pi)/(pi*x)

Slide 42

Slide 42 text

Example from numbapro import vectorize sig = 'uint8(uint32, f4, f4, f4, f4, uint32, uint32, uint32)' @vectorize([sig], target='gpu') def mandel(tid, min_x, max_x, min_y, max_y, width, height, iters): pixel_size_x = (max_x - min_x) / width pixel_size_y = (max_y - min_y) / height x = tid % width y = tid / width real = min_x + x * pixel_size_x imag = min_y + y * pixel_size_y c = complex(real, imag) z = 0.0j for i in range(iters): z = z * z + c if (z.real * z.real + z.imag * z.imag) >= 4: return i return 255 Kind Time Speed-up Python 263.6 1.0x CPU 2.639 100x GPU 0.1676 1573x Tesla S2050

Slide 43

Slide 43 text

Example with cuFFT # host -> device d_img = cuda.to_device(img) # image d_fltr = cuda.to_device(fltr) # filter # FFT forward cufft.fft_inplace(d_img) cufft.fft_inplace(d_fltr) # multply vmult(d_img, d_fltr, out=d_img) # inplace # FFT inverse cufft.ifft_inplace(d_img) # device -> host filted_img = d_img.copy_to_host() @vectorize(['complex64(complex64, complex64)'], target='gpu') def vmult(a, b): return a * b

Slide 44

Slide 44 text

More examples at https://github.com/ContinuumIO/numbapro-examples http://devblogs.nvidia.com/parallelforall/numbapro-high- performance-python-cuda-acceleration/ CUDA CAST #10 https://www.youtube.com/watch?v=jKV1m8APttU