Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Compiling Python with Numba

Compiling Python with Numba

Finance is full of numerical problems to solve rapidly where speed of execution *and* development are both important. Numba provides a mechanism to get native-code performance out of Python syntax which is already known for rapid development.

Travis E. Oliphant

November 18, 2013
Tweet

More Decks by Travis E. Oliphant

Other Decks in Technology

Transcript

  1. Compiling Python with Numba Travis E. Oliphant, PhD Continuum Analytics,

    Inc Workshop in HPC in Finance Supercomputing 2013 November 18, 2013
  2. Enterprise Python Scientific Computing Data Processing Data Analysis Visualisation Scalable

    Computing • Products • Training • Support • Consulting Now: Improve Python for Enterprise Wakari
  3. What is Python? Python is an interpreted, object-oriented, high-level programming

    language with dynamic semantics. Its high-level built in data structures, combined with dynamic typing and dynamic binding, make it very attractive for Rapid Application Development, as well as for use as a scripting or glue language to connect existing components together. http://www.python.org/doc/essays/blurb.html Created in 1990 by Guido Van Rossum -- inspired by teaching languages
  4. How many people are using Python? • Tens of millions

    of downloads from python.org each year • Does not include distribution downloads or pre-installed systems. • #10 most popular tag on Stack Overflow • Consistently ranks 6-8 in Tiobe index • Has both developers and Domain users
  5. Why Python License Community Readable Syntax Modern Constructs Batteries Included

    Free and Open Source, Permissive License • Broad and friendly community • Over 36,000 packages on PyPI • Commercial Support • Many conferences (PyData, SciPy, PyCon...) • Executable pseudo-code • Can understand and edit code a year later • Fun to develop • Use of Indentation IPython • Interactive prompt on steroids (Notebook) • Allows less working memory • Allows failing quickly for exploration • List comprehensions • Iterator protocol and generators • Meta-programming • Introspection • (JIT Compiler and Concurrency) • Internet (FTP, HTTP, SMTP, XMLRPC) • Compression and Databases • Logging, unit-tests • Glue for other languages • Distribution has much, much more....
  6. Why Python for Technical Computing • Syntax (it gets out

    of your way) • Over-loadable operators • Complex numbers built-in early • Just enough language support for arrays • “Occasional” programmers can grok it • Supports multiple programming styles • Expert programmers can also use it effectively • Has a simple, extensible implementation • General-purpose language --- can build a system • Critical mass
  7. Python supports a developer spectrum Developer Occasional Scientist Developer •

    Cut and paste • Modify a few variables • Call some functions • Typical Quant or Engineer who doesn’t become programmer • Extend frameworks • Builds new objects • Wraps code • Quant / Engineer with decent developer skill • Creates frameworks • Creates compilers • Typical CS grad • Knows multiple languages Unique aspect of Python
  8. Easy to Learn What can be learned in 3 hours

    What can be learned in a (dedicated) day types bool, int, str, float, complex, lists, tuples, sets, dicts control for-loops, while-loops, try-except, if-elif-else organization functions, classes, modules, packages python idioms methods on common objects tour of the library special methods (__add__, __getitem__, ...)
  9. NumPy Stack -- about 2,000,000 users NumPy SciPy Pandas Matplotlib

    scikit-learn scikit-image statsmodels PyTables OpenCV Cython Numba SymPy NumExpr astropy BioPython GDAL PySAL ... many many more ...
  10. Code that users want to write xi = i 1

    X j=0 ki j,jai jaj O = I ? F Slow!!!!
  11. What are Quants doing Now? • Writing critical parts in

    C/C++/Fortran and “wrapping” with • SWIG • ctypes • Cython • f2py (or fwrap) • hand-coded wrappers • Writing new code in Cython directly • Cython is “modified Python” with type information everywhere. • It produces a C-extension module which is then compiled
  12. Requirements for Python compiler • Work with CPython (we need

    the full scientific Python stack!) • Minimal modifications to code (use type inference) • Programmer control over what and when to “jit” • Ability to build static extensions (for libraries) • Fall back to Python C-API for “object” types.
  13. Requirements Part II • Produce code as fast as Fortran

    • Support NumPy array-expressions and be able to produce universal functions (e.g. y = sin(x)) • Provide a tool that could adapt to provide parallelism and produce code for modern vector hardware (GPUs, accelerators, and many-core machines)
  14. Do we have to write the full compiler?? No! LLVM

    has done much heavy lifting LLVM = Compilers for everybody
  15. Face of a modern compiler Intermediate Representation (IR) x86 C++

    ARM PTX C Fortran ObjC Parsing Code Generation Front-End Back-End
  16. Face of a modern compiler Intermediate Representation (IR) x86 ARM

    PTX Python Code Generation Back-End Numba LLVM Parsing Front-End
  17. NumPy + Mamba = Numba LLVM Library Intel Nvidia Apple

    AMD OpenCL ISPC CUDA CLANG OpenMP LLVMPY Python Function Machine Code ARM
  18. Simple API • jit --- provide type information (fastest to

    call at run-time) • autojit --- detects input types, infers output, generates code if needed, and dispatches (a little more run-time call overhead) #@jit('void(double[:,:], double, double)') @autojit def numba_update(u, dx2, dy2): nx, ny = u.shape for i in xrange(1,nx-1): for j in xrange(1, ny-1): u[i,j] = ((u[i+1,j] + u[i-1,j]) * dy2 + (u[i,j+1] + u[i,j-1]) * dx2) / (2*(dx2+dy2)) Comment out one of jit or autojit (don’t use together)
  19. Image Processing @jit('void(f8[:,:],f8[:,:],f8[:,:])') def filter(image, filt, output): M, N =

    image.shape m, n = filt.shape for i in range(m//2, M-m//2): for j in range(n//2, N-n//2): result = 0.0 for k in range(m): for l in range(n): result += image[i+k-m//2,j+l-n//2]*filt[k, l] output[i,j] = result ~1500x speed-up
  20. Compile NumPy array expressions from numba import autojit @autojit def

    formula(a, b, c): a[1:,1:] = a[1:,1:] + b[1:,:-1] + c[1:,:-1] @autojit def express(m1, m2): m2[1:-1:2,0,...,::2] = (m1[1:-1:2,...,::2] * m1[-2:1:-2,...,::2]) return m2
  21. Fast vectorize NumPy’s ufuncs take “kernels” and apply the kernel

    element-by-element over entire arrays Write kernels in Python! from numba.vectorize import vectorize from math import sin @vectorize([‘f8(f8)’, ‘f4(f4)’]) def sinc(x): if x==0.0: return 1.0 else: return sin(x*pi)/(pi*x)
  22. Case-study -- j0 from scipy.special • scipy.special was one of

    the first libraries I wrote • extended “umath” module by adding new “universal functions” to compute many scientific functions by wrapping C and Fortran libs. • Bessel functions are solutions to a differential equation: x 2 d 2 y dx 2 + x dy dx + ( x 2 ↵ 2) y = 0 y = J↵ ( x ) Jn (x) = 1 ⇡ Z ⇡ 0 cos (n⌧ x sin (⌧)) d⌧
  23. Result --- equivalent to compiled code In [6]: %timeit vj0(x)

    10000 loops, best of 3: 75 us per loop In [7]: from scipy.special import j0 In [8]: %timeit j0(x) 10000 loops, best of 3: 75.3 us per loop But! Now code is in Python and can be experimented with more easily (and moved to the GPU / accelerator more easily)!
  24. Laplace Example @jit('void(double[:,:], double, double)') def numba_update(u, dx2, dy2): nx,

    ny = u.shape for i in xrange(1,nx-1): for j in xrange(1, ny-1): u[i,j] = ((u[i+1,j] + u[i-1,j]) * dy2 + (u[i,j+1] + u[i,j-1]) * dx2) / (2*(dx2+dy2)) Adapted from http://www.scipy.org/PerformancePython originally by Prabhu Ramachandran @jit('void(double[:,:], double, double)') def numbavec_update(u, dx2, dy2): u[1:-1,1:-1] = ((u[2:,1:-1]+u[:-2,1:-1])*dy2 + (u[1:-1,2:] + u[1:-1,:-2])*dx2) / (2*(dx2+dy2))
  25. Results of Laplace example Version Time Speed Up NumPy 3.19

    1.0 Numba 2.32 1.38 Vect. Numba 2.33 1.37 Cython 2.38 1.34 Weave 2.47 1.29 Numexpr 2.62 1.22 Fortran Loops 2.30 1.39 Vect. Fortran 1.50 2.13 https://github.com/teoliphant/speed.git
  26. Numba can change the game! LLVM IR x86 C++ ARM

    PTX C Fortran Numba Numba turns Python into a “compiled language” (but much more flexible). You don’t have to reach for C/C++
  27. New Numba Language • Jit classes (zero-cost abstraction) • Integrated

    compiled language with Python run-time for glue • Struct support (NumPy arrays can be structs) • SSA --- can refer to local variables as different types • Can create typed lists, typed dictionaries, and typed sets • pointer support • calling ctypes and CFFI functions natively • create stand-alone dynamic library and executable • create static extension module for Python • some support for GPUs in open source coming in Q1 of 2014 (see Accelerate for GPU support today)
  28. @sjit('StaticTuple[a,  b]') class  StaticTuple(object):        layout  =  [('hd',

     'a'),  ('tl',  'b')]        @jit        def  __init__(self,  hd,  tl):                self.hd  =  hd                self.tl  =  tl        @jit('a  -­‐>  b  :  integral  -­‐>  c')        def  __getitem__(self,  item):                if  item  ==  0:                        return  self.hd                else:                        return  self.tl[item  -­‐  1]        @jit('a  -­‐>  Iterator[T]')        def  __iter__(self):                yield  self.hd                for  x  in  self.tl:                        yield  x        @jit('a  -­‐>  int64')        def  __len__(self):                if  self.hd  is  None:                        return  0                elif  self.tl  is  None:                        return  1                else:                        return  len(self.tl)  +  1        @jit('a  -­‐>  StaticTuple[t1,  t2]  -­‐>  c')        def  __add__(self,  other):                if  self.tl  is  None:                        return  StaticTuple(self.hd,  other)                else:                        return  StaticTuple(self.hd,  self.tl  +  other)      @jit('a  -­‐>  str')        def  __repr__(self):                return  '(%s)'  %  ",  ".join(map(str,  self))        def  element_type(self):                if  self.hd  is  None:                        raise  TypeError("Cannot  compute  element  type  of  empty  tuple!")                else:                        type  =  typeof(self.hd)                        if  self.tl  is  not  None:                                type  =  promote(type,  self.tl.element_type())                        return  type        @staticmethod        def  fromobject(tuple,  type):                head,  tail  =  type.parameters                hd  =  fromobject(tuple[0],  head)                if  tuple[1:]:                        tl  =  fromobject(tuple[1:],  tail)                else:                        tl  =  EmptyTuple()                return  StaticTuple(hd,  tl)        @staticmethod        def  toobject(value,  type):                head,  tail  =  type.parameters                hd  =  toobject(value.hd,  head)                if  isinstance(value.tl,  EmptyTuple):                        return  (hd,)                return  (hd,)  +  toobject(value.tl,  tail)        @jit('a  -­‐>  Tuple[T]  -­‐>  Tuple[T]')        def  __add__(self,  other):                result  =  List[T]()                for  x  in  self:                        result.append(x)                return  tuple(result)  +  other        @jit('a  -­‐>  a  -­‐>  bool')        def  __eq__(self,  other):                return  self.hd  ==  other.hd  and  self.tl  ==  other.tl Example
  29. Numba Development 1852 Mark Florisson 252 Jon Riehl 224 Siu

    Kwan Lam 112 Travis E. Oliphant 96 Jay Bourque 68 Hernan Grecco 30 Dag Sverre Seljebotn 22 Ilan Schnell 11 Mark Wiebe 11 Björn Linse 10 majidaldo 8 James Bergstra 8 Gaëtan de Menten 4 Francesc Alted 4 Alberto Valverde 3 Thomas Kluyver 3 GFKjunior 3 Christoph Gohlke ... git log --format=format:%an | sort | uniq -c | sort -r Siu Mark Jon
  30. GPU and Parallel Support fast development and fast execution! Python

    and NumPy compiled to Parallel Architectures (GPUs and multi-core machines) $ conda install accelerate
  31. Examples from numbapro import vectorize from math import sin @vectorize([‘f8(f8)’,

    ‘f4(f4)’], target=‘gpu’) def sinc(x): if x==0.0: return 1.0 else: return sin(x*pi)/(pi*x) @vectorize([‘f8(f8)’, ‘f4(f4)’], target=‘parallel’) def sinc2(x): if x==0.0: return 1.0 else: return sin(x*pi)/(pi*x)
  32. Example from numbapro import vectorize sig = 'uint8(uint32, f4, f4,

    f4, f4, uint32, uint32, uint32)' @vectorize([sig], target='gpu') def mandel(tid, min_x, max_x, min_y, max_y, width, height, iters): pixel_size_x = (max_x - min_x) / width pixel_size_y = (max_y - min_y) / height x = tid % width y = tid / width real = min_x + x * pixel_size_x imag = min_y + y * pixel_size_y c = complex(real, imag) z = 0.0j for i in range(iters): z = z * z + c if (z.real * z.real + z.imag * z.imag) >= 4: return i return 255 Kind Time Speed-up Python 263.6 1.0x CPU 2.639 100x GPU 0.1676 1573x Tesla S2050
  33. Example with cuFFT # host -> device d_img = cuda.to_device(img)

    # image d_fltr = cuda.to_device(fltr) # filter # FFT forward cufft.fft_inplace(d_img) cufft.fft_inplace(d_fltr) # multply vmult(d_img, d_fltr, out=d_img) # inplace # FFT inverse cufft.ifft_inplace(d_img) # device -> host filted_img = d_img.copy_to_host() @vectorize(['complex64(complex64, complex64)'], target='gpu') def vmult(a, b): return a * b