Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Compiling Python with Numba

Compiling Python with Numba

Finance is full of numerical problems to solve rapidly where speed of execution *and* development are both important. Numba provides a mechanism to get native-code performance out of Python syntax which is already known for rapid development.

Travis E. Oliphant

November 18, 2013
Tweet

More Decks by Travis E. Oliphant

Other Decks in Technology

Transcript

  1. Compiling Python with
    Numba
    Travis E. Oliphant, PhD
    Continuum Analytics, Inc
    Workshop in HPC in Finance
    Supercomputing 2013
    November 18, 2013

    View full-size slide

  2. Where I’m coming from
    After
    Before
    ⇢0
    (2⇡f)2 Ui
    (a, f) = [Cijkl
    (a, f) Uk,l
    (a, f)]
    ,j

    View full-size slide

  3. Founding sponsors of NumFOCUS and
    creators of PyData
    Spyder

    View full-size slide

  4. Enterprise
    Python
    Scientific
    Computing
    Data Processing
    Data Analysis
    Visualisation
    Scalable
    Computing
    • Products
    • Training
    • Support
    • Consulting
    Now: Improve Python for Enterprise
    Wakari

    View full-size slide

  5. What is Python?
    Python is an interpreted, object-oriented, high-level
    programming language with dynamic semantics. Its high-level
    built in data structures, combined with dynamic typing and
    dynamic binding, make it very attractive for Rapid Application
    Development, as well as for use as a scripting or glue language
    to connect existing components together.
    http://www.python.org/doc/essays/blurb.html
    Created in 1990 by Guido Van Rossum
    -- inspired by teaching languages

    View full-size slide

  6. How many people are using Python?
    • Tens of millions of downloads from
    python.org each year
    • Does not include distribution downloads or
    pre-installed systems.
    • #10 most popular tag on Stack Overflow
    • Consistently ranks 6-8 in Tiobe index
    • Has both developers and Domain users

    View full-size slide

  7. A sample of users

    View full-size slide

  8. Why Python
    License
    Community
    Readable Syntax
    Modern Constructs
    Batteries Included
    Free and Open Source, Permissive License
    • Broad and friendly community
    • Over 36,000 packages on PyPI
    • Commercial Support
    • Many conferences (PyData, SciPy, PyCon...)
    • Executable pseudo-code
    • Can understand and edit code a year later
    • Fun to develop
    • Use of Indentation
    IPython
    • Interactive prompt on steroids (Notebook)
    • Allows less working memory
    • Allows failing quickly for exploration
    • List comprehensions
    • Iterator protocol and generators
    • Meta-programming
    • Introspection
    • (JIT Compiler and Concurrency)
    • Internet (FTP, HTTP, SMTP, XMLRPC)
    • Compression and Databases
    • Logging, unit-tests
    • Glue for other languages
    • Distribution has much, much more....

    View full-size slide

  9. Why Python for Technical Computing
    • Syntax (it gets out of your way)
    • Over-loadable operators
    • Complex numbers built-in early
    • Just enough language support for arrays
    • “Occasional” programmers can grok it
    • Supports multiple programming styles
    • Expert programmers can also use it effectively
    • Has a simple, extensible implementation
    • General-purpose language --- can build a system
    • Critical mass

    View full-size slide

  10. Python supports a developer spectrum
    Developer
    Occasional Scientist Developer
    • Cut and paste
    • Modify a few variables
    • Call some functions
    • Typical Quant or
    Engineer who doesn’t
    become programmer
    • Extend frameworks
    • Builds new objects
    • Wraps code
    • Quant / Engineer with
    decent developer skill
    • Creates frameworks
    • Creates compilers
    • Typical CS grad
    • Knows multiple
    languages
    Unique aspect of Python

    View full-size slide

  11. Easy to Learn
    What can be learned in 3 hours
    What can be learned in a (dedicated) day
    types bool, int, str, float, complex, lists, tuples, sets, dicts
    control for-loops, while-loops, try-except, if-elif-else
    organization functions, classes, modules, packages
    python idioms
    methods on common objects
    tour of the library
    special methods (__add__, __getitem__, ...)

    View full-size slide

  12. NumPy Array is “typed container”
    shape

    View full-size slide

  13. NumPy Stack -- about 2,000,000 users
    NumPy
    SciPy Pandas Matplotlib
    scikit-learn
    scikit-image statsmodels
    PyTables
    OpenCV
    Cython
    Numba SymPy NumExpr
    astropy BioPython GDAL
    PySAL
    ... many many more ...

    View full-size slide

  14. Code that users want to write
    xi =
    i 1
    X
    j=0
    ki j,jai jaj
    O = I ? F
    Slow!!!!

    View full-size slide

  15. Why is Python slow?
    1. Dynamic typing
    2. Attribute lookups
    3. NumPy get-item (a[...])

    View full-size slide

  16. What are Quants doing Now?
    • Writing critical parts in C/C++/Fortran and
    “wrapping” with
    • SWIG
    • ctypes
    • Cython
    • f2py (or fwrap)
    • hand-coded wrappers
    • Writing new code in Cython directly
    • Cython is “modified Python” with type information everywhere.
    • It produces a C-extension module which is then compiled

    View full-size slide

  17. Requirements for Python compiler
    • Work with CPython (we need the full scientific
    Python stack!)
    • Minimal modifications to code (use type inference)
    • Programmer control over what and when to “jit”
    • Ability to build static extensions (for libraries)
    • Fall back to Python C-API for “object” types.

    View full-size slide

  18. Requirements Part II
    • Produce code as fast as Fortran
    • Support NumPy array-expressions and be able to
    produce universal functions (e.g. y = sin(x))
    • Provide a tool that could adapt to provide
    parallelism and produce code for modern vector
    hardware (GPUs, accelerators, and many-core
    machines)

    View full-size slide

  19. Do we have to write the full compiler??
    No!
    LLVM has
    done much
    heavy lifting
    LLVM =
    Compilers for
    everybody

    View full-size slide

  20. Face of a modern compiler
    Intermediate
    Representation
    (IR)
    x86
    C++
    ARM
    PTX
    C
    Fortran
    ObjC
    Parsing Code Generation
    Front-End Back-End

    View full-size slide

  21. Face of a modern compiler
    Intermediate
    Representation
    (IR)
    x86
    ARM
    PTX
    Python
    Code Generation
    Back-End
    Numba LLVM
    Parsing
    Front-End

    View full-size slide

  22. Example
    Numba

    View full-size slide

  23. NumPy + Mamba = Numba
    LLVM Library
    Intel Nvidia Apple
    AMD
    OpenCL
    ISPC CUDA CLANG
    OpenMP
    LLVMPY
    Python Function Machine Code
    ARM

    View full-size slide

  24. Simple API

    • jit --- provide type information (fastest to call at run-time)
    • autojit --- detects input types, infers output, generates code
    if needed, and dispatches (a little more run-time call
    overhead)
    #@jit('void(double[:,:], double, double)')
    @autojit
    def numba_update(u, dx2, dy2):
    nx, ny = u.shape
    for i in xrange(1,nx-1):
    for j in xrange(1, ny-1):
    u[i,j] = ((u[i+1,j] + u[i-1,j]) * dy2 +
    (u[i,j+1] + u[i,j-1]) * dx2) / (2*(dx2+dy2))
    Comment out one of jit or autojit (don’t use together)

    View full-size slide

  25. Example
    @numba.jit(‘f8(f8)’)
    def sinc(x):
    if x==0.0:
    return 1.0
    else:
    return sin(x*pi)/(pi*x)
    Numba

    View full-size slide

  26. ~150x speed-up Real-time image
    processing (50 fps
    Mandelbrot)

    View full-size slide

  27. Speeding up Math Expressions
    xi =
    i 1
    X
    j=0
    ki j,jai jaj

    View full-size slide

  28. Image Processing
    @jit('void(f8[:,:],f8[:,:],f8[:,:])')
    def filter(image, filt, output):
    M, N = image.shape
    m, n = filt.shape
    for i in range(m//2, M-m//2):
    for j in range(n//2, N-n//2):
    result = 0.0
    for k in range(m):
    for l in range(n):
    result += image[i+k-m//2,j+l-n//2]*filt[k, l]
    output[i,j] = result
    ~1500x speed-up

    View full-size slide

  29. Compile NumPy array expressions
    from numba import autojit
    @autojit
    def formula(a, b, c):
    a[1:,1:] = a[1:,1:] + b[1:,:-1] + c[1:,:-1]
    @autojit
    def express(m1, m2):
    m2[1:-1:2,0,...,::2] = (m1[1:-1:2,...,::2] *
    m1[-2:1:-2,...,::2])
    return m2

    View full-size slide

  30. Fast vectorize
    NumPy’s ufuncs take “kernels” and
    apply the kernel element-by-element
    over entire arrays
    Write kernels in
    Python!
    from numba.vectorize import vectorize
    from math import sin
    @vectorize([‘f8(f8)’, ‘f4(f4)’])
    def sinc(x):
    if x==0.0:
    return 1.0
    else:
    return sin(x*pi)/(pi*x)

    View full-size slide

  31. Case-study -- j0 from scipy.special
    • scipy.special was one of the first libraries I wrote
    • extended “umath” module by adding new
    “universal functions” to compute many scientific
    functions by wrapping C and Fortran libs.
    • Bessel functions are solutions to a differential
    equation:
    x
    2 d
    2
    y
    dx
    2
    +
    x
    dy
    dx
    + (
    x
    2

    2)
    y
    = 0
    y
    =
    J↵ (
    x
    )
    Jn (x) =
    1

    Z ⇡
    0
    cos (n⌧ x sin (⌧)) d⌧

    View full-size slide

  32. scipy.special.j0 wraps cephes algorithm

    View full-size slide

  33. Result --- equivalent to compiled code
    In [6]: %timeit vj0(x)
    10000 loops, best of 3: 75 us per loop
    In [7]: from scipy.special import j0
    In [8]: %timeit j0(x)
    10000 loops, best of 3: 75.3 us per loop
    But! Now code is in Python and can be
    experimented with more easily (and moved to
    the GPU / accelerator more easily)!

    View full-size slide

  34. Laplace Example
    @jit('void(double[:,:], double, double)')
    def numba_update(u, dx2, dy2):
    nx, ny = u.shape
    for i in xrange(1,nx-1):
    for j in xrange(1, ny-1):
    u[i,j] = ((u[i+1,j] + u[i-1,j]) * dy2 +
    (u[i,j+1] + u[i,j-1]) * dx2) / (2*(dx2+dy2))
    Adapted from http://www.scipy.org/PerformancePython
    originally by Prabhu Ramachandran
    @jit('void(double[:,:], double, double)')
    def numbavec_update(u, dx2, dy2):
    u[1:-1,1:-1] = ((u[2:,1:-1]+u[:-2,1:-1])*dy2 +
    (u[1:-1,2:] + u[1:-1,:-2])*dx2) / (2*(dx2+dy2))

    View full-size slide

  35. Results of Laplace example
    Version Time Speed Up
    NumPy 3.19 1.0
    Numba 2.32 1.38
    Vect. Numba 2.33 1.37
    Cython 2.38 1.34
    Weave 2.47 1.29
    Numexpr 2.62 1.22
    Fortran Loops 2.30 1.39
    Vect. Fortran 1.50 2.13
    https://github.com/teoliphant/speed.git

    View full-size slide

  36. Numba can change the game!
    LLVM IR
    x86
    C++
    ARM
    PTX
    C
    Fortran
    Numba
    Numba turns Python into a “compiled
    language” (but much more flexible). You don’t
    have to reach for C/C++

    View full-size slide

  37. New Numba Language
    • Jit classes (zero-cost abstraction)
    • Integrated compiled language with Python run-time for glue
    • Struct support (NumPy arrays can be structs)
    • SSA --- can refer to local variables as different types
    • Can create typed lists, typed dictionaries, and typed sets
    • pointer support
    • calling ctypes and CFFI functions natively
    • create stand-alone dynamic library and executable
    • create static extension module for Python
    • some support for GPUs in open source coming in Q1 of
    2014 (see Accelerate for GPU support today)

    View full-size slide

  38. @sjit('StaticTuple[a,  b]')
    class  StaticTuple(object):
           layout  =  [('hd',  'a'),  ('tl',  'b')]
           @jit
           def  __init__(self,  hd,  tl):
                   self.hd  =  hd
                   self.tl  =  tl
           @jit('a  -­‐>  b  :  integral  -­‐>  c')
           def  __getitem__(self,  item):
                   if  item  ==  0:
                           return  self.hd
                   else:
                           return  self.tl[item  -­‐  1]
           @jit('a  -­‐>  Iterator[T]')
           def  __iter__(self):
                   yield  self.hd
                   for  x  in  self.tl:
                           yield  x
           @jit('a  -­‐>  int64')
           def  __len__(self):
                   if  self.hd  is  None:
                           return  0
                   elif  self.tl  is  None:
                           return  1
                   else:
                           return  len(self.tl)  +  1
           @jit('a  -­‐>  StaticTuple[t1,  t2]  -­‐>  c')
           def  __add__(self,  other):
                   if  self.tl  is  None:
                           return  StaticTuple(self.hd,  other)
                   else:
                           return  StaticTuple(self.hd,  self.tl  +  other)
         @jit('a  -­‐>  str')
           def  __repr__(self):
                   return  '(%s)'  %  ",  ".join(map(str,  self))
           def  element_type(self):
                   if  self.hd  is  None:
                           raise  TypeError("Cannot  compute  element  type  of  empty  tuple!")
                   else:
                           type  =  typeof(self.hd)
                           if  self.tl  is  not  None:
                                   type  =  promote(type,  self.tl.element_type())
                           return  type
           @staticmethod
           def  fromobject(tuple,  type):
                   head,  tail  =  type.parameters
                   hd  =  fromobject(tuple[0],  head)
                   if  tuple[1:]:
                           tl  =  fromobject(tuple[1:],  tail)
                   else:
                           tl  =  EmptyTuple()
                   return  StaticTuple(hd,  tl)
           @staticmethod
           def  toobject(value,  type):
                   head,  tail  =  type.parameters
                   hd  =  toobject(value.hd,  head)
                   if  isinstance(value.tl,  EmptyTuple):
                           return  (hd,)
                   return  (hd,)  +  toobject(value.tl,  tail)
           @jit('a  -­‐>  Tuple[T]  -­‐>  Tuple[T]')
           def  __add__(self,  other):
                   result  =  List[T]()
                   for  x  in  self:
                           result.append(x)
                   return  tuple(result)  +  other
           @jit('a  -­‐>  a  -­‐>  bool')
           def  __eq__(self,  other):
                   return  self.hd  ==  other.hd  and  self.tl  ==  other.tl
    Example

    View full-size slide

  39. Numba Development
    1852 Mark Florisson
    252 Jon Riehl
    224 Siu Kwan Lam
    112 Travis E. Oliphant
    96 Jay Bourque
    68 Hernan Grecco
    30 Dag Sverre Seljebotn
    22 Ilan Schnell
    11 Mark Wiebe
    11 Björn Linse
    10 majidaldo
    8 James Bergstra
    8 Gaëtan de Menten
    4 Francesc Alted
    4 Alberto Valverde
    3 Thomas Kluyver
    3 GFKjunior
    3 Christoph Gohlke
    ...
    git log --format=format:%an | sort | uniq -c | sort -r
    Siu
    Mark
    Jon

    View full-size slide

  40. GPU and Parallel Support
    fast development
    and fast execution!
    Python and NumPy compiled to Parallel Architectures
    (GPUs and multi-core machines)
    $ conda install accelerate

    View full-size slide

  41. Examples
    from numbapro import vectorize
    from math import sin
    @vectorize([‘f8(f8)’, ‘f4(f4)’], target=‘gpu’)
    def sinc(x):
    if x==0.0:
    return 1.0
    else:
    return sin(x*pi)/(pi*x)
    @vectorize([‘f8(f8)’, ‘f4(f4)’], target=‘parallel’)
    def sinc2(x):
    if x==0.0:
    return 1.0
    else:
    return sin(x*pi)/(pi*x)

    View full-size slide

  42. Example
    from numbapro import vectorize
    sig = 'uint8(uint32, f4, f4, f4, f4, uint32, uint32, uint32)'
    @vectorize([sig], target='gpu')
    def mandel(tid, min_x, max_x, min_y, max_y, width, height, iters):
    pixel_size_x = (max_x - min_x) / width
    pixel_size_y = (max_y - min_y) / height
    x = tid % width
    y = tid / width
    real = min_x + x * pixel_size_x
    imag = min_y + y * pixel_size_y
    c = complex(real, imag)
    z = 0.0j
    for i in range(iters):
    z = z * z + c
    if (z.real * z.real + z.imag * z.imag) >= 4:
    return i
    return 255
    Kind Time Speed-up
    Python 263.6 1.0x
    CPU 2.639 100x
    GPU 0.1676 1573x
    Tesla S2050

    View full-size slide

  43. Example with cuFFT
    # host -> device
    d_img = cuda.to_device(img) # image
    d_fltr = cuda.to_device(fltr) # filter
    # FFT forward
    cufft.fft_inplace(d_img)
    cufft.fft_inplace(d_fltr)
    # multply
    vmult(d_img, d_fltr, out=d_img) # inplace
    # FFT inverse
    cufft.ifft_inplace(d_img)
    # device -> host
    filted_img = d_img.copy_to_host()
    @vectorize(['complex64(complex64, complex64)'],
    target='gpu')
    def vmult(a, b):
    return a * b

    View full-size slide

  44. More examples at
    https://github.com/ContinuumIO/numbapro-examples
    http://devblogs.nvidia.com/parallelforall/numbapro-high-
    performance-python-cuda-acceleration/
    CUDA CAST #10
    https://www.youtube.com/watch?v=jKV1m8APttU

    View full-size slide