$30 off During Our Annual Pro Sale. View Details »

Large scale array-oriented computing with Python

Large scale array-oriented computing with Python

Talk given at PyCon Tawain in 2012. History of SciPy and early thoughts on Numba and Blaze described.

Travis E. Oliphant

April 12, 2014
Tweet

More Decks by Travis E. Oliphant

Other Decks in Technology

Transcript

  1. Large-scale array-oriented
    computing with Python
    PyCon Taiwan, June 9, 2012
    Travis E. Oliphant

    View Slide

  2. My Roots

    View Slide

  3. My Roots
    Images from BYU Mers Lab

    View Slide

  4. Science led to Python
    Raja Muthupillai
    Armando Manduca
    Richard Ehman
    1997
    ⇢0
    (2⇡f)2 Ui
    (a, f) = [Cijkl
    (a, f) Uk,l
    (a, f)]
    ,j

    View Slide

  5. Finding derivatives of 5-d data
    ⌅ = r ⇥ U

    View Slide

  6. Scientist at heart

    View Slide

  7. Python origins.
    Version Date
    0.9.0 Feb. 1991
    0.9.4 Dec. 1991
    0.9.6 Apr. 1992
    0.9.8 Jan. 1993
    1.0.0 Jan. 1994
    1.2 Apr. 1995
    1.4 Oct. 1996
    1.5.2 Apr. 1999
    http://python-history.blogspot.com/2009/01/brief-timeline-of-python.html

    View Slide

  8. Brief History
    Person Package Year
    Jim Fulton
    Matrix Object
    in Python
    1994
    Jim Hugunin Numeric 1995
    Perry Greenfield, Rick
    White, Todd Miller
    Numarray 2001
    Travis Oliphant NumPy 2005

    View Slide

  9. 1999 : Early SciPy emerges
    Discussions on the matrix-sig from 1997 to 1999 wanting a complete data analysis
    environment: Paul Barrett, Joe Harrington, Perry Greenfield, Paul Dubois, Konrad Hinsen,
    and others. Activity in 1998, led to increased interest in 1999.

    !
    In response on 15 Jan, 1999, I posted to matrix-sig a list of routines I felt needed to be
    present and began wrapping / writing in earnest. On 6 April 1999, I announced I would
    be creating this uber-package which eventually became SciPy
    Gaussian quadrature 5 Jan 1999
    cephes 1.0 30 Jan 1999
    sigtools 0.40 23 Feb 1999
    Numeric docs March 1999
    cephes 1.1 9 Mar 1999
    multipack 0.3 13 Apr 1999
    Helper routines 14 Apr 1999
    multipack 0.6 (leastsq, ode, fsolve,
    quad)
    29 Apr 1999
    sparse plan described 30 May 1999
    multipack 0.7 14 Jun 1999
    SparsePy 0.1 5 Nov 1999
    cephes 1.2 (vectorize) 29 Dec 1999
    Plotting??

    !
    Gist

    XPLOT

    DISLIN

    Gnuplot
    Helping with f2py

    View Slide

  10. SciPy 2001
    Founded in 2001 with Travis Vaught
    Eric Jones

    weave

    cluster

    GA*
    Pearu Peterson

    linalg

    interpolate

    f2py
    Travis Oliphant

    optimize

    sparse

    interpolate

    integrate

    special

    signal

    stats

    fftpack

    misc

    View Slide

  11. Community effort
    • Chuck Harris
    • Pauli Virtanen
    • David Cournapeau
    • Stefan van der Walt
    • Dag Sverre Seljebotn
    • Robert Kern
    • Warren Weckesser
    • Ralf Gommers
    • Mark Wiebe
    • Nathaniel Smith

    View Slide

  12. Why Python for Technical Computing
    • Syntax (it gets out of your way)

    • Over-loadable operators

    • Complex numbers built-in early

    • Just enough language support for arrays

    • “Occasional” programmers can grok it

    • Supports multiple programming styles

    • Expert programmers can also use it effectively

    • Has a simple, extensible implementation

    • General-purpose language --- can build a system

    • Critical mass

    View Slide

  13. What is wrong with Python?
    • Packaging is still not solved well (distribute, pip, and
    distutils2 don’t cut it)

    • Missing anonymous blocks

    • The CPython run-time is aged and needs an overhaul
    (GIL, global variables, lack of dynamic compilation
    support)

    • No approach to language extension except for
    “import hooks” (lightweight DSL need)

    • The distraction of multiple run-times...

    • Array-oriented and NumPy not really understood by
    most Python devs.

    View Slide

  14. Putting Science back in Comp Sci
    • Much of the software stack is for systems
    programming --- C++, Java, .NET, ObjC, web

    - Complex numbers?

    - Vectorized primitives?

    • Array-oriented programming has been
    supplanted by Object-oriented programming

    • Software stack for scientists is not as helpful
    as it should be

    • Fortran is still where many scientists end up

    View Slide

  15. Array-Oriented Computing
    Example1: Fibonacci Numbers
    fn = fn 1 + fn 2
    f0 = 0
    f1 = 1
    f = 0, 1, 1, 2, 3, 5, 8, 13, 21, 34, . . .

    View Slide

  16. Common Python approaches
    Recursive Iterative
    Algorithm matters!!

    View Slide

  17. Array-oriented approaches
    Using LFilter
    Using Formula

    View Slide

  18. Array-oriented approaches

    View Slide

  19. NumPy: an Array-Oriented Extension
    • Data: the array object
    – slicing and shaping
    – data-type map to Bytes
    !
    • Fast Math:
    – vectorization
    – broadcasting
    – aggregations

    View Slide

  20. NumPy Array
    shape

    View Slide

  21. Zen of NumPy
    • strided is better than scattered
    • contiguous is better than strided
    • descriptive is better than imperative
    • array-oriented is better than object-oriented
    • broadcasting is a great idea
    • vectorized is better than an explicit loop
    • unless it’s too complicated --- then use Cython/Numba
    • think in higher dimensions

    View Slide

  22. More NumPy Demonstration

    View Slide

  23. Conway’s game of Life
    • Dead cell with exactly 3 live neighbors
    will come to life
    • A live cell with 2 or 3 neighbors will
    survive
    • With too few or too many neighbors, the
    cell dies

    View Slide

  24. Interesting Patterns emerge

    View Slide

  25. APL : the first array-oriented language
    • Appeared in 1964
    • Originated by Ken Iverson
    • Direct descendants (J, K, Matlab) are still
    used heavily and people pay a lot of money
    for them
    • NumPy is a descendent
    APL
    J
    K Matlab
    Numeric
    NumPy

    View Slide

  26. Conway’s Game of Life
    APL
    NumPy
    Initialization
    Update Step

    View Slide

  27. Demo
    Python Version
    Array-oriented NumPy Version

    View Slide

  28. Memory using Object-oriented
    Object
    Attr1
    Attr2
    Attr3
    Object
    Attr1
    Attr2
    Attr3
    Object
    Attr1
    Attr2
    Attr3
    Object
    Attr1
    Attr2
    Attr3
    Object
    Attr1
    Attr2
    Attr3
    Object
    Attr1
    Attr2
    Attr3

    View Slide

  29. Array-oriented (Table) approach
    Attr1 Attr2 Attr3
    Object1
    Object2
    Object3
    Object4
    Object5
    Object6

    View Slide

  30. Benefits of Array-oriented
    • Many technical problems are naturally array-
    oriented (easy to vectorize)

    • Algorithms can be expressed at a high-level

    • These algorithms can be parallelized more
    simply (quite often much information is lost in
    the translation to typical “compiled” languages)

    • Array-oriented algorithms map to modern
    hard-ware caches and pipelines.

    View Slide

  31. We need more focus on
    complied array-oriented
    languages with fast compilers!

    View Slide

  32. What is good about NumPy?
    • Array-oriented

    • Extensive Dtype System (including structures)

    • C-API

    • Simple to understand data-structure

    • Memory mapping

    • Syntax support from Python

    • Large community of users

    • Broadcasting

    • Easy to interface C/C++/Fortran code

    View Slide

  33. What is wrong with NumPy
    • Dtype system is difficult to extend

    • Immediate mode creates huge temporaries
    (spawning Numexpr)

    • “Almost” an in-memory data-base comparable
    to SQL-lite (missing indexes)

    • Integration with sparse arrays

    • Lots of un-optimized parts

    • Minimal support for multi-core / GPU

    • Code-base is organic and hard to extend

    View Slide

  34. Improvements needed
    • NDArray improvements
    • Indexes (esp. for Structured arrays)
    • SQL front-end
    • Multi-level, hierarchical labels
    • selection via mappings (labeled arrays)
    • Memory spaces (array made up of regions)
    • Distributed arrays (global array)
    • Compressed arrays
    • Standard distributed persistance
    • fancy indexing as view and optimizations
    • streaming arrays

    View Slide

  35. Improvements needed
    • Dtype improvements
    • Enumerated types (including dynamic enumeration)
    • Derived fields
    • Specification as a class (or JSON)
    • Pointer dtype (i.e. C++ object, or varchar)
    • Finishing datetime
    • Missing data with bit-patterns
    • Parameterized field names

    View Slide

  36. Example of Object-defined Dtype
    @np.dtype
    class Stock(np.DType):
    symbol = np.Str(4)
    open = np.Int(2)
    close = np.Int(2)
    high = np.Int(2)
    low = np.Int(2)
    @np.Int(2)
    def mid(self):
    return (self.high + self.low) / 2.0

    View Slide

  37. Improvements needed
    • Ufunc improvements
    • Generalized ufuncs support more than just
    contiguous arrays
    • Specification of ufuncs in Python
    • Move most dtype “array functions” to ufuncs
    • Unify error-handling for all computations
    • Allow lazy-evaluation and remote computation ---
    streaming and generator data
    • Structured and string dtype ufuncs
    • Multi-core and GPU optimized ufuncs
    • Group-by reduction

    View Slide

  38. More Improvements needed
    • Miscellaneous improvements
    • ABI-management
    • Eventual Move to library (NDLib)?
    • Integration with LLVM
    • Sparse dimensions
    • Remote computation
    • Fast I/O for CSV and Excel
    • Out-of-core calculations
    • Delayed-mode execution

    View Slide

  39. New Project
    Blaze
    NumPy
    Next Generation NumPy

    Out-of-core

    Distributed Tables

    View Slide

  40. Blaze Main Features
    • New ndarray with multiple memory segments

    • Distributed ndtable which can span the world

    • Fast, out-of-core algorithms for all functions

    • Delayed-mode execution: expressions build up
    graph which gets executed where the data is

    • Built-in Indexes (beyond searchsorted)

    • Built-in labels (data-array)

    • Sparse dimensions (defined by attributes or
    elements of another dimension)

    • Direct adapters to all data (move code to data)

    View Slide

  41. Delayed execution
    Demo
    Code Only

    View Slide

  42. Dimensions defined by Attributes
    Day Month Year High Low
    15 3 2012 30 20
    16 3 2012 35 25
    20 3 2012 40 30
    21 3 2012 41 29
    dim1

    View Slide

  43. Outline
    NDTable
    NDArray
    Bytes
    GFunc DType
    Domain

    View Slide

  44. NDTable (Example)
    Proc0 Proc1 Proc2 Proc3
    Proc0 Proc1 Proc2 Proc3
    Proc0 Proc1 Proc2 Proc3
    Proc4 Proc4 Proc4 Proc4
    Each Partition:

    • Remote

    • Expression

    • NDArray

    View Slide

  45. Data URLs
    • Variables in script are global addresses (DATA
    URLs). All the world’s data you can see via web
    can be in used as part of an algorithm by
    referencing it as a part of an array.

    • Dynamically interpret bytes as data-type

    • Scheduler will push code based on data-type
    to the data instead of pulling data to the code.

    View Slide

  46. Overview
    Main
    Script
    Processing
    Node
    Processing
    Node
    Processing
    Node
    Processing
    Node
    Processing
    Node
    Code
    Code
    Code
    Code

    View Slide

  47. NDArray
    • Local ndarray (NumPy++)

    • Multiple byte-buffers (streaming or random
    access)

    • Variable-length arrays

    • All kinds of data-types (everything...)

    • Multiple patterns of memory access possible
    (Z-order, Fortran-order, C-order)

    • Sparse dimensions

    View Slide

  48. GFunc
    • Generalized Function

    • All NumPy functions

    • element-by-element

    • linear algebra

    • manipulation

    • Fourier Transform

    • Iteration and Dispatch to low-level kernels

    • Kernels can be written in anything that builds a
    C-like interface

    View Slide

  49. PyData
    All computing modules known to work with
    Blaze will be placed under PyData umbrella of
    projects over the coming years.

    View Slide

  50. Introducing Numba

    (lots of kernels to write)

    View Slide

  51. NumPy Users
    • Want to be able to write Python to get fast
    code that works on arrays and scalars

    • Need access to a boat-load of C-extensions
    (NumPy is just the beginning)
    PyPy doesn’t cut it for us!

    View Slide

  52. Dynamic compilation
    Python
    Function
    NumPy Runtime
    Ufuncs
    Generalized
    UFuncs
    Function-
    based
    Indexing
    Memory
    Filters
    Window
    Kernel
    Funcs
    I/O Filters
    Reduction
    Filters
    Computed
    Columns
    Dynamic
    Compilation
    function pointer

    View Slide

  53. SciPy needs a Python compiler
    integrate
    optimize
    ode
    special
    writing more of SciPy at high-level

    View Slide

  54. Numba -- a Python compiler
    • Replays byte-code on a stack with simple type-
    inference

    • Translates to LLVM (using LLVM-py)

    • Uses LLVM for code-gen

    • Resulting C-level function-pointer can be
    inserted into NumPy run-time

    • Understands NumPy arrays

    • Is NumPy / SciPy aware

    View Slide

  55. NumPy + Mamba = Numba
    LLVM 3.1
    Intel Nvidia Apple
    AMD
    OpenCL
    ISPC CUDA CLANG
    OpenMP
    LLVM-PY
    Python Function Machine Code

    View Slide

  56. Examples

    View Slide

  57. Examples

    View Slide

  58. Software Stack Future?
    LLVM
    Python
    C
    OBJC
    FORTRA
    R
    C++
    Plateaus of Code re-use + DSLs
    Matlab
    SQL
    TDPL

    View Slide

  59. How to pay for all this?

    View Slide

  60. Dual strategy
    Blaze

    View Slide

  61. NumFOCUS
    Num(Py) Foundation for Open Code for Usable Science

    View Slide

  62. NumFOCUS
    • Mission
    • To initiate and support educational programs
    furthering the use of open source software in
    science.
    • To promote the use of high-level languages and
    open source in science, engineering, and math
    research
    • To encourage reproducible scientific research
    • To provide infrastructure and support for open
    source projects for technical computing

    View Slide

  63. NumFOCUS
    • Activites
    • Sponsor sprints and conferences
    • Provide scholarships and grants for people using
    these tools
    • Pay for documentation development and basic
    course development
    • Fund continuous integration and build systems
    • Work with domain-specific organizations
    • Raise funds from industries using Python and
    NumPy

    View Slide

  64. NumFOCUS
    Core Projects
    Other Projects (seeking more --- need representatives)
    NumPy SciPy IPython Matplotlib
    Scikits Image

    View Slide

  65. NumFOCUS
    • Directors
    • Perry Greenfield
    • John Hunter
    • Jarrod Millman
    • Travis Oliphant
    • Fernando Perez
    • Members
    • Basically people who donate for now. In time, a
    body that elects directors.

    View Slide

  66. • Large-scale data analysis products

    • Python training (data analysis and
    development)

    • NumPy support and consulting

    • Rich-client or web user-interfaces

    • Blaze and PyData Development

    View Slide