Large scale array-oriented computing with Python

Slide 1

Slide 1 text

Large-scale array-oriented computing with Python PyCon Taiwan, June 9, 2012 Travis E. Oliphant

Slide 2

Slide 2 text

My Roots

Slide 3

Slide 3 text

My Roots Images from BYU Mers Lab

Slide 4

Slide 4 text

Science led to Python Raja Muthupillai Armando Manduca Richard Ehman 1997 ⇢0 (2⇡f)2 Ui (a, f) = [Cijkl (a, f) Uk,l (a, f)] ,j

Slide 5

Slide 5 text

Finding derivatives of 5-d data ⌅ = r ⇥ U

Slide 6

Slide 6 text

Scientist at heart

Slide 7

Slide 7 text

Python origins. Version Date 0.9.0 Feb. 1991 0.9.4 Dec. 1991 0.9.6 Apr. 1992 0.9.8 Jan. 1993 1.0.0 Jan. 1994 1.2 Apr. 1995 1.4 Oct. 1996 1.5.2 Apr. 1999 http://python-history.blogspot.com/2009/01/brief-timeline-of-python.html

Slide 8

Slide 8 text

Brief History Person Package Year Jim Fulton Matrix Object in Python 1994 Jim Hugunin Numeric 1995 Perry Greenﬁeld, Rick White, Todd Miller Numarray 2001 Travis Oliphant NumPy 2005

Slide 9

Slide 9 text

1999 : Early SciPy emerges Discussions on the matrix-sig from 1997 to 1999 wanting a complete data analysis environment: Paul Barrett, Joe Harrington, Perry Greenﬁeld, Paul Dubois, Konrad Hinsen, and others. Activity in 1998, led to increased interest in 1999. ! In response on 15 Jan, 1999, I posted to matrix-sig a list of routines I felt needed to be present and began wrapping / writing in earnest. On 6 April 1999, I announced I would be creating this uber-package which eventually became SciPy Gaussian quadrature 5 Jan 1999 cephes 1.0 30 Jan 1999 sigtools 0.40 23 Feb 1999 Numeric docs March 1999 cephes 1.1 9 Mar 1999 multipack 0.3 13 Apr 1999 Helper routines 14 Apr 1999 multipack 0.6 (leastsq, ode, fsolve, quad) 29 Apr 1999 sparse plan described 30 May 1999 multipack 0.7 14 Jun 1999 SparsePy 0.1 5 Nov 1999 cephes 1.2 (vectorize) 29 Dec 1999 Plotting?? ! Gist XPLOT DISLIN Gnuplot Helping with f2py

Slide 10

Slide 10 text

SciPy 2001 Founded in 2001 with Travis Vaught Eric Jones weave cluster GA* Pearu Peterson linalg interpolate f2py Travis Oliphant optimize sparse interpolate integrate special signal stats fftpack misc

Slide 11

Slide 11 text

Community effort • Chuck Harris • Pauli Virtanen • David Cournapeau • Stefan van der Walt • Dag Sverre Seljebotn • Robert Kern • Warren Weckesser • Ralf Gommers • Mark Wiebe • Nathaniel Smith

Slide 12

Slide 12 text

Why Python for Technical Computing • Syntax (it gets out of your way) • Over-loadable operators • Complex numbers built-in early • Just enough language support for arrays • “Occasional” programmers can grok it • Supports multiple programming styles • Expert programmers can also use it effectively • Has a simple, extensible implementation • General-purpose language --- can build a system • Critical mass

Slide 13

Slide 13 text

What is wrong with Python? • Packaging is still not solved well (distribute, pip, and distutils2 don’t cut it) • Missing anonymous blocks • The CPython run-time is aged and needs an overhaul (GIL, global variables, lack of dynamic compilation support) • No approach to language extension except for “import hooks” (lightweight DSL need) • The distraction of multiple run-times... • Array-oriented and NumPy not really understood by most Python devs.

Slide 14

Slide 14 text

Putting Science back in Comp Sci • Much of the software stack is for systems programming --- C++, Java, .NET, ObjC, web - Complex numbers? - Vectorized primitives? • Array-oriented programming has been supplanted by Object-oriented programming • Software stack for scientists is not as helpful as it should be • Fortran is still where many scientists end up

Slide 15

Slide 15 text

Array-Oriented Computing Example1: Fibonacci Numbers fn = fn 1 + fn 2 f0 = 0 f1 = 1 f = 0, 1, 1, 2, 3, 5, 8, 13, 21, 34, . . .

Slide 16

Slide 16 text

Common Python approaches Recursive Iterative Algorithm matters!!

Slide 17

Slide 17 text

Array-oriented approaches Using LFilter Using Formula

Slide 18

Slide 18 text

Array-oriented approaches

Slide 19

Slide 19 text

NumPy: an Array-Oriented Extension • Data: the array object – slicing and shaping – data-type map to Bytes ! • Fast Math: – vectorization – broadcasting – aggregations

Slide 20

Slide 20 text

NumPy Array shape

Slide 21

Slide 21 text

Zen of NumPy • strided is better than scattered • contiguous is better than strided • descriptive is better than imperative • array-oriented is better than object-oriented • broadcasting is a great idea • vectorized is better than an explicit loop • unless it’s too complicated --- then use Cython/Numba • think in higher dimensions

Slide 22

Slide 22 text

More NumPy Demonstration

Slide 23

Slide 23 text

Conway’s game of Life • Dead cell with exactly 3 live neighbors will come to life • A live cell with 2 or 3 neighbors will survive • With too few or too many neighbors, the cell dies

Slide 24

Slide 24 text

Interesting Patterns emerge

Slide 25

Slide 25 text

APL : the ﬁrst array-oriented language • Appeared in 1964 • Originated by Ken Iverson • Direct descendants (J, K, Matlab) are still used heavily and people pay a lot of money for them • NumPy is a descendent APL J K Matlab Numeric NumPy

Slide 26

Slide 26 text

Conway’s Game of Life APL NumPy Initialization Update Step

Slide 27

Slide 27 text

Demo Python Version Array-oriented NumPy Version

Slide 28

Slide 28 text

Memory using Object-oriented Object Attr1 Attr2 Attr3 Object Attr1 Attr2 Attr3 Object Attr1 Attr2 Attr3 Object Attr1 Attr2 Attr3 Object Attr1 Attr2 Attr3 Object Attr1 Attr2 Attr3

Slide 29

Slide 29 text

Array-oriented (Table) approach Attr1 Attr2 Attr3 Object1 Object2 Object3 Object4 Object5 Object6

Slide 30

Slide 30 text

Beneﬁts of Array-oriented • Many technical problems are naturally array- oriented (easy to vectorize) • Algorithms can be expressed at a high-level • These algorithms can be parallelized more simply (quite often much information is lost in the translation to typical “compiled” languages) • Array-oriented algorithms map to modern hard-ware caches and pipelines.

Slide 31

Slide 31 text

We need more focus on complied array-oriented languages with fast compilers!

Slide 32

Slide 32 text

What is good about NumPy? • Array-oriented • Extensive Dtype System (including structures) • C-API • Simple to understand data-structure • Memory mapping • Syntax support from Python • Large community of users • Broadcasting • Easy to interface C/C++/Fortran code

Slide 33

Slide 33 text

What is wrong with NumPy • Dtype system is difﬁcult to extend • Immediate mode creates huge temporaries (spawning Numexpr) • “Almost” an in-memory data-base comparable to SQL-lite (missing indexes) • Integration with sparse arrays • Lots of un-optimized parts • Minimal support for multi-core / GPU • Code-base is organic and hard to extend

Slide 34

Slide 34 text

Improvements needed • NDArray improvements • Indexes (esp. for Structured arrays) • SQL front-end • Multi-level, hierarchical labels • selection via mappings (labeled arrays) • Memory spaces (array made up of regions) • Distributed arrays (global array) • Compressed arrays • Standard distributed persistance • fancy indexing as view and optimizations • streaming arrays

Slide 35

Slide 35 text

Improvements needed • Dtype improvements • Enumerated types (including dynamic enumeration) • Derived fields • Specification as a class (or JSON) • Pointer dtype (i.e. C++ object, or varchar) • Finishing datetime • Missing data with bit-patterns • Parameterized field names

Slide 36

Slide 36 text

Example of Object-deﬁned Dtype @np.dtype class Stock(np.DType): symbol = np.Str(4) open = np.Int(2) close = np.Int(2) high = np.Int(2) low = np.Int(2) @np.Int(2) def mid(self): return (self.high + self.low) / 2.0

Slide 37

Slide 37 text

Improvements needed • Ufunc improvements • Generalized ufuncs support more than just contiguous arrays • Specification of ufuncs in Python • Move most dtype “array functions” to ufuncs • Unify error-handling for all computations • Allow lazy-evaluation and remote computation --- streaming and generator data • Structured and string dtype ufuncs • Multi-core and GPU optimized ufuncs • Group-by reduction

Slide 38

Slide 38 text

More Improvements needed • Miscellaneous improvements • ABI-management • Eventual Move to library (NDLib)? • Integration with LLVM • Sparse dimensions • Remote computation • Fast I/O for CSV and Excel • Out-of-core calculations • Delayed-mode execution

Slide 39

Slide 39 text

New Project Blaze NumPy Next Generation NumPy Out-of-core Distributed Tables

Slide 40

Slide 40 text

Blaze Main Features • New ndarray with multiple memory segments • Distributed ndtable which can span the world • Fast, out-of-core algorithms for all functions • Delayed-mode execution: expressions build up graph which gets executed where the data is • Built-in Indexes (beyond searchsorted) • Built-in labels (data-array) • Sparse dimensions (deﬁned by attributes or elements of another dimension) • Direct adapters to all data (move code to data)

Slide 41

Slide 41 text

Delayed execution Demo Code Only

Slide 42

Slide 42 text

Dimensions deﬁned by Attributes Day Month Year High Low 15 3 2012 30 20 16 3 2012 35 25 20 3 2012 40 30 21 3 2012 41 29 dim1

Slide 43

Slide 43 text

Outline NDTable NDArray Bytes GFunc DType Domain

Slide 44

Slide 44 text

NDTable (Example) Proc0 Proc1 Proc2 Proc3 Proc0 Proc1 Proc2 Proc3 Proc0 Proc1 Proc2 Proc3 Proc4 Proc4 Proc4 Proc4 Each Partition: • Remote • Expression • NDArray

Slide 45

Slide 45 text

Data URLs • Variables in script are global addresses (DATA URLs). All the world’s data you can see via web can be in used as part of an algorithm by referencing it as a part of an array. • Dynamically interpret bytes as data-type • Scheduler will push code based on data-type to the data instead of pulling data to the code.

Slide 46

Slide 46 text

Overview Main Script Processing Node Processing Node Processing Node Processing Node Processing Node Code Code Code Code

Slide 47

Slide 47 text

NDArray • Local ndarray (NumPy++) • Multiple byte-buffers (streaming or random access) • Variable-length arrays • All kinds of data-types (everything...) • Multiple patterns of memory access possible (Z-order, Fortran-order, C-order) • Sparse dimensions

Slide 48

Slide 48 text

GFunc • Generalized Function • All NumPy functions • element-by-element • linear algebra • manipulation • Fourier Transform • Iteration and Dispatch to low-level kernels • Kernels can be written in anything that builds a C-like interface

Slide 49

Slide 49 text

PyData All computing modules known to work with Blaze will be placed under PyData umbrella of projects over the coming years.

Slide 50

Slide 50 text

Introducing Numba (lots of kernels to write)

Slide 51

Slide 51 text

NumPy Users • Want to be able to write Python to get fast code that works on arrays and scalars • Need access to a boat-load of C-extensions (NumPy is just the beginning) PyPy doesn’t cut it for us!

Slide 52

Slide 52 text

Dynamic compilation Python Function NumPy Runtime Ufuncs Generalized UFuncs Function- based Indexing Memory Filters Window Kernel Funcs I/O Filters Reduction Filters Computed Columns Dynamic Compilation function pointer

Slide 53

Slide 53 text

SciPy needs a Python compiler integrate optimize ode special writing more of SciPy at high-level

Slide 54

Slide 54 text

Numba -- a Python compiler • Replays byte-code on a stack with simple type- inference • Translates to LLVM (using LLVM-py) • Uses LLVM for code-gen • Resulting C-level function-pointer can be inserted into NumPy run-time • Understands NumPy arrays • Is NumPy / SciPy aware

Slide 55

Slide 55 text

NumPy + Mamba = Numba LLVM 3.1 Intel Nvidia Apple AMD OpenCL ISPC CUDA CLANG OpenMP LLVM-PY Python Function Machine Code

Slide 56

Slide 56 text

Examples

Slide 57

Slide 57 text

Examples

Slide 58

Slide 58 text

Software Stack Future? LLVM Python C OBJC FORTRA R C++ Plateaus of Code re-use + DSLs Matlab SQL TDPL

Slide 59

Slide 59 text

How to pay for all this?

Slide 60

Slide 60 text

Dual strategy Blaze

Slide 61

Slide 61 text

NumFOCUS Num(Py) Foundation for Open Code for Usable Science

Slide 62

Slide 62 text

NumFOCUS • Mission • To initiate and support educational programs furthering the use of open source software in science. • To promote the use of high-level languages and open source in science, engineering, and math research • To encourage reproducible scientific research • To provide infrastructure and support for open source projects for technical computing

Slide 63

Slide 63 text

NumFOCUS • Activites • Sponsor sprints and conferences • Provide scholarships and grants for people using these tools • Pay for documentation development and basic course development • Fund continuous integration and build systems • Work with domain-specific organizations • Raise funds from industries using Python and NumPy

Slide 64

Slide 64 text

NumFOCUS Core Projects Other Projects (seeking more --- need representatives) NumPy SciPy IPython Matplotlib Scikits Image

Slide 65

Slide 65 text

NumFOCUS • Directors • Perry Greenfield • John Hunter • Jarrod Millman • Travis Oliphant • Fernando Perez • Members • Basically people who donate for now. In time, a body that elects directors.

Slide 66

Slide 66 text

• Large-scale data analysis products • Python training (data analysis and development) • NumPy support and consulting • Rich-client or web user-interfaces • Blaze and PyData Development