Slide 1

Slide 1 text

Python in Big Data Analytics: Past, Present, and Future EuroPython, July 25, 2014 Travis E. Oliphant

Slide 2

Slide 2 text

My Roots

Slide 3

Slide 3 text

My Roots Images from BYU Mers Lab

Slide 4

Slide 4 text

Science led to Python Raja Muthupillai Armando Manduca Richard Ehman 1997 ⇢0 (2⇡f)2 Ui (a, f) = [Cijkl (a, f) Uk,l (a, f)] ,j

Slide 5

Slide 5 text

Finding derivatives of 5-d data ⌅ = r ⇥ U

Slide 6

Slide 6 text

Scientist at heart

Slide 7

Slide 7 text

Python origins. Version Date 0.9.0 Feb. 1991 0.9.4 Dec. 1991 0.9.6 Apr. 1992 0.9.8 Jan. 1993 1.0.0 Jan. 1994 1.2 Apr. 1995 1.4 Oct. 1996 1.5.2 Apr. 1999

Slide 8

Slide 8 text

First problem: Efficient Data Input “It’s Always About the Data” Reference Counting Essay May 1998 Guido van Rossum TableIO April 1998 Michael A. Miller NumPyIO June 1998

Slide 9

Slide 9 text

Early pieces of SciPy cephesmodule fftw wrappers June 1998 November 1998 December 1998 Gary Strangman

Slide 10

Slide 10 text

1999 : Early SciPy emerges Discussions on the matrix-sig from 1997 to 1999 wanting a complete data analysis environment: Paul Barrett, Joe Harrington, Perry Greenfield, Paul Dubois, Konrad Hinsen, and others. Activity in 1998, led to increased interest in 1999. ! In response on 15 Jan, 1999, I posted to matrix-sig a list of routines I felt needed to be present and began wrapping / writing in earnest. On 6 April 1999, I announced I would be creating this uber-package which eventually became SciPy Gaussian quadrature 5 Jan 1999 cephes 1.0 30 Jan 1999 sigtools 0.40 23 Feb 1999 Numeric docs March 1999 cephes 1.1 9 Mar 1999 multipack 0.3 13 Apr 1999 Helper routines 14 Apr 1999 multipack 0.6 (leastsq, ode, fsolve, quad) 29 Apr 1999 sparse plan described 30 May 1999 multipack 0.7 14 Jun 1999 SparsePy 0.1 5 Nov 1999 cephes 1.2 (vectorize) 29 Dec 1999 Plotting?? ! Gist XPLOT DISLIN Gnuplot Helping with f2py

Slide 11

Slide 11 text

Using Numeric circa 2000 Image from my PhD Thesis made using Python interface to DISLIN and using Numeric + hand- written C-extensions for inverting wave- equation.

Slide 12

Slide 12 text

SciPy 2001 Founded in 2001 with Travis Vaught Eric Jones weave cluster GA* Pearu Peterson linalg interpolate f2py Travis Oliphant optimize sparse interpolate integrate special signal stats fftpack misc

Slide 13

Slide 13 text

Brief History Person Package Year Jim Fulton Matrix Object in Python 1994 Jim Hugunin Numeric 1995 Perry Greenfield, Rick White, Todd Miller Numarray 2001 Travis Oliphant NumPy 2005

Slide 14

Slide 14 text

Now an impressive community effort • Chuck Harris • Pauli Virtanen • Robert Kern • Warren Weckesser • Ralf Gommers • Mark Wiebe • Nathaniel Smith • Nathan Bell • Stefan van der Walt • Matthew Brett • Josef Perktold …

Slide 15

Slide 15 text

Over 3,000,000 users of NumPy!

Slide 16

Slide 16 text

Keys to Success • Hard work — especially up front • Often lonely — initially nobody believes in your idea more than you do. Others need some “proof” before they join you. • The more complicated what you are doing is the lonelier it will be initially. • Examples: • I procrastinated my PhD at least 1 year to create the beginnings of SciPy (don’t tell my wife). • Pearu Peterson put in tremendous work to create f2py and scipy.linalg • I spent 18 months not publishing papers to write NumPy (despite many people telling me it was foolish).

Slide 17

Slide 17 text

Keys to Success • Do what is “right” • Timing is everything (sometimes you are the right person for the job) • Having an urgency (it won’t wait) • Striving for excellence Give the best you have…and it will never be enough. Give your best anyway. — Mother Teresa

Slide 18

Slide 18 text

Keys to Success • Build a community • You will need help to achieve your goals. • This means other people. This will require sacrificing some of your ego to really listen! • Someone will point out how you suck (listen to them, you probably do). • Nurture empathy. • Treat other people like they matter to you — the only successful way to do that is to actually care. This exposes you to get hurt — care about people anyway! • Much more could be said on this topic…

Slide 19

Slide 19 text

Keys to Success • Patience (and some luck) • Good things take time • The right factors have to come together — some you can influence and some you can’t.

Slide 20

Slide 20 text

So what is this NumPy thing?

Slide 21

Slide 21 text

NumPy: an Array-Oriented Extension • Data: the array object – slicing and shaping – data-type map to Bytes ! • Fast Math: – vectorization – broadcasting – aggregations

Slide 22

Slide 22 text

NumPy Examples 2d array 3d array [439 472 477] [217 205 261 222 245 238] 9.98330639789 2.96677717122

Slide 23

Slide 23 text

NumPy Array shape

Slide 24

Slide 24 text

Zen of NumPy (ala import this) • strided is better than scattered • contiguous is better than strided • descriptive is better than imperative • array-oriented is better than object-oriented • broadcasting is a great idea • vectorized is better than an explicit loop • unless it’s too complicated --- then use Cython/Numba • think in higher dimensions

Slide 25

Slide 25 text

Array-Oriented Computing Example1: Fibonacci Numbers fn = fn 1 + fn 2 f0 = 0 f1 = 1 f = 0, 1, 1, 2, 3, 5, 8, 13, 21, 34, . . .

Slide 26

Slide 26 text

Common Python approaches Recursive Iterative Algorithm matters!!

Slide 27

Slide 27 text

Array-oriented approaches Using LFilter Using Formula

Slide 28

Slide 28 text

Array-oriented approaches

Slide 29

Slide 29 text

APL : the first array-oriented language • Appeared in 1964 • Originated by Ken Iverson • Direct descendants (J, K, Matlab) are still used heavily and people pay a lot of money for them • NumPy is a descendent APL J K Matlab Numeric NumPy Q

Slide 30

Slide 30 text

Memory using Object-oriented Object Attr1 Attr2 Attr3 Object Attr1 Attr2 Attr3 Object Attr1 Attr2 Attr3 Object Attr1 Attr2 Attr3 Object Attr1 Attr2 Attr3 Object Attr1 Attr2 Attr3

Slide 31

Slide 31 text

Array-oriented (Table) approach Attr1 Attr2 Attr3 Object1 Object2 Object3 Object4 Object5 Object6

Slide 32

Slide 32 text

Why Array-oriented... • Today’s vector machines (and vector co-processors, or GPUS) were made for array-oriented computing. • The software stack has just not caught up --- unfortunate because APL came out in 1963. • There is a reason Fortran remains popular.

Slide 33

Slide 33 text

Benefits of Array-oriented • Many technical problems (advanced analytics) are naturally array-oriented (easy to vectorize) • Algorithms can be expressed at a high-level • These algorithms can be parallelized more simply (quite often much information is lost in the translation to typical “compiled” languages) • Array-oriented algorithms map to modern hard-ware caches and pipelines.

Slide 34

Slide 34 text

Complete Example travis/CircleMask

Slide 35

Slide 35 text

NumPy in Data Analytics • NumPy is a decent array object with some user- friendly features. • NumPy “arrays of structures” can be used to handle arbitrary data. First Name Last Name Score Dave Thomas 89.4 Tasha Hen 76.6 Cool Python 100 Stack Overflow 95.32 Py Py 75

Slide 36

Slide 36 text

Pandas is “Structure of Arrays” • Labels on the dimensions (indexes) • Easy manipulation of new columns • Missing Value handling • Time Series handling • General Split-Apply-Combine • Merge and Join • Integrated Plotting • Chained method calls the norm • Familiar to R users — more user-friendly features!

Slide 37

Slide 37 text

Current Key Libraries • NumPy • SciPy • Pandas • Matplotlib • IPython (with notebook) • PyTables (HDF5) • Scikit learn • Statsmodels (Patsy) • SymPy • Cython (Numba) • NumExpr

Slide 38

Slide 38 text

Tools used for Data Source: O’Reilly Strata attendee survey 2012 and 2013

Slide 39

Slide 39 text

Python for Data Science

Slide 40

Slide 40 text

Python is the top language in schools!

Slide 41

Slide 41 text

Why Python for Technical Computing • Syntax (it gets out of your way) • White space preserves “visual real-estate” • Over-loadable operators • Complex numbers built-in early • Just enough language support for arrays • “Occasional” programmers can grok it • Packaging with conda is awesome!

Slide 42

Slide 42 text

What is great about Python • Supports multiple programming styles (functional, object-oriented, scripts, etc.) • Experienced programmers can also use it effectively (classes, meta-programming techniques) • Has a simple, extensible implementation (so that C-extensions can exist) • General-purpose language --- can build a system • Critical mass! • Allows for community

Slide 43

Slide 43 text

What is wrong with Python? • Missing anonymous blocks • Some syntax warts (1:10:2 outside [ ] please) • The CPython run-time is aged and needs an overhaul (GIL, global variables, lack of dynamic compilation support) • No approach to language extension except for abuse of meta-programming and “import hooks” (lightweight DSL need) • The distraction of multiple run-times... • Array-oriented and NumPy not really understood by many Python devs (but thank you for ‘@‘)

Slide 44

Slide 44 text

What is good about NumPy? • Array-oriented • Extensive Dtype System (including structures) • C-API • Simple to understand data-structure • Memory mapping • Syntax support from Python • Large community of users • Broadcasting • Easy to interface C/C++/Fortran code • PEP 3118

Slide 45

Slide 45 text

What is wrong with NumPy • Dtype system is difficult to extend • Immediate mode creates huge temporaries (spawning Numexpr) • “Almost” an in-memory data-base comparable to SQL-lite (missing indexes) • Integration with sparse arrays • Lots of un-optimized parts • Minimal support for multi-core / GPU • Code-base is organic and hard to extend (already inherited from Numeric)

Slide 46

Slide 46 text

The most important part PEP 3118: Revising the buffer protocol Basically the “structure” of NumPy arrays as a protocol in Python itself to establish a memory-sharing standard between objects. It makes it possible for a heterogeneous world of powerful array-like objects outside of NumPy that communicate. ! Falls short in not defining a general data description language (DDL).

Slide 47

Slide 47 text

A Dual of Encapsulation? • The buffer protocol and data-types provide a different view on encapsulation (the dual of typical encapsulation approaches for those initiated to linear algebra) ! • Rather than attach methods to data and hide the data, you describe data completely and declaratively. You can then later apply arbitrary code to this data at run-time. ! • It makes it much easier to move code to data at all levels.

Slide 48

Slide 48 text

What of the future? I’ve watched most of the episodes, so there’s that… I will just describe what I would like to see…

Slide 49

Slide 49 text

“Data Has Mass”

Slide 50

Slide 50 text


Slide 51

Slide 51 text

Workflow Perspective Data-centric Perspective

Slide 52

Slide 52 text

Fundamentally Heterogeneous • The future of Python Data analytics will be heterogenous • There will be many projects - Biggus - DistArray - SciDB - Elemental - Spartan • Not counting integration with the other projects - Spark - Imapala - Disco

Slide 53

Slide 53 text

What about Continuum? After watching NumPy and SciPy get used all over the world --- what would we do differently? Blaze Numba Conda (Anaconda) All Open Source!

Slide 54

Slide 54 text

Conda and Anaconda • Cross-platform package management • Multiple environments allows you to have multiple versions of packages installed in system • Easy app-deployment • Taming open-source • Users love it! Free for all users Enterprise support available!

Slide 55

Slide 55 text

Blaze and Numba Blaze is motivated by generalizing PEP 3118 to all languages and data-sets — Python Glue 2.0 Numba is motivated by a desire for Python NumPy array-like code to reach the speeds of Fortran!

Slide 56

Slide 56 text

from data to code, seamlessly Blaze

Slide 57

Slide 57 text

• Dealing with data applications has numerous pain points
 - Hundreds of data formats - Basic programs expect all data to fit in memory - Data analysis pipelines constantly changing from one form to another - Sharing analysis contains significant overhead to configure systems - Parallelizing analysis requires expert in particular distributed computing stack Data Pain

Slide 58

Slide 58 text

Deferred Expr Compilers Interpreters Data Compute API Blaze Architecture • Flexible architecture to accommodate exploration
 • Use compilation of deferred expressions to optimize data interactions

Slide 59

Slide 59 text

Blaze Data • Single interface for data layers
 • Composition of different
 • Simple api to add 
 custom data formats SQL CSV HDFS JSON Mem Custom HDF5 Data

Slide 60

Slide 60 text

Blaze Compute Compute DyND Pandas PyTables Spark • Computation abstraction over numerous data libraries
 • Simple multi-dispatched visitors to implement new backends
 • Allows plumbing between stacks to be seamless to user

Slide 61

Slide 61 text

Deferred Expr Blaze Expr temps.hdf5 nasdaq.sql tweets.json Join by date Select NYC Find Tech Selloff Plot • Lazy computation to minimize data movement
 • Simple DAG for
 compilation to • parallel application • distributed memory • static optimizations

Slide 62

Slide 62 text

Blaze Example - Counting Weblinks Common Blaze Code #  Expr   t_idx  =  TableSymbol('{name:  string,                                              node_id:  int32}')   t_arc  =  TableSymbol('{node_out:  int32,                                              node_id:  int32}')   joined  =  Join(t_arc,  t_idx,  "node_id")   t  =  By(joined,  joined['name'],                  joined['node_id'].count())   ! #  Data  Load   idx,  arc  =  load_data()
 #  Computations   ans  =  compute(t,  {t_arc:  arc,  t_idx:  idx})
 in_deg  =  dict(ans)   in_deg[u'']

Slide 63

Slide 63 text

Blaze Example - Counting Weblinks Using Spark + HDFS load_data sc  =  SparkContext("local",  "Simple  App")   idx  =  sc.textFile(“hdfs://”)   idx  =  x:  x.split(‘\t’))\                    .map(lambda  x:  [x[0],  int(x[1])])   arc  =  sc.textFile("hdfs://")   arc  =  x:  x.split(‘\t’))\                    .map(lambda  x:  [int(x[0]),  int(x[1])])   Using Pandas + Local Disc with  open("example_index.txt")  as  f:          idx  =  [  ln.strip().split('\t')  for  ln  in  f.readlines()]   idx  =  DataFrame(idx,  columns=['name',  'node_id'])   ! with  open("example_arcs.txt")  as  f:          arc  =  [  ln.strip().split('\t')  for  ln  in  f.readlines()]   arc  =  DataFrame(arc,  columns=['node_out',  'node_id'])

Slide 64

Slide 64 text

Blaze Ecosystem • dynd — next-generation NumPy • libdynd — C++ library for multi-dimensional arrays • datashape — general data-description language (what PEP 3118 was missing) • Blaze - Data (adapt data from many different silos) - Compute (interpreters to run expressions on backends) - Expr (Symbolic expressions) - Interfaces (Focusing on Tables) • BLZ — experimental storage format (unsupported)

Slide 65

Slide 65 text

Numba CPython compatible JIT compiler

Slide 66

Slide 66 text

Code that users might write xi = i 1 X j=0 ki j,jai jaj O = I ? F Slow!!!!

Slide 67

Slide 67 text

Face of a modern compiler Intermediate Representation (IR) x86 C++ ARM PTX C Fortran ObjC Parsing Code Generation Front-End Back-End

Slide 68

Slide 68 text

Face of a modern compiler Intermediate Representation (IR) x86 ARM PTX Python Code Generation Back-End Numba LLVM Parsing Front-End

Slide 69

Slide 69 text

Example Numba

Slide 70

Slide 70 text

NumPy + Mamba = Numba LLVM Library Intel Nvidia Apple AMD OpenCL ISPC CUDA CLANG OpenMP LLVMPY Python Function Machine Code ARM

Slide 71

Slide 71 text

Speeding up Math Expressions xi = i 1 X j=0 ki j,jai jaj

Slide 72

Slide 72 text

Image Processing @jit('void(f8[:,:],f8[:,:],f8[:,:])') def filter(image, filt, output): M, N = image.shape m, n = filt.shape for i in range(m//2, M-m//2): for j in range(n//2, N-n//2): result = 0.0 for k in range(m): for l in range(n): result += image[i+k-m//2,j+l-n//2]*filt[k, l] output[i,j] = result ~1500x speed-up

Slide 73

Slide 73 text

Case-study -- j0 from scipy.special • scipy.special was one of the first libraries I wrote • extended “umath” module by adding new “universal functions” to compute many scientific functions by wrapping C and Fortran libs. • Bessel functions are solutions to a differential equation: x 2 d 2 y dx 2 + x dy dx + ( x 2 ↵ 2) y = 0 y = J↵ ( x ) Jn (x) = 1 ⇡ Z ⇡ 0 cos (n⌧ x sin (⌧)) d⌧

Slide 74

Slide 74 text

scipy.special.j0 wraps cephes algorithm

Slide 75

Slide 75 text

Result --- equivalent to compiled code In [6]: %timeit vj0(x) 10000 loops, best of 3: 75 us per loop ! In [7]: from scipy.special import j0 ! In [8]: %timeit j0(x) 10000 loops, best of 3: 75.3 us per loop But! Now code is in Python and can be experimented with more easily (and moved to the GPU / accelerator more easily)!

Slide 76

Slide 76 text

Numba can change the game! LLVM IR x86 C++ ARM PTX C Fortran Python Numba turns Python into a “compiled language” (but much more flexible). You don’t have to reach for C/C++

Slide 77

Slide 77 text

CUDA-Python from numba import cuda from numba import autojit ! @autojit(target=‘gpu’) def array_scale(src, dst, scale): tid = cuda.threadIdx.x blkid = cuda.blockIdx.x blkdim = cuda.blockDim.x ! i = tid + blkid * blkdim ! if i >= n: return ! dst[i] = src[i] * scale ! src = np.arange(N, dtype=np.float) dst = np.empty_like(src) ! array_scale[grid, block](src, dst, 5.0) CUDA Development using Python syntax for optimal performance!

Slide 78

Slide 78 text

Example: Black-Scholes @cuda.jit(argtypes=(double[:], double[:], double[:], double[:], double[:], double, double)) def black_scholes_cuda(callResult, putResult, S, X, T, R, V): # S = stockPrice # X = optionStrike # T = optionYears # R = Riskfree # V = Volatility i = cuda.threadIdx.x + cuda.blockIdx.x * cuda.blockDim.x if i >= S.shape[0]: return sqrtT = math.sqrt(T[i]) d1 = (math.log(S[i] / X[i]) + (R + 0.5 * V * V) * T[i]) / (V * sqrtT) d2 = d1 - V * sqrtT cndd1 = cnd_cuda(d1) cndd2 = cnd_cuda(d2) ! expRT = math.exp((-1. * R) * T[i]) callResult[i] = (S[i] * cndd1 - X[i] * expRT * cndd2) putResult[i] = (X[i] * expRT * (1.0 - cndd2) - S[i] * (1.0 - cndd1)) @cuda.jit(argtypes=(double,), restype=double, device=True, inline=True) def cnd_cuda(d): A1 = 0.31938153 A2 = -0.356563782 A3 = 1.781477937 A4 = -1.821255978 A5 = 1.330274429 RSQRT2PI = 0.39894228040143267793994605993438 K = 1.0 / (1.0 + 0.2316419 * math.fabs(d)) ret_val = (RSQRT2PI * math.exp(-0.5 * d * d) * (K * (A1 + K * (A2 + K * (A3 + K * (A4 + K * A5)))))) if d > 0: ret_val = 1.0 - ret_val return ret_val blockdim = 1024, 1 griddim = int(math.ceil(float(OPT_N)/blockdim[0])), 1 stream = d_callResult = cuda.to_device(callResultNumbapro, stream) d_putResult = cuda.to_device(putResultNumbapro, stream) d_stockPrice = cuda.to_device(stockPrice, stream) d_optionStrike = cuda.to_device(optionStrike, stream) d_optionYears = cuda.to_device(optionYears, stream) for i in range(iterations): black_scholes_cuda[griddim, blockdim, stream]( d_callResult, d_putResult, d_stockPrice, d_optionStrike, d_optionYears, RISKFREE, VOLATILITY) d_callResult.to_host(stream) d_putResult.to_host(stream) stream.synchronize()

Slide 79

Slide 79 text

Black-Scholes: Results core i7 GeForce GTX 560 Ti About 9x faster on this GPU ~ same speed as CUDA-C

Slide 80

Slide 80 text

Summary • Python has had a long and fruitful history in Data Analytics • It will have a long and bright future with your help! • Join the PyData Community and make the world a better place!

Slide 81

Slide 81 text

Dedication Nothing I have done or am would be were it not for your patience, love, and support! ! Thank you! Amy Pennock Oliphant

Slide 82

Slide 82 text

Donate to Numfocus Become a PSF Member: Attend PyData Berlin and Present at Future PyData events: Community Commercials