EuroPython Keynote July 25, 2014

Python in Big Data Analytics: Past, Present, and Future EuroPython,
July 25, 2014 Travis E. Oliphant

My Roots

My Roots Images from BYU Mers Lab

Science led to Python Raja Muthupillai Armando Manduca Richard Ehman
1997 ⇢0 (2⇡f)2 Ui (a, f) = [Cijkl (a, f) Uk,l (a, f)] ,j

Finding derivatives of 5-d data ⌅ = r ⇥ U

Scientist at heart

Python origins. Version Date 0.9.0 Feb. 1991 0.9.4 Dec. 1991
0.9.6 Apr. 1992 0.9.8 Jan. 1993 1.0.0 Jan. 1994 1.2 Apr. 1995 1.4 Oct. 1996 1.5.2 Apr. 1999 http://python-history.blogspot.com/2009/01/brief-timeline-of-python.html

First problem: Efﬁcient Data Input “It’s Always About the Data”
http://www.python.org/doc/essays/refcnt/ Reference Counting Essay May 1998 Guido van Rossum TableIO April 1998 Michael A. Miller NumPyIO June 1998

Early pieces of SciPy cephesmodule fftw wrappers June 1998 November
1998 stats.py December 1998 Gary Strangman

1999 : Early SciPy emerges Discussions on the matrix-sig from
1997 to 1999 wanting a complete data analysis environment: Paul Barrett, Joe Harrington, Perry Greenﬁeld, Paul Dubois, Konrad Hinsen, and others. Activity in 1998, led to increased interest in 1999. ! In response on 15 Jan, 1999, I posted to matrix-sig a list of routines I felt needed to be present and began wrapping / writing in earnest. On 6 April 1999, I announced I would be creating this uber-package which eventually became SciPy Gaussian quadrature 5 Jan 1999 cephes 1.0 30 Jan 1999 sigtools 0.40 23 Feb 1999 Numeric docs March 1999 cephes 1.1 9 Mar 1999 multipack 0.3 13 Apr 1999 Helper routines 14 Apr 1999 multipack 0.6 (leastsq, ode, fsolve, quad) 29 Apr 1999 sparse plan described 30 May 1999 multipack 0.7 14 Jun 1999 SparsePy 0.1 5 Nov 1999 cephes 1.2 (vectorize) 29 Dec 1999 Plotting?? ! Gist XPLOT DISLIN Gnuplot Helping with f2py

Using Numeric circa 2000 Image from my PhD Thesis made
using Python interface to DISLIN and using Numeric + hand- written C-extensions for inverting wave- equation.

SciPy 2001 Founded in 2001 with Travis Vaught Eric Jones
weave cluster GA* Pearu Peterson linalg interpolate f2py Travis Oliphant optimize sparse interpolate integrate special signal stats fftpack misc

Brief History Person Package Year Jim Fulton Matrix Object in
Python 1994 Jim Hugunin Numeric 1995 Perry Greenﬁeld, Rick White, Todd Miller Numarray 2001 Travis Oliphant NumPy 2005

Now an impressive community effort • Chuck Harris • Pauli
Virtanen • Robert Kern • Warren Weckesser • Ralf Gommers • Mark Wiebe • Nathaniel Smith • Nathan Bell • Stefan van der Walt • Matthew Brett • Josef Perktold …

Over 3,000,000 users of NumPy!

Keys to Success • Hard work — especially up front
• Often lonely — initially nobody believes in your idea more than you do. Others need some “proof” before they join you. • The more complicated what you are doing is the lonelier it will be initially. • Examples: • I procrastinated my PhD at least 1 year to create the beginnings of SciPy (don’t tell my wife). • Pearu Peterson put in tremendous work to create f2py and scipy.linalg • I spent 18 months not publishing papers to write NumPy (despite many people telling me it was foolish).

Keys to Success • Do what is “right” • Timing
is everything (sometimes you are the right person for the job) • Having an urgency (it won’t wait) • Striving for excellence Give the best you have…and it will never be enough. Give your best anyway. — Mother Teresa

Keys to Success • Build a community • You will
need help to achieve your goals. • This means other people. This will require sacriﬁcing some of your ego to really listen! • Someone will point out how you suck (listen to them, you probably do). • Nurture empathy. • Treat other people like they matter to you — the only successful way to do that is to actually care. This exposes you to get hurt — care about people anyway! • Much more could be said on this topic…

Keys to Success • Patience (and some luck) • Good
things take time • The right factors have to come together — some you can inﬂuence and some you can’t.

So what is this NumPy thing?

NumPy: an Array-Oriented Extension • Data: the array object –
slicing and shaping – data-type map to Bytes ! • Fast Math: – vectorization – broadcasting – aggregations

NumPy Examples 2d array 3d array [439 472 477] [217
205 261 222 245 238] 9.98330639789 2.96677717122

NumPy Array shape

Zen of NumPy (ala import this) • strided is better
than scattered • contiguous is better than strided • descriptive is better than imperative • array-oriented is better than object-oriented • broadcasting is a great idea • vectorized is better than an explicit loop • unless it’s too complicated --- then use Cython/Numba • think in higher dimensions

Array-Oriented Computing Example1: Fibonacci Numbers fn = fn 1 +
fn 2 f0 = 0 f1 = 1 f = 0, 1, 1, 2, 3, 5, 8, 13, 21, 34, . . .

Common Python approaches Recursive Iterative Algorithm matters!!

Array-oriented approaches Using LFilter Using Formula

Array-oriented approaches

APL : the ﬁrst array-oriented language • Appeared in 1964
• Originated by Ken Iverson • Direct descendants (J, K, Matlab) are still used heavily and people pay a lot of money for them • NumPy is a descendent APL J K Matlab Numeric NumPy Q

Memory using Object-oriented Object Attr1 Attr2 Attr3 Object Attr1 Attr2
Attr3 Object Attr1 Attr2 Attr3 Object Attr1 Attr2 Attr3 Object Attr1 Attr2 Attr3 Object Attr1 Attr2 Attr3

Array-oriented (Table) approach Attr1 Attr2 Attr3 Object1 Object2 Object3 Object4
Object5 Object6

Why Array-oriented... • Today’s vector machines (and vector co-processors, or
GPUS) were made for array-oriented computing. • The software stack has just not caught up --- unfortunate because APL came out in 1963. • There is a reason Fortran remains popular.

Beneﬁts of Array-oriented • Many technical problems (advanced analytics) are
naturally array-oriented (easy to vectorize) • Algorithms can be expressed at a high-level • These algorithms can be parallelized more simply (quite often much information is lost in the translation to typical “compiled” languages) • Array-oriented algorithms map to modern hard-ware caches and pipelines.

Complete Example https://www.wakari.io/sharing/bundle/ travis/CircleMask

NumPy in Data Analytics • NumPy is a decent array
object with some user- friendly features. • NumPy “arrays of structures” can be used to handle arbitrary data. http://people.rit.edu/blbgse/pythonNotes/numpy.html First Name Last Name Score Dave Thomas 89.4 Tasha Hen 76.6 Cool Python 100 Stack Overﬂow 95.32 Py Py 75

Pandas is “Structure of Arrays” • Labels on the dimensions
(indexes) • Easy manipulation of new columns • Missing Value handling • Time Series handling • General Split-Apply-Combine • Merge and Join • Integrated Plotting • Chained method calls the norm • Familiar to R users — more user-friendly features!

Current Key Libraries • NumPy • SciPy • Pandas •
Matplotlib • IPython (with notebook) • PyTables (HDF5) • Scikit learn • Statsmodels (Patsy) • SymPy • Cython (Numba) • NumExpr

Tools used for Data Source: O’Reilly Strata attendee survey 2012
and 2013

Python for Data Science http://readwrite.com/2013/11/25/python-displacing-r-as-the-programming-language-for-data-science

Python is the top language in schools!

Why Python for Technical Computing • Syntax (it gets out
of your way) • White space preserves “visual real-estate” • Over-loadable operators • Complex numbers built-in early • Just enough language support for arrays • “Occasional” programmers can grok it • Packaging with conda is awesome!

What is great about Python • Supports multiple programming styles
(functional, object-oriented, scripts, etc.) • Experienced programmers can also use it effectively (classes, meta-programming techniques) • Has a simple, extensible implementation (so that C-extensions can exist) • General-purpose language --- can build a system • Critical mass! • Allows for community

What is wrong with Python? • Missing anonymous blocks •
Some syntax warts (1:10:2 outside [ ] please) • The CPython run-time is aged and needs an overhaul (GIL, global variables, lack of dynamic compilation support) • No approach to language extension except for abuse of meta-programming and “import hooks” (lightweight DSL need) • The distraction of multiple run-times... • Array-oriented and NumPy not really understood by many Python devs (but thank you for ‘@‘)

What is good about NumPy? • Array-oriented • Extensive Dtype
System (including structures) • C-API • Simple to understand data-structure • Memory mapping • Syntax support from Python • Large community of users • Broadcasting • Easy to interface C/C++/Fortran code • PEP 3118

What is wrong with NumPy • Dtype system is difﬁcult
to extend • Immediate mode creates huge temporaries (spawning Numexpr) • “Almost” an in-memory data-base comparable to SQL-lite (missing indexes) • Integration with sparse arrays • Lots of un-optimized parts • Minimal support for multi-core / GPU • Code-base is organic and hard to extend (already inherited from Numeric)

The most important part PEP 3118: Revising the buffer protocol
Basically the “structure” of NumPy arrays as a protocol in Python itself to establish a memory-sharing standard between objects. It makes it possible for a heterogeneous world of powerful array-like objects outside of NumPy that communicate. ! Falls short in not deﬁning a general data description language (DDL). http://python.org/dev/peps/pep-3118/

A Dual of Encapsulation? • The buffer protocol and data-types
provide a different view on encapsulation (the dual of typical encapsulation approaches for those initiated to linear algebra) ! • Rather than attach methods to data and hide the data, you describe data completely and declaratively. You can then later apply arbitrary code to this data at run-time. ! • It makes it much easier to move code to data at all levels.

What of the future? I’ve watched most of the episodes,
so there’s that… I will just describe what I would like to see…

“Data Has Mass” http://blog.mccrory.me/2010/12/07/data-gravity-in-the-clouds/

Uh... http://datagravity.org/2012/06/26/a-formula-for-data-gravity/

Workﬂow Perspective Data-centric Perspective

Fundamentally Heterogeneous • The future of Python Data analytics will
be heterogenous • There will be many projects - Biggus - DistArray - SciDB - Elemental - Spartan • Not counting integration with the other projects - Spark - Imapala - Disco

What about Continuum? After watching NumPy and SciPy get used
all over the world --- what would we do differently? Blaze Numba Conda (Anaconda) All Open Source!

Conda and Anaconda • Cross-platform package management • Multiple environments
allows you to have multiple versions of packages installed in system • Easy app-deployment • Taming open-source • Users love it! Free for all users Enterprise support available!

Blaze and Numba Blaze is motivated by generalizing PEP 3118
to all languages and data-sets — Python Glue 2.0 Numba is motivated by a desire for Python NumPy array-like code to reach the speeds of Fortran!

from data to code, seamlessly Blaze

• Dealing with data applications has numerous pain points  -
Hundreds of data formats - Basic programs expect all data to fit in memory - Data analysis pipelines constantly changing from one form to another - Sharing analysis contains significant overhead to configure systems - Parallelizing analysis requires expert in particular distributed computing stack Data Pain

Deferred Expr Compilers Interpreters Data Compute API Blaze Architecture •
Flexible architecture to accommodate exploration  • Use compilation of deferred expressions to optimize data interactions

Blaze Data • Single interface for data layers  • Composition
of different  formats  • Simple api to add   custom data formats SQL CSV HDFS JSON Mem Custom HDF5 Data

Blaze Compute Compute DyND Pandas PyTables Spark • Computation abstraction
over numerous data libraries  • Simple multi-dispatched visitors to implement new backends  • Allows plumbing between stacks to be seamless to user

Deferred Expr Blaze Expr temps.hdf5 nasdaq.sql tweets.json Join by date
Select NYC Find Tech Selloff Plot • Lazy computation to minimize data movement  • Simple DAG for  compilation to • parallel application • distributed memory • static optimizations

Blaze Example - Counting Weblinks Common Blaze Code # Expr
t_idx = TableSymbol('{name: string, node_id: int32}') t_arc = TableSymbol('{node_out: int32, node_id: int32}') joined = Join(t_arc, t_idx, "node_id") t = By(joined, joined['name'], joined['node_id'].count()) ! # Data Load idx, arc = load_data()  # Computations ans = compute(t, {t_arc: arc, t_idx: idx})  in_deg = dict(ans) in_deg[u'blogspot.com']

Blaze Example - Counting Weblinks Using Spark + HDFS load_data
sc = SparkContext("local", "Simple App") idx = sc.textFile(“hdfs://master.continuum.io/example_index.txt”) idx = idx.map(lambda x: x.split(‘\t’))\ .map(lambda x: [x[0], int(x[1])]) arc = sc.textFile("hdfs://master.continuum.io/example_arcs.txt") arc = arc.map(lambda x: x.split(‘\t’))\ .map(lambda x: [int(x[0]), int(x[1])]) Using Pandas + Local Disc with open("example_index.txt") as f: idx = [ ln.strip().split('\t') for ln in f.readlines()] idx = DataFrame(idx, columns=['name', 'node_id']) ! with open("example_arcs.txt") as f: arc = [ ln.strip().split('\t') for ln in f.readlines()] arc = DataFrame(arc, columns=['node_out', 'node_id'])

Blaze Ecosystem • dynd — next-generation NumPy • libdynd —
C++ library for multi-dimensional arrays • datashape — general data-description language (what PEP 3118 was missing) • Blaze - Data (adapt data from many different silos) - Compute (interpreters to run expressions on backends) - Expr (Symbolic expressions) - Interfaces (Focusing on Tables) • BLZ — experimental storage format (unsupported)

Numba CPython compatible JIT compiler

Code that users might write xi = i 1 X
j=0 ki j,jai jaj O = I ? F Slow!!!!

Face of a modern compiler Intermediate Representation (IR) x86 C++
ARM PTX C Fortran ObjC Parsing Code Generation Front-End Back-End

Face of a modern compiler Intermediate Representation (IR) x86 ARM
PTX Python Code Generation Back-End Numba LLVM Parsing Front-End

Example Numba

NumPy + Mamba = Numba LLVM Library Intel Nvidia Apple
AMD OpenCL ISPC CUDA CLANG OpenMP LLVMPY Python Function Machine Code ARM

Speeding up Math Expressions xi = i 1 X j=0
ki j,jai jaj

Image Processing @jit('void(f8[:,:],f8[:,:],f8[:,:])') def filter(image, filt, output): M, N =
image.shape m, n = filt.shape for i in range(m//2, M-m//2): for j in range(n//2, N-n//2): result = 0.0 for k in range(m): for l in range(n): result += image[i+k-m//2,j+l-n//2]*filt[k, l] output[i,j] = result ~1500x speed-up

Case-study -- j0 from scipy.special • scipy.special was one of
the ﬁrst libraries I wrote • extended “umath” module by adding new “universal functions” to compute many scientiﬁc functions by wrapping C and Fortran libs. • Bessel functions are solutions to a differential equation: x 2 d 2 y dx 2 + x dy dx + ( x 2 ↵ 2) y = 0 y = J↵ ( x ) Jn (x) = 1 ⇡ Z ⇡ 0 cos (n⌧ x sin (⌧)) d⌧

scipy.special.j0 wraps cephes algorithm

Result --- equivalent to compiled code In [6]: %timeit vj0(x)
10000 loops, best of 3: 75 us per loop ! In [7]: from scipy.special import j0 ! In [8]: %timeit j0(x) 10000 loops, best of 3: 75.3 us per loop But! Now code is in Python and can be experimented with more easily (and moved to the GPU / accelerator more easily)!

Numba can change the game! LLVM IR x86 C++ ARM
PTX C Fortran Python Numba turns Python into a “compiled language” (but much more ﬂexible). You don’t have to reach for C/C++

CUDA-Python from numba import cuda from numba import autojit !
@autojit(target=‘gpu’) def array_scale(src, dst, scale): tid = cuda.threadIdx.x blkid = cuda.blockIdx.x blkdim = cuda.blockDim.x ! i = tid + blkid * blkdim ! if i >= n: return ! dst[i] = src[i] * scale ! src = np.arange(N, dtype=np.float) dst = np.empty_like(src) ! array_scale[grid, block](src, dst, 5.0) CUDA Development using Python syntax for optimal performance!

Example: Black-Scholes @cuda.jit(argtypes=(double[:], double[:], double[:], double[:], double[:], double, double)) def
black_scholes_cuda(callResult, putResult, S, X, T, R, V): # S = stockPrice # X = optionStrike # T = optionYears # R = Riskfree # V = Volatility i = cuda.threadIdx.x + cuda.blockIdx.x * cuda.blockDim.x if i >= S.shape[0]: return sqrtT = math.sqrt(T[i]) d1 = (math.log(S[i] / X[i]) + (R + 0.5 * V * V) * T[i]) / (V * sqrtT) d2 = d1 - V * sqrtT cndd1 = cnd_cuda(d1) cndd2 = cnd_cuda(d2) ! expRT = math.exp((-1. * R) * T[i]) callResult[i] = (S[i] * cndd1 - X[i] * expRT * cndd2) putResult[i] = (X[i] * expRT * (1.0 - cndd2) - S[i] * (1.0 - cndd1)) @cuda.jit(argtypes=(double,), restype=double, device=True, inline=True) def cnd_cuda(d): A1 = 0.31938153 A2 = -0.356563782 A3 = 1.781477937 A4 = -1.821255978 A5 = 1.330274429 RSQRT2PI = 0.39894228040143267793994605993438 K = 1.0 / (1.0 + 0.2316419 * math.fabs(d)) ret_val = (RSQRT2PI * math.exp(-0.5 * d * d) * (K * (A1 + K * (A2 + K * (A3 + K * (A4 + K * A5)))))) if d > 0: ret_val = 1.0 - ret_val return ret_val blockdim = 1024, 1 griddim = int(math.ceil(float(OPT_N)/blockdim[0])), 1 stream = cuda.stream() d_callResult = cuda.to_device(callResultNumbapro, stream) d_putResult = cuda.to_device(putResultNumbapro, stream) d_stockPrice = cuda.to_device(stockPrice, stream) d_optionStrike = cuda.to_device(optionStrike, stream) d_optionYears = cuda.to_device(optionYears, stream) for i in range(iterations): black_scholes_cuda[griddim, blockdim, stream]( d_callResult, d_putResult, d_stockPrice, d_optionStrike, d_optionYears, RISKFREE, VOLATILITY) d_callResult.to_host(stream) d_putResult.to_host(stream) stream.synchronize()

Black-Scholes: Results core i7 GeForce GTX 560 Ti About 9x
faster on this GPU ~ same speed as CUDA-C

Summary • Python has had a long and fruitful history
in Data Analytics • It will have a long and bright future with your help! • Join the PyData Community and make the world a better place!

Dedication Nothing I have done or am would be were
it not for your patience, love, and support! ! Thank you! Amy Pennock Oliphant

Donate to Numfocus http://numfocus.org Become a PSF Member: https://www.python.org/psf/ Attend
PyData Berlin and Present at Future PyData events: http://pydata.org Community Commercials

EuroPython Keynote July 25, 2014

EuroPython Keynote July 25, 2014

More Decks by Travis E. Oliphant

Other Decks in Technology

Featured

Transcript