Upgrade to Pro — share decks privately, control downloads, hide ads and more …

EuroPython Keynote July 25, 2014

EuroPython Keynote July 25, 2014

Python and Big Data Analytics: Past, Present, and Future

Travis E. Oliphant

July 25, 2014
Tweet

More Decks by Travis E. Oliphant

Other Decks in Technology

Transcript

  1. Science led to Python Raja Muthupillai Armando Manduca Richard Ehman

    1997 ⇢0 (2⇡f)2 Ui (a, f) = [Cijkl (a, f) Uk,l (a, f)] ,j
  2. Python origins. Version Date 0.9.0 Feb. 1991 0.9.4 Dec. 1991

    0.9.6 Apr. 1992 0.9.8 Jan. 1993 1.0.0 Jan. 1994 1.2 Apr. 1995 1.4 Oct. 1996 1.5.2 Apr. 1999 http://python-history.blogspot.com/2009/01/brief-timeline-of-python.html
  3. First problem: Efficient Data Input “It’s Always About the Data”

    http://www.python.org/doc/essays/refcnt/ Reference Counting Essay May 1998 Guido van Rossum TableIO April 1998 Michael A. Miller NumPyIO June 1998
  4. Early pieces of SciPy cephesmodule fftw wrappers June 1998 November

    1998 stats.py December 1998 Gary Strangman
  5. 1999 : Early SciPy emerges Discussions on the matrix-sig from

    1997 to 1999 wanting a complete data analysis environment: Paul Barrett, Joe Harrington, Perry Greenfield, Paul Dubois, Konrad Hinsen, and others. Activity in 1998, led to increased interest in 1999. ! In response on 15 Jan, 1999, I posted to matrix-sig a list of routines I felt needed to be present and began wrapping / writing in earnest. On 6 April 1999, I announced I would be creating this uber-package which eventually became SciPy Gaussian quadrature 5 Jan 1999 cephes 1.0 30 Jan 1999 sigtools 0.40 23 Feb 1999 Numeric docs March 1999 cephes 1.1 9 Mar 1999 multipack 0.3 13 Apr 1999 Helper routines 14 Apr 1999 multipack 0.6 (leastsq, ode, fsolve, quad) 29 Apr 1999 sparse plan described 30 May 1999 multipack 0.7 14 Jun 1999 SparsePy 0.1 5 Nov 1999 cephes 1.2 (vectorize) 29 Dec 1999 Plotting?? ! Gist XPLOT DISLIN Gnuplot Helping with f2py
  6. Using Numeric circa 2000 Image from my PhD Thesis made

    using Python interface to DISLIN and using Numeric + hand- written C-extensions for inverting wave- equation.
  7. SciPy 2001 Founded in 2001 with Travis Vaught Eric Jones

    weave cluster GA* Pearu Peterson linalg interpolate f2py Travis Oliphant optimize sparse interpolate integrate special signal stats fftpack misc
  8. Brief History Person Package Year Jim Fulton Matrix Object in

    Python 1994 Jim Hugunin Numeric 1995 Perry Greenfield, Rick White, Todd Miller Numarray 2001 Travis Oliphant NumPy 2005
  9. Now an impressive community effort • Chuck Harris • Pauli

    Virtanen • Robert Kern • Warren Weckesser • Ralf Gommers • Mark Wiebe • Nathaniel Smith • Nathan Bell • Stefan van der Walt • Matthew Brett • Josef Perktold …
  10.          

     Over 3,000,000 users of NumPy!
  11. Keys to Success • Hard work — especially up front

    • Often lonely — initially nobody believes in your idea more than you do. Others need some “proof” before they join you. • The more complicated what you are doing is the lonelier it will be initially. • Examples: • I procrastinated my PhD at least 1 year to create the beginnings of SciPy (don’t tell my wife). • Pearu Peterson put in tremendous work to create f2py and scipy.linalg • I spent 18 months not publishing papers to write NumPy (despite many people telling me it was foolish).
  12. Keys to Success • Do what is “right” • Timing

    is everything (sometimes you are the right person for the job) • Having an urgency (it won’t wait) • Striving for excellence Give the best you have…and it will never be enough. Give your best anyway. — Mother Teresa
  13. Keys to Success • Build a community • You will

    need help to achieve your goals. • This means other people. This will require sacrificing some of your ego to really listen! • Someone will point out how you suck (listen to them, you probably do). • Nurture empathy. • Treat other people like they matter to you — the only successful way to do that is to actually care. This exposes you to get hurt — care about people anyway! • Much more could be said on this topic…
  14. Keys to Success • Patience (and some luck) • Good

    things take time • The right factors have to come together — some you can influence and some you can’t.
  15. NumPy: an Array-Oriented Extension • Data: the array object –

    slicing and shaping – data-type map to Bytes ! • Fast Math: – vectorization – broadcasting – aggregations
  16. NumPy Examples 2d array 3d array [439 472 477] [217

    205 261 222 245 238] 9.98330639789 2.96677717122
  17. Zen of NumPy (ala import this) • strided is better

    than scattered • contiguous is better than strided • descriptive is better than imperative • array-oriented is better than object-oriented • broadcasting is a great idea • vectorized is better than an explicit loop • unless it’s too complicated --- then use Cython/Numba • think in higher dimensions
  18. Array-Oriented Computing Example1: Fibonacci Numbers fn = fn 1 +

    fn 2 f0 = 0 f1 = 1 f = 0, 1, 1, 2, 3, 5, 8, 13, 21, 34, . . .
  19. APL : the first array-oriented language • Appeared in 1964

    • Originated by Ken Iverson • Direct descendants (J, K, Matlab) are still used heavily and people pay a lot of money for them • NumPy is a descendent APL J K Matlab Numeric NumPy Q
  20. Memory using Object-oriented Object Attr1 Attr2 Attr3 Object Attr1 Attr2

    Attr3 Object Attr1 Attr2 Attr3 Object Attr1 Attr2 Attr3 Object Attr1 Attr2 Attr3 Object Attr1 Attr2 Attr3
  21. Why Array-oriented... • Today’s vector machines (and vector co-processors, or

    GPUS) were made for array-oriented computing. • The software stack has just not caught up --- unfortunate because APL came out in 1963. • There is a reason Fortran remains popular.
  22. Benefits of Array-oriented • Many technical problems (advanced analytics) are

    naturally array-oriented (easy to vectorize) • Algorithms can be expressed at a high-level • These algorithms can be parallelized more simply (quite often much information is lost in the translation to typical “compiled” languages) • Array-oriented algorithms map to modern hard-ware caches and pipelines.
  23. NumPy in Data Analytics • NumPy is a decent array

    object with some user- friendly features. • NumPy “arrays of structures” can be used to handle arbitrary data. http://people.rit.edu/blbgse/pythonNotes/numpy.html First Name Last Name Score Dave Thomas 89.4 Tasha Hen 76.6 Cool Python 100 Stack Overflow 95.32 Py Py 75
  24. Pandas is “Structure of Arrays” • Labels on the dimensions

    (indexes) • Easy manipulation of new columns • Missing Value handling • Time Series handling • General Split-Apply-Combine • Merge and Join • Integrated Plotting • Chained method calls the norm • Familiar to R users — more user-friendly features!
  25. Current Key Libraries • NumPy • SciPy • Pandas •

    Matplotlib • IPython (with notebook) • PyTables (HDF5) • Scikit learn • Statsmodels (Patsy) • SymPy • Cython (Numba) • NumExpr
  26. Why Python for Technical Computing • Syntax (it gets out

    of your way) • White space preserves “visual real-estate” • Over-loadable operators • Complex numbers built-in early • Just enough language support for arrays • “Occasional” programmers can grok it • Packaging with conda is awesome!
  27. What is great about Python • Supports multiple programming styles

    (functional, object-oriented, scripts, etc.) • Experienced programmers can also use it effectively (classes, meta-programming techniques) • Has a simple, extensible implementation (so that C-extensions can exist) • General-purpose language --- can build a system • Critical mass! • Allows for community
  28. What is wrong with Python? • Missing anonymous blocks •

    Some syntax warts (1:10:2 outside [ ] please) • The CPython run-time is aged and needs an overhaul (GIL, global variables, lack of dynamic compilation support) • No approach to language extension except for abuse of meta-programming and “import hooks” (lightweight DSL need) • The distraction of multiple run-times... • Array-oriented and NumPy not really understood by many Python devs (but thank you for ‘@‘)
  29. What is good about NumPy? • Array-oriented • Extensive Dtype

    System (including structures) • C-API • Simple to understand data-structure • Memory mapping • Syntax support from Python • Large community of users • Broadcasting • Easy to interface C/C++/Fortran code • PEP 3118
  30. What is wrong with NumPy • Dtype system is difficult

    to extend • Immediate mode creates huge temporaries (spawning Numexpr) • “Almost” an in-memory data-base comparable to SQL-lite (missing indexes) • Integration with sparse arrays • Lots of un-optimized parts • Minimal support for multi-core / GPU • Code-base is organic and hard to extend (already inherited from Numeric)
  31. The most important part PEP 3118: Revising the buffer protocol

    Basically the “structure” of NumPy arrays as a protocol in Python itself to establish a memory-sharing standard between objects. It makes it possible for a heterogeneous world of powerful array-like objects outside of NumPy that communicate. ! Falls short in not defining a general data description language (DDL). http://python.org/dev/peps/pep-3118/
  32. A Dual of Encapsulation? • The buffer protocol and data-types

    provide a different view on encapsulation (the dual of typical encapsulation approaches for those initiated to linear algebra) ! • Rather than attach methods to data and hide the data, you describe data completely and declaratively. You can then later apply arbitrary code to this data at run-time. ! • It makes it much easier to move code to data at all levels.
  33. What of the future? I’ve watched most of the episodes,

    so there’s that… I will just describe what I would like to see…
  34. Fundamentally Heterogeneous • The future of Python Data analytics will

    be heterogenous • There will be many projects - Biggus - DistArray - SciDB - Elemental - Spartan • Not counting integration with the other projects - Spark - Imapala - Disco
  35. What about Continuum? After watching NumPy and SciPy get used

    all over the world --- what would we do differently? Blaze Numba Conda (Anaconda) All Open Source!
  36. Conda and Anaconda • Cross-platform package management • Multiple environments

    allows you to have multiple versions of packages installed in system • Easy app-deployment • Taming open-source • Users love it! Free for all users Enterprise support available!
  37. Blaze and Numba Blaze is motivated by generalizing PEP 3118

    to all languages and data-sets — Python Glue 2.0 Numba is motivated by a desire for Python NumPy array-like code to reach the speeds of Fortran!
  38. • Dealing with data applications has numerous pain points
 -

    Hundreds of data formats - Basic programs expect all data to fit in memory - Data analysis pipelines constantly changing from one form to another - Sharing analysis contains significant overhead to configure systems - Parallelizing analysis requires expert in particular distributed computing stack Data Pain
  39. Deferred Expr Compilers Interpreters Data Compute API Blaze Architecture •

    Flexible architecture to accommodate exploration
 • Use compilation of deferred expressions to optimize data interactions
  40. Blaze Data • Single interface for data layers
 • Composition

    of different
 formats
 • Simple api to add 
 custom data formats SQL CSV HDFS JSON Mem Custom HDF5 Data
  41. Blaze Compute Compute DyND Pandas PyTables Spark • Computation abstraction

    over numerous data libraries
 • Simple multi-dispatched visitors to implement new backends
 • Allows plumbing between stacks to be seamless to user
  42. Deferred Expr Blaze Expr temps.hdf5 nasdaq.sql tweets.json Join by date

    Select NYC Find Tech Selloff Plot • Lazy computation to minimize data movement
 • Simple DAG for
 compilation to • parallel application • distributed memory • static optimizations
  43. Blaze Example - Counting Weblinks Common Blaze Code #  Expr

      t_idx  =  TableSymbol('{name:  string,                                              node_id:  int32}')   t_arc  =  TableSymbol('{node_out:  int32,                                              node_id:  int32}')   joined  =  Join(t_arc,  t_idx,  "node_id")   t  =  By(joined,  joined['name'],                  joined['node_id'].count())   ! #  Data  Load   idx,  arc  =  load_data()
 #  Computations   ans  =  compute(t,  {t_arc:  arc,  t_idx:  idx})
 in_deg  =  dict(ans)   in_deg[u'blogspot.com']
  44. Blaze Example - Counting Weblinks Using Spark + HDFS load_data

    sc  =  SparkContext("local",  "Simple  App")   idx  =  sc.textFile(“hdfs://master.continuum.io/example_index.txt”)   idx  =  idx.map(lambda  x:  x.split(‘\t’))\                    .map(lambda  x:  [x[0],  int(x[1])])   arc  =  sc.textFile("hdfs://master.continuum.io/example_arcs.txt")   arc  =  arc.map(lambda  x:  x.split(‘\t’))\                    .map(lambda  x:  [int(x[0]),  int(x[1])])   Using Pandas + Local Disc with  open("example_index.txt")  as  f:          idx  =  [  ln.strip().split('\t')  for  ln  in  f.readlines()]   idx  =  DataFrame(idx,  columns=['name',  'node_id'])   ! with  open("example_arcs.txt")  as  f:          arc  =  [  ln.strip().split('\t')  for  ln  in  f.readlines()]   arc  =  DataFrame(arc,  columns=['node_out',  'node_id'])
  45. Blaze Ecosystem • dynd — next-generation NumPy • libdynd —

    C++ library for multi-dimensional arrays • datashape — general data-description language (what PEP 3118 was missing) • Blaze - Data (adapt data from many different silos) - Compute (interpreters to run expressions on backends) - Expr (Symbolic expressions) - Interfaces (Focusing on Tables) • BLZ — experimental storage format (unsupported)
  46. Code that users might write xi = i 1 X

    j=0 ki j,jai jaj O = I ? F Slow!!!!
  47. Face of a modern compiler Intermediate Representation (IR) x86 C++

    ARM PTX C Fortran ObjC Parsing Code Generation Front-End Back-End
  48. Face of a modern compiler Intermediate Representation (IR) x86 ARM

    PTX Python Code Generation Back-End Numba LLVM Parsing Front-End
  49. NumPy + Mamba = Numba LLVM Library Intel Nvidia Apple

    AMD OpenCL ISPC CUDA CLANG OpenMP LLVMPY Python Function Machine Code ARM
  50. Image Processing @jit('void(f8[:,:],f8[:,:],f8[:,:])') def filter(image, filt, output): M, N =

    image.shape m, n = filt.shape for i in range(m//2, M-m//2): for j in range(n//2, N-n//2): result = 0.0 for k in range(m): for l in range(n): result += image[i+k-m//2,j+l-n//2]*filt[k, l] output[i,j] = result ~1500x speed-up
  51. Case-study -- j0 from scipy.special • scipy.special was one of

    the first libraries I wrote • extended “umath” module by adding new “universal functions” to compute many scientific functions by wrapping C and Fortran libs. • Bessel functions are solutions to a differential equation: x 2 d 2 y dx 2 + x dy dx + ( x 2 ↵ 2) y = 0 y = J↵ ( x ) Jn (x) = 1 ⇡ Z ⇡ 0 cos (n⌧ x sin (⌧)) d⌧
  52. Result --- equivalent to compiled code In [6]: %timeit vj0(x)

    10000 loops, best of 3: 75 us per loop ! In [7]: from scipy.special import j0 ! In [8]: %timeit j0(x) 10000 loops, best of 3: 75.3 us per loop But! Now code is in Python and can be experimented with more easily (and moved to the GPU / accelerator more easily)!
  53. Numba can change the game! LLVM IR x86 C++ ARM

    PTX C Fortran Python Numba turns Python into a “compiled language” (but much more flexible). You don’t have to reach for C/C++
  54. CUDA-Python from numba import cuda from numba import autojit !

    @autojit(target=‘gpu’) def array_scale(src, dst, scale): tid = cuda.threadIdx.x blkid = cuda.blockIdx.x blkdim = cuda.blockDim.x ! i = tid + blkid * blkdim ! if i >= n: return ! dst[i] = src[i] * scale ! src = np.arange(N, dtype=np.float) dst = np.empty_like(src) ! array_scale[grid, block](src, dst, 5.0) CUDA Development using Python syntax for optimal performance!
  55. Example: Black-Scholes @cuda.jit(argtypes=(double[:], double[:], double[:], double[:], double[:], double, double)) def

    black_scholes_cuda(callResult, putResult, S, X, T, R, V): # S = stockPrice # X = optionStrike # T = optionYears # R = Riskfree # V = Volatility i = cuda.threadIdx.x + cuda.blockIdx.x * cuda.blockDim.x if i >= S.shape[0]: return sqrtT = math.sqrt(T[i]) d1 = (math.log(S[i] / X[i]) + (R + 0.5 * V * V) * T[i]) / (V * sqrtT) d2 = d1 - V * sqrtT cndd1 = cnd_cuda(d1) cndd2 = cnd_cuda(d2) ! expRT = math.exp((-1. * R) * T[i]) callResult[i] = (S[i] * cndd1 - X[i] * expRT * cndd2) putResult[i] = (X[i] * expRT * (1.0 - cndd2) - S[i] * (1.0 - cndd1)) @cuda.jit(argtypes=(double,), restype=double, device=True, inline=True) def cnd_cuda(d): A1 = 0.31938153 A2 = -0.356563782 A3 = 1.781477937 A4 = -1.821255978 A5 = 1.330274429 RSQRT2PI = 0.39894228040143267793994605993438 K = 1.0 / (1.0 + 0.2316419 * math.fabs(d)) ret_val = (RSQRT2PI * math.exp(-0.5 * d * d) * (K * (A1 + K * (A2 + K * (A3 + K * (A4 + K * A5)))))) if d > 0: ret_val = 1.0 - ret_val return ret_val blockdim = 1024, 1 griddim = int(math.ceil(float(OPT_N)/blockdim[0])), 1 stream = cuda.stream() d_callResult = cuda.to_device(callResultNumbapro, stream) d_putResult = cuda.to_device(putResultNumbapro, stream) d_stockPrice = cuda.to_device(stockPrice, stream) d_optionStrike = cuda.to_device(optionStrike, stream) d_optionYears = cuda.to_device(optionYears, stream) for i in range(iterations): black_scholes_cuda[griddim, blockdim, stream]( d_callResult, d_putResult, d_stockPrice, d_optionStrike, d_optionYears, RISKFREE, VOLATILITY) d_callResult.to_host(stream) d_putResult.to_host(stream) stream.synchronize()
  56. Black-Scholes: Results core i7 GeForce GTX 560 Ti About 9x

    faster on this GPU ~ same speed as CUDA-C
  57. Summary • Python has had a long and fruitful history

    in Data Analytics • It will have a long and bright future with your help! • Join the PyData Community and make the world a better place!
  58. Dedication Nothing I have done or am would be were

    it not for your patience, love, and support! ! Thank you! Amy Pennock Oliphant
  59. Donate to Numfocus http://numfocus.org Become a PSF Member: https://www.python.org/psf/ Attend

    PyData Berlin and Present at Future PyData events: http://pydata.org Community Commercials