$30 off During Our Annual Pro Sale. View Details »

EuroPython Keynote July 25, 2014

EuroPython Keynote July 25, 2014

Python and Big Data Analytics: Past, Present, and Future

Travis E. Oliphant

July 25, 2014
Tweet

More Decks by Travis E. Oliphant

Other Decks in Technology

Transcript

  1. Python in Big Data
    Analytics: Past, Present, and
    Future
    EuroPython, July 25, 2014
    Travis E. Oliphant

    View Slide

  2. My Roots

    View Slide

  3. My Roots
    Images from BYU Mers Lab

    View Slide

  4. Science led to Python
    Raja Muthupillai
    Armando Manduca
    Richard Ehman
    1997
    ⇢0
    (2⇡f)2 Ui
    (a, f) = [Cijkl
    (a, f) Uk,l
    (a, f)]
    ,j

    View Slide

  5. Finding derivatives of 5-d data
    ⌅ = r ⇥ U

    View Slide

  6. Scientist at heart

    View Slide

  7. Python origins.
    Version Date
    0.9.0 Feb. 1991
    0.9.4 Dec. 1991
    0.9.6 Apr. 1992
    0.9.8 Jan. 1993
    1.0.0 Jan. 1994
    1.2 Apr. 1995
    1.4 Oct. 1996
    1.5.2 Apr. 1999
    http://python-history.blogspot.com/2009/01/brief-timeline-of-python.html

    View Slide

  8. First problem: Efficient Data Input
    “It’s Always About the Data”
    http://www.python.org/doc/essays/refcnt/
    Reference Counting Essay
    May 1998
    Guido van Rossum
    TableIO

    April 1998
    Michael A. Miller
    NumPyIO

    June 1998

    View Slide

  9. Early pieces of SciPy
    cephesmodule
    fftw wrappers
    June 1998 November 1998
    stats.py
    December 1998
    Gary

    Strangman

    View Slide

  10. 1999 : Early SciPy emerges
    Discussions on the matrix-sig from 1997 to 1999 wanting a complete data analysis
    environment: Paul Barrett, Joe Harrington, Perry Greenfield, Paul Dubois, Konrad Hinsen,
    and others. Activity in 1998, led to increased interest in 1999.

    !
    In response on 15 Jan, 1999, I posted to matrix-sig a list of routines I felt needed to be
    present and began wrapping / writing in earnest. On 6 April 1999, I announced I would
    be creating this uber-package which eventually became SciPy
    Gaussian quadrature 5 Jan 1999
    cephes 1.0 30 Jan 1999
    sigtools 0.40 23 Feb 1999
    Numeric docs March 1999
    cephes 1.1 9 Mar 1999
    multipack 0.3 13 Apr 1999
    Helper routines 14 Apr 1999
    multipack 0.6 (leastsq, ode, fsolve,
    quad)
    29 Apr 1999
    sparse plan described 30 May 1999
    multipack 0.7 14 Jun 1999
    SparsePy 0.1 5 Nov 1999
    cephes 1.2 (vectorize) 29 Dec 1999
    Plotting??

    !
    Gist

    XPLOT

    DISLIN

    Gnuplot
    Helping with f2py

    View Slide

  11. Using Numeric circa 2000
    Image from my PhD
    Thesis made using
    Python interface to
    DISLIN and using
    Numeric + hand-
    written C-extensions
    for inverting wave-
    equation.

    View Slide

  12. SciPy 2001
    Founded in 2001 with Travis Vaught
    Eric Jones

    weave

    cluster

    GA*
    Pearu Peterson

    linalg

    interpolate

    f2py
    Travis Oliphant

    optimize

    sparse

    interpolate

    integrate

    special

    signal

    stats

    fftpack

    misc

    View Slide

  13. Brief History
    Person Package Year
    Jim Fulton
    Matrix Object
    in Python
    1994
    Jim Hugunin Numeric 1995
    Perry Greenfield, Rick
    White, Todd Miller
    Numarray 2001
    Travis Oliphant NumPy 2005

    View Slide

  14. Now an impressive community effort
    • Chuck Harris
    • Pauli Virtanen
    • Robert Kern
    • Warren Weckesser
    • Ralf Gommers
    • Mark Wiebe
    • Nathaniel Smith
    • Nathan Bell
    • Stefan van der Walt
    • Matthew Brett
    • Josef Perktold …

    View Slide








  15. Over 3,000,000 users of NumPy!

    View Slide

  16. Keys to Success
    • Hard work — especially up front

    • Often lonely — initially nobody believes in your idea
    more than you do. Others need some “proof” before
    they join you.

    • The more complicated what you are doing is the
    lonelier it will be initially.

    • Examples:

    • I procrastinated my PhD at least 1 year to create the
    beginnings of SciPy (don’t tell my wife).

    • Pearu Peterson put in tremendous work to create f2py and
    scipy.linalg

    • I spent 18 months not publishing papers to write NumPy
    (despite many people telling me it was foolish).

    View Slide

  17. Keys to Success
    • Do what is “right”

    • Timing is everything (sometimes
    you are the right person for the
    job)

    • Having an urgency (it won’t wait)

    • Striving for excellence
    Give the best you have…and it will never be
    enough. Give your best anyway.

    — Mother Teresa

    View Slide

  18. Keys to Success
    • Build a community

    • You will need help to achieve your goals.

    • This means other people. This will require sacrificing
    some of your ego to really listen!

    • Someone will point out how you suck (listen to them,
    you probably do).

    • Nurture empathy.

    • Treat other people like they matter to you — the only
    successful way to do that is to actually care. This
    exposes you to get hurt — care about people anyway!

    • Much more could be said on this topic…

    View Slide

  19. Keys to Success
    • Patience (and some luck)

    • Good things take time

    • The right factors have to
    come together — some
    you can influence and some
    you can’t.

    View Slide

  20. So what is this NumPy thing?

    View Slide

  21. NumPy: an Array-Oriented Extension
    • Data: the array object
    – slicing and shaping
    – data-type map to Bytes
    !
    • Fast Math:
    – vectorization
    – broadcasting
    – aggregations

    View Slide

  22. NumPy Examples
    2d array
    3d array
    [439 472 477] [217 205 261 222 245 238] 9.98330639789 2.96677717122

    View Slide

  23. NumPy Array
    shape

    View Slide

  24. Zen of NumPy (ala import this)
    • strided is better than scattered
    • contiguous is better than strided
    • descriptive is better than imperative
    • array-oriented is better than object-oriented
    • broadcasting is a great idea
    • vectorized is better than an explicit loop
    • unless it’s too complicated --- then use Cython/Numba
    • think in higher dimensions

    View Slide

  25. Array-Oriented Computing
    Example1: Fibonacci Numbers
    fn = fn 1 + fn 2
    f0 = 0
    f1 = 1
    f = 0, 1, 1, 2, 3, 5, 8, 13, 21, 34, . . .

    View Slide

  26. Common Python approaches
    Recursive Iterative
    Algorithm matters!!

    View Slide

  27. Array-oriented approaches
    Using LFilter
    Using Formula

    View Slide

  28. Array-oriented approaches

    View Slide

  29. APL : the first array-oriented language
    • Appeared in 1964
    • Originated by Ken Iverson
    • Direct descendants (J, K, Matlab) are still
    used heavily and people pay a lot of money
    for them
    • NumPy is a descendent
    APL
    J
    K Matlab
    Numeric
    NumPy
    Q

    View Slide

  30. Memory using Object-oriented
    Object
    Attr1
    Attr2
    Attr3
    Object
    Attr1
    Attr2
    Attr3
    Object
    Attr1
    Attr2
    Attr3
    Object
    Attr1
    Attr2
    Attr3
    Object
    Attr1
    Attr2
    Attr3
    Object
    Attr1
    Attr2
    Attr3

    View Slide

  31. Array-oriented (Table) approach
    Attr1 Attr2 Attr3
    Object1
    Object2
    Object3
    Object4
    Object5
    Object6

    View Slide

  32. Why Array-oriented...
    • Today’s vector machines (and vector co-processors,
    or GPUS) were made for array-oriented computing.

    • The software stack has just not caught up ---
    unfortunate because APL came out in 1963.

    • There is a reason Fortran remains popular.

    View Slide

  33. Benefits of Array-oriented
    • Many technical problems (advanced analytics)
    are naturally array-oriented (easy to vectorize)

    • Algorithms can be expressed at a high-level

    • These algorithms can be parallelized more
    simply (quite often much information is lost in
    the translation to typical “compiled” languages)

    • Array-oriented algorithms map to modern
    hard-ware caches and pipelines.

    View Slide

  34. Complete Example
    https://www.wakari.io/sharing/bundle/
    travis/CircleMask

    View Slide

  35. NumPy in Data Analytics
    • NumPy is a decent array object with some user-
    friendly features.

    • NumPy “arrays of structures” can be used to
    handle arbitrary data.
    http://people.rit.edu/blbgse/pythonNotes/numpy.html
    First Name Last Name Score
    Dave Thomas 89.4
    Tasha Hen 76.6
    Cool Python 100
    Stack Overflow 95.32
    Py Py 75

    View Slide

  36. Pandas is “Structure of Arrays”
    • Labels on the dimensions (indexes)
    • Easy manipulation of new columns
    • Missing Value handling
    • Time Series handling
    • General Split-Apply-Combine
    • Merge and Join
    • Integrated Plotting
    • Chained method calls the norm
    • Familiar to R users — more user-friendly features!

    View Slide

  37. Current Key Libraries
    • NumPy

    • SciPy

    • Pandas

    • Matplotlib

    • IPython (with notebook)

    • PyTables (HDF5)

    • Scikit learn

    • Statsmodels (Patsy)

    • SymPy

    • Cython (Numba)

    • NumExpr

    View Slide

  38. Tools used for Data
    Source: O’Reilly Strata attendee survey 2012 and 2013

    View Slide

  39. Python for Data Science
    http://readwrite.com/2013/11/25/python-displacing-r-as-the-programming-language-for-data-science

    View Slide

  40. Python is the top language in schools!

    View Slide

  41. Why Python for Technical Computing
    • Syntax (it gets out of your way)

    • White space preserves “visual real-estate”

    • Over-loadable operators

    • Complex numbers built-in early

    • Just enough language support for arrays

    • “Occasional” programmers can grok it

    • Packaging with conda is awesome!

    View Slide

  42. What is great about Python
    • Supports multiple programming styles (functional,
    object-oriented, scripts, etc.)

    • Experienced programmers can also use it
    effectively (classes, meta-programming techniques)

    • Has a simple, extensible implementation (so that
    C-extensions can exist)

    • General-purpose language --- can build a system

    • Critical mass!

    • Allows for community

    View Slide

  43. What is wrong with Python?
    • Missing anonymous blocks

    • Some syntax warts (1:10:2 outside [ ] please)

    • The CPython run-time is aged and needs an overhaul
    (GIL, global variables, lack of dynamic compilation
    support)

    • No approach to language extension except for abuse
    of meta-programming and “import
    hooks” (lightweight DSL need)

    • The distraction of multiple run-times...

    • Array-oriented and NumPy not really understood by
    many Python devs (but thank you for ‘@‘)

    View Slide

  44. What is good about NumPy?
    • Array-oriented

    • Extensive Dtype System (including structures)

    • C-API

    • Simple to understand data-structure

    • Memory mapping

    • Syntax support from Python

    • Large community of users

    • Broadcasting

    • Easy to interface C/C++/Fortran code

    • PEP 3118

    View Slide

  45. What is wrong with NumPy
    • Dtype system is difficult to extend

    • Immediate mode creates huge temporaries
    (spawning Numexpr)

    • “Almost” an in-memory data-base comparable
    to SQL-lite (missing indexes)

    • Integration with sparse arrays

    • Lots of un-optimized parts

    • Minimal support for multi-core / GPU

    • Code-base is organic and hard to extend
    (already inherited from Numeric)

    View Slide

  46. The most important part
    PEP 3118: Revising the buffer protocol
    Basically the “structure” of NumPy arrays
    as a protocol in Python itself to establish a
    memory-sharing standard between objects.
    It makes it possible for a heterogeneous world of
    powerful array-like objects outside of NumPy that
    communicate.

    !
    Falls short in not defining a general data description language (DDL).
    http://python.org/dev/peps/pep-3118/

    View Slide

  47. A Dual of Encapsulation?
    • The buffer protocol and data-types provide a
    different view on encapsulation (the dual of
    typical encapsulation approaches for those
    initiated to linear algebra)

    !
    • Rather than attach methods to data and hide
    the data, you describe data completely and
    declaratively. You can then later apply arbitrary
    code to this data at run-time.

    !
    • It makes it much easier to move code to data
    at all levels.

    View Slide

  48. What of the future?
    I’ve watched most of the
    episodes, so there’s that…
    I will just describe what
    I would like to see…

    View Slide

  49. “Data Has Mass”
    http://blog.mccrory.me/2010/12/07/data-gravity-in-the-clouds/

    View Slide

  50. Uh...
    http://datagravity.org/2012/06/26/a-formula-for-data-gravity/

    View Slide

  51. Workflow

    Perspective
    Data-centric

    Perspective

    View Slide

  52. Fundamentally Heterogeneous
    • The future of Python Data analytics will be heterogenous

    • There will be many projects

    - Biggus

    - DistArray

    - SciDB

    - Elemental

    - Spartan

    • Not counting integration with the other projects

    - Spark

    - Imapala

    - Disco

    View Slide

  53. What about Continuum?
    After watching NumPy and SciPy get used all over the
    world --- what would we do differently?
    Blaze

    Numba

    Conda (Anaconda)
    All Open Source!

    View Slide

  54. Conda and Anaconda
    • Cross-platform package management

    • Multiple environments allows you to have multiple
    versions of packages installed in system

    • Easy app-deployment

    • Taming open-source

    • Users love it!
    Free for all users
    Enterprise
    support available!

    View Slide

  55. Blaze and Numba
    Blaze is motivated by generalizing PEP 3118
    to all languages and data-sets — Python
    Glue 2.0
    Numba is motivated by a desire for Python
    NumPy array-like code to reach the speeds of
    Fortran!

    View Slide

  56. from data to code, seamlessly
    Blaze

    View Slide

  57. • Dealing with data applications has numerous pain points

    - Hundreds of data formats

    - Basic programs expect all data to fit in memory

    - Data analysis pipelines constantly changing from one form to another

    - Sharing analysis contains significant overhead to configure systems

    - Parallelizing analysis requires expert in particular distributed computing
    stack
    Data Pain

    View Slide

  58. Deferred

    Expr
    Compilers

    Interpreters
    Data
    Compute
    API
    Blaze Architecture
    • Flexible
    architecture to
    accommodate
    exploration

    • Use compilation
    of deferred
    expressions to
    optimize data
    interactions

    View Slide

  59. Blaze Data
    • Single interface for data layers

    • Composition of different

    formats

    • Simple api to add 

    custom data formats
    SQL
    CSV
    HDFS
    JSON
    Mem
    Custom
    HDF5
    Data

    View Slide

  60. Blaze Compute
    Compute
    DyND Pandas
    PyTables
    Spark
    • Computation abstraction over
    numerous data libraries

    • Simple multi-dispatched visitors
    to implement new backends

    • Allows plumbing between stacks
    to be seamless to user

    View Slide

  61. Deferred

    Expr
    Blaze Expr
    temps.hdf5
    nasdaq.sql
    tweets.json
    Join

    by date
    Select

    NYC
    Find

    Tech Selloff
    Plot
    • Lazy computation
    to minimize data
    movement

    • Simple DAG for

    compilation to

    • parallel
    application

    • distributed
    memory

    • static
    optimizations

    View Slide

  62. Blaze Example - Counting Weblinks
    Common Blaze Code
    #  Expr  
    t_idx  =  TableSymbol('{name:  string,    
                                             node_id:  int32}')  
    t_arc  =  TableSymbol('{node_out:  int32,    
                                             node_id:  int32}')  
    joined  =  Join(t_arc,  t_idx,  "node_id")  
    t  =  By(joined,  joined['name'],    
                 joined['node_id'].count())  
    !
    #  Data  Load  
    idx,  arc  =  load_data()

    #  Computations  
    ans  =  compute(t,  {t_arc:  arc,  t_idx:  idx})

    in_deg  =  dict(ans)  
    in_deg[u'blogspot.com']

    View Slide

  63. Blaze Example - Counting Weblinks
    Using Spark + HDFS
    load_data
    sc  =  SparkContext("local",  "Simple  App")  
    idx  =  sc.textFile(“hdfs://master.continuum.io/example_index.txt”)  
    idx  =  idx.map(lambda  x:  x.split(‘\t’))\  
                     .map(lambda  x:  [x[0],  int(x[1])])  
    arc  =  sc.textFile("hdfs://master.continuum.io/example_arcs.txt")  
    arc  =  arc.map(lambda  x:  x.split(‘\t’))\  
                     .map(lambda  x:  [int(x[0]),  int(x[1])])  
    Using Pandas + Local Disc
    with  open("example_index.txt")  as  f:  
           idx  =  [  ln.strip().split('\t')  for  ln  in  f.readlines()]  
    idx  =  DataFrame(idx,  columns=['name',  'node_id'])  
    !
    with  open("example_arcs.txt")  as  f:  
           arc  =  [  ln.strip().split('\t')  for  ln  in  f.readlines()]  
    arc  =  DataFrame(arc,  columns=['node_out',  'node_id'])

    View Slide

  64. Blaze Ecosystem
    • dynd — next-generation NumPy

    • libdynd — C++ library for multi-dimensional arrays

    • datashape — general data-description language (what
    PEP 3118 was missing)

    • Blaze

    - Data (adapt data from many different silos)

    - Compute (interpreters to run expressions on backends)

    - Expr (Symbolic expressions)

    - Interfaces (Focusing on Tables)

    • BLZ — experimental storage format (unsupported)

    View Slide

  65. Numba

    CPython compatible JIT compiler

    View Slide

  66. Code that users might write
    xi =
    i 1
    X
    j=0
    ki j,jai jaj
    O = I ? F
    Slow!!!!

    View Slide

  67. Face of a modern compiler
    Intermediate
    Representation
    (IR)
    x86
    C++
    ARM
    PTX
    C
    Fortran
    ObjC
    Parsing Code Generation
    Front-End Back-End

    View Slide

  68. Face of a modern compiler
    Intermediate
    Representation
    (IR)
    x86
    ARM
    PTX
    Python
    Code Generation
    Back-End
    Numba LLVM
    Parsing
    Front-End

    View Slide

  69. Example
    Numba

    View Slide

  70. NumPy + Mamba = Numba
    LLVM Library
    Intel Nvidia Apple
    AMD
    OpenCL
    ISPC CUDA CLANG
    OpenMP
    LLVMPY
    Python Function Machine Code
    ARM

    View Slide

  71. Speeding up Math Expressions
    xi =
    i 1
    X
    j=0
    ki j,jai jaj

    View Slide

  72. Image Processing
    @jit('void(f8[:,:],f8[:,:],f8[:,:])')
    def filter(image, filt, output):
    M, N = image.shape
    m, n = filt.shape
    for i in range(m//2, M-m//2):
    for j in range(n//2, N-n//2):
    result = 0.0
    for k in range(m):
    for l in range(n):
    result += image[i+k-m//2,j+l-n//2]*filt[k, l]
    output[i,j] = result
    ~1500x speed-up

    View Slide

  73. Case-study -- j0 from scipy.special
    • scipy.special was one of the first libraries I wrote

    • extended “umath” module by adding new
    “universal functions” to compute many scientific
    functions by wrapping C and Fortran libs.

    • Bessel functions are solutions to a differential
    equation:
    x
    2 d
    2
    y
    dx
    2
    +
    x
    dy
    dx
    + (
    x
    2

    2)
    y
    = 0
    y
    =
    J↵ (
    x
    )
    Jn (x) =
    1

    Z ⇡
    0
    cos (n⌧ x sin (⌧)) d⌧

    View Slide

  74. scipy.special.j0 wraps cephes algorithm

    View Slide

  75. Result --- equivalent to compiled code
    In [6]: %timeit vj0(x)
    10000 loops, best of 3: 75 us per loop
    !
    In [7]: from scipy.special import j0
    !
    In [8]: %timeit j0(x)
    10000 loops, best of 3: 75.3 us per loop
    But! Now code is in Python and can be
    experimented with more easily (and moved to
    the GPU / accelerator more easily)!

    View Slide

  76. Numba can change the game!
    LLVM IR
    x86
    C++
    ARM
    PTX
    C
    Fortran
    Python
    Numba turns Python into a “compiled
    language” (but much more flexible). You don’t
    have to reach for C/C++

    View Slide

  77. CUDA-Python
    from numba import cuda
    from numba import autojit
    !
    @autojit(target=‘gpu’)
    def array_scale(src, dst, scale):
    tid = cuda.threadIdx.x
    blkid = cuda.blockIdx.x
    blkdim = cuda.blockDim.x
    !
    i = tid + blkid * blkdim
    !
    if i >= n:
    return
    !
    dst[i] = src[i] * scale
    !
    src = np.arange(N, dtype=np.float)
    dst = np.empty_like(src)
    !
    array_scale[grid, block](src, dst, 5.0)
    CUDA
    Development

    using Python syntax
    for optimal
    performance!

    View Slide

  78. Example: Black-Scholes
    @cuda.jit(argtypes=(double[:], double[:],
    double[:], double[:], double[:], double,
    double))
    def black_scholes_cuda(callResult, putResult,
    S, X, T, R, V):
    # S = stockPrice
    # X = optionStrike
    # T = optionYears
    # R = Riskfree
    # V = Volatility
    i = cuda.threadIdx.x + cuda.blockIdx.x *
    cuda.blockDim.x
    if i >= S.shape[0]:
    return
    sqrtT = math.sqrt(T[i])
    d1 = (math.log(S[i] / X[i]) +
    (R + 0.5 * V * V) * T[i]) / (V *
    sqrtT)
    d2 = d1 - V * sqrtT
    cndd1 = cnd_cuda(d1)
    cndd2 = cnd_cuda(d2)
    !
    expRT = math.exp((-1. * R) * T[i])
    callResult[i] = (S[i] * cndd1 - X[i] *
    expRT * cndd2)
    putResult[i] = (X[i] * expRT * (1.0 -
    cndd2) -
    S[i] * (1.0 -
    cndd1))
    @cuda.jit(argtypes=(double,), restype=double, device=True,
    inline=True)
    def cnd_cuda(d):
    A1 = 0.31938153
    A2 = -0.356563782
    A3 = 1.781477937
    A4 = -1.821255978
    A5 = 1.330274429
    RSQRT2PI = 0.39894228040143267793994605993438
    K = 1.0 / (1.0 + 0.2316419 * math.fabs(d))
    ret_val = (RSQRT2PI * math.exp(-0.5 * d * d) *
    (K * (A1 + K * (A2 + K * (A3 + K * (A4 + K *
    A5))))))
    if d > 0:
    ret_val = 1.0 - ret_val
    return ret_val
    blockdim = 1024, 1
    griddim = int(math.ceil(float(OPT_N)/blockdim[0])), 1
    stream = cuda.stream()
    d_callResult = cuda.to_device(callResultNumbapro,
    stream)
    d_putResult = cuda.to_device(putResultNumbapro,
    stream)
    d_stockPrice = cuda.to_device(stockPrice, stream)
    d_optionStrike = cuda.to_device(optionStrike, stream)
    d_optionYears = cuda.to_device(optionYears, stream)
    for i in range(iterations):
    black_scholes_cuda[griddim, blockdim, stream](
    d_callResult, d_putResult, d_stockPrice,
    d_optionStrike,
    d_optionYears, RISKFREE, VOLATILITY)
    d_callResult.to_host(stream)
    d_putResult.to_host(stream)
    stream.synchronize()

    View Slide

  79. Black-Scholes: Results
    core i7
    GeForce GTX
    560 Ti
    About 9x
    faster on this
    GPU
    ~ same speed
    as CUDA-C

    View Slide

  80. Summary
    • Python has had a long and fruitful history in Data
    Analytics

    • It will have a long and bright future with your help!

    • Join the PyData Community and make the world a
    better place!

    View Slide

  81. Dedication
    Nothing I have done or am
    would be were it not for
    your patience, love, and
    support!

    !
    Thank you!
    Amy Pennock Oliphant

    View Slide

  82. Donate to Numfocus

    http://numfocus.org
    Become a PSF Member:

    https://www.python.org/psf/

    Attend PyData Berlin and

    Present at Future PyData

    events: http://pydata.org
    Community Commercials

    View Slide