$30 off During Our Annual Pro Sale. View Details »

Numba and Dask at the PyData Foundations Workshop

Numba and Dask at the PyData Foundations Workshop

An overview of the PyData workshop held at IBM's INDEX conference plus information about Conda, Numba, Dask

Travis E. Oliphant

February 20, 2018
Tweet

More Decks by Travis E. Oliphant

Other Decks in Technology

Transcript

  1. Scaling PyData
    PyData Workshop at IBM Index Conference
    Travis E. Oliphant
    Founder, Quansight
    1
    Credit to Anaconda, Inc. for some of these slides

    View Slide

  2. PyData Foundations Workshop
    Jake
    VanderPlas
    Jim
    Bednar
    Phillip
    Cloud
    Steven
    Silvester
    Jason
    Grout
    Travis Oliphant

    View Slide

  3. 3
    • MS/BS degrees in Elec. Comp. Engineering
    • PhD from Mayo Clinic in Biomedical Engineering
    (Ultrasound and MRI)
    • Creator and Developer of SciPy (1998-2009)
    • Professor at BYU (2001-2007) Inverse Problems
    • Creator and Developer of NumPy (2005-2012)
    • Started Numba (2012)
    • Founder of NumFOCUS / PyData
    • Python Software Foundation Director (2012)
    • Co-founder of Continuum Analytics => Anaconda, Inc.
    • CEO (2012) => Chief Data Scientist (2017)
    • Founder (2018) of Quansight
    SciPy

    View Slide

  4. 4
    Empower domain experts with high-level tools that exploit
    modern hard-ware
    Array Oriented Computing
    expertise

    View Slide

  5. • Express domain knowledge
    directly in arrays (tensors,
    matrices, vectors) --- easier to
    teach programming in the specific
    domain
    • Can take advantage of
    parallelism and accelerators
    • Array expressions
    Why Array-oriented computing
    5
    Object
    Attr1
    Attr2
    Attr3
    Object
    Attr1
    Attr2
    Attr3
    Object
    Attr1
    Attr2
    Attr3
    Object
    Attr1
    Attr2
    Attr3
    Object
    Attr1
    Attr2
    Attr3
    Object
    Attr1
    Attr2
    Attr3
    Attr1 Attr2 Attr3
    Object1
    Object2
    Object3
    Object4
    Object5
    Object6

    View Slide

  6. • Today’s vector machines (and vector co-processors, or GPUS) were
    made for array-oriented computing.
    • The software stack has just not caught up --- unfortunate because
    APL came out in 1963.
    • There is a reason Fortran remains popular.
    More reasons for array-oriented
    6

    View Slide

  7. 7
    Python and in particular PyData is Growing

    View Slide

  8. Data Science Workflow
    8
    New Data
    Notebooks
    Understand Data
    Getting Data
    Understand World
    Reports
    Microservices
    Dashboards
    Applications
    Decisions
    and
    Actions
    Models
    Exploratory Data Analysis and Viz
    Data Products

    View Slide

  9. Machine Learning Explosion
    9
    Scikit-Learn
    Tensorflow
    Keras
    XGBoost
    Torch
    MxNet
    theano
    lasagne
    caffe/caffe2
    minpy
    neon
    CNTK
    DAAL
    Chainer
    Dynet
    Apache Singa
    Shogun
    CuPy
    https://github.com/josephmisiti/awesome-machine-learning#python-general-purpose
    http://deeplearning.net/software_links/
    http://scikit-learn.org/stable/related_projects.html

    View Slide

  10. 10
    A Representation of Packages for ML
    © 2017 Anaconda, Inc. - Confidential & Proprietary

    View Slide

  11. 11
    NumPy and Packages that Depend on It
    © 2017 Anaconda, Inc. - Confidential & Proprietary

    View Slide

  12. 12
    pandas Depends on NumPy

    (and other packages depend on pandas)
    © 2017 Anaconda, Inc. - Confidential & Proprietary

    View Slide

  13. 13
    Caffe Depends on pandas and NumPy
    © 2017 Anaconda, Inc. - Confidential & Proprietary

    View Slide

  14. Embrace Innovation Without Anarchy
    14
    From http://www.slideshare.net/RevolutionAnalytics/r-at-microsoft
    Reproducibility

    View Slide

  15. 15
    Conda
    Conda Forge
    Conda Environments
    Anaconda Project
    A cross-platform and language agnostic package and
    environment manager
    A community-led collection of recipes, build
    infrastructure, and packages for conda.
    Custom isolated software sandboxes to allow easy
    reproducibility and sharing of data-science work.
    Reproducible, executable project directories

    View Slide

  16. • Language independent
    • Platform independent
    • No special privileges required
    • No VMs or containers
    • Enables:
    - Reproducibility
    - Collaboration
    - Scaling
    “conda – package everything”
    16
    A
    Python v2.7
    Conda Sandboxing Technology
    B
    Python
    v3.4
    Pandas
    v0.18
    Jupyter
    C
    R
    R
    Essentials
    conda
    NumPy
    v1.11
    NumPy
    v1.10
    Pandas
    v0.16

    View Slide

  17. Basic Conda Usage
    17
    Install a package conda install sympy
    List all installed packages conda list
    Search for packages
    conda search llvm
    Create a new environment
    conda create -n py3k python=3
    Remove a package
    conda remove nose
    Get help conda install --help

    View Slide

  18. Advanced Conda Usage
    18
    Install a package in an environment conda install -n py3k sympy
    Update all packages conda update --all
    Export list of packages conda list --export packages.txt
    Install packages from an export conda install --file packages.txt
    See package history conda list --revisions
    Revert to a revision conda install --revision 23
    Remove unused packages and cached
    tarballs
    conda clean -pt

    View Slide

  19. 19
    Development Deployment
    Conda eases rapid deployment of ML

    View Slide

  20. Scaling Up and Out with
    Numba and Dask
    20

    View Slide

  21. Scale Up vs Scale Out
    21
    Big Memory &
    Many Cores
    / GPU Box
    Best of Both
    (e.g. GPU Cluster)
    Many commodity
    nodes in a cluster
    Scale Up
    (Bigger Nodes)
    Scale Out
    (More Nodes)
    Numba
    Dask
    Dask with Numba

    View Slide

  22. © 2017 Anaconda, Inc. - Confidential & Proprietary
    Development
    Name Latest
    Release
    Number of
    Releases
    GitHub
    Stars
    Contributors Downloads
    in 2017
    numba 0.37 101 2907 77 ~2m
    dask 0.17.0 41 2444 136 ~1.5m
    dask-ml 0.4.1 9 188 8 New
    Numba
    Dask
    Dask-ml
    http://numba.pydata.org http://github.com/numba
    http://dask.pydata.org http://github.com/dask
    http://dask-ml.readthedocs.io/en/latest/index.html
    http://github.com/dask/dask-ml

    View Slide

  23. Numba
    23
    Credit: Stan Seibert for many of these slides

    View Slide

  24. Numba (compile Python to CPUs and GPUs)
    24
    conda install numba
    Intermediate
    Representation
    (IR)
    x86
    ARM
    PTX
    Python
    LLVM
    Numba
    Code Generation
    Backend
    Parsing
    Frontend

    View Slide

  25. 25
    @jit('void(f8[:,:],f8[:,:],f8[:,:])')
    def filter(image, filt, output):
    M, N = image.shape
    m, n = filt.shape
    for i in range(m//2, M-m//2):
    for j in range(n//2, N-n//2):
    result = 0.0
    for k in range(m):
    for l in range(n):
    result += image[i+k-m//2,j+l-n//2]*filt[k, l]
    output[i,j] = result
    ~1500x speed-up
    Image Processing

    View Slide

  26. Works with and does not replace the standard Python interpreter

    (all of your existing Python libraries are still available)
    Numba Features
    26

    View Slide

  27. 7 things about Numba you may not know
    27
    1
    2
    3
    4
    5
    6
    7
    Numba is 100% Open Source
    Numba + Jupyter = Rapid
    CUDA Prototyping
    Numba can compile for the
    CPU and the GPU at the same time
    Numba makes array processing
    easy with @(gu)vectorize
    Numba comes with a
    CUDA Simulator
    You can send Numba
    functions over the network
    Numba developers are working
    On a GPU DataFrame (pygdf)

    View Slide

  28. Other Numba topics
    28
    CUDA Python — write general GPU kernels with Python
    Device Arrays — manage memory transfer from host to GPU
    Streaming — manage asynchronous and parallel GPU compute streams
    CUDA Simulator in Python — to help debug your kernels
    HSA Support — early support for HSA-based GPUs and APUs
    Pyculib — access to cuFFT, cuBLAS, cuSPARSE, cuRAND, CUDA Sorting
    Parallel Acceleration — prange, parallel functions, and more
    https://github.com/ContinuumIO/gtc2017-numba

    View Slide

  29. © 2017 Anaconda, Inc. - Confidential & Proprietary
    • Detects CPU model during compilation and optimizes for that target
    • Automatic type inference: No need to give type signatures for functions
    • Dispatches to multiple type-specializations for the same function
    • Call out to C libraries with CFFI and types
    • Special "callback" mode for creating C callbacks to use with external libraries
    • Optional caching to disk, and ahead-of-time creation of shared libraries
    • Compiler is extensible with new data types and functions
    Numba Features

    View Slide

  30. © 2017 Anaconda, Inc. - Confidential & Proprietary
    • Three main technologies for parallelism:
    Parallel Computing
    SIMD Multi-threading Distributed Computing
    x0
    x1
    x2
    x3 x0
    x1
    x2
    x3 x0
    x3
    x2
    x1

    View Slide

  31. © 2017 Anaconda, Inc. - Confidential & Proprietary
    • Numba's CPU detection will enable
    LLVM to autovectorize for
    appropriate SIMD instruction set:
    • SSE, AVX, AVX2, AVX-512
    • Will become even more important as
    AVX-512 is now available on both
    Xeon Phi and Skylake Xeon processors
    SIMD: Single Instruction Multiple Data

    View Slide

  32. © 2017 Anaconda, Inc. - Confidential & Proprietary
    Manual Multithreading: Release the GIL
    Speedup Ratio
    0
    0.9
    1.8
    2.6
    3.5
    Number of Threads
    1 2 4
    Option to release the GIL
    Using Python
    concurrent.futures

    View Slide

  33. © 2017 Anaconda, Inc. - Confidential & Proprietary
    Universal Functions (Ufuncs)
    Ufuncs are a core concept in NumPy for array-oriented
    computing.
    ◦ A function with scalar inputs is broadcast across the elements
    of the input arrays:
    • np.add([1,2,3], 3) == [4, 5, 6]
    • np.add([1,2,3], [10, 20, 30]) == [11, 22, 33]
    ◦ Parallelism is present, by construction. Numba will generate
    loops and can automatically multi-thread if requested.
    ◦ Before Numba, creating fast ufuncs required writing C. No
    longer!

    View Slide

  34. © 2017 Anaconda, Inc. - Confidential & Proprietary
    Universal Functions (Ufuncs)
    Different decorator!
    1.8x speedup!

    View Slide

  35. © 2017 Anaconda, Inc. - Confidential & Proprietary
    Multi-threaded Ufuncs
    Specify type signature
    Select parallel target
    Automatically uses all CPU cores!

    View Slide

  36. © 2017 Anaconda, Inc. - Confidential & Proprietary
    ParallelAccelerator
    • ParallelAccelerator is a special compiler pass contributed by Intel Labs
    • Todd A. Anderson, Ehsan Totoni, Paul Liu
    • Based on similar contribution to Julia
    • Automatically generates mulithreaded code in a Numba compiled-function:
    • Array expressions and reductions
    • Random functions
    • Dot products
    • Explicit loops indicated with prange() call

    View Slide

  37. © 2017 Anaconda, Inc. - Confidential & Proprietary
    ParallelAccelerator: Example #1
    Time (ms)
    0
    1000
    2000
    3000
    4000
    NumPy Numba Numba+PA
    1.8x
    3.6x
    1000000x10 input,
    Core i7 Quad Core CPU

    View Slide

  38. © 2017 Anaconda, Inc. - Confidential & Proprietary
    ParallelAccelerator: prange()
    Time (ms)
    0
    25
    50
    75
    100
    NumPy Numba Numba+PA
    4.3x
    50x
    1000000x10 input,
    Core i7 Quad Core CPU

    View Slide

  39. © 2017 Anaconda, Inc. - Confidential & Proprietary
    ParallelAccelerator: prange()
    Time (ms)
    0
    25
    50
    75
    100
    NumPy Numba Numba+PA
    2x
    3.6x
    1000000x10 input,
    Core i7 Quad Core CPU

    View Slide

  40. © 2017 Anaconda, Inc. - Confidential & Proprietary
    ParallelAccelerator: Image Resampling
    https://github.com/bokeh/
    datashader/blob/master/
    examples/landsat.ipynb
    Interactive image resampling
    with Holoviews + Datashader
    Datashader resampling implemented
    with Numba + prange()

    View Slide

  41. © 2017 Anaconda, Inc. - Confidential & Proprietary
    ParallelAccelerator: Stencils
    730x547 image w/ 21x21 pixel blur
    half the lines of code and 4x faster on a quad core
    CPU than equivalent non-stencil Numba code

    View Slide

  42. Dask
    42
    Credit Matthew Rocklin for many of these slides

    View Slide

  43. • Designed to parallelize the Python ecosystem
    • Handles complex algorithms
    • Co-developed with Pandas/SKLearn/Jupyter teams
    • Familiar APIs for Python users
    • Scales
    • Scales from multicore to 1000-node clusters
    • Resilience, responsive, and real-time

    View Slide

  44. • Parallelizes NumPy, Pandas, SKLearn
    • Satisfies subset of these APIs
    • Uses these libraries internally
    • Co-developed with these teams
    • Task scheduler supports custom algorithms
    • Parallelize existing code
    • Build novel real-time systems
    • Arbitrary task graphs 

    with data dependencies
    • Same scalability

    View Slide

  45. Dask: From User Interaction to Execution
    45
    delayed

    View Slide

  46. 46
    >>> import pandas as pd
    >>> df = pd.read_csv('iris.csv')
    >>> df.head()
    sepal_length sepal_width petal_length petal_width species
    0 5.1 3.5 1.4 0.2 Iris-setosa
    1 4.9 3.0 1.4 0.2 Iris-setosa
    2 4.7 3.2 1.3 0.2 Iris-setosa
    3 4.6 3.1 1.5 0.2 Iris-setosa
    4 5.0 3.6 1.4 0.2 Iris-setosa
    >>> max_sepal_length_setosa = df[df.species
    == 'setosa'].sepal_length.max()
    5.7999999999999998
    >>> import dask.dataframe as dd
    >>> ddf = dd.read_csv('*.csv')
    >>> ddf.head()
    sepal_length sepal_width petal_length petal_width species
    0 5.1 3.5 1.4 0.2 Iris-setosa
    1 4.9 3.0 1.4 0.2 Iris-setosa
    2 4.7 3.2 1.3 0.2 Iris-setosa
    3 4.6 3.1 1.5 0.2 Iris-setosa
    4 5.0 3.6 1.4 0.2 Iris-setosa

    >>> d_max_sepal_length_setosa = ddf[ddf.species
    == 'setosa'].sepal_length.max()
    >>> d_max_sepal_length_setosa.compute()
    5.7999999999999998
    Dask DataFrame is like Pandas

    View Slide

  47. Example 1: Using Dask DataFrames on a cluster with
    CSV data
    47
    • Built from Pandas DataFrames
    • Match Pandas interface
    • Access data from HDFS, S3, local, etc.
    • Fast, low latency
    • Responsive user interface

    View Slide

  48. 48
    >>> import numpy as np
    >>> np_ones = np.ones((5000, 1000))
    >>> np_ones
    array([[ 1., 1., 1., ..., 1., 1., 1.],
    [ 1., 1., 1., ..., 1., 1., 1.],
    [ 1., 1., 1., ..., 1., 1., 1.],
    ...,
    [ 1., 1., 1., ..., 1., 1., 1.],
    [ 1., 1., 1., ..., 1., 1., 1.],
    [ 1., 1., 1., ..., 1., 1., 1.]])
    >>> np_y = np.log(np_ones + 1)[:5].sum(axis=1)
    >>> np_y
    array([ 693.14718056, 693.14718056,
    693.14718056, 693.14718056, 693.14718056])
    >>> import dask.array as da
    >>> da_ones = da.ones((5000000, 1000000),
    chunks=(1000, 1000))
    >>> da_ones.compute()
    array([[ 1., 1., 1., ..., 1., 1., 1.],
    [ 1., 1., 1., ..., 1., 1., 1.],
    [ 1., 1., 1., ..., 1., 1., 1.],
    ...,
    [ 1., 1., 1., ..., 1., 1., 1.],
    [ 1., 1., 1., ..., 1., 1., 1.],
    [ 1., 1., 1., ..., 1., 1., 1.]])
    >>> da_y = da.log(da_ones + 1)[:5].sum(axis=1)
    >>> np_da_y = np.array(da_y) #fits in memory
    array([ 693.14718056, 693.14718056,
    693.14718056, 693.14718056, …, 693.14718056])
    # If result doesn’t fit in memory
    >>> da_y.to_hdf5('myfile.hdf5', 'result')
    Dask Array is like NumPy

    View Slide

  49. Example 3: Using Dask Arrays with global temperature
    data
    49
    • Built from NumPy

    n-dimensional arrays
    • Matches NumPy interface
    (subset)
    • Solve medium-large
    problems
    • Complex algorithms

    View Slide

  50. Dask Schedulers: Distributed Scheduler
    50

    View Slide

  51. Scheduler Visualization with Bokeh
    51

    View Slide

  52. Ten Reasons People
    Choose Dask

    View Slide

  53. Scalable Pandas DataFrames
    • Same API

    import dask.dataframe as dd

    df = dd.read_parquet(‘s3://bucket/accounts/2017')

    df.groupby(df.name).value.mean().compute()
    • Efficient Timeseries Operations

    df.loc[‘2017-01-01’] # Uses the Pandas index…

    df.value.rolling(10).std() # for efficient…

    df.value.resample(‘10m’).mean() # operations.
    • Co-developed with Pandas

    and by the Pandas developer community

    View Slide

  54. Scalable NumPy Arrays
    • Same API


    import dask.array as da

    x = da.from_array(my_hdf5_file)

    y = x.dot(x.T)
    • Applications
    • Atmospheric science
    • Satellite imagery
    • Biomedical imagery
    • Optimization algorithms

    check out dask-glm

    View Slide

  55. Parallelize Scikit-Learn/Joblib
    • Scikit-Learn parallelizes with Joblib


    estimator = RandomForest(…)


    estimator.fit(train_data, train_labels, njobs=8)
    • Joblib can use Dask


    from sklearn.externals.joblib import parallel_backend

    with parallel_backend('dask', scheduler=‘…’):
    estimator.fit(train_data, train_labels)
    https://pythonhosted.org/joblib/
    http://distributed.readthedocs.io/en/latest/joblib.html
    Joblib
    Thread pool

    View Slide

  56. Parallelize Scikit-Learn/Joblib
    • Scikit-Learn parallelizes with Joblib


    estimator = RandomForest(…)


    estimator.fit(train_data, train_labels, njobs=8)
    • Joblib can use Dask


    from sklearn.externals.joblib import parallel_backend

    with parallel_backend('dask', scheduler=‘…’):
    estimator.fit(train_data, train_labels)
    https://pythonhosted.org/joblib/
    http://distributed.readthedocs.io/en/latest/joblib.html
    Joblib
    Dask

    View Slide

  57. Many Other Libraries in Anaconda
    • Scikit-Image uses dask to break down images and speed
    up algorithms with overlapping regions
    • Geopandas can use Dask to partition data
    spatially and accelerate spatial joins

    View Slide

  58. Dask Scales Up
    • Thousand node clusters
    • Cloud computing
    • Super computers
    • Gigabyte/s bandwidth
    • 200 microsecond task overhead
    Dask Scales Down (the median cluster size is one)
    • Can run in a single Python thread pool
    • Almost no performance penalty (microseconds)
    • Lightweight
    • Few dependencies
    • Easy install

    View Slide

  59. Parallelize Web Backends
    • Web servers process thousands of small computations asynchronously

    for web pages or REST endpoints
    • Dask provides dynamic, heterogenous computation
    • Supports small data
    • 10ms roundtrip times
    • Dynamic scaling for different loads
    • Supports asynchronous Python (like GoLang)


    async def serve(request):

    future = dask_client.submit(process, request)

    result = await future

    return result

    View Slide

  60. Debugging support
    • Clean Python tracebacks when user code breaks
    • Connect to remote workers with IPython sessions 

    for advanced debugging

    View Slide

  61. Resource constraints
    • Define limited hardware resources for workers
    • Specify resource constraints when submitting tasks
    $ dask-worker … —resources GPU=2
    $ dask-worker … —resources GPU=2
    $ dask-worker … —resources special-db=1
    future = client.submit(my_function, resources={‘GPU’: 1})
    • Used for GPUs, big-memory machines, special
    hardware, database connections, I/O machines, etc..

    View Slide

  62. Collaboration
    • Many users can share the same cluster simultaneously
    • Define public datasets
    • Repeated computation and data use is shared among everyone
    df = dd.read_parquet(…).persist()
    client.publish_dataset(accounts=df)
    df = client.get_dataset(‘accounts’)

    View Slide

  63. Beautiful Diagnostic Dashboards
    • Fast responsive dashboards
    • Provide users performance insight
    • Powered by Bokeh

    View Slide

  64. Some Reasons not to
    Choose Dask

    View Slide

  65. • Dask is not a SQL database. 

    Does Pandas well, but won’t optimize complex queries.
    • Dask is not MPI

    Very fast, but does leave some performance on the table

    200us task overhead

    a couple copies in the network stack
    • Dask is not a JVM technology

    It’s a Python library

    (although Julia bindings available)
    • Dask is not always necessary 

    You may not need parallelism
    Dask’s limitations

    View Slide

  66. dask.pydata.org
    conda install dask

    View Slide