$30 off During Our Annual Pro Sale. View Details »

Scaling Python Up and Out with Numba and Dask

Scaling Python Up and Out with Numba and Dask

An overview of Python for Data Science. In particular a description of how Numba can be used to speed up your Python code by compiling array-oriented code to native machine code. and how Dask can be used to run your code in parallel across multiple cores and multiple machines.

Travis E. Oliphant

October 05, 2018
Tweet

More Decks by Travis E. Oliphant

Other Decks in Programming

Transcript

  1. © 2017 Continuum Analytics - Confidential & Proprietary
    © 2018 Quansight - Confidential & Proprietary
    Scaling Python Up and Out with Numba
    and Dask
    Travis E. Oliphant
    PyCon India Tutorial
    October 5, 2018

    View Slide

  2. • MS/BS degrees in Elec. Comp. Engineering
    • PhD from Mayo Clinic in Biomedical Engineering
    (Ultrasound and MRI)
    • Creator and Developer of SciPy (1998-2009)
    • Professor at BYU (2001-2007) Inverse Problems
    • Creator and Developer of NumPy (2005-2012)
    • Started Numba and Conda (2012 - )
    • Founder of NumFOCUS / PyData
    • Python Software Foundation Director (2012)
    • Co-founder of Continuum Analytics => Anaconda, Inc.
    • CEO (2012) => Chief Data Scientist (2017)
    • Founder (2018) of Quansight
    SciPy

    View Slide

  3. Company
    2012 - Created Two Orgs for Sustainable Open Source
    Community
    Enterprise software company initially
    built on services and supporting
    open-source.
    Became

    View Slide

  4. Data Science Workflow
    New Data
    Notebooks
    Understand Data
    Getting Data
    Understand World
    Reports
    Microservices
    Dashboards
    Applications
    Decisions
    and
    Actions
    Models
    Exploratory Data Analysis and Viz
    Data Products

    View Slide

  5. Quansight — continuing Continuum momentum
    Replaced by
    Spin Out
    Spin Out
    2012
    2018 ?
    ?
    Key. Members of the management team at Continuum
    Analytics ==> Anaconda was our first (spin-out) company.
    2015
    2019 and beyond…

    View Slide

  6. What We Do
    Connecting companies and communities
    We build and connect companies and open-source
    communities to sustainably solve problems with data.

    View Slide

  7. © 2018 Quansight - Confidential & Proprietary
    7
    Core Business
    Quansight Labs Membership
    Staffing / Mentoring
    Custom Data-Science/ML Consulting
    Sustainable Open Source Partnerships

    View Slide

  8. Open Source Directions
    Webinar series to promote and encourage open-source Roadmaps.
    We also help communities publicize these roadmaps.

    View Slide

  9. LABS
    Sustaining the Future
    Open-source innovation and
    maintenance around the entire data-
    science and AI workflow.
    • NumPy ecosystem maintenance (fund developers)
    • Improve connection of NumPy to ML Frameworks
    • GPU Support for NumPy Ecosystem
    • Improve foundations of Array computing
    • JupyterLab
    • Data Catalog standards
    • Packaging (conda-forge, PyPA, etc.)
    uarray — unified array interface and symbolic NumPy
    xnd — re-factored NumPy (low-level cross-language
    libraries for N-D (tensor) computing)
    Partnered with NumFOCUS and
    Ursa Labs (supporting Arrow)
    Bokeh
    Adapted from Jake Vanderplas
    PyCon 2017 Keynote

    View Slide

  10. 1991 2018
    2001
    2015
    2009 2012
    2005

    2001
    2006
    Python Data Analysis and Machine Learning Time-Line
    1991
    2003
    2014
    2011
    2010 2016

    View Slide

  11. Empower domain experts with high-level tools that exploit
    modern hard-ware
    Array Oriented Computing
    expertise

    View Slide

  12. • Express domain knowledge
    directly in arrays (tensors,
    matrices, vectors) --- easier to
    teach programming in domain
    • Can take advantage of
    parallelism and accelerators
    • Array expressions
    Why Array-oriented computing Object
    Attr1
    Attr2
    Attr3
    Object
    Attr1
    Attr2
    Attr3
    Object
    Attr1
    Attr2
    Attr3
    Attr1 Attr2 Attr3
    Object1
    Object2
    Object3
    Object4
    Object5
    Object6
    Object
    Attr1
    Attr2
    Attr3

    View Slide

  13. • Today’s vector machines (and vector co-processors, or GPUS) were made for array-
    oriented computing.
    • The software stack has just not caught up --- unfortunate because APL came out in 1963.
    • There is a reason Fortran remains popular among High Performance groups.
    Reasons for array-oriented

    View Slide

  14. Python and in particular PyData is Growing

    View Slide

  15. Bokeh
    Adapted from Jake Vanderplas
    PyCon 2017 Keynote

    View Slide

  16. Conda
    Conda Forge
    Conda Environments
    A cross-platform and language agnostic package and
    environment manager
    A community-led collection of recipes, build
    infrastructure, and packages for conda.
    Custom isolated software sandboxes to allow easy
    reproducibility and sharing of data-science work.
    Anaconda.org Web-site for freely hosting public packages and
    environments. Example of conda repository.

    View Slide

  17. • Language independent
    • Platform independent
    • No special privileges required
    • No VMs or containers
    • Enables:
    - Reproducibility
    - Collaboration
    - Scaling
    “conda – package everything”
    17
    A
    Python v2.7
    Conda Sandboxing Technology
    B
    Python
    v3.4
    Pandas
    v0.18
    Jupyter
    C
    R
    R
    Essentials
    conda
    NumPy
    v1.11
    NumPy
    v1.10
    Pandas
    v0.16

    View Slide

  18. Basic Conda Usage
    18
    Install a package conda install sympy
    List all installed packages conda list
    Search for packages
    conda search llvm
    Create a new environment
    conda create -n py3k python=3
    Remove a package
    conda remove nose
    Get help conda install --help

    View Slide

  19. Advanced Conda Usage
    19
    Install a package in an environment conda install -n py3k sympy
    Update all packages conda update --all
    Export list of packages conda list --export packages.txt
    Install packages from an export conda install --file packages.txt
    See package history conda list --revisions
    Revert to a revision conda install --revision 23
    Remove unused packages and cached tarballs conda clean -pt

    View Slide

  20. 20
    Development Deployment
    Conda eases rapid deployment

    View Slide

  21. NumPy

    View Slide

  22. Without NumPy
    from math import sin, pi
    def sinc(x):
    if x == 0:
    return 1.0
    else:
    pix = pi*x
    return sin(pix)/pix
    def step(x):
    if x > 0:
    return 1.0
    elif x < 0:
    return 0.0
    else:
    return 0.5
    functions.py
    >>> import functions as f
    >>> xval = [x/3.0 for x in
    range(-10,10)]
    >>> yval1 = [f.sinc(x) for x
    in xval]
    >>> yval2 = [f.step(x) for x
    in xval]
    Python is a great language but
    needed a way to operate quickly
    and cleanly over multi-
    dimensional arrays.

    View Slide

  23. With NumPy
    from numpy import sin, pi
    from numpy import vectorize
    import functions as f
    vsinc = vectorize(f.sinc)
    def sinc(x):
    pix = pi*x
    val = sin(pix)/pix
    val[x==0] = 1.0
    return val
    vstep = vectorize(f.step)
    def step(x):
    y = x*0.0
    y[x>0] = 1
    y[x==0] = 0.5
    return y
    >>> import functions2 as f
    >>> from numpy import *
    >>> x = r_[-10:10]/3.0
    >>> y1 = f.sinc(x)
    >>> y2 = f.step(x)
    functions2.py
    Offers N-D array, element-by-element
    functions, and basic random numbers,
    linear algebra, and FFT capability for
    Python
    http://numpy.org
    Fiscally sponsored by NumFOCUS

    View Slide

  24. NumPy: an Array Extension of Python
    • Data: the array object
    – slicing and shaping
    – data-type map to Bytes
    • Fast Math (ufuncs):
    – vectorization
    – broadcasting
    – aggregations

    View Slide

  25. shape
    NumPy Array
    Key Attributes
    • dtype
    • shape
    • ndim
    • strides
    • data

    View Slide

  26. NumPy Examples
    2d array
    3d array
    [439 472 477]
    [217 205 261 222 245 238]
    9.98330639789 2.96677717122

    View Slide

  27. NumPy Slicing (Selection)
    >>> a[0,3:5]
    array([3, 4])
    >>> a[4:,4:]
    array([[44, 45],
    [54, 55]])
    >>> a[:,2]
    array([2,12,22,32,42,52])
    >>> a[2::2,::2]
    array([[20, 22, 24],
    [40, 42, 44]])

    View Slide

  28. Summary
    • Provides foundational N-dimensional array composed
    of homogeneous elements of a particular “dtype”
    • The dtype of the elements is extensive (but difficult to
    extend)
    • Arrays can be sliced and diced with simple syntax to
    provide easy manipulation and selection.
    • Provides fast and powerful math, statistics, and linear
    algebra functions that operate over arrays.
    • Utilities for sorting, reading and writing data also
    provided.

    View Slide

  29. Scaling Up and Out with Numba
    and Dask

    View Slide

  30. Scale Up vs Scale Out
    Big Memory &
    Many Cores
    / GPU Box
    Best of Both
    (e.g. GPU Cluster)
    Many commodity
    nodes in a cluster
    Scale Up
    (Bigger Nodes)
    Scale Out
    (More Nodes)
    Numba
    Dask
    Dask with Numba

    View Slide

  31. Development
    Name Latest Release Number of
    Releases
    GitHub Stars Contributors
    numba 0.40.0 113 3476 96
    dask 0.19.2 52 3507 195
    dask-ml 0.10.0 15 104 23
    numpy 1.15.2 144 8298 694
    pandas 0.23.4 97 16,276 1285
    Numba
    Dask
    Dask-ml
    http://numba.pydata.org http://github.com/numba
    http://dask.pydata.org http://github.com/dask
    http://dask-ml.readthedocs.io/en/latest/index.html
    http://github.com/dask/dask-ml

    View Slide

  32. Numba

    View Slide

  33. • Python is one of the most popular languages for data science
    • Python integrates well with compiled, accelerated libraries (MKL,
    TensorFlow, etc)
    • But what about custom algorithms and data processing tasks?
    • Our goal was to make a compiler that:
    • Worked within the standard Python interpreter, not replaced it
    • Integrated tightly with NumPy
    • Compatible with both multithreaded and distributed computing
    paradigms
    A Compiler for Python?
    Combining Productivity and Performance

    View Slide

  34. • An open-source, function-at-a-time compiler library for Python
    • Compiler toolbox for different targets and execution models:
    • single-threaded CPU, multi-threaded CPU, GPU
    • regular functions, “universal functions” (array functions), etc
    • Speedup: 2x (compared to basic NumPy code) to 200x (compared to pure
    Python)
    • Combine ease of writing Python with speeds approaching FORTRAN
    • Empowers data scientists who make tools for themselves and other data
    scientists
    Numba: A JIT Compiler for Python

    View Slide

  35. 7 things about Numba you may not know
    1
    2
    3
    4
    5
    6
    7
    Numba is 100% Open Source
    Numba + Jupyter = Rapid
    CUDA Prototyping
    Numba can compile for the
    CPU and the GPU at the same time
    Numba makes array processing
    easy with @(gu)vectorize
    Numba comes with a
    CUDA Simulator
    You can send Numba
    functions over the network
    Numba developers contributing to
    a GPU DataFrame (pygdf)

    View Slide

  36. Numba (compile Python to CPUs and GPUs)
    conda install numba
    Intermediate
    Representation
    (IR)
    x86
    ARM
    PTX
    Python
    LLVM
    Numba
    Code Generation
    Backend
    Parsing
    Frontend

    View Slide

  37. How does Numba work?
    Python Function
    (bytecode)
    Bytecode
    Analysis
    Functions
    Arguments
    Numba IR
    Machine
    Code
    Execute!
    Type
    Inference
    LLVM/NVVM JIT LLVM IR
    Lowering
    Rewrite IR
    Cache
    @jit
    def do_math(a, b):

    >>> do_math(x, y)

    View Slide

  38. Supported Platforms and Hardware
    OS HW SW
    Windows

    (7 and later)
    32 and 64-bit CPUs (Incl
    Xeon Phi)
    Python 2.7, 3.4-3.7
    OS X

    (10.9 and later)
    CUDA & HSA GPUs NumPy 1.10 and later
    Linux

    (RHEL 6 and later)
    Some support for ARM and
    ROCm

    View Slide

  39. Basic Example

    View Slide

  40. Basic Example
    Array Allocation
    Looping over ndarray x as an iterator
    Using numpy math functions
    Returning a slice of the array
    2.7x speedup!
    Numba decorator

    (nopython=True not required)

    View Slide

  41. • Detects CPU model during compilation and optimizes for that target
    • Automatic type inference: No need to give type signatures for functions
    • Dispatches to multiple type-specializations for the same function
    • Call out to C libraries with CFFI and types
    • Special "callback" mode for creating C callbacks to use with external
    libraries
    • Optional caching to disk, and ahead-of-time creation of shared libraries
    • Compiler is extensible with new data types and functions
    Numba Features

    View Slide

  42. • Three main technologies for parallelism:
    Parallel Computing
    SIMD Multi-threading Distributed Computing
    x0
    x1
    x2
    x3 x0
    x1
    x2
    x3 x0
    x3
    x2
    x1

    View Slide

  43. • Numba's CPU detection will enable
    LLVM to autovectorize for
    appropriate SIMD instruction set:
    • SSE, AVX, AVX2, AVX-512
    • Will become even more important
    as AVX-512 is now available on
    both Xeon Phi and Skylake Xeon
    processors
    SIMD: Single Instruction Multiple Data

    View Slide

  44. Manual Multithreading: Release the GIL
    Speedup Ratio
    0
    0.9
    1.8
    2.6
    3.5
    Number of Threads
    1 2 4
    Option to release the GIL
    Using Python
    concurrent.futures

    View Slide

  45. Universal Functions (Ufuncs)
    Ufuncs are a core concept in NumPy for array-oriented
    computing.
    ◦ A function with scalar inputs is broadcast across the elements of
    the input arrays:
    • np.add([1,2,3], 3) == [4, 5, 6]
    • np.add([1,2,3], [10, 20, 30]) == [11, 22, 33]
    ◦ Parallelism is present, by construction. Numba will generate
    loops and can automatically multi-thread if requested.
    ◦ Before Numba, creating fast ufuncs required writing C. No
    longer!

    View Slide

  46. Universal Functions (Ufuncs)
    Different decorator!
    1.8x speedup!

    View Slide

  47. Multi-threaded Ufuncs
    Specify type signature
    Select parallel target
    Automatically uses all CPU cores!

    View Slide

  48. ParallelAccelerator
    • ParallelAccelerator is a special compiler pass contributed by Intel Labs
    • Todd A. Anderson, Ehsan Totoni, Paul Liu
    • Based on similar contribution to Julia
    • Automatically generates mulithreaded code in a Numba compiled-
    function:
    • Array expressions and reductions
    • Random functions
    • Dot products
    • Explicit loops indicated with prange() call

    View Slide

  49. ParallelAccelerator: Example #1
    Time (ms)
    0
    1000
    2000
    3000
    4000
    NumPy Numba Numba+PA
    1.8x
    3.6x
    1000000x10 input,
    Core i7 Quad Core CPU

    View Slide

  50. ParallelAccelerator: prange()
    Time (ms)
    0
    25
    50
    75
    100
    NumPy Numba Numba+PA
    4.3x
    50x
    1000000x10 input,
    Core i7 Quad Core CPU

    View Slide

  51. ParallelAccelerator: prange()
    Time (ms)
    0
    25
    50
    75
    100
    NumPy Numba Numba+PA
    2x
    3.6x
    1000000x10 input,
    Core i7 Quad Core CPU

    View Slide

  52. ParallelAccelerator: Image Resampling
    https://github.com/bokeh/
    datashader/blob/master/examples/
    landsat.ipynb
    Interactive image resampling
    with Holoviews + Datashader
    Datashader resampling implemented
    with Numba + prange()

    View Slide

  53. ParallelAccelerator: Stencils
    730x547 image w/ 21x21 pixel blur
    half the lines of code and 4x faster on a quad core CPU
    than equivalent non-stencil Numba code

    View Slide

  54. Distributed Computing

    Example: Dask
    Dask Client

    (Haswell)
    Dask Scheduler
    Dask Worker

    (Skylake)
    Dask Worker
    (Skylake)
    Dask Worker
    (Knight’s Landing)
    @jit
    def f(x):

    - Serialize with pickle module
    - Works with Dask and Spark (and others)
    - Automatic recompilation for each target
    f(x)
    f(x)
    f(x)

    View Slide

  55. Other Numba topics
    CUDA Python — write general GPU kernels with Python
    Device Arrays — manage memory transfer from host to GPU
    Streaming — manage asynchronous and parallel GPU compute streams
    CUDA Simulator in Python — to help debug your kernels
    HSA Support — early support for HSA-based GPUs and APUs
    Pyculib — access to cuFFT, cuBLAS, cuSPARSE, cuRAND, CUDA Sorting
    https://github.com/ContinuumIO/gtc2017-numba

    View Slide

  56. Dask

    View Slide

  57. • Designed to parallelize the Python ecosystem
    • Handles complex algorithms
    • Co-developed with Pandas/SKLearn/Jupyter teams
    • Familiar APIs for Python users
    • Scales
    • Scales from multicore to 1000-node clusters
    • Resilience, responsive, and real-time

    View Slide

  58. • Parallelizes NumPy, Pandas, SKLearn
    • Satisfies subset of these APIs
    • Uses these libraries internally
    • Co-developed with these teams
    • Task scheduler supports custom algorithms
    • Parallelize existing code
    • Build novel real-time systems
    • Arbitrary task graphs 

    with data dependencies
    • Same scalability

    View Slide

  59. demo video
    • High level: Scaling Pandas
    • Same Pandas look and feel
    • Uses Pandas under the hood
    • Scales nicely onto many machines
    • Low level: Arbitrary task scheduling
    • Parallelize normal Python code
    • Build custom algorithms
    • React real-time
    • Demo deployed with
    • dask-kubernetes 

    Google Compute Engine
    • github.com/dask/dask-kubernetes
    • Youtube link
    • https://www.youtube.com/watch?
    v=ods97a5Pzw0&

    View Slide

  60. Why do people choose Dask?
    • Familiar with Python:
    • Drop-in NumPy/Pandas/SKLearn APIs
    • Native memory environment
    • Easy debugging and diagnostics
    • Have complex problems:
    • Parallelize existing code without expensive rewrites
    • Sophisticated algorithms and systems
    • Real-time response to small-data
    • Scales up and down:
    • Scales to 1000-node clusters
    • Also runs cheaply on a laptop
    #import pandas as pd
    import dask.dataframe as dd

    View Slide

  61. Dask
    • Started as part of Blaze in early 2014.
    • General parallel programming engine
    • Flexible and therefore highly suited for
    • Commodity Clusters
    • Advanced Algorithms
    • Wide community adoption and use
    conda install -c conda-forge dask
    pip install dask[complete] distributed --upgrade

    View Slide

  62. 62
    Big Data
    Small Data
    Numba

    View Slide

  63. Dask: From User Interaction to Execution
    63
    delayed

    View Slide

  64. Dask: Parallel Data Processing
    Synthetic views of
    Numpy ndarrays
    Synthetic views of
    Pandas DataFrames
    with HDFS support
    DAG construction and
    workflow manager

    View Slide

  65. 65
    >>> import pandas as pd
    >>> df = pd.read_csv('iris.csv')
    >>> df.head()
    sepal_length sepal_width petal_length petal_width species
    0 5.1 3.5 1.4 0.2 Iris-setosa
    1 4.9 3.0 1.4 0.2 Iris-setosa
    2 4.7 3.2 1.3 0.2 Iris-setosa
    3 4.6 3.1 1.5 0.2 Iris-setosa
    4 5.0 3.6 1.4 0.2 Iris-setosa
    >>> max_sepal_length_setosa = df[df.species
    == 'setosa'].sepal_length.max()
    5.7999999999999998
    >>> import dask.dataframe as dd
    >>> ddf = dd.read_csv('*.csv')
    >>> ddf.head()
    sepal_length sepal_width petal_length petal_width species
    0 5.1 3.5 1.4 0.2 Iris-setosa
    1 4.9 3.0 1.4 0.2 Iris-setosa
    2 4.7 3.2 1.3 0.2 Iris-setosa
    3 4.6 3.1 1.5 0.2 Iris-setosa
    4 5.0 3.6 1.4 0.2 Iris-setosa

    >>> d_max_sepal_length_setosa = ddf[ddf.species
    == 'setosa'].sepal_length.max()
    >>> d_max_sepal_length_setosa.compute()
    5.7999999999999998
    Dask DataFrame is like Pandas

    View Slide

  66. New Spark/Hadoop clusters
    • Create and provision a Spark/Hadoop cluster with a
    few simple steps
    • Work on the cloud or with your existing in-house
    servers
    Dask Graphs: Example Machine Learning
    Pipeline
    66

    View Slide

  67. Example 1: Using Dask DataFrames on a cluster with CSV
    data
    67
    • Built from Pandas DataFrames
    • Match Pandas interface
    • Access data from HDFS, S3, local, etc.
    • Fast, low latency
    • Responsive user interface

    View Slide

  68. 68
    >>> import numpy as np
    >>> np_ones = np.ones((5000, 1000))
    >>> np_ones
    array([[ 1., 1., 1., ..., 1., 1., 1.],
    [ 1., 1., 1., ..., 1., 1., 1.],
    [ 1., 1., 1., ..., 1., 1., 1.],
    ...,
    [ 1., 1., 1., ..., 1., 1., 1.],
    [ 1., 1., 1., ..., 1., 1., 1.],
    [ 1., 1., 1., ..., 1., 1., 1.]])
    >>> np_y = np.log(np_ones + 1)[:5].sum(axis=1)
    >>> np_y
    array([ 693.14718056, 693.14718056,
    693.14718056, 693.14718056, 693.14718056])
    >>> import dask.array as da
    >>> da_ones = da.ones((5000000, 1000000),
    chunks=(1000, 1000))
    >>> da_ones.compute()
    array([[ 1., 1., 1., ..., 1., 1., 1.],
    [ 1., 1., 1., ..., 1., 1., 1.],
    [ 1., 1., 1., ..., 1., 1., 1.],
    ...,
    [ 1., 1., 1., ..., 1., 1., 1.],
    [ 1., 1., 1., ..., 1., 1., 1.],
    [ 1., 1., 1., ..., 1., 1., 1.]])
    >>> da_y = da.log(da_ones + 1)[:5].sum(axis=1)
    >>> np_da_y = np.array(da_y) #fits in memory
    array([ 693.14718056, 693.14718056,
    693.14718056, 693.14718056, …, 693.14718056])
    # If result doesn’t fit in memory
    >>> da_y.to_hdf5('myfile.hdf5', 'result')
    Dask Array is like NumPy

    View Slide

  69. Example 3: Using Dask Arrays with global temperature data
    69
    • Built from NumPy

    n-dimensional arrays
    • Matches NumPy interface
    (subset)
    • Solve medium-large
    problems
    • Complex algorithms

    View Slide

  70. Dask Schedulers: Distributed Scheduler
    70

    View Slide

  71. • Scheduling arbitrary graphs is hard.
    • Optimal graph scheduling is NP-hard
    • Scalable Scheduling requires Linear time solutions
    • Fortunately dask does well with a lot of heuristics
    • … and a lot of monitoring and data about sizes
    • … and how long functions take.
    Dask Scheduler
    71

    View Slide

  72. Cluster Architecture Diagram
    72
    Client Machine Compute
    Node
    Compute
    Node
    Compute
    Node
    Head Node

    View Slide

  73. • Single machine with multiple threads or processes
    • On a cluster with SSH (dcluster)
    • Resource management: YARN (knit), SGE, Slurm
    • On the cloud with Amazon EC2 (dec2) or Google CE
    • On a cluster with Anaconda for cluster management
    • Manage multiple conda environments and packages 

    on bare-metal or cloud-based clusters
    Using Anaconda and Dask on your Cluster
    73

    View Slide

  74. Scheduler Visualization with Bokeh
    74

    View Slide

  75. What makes Dask different?
    Lets look at some pictures of directed graphs

    View Slide

  76. View Slide

  77. View Slide

  78. Most Parallel Framework Architectures
    User API
    High Level Representation
    Logical Plan
    Low Level Representation
    Physical Plan
    Task scheduler
    for execution

    View Slide

  79. SQL Database Architecture
    SELECT avg(value)
    FROM accounts
    INNER JOIN customers ON …
    WHERE name == ‘Alice’

    View Slide

  80. SQL Database Architecture
    SELECT avg(value)
    FROM accounts
    WHERE name == ‘Alice’
    INNER JOIN customers ON …
    Optimize

    View Slide

  81. Spark Architecture
    df.join(df2, …)
    .select(…)
    .filter(…)
    Optimize

    View Slide

  82. Large Matrix Architecture
    (A’ * A) \ A’ * b
    Optimize

    View Slide

  83. Dask Architecture

    View Slide

  84. Dask Architecture
    accts=dd.read_parquet(…)
    accts=accts[accts.name == ‘Alice’]
    df=dd.merge(accts, customers)
    df.value.mean().compute()

    View Slide

  85. Dask Architecture
    u, s, v = da.linalg.svd(X)
    Y = u.dot(da.diag(s)).dot(v.T)
    da.linalg.norm(X - y)

    View Slide

  86. Dask Architecture
    for i in range(256):
    x = dask.delayed(f)(i)
    y = dask.delayed(g)(x)
    z = dask.delayed(add)(x, y

    View Slide

  87. Dask Architecture
    async def func():
    client = await Client()
    futures = client.map(…)
    async for f in as_completed(…):
    result = await f

    View Slide

  88. Dask Architecture
    Your own
    system here

    View Slide

  89. By dropping the high level representation
    Costs
    • Lose specialization
    • Lose opportunities for high level optimization
    Benefits
    • Become generalists
    • More flexibility for new domains and algorithms
    • Access to smarter algorithms
    • Better task scheduling

    Resource constraints, GPUs, multiple clients,

    async-real-time, etc..

    View Slide

  90. Ten Reasons People Choose
    Dask

    View Slide

  91. Scalable Pandas DataFrames
    • Same API

    import dask.dataframe as dd

    df = dd.read_parquet(‘s3://bucket/accounts/2017')

    df.groupby(df.name).value.mean().compute()
    • Efficient Timeseries Operations

    df.loc[‘2017-01-01’] # Uses the Pandas index…

    df.value.rolling(10).std() # for efficient…

    df.value.resample(‘10m’).mean() # operations.
    • Co-developed with Pandas

    and by the Pandas developer community

    View Slide

  92. Scalable NumPy Arrays
    • Same API


    import dask.array as da

    x = da.from_array(my_hdf5_file)

    y = x.dot(x.T)
    • Applications
    • Atmospheric science
    • Satellite imagery
    • Biomedical imagery
    • Optimization algorithms

    check out dask-glm

    View Slide

  93. Parallelize Scikit-Learn/Joblib
    • Scikit-Learn parallelizes with Joblib


    estimator = RandomForest(…)


    estimator.fit(train_data, train_labels, njobs=8)
    • Joblib can use Dask


    from sklearn.externals.joblib import parallel_backend

    with parallel_backend('dask', scheduler=‘…’):
    estimator.fit(train_data, train_labels)
    https://pythonhosted.org/joblib/
    http://distributed.readthedocs.io/en/latest/joblib.html
    Joblib
    Thread pool

    View Slide

  94. Parallelize Scikit-Learn/Joblib
    • Scikit-Learn parallelizes with Joblib


    estimator = RandomForest(…)


    estimator.fit(train_data, train_labels, njobs=8)
    • Joblib can use Dask


    from sklearn.externals.joblib import parallel_backend

    with parallel_backend('dask', scheduler=‘…’):
    estimator.fit(train_data, train_labels)
    https://pythonhosted.org/joblib/
    http://distributed.readthedocs.io/en/latest/joblib.html
    Joblib
    Dask

    View Slide

  95. Many Other Libraries in Anaconda
    • Scikit-Image uses dask to break down images and speed up
    algorithms with overlapping regions
    • Geopandas can use Dask to partition data spatially
    and accelerate spatial joins

    View Slide

  96. Dask Scales Up
    • Thousand node clusters
    • Cloud computing
    • Super computers
    • Gigabyte/s bandwidth
    • 200 microsecond task overhead
    Dask Scales Down (the median cluster size is one)
    • Can run in a single Python thread pool
    • Almost no performance penalty (microseconds)
    • Lightweight
    • Few dependencies
    • Easy install

    View Slide

  97. Parallelize Web Backends
    • Web servers process thousands of small computations asynchronously

    for web pages or REST endpoints
    • Dask provides dynamic, heterogenous computation
    • Supports small data
    • 10ms roundtrip times
    • Dynamic scaling for different loads
    • Supports asynchronous Python (like GoLang)


    async def serve(request):

    future = dask_client.submit(process, request)

    result = await future

    return result

    View Slide

  98. Debugging support
    • Clean Python tracebacks when user code breaks
    • Connect to remote workers with IPython sessions 

    for advanced debugging

    View Slide

  99. Resource constraints
    • Define limited hardware resources for workers
    • Specify resource constraints when submitting tasks
    $ dask-worker … —resources GPU=2
    $ dask-worker … —resources GPU=2
    $ dask-worker … —resources special-db=1
    future = client.submit(my_function, resources={‘GPU’: 1})
    • Used for GPUs, big-memory machines, special hardware,
    database connections, I/O machines, etc..

    View Slide

  100. Collaboration
    • Many users can share the same cluster simultaneously
    • Define public datasets
    • Repeated computation and data use is shared among everyone
    df = dd.read_parquet(…).persist()
    client.publish_dataset(accounts=df)
    df = client.get_dataset(‘accounts’)

    View Slide

  101. Beautiful Diagnostic Dashboards
    • Fast responsive dashboards
    • Provide users performance insight
    • Powered by Bokeh

    View Slide

  102. Some Reasons not to Choose
    Dask

    View Slide

  103. • Dask is not a SQL database. 

    Does Pandas well, but won’t optimize complex queries.
    • Dask is not MPI

    Very fast, but does leave some performance on the table

    200us task overhead

    a couple copies in the network stack
    • Dask is not a JVM technology

    It’s a Python library

    (although Julia bindings available)
    • Dask is not always necessary 

    You may not need parallelism
    Dask’s limitations

    View Slide

  104. dask.pydata.org
    conda install dask

    View Slide