Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Scaling DS in Python

Scaling DS in Python

Scaling Data Science in Python, Data Science Summit SF 2016

Christine Doig

July 13, 2016
Tweet

More Decks by Christine Doig

Other Decks in Programming

Transcript

  1. Scaling Data Science
    in Python
    Christine Doig
    Senior Data Scientist
    Continuum Analytics

    View Slide

  2. 2
    About me
    @ch_doig
    Senior Data Scientist, Product Manager for Anaconda Fusion,
    Product Marketing Manager & Technology Evangelist at
    Continuum Analytics
    M.S. in Industrial Engineering - UPC, Barcelona
    Experience in energy, manufacturing, banking and defense
    (E.ON, P&G, LaCaixa, DARPA)
    chdoig
    Connecting Open Data Science
    to Microsoft Excel

    View Slide

  3. 3
    is….
    the leading open data science platform
    powered by Python, the fastest growing open data science language
    • Consulting
    • Training
    • Open Source Innovation

    View Slide

  4. 4
    Tutorial content
    Slides: https://speakerdeck.com/scaling-ds-in-python
    Notebooks: https://github.com/chdoig/dss-scaling-tutorial
    Support: [email protected]

    View Slide

  5. 5
    Thanks to:
    • Matthew Rocklin
    • Jim Crist
    • Jim Bednar
    • Brendan Collins
    • All Bokeh, Dask, datashader, pandas, scikit-learn contributors

    View Slide

  6. Introduction

    View Slide

  7. 7
    Data Science in Python
    Machine Learning
    Data
    Analysis
    Data
    Visualization
    Data
    Science
    Bokeh

    View Slide

  8. 8
    Scaling Data Science in Python
    Machine Learning
    Data
    Analysis
    Data
    Visualization
    Data
    Science
    Bokeh
    Client Machine Compute
    Node
    Compute
    Node
    Compute
    Node
    Head Node
    Client Machine Compute
    Node
    Compute
    Node
    Compute
    Node
    Head Node
    Small data: Easily fits in memory ~GBs
    Medium data: Easily fits on disk or a
    small cluster ~GBs - TBs
    Large data: Requires a large cluster
    with many nodes ~TBs - PBs
    Head Node
    Client Machine Compute
    Node
    Comp
    Nod
    * Amazon X1 instances
    - 2TB of memory
    Scaling

    View Slide

  9. 9
    Client Machine Compute
    Node
    Compute
    Node
    Compute
    Node
    Head Node
    Client Machine Compute
    Node
    Compute
    Node
    Compute
    Node
    Head Node
    Scaling Data Science in Python
    Bokeh

    View Slide

  10. 10
    Challenges
    • Time spent learning a new ecosystem
    • Refactoring workflows to scale
    • Efficiently handling medium data
    • Lack of extensibility to express custom algorithms
    Client Machine Compu
    Node

    View Slide

  11. 11
    About this tutorial
    1 - Scaling Data Analysis - From pandas to dask.dataframe
    2 - Scaling Interactive Visualizations - From bokeh to datashader
    3 - Scaling Machine Learning - From sklearn to dask-learn

    View Slide

  12. 1 - Scaling Data Analysis

    View Slide

  13. 13
    Motivation
    • Data > memory in laptop
    • Similar to a pandas solution

    View Slide

  14. 14
    Dask Dataframes
    >>> import pandas as pd
    >>> df = pd.read_csv('iris.csv')
    >>> df.head()
    sepal_length sepal_width petal_length
    petal_width species
    0 5.1 3.5 1.4
    0.2 Iris-setosa
    1 4.9 3.0 1.4
    0.2 Iris-setosa
    2 4.7 3.2 1.3
    0.2 Iris-setosa
    3 4.6 3.1 1.5
    0.2 Iris-setosa
    4 5.0 3.6 1.4
    0.2 Iris-setosa
    >>> max_sepal_length_setosa = df[df.species ==
    'setosa'].sepal_length.max()
    5.7999999999999998
    >>> import dask.dataframe as dd
    >>> ddf = dd.read_csv('*.csv')
    >>> ddf.head()
    sepal_length sepal_width petal_length
    petal_width species
    0 5.1 3.5 1.4
    0.2 Iris-setosa
    1 4.9 3.0 1.4
    0.2 Iris-setosa
    2 4.7 3.2 1.3
    0.2 Iris-setosa
    3 4.6 3.1 1.5
    0.2 Iris-setosa
    4 5.0 3.6 1.4
    0.2 Iris-setosa

    >>> d_max_sepal_length_setosa = ddf[ddf.species ==
    'setosa'].sepal_length.max()
    >>> d_max_sepal_length_setosa.compute()
    5.7999999999999998
    Dask dataframes look and feel like pandas
    dataframes, but operate on datasets larger than
    memory using multiple threads

    View Slide

  15. 15
    Distributed
    http://distributed.readthedocs.io/en/latest/
    Distributed is a lightweight library for distributed computing in Python.
    It extends dask APIs to moderate sized clusters.

    View Slide

  16. 16
    Web UI
    Dask.distributed includes a web
    interface to help deliver information
    about the current state of the
    network helps to track progress,
    identify performance issues, and
    debug failures over a normal web
    page in real time.

    View Slide

  17. 17
    Scaling Data Analysis
    Data
    Analysis
    Client Machine Compute
    Node
    Compute
    Node
    Compute
    Node
    Head Node
    Client Machine Compute
    Node
    Compute
    Node
    Compute
    Node
    Head Node
    One month CSV file ~ 2GBs
    Two years CSV files ~ 50GB
    Scaling
    NYC Taxi Data
    One month CSV file ~ 2GBs
    Client Machine Compute
    Node
    Compute
    Node
    Compu
    Node
    Client Machine Compute
    Node
    Compute
    Node
    Compu
    Node
    Head Node
    HDFS +
    + distributed

    View Slide

  18. 18
    1 - pandas.ipynb
    2 - dask.ipynb
    3 - dask + distributed.ipynb
    Notebooks in Scaling Data Analysis

    View Slide

  19. 19
    Useful resources
    Documentation: http://dask.readthedocs.io/en/latest/
    Dask - Matthew Rocklin’s blog: http://matthewrocklin.com/blog/
    Dask-learn - Jim Crist’s blog: http://jcrist.github.io/dask-sklearn-part-1.html
    Dask + YARN - Ben Zaitlen’s blog: http://quasiben.github.io/blog/2016/4/8/dask-yarn/
    Dask Youtube Playlist: https://www.youtube.com/playlist?list=PLRtz5iA93T4PQvWuoMnIyEIz1fXiJ5Pri

    View Slide

  20. 2 - Scaling Interactive Visualizations

    View Slide

  21. 21
    Motivation
    • Visualize large amounts of data in a meaningful way
    • Interactively explore the data

    View Slide

  22. 22
    Bokeh
    Interactive visualization
    framework that targets modern
    web browsers for presentation
    • No JavaScript
    • Python, R, Scala and Lua
    bindings
    • Easy to embed in web
    applications
    • Server apps: data can be
    updated, and UI and
    selection events can be
    processed to trigger more
    visual updates.
    http://bokeh.pydata.org/en/latest/

    View Slide

  23. 23
    Datashader
    Overplotting:
    Oversaturation:
    Undersampling:
    https://anaconda.org/jbednar/plotting_pitfalls/notebook

    View Slide

  24. 24
    Datashader
    graphics pipeline system for
    creating meaningful
    representations of
    large amounts of data
    • Provides automatic, nearly parameter-free
    visualization of datasets
    • Allows extensive customization of each step in
    the data-processing pipeline
    • Supports automatic downsampling and re-
    rendering with Bokeh and the Jupyter notebook
    • Works well with dask and numba to handle very
    large datasets in and out of core (with
    examples using billions of datapoints)
    https://github.com/bokeh/datashader NYC census data by race

    View Slide

  25. 25
    Scaling Interactive Visualizations
    Scaling
    Data
    Visualization
    Bokeh
    Client Machine Compute
    Node
    Compute
    Node
    Compute
    Node
    Head Node
    Small data points:
    Large data points:
    Bokeh
    Bokeh
    + datashader

    View Slide

  26. 26
    bokeh:
    - clustering app
    datashader:
    1 - census.ipynb
    2 - nyc_taxi.ipynb
    Examples in Scaling Data Analysis

    View Slide

  27. 27
    Useful resources
    Bokeh documentation: http://bokeh.pydata.org/en/latest/
    Bokeh demos: https://demo.bokehplots.com/
    Datashader documentation: http://datashader.readthedocs.org/
    Bokeh + datashader tutorial: https://github.com/bokeh/bokeh-notebooks
    Bokeh webinar: http://go.continuum.io/hassle-free-data-science-apps/
    Datashader webinar: http://go.continuum.io/datashader/

    View Slide

  28. 3 - Scaling Machine Learning

    View Slide

  29. 29
    Motivation
    • Easily parallelizing GridSearch

    View Slide

  30. 30
    Scaling Machine Learning
    Scaling
    Client Machine Compute
    Node
    Compute
    Node
    Compute
    Node
    Head Node
    +
    Machine Learning
    = dask-learn
    GridSearchCV (optionally, with joblib)
    DaskGridSearchCV

    View Slide

  31. 31
    Notebook in Scaling Machine Learning
    dask-learn.ipynb

    View Slide

  32. 32
    Useful resources
    Dask-learn repository: https://github.com/jcrist/dask-learn
    Blogpost: http://jcrist.github.io/dask-sklearn-part-1.html

    View Slide

  33. Thank you!
    Questions?
    @ch_doig
    chdoig
    [email protected]

    View Slide