Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Scaling DS in Python

Scaling DS in Python

Scaling Data Science in Python, Data Science Summit SF 2016

6cc5be6a122c6e768981003fd2e24789?s=128

Christine Doig

July 13, 2016
Tweet

Transcript

  1. Scaling Data Science in Python Christine Doig Senior Data Scientist

    Continuum Analytics
  2. 2 About me @ch_doig Senior Data Scientist, Product Manager for

    Anaconda Fusion, Product Marketing Manager & Technology Evangelist at Continuum Analytics M.S. in Industrial Engineering - UPC, Barcelona Experience in energy, manufacturing, banking and defense (E.ON, P&G, LaCaixa, DARPA) chdoig Connecting Open Data Science to Microsoft Excel
  3. 3 is…. the leading open data science platform powered by

    Python, the fastest growing open data science language • Consulting • Training • Open Source Innovation
  4. 4 Tutorial content Slides: https://speakerdeck.com/scaling-ds-in-python Notebooks: https://github.com/chdoig/dss-scaling-tutorial Support: christine.doig@continuum.io

  5. 5 Thanks to: • Matthew Rocklin • Jim Crist •

    Jim Bednar • Brendan Collins • All Bokeh, Dask, datashader, pandas, scikit-learn contributors
  6. Introduction

  7. 7 Data Science in Python Machine Learning Data Analysis Data

    Visualization Data Science Bokeh
  8. 8 Scaling Data Science in Python Machine Learning Data Analysis

    Data Visualization Data Science Bokeh Client Machine Compute Node Compute Node Compute Node Head Node Client Machine Compute Node Compute Node Compute Node Head Node Small data: Easily fits in memory ~GBs Medium data: Easily fits on disk or a small cluster ~GBs - TBs Large data: Requires a large cluster with many nodes ~TBs - PBs Head Node Client Machine Compute Node Comp Nod * Amazon X1 instances - 2TB of memory Scaling
  9. 9 Client Machine Compute Node Compute Node Compute Node Head

    Node Client Machine Compute Node Compute Node Compute Node Head Node Scaling Data Science in Python Bokeh
  10. 10 Challenges • Time spent learning a new ecosystem •

    Refactoring workflows to scale • Efficiently handling medium data • Lack of extensibility to express custom algorithms Client Machine Compu Node
  11. 11 About this tutorial 1 - Scaling Data Analysis -

    From pandas to dask.dataframe 2 - Scaling Interactive Visualizations - From bokeh to datashader 3 - Scaling Machine Learning - From sklearn to dask-learn
  12. 1 - Scaling Data Analysis

  13. 13 Motivation • Data > memory in laptop • Similar

    to a pandas solution
  14. 14 Dask Dataframes >>> import pandas as pd >>> df

    = pd.read_csv('iris.csv') >>> df.head() sepal_length sepal_width petal_length petal_width species 0 5.1 3.5 1.4 0.2 Iris-setosa 1 4.9 3.0 1.4 0.2 Iris-setosa 2 4.7 3.2 1.3 0.2 Iris-setosa 3 4.6 3.1 1.5 0.2 Iris-setosa 4 5.0 3.6 1.4 0.2 Iris-setosa >>> max_sepal_length_setosa = df[df.species == 'setosa'].sepal_length.max() 5.7999999999999998 >>> import dask.dataframe as dd >>> ddf = dd.read_csv('*.csv') >>> ddf.head() sepal_length sepal_width petal_length petal_width species 0 5.1 3.5 1.4 0.2 Iris-setosa 1 4.9 3.0 1.4 0.2 Iris-setosa 2 4.7 3.2 1.3 0.2 Iris-setosa 3 4.6 3.1 1.5 0.2 Iris-setosa 4 5.0 3.6 1.4 0.2 Iris-setosa … >>> d_max_sepal_length_setosa = ddf[ddf.species == 'setosa'].sepal_length.max() >>> d_max_sepal_length_setosa.compute() 5.7999999999999998 Dask dataframes look and feel like pandas dataframes, but operate on datasets larger than memory using multiple threads
  15. 15 Distributed http://distributed.readthedocs.io/en/latest/ Distributed is a lightweight library for distributed

    computing in Python. It extends dask APIs to moderate sized clusters.
  16. 16 Web UI Dask.distributed includes a web interface to help

    deliver information about the current state of the network helps to track progress, identify performance issues, and debug failures over a normal web page in real time.
  17. 17 Scaling Data Analysis Data Analysis Client Machine Compute Node

    Compute Node Compute Node Head Node Client Machine Compute Node Compute Node Compute Node Head Node One month CSV file ~ 2GBs Two years CSV files ~ 50GB Scaling NYC Taxi Data One month CSV file ~ 2GBs Client Machine Compute Node Compute Node Compu Node Client Machine Compute Node Compute Node Compu Node Head Node HDFS + + distributed
  18. 18 1 - pandas.ipynb 2 - dask.ipynb 3 - dask

    + distributed.ipynb Notebooks in Scaling Data Analysis
  19. 19 Useful resources Documentation: http://dask.readthedocs.io/en/latest/ Dask - Matthew Rocklin’s blog:

    http://matthewrocklin.com/blog/ Dask-learn - Jim Crist’s blog: http://jcrist.github.io/dask-sklearn-part-1.html Dask + YARN - Ben Zaitlen’s blog: http://quasiben.github.io/blog/2016/4/8/dask-yarn/ Dask Youtube Playlist: https://www.youtube.com/playlist?list=PLRtz5iA93T4PQvWuoMnIyEIz1fXiJ5Pri
  20. 2 - Scaling Interactive Visualizations

  21. 21 Motivation • Visualize large amounts of data in a

    meaningful way • Interactively explore the data
  22. 22 Bokeh Interactive visualization framework that targets modern web browsers

    for presentation • No JavaScript • Python, R, Scala and Lua bindings • Easy to embed in web applications • Server apps: data can be updated, and UI and selection events can be processed to trigger more visual updates. http://bokeh.pydata.org/en/latest/
  23. 23 Datashader Overplotting: Oversaturation: Undersampling: https://anaconda.org/jbednar/plotting_pitfalls/notebook

  24. 24 Datashader graphics pipeline system for creating meaningful representations of

    large amounts of data • Provides automatic, nearly parameter-free visualization of datasets • Allows extensive customization of each step in the data-processing pipeline • Supports automatic downsampling and re- rendering with Bokeh and the Jupyter notebook • Works well with dask and numba to handle very large datasets in and out of core (with examples using billions of datapoints) https://github.com/bokeh/datashader NYC census data by race
  25. 25 Scaling Interactive Visualizations Scaling Data Visualization Bokeh Client Machine

    Compute Node Compute Node Compute Node Head Node Small data points: Large data points: Bokeh Bokeh + datashader
  26. 26 bokeh: - clustering app datashader: 1 - census.ipynb 2

    - nyc_taxi.ipynb Examples in Scaling Data Analysis
  27. 27 Useful resources Bokeh documentation: http://bokeh.pydata.org/en/latest/ Bokeh demos: https://demo.bokehplots.com/ Datashader

    documentation: http://datashader.readthedocs.org/ Bokeh + datashader tutorial: https://github.com/bokeh/bokeh-notebooks Bokeh webinar: http://go.continuum.io/hassle-free-data-science-apps/ Datashader webinar: http://go.continuum.io/datashader/
  28. 3 - Scaling Machine Learning

  29. 29 Motivation • Easily parallelizing GridSearch

  30. 30 Scaling Machine Learning Scaling Client Machine Compute Node Compute

    Node Compute Node Head Node + Machine Learning = dask-learn GridSearchCV (optionally, with joblib) DaskGridSearchCV
  31. 31 Notebook in Scaling Machine Learning dask-learn.ipynb

  32. 32 Useful resources Dask-learn repository: https://github.com/jcrist/dask-learn Blogpost: http://jcrist.github.io/dask-sklearn-part-1.html

  33. Thank you! Questions? @ch_doig chdoig christine.doig@continuum.io