Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Scaling DS in Python

Scaling DS in Python

Scaling Data Science in Python, Data Science Summit SF 2016

Christine Doig

July 13, 2016
Tweet

More Decks by Christine Doig

Other Decks in Programming

Transcript

  1. 2 About me @ch_doig Senior Data Scientist, Product Manager for

    Anaconda Fusion, Product Marketing Manager & Technology Evangelist at Continuum Analytics M.S. in Industrial Engineering - UPC, Barcelona Experience in energy, manufacturing, banking and defense (E.ON, P&G, LaCaixa, DARPA) chdoig Connecting Open Data Science to Microsoft Excel
  2. 3 is…. the leading open data science platform powered by

    Python, the fastest growing open data science language • Consulting • Training • Open Source Innovation
  3. 5 Thanks to: • Matthew Rocklin • Jim Crist •

    Jim Bednar • Brendan Collins • All Bokeh, Dask, datashader, pandas, scikit-learn contributors
  4. 8 Scaling Data Science in Python Machine Learning Data Analysis

    Data Visualization Data Science Bokeh Client Machine Compute Node Compute Node Compute Node Head Node Client Machine Compute Node Compute Node Compute Node Head Node Small data: Easily fits in memory ~GBs Medium data: Easily fits on disk or a small cluster ~GBs - TBs Large data: Requires a large cluster with many nodes ~TBs - PBs Head Node Client Machine Compute Node Comp Nod * Amazon X1 instances - 2TB of memory Scaling
  5. 9 Client Machine Compute Node Compute Node Compute Node Head

    Node Client Machine Compute Node Compute Node Compute Node Head Node Scaling Data Science in Python Bokeh
  6. 10 Challenges • Time spent learning a new ecosystem •

    Refactoring workflows to scale • Efficiently handling medium data • Lack of extensibility to express custom algorithms Client Machine Compu Node
  7. 11 About this tutorial 1 - Scaling Data Analysis -

    From pandas to dask.dataframe 2 - Scaling Interactive Visualizations - From bokeh to datashader 3 - Scaling Machine Learning - From sklearn to dask-learn
  8. 14 Dask Dataframes >>> import pandas as pd >>> df

    = pd.read_csv('iris.csv') >>> df.head() sepal_length sepal_width petal_length petal_width species 0 5.1 3.5 1.4 0.2 Iris-setosa 1 4.9 3.0 1.4 0.2 Iris-setosa 2 4.7 3.2 1.3 0.2 Iris-setosa 3 4.6 3.1 1.5 0.2 Iris-setosa 4 5.0 3.6 1.4 0.2 Iris-setosa >>> max_sepal_length_setosa = df[df.species == 'setosa'].sepal_length.max() 5.7999999999999998 >>> import dask.dataframe as dd >>> ddf = dd.read_csv('*.csv') >>> ddf.head() sepal_length sepal_width petal_length petal_width species 0 5.1 3.5 1.4 0.2 Iris-setosa 1 4.9 3.0 1.4 0.2 Iris-setosa 2 4.7 3.2 1.3 0.2 Iris-setosa 3 4.6 3.1 1.5 0.2 Iris-setosa 4 5.0 3.6 1.4 0.2 Iris-setosa … >>> d_max_sepal_length_setosa = ddf[ddf.species == 'setosa'].sepal_length.max() >>> d_max_sepal_length_setosa.compute() 5.7999999999999998 Dask dataframes look and feel like pandas dataframes, but operate on datasets larger than memory using multiple threads
  9. 16 Web UI Dask.distributed includes a web interface to help

    deliver information about the current state of the network helps to track progress, identify performance issues, and debug failures over a normal web page in real time.
  10. 17 Scaling Data Analysis Data Analysis Client Machine Compute Node

    Compute Node Compute Node Head Node Client Machine Compute Node Compute Node Compute Node Head Node One month CSV file ~ 2GBs Two years CSV files ~ 50GB Scaling NYC Taxi Data One month CSV file ~ 2GBs Client Machine Compute Node Compute Node Compu Node Client Machine Compute Node Compute Node Compu Node Head Node HDFS + + distributed
  11. 18 1 - pandas.ipynb 2 - dask.ipynb 3 - dask

    + distributed.ipynb Notebooks in Scaling Data Analysis
  12. 19 Useful resources Documentation: http://dask.readthedocs.io/en/latest/ Dask - Matthew Rocklin’s blog:

    http://matthewrocklin.com/blog/ Dask-learn - Jim Crist’s blog: http://jcrist.github.io/dask-sklearn-part-1.html Dask + YARN - Ben Zaitlen’s blog: http://quasiben.github.io/blog/2016/4/8/dask-yarn/ Dask Youtube Playlist: https://www.youtube.com/playlist?list=PLRtz5iA93T4PQvWuoMnIyEIz1fXiJ5Pri
  13. 21 Motivation • Visualize large amounts of data in a

    meaningful way • Interactively explore the data
  14. 22 Bokeh Interactive visualization framework that targets modern web browsers

    for presentation • No JavaScript • Python, R, Scala and Lua bindings • Easy to embed in web applications • Server apps: data can be updated, and UI and selection events can be processed to trigger more visual updates. http://bokeh.pydata.org/en/latest/
  15. 24 Datashader graphics pipeline system for creating meaningful representations of

    large amounts of data • Provides automatic, nearly parameter-free visualization of datasets • Allows extensive customization of each step in the data-processing pipeline • Supports automatic downsampling and re- rendering with Bokeh and the Jupyter notebook • Works well with dask and numba to handle very large datasets in and out of core (with examples using billions of datapoints) https://github.com/bokeh/datashader NYC census data by race
  16. 25 Scaling Interactive Visualizations Scaling Data Visualization Bokeh Client Machine

    Compute Node Compute Node Compute Node Head Node Small data points: Large data points: Bokeh Bokeh + datashader
  17. 26 bokeh: - clustering app datashader: 1 - census.ipynb 2

    - nyc_taxi.ipynb Examples in Scaling Data Analysis
  18. 27 Useful resources Bokeh documentation: http://bokeh.pydata.org/en/latest/ Bokeh demos: https://demo.bokehplots.com/ Datashader

    documentation: http://datashader.readthedocs.org/ Bokeh + datashader tutorial: https://github.com/bokeh/bokeh-notebooks Bokeh webinar: http://go.continuum.io/hassle-free-data-science-apps/ Datashader webinar: http://go.continuum.io/datashader/
  19. 30 Scaling Machine Learning Scaling Client Machine Compute Node Compute

    Node Compute Node Head Node + Machine Learning = dask-learn GridSearchCV (optionally, with joblib) DaskGridSearchCV