Scaling DS in Python

Scaling Data Science in Python Christine Doig Senior Data Scientist
Continuum Analytics

2 About me @ch_doig Senior Data Scientist, Product Manager for
Anaconda Fusion, Product Marketing Manager & Technology Evangelist at Continuum Analytics M.S. in Industrial Engineering - UPC, Barcelona Experience in energy, manufacturing, banking and defense (E.ON, P&G, LaCaixa, DARPA) chdoig Connecting Open Data Science to Microsoft Excel

3 is…. the leading open data science platform powered by
Python, the fastest growing open data science language • Consulting • Training • Open Source Innovation

4 Tutorial content Slides: https://speakerdeck.com/scaling-ds-in-python Notebooks: https://github.com/chdoig/dss-scaling-tutorial Support: [email protected]

5 Thanks to: • Matthew Rocklin • Jim Crist •
Jim Bednar • Brendan Collins • All Bokeh, Dask, datashader, pandas, scikit-learn contributors

Introduction

7 Data Science in Python Machine Learning Data Analysis Data
Visualization Data Science Bokeh

8 Scaling Data Science in Python Machine Learning Data Analysis
Data Visualization Data Science Bokeh Client Machine Compute Node Compute Node Compute Node Head Node Client Machine Compute Node Compute Node Compute Node Head Node Small data: Easily fits in memory ~GBs Medium data: Easily fits on disk or a small cluster ~GBs - TBs Large data: Requires a large cluster with many nodes ~TBs - PBs Head Node Client Machine Compute Node Comp Nod * Amazon X1 instances - 2TB of memory Scaling

9 Client Machine Compute Node Compute Node Compute Node Head
Node Client Machine Compute Node Compute Node Compute Node Head Node Scaling Data Science in Python Bokeh

10 Challenges • Time spent learning a new ecosystem •
Refactoring workflows to scale • Efficiently handling medium data • Lack of extensibility to express custom algorithms Client Machine Compu Node

11 About this tutorial 1 - Scaling Data Analysis -
From pandas to dask.dataframe 2 - Scaling Interactive Visualizations - From bokeh to datashader 3 - Scaling Machine Learning - From sklearn to dask-learn

1 - Scaling Data Analysis

13 Motivation • Data > memory in laptop • Similar
to a pandas solution

14 Dask Dataframes >>> import pandas as pd >>> df
= pd.read_csv('iris.csv') >>> df.head() sepal_length sepal_width petal_length petal_width species 0 5.1 3.5 1.4 0.2 Iris-setosa 1 4.9 3.0 1.4 0.2 Iris-setosa 2 4.7 3.2 1.3 0.2 Iris-setosa 3 4.6 3.1 1.5 0.2 Iris-setosa 4 5.0 3.6 1.4 0.2 Iris-setosa >>> max_sepal_length_setosa = df[df.species == 'setosa'].sepal_length.max() 5.7999999999999998 >>> import dask.dataframe as dd >>> ddf = dd.read_csv('*.csv') >>> ddf.head() sepal_length sepal_width petal_length petal_width species 0 5.1 3.5 1.4 0.2 Iris-setosa 1 4.9 3.0 1.4 0.2 Iris-setosa 2 4.7 3.2 1.3 0.2 Iris-setosa 3 4.6 3.1 1.5 0.2 Iris-setosa 4 5.0 3.6 1.4 0.2 Iris-setosa … >>> d_max_sepal_length_setosa = ddf[ddf.species == 'setosa'].sepal_length.max() >>> d_max_sepal_length_setosa.compute() 5.7999999999999998 Dask dataframes look and feel like pandas dataframes, but operate on datasets larger than memory using multiple threads

15 Distributed http://distributed.readthedocs.io/en/latest/ Distributed is a lightweight library for distributed
computing in Python. It extends dask APIs to moderate sized clusters.

16 Web UI Dask.distributed includes a web interface to help
deliver information about the current state of the network helps to track progress, identify performance issues, and debug failures over a normal web page in real time.

17 Scaling Data Analysis Data Analysis Client Machine Compute Node
Compute Node Compute Node Head Node Client Machine Compute Node Compute Node Compute Node Head Node One month CSV file ~ 2GBs Two years CSV files ~ 50GB Scaling NYC Taxi Data One month CSV file ~ 2GBs Client Machine Compute Node Compute Node Compu Node Client Machine Compute Node Compute Node Compu Node Head Node HDFS + + distributed

18 1 - pandas.ipynb 2 - dask.ipynb 3 - dask
+ distributed.ipynb Notebooks in Scaling Data Analysis

19 Useful resources Documentation: http://dask.readthedocs.io/en/latest/ Dask - Matthew Rocklin’s blog:
http://matthewrocklin.com/blog/ Dask-learn - Jim Crist’s blog: http://jcrist.github.io/dask-sklearn-part-1.html Dask + YARN - Ben Zaitlen’s blog: http://quasiben.github.io/blog/2016/4/8/dask-yarn/ Dask Youtube Playlist: https://www.youtube.com/playlist?list=PLRtz5iA93T4PQvWuoMnIyEIz1fXiJ5Pri

2 - Scaling Interactive Visualizations

21 Motivation • Visualize large amounts of data in a
meaningful way • Interactively explore the data

22 Bokeh Interactive visualization framework that targets modern web browsers
for presentation • No JavaScript • Python, R, Scala and Lua bindings • Easy to embed in web applications • Server apps: data can be updated, and UI and selection events can be processed to trigger more visual updates. http://bokeh.pydata.org/en/latest/

23 Datashader Overplotting: Oversaturation: Undersampling: https://anaconda.org/jbednar/plotting_pitfalls/notebook

24 Datashader graphics pipeline system for creating meaningful representations of
large amounts of data • Provides automatic, nearly parameter-free visualization of datasets • Allows extensive customization of each step in the data-processing pipeline • Supports automatic downsampling and re- rendering with Bokeh and the Jupyter notebook • Works well with dask and numba to handle very large datasets in and out of core (with examples using billions of datapoints) https://github.com/bokeh/datashader NYC census data by race

25 Scaling Interactive Visualizations Scaling Data Visualization Bokeh Client Machine
Compute Node Compute Node Compute Node Head Node Small data points: Large data points: Bokeh Bokeh + datashader

26 bokeh: - clustering app datashader: 1 - census.ipynb 2
- nyc_taxi.ipynb Examples in Scaling Data Analysis

27 Useful resources Bokeh documentation: http://bokeh.pydata.org/en/latest/ Bokeh demos: https://demo.bokehplots.com/ Datashader
documentation: http://datashader.readthedocs.org/ Bokeh + datashader tutorial: https://github.com/bokeh/bokeh-notebooks Bokeh webinar: http://go.continuum.io/hassle-free-data-science-apps/ Datashader webinar: http://go.continuum.io/datashader/

3 - Scaling Machine Learning

29 Motivation • Easily parallelizing GridSearch

30 Scaling Machine Learning Scaling Client Machine Compute Node Compute
Node Compute Node Head Node + Machine Learning = dask-learn GridSearchCV (optionally, with joblib) DaskGridSearchCV

31 Notebook in Scaling Machine Learning dask-learn.ipynb

32 Useful resources Dask-learn repository: https://github.com/jcrist/dask-learn Blogpost: http://jcrist.github.io/dask-sklearn-part-1.html

Thank you! Questions? @ch_doig chdoig [email protected]

Scaling DS in Python

Scaling DS in Python

Christine Doig

More Decks by Christine Doig

Other Decks in Programming

Featured

Transcript

Scaling Data Science in Python Christine Doig Senior Data Scientist

2 About me @ch_doig Senior Data Scientist, Product Manager for

3 is…. the leading open data science platform powered by

4 Tutorial content Slides: https://speakerdeck.com/scaling-ds-in-python Notebooks: https://github.com/chdoig/dss-scaling-tutorial Support: [email protected]

5 Thanks to: • Matthew Rocklin • Jim Crist •

Introduction

7 Data Science in Python Machine Learning Data Analysis Data

8 Scaling Data Science in Python Machine Learning Data Analysis

9 Client Machine Compute Node Compute Node Compute Node Head

10 Challenges • Time spent learning a new ecosystem •

11 About this tutorial 1 - Scaling Data Analysis -

1 - Scaling Data Analysis

13 Motivation • Data > memory in laptop • Similar

14 Dask Dataframes >>> import pandas as pd >>> df

15 Distributed http://distributed.readthedocs.io/en/latest/ Distributed is a lightweight library for distributed

16 Web UI Dask.distributed includes a web interface to help

17 Scaling Data Analysis Data Analysis Client Machine Compute Node

18 1 - pandas.ipynb 2 - dask.ipynb 3 - dask

19 Useful resources Documentation: http://dask.readthedocs.io/en/latest/ Dask - Matthew Rocklin’s blog:

2 - Scaling Interactive Visualizations

21 Motivation • Visualize large amounts of data in a

22 Bokeh Interactive visualization framework that targets modern web browsers

23 Datashader Overplotting: Oversaturation: Undersampling: https://anaconda.org/jbednar/plotting_pitfalls/notebook

24 Datashader graphics pipeline system for creating meaningful representations of

25 Scaling Interactive Visualizations Scaling Data Visualization Bokeh Client Machine

26 bokeh: - clustering app datashader: 1 - census.ipynb 2

27 Useful resources Bokeh documentation: http://bokeh.pydata.org/en/latest/ Bokeh demos: https://demo.bokehplots.com/ Datashader

3 - Scaling Machine Learning

29 Motivation • Easily parallelizing GridSearch

30 Scaling Machine Learning Scaling Client Machine Compute Node Compute

31 Notebook in Scaling Machine Learning dask-learn.ipynb

32 Useful resources Dask-learn repository: https://github.com/jcrist/dask-learn Blogpost: http://jcrist.github.io/dask-sklearn-part-1.html

Thank you! Questions? @ch_doig chdoig [email protected]