Scaling DS in Python - Speaker Deck

Slide 1

Slide 1 text

Scaling Data Science in Python Christine Doig Senior Data Scientist Continuum Analytics

Slide 2

Slide 2 text

2 About me @ch_doig Senior Data Scientist, Product Manager for Anaconda Fusion, Product Marketing Manager & Technology Evangelist at Continuum Analytics M.S. in Industrial Engineering - UPC, Barcelona Experience in energy, manufacturing, banking and defense (E.ON, P&G, LaCaixa, DARPA) chdoig Connecting Open Data Science to Microsoft Excel

Slide 3

Slide 3 text

3 is…. the leading open data science platform powered by Python, the fastest growing open data science language • Consulting • Training • Open Source Innovation

Slide 4

Slide 4 text

4 Tutorial content Slides: https://speakerdeck.com/scaling-ds-in-python Notebooks: https://github.com/chdoig/dss-scaling-tutorial Support: [email protected]

Slide 5

Slide 5 text

5 Thanks to: • Matthew Rocklin • Jim Crist • Jim Bednar • Brendan Collins • All Bokeh, Dask, datashader, pandas, scikit-learn contributors

Slide 6

Slide 6 text

Introduction

Slide 7

Slide 7 text

7 Data Science in Python Machine Learning Data Analysis Data Visualization Data Science Bokeh

Slide 8

Slide 8 text

8 Scaling Data Science in Python Machine Learning Data Analysis Data Visualization Data Science Bokeh Client Machine Compute Node Compute Node Compute Node Head Node Client Machine Compute Node Compute Node Compute Node Head Node Small data: Easily fits in memory ~GBs Medium data: Easily fits on disk or a small cluster ~GBs - TBs Large data: Requires a large cluster with many nodes ~TBs - PBs Head Node Client Machine Compute Node Comp Nod * Amazon X1 instances - 2TB of memory Scaling

Slide 9

Slide 9 text

9 Client Machine Compute Node Compute Node Compute Node Head Node Client Machine Compute Node Compute Node Compute Node Head Node Scaling Data Science in Python Bokeh

Slide 10

Slide 10 text

10 Challenges • Time spent learning a new ecosystem • Refactoring workflows to scale • Efficiently handling medium data • Lack of extensibility to express custom algorithms Client Machine Compu Node

Slide 11

Slide 11 text

11 About this tutorial 1 - Scaling Data Analysis - From pandas to dask.dataframe 2 - Scaling Interactive Visualizations - From bokeh to datashader 3 - Scaling Machine Learning - From sklearn to dask-learn

Slide 12

Slide 12 text

1 - Scaling Data Analysis

Slide 13

Slide 13 text

13 Motivation • Data > memory in laptop • Similar to a pandas solution

Slide 14

Slide 14 text

14 Dask Dataframes >>> import pandas as pd >>> df = pd.read_csv('iris.csv') >>> df.head() sepal_length sepal_width petal_length petal_width species 0 5.1 3.5 1.4 0.2 Iris-setosa 1 4.9 3.0 1.4 0.2 Iris-setosa 2 4.7 3.2 1.3 0.2 Iris-setosa 3 4.6 3.1 1.5 0.2 Iris-setosa 4 5.0 3.6 1.4 0.2 Iris-setosa >>> max_sepal_length_setosa = df[df.species == 'setosa'].sepal_length.max() 5.7999999999999998 >>> import dask.dataframe as dd >>> ddf = dd.read_csv('*.csv') >>> ddf.head() sepal_length sepal_width petal_length petal_width species 0 5.1 3.5 1.4 0.2 Iris-setosa 1 4.9 3.0 1.4 0.2 Iris-setosa 2 4.7 3.2 1.3 0.2 Iris-setosa 3 4.6 3.1 1.5 0.2 Iris-setosa 4 5.0 3.6 1.4 0.2 Iris-setosa … >>> d_max_sepal_length_setosa = ddf[ddf.species == 'setosa'].sepal_length.max() >>> d_max_sepal_length_setosa.compute() 5.7999999999999998 Dask dataframes look and feel like pandas dataframes, but operate on datasets larger than memory using multiple threads

Slide 15

Slide 15 text

15 Distributed http://distributed.readthedocs.io/en/latest/ Distributed is a lightweight library for distributed computing in Python. It extends dask APIs to moderate sized clusters.

Slide 16

Slide 16 text

16 Web UI Dask.distributed includes a web interface to help deliver information about the current state of the network helps to track progress, identify performance issues, and debug failures over a normal web page in real time.

Slide 17

Slide 17 text

17 Scaling Data Analysis Data Analysis Client Machine Compute Node Compute Node Compute Node Head Node Client Machine Compute Node Compute Node Compute Node Head Node One month CSV file ~ 2GBs Two years CSV files ~ 50GB Scaling NYC Taxi Data One month CSV file ~ 2GBs Client Machine Compute Node Compute Node Compu Node Client Machine Compute Node Compute Node Compu Node Head Node HDFS + + distributed

Slide 18

Slide 18 text

18 1 - pandas.ipynb 2 - dask.ipynb 3 - dask + distributed.ipynb Notebooks in Scaling Data Analysis

Slide 19

Slide 19 text

19 Useful resources Documentation: http://dask.readthedocs.io/en/latest/ Dask - Matthew Rocklin’s blog: http://matthewrocklin.com/blog/ Dask-learn - Jim Crist’s blog: http://jcrist.github.io/dask-sklearn-part-1.html Dask + YARN - Ben Zaitlen’s blog: http://quasiben.github.io/blog/2016/4/8/dask-yarn/ Dask Youtube Playlist: https://www.youtube.com/playlist?list=PLRtz5iA93T4PQvWuoMnIyEIz1fXiJ5Pri

Slide 20

Slide 20 text

2 - Scaling Interactive Visualizations

Slide 21

Slide 21 text

21 Motivation • Visualize large amounts of data in a meaningful way • Interactively explore the data

Slide 22

Slide 22 text

22 Bokeh Interactive visualization framework that targets modern web browsers for presentation • No JavaScript • Python, R, Scala and Lua bindings • Easy to embed in web applications • Server apps: data can be updated, and UI and selection events can be processed to trigger more visual updates. http://bokeh.pydata.org/en/latest/

Slide 23

Slide 23 text

23 Datashader Overplotting: Oversaturation: Undersampling: https://anaconda.org/jbednar/plotting_pitfalls/notebook

Slide 24

Slide 24 text

24 Datashader graphics pipeline system for creating meaningful representations of large amounts of data • Provides automatic, nearly parameter-free visualization of datasets • Allows extensive customization of each step in the data-processing pipeline • Supports automatic downsampling and re- rendering with Bokeh and the Jupyter notebook • Works well with dask and numba to handle very large datasets in and out of core (with examples using billions of datapoints) https://github.com/bokeh/datashader NYC census data by race

Slide 25

Slide 25 text

25 Scaling Interactive Visualizations Scaling Data Visualization Bokeh Client Machine Compute Node Compute Node Compute Node Head Node Small data points: Large data points: Bokeh Bokeh + datashader

Slide 26

Slide 26 text

26 bokeh: - clustering app datashader: 1 - census.ipynb 2 - nyc_taxi.ipynb Examples in Scaling Data Analysis

Slide 27

Slide 27 text

27 Useful resources Bokeh documentation: http://bokeh.pydata.org/en/latest/ Bokeh demos: https://demo.bokehplots.com/ Datashader documentation: http://datashader.readthedocs.org/ Bokeh + datashader tutorial: https://github.com/bokeh/bokeh-notebooks Bokeh webinar: http://go.continuum.io/hassle-free-data-science-apps/ Datashader webinar: http://go.continuum.io/datashader/

Slide 28

Slide 28 text

3 - Scaling Machine Learning

Slide 29

Slide 29 text

29 Motivation • Easily parallelizing GridSearch

Slide 30

Slide 30 text

30 Scaling Machine Learning Scaling Client Machine Compute Node Compute Node Compute Node Head Node + Machine Learning = dask-learn GridSearchCV (optionally, with joblib) DaskGridSearchCV

Slide 31

Slide 31 text

31 Notebook in Scaling Machine Learning dask-learn.ipynb

Slide 32

Slide 32 text

32 Useful resources Dask-learn repository: https://github.com/jcrist/dask-learn Blogpost: http://jcrist.github.io/dask-sklearn-part-1.html

Slide 33

Slide 33 text

Thank you! Questions? @ch_doig chdoig [email protected]