2 About me @ch_doig Senior Data Scientist, Product Manager for Anaconda Fusion, Product Marketing Manager & Technology Evangelist at Continuum Analytics M.S. in Industrial Engineering - UPC, Barcelona Experience in energy, manufacturing, banking and defense (E.ON, P&G, LaCaixa, DARPA) chdoig Connecting Open Data Science to Microsoft Excel
3 is…. the leading open data science platform powered by Python, the fastest growing open data science language • Consulting • Training • Open Source Innovation
8 Scaling Data Science in Python Machine Learning Data Analysis Data Visualization Data Science Bokeh Client Machine Compute Node Compute Node Compute Node Head Node Client Machine Compute Node Compute Node Compute Node Head Node Small data: Easily fits in memory ~GBs Medium data: Easily fits on disk or a small cluster ~GBs - TBs Large data: Requires a large cluster with many nodes ~TBs - PBs Head Node Client Machine Compute Node Comp Nod * Amazon X1 instances - 2TB of memory Scaling
10 Challenges • Time spent learning a new ecosystem • Refactoring workflows to scale • Efficiently handling medium data • Lack of extensibility to express custom algorithms Client Machine Compu Node
11 About this tutorial 1 - Scaling Data Analysis - From pandas to dask.dataframe 2 - Scaling Interactive Visualizations - From bokeh to datashader 3 - Scaling Machine Learning - From sklearn to dask-learn
15 Distributed http://distributed.readthedocs.io/en/latest/ Distributed is a lightweight library for distributed computing in Python. It extends dask APIs to moderate sized clusters.
16 Web UI Dask.distributed includes a web interface to help deliver information about the current state of the network helps to track progress, identify performance issues, and debug failures over a normal web page in real time.
22 Bokeh Interactive visualization framework that targets modern web browsers for presentation • No JavaScript • Python, R, Scala and Lua bindings • Easy to embed in web applications • Server apps: data can be updated, and UI and selection events can be processed to trigger more visual updates. http://bokeh.pydata.org/en/latest/
24 Datashader graphics pipeline system for creating meaningful representations of large amounts of data • Provides automatic, nearly parameter-free visualization of datasets • Allows extensive customization of each step in the data-processing pipeline • Supports automatic downsampling and re- rendering with Bokeh and the Jupyter notebook • Works well with dask and numba to handle very large datasets in and out of core (with examples using billions of datapoints) https://github.com/bokeh/datashader NYC census data by race
25 Scaling Interactive Visualizations Scaling Data Visualization Bokeh Client Machine Compute Node Compute Node Compute Node Head Node Small data points: Large data points: Bokeh Bokeh + datashader