Embracing Open Data Science in Your Organization

Slide 1

Slide 1 text

Embracing Open Data Science in your Organization Christine Doig Senior Data Scientist Continuum Analytics

Slide 2

Slide 2 text

2 • Introduction to Data Science • Data Science Challenges in Organizations • Anaconda Distribution • Anaconda Community Innovation • Anaconda Enterprise Platform Agenda

Slide 3

Slide 3 text

INTRODUCTION TO DATA SCIENCE

Slide 4

Slide 4 text

4 http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram The Data Science Venn Diagram

Slide 5

Slide 5 text

5 The Data Science Venn Diagram Revisited Machine Learning Big Data Visualization Analytics HPC CS / Programming DS

Slide 6

Slide 6 text

6 Machine Learning Big Data Visualization Analytics HPC CS / Programming DS Data Scientist come with different skills and backgrounds Machine Learning Big Data Visualization Analytics HPC CS / Programming DS Machine Learning Big Data Visualization Analytics HPC CS / Programming DS Statistician / Analyst Research / Computational Scientist Developer / Engineer

Slide 7

Slide 7 text

7 Data Science in summary: • is a team sport • formed by team members with very diverse backgrounds • both in terms of knowledge (CS, Statistics, Viz, ML…) • and technology stacks (R, SAS, Python…) How can companies organize efficiently in this environment?

Slide 8

Slide 8 text

© 2016 Continuum Analytics- Confidential & Proprietary With an inclusive movement that makes open source tools for data science -- data, analytics, & computation – easily work together as a connected ecosystem 8 Open Data Science

Slide 9

Slide 9 text

© 2016 Continuum Analytics- Confidential & Proprietary Open Data Science  Vibrant and Growing Community 9 Python Community 30M+ ANACONDA Downloads* 3M+ Packages in Anaconda 720+ R Community 16M+ Spark Python Usage 50%+ * As of Dec 2015. Another 2.7M download YTD

Slide 10

Slide 10 text

© 2016 Continuum Analytics- Confidential & Proprietary Availability Innovation Interoperability Transparency For everyone in the data science team 10 Open Data Science means… OPEN DATA SCIENCE is the FOUNDATION TO MODERNIZATION

Slide 11

Slide 11 text

© 2016 Continuum Analytics- Confidential & Proprietary 11 Data Scientist Biz Analyst Data Engineer Developer DevOps Deploy & Operate Explore & Analyze Collaborate & Publish Data Scientists are not the only player in the Data Science Team

Slide 12

Slide 12 text

12 Data Science assets Data Scientist Biz Analyst Developer Spreadsheets Reports Presentations Notebooks Scripts Visualizations Software packages Web applications

Slide 13

Slide 13 text

13 Data Scientist Notebooks Scripts Interactive Data Visualizations assets

Slide 14

Slide 14 text

14 Data Science workflows Explore & Analyze Data Query Visualize Clean & Tidy Predict, Simulate, & Optimize Interactive Reports Interactive Presentations Interactive Notebooks Interactive Apps Predictive Models Collaborate & Publish Interactive Notebooks Predictive Models Interactive Apps Code Applications

Slide 15

Slide 15 text

15 Data Science workflows Deploy & Operate Querying & Reports Web Services Data Warehouse HDFS Streaming Data Flat Files NoSQL Model Building Integrate DEPLOY OPERATE Cloud Computing Web Services On-Premise Internal Cluster

Slide 16

Slide 16 text

DATA SCIENCE CHALLENGES IN ORGANIZATIONS

Slide 17

Slide 17 text

17 Challenges • Manage reproducible heterogeneous Data Science environments • Distribute, share and publish Data Science assets • Get diverse data scientists (languages, tools, data models, assets…) to collaborate effectively • Enable Data Scientists to easily leverage Big Data technologies • Deploy data science assets into production applications • Share insights with decision makers • Enable Business Analysts and Managers to leverage Data Science

Slide 18

Slide 18 text

18 How are we solving those challenges through: • Anaconda Distribution • Anaconda Community Innovation • Jupyter, JupyterLab and extensions • Bokeh for interactive data visualizations • Datashader for large scale visualizations • Dask for parallel computing • Numba for high performance computing • Anaconda Enterprise

Slide 19

Slide 19 text

ANACONDA DISTRIBUTION

Slide 20

Slide 20 text

20 The Distribution for Data Science Machine Learning Big Data Visualization Analytics Scientific computing CS / Programming Numba Blaze Bokeh

Slide 21

Slide 21 text

21 … with an amazing community!

Slide 22

Slide 22 text

22 Download for free: www.continuum.io/downloads

Slide 23

Slide 23 text

23 Anaconda Distribution Glossary PYTHON NumPy, SciPy, Pandas, Scikit-learn, Jupyter / IPython, Numba, Matplotlib, Spyder, Numexpr, Cython, Theano, Scikit-image, NLTK, NetworkX and 150+ packages conda PYTHON cond conda • Anaconda distribution: Python distribution that includes 150+ packages for data science (in the installer) • Miniconda: Lightweight version of Anaconda, with just Python and conda. • Anaconda Cloud: Cloud service to host and share public (free) and private data science assets • Anaconda Navigator: Anaconda distribution UI to manage environments, launch applications and learn about what’s happening in the community Anaconda distribution Miniconda

Slide 24

Slide 24 text

24 Anaconda Navigator Launch applications Learn about the Anaconda community Manage environments

Slide 25

Slide 25 text

25 • conda: Cross-platform and language agnostic package and environment manager • conda-forge: A community led collection of recipes, build infrastructure and distributions for the conda package manager • conda environments: custom isolated sandboxes to easily reproduce and share data science projects • conda kapsel: reproducible, executable project directories

Slide 26

Slide 26 text

26 $ conda install python=2.7 $ conda install pandas $ conda install -c r r $ conda install -c conda-forge tensorflow name: myenv channels: -chdoig -r -foo dependecies: -python=2.7 -r -r-ldavis -pandas -mongodb -spark=1.5 -pip -pip: - flask-migrate - bar=1.4 $ conda env create $ source activate myenv Install dependencies Manage multiple environments $ conda kapsel run plot --show Deploy an interactive visualization

Slide 27

Slide 27 text

27 What challenges does Anaconda Distribution solve? PYTHON NumPy, SciPy, Pandas, Scikit-learn, Jupyter / IPython, Numba, Matplotlib, Spyder, Numexpr, Cython, Theano, Scikit-image, NLTK, NetworkX and 150+ packages conda PYTHON cond conda Anaconda distribution Miniconda • Easy to install on all platforms • Language agnostic - Python, R, Scala… • Trusted by industry leaders • Trusted by the community - Large user base: 3M+ downloads • BSD license • Extensible - easily build, share and install proprietary libraries with Anaconda Cloud • Allows isolated custom sandboxes with different versions of packages - conda environments • Allows for easy encapsulation and deployment of data science assets - conda kapsel

Slide 28

Slide 28 text

ANACONDA COMMUNITY INNOVATION

Slide 29

Slide 29 text

29 • Anaconda Distribution • Anaconda Community Innovation • Jupyter, JupyterLab and extensions • Bokeh for interactive data visualizations • Datashader for large scale visualizations • Dask for parallel computing • Anaconda Enterprise

Slide 30

Slide 30 text

30 Continuum Analytics contributions to the Python ODS ecosystem Bokeh Dask Datashader • Web interactive data visualizations (no JS) • Graphics pipeline system for creating meaningful representations of large amounts of data • Parallel computing framework • Next generation Data Science IDE JupyterLab

Slide 31

Slide 31 text

31 Jupyter Notebook Web application that allows you to create and share documents that contain live code, equations, visualizations and explanatory text. $ jupyter notebook

Slide 32

Slide 32 text

32 JupyterLab: the next generation

Slide 33

Slide 33 text

33 Sharing insights with decision makers From text, code and visualizations directly to slides

Slide 34

Slide 34 text

34 Jupyter: Extensions - nbpresent remix your Jupyter Notebooks as interactive slideshows with a UI editor • Edit slides, layout and themes conda install -c anaconda-nb-extensions nbpresent jupyter notebook

Slide 35

Slide 35 text

35 Jupyter extensions - anaconda-nb-extensions • nb_condakernel: use the kernel-switching dropdown inside notebook UI to switch between conda envs • nb_conda: help manage conda envs from inside file viewer of jupter notebook nb_condakernel nb_conda

Slide 36

Slide 36 text

36 Jupyter: IRkernel https://www.continuum.io/blog/developer/jupyter-and-conda-r conda config --add channels r conda install r-essentials jupyter notebook Trivial to get started writing R notebooks the same way you write Python ones.

Slide 37

Slide 37 text

37 Bokeh Interactive visualization framework that targets modern web browsers for presentation • No JavaScript • Python, R, Scala and Lua bindings • Easy to embed in web applications • Server apps: data can be updated, and UI and selection events can be processed to trigger more visual updates. http://bokeh.pydata.org/en/latest/

Slide 38

Slide 38 text

38 Datashader - Plotting pitfalls Overplotting: Undersampling: https://anaconda.org/jbednar/plotting_pitfalls/notebook

Slide 39

Slide 39 text

39 Datashader graphics pipeline system for creating meaningful representations of large amounts of data • Provides automatic, nearly parameter-free visualization of datasets • Allows extensive customization of each step in the data-processing pipeline • Supports automatic downsampling and re- rendering with Bokeh and the Jupyter notebook • Works well with dask and numba to handle very large datasets in and out of core (with examples using billions of datapoints) https://github.com/bokeh/datashader NYC census data by race

Slide 40

Slide 40 text

40 Datashader https://anaconda.org/jbednar/notebooks More examples:

Slide 41

Slide 41 text

41 Dask: Scaling Data Analysis Client Machine Compute Node Compute Node Compute Node Head Node One month CSV file ~ 2GBs Two years CSV files ~ 50GB Scaling Data Analysis Six month CSV file ~ 12GBs Client Machine Compute Node Compute Node Compute Node Head Node Client Machine Compute Node Compute Node Compute Node Head Node HDFS + + distributed

Slide 42

Slide 42 text

42 Dask Dataframes >>> import pandas as pd >>> df = pd.read_csv('iris.csv') >>> df.head() sepal_length sepal_width petal_length petal_width species 0 5.1 3.5 1.4 0.2 Iris-setosa 1 4.9 3.0 1.4 0.2 Iris-setosa 2 4.7 3.2 1.3 0.2 Iris-setosa 3 4.6 3.1 1.5 0.2 Iris-setosa 4 5.0 3.6 1.4 0.2 Iris-setosa >>> max_sepal_length_setosa = df[df.species == 'setosa'].sepal_length.max() 5.7999999999999998 >>> import dask.dataframe as dd >>> ddf = dd.read_csv('*.csv') >>> ddf.head() sepal_length sepal_width petal_length petal_width species 0 5.1 3.5 1.4 0.2 Iris-setosa 1 4.9 3.0 1.4 0.2 Iris-setosa 2 4.7 3.2 1.3 0.2 Iris-setosa 3 4.6 3.1 1.5 0.2 Iris-setosa 4 5.0 3.6 1.4 0.2 Iris-setosa … >>> d_max_sepal_length_setosa = ddf[ddf.species == 'setosa'].sepal_length.max() >>> d_max_sepal_length_setosa.compute() 5.7999999999999998 Dask dataframes look and feel like pandas dataframes, but operate on datasets larger than memory using multiple threads

Slide 43

Slide 43 text

43 Distributed http://distributed.readthedocs.io/en/latest/ Distributed is a lightweight library for distributed computing in Python. It extends dask APIs to moderate sized clusters.

Slide 44

Slide 44 text

44 Web UI Dask.distributed includes a web interface to help deliver information about the current state of the network helps to track progress, identify performance issues, and debug failures over a normal web page in real time.

Slide 45

Slide 45 text

ANACONDA ENTERPRISE PLATFORM

Slide 46

Slide 46 text

© 2016 Continuum Analytics- Confidential & Proprietary ANACONDA platform 46 ANACONDA Repository ANACONDA Accelerate ANACONDA Distribution ANACONDA Scale Open Data Science Core Open Data Science Repository High Performance Computing Distributed Computing ANACONDA Enterprise Notebooks Data Science Collaboration ANACONDA Mosaic Heterogeneous Data Exploration ANACONDA Fusion Excel Data Science

Slide 47

Slide 47 text

47 Anaconda Repository Anaconda Enterprise Notebooks Anaconda Mosaic Anaconda Fusion

Slide 48

Slide 48 text

48 Challenges revisited • Manage reproducible Data Science environments • Distribute Data Science assets • Get diverse data scientists (languages, tools, data models, deliverables…) to collaborate effectively • Enable Data Scientists to easily leverage Big Data technologies • Deploy data science assets into production applications • Share insights with decision makers

Slide 49

Slide 49 text

49 https://www.continuum.io/ Learn more • Whitepapers: https://www.continuum.io/whitepapers • Webinars: https://www.continuum.io/webinars • Presentations: https://www.continuum.io/presentations • Videos: https://www.continuum.io/videos

Slide 50

Slide 50 text

Thank you! Twitter: @ch_doig [email protected] [email protected]