Slide 1

Slide 1 text

The Hitchhiker's Guide to Data Science Christine Doig Senior Data Scientist Continuum Analytics

Slide 2

Slide 2 text

2 “Far out in the uncharted backwaters of the unfashionable end of the Western Spiral arm of the Galaxy lies a small unregarded yellow sun…”

Slide 3

Slide 3 text

3 “Orbiting this at a distance of roughly ninety-eight million miles is an utterly insignificant little blue- green planet whose ape-descended life forms are so amazingly primitive that they still think digital watches are a pretty neat idea…”

Slide 4

Slide 4 text

4 “This planet has—or rather had—a problem, which was this:” It had a lot of DATA! and not enough DATA SCIENTISTS!

Slide 5

Slide 5 text

5 “…the Hitchhiker’s Guide has already supplanted the great Encyclopedia Galactica as the standard repository of all knowledge and wisdom…” DON’T PANIC The Hitchhiker’s Guide to Data Science

Slide 6

Slide 6 text

© 2015 Continuum Analytics- Confidential & Proprietary 6 DS development environments Transformations Visualization Deep Learning Parallelism Optimization Setup Explore Scale Compiled Assets Real-time processing Learn Statistics Machine Learning Distributed file systems and databases The Data Science Galaxy Queries

Slide 7

Slide 7 text

7 DON’T PANIC • Setup: • Anaconda • Jupyter • Explore: • Pandas • Bokeh • Scale: • Numba • Dask The Hitchhiker’s Guide to Data Science

Slide 8

Slide 8 text

SETUP: ANACONDA + JUPYTER

Slide 9

Slide 9 text

9 Anaconda: Intro PYTHON NumPy, SciPy, Pandas, Scikit-learn, Jupyter / IPython, Numba, Matplotlib, Spyder, Numexpr, Cython, Theano, Scikit-image, NLTK, NetworkX and 150+ packages conda PYTHON cond conda • Anaconda: Python distribution that includes 150+ packages for data science • conda: Cross-platform and language agnostic package and environment manager • Miniconda: Minified version of Anaconda, with just Python and conda. • Anaconda Cloud: Cloud service to host and share public and private packages, environments and notebooks • conda environments: custom isolated sandboxes to easily reproduce and share data science projects

Slide 10

Slide 10 text

10 PYTHON NumPy, SciPy, Pandas, Scikit-learn, Jupyter / IPython, Numba, Matplotlib, Spyder, Numexpr, Cython, Theano, Scikit-image, NLTK, NetworkX and 150+ packages conda PYTHON cond conda Why Anaconda? • Easy to install on all platforms • Trusted by industry leaders: e.g. Microsoft Azure ML • Large user base: 3M+ downloads • BSD license • Extensible - easily build, share and install proprietary libraries with Anaconda Cloud • Language agnostic - Python, R, Scala… • Allows isolated custom sandboxes with different versions of packages Anaconda: Intro

Slide 11

Slide 11 text

11 Anaconda: What’s new? • MKL optimization • Navigator • Conda forge • R

Slide 12

Slide 12 text

12 Anaconda: MKL optimization https://www.continuum.io/blog/developer-blog/anaconda-25-release-now-mkl-optimizations Starting with release 2.5 (February 2016), Anaconda now includes the Intel Math Kernel Library (MKL) optimizations (version 11.3.1) for improved performance. Available by default and free for all. conda update conda conda install anaconda=2.5

Slide 13

Slide 13 text

13 Anaconda: Navigator • Launch applications and easily manage conda packages, environments and channels. • No need of using the command line. •Available for Windows, OS X and Linux. • Anaconda Navigator has replaced Launcher. • Integration with Anaconda Cloud. A desktop graphical user interface included in Anaconda

Slide 14

Slide 14 text

14 Anaconda: Conda-forge https://conda-forge.github.io/ A community led collection of recipes, build infrastructure and distributions for the conda package manager. • Each repo (feedstock), automatically builds with CI (AppVeyor, CircleCI and TravisCI) • Builds are uploaded to Anaconda Cloud conda config —add channels conda-forge conda install

Slide 15

Slide 15 text

15 Anaconda: R https://www.continuum.io/blog/developer/jupyter-and-conda-r • R-Essentials: A conda metapackage with 80+ R packages for data science • MRO: Microsoft R Open distribution with MKL Added support for R conda config —add channels r conda install r-essentials conda config —add channels mro conda install r

Slide 16

Slide 16 text

16 Jupyter: Intro Web application that allows you to create and share documents that contain live code, equations, visualizations and explanatory text. jupyter notebook

Slide 17

Slide 17 text

17 Jupyter: What’s new? • Extensions - nbpresent, conda environments • Kernels - IRKernel

Slide 18

Slide 18 text

18 Jupyter: Extensions - nbpresent remix your Jupyter Notebooks as interactive slideshows with a UI editor • Edit slides, layout and themes conda install -c anaconda-nb-extensions nbpresent jupyter notebook

Slide 19

Slide 19 text

19 Jupyter: Extensions - anaconda-nb-extensions • nb_condakernel: use the kernel-switching dropdown inside notebook UI to switch between conda envs • nb_conda: help manage conda envs from inside file viewer of jupter notebook nb_condakernel nb_conda

Slide 20

Slide 20 text

20 Jupyter: IRkernel https://www.continuum.io/blog/developer/jupyter-and-conda-r conda config --add channels r conda install r-essentials jupyter notebooks Trivial to get started writing R notebooks the same way you write Python ones.

Slide 21

Slide 21 text

EXPLORE: PANDAS + BOKEH

Slide 22

Slide 22 text

22 Pandas: Intro • Designed to make working with "relational" or "labeled" data both easy and intuitive. • Building block for doing practical, real world data analysis in Python. high-performance, easy- to-use data structures and data analysis tools • Automatic data alignment • Rolling, expanding, and EWM operations • Timeseries operations, including fill or drop missing values • Resampling & ordered merges • Timezone handling • Date offsets & holiday support • Intelligent interactive indexing

Slide 23

Slide 23 text

• New logo • Window functions are now methods • The .to_xarray() function has been added for compatibility with the xarray package • pd.read_sas() has gained the ability to read SAS7BDAT files • Conditional formatting, the visual styling of a DataFrame depending on the data within, by using the DataFrame.style property. 23 Pandas: What’s new? pd.rolling_mean(df,window=3) r = df.rolling(window=3) http://pandas.pydata.org/pandas-docs/stable/style.html

Slide 24

Slide 24 text

24 Bokeh: Intro Interactive visualization framew that targets modern web browser presentation • No JavaScript • Python, R, Scala and Lua bindings • Easy to embed in web applications • Server apps: data can be updated, and UI and selection events can be processed to trigger more visual updates.

Slide 25

Slide 25 text

25 Bokeh: What’s new? • Bokeh server • rBokeh + Shiny • Datashader

Slide 26

Slide 26 text

26 Bokeh: Server • New Tornado and websocket-based Bokeh Server • bokeh command line tool for creating applications • expanded docs including deployment guidance • video demonstrations and tutorials • supports async, periodic, timeout and model event callbacks • python client API

Slide 27

Slide 27 text

27 Bokeh: rBokeh + Shiny • rBokeh is library providing an R interface to Bokeh. • rBokeh integrates with Shiny to build dynamic interactive web visualizations

Slide 28

Slide 28 text

28 Bokeh: Datashader graphics pipeline system for creating meaningful representations of large amounts of data • Provides automatic, nearly parameter-free visualization of datasets • Allows extensive customization of each step in the data-processing pipeline • Supports automatic downsampling and re- rendering with Bokeh and the Jupyter notebook • Works well with dask and numba to handle very large datasets in and out of core (with examples using billions of datapoints) https://github.com/bokeh/datashader

Slide 29

Slide 29 text

29 Bokeh: Datashader Overplotting: Oversaturation: Undersampling: https://anaconda.org/jbednar/plotting_pitfalls/notebook

Slide 30

Slide 30 text

30 Bokeh: Datashader https://anaconda.org/jbednar/notebooks

Slide 31

Slide 31 text

SCALE: NUMBA + DASK

Slide 32

Slide 32 text

32 Numba: Intro Speed up your applications with high performance functions written directly in Python. • Just-in-time compiled to native machine instructions • Similar in performance to C, C++ and Fortran • Supports compilation of Python to run on either CPU or GPU hardware • Integrates well with the Python scientific software stack.

Slide 33

Slide 33 text

33 Dask: Intro A parallel computing framework through task scheduling and blocked algorithms • Familiar: Implements parallel NumPy and Pandas objects • Fast: Optimized for demanding for numerical applications • Flexible: for sophisticated and messy algorithms • Scales up: Runs resiliently on clusters of 100s of machines • Scales down: Pragmatic in a single process on a laptop • Interactive: Responsive and fast for interactive data science

Slide 34

Slide 34 text

34 Dask: What’s new? • Distributed scheduler • Hadoop integration (HDFS + YARN) • Notebook widgets • More diagnostics http://quasiben.github.io/blog/2016/4/8/dask-yarn/ http://matthewrocklin.com/blog/

Slide 35

Slide 35 text

35 DON’T PANIC • Setup: • Anaconda • Jupyter • Explore: • Pandas • Bokeh • Scale: • Numba • Dask The Hitchhiker’s Guide to Data Science 42 The Answer to the Ultimate Question of Life, The Universe, and Everything.

Slide 36

Slide 36 text

So Long, and Thanks for All the Fish! Questions?