Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The Hitchhicker's Guide to Data Science

The Hitchhicker's Guide to Data Science

The Hitchhiker's Guide to Data Science, PyData Madrid 2016


Christine Doig

April 09, 2016


  1. The Hitchhiker's Guide to Data Science Christine Doig Senior Data

    Scientist Continuum Analytics
  2. 2 “Far out in the uncharted backwaters of the unfashionable

    end of the Western Spiral arm of the Galaxy lies a small unregarded yellow sun…”
  3. 3 “Orbiting this at a distance of roughly ninety-eight million

    miles is an utterly insignificant little blue- green planet whose ape-descended life forms are so amazingly primitive that they still think digital watches are a pretty neat idea…”
  4. 4 “This planet has—or rather had—a problem, which was this:”

    It had a lot of DATA! and not enough DATA SCIENTISTS!
  5. 5 “…the Hitchhiker’s Guide has already supplanted the great Encyclopedia

    Galactica as the standard repository of all knowledge and wisdom…” DON’T PANIC The Hitchhiker’s Guide to Data Science
  6. © 2015 Continuum Analytics- Confidential & Proprietary 6 DS development

    environments Transformations Visualization Deep Learning Parallelism Optimization Setup Explore Scale Compiled Assets Real-time processing Learn Statistics Machine Learning Distributed file systems and databases The Data Science Galaxy Queries
  7. 7 DON’T PANIC • Setup: • Anaconda • Jupyter •

    Explore: • Pandas • Bokeh • Scale: • Numba • Dask The Hitchhiker’s Guide to Data Science

  9. 9 Anaconda: Intro PYTHON NumPy, SciPy, Pandas, Scikit-learn, Jupyter /

    IPython, Numba, Matplotlib, Spyder, Numexpr, Cython, Theano, Scikit-image, NLTK, NetworkX and 150+ packages conda PYTHON cond conda • Anaconda: Python distribution that includes 150+ packages for data science • conda: Cross-platform and language agnostic package and environment manager • Miniconda: Minified version of Anaconda, with just Python and conda. • Anaconda Cloud: Cloud service to host and share public and private packages, environments and notebooks • conda environments: custom isolated sandboxes to easily reproduce and share data science projects
  10. 10 PYTHON NumPy, SciPy, Pandas, Scikit-learn, Jupyter / IPython, Numba,

    Matplotlib, Spyder, Numexpr, Cython, Theano, Scikit-image, NLTK, NetworkX and 150+ packages conda PYTHON cond conda Why Anaconda? • Easy to install on all platforms • Trusted by industry leaders: e.g. Microsoft Azure ML • Large user base: 3M+ downloads • BSD license • Extensible - easily build, share and install proprietary libraries with Anaconda Cloud • Language agnostic - Python, R, Scala… • Allows isolated custom sandboxes with different versions of packages Anaconda: Intro
  11. 11 Anaconda: What’s new? • MKL optimization • Navigator •

    Conda forge • R
  12. 12 Anaconda: MKL optimization https://www.continuum.io/blog/developer-blog/anaconda-25-release-now-mkl-optimizations Starting with release 2.5 (February

    2016), Anaconda now includes the Intel Math Kernel Library (MKL) optimizations (version 11.3.1) for improved performance. Available by default and free for all. conda update conda conda install anaconda=2.5
  13. 13 Anaconda: Navigator • Launch applications and easily manage conda

    packages, environments and channels. • No need of using the command line. •Available for Windows, OS X and Linux. • Anaconda Navigator has replaced Launcher. • Integration with Anaconda Cloud. A desktop graphical user interface included in Anaconda
  14. 14 Anaconda: Conda-forge https://conda-forge.github.io/ A community led collection of recipes,

    build infrastructure and distributions for the conda package manager. • Each repo (feedstock), automatically builds with CI (AppVeyor, CircleCI and TravisCI) • Builds are uploaded to Anaconda Cloud conda config —add channels conda-forge conda install <package_name>
  15. 15 Anaconda: R https://www.continuum.io/blog/developer/jupyter-and-conda-r • R-Essentials: A conda metapackage with

    80+ R packages for data science • MRO: Microsoft R Open distribution with MKL Added support for R conda config —add channels r conda install r-essentials conda config —add channels mro conda install r
  16. 16 Jupyter: Intro Web application that allows you to create

    and share documents that contain live code, equations, visualizations and explanatory text. jupyter notebook
  17. 17 Jupyter: What’s new? • Extensions - nbpresent, conda environments

    • Kernels - IRKernel
  18. 18 Jupyter: Extensions - nbpresent remix your Jupyter Notebooks as

    interactive slideshows with a UI editor • Edit slides, layout and themes conda install -c anaconda-nb-extensions nbpresent jupyter notebook
  19. 19 Jupyter: Extensions - anaconda-nb-extensions • nb_condakernel: use the kernel-switching

    dropdown inside notebook UI to switch between conda envs • nb_conda: help manage conda envs from inside file viewer of jupter notebook nb_condakernel nb_conda
  20. 20 Jupyter: IRkernel https://www.continuum.io/blog/developer/jupyter-and-conda-r conda config --add channels r conda

    install r-essentials jupyter notebooks Trivial to get started writing R notebooks the same way you write Python ones.

  22. 22 Pandas: Intro • Designed to make working with "relational"

    or "labeled" data both easy and intuitive. • Building block for doing practical, real world data analysis in Python. high-performance, easy- to-use data structures and data analysis tools • Automatic data alignment • Rolling, expanding, and EWM operations • Timeseries operations, including fill or drop missing values • Resampling & ordered merges • Timezone handling • Date offsets & holiday support • Intelligent interactive indexing
  23. • New logo • Window functions are now methods •

    The .to_xarray() function has been added for compatibility with the xarray package • pd.read_sas() has gained the ability to read SAS7BDAT files • Conditional formatting, the visual styling of a DataFrame depending on the data within, by using the DataFrame.style property. 23 Pandas: What’s new? pd.rolling_mean(df,window=3) r = df.rolling(window=3) http://pandas.pydata.org/pandas-docs/stable/style.html
  24. 24 Bokeh: Intro Interactive visualization framew that targets modern web

    browser presentation • No JavaScript • Python, R, Scala and Lua bindings • Easy to embed in web applications • Server apps: data can be updated, and UI and selection events can be processed to trigger more visual updates.
  25. 25 Bokeh: What’s new? • Bokeh server • rBokeh +

    Shiny • Datashader
  26. 26 Bokeh: Server • New Tornado and websocket-based Bokeh Server

    • bokeh command line tool for creating applications • expanded docs including deployment guidance • video demonstrations and tutorials • supports async, periodic, timeout and model event callbacks • python client API
  27. 27 Bokeh: rBokeh + Shiny • rBokeh is library providing

    an R interface to Bokeh. • rBokeh integrates with Shiny to build dynamic interactive web visualizations
  28. 28 Bokeh: Datashader graphics pipeline system for creating meaningful representations

    of large amounts of data • Provides automatic, nearly parameter-free visualization of datasets • Allows extensive customization of each step in the data-processing pipeline • Supports automatic downsampling and re- rendering with Bokeh and the Jupyter notebook • Works well with dask and numba to handle very large datasets in and out of core (with examples using billions of datapoints) https://github.com/bokeh/datashader
  29. 29 Bokeh: Datashader Overplotting: Oversaturation: Undersampling: https://anaconda.org/jbednar/plotting_pitfalls/notebook

  30. 30 Bokeh: Datashader https://anaconda.org/jbednar/notebooks


  32. 32 Numba: Intro Speed up your applications with high performance

    functions written directly in Python. • Just-in-time compiled to native machine instructions • Similar in performance to C, C++ and Fortran • Supports compilation of Python to run on either CPU or GPU hardware • Integrates well with the Python scientific software stack.
  33. 33 Dask: Intro A parallel computing framework through task scheduling

    and blocked algorithms • Familiar: Implements parallel NumPy and Pandas objects • Fast: Optimized for demanding for numerical applications • Flexible: for sophisticated and messy algorithms • Scales up: Runs resiliently on clusters of 100s of machines • Scales down: Pragmatic in a single process on a laptop • Interactive: Responsive and fast for interactive data science
  34. 34 Dask: What’s new? • Distributed scheduler • Hadoop integration

    (HDFS + YARN) • Notebook widgets • More diagnostics http://quasiben.github.io/blog/2016/4/8/dask-yarn/ http://matthewrocklin.com/blog/
  35. 35 DON’T PANIC • Setup: • Anaconda • Jupyter •

    Explore: • Pandas • Bokeh • Scale: • Numba • Dask The Hitchhiker’s Guide to Data Science 42 The Answer to the Ultimate Question of Life, The Universe, and Everything.
  36. So Long, and Thanks for All the Fish! Questions?