Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Data Science Python Ecosystem

Data Science Python Ecosystem

Navigating the Data Science Python Ecosystem talk at PyConES 2015

6cc5be6a122c6e768981003fd2e24789?s=128

Christine Doig

November 21, 2015
Tweet

Transcript

  1. Navigating the Data Science Python Ecosystem Christine Doig PyConES 2015

    Nov 2015
  2. 2 Christine Doig is… a Data Scientist at Continuum Analytics

    originally from Barcelona living in Austin, Texas loving Python and Data Science @ch_doig / @ContinuumIO chdoig.github.io
  3. 3 NAVIGATING THE DATA SCIENCE PYTHON ECOSYSTEM 1 What is

    Data Science? 2 What is the state of the Python ecosystem for Data Science? 3 How to get started?
  4. 4 NAVIGATING THE DATA SCIENCE PYTHON ECOSYSTEM The guide I

    wish I had when I started doing Data Science with Python
  5. DATA SCIENCE

  6. 6 ! ! ! ! ! ! ! ! !

    From the lab to the factory - Data Day Texas Slides: http://www.slideshare.net/joshwills/production-machine-learninginfrastructure Video: https://www.youtube.com/watch?v=v-91JycaKjc
  7. 7

  8. 8 ” Data Scientist (n.): Person who is worst at

    statistics that any statistician and worst at software engineering than any software engineer “
  9. 9 http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram

  10. 10 Data Scientist Job posting

  11. 11 Scientific Computing Distributed Systems BI / Analytics Machine Learning

    / Statistics Web
  12. 12 ” Forget unicorns! “ Data Science is a team

    sport!
  13. 13 Scientific Computing Distributed Systems Analytics Machine Learning/Stats Web Data

    Scientists/ Modeler Data/Business Analyst Research/Computational Scientist Data Engineers/ Architects Developer Model Algorithm Report Application Pipeline/ Architecture
  14. 14 Scientific Computing Distributed Systems Analytics Machine Learning/Stats Web Data

    Scientists/ Modeler Data/Business Analyst Research/Computational Scientist Data Engineers/ Architects Developer PyMC Numba xlwings Bokeh Kafka RDFLib mrjob Airflow
  15. DATA SCIENCE PYTHON ECOSYSTEM

  16. 16 Why Python? • General purpose language, easy and fun

    to learn • Rich ecosystem of libraries: high quality and quantity. Growing at a rapid pace. • Developer Community - Conferences: SciPy, PyData… • Mature core Scientific Computing libraries (bindings C/C++ or Fortran) • Glue language • Diverse users: SysAdmins, Web developers, Scientists, Statisticians… enables cross teams collaboration • Analysis -> Production (vs R, Matlab…) • R: "The best thing about R is that it was written by statisticians. The worst thing about R is that it was written by statisticians." Bow Cogwill • Matlab: $$$, not open http://nbviewer.ipython.org/github/twiecki/pydata_ninja/blob/master/PyData%20Ninja.ipynb
  17. 17

  18. 18 3 Talks about the DS Python ecosystem that inspired

    me
  19. 19 https://www.youtube.com/watch?v=5GlNDD7qbP4 Keynote: State of the Tools | SciPy 2015

    | Jake VanderPlas https://speakerdeck.com/jakevdp/the-state-of-the-stack-scipy-2015-keynote
  20. 20 http://nbviewer.ipython.org/github/twiecki/pydata_ninja/blob/master/PyData%20Ninja.ipynb

  21. 21 https://www.youtube.com/watch?v=RTiAMB2tQjo Rob Story: Python Data Bikeshed I have data.

    It’s July 2015. I want to group things. or count things. or average things. or add things. ! What Python library do I use?
  22. 22 This talk

  23. 23 THIS TALK RELATED / ALTERNATIVES SETUP Miniconda, Canopy, Python

    + pip… IDE Spyder, PyCharm, Atom, Rodeo, Sublime DATA MUNGING numpy, xray, DATA VISUALIZATION matplotlib, seaborn, pyxley, plotly, lightning Bokeh Sat - 12:30 p.m. Introducción a visualizaciones interactivas con Bokeh ! Alejandro Vidal Sat -3:40 p.m. Data structures beyond dicts and lists ! Sergi Sorribas
  24. 24 THIS TALK RELATED / ALTERNATIVES MACHINE LEARNING Orange, Pylearn..

    DEEP LEARNING Caffe, Keras, TensorFlow… BIG DATA - dask, bcolz… - Hadoop, Spark, Impala, Ibis… + Lasagne +
  25. 25

  26. 26 THIS TALK RELATED / ALTERNATIVES MACHINE LEARNING Orange, Pylearn…

    DEEP LEARNING Caffe, Keras, TensorFlow… BIG DATA - dask, bcolz - Hadoop, Spark, Impala, Ibis Medium Data and Distributed computing + Lasagne +
  27. 27 THIS TALK RELATED / ALTERNATIVES MACHINE LEARNING Orange, Pylearn,

    DEEP LEARNING Caffe, Keras, TensorFlow… BIG DATA - dask, bcolz - Hadoop, Spark, Impala, Ibis Sun - 11:50 a.m. Trolling Detection with Scikit-learn and NLTK ! Rafa Haro Sun - 12:30 p.m. Tratando datos más allá de los límites de la memoria ! Francesc Alted Medium Data and Distributed computing + Lasagne +
  28. 28 If I had more time…

  29. 29 THIS TALK RELATED / ALTERNATIVES WEB SCRAPING beautifulsoup WORKFLOW

    / PIPELINES Airflow Luigi NLP Spacy Gensim, NLTK STATISTICS PyMC, PyMC3… Sat - 3 p.m. Know your models - Statsmodels! ! Israel Saeta Pérez y Miquel Camprodon Sun - 1:10 p.m. Dive into Scrapy ! Juan Riaza Another time
  30. 30 THIS TALK IMAGE AUDIO GRAPH WEB FRAMEWORKS Another time

    PyAudio NetworkX
  31. NAVIGATING THE DATA SCIENCE PYTHON ECOSYSTEM

  32. 32 Bokeh Blaze dask

  33. ANACONDA Setup

  34. 34 PYTHON NumPy, SciPy, Pandas, Scikit-learn, Jupyter / IPython, Numba,

    Matplotlib, Spyder, Numexpr, Cython, Theano, Scikit-image, NLTK, NetworkX and 330+ packages conda
  35. 35 Conda • Package and environment manager • Language angnostic

    (Python, R, Java…) • Cross-platform (Windows, OS X, Linux) $ conda install python=2.7 $ conda install pandas $ conda install -c r r $ conda install mongodb
  36. 36 https://www.continuum.io/downloads Anaconda and conda are BSD licensed

  37. 37 Miniconda = Python + conda http://conda.pydata.org/miniconda.html

  38. 38 http://docs.continuum.io/anaconda/pkg-docs

  39. 39 https://www.continuum.io/content/conda-data-science Learn more TALK: Reproducible Multi-language Data Science with

    Conda, PyData Dallas 2015 http://chdoig.github.io/pydata2015-dallas-conda BLOGPOST: Conda for Data Science
  40. JUPYTER IDE

  41. 41 http://jupyter.org/ ! Open source, interactive data science and scientific

    computing across over 40 programming languages.
  42. 42 The Jupyter Notebook is a web application that allows

    you to create and share documents that contain live code, equations, visualizations and explanatory text.
  43. 43 IPython IPython notebook nbviewer tmpnb binder Jupyter https://try.jupyter.org/

  44. 44 Binder

  45. 45 http://mybinder.org/

  46. 46

  47. 47

  48. 48 Notebook -> Slides

  49. 49

  50. 50

  51. 51

  52. 52 $ jupyter nbconvert my_r_notebook.ipynb --to slides --post serve

  53. 53

  54. 54 RISE

  55. 55 https://github.com/damianavila/RISE RISE: "Live" Reveal.js Jupyter/IPython Slideshow Extension A notebook

    rendered as a Reveal.js-based slideshow, where you can execute code or show to the audience whatever you can show/do inside the notebook itself
  56. 56 Learn more BLOGPOST: Jupyter and conda for R, Christine

    Doig, Sep. 2015 https://www.continuum.io/blog/developer/jupyter-and-conda-r ! TALK: RISE, Damian Avila, SciPy 2014 https://www.youtube.com/watch?v=sZBKruEh0jI
  57. PANDAS Data munging

  58. 58 pandas is an open source, BSD-licensed library providing high-

    performance, easy-to-use data structures and data analysis tools for the Python programming language. http://pandas.pydata.org/
  59. 59 pandas.DataFrame Two-dimensional size-mutable, potentially heterogeneous tabular data structure with

    labeled axes (rows and columns). https://github.com/jreback/PyDataNYC2015
  60. 60 https://github.com/jreback/PyDataNYC2015

  61. 61 I/O https://github.com/jreback/PyDataNYC2015

  62. 62 Data Exploration Get beers with abv (alcohol by volume)

    less than 5 and just return columns ‘beer_style’ and ‘review_overall’ https://github.com/jreback/PyDataNYC2015
  63. 63 Data Exploration Get beers with abv (alcohol by volume)

    less than 5 and time after June 2009, or with review overall rating higher than 4.5 https://github.com/jreback/PyDataNYC2015
  64. 64 Groupby https://github.com/jreback/PyDataNYC2015

  65. 65 …and much more! Missing values -> resample(fill_method=…) Computational tools

    -> rolling_mean() Timezone handling -> pd.date_range('20130101 09:00:00',periods=5,tz='US/Eastern') Timeseries Tidy data melt() pivot() dropna() pipe() stack()
  66. 66 Learn more TUTORIAL: Performance Pandas, Jeff Reback, PyDataNYC 2015

    https://jakevdp.github.io/blog/2015/10/17/analyzing-pronto-cycleshare-data-with-python-and-pandas/ BLOGPOST: Analyzing Pronto CycleShare Data with Python and Pandas, Jake VanderPlas https://github.com/jreback/PyDataNYC2015/
  67. BOKEH Data visualization

  68. 68 matplotlib seaborn bokeh pyxley lightning Visualization libraries in Python

    plotly
  69. 69 Custom visualizations Dashboards Streaming/ Animations Charts T ools Widgets

    Maps Hover Bokeh
  70. 70

  71. 71

  72. 72 Learn more TUTORIAL: Getting started with Bokeh, Bryan Van

    De Ven, Sarah Bird PyDataLDN 2015 https://www.youtube.com/watch?v=XBiS0oBzX3o ! http://nbviewer.ipython.org/github/bokeh/bokeh-notebooks/blob/master/ index.ipynb#Tutorial
  73. SCIKIT-LEARN Machine Learning

  74. 74

  75. 75 Machine Learning Unsupervised learning Supervised learning Classification Regression Clustering

    Latent variables/structure categorical quantitative Linear regression Logistic regression SVM Decision trees k-NN K-means Hierarchical clustering *Topic modeling Dimenstionality reduction *Topic modeling
  76. 76 Exploratory Predictive Machine Learning Unsupervised learning Supervised learning Classification

    Regression categorical quantitative id gender age job_id 1 F 67 1 2 M 32 2 3 M 45 1 4 F 18 2 group similar individuals together id gend er age job_i d buy/click_ad money_spent 1 F 67 1 Yes $1,000 2 M 32 2 No - 3 M 45 1 No - 4 F 18 2 Yes $300 predict whether an individual is going to buy/click or not Classification Regression predict how much is the individual going to spend
  77. 77 http://scikit-learn.org/

  78. 78

  79. 79 Learn more TUTORIAL: Scikit-learn Tutorial, Jake Vanderplas, PyData Seattle

    2015 http://nbviewer.ipython.org/github/ebenolson/pydata2015/tree/master/ https://github.com/amueller/scipy_2015_sklearn_tutorial TUTORIAL: Scikit-learn Tutorial, Andreas Meurer, SciPy 2015
  80. THEANO + LASAGNE Deep learning

  81. 81 http://www.slideshare.net/XavierArrufat/20141120-python-bcninsideannsrev07 Inside the Artificial Neural Network ! A visual

    and intuitive journey to understand how artificial neural networks store knowledge and how they make decisions (no code, no math included)
  82. 82 http://www.slideshare.net/XavierArrufat/20141120-python-bcninsideannsrev07

  83. 83 http://www.slideshare.net/XavierArrufat/20141120-python-bcninsideannsrev07

  84. 84 http://www.slideshare.net/XavierArrufat/20141120-python-bcninsideannsrev07

  85. 85 http://nbviewer.ipython.org/github/ebenolson/pydata2015/blob/master/3%20-%20Convolutional%20Networks/Art%20Style%20Transfer.ipynb Art Style Transfer

  86. 86 Image Recognition http://nbviewer.ipython.org/github/ebenolson/pydata2015/blob/master/2%20-%20Lasagne%20Basics/Digit%20Recognizer.ipynb

  87. 87 Image Recognition http://nbviewer.ipython.org/github/ebenolson/pydata2015/blob/master/3%20-%20Convolutional%20Networks/Finetuning%20for%20Image%20Classification.ipynb

  88. 88 Learn more TUTORIAL: Neural networks with Theano and Lasagne

    Eben Olson, PyDataNYC ! http://nbviewer.ipython.org/github/ebenolson/pydata2015/tree/master/
  89. WHAT TO DO NEXT?

  90. My favorite blogs 90 • http://blaze.pydata.org/ • http://matthewrocklin.com/blog/ • http://danielfrg.com/

    • https://jakevdp.github.io/ • https://www.continuum.io/blog/developer-blog • http://nerds.airbnb.com/ • http://blog.yhathq.com/ • http://multithreaded.stitchfix.com/blog/ • https://labs.spotify.com/
  91. Conferences 91 • General: PyCon/Europython • Scientific: SciPy/EuroSciPy • Data

    Science: PyData • Spain: PyConES, PySS, EP 2016 Bilbao http://www.pyvideo.org/ https://www.youtube.com/user/PyDataTV
  92. 92 Find a local Python meetup near you or start

    one!
  93. Thank you! Christine Doig PyConES 2015 Nov 2015 Twitter: ch_doig

    Github: chdoig Site: chdoig.github.io Email: cdoig@continuum.io