Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Data Science Python Ecosystem

Data Science Python Ecosystem

Navigating the Data Science Python Ecosystem talk at PyConES 2015

Christine Doig

November 21, 2015
Tweet

More Decks by Christine Doig

Other Decks in Programming

Transcript

  1. 2 Christine Doig is… a Data Scientist at Continuum Analytics

    originally from Barcelona living in Austin, Texas loving Python and Data Science @ch_doig / @ContinuumIO chdoig.github.io
  2. 3 NAVIGATING THE DATA SCIENCE PYTHON ECOSYSTEM 1 What is

    Data Science? 2 What is the state of the Python ecosystem for Data Science? 3 How to get started?
  3. 4 NAVIGATING THE DATA SCIENCE PYTHON ECOSYSTEM The guide I

    wish I had when I started doing Data Science with Python
  4. 6 ! ! ! ! ! ! ! ! !

    From the lab to the factory - Data Day Texas Slides: http://www.slideshare.net/joshwills/production-machine-learninginfrastructure Video: https://www.youtube.com/watch?v=v-91JycaKjc
  5. 7

  6. 8 ” Data Scientist (n.): Person who is worst at

    statistics that any statistician and worst at software engineering than any software engineer “
  7. 13 Scientific Computing Distributed Systems Analytics Machine Learning/Stats Web Data

    Scientists/ Modeler Data/Business Analyst Research/Computational Scientist Data Engineers/ Architects Developer Model Algorithm Report Application Pipeline/ Architecture
  8. 14 Scientific Computing Distributed Systems Analytics Machine Learning/Stats Web Data

    Scientists/ Modeler Data/Business Analyst Research/Computational Scientist Data Engineers/ Architects Developer PyMC Numba xlwings Bokeh Kafka RDFLib mrjob Airflow
  9. 16 Why Python? • General purpose language, easy and fun

    to learn • Rich ecosystem of libraries: high quality and quantity. Growing at a rapid pace. • Developer Community - Conferences: SciPy, PyData… • Mature core Scientific Computing libraries (bindings C/C++ or Fortran) • Glue language • Diverse users: SysAdmins, Web developers, Scientists, Statisticians… enables cross teams collaboration • Analysis -> Production (vs R, Matlab…) • R: "The best thing about R is that it was written by statisticians. The worst thing about R is that it was written by statisticians." Bow Cogwill • Matlab: $$$, not open http://nbviewer.ipython.org/github/twiecki/pydata_ninja/blob/master/PyData%20Ninja.ipynb
  10. 17

  11. 19 https://www.youtube.com/watch?v=5GlNDD7qbP4 Keynote: State of the Tools | SciPy 2015

    | Jake VanderPlas https://speakerdeck.com/jakevdp/the-state-of-the-stack-scipy-2015-keynote
  12. 21 https://www.youtube.com/watch?v=RTiAMB2tQjo Rob Story: Python Data Bikeshed I have data.

    It’s July 2015. I want to group things. or count things. or average things. or add things. ! What Python library do I use?
  13. 23 THIS TALK RELATED / ALTERNATIVES SETUP Miniconda, Canopy, Python

    + pip… IDE Spyder, PyCharm, Atom, Rodeo, Sublime DATA MUNGING numpy, xray, DATA VISUALIZATION matplotlib, seaborn, pyxley, plotly, lightning Bokeh Sat - 12:30 p.m. Introducción a visualizaciones interactivas con Bokeh ! Alejandro Vidal Sat -3:40 p.m. Data structures beyond dicts and lists ! Sergi Sorribas
  14. 24 THIS TALK RELATED / ALTERNATIVES MACHINE LEARNING Orange, Pylearn..

    DEEP LEARNING Caffe, Keras, TensorFlow… BIG DATA - dask, bcolz… - Hadoop, Spark, Impala, Ibis… + Lasagne +
  15. 25

  16. 26 THIS TALK RELATED / ALTERNATIVES MACHINE LEARNING Orange, Pylearn…

    DEEP LEARNING Caffe, Keras, TensorFlow… BIG DATA - dask, bcolz - Hadoop, Spark, Impala, Ibis Medium Data and Distributed computing + Lasagne +
  17. 27 THIS TALK RELATED / ALTERNATIVES MACHINE LEARNING Orange, Pylearn,

    DEEP LEARNING Caffe, Keras, TensorFlow… BIG DATA - dask, bcolz - Hadoop, Spark, Impala, Ibis Sun - 11:50 a.m. Trolling Detection with Scikit-learn and NLTK ! Rafa Haro Sun - 12:30 p.m. Tratando datos más allá de los límites de la memoria ! Francesc Alted Medium Data and Distributed computing + Lasagne +
  18. 29 THIS TALK RELATED / ALTERNATIVES WEB SCRAPING beautifulsoup WORKFLOW

    / PIPELINES Airflow Luigi NLP Spacy Gensim, NLTK STATISTICS PyMC, PyMC3… Sat - 3 p.m. Know your models - Statsmodels! ! Israel Saeta Pérez y Miquel Camprodon Sun - 1:10 p.m. Dive into Scrapy ! Juan Riaza Another time
  19. 34 PYTHON NumPy, SciPy, Pandas, Scikit-learn, Jupyter / IPython, Numba,

    Matplotlib, Spyder, Numexpr, Cython, Theano, Scikit-image, NLTK, NetworkX and 330+ packages conda
  20. 35 Conda • Package and environment manager • Language angnostic

    (Python, R, Java…) • Cross-platform (Windows, OS X, Linux) $ conda install python=2.7 $ conda install pandas $ conda install -c r r $ conda install mongodb
  21. 39 https://www.continuum.io/content/conda-data-science Learn more TALK: Reproducible Multi-language Data Science with

    Conda, PyData Dallas 2015 http://chdoig.github.io/pydata2015-dallas-conda BLOGPOST: Conda for Data Science
  22. 42 The Jupyter Notebook is a web application that allows

    you to create and share documents that contain live code, equations, visualizations and explanatory text.
  23. 46

  24. 47

  25. 49

  26. 50

  27. 51

  28. 53

  29. 55 https://github.com/damianavila/RISE RISE: "Live" Reveal.js Jupyter/IPython Slideshow Extension A notebook

    rendered as a Reveal.js-based slideshow, where you can execute code or show to the audience whatever you can show/do inside the notebook itself
  30. 56 Learn more BLOGPOST: Jupyter and conda for R, Christine

    Doig, Sep. 2015 https://www.continuum.io/blog/developer/jupyter-and-conda-r ! TALK: RISE, Damian Avila, SciPy 2014 https://www.youtube.com/watch?v=sZBKruEh0jI
  31. 58 pandas is an open source, BSD-licensed library providing high-

    performance, easy-to-use data structures and data analysis tools for the Python programming language. http://pandas.pydata.org/
  32. 59 pandas.DataFrame Two-dimensional size-mutable, potentially heterogeneous tabular data structure with

    labeled axes (rows and columns). https://github.com/jreback/PyDataNYC2015
  33. 62 Data Exploration Get beers with abv (alcohol by volume)

    less than 5 and just return columns ‘beer_style’ and ‘review_overall’ https://github.com/jreback/PyDataNYC2015
  34. 63 Data Exploration Get beers with abv (alcohol by volume)

    less than 5 and time after June 2009, or with review overall rating higher than 4.5 https://github.com/jreback/PyDataNYC2015
  35. 65 …and much more! Missing values -> resample(fill_method=…) Computational tools

    -> rolling_mean() Timezone handling -> pd.date_range('20130101 09:00:00',periods=5,tz='US/Eastern') Timeseries Tidy data melt() pivot() dropna() pipe() stack()
  36. 66 Learn more TUTORIAL: Performance Pandas, Jeff Reback, PyDataNYC 2015

    https://jakevdp.github.io/blog/2015/10/17/analyzing-pronto-cycleshare-data-with-python-and-pandas/ BLOGPOST: Analyzing Pronto CycleShare Data with Python and Pandas, Jake VanderPlas https://github.com/jreback/PyDataNYC2015/
  37. 70

  38. 71

  39. 72 Learn more TUTORIAL: Getting started with Bokeh, Bryan Van

    De Ven, Sarah Bird PyDataLDN 2015 https://www.youtube.com/watch?v=XBiS0oBzX3o ! http://nbviewer.ipython.org/github/bokeh/bokeh-notebooks/blob/master/ index.ipynb#Tutorial
  40. 74

  41. 75 Machine Learning Unsupervised learning Supervised learning Classification Regression Clustering

    Latent variables/structure categorical quantitative Linear regression Logistic regression SVM Decision trees k-NN K-means Hierarchical clustering *Topic modeling Dimenstionality reduction *Topic modeling
  42. 76 Exploratory Predictive Machine Learning Unsupervised learning Supervised learning Classification

    Regression categorical quantitative id gender age job_id 1 F 67 1 2 M 32 2 3 M 45 1 4 F 18 2 group similar individuals together id gend er age job_i d buy/click_ad money_spent 1 F 67 1 Yes $1,000 2 M 32 2 No - 3 M 45 1 No - 4 F 18 2 Yes $300 predict whether an individual is going to buy/click or not Classification Regression predict how much is the individual going to spend
  43. 78

  44. 79 Learn more TUTORIAL: Scikit-learn Tutorial, Jake Vanderplas, PyData Seattle

    2015 http://nbviewer.ipython.org/github/ebenolson/pydata2015/tree/master/ https://github.com/amueller/scipy_2015_sklearn_tutorial TUTORIAL: Scikit-learn Tutorial, Andreas Meurer, SciPy 2015
  45. 81 http://www.slideshare.net/XavierArrufat/20141120-python-bcninsideannsrev07 Inside the Artificial Neural Network ! A visual

    and intuitive journey to understand how artificial neural networks store knowledge and how they make decisions (no code, no math included)
  46. 88 Learn more TUTORIAL: Neural networks with Theano and Lasagne

    Eben Olson, PyDataNYC ! http://nbviewer.ipython.org/github/ebenolson/pydata2015/tree/master/
  47. My favorite blogs 90 • http://blaze.pydata.org/ • http://matthewrocklin.com/blog/ • http://danielfrg.com/

    • https://jakevdp.github.io/ • https://www.continuum.io/blog/developer-blog • http://nerds.airbnb.com/ • http://blog.yhathq.com/ • http://multithreaded.stitchfix.com/blog/ • https://labs.spotify.com/
  48. Conferences 91 • General: PyCon/Europython • Scientific: SciPy/EuroSciPy • Data

    Science: PyData • Spain: PyConES, PySS, EP 2016 Bilbao http://www.pyvideo.org/ https://www.youtube.com/user/PyDataTV