Slide 1

Slide 1 text

Navigating the Data Science Python Ecosystem Christine Doig PyConES 2015 Nov 2015

Slide 2

Slide 2 text

2 Christine Doig is… a Data Scientist at Continuum Analytics originally from Barcelona living in Austin, Texas loving Python and Data Science @ch_doig / @ContinuumIO chdoig.github.io

Slide 3

Slide 3 text

3 NAVIGATING THE DATA SCIENCE PYTHON ECOSYSTEM 1 What is Data Science? 2 What is the state of the Python ecosystem for Data Science? 3 How to get started?

Slide 4

Slide 4 text

4 NAVIGATING THE DATA SCIENCE PYTHON ECOSYSTEM The guide I wish I had when I started doing Data Science with Python

Slide 5

Slide 5 text

DATA SCIENCE

Slide 6

Slide 6 text

6 ! ! ! ! ! ! ! ! ! From the lab to the factory - Data Day Texas Slides: http://www.slideshare.net/joshwills/production-machine-learninginfrastructure Video: https://www.youtube.com/watch?v=v-91JycaKjc

Slide 7

Slide 7 text

7

Slide 8

Slide 8 text

8 ” Data Scientist (n.): Person who is worst at statistics that any statistician and worst at software engineering than any software engineer “

Slide 9

Slide 9 text

9 http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram

Slide 10

Slide 10 text

10 Data Scientist Job posting

Slide 11

Slide 11 text

11 Scientific Computing Distributed Systems BI / Analytics Machine Learning / Statistics Web

Slide 12

Slide 12 text

12 ” Forget unicorns! “ Data Science is a team sport!

Slide 13

Slide 13 text

13 Scientific Computing Distributed Systems Analytics Machine Learning/Stats Web Data Scientists/ Modeler Data/Business Analyst Research/Computational Scientist Data Engineers/ Architects Developer Model Algorithm Report Application Pipeline/ Architecture

Slide 14

Slide 14 text

14 Scientific Computing Distributed Systems Analytics Machine Learning/Stats Web Data Scientists/ Modeler Data/Business Analyst Research/Computational Scientist Data Engineers/ Architects Developer PyMC Numba xlwings Bokeh Kafka RDFLib mrjob Airflow

Slide 15

Slide 15 text

DATA SCIENCE PYTHON ECOSYSTEM

Slide 16

Slide 16 text

16 Why Python? • General purpose language, easy and fun to learn • Rich ecosystem of libraries: high quality and quantity. Growing at a rapid pace. • Developer Community - Conferences: SciPy, PyData… • Mature core Scientific Computing libraries (bindings C/C++ or Fortran) • Glue language • Diverse users: SysAdmins, Web developers, Scientists, Statisticians… enables cross teams collaboration • Analysis -> Production (vs R, Matlab…) • R: "The best thing about R is that it was written by statisticians. The worst thing about R is that it was written by statisticians." Bow Cogwill • Matlab: $$$, not open http://nbviewer.ipython.org/github/twiecki/pydata_ninja/blob/master/PyData%20Ninja.ipynb

Slide 17

Slide 17 text

17

Slide 18

Slide 18 text

18 3 Talks about the DS Python ecosystem that inspired me

Slide 19

Slide 19 text

19 https://www.youtube.com/watch?v=5GlNDD7qbP4 Keynote: State of the Tools | SciPy 2015 | Jake VanderPlas https://speakerdeck.com/jakevdp/the-state-of-the-stack-scipy-2015-keynote

Slide 20

Slide 20 text

20 http://nbviewer.ipython.org/github/twiecki/pydata_ninja/blob/master/PyData%20Ninja.ipynb

Slide 21

Slide 21 text

21 https://www.youtube.com/watch?v=RTiAMB2tQjo Rob Story: Python Data Bikeshed I have data. It’s July 2015. I want to group things. or count things. or average things. or add things. ! What Python library do I use?

Slide 22

Slide 22 text

22 This talk

Slide 23

Slide 23 text

23 THIS TALK RELATED / ALTERNATIVES SETUP Miniconda, Canopy, Python + pip… IDE Spyder, PyCharm, Atom, Rodeo, Sublime DATA MUNGING numpy, xray, DATA VISUALIZATION matplotlib, seaborn, pyxley, plotly, lightning Bokeh Sat - 12:30 p.m. Introducción a visualizaciones interactivas con Bokeh ! Alejandro Vidal Sat -3:40 p.m. Data structures beyond dicts and lists ! Sergi Sorribas

Slide 24

Slide 24 text

24 THIS TALK RELATED / ALTERNATIVES MACHINE LEARNING Orange, Pylearn.. DEEP LEARNING Caffe, Keras, TensorFlow… BIG DATA - dask, bcolz… - Hadoop, Spark, Impala, Ibis… + Lasagne +

Slide 25

Slide 25 text

25

Slide 26

Slide 26 text

26 THIS TALK RELATED / ALTERNATIVES MACHINE LEARNING Orange, Pylearn… DEEP LEARNING Caffe, Keras, TensorFlow… BIG DATA - dask, bcolz - Hadoop, Spark, Impala, Ibis Medium Data and Distributed computing + Lasagne +

Slide 27

Slide 27 text

27 THIS TALK RELATED / ALTERNATIVES MACHINE LEARNING Orange, Pylearn, DEEP LEARNING Caffe, Keras, TensorFlow… BIG DATA - dask, bcolz - Hadoop, Spark, Impala, Ibis Sun - 11:50 a.m. Trolling Detection with Scikit-learn and NLTK ! Rafa Haro Sun - 12:30 p.m. Tratando datos más allá de los límites de la memoria ! Francesc Alted Medium Data and Distributed computing + Lasagne +

Slide 28

Slide 28 text

28 If I had more time…

Slide 29

Slide 29 text

29 THIS TALK RELATED / ALTERNATIVES WEB SCRAPING beautifulsoup WORKFLOW / PIPELINES Airflow Luigi NLP Spacy Gensim, NLTK STATISTICS PyMC, PyMC3… Sat - 3 p.m. Know your models - Statsmodels! ! Israel Saeta Pérez y Miquel Camprodon Sun - 1:10 p.m. Dive into Scrapy ! Juan Riaza Another time

Slide 30

Slide 30 text

30 THIS TALK IMAGE AUDIO GRAPH WEB FRAMEWORKS Another time PyAudio NetworkX

Slide 31

Slide 31 text

NAVIGATING THE DATA SCIENCE PYTHON ECOSYSTEM

Slide 32

Slide 32 text

32 Bokeh Blaze dask

Slide 33

Slide 33 text

ANACONDA Setup

Slide 34

Slide 34 text

34 PYTHON NumPy, SciPy, Pandas, Scikit-learn, Jupyter / IPython, Numba, Matplotlib, Spyder, Numexpr, Cython, Theano, Scikit-image, NLTK, NetworkX and 330+ packages conda

Slide 35

Slide 35 text

35 Conda • Package and environment manager • Language angnostic (Python, R, Java…) • Cross-platform (Windows, OS X, Linux) $ conda install python=2.7 $ conda install pandas $ conda install -c r r $ conda install mongodb

Slide 36

Slide 36 text

36 https://www.continuum.io/downloads Anaconda and conda are BSD licensed

Slide 37

Slide 37 text

37 Miniconda = Python + conda http://conda.pydata.org/miniconda.html

Slide 38

Slide 38 text

38 http://docs.continuum.io/anaconda/pkg-docs

Slide 39

Slide 39 text

39 https://www.continuum.io/content/conda-data-science Learn more TALK: Reproducible Multi-language Data Science with Conda, PyData Dallas 2015 http://chdoig.github.io/pydata2015-dallas-conda BLOGPOST: Conda for Data Science

Slide 40

Slide 40 text

JUPYTER IDE

Slide 41

Slide 41 text

41 http://jupyter.org/ ! Open source, interactive data science and scientific computing across over 40 programming languages.

Slide 42

Slide 42 text

42 The Jupyter Notebook is a web application that allows you to create and share documents that contain live code, equations, visualizations and explanatory text.

Slide 43

Slide 43 text

43 IPython IPython notebook nbviewer tmpnb binder Jupyter https://try.jupyter.org/

Slide 44

Slide 44 text

44 Binder

Slide 45

Slide 45 text

45 http://mybinder.org/

Slide 46

Slide 46 text

46

Slide 47

Slide 47 text

47

Slide 48

Slide 48 text

48 Notebook -> Slides

Slide 49

Slide 49 text

49

Slide 50

Slide 50 text

50

Slide 51

Slide 51 text

51

Slide 52

Slide 52 text

52 $ jupyter nbconvert my_r_notebook.ipynb --to slides --post serve

Slide 53

Slide 53 text

53

Slide 54

Slide 54 text

54 RISE

Slide 55

Slide 55 text

55 https://github.com/damianavila/RISE RISE: "Live" Reveal.js Jupyter/IPython Slideshow Extension A notebook rendered as a Reveal.js-based slideshow, where you can execute code or show to the audience whatever you can show/do inside the notebook itself

Slide 56

Slide 56 text

56 Learn more BLOGPOST: Jupyter and conda for R, Christine Doig, Sep. 2015 https://www.continuum.io/blog/developer/jupyter-and-conda-r ! TALK: RISE, Damian Avila, SciPy 2014 https://www.youtube.com/watch?v=sZBKruEh0jI

Slide 57

Slide 57 text

PANDAS Data munging

Slide 58

Slide 58 text

58 pandas is an open source, BSD-licensed library providing high- performance, easy-to-use data structures and data analysis tools for the Python programming language. http://pandas.pydata.org/

Slide 59

Slide 59 text

59 pandas.DataFrame Two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). https://github.com/jreback/PyDataNYC2015

Slide 60

Slide 60 text

60 https://github.com/jreback/PyDataNYC2015

Slide 61

Slide 61 text

61 I/O https://github.com/jreback/PyDataNYC2015

Slide 62

Slide 62 text

62 Data Exploration Get beers with abv (alcohol by volume) less than 5 and just return columns ‘beer_style’ and ‘review_overall’ https://github.com/jreback/PyDataNYC2015

Slide 63

Slide 63 text

63 Data Exploration Get beers with abv (alcohol by volume) less than 5 and time after June 2009, or with review overall rating higher than 4.5 https://github.com/jreback/PyDataNYC2015

Slide 64

Slide 64 text

64 Groupby https://github.com/jreback/PyDataNYC2015

Slide 65

Slide 65 text

65 …and much more! Missing values -> resample(fill_method=…) Computational tools -> rolling_mean() Timezone handling -> pd.date_range('20130101 09:00:00',periods=5,tz='US/Eastern') Timeseries Tidy data melt() pivot() dropna() pipe() stack()

Slide 66

Slide 66 text

66 Learn more TUTORIAL: Performance Pandas, Jeff Reback, PyDataNYC 2015 https://jakevdp.github.io/blog/2015/10/17/analyzing-pronto-cycleshare-data-with-python-and-pandas/ BLOGPOST: Analyzing Pronto CycleShare Data with Python and Pandas, Jake VanderPlas https://github.com/jreback/PyDataNYC2015/

Slide 67

Slide 67 text

BOKEH Data visualization

Slide 68

Slide 68 text

68 matplotlib seaborn bokeh pyxley lightning Visualization libraries in Python plotly

Slide 69

Slide 69 text

69 Custom visualizations Dashboards Streaming/ Animations Charts T ools Widgets Maps Hover Bokeh

Slide 70

Slide 70 text

70

Slide 71

Slide 71 text

71

Slide 72

Slide 72 text

72 Learn more TUTORIAL: Getting started with Bokeh, Bryan Van De Ven, Sarah Bird PyDataLDN 2015 https://www.youtube.com/watch?v=XBiS0oBzX3o ! http://nbviewer.ipython.org/github/bokeh/bokeh-notebooks/blob/master/ index.ipynb#Tutorial

Slide 73

Slide 73 text

SCIKIT-LEARN Machine Learning

Slide 74

Slide 74 text

74

Slide 75

Slide 75 text

75 Machine Learning Unsupervised learning Supervised learning Classification Regression Clustering Latent variables/structure categorical quantitative Linear regression Logistic regression SVM Decision trees k-NN K-means Hierarchical clustering *Topic modeling Dimenstionality reduction *Topic modeling

Slide 76

Slide 76 text

76 Exploratory Predictive Machine Learning Unsupervised learning Supervised learning Classification Regression categorical quantitative id gender age job_id 1 F 67 1 2 M 32 2 3 M 45 1 4 F 18 2 group similar individuals together id gend er age job_i d buy/click_ad money_spent 1 F 67 1 Yes $1,000 2 M 32 2 No - 3 M 45 1 No - 4 F 18 2 Yes $300 predict whether an individual is going to buy/click or not Classification Regression predict how much is the individual going to spend

Slide 77

Slide 77 text

77 http://scikit-learn.org/

Slide 78

Slide 78 text

78

Slide 79

Slide 79 text

79 Learn more TUTORIAL: Scikit-learn Tutorial, Jake Vanderplas, PyData Seattle 2015 http://nbviewer.ipython.org/github/ebenolson/pydata2015/tree/master/ https://github.com/amueller/scipy_2015_sklearn_tutorial TUTORIAL: Scikit-learn Tutorial, Andreas Meurer, SciPy 2015

Slide 80

Slide 80 text

THEANO + LASAGNE Deep learning

Slide 81

Slide 81 text

81 http://www.slideshare.net/XavierArrufat/20141120-python-bcninsideannsrev07 Inside the Artificial Neural Network ! A visual and intuitive journey to understand how artificial neural networks store knowledge and how they make decisions (no code, no math included)

Slide 82

Slide 82 text

82 http://www.slideshare.net/XavierArrufat/20141120-python-bcninsideannsrev07

Slide 83

Slide 83 text

83 http://www.slideshare.net/XavierArrufat/20141120-python-bcninsideannsrev07

Slide 84

Slide 84 text

84 http://www.slideshare.net/XavierArrufat/20141120-python-bcninsideannsrev07

Slide 85

Slide 85 text

85 http://nbviewer.ipython.org/github/ebenolson/pydata2015/blob/master/3%20-%20Convolutional%20Networks/Art%20Style%20Transfer.ipynb Art Style Transfer

Slide 86

Slide 86 text

86 Image Recognition http://nbviewer.ipython.org/github/ebenolson/pydata2015/blob/master/2%20-%20Lasagne%20Basics/Digit%20Recognizer.ipynb

Slide 87

Slide 87 text

87 Image Recognition http://nbviewer.ipython.org/github/ebenolson/pydata2015/blob/master/3%20-%20Convolutional%20Networks/Finetuning%20for%20Image%20Classification.ipynb

Slide 88

Slide 88 text

88 Learn more TUTORIAL: Neural networks with Theano and Lasagne Eben Olson, PyDataNYC ! http://nbviewer.ipython.org/github/ebenolson/pydata2015/tree/master/

Slide 89

Slide 89 text

WHAT TO DO NEXT?

Slide 90

Slide 90 text

My favorite blogs 90 • http://blaze.pydata.org/ • http://matthewrocklin.com/blog/ • http://danielfrg.com/ • https://jakevdp.github.io/ • https://www.continuum.io/blog/developer-blog • http://nerds.airbnb.com/ • http://blog.yhathq.com/ • http://multithreaded.stitchfix.com/blog/ • https://labs.spotify.com/

Slide 91

Slide 91 text

Conferences 91 • General: PyCon/Europython • Scientific: SciPy/EuroSciPy • Data Science: PyData • Spain: PyConES, PySS, EP 2016 Bilbao http://www.pyvideo.org/ https://www.youtube.com/user/PyDataTV

Slide 92

Slide 92 text

92 Find a local Python meetup near you or start one!

Slide 93

Slide 93 text

Thank you! Christine Doig PyConES 2015 Nov 2015 Twitter: ch_doig Github: chdoig Site: chdoig.github.io Email: cdoig@continuum.io