Slide 1

Slide 1 text

scikit-learn Machine Learning in Python Paris.py - June. 12 2013 Wednesday, June 12, 13

Slide 2

Slide 2 text

Machine Learning == Building Executable Data Summaries Wednesday, June 12, 13

Slide 3

Slide 3 text

Supervised Machine Learning Wednesday, June 12, 13

Slide 4

Slide 4 text

Possible Applications • Text Classification / Sequence Tagging NLP • Spam Filtering, Sentiment Analysis... • Computer Vision / Speech Recognition • Learning To Rank - IR and advertisement • Science: Statistical Analysis of the Brain, Astronomy, Biology, Social Sciences... Wednesday, June 12, 13

Slide 5

Slide 5 text

Spam Classification 0 2 1 0 0 1 0 0 0 1 1 0 1 1 0 0 0 1 1 0 1 1 2 3 0 0 1 1 0 0 word 1 word 2 word 3 word 4 word 5 word 6 email 1 email 2 email 3 email 4 email 5 X word 6 0 1 1 0 0 Spam ? y Wednesday, June 12, 13

Slide 6

Slide 6 text

Topic Classification 0 2 1 0 0 1 0 0 0 1 1 0 1 1 0 0 0 1 1 0 1 1 2 3 0 0 1 1 0 0 word 1 word 2 word 3 word 4 word 5 word 6 news 1 news 2 news 3 news 4 news 5 X 0 0 1 1 1 0 1 0 0 0 1 1 0 0 0 word 6 Sport Business Tech. y Wednesday, June 12, 13

Slide 7

Slide 7 text

Sentiment Analysis 0 2 1 0 0 1 0 0 0 1 1 0 1 1 0 0 0 1 1 0 1 1 2 3 0 0 1 1 0 0 word 1 word 2 word 3 word 4 word 5 word 6 review 1 review 2 review 3 review 4 review 5 X 0 1 1 0 0 word 6 Positive? y Wednesday, June 12, 13

Slide 8

Slide 8 text

Vegetation Cover Type 46. 200. 1 0 0.0 N -30. 150. 2. 149 0.1 S 87. 50 1000 10 0.1 W 45. 10 10. 1 0.4 NW 5. 2. 67. 1. 0.2 E Latitude Altitude D istance to closest river Altitude closest river Slope Slope orientation location 1 X 0 1 0 1 0 0 0 0 1 0 1 0 1 0 0 Rain forest G rassland Arid Ice y location 2 location 3 location 4 location 5 Wednesday, June 12, 13

Slide 9

Slide 9 text

Object Classification in Images 0 2 1 0 0 1 0 0 0 1 1 0 1 1 0 0 0 1 1 0 1 1 2 3 0 0 1 1 0 0 SIFT word 1 SIFT word 2 SIFT word 3 SIFT word 4 SIFT word 5 SIFT word 6 image 1 image 2 image 3 image 4 image 5 X 0 0 1 1 1 0 1 0 0 0 1 1 0 0 0 C at C ar Pedestrian y Wednesday, June 12, 13

Slide 10

Slide 10 text

• Library for Machine Learning • Open Source (BSD) • Simple fit / predict / transform API • Python / NumPy / SciPy / Cython • Model Assessment, Selection & Ensembles Wednesday, June 12, 13

Slide 11

Slide 11 text

Supervised ML with sklearn Wednesday, June 12, 13

Slide 12

Slide 12 text

Credits: Andreas Mueller Wednesday, June 12, 13

Slide 13

Slide 13 text

Wednesday, June 12, 13

Slide 14

Slide 14 text

Wednesday, June 12, 13

Slide 15

Slide 15 text

Training a Model for Face Recognition http://j.mp/face-recognition-notebook Wednesday, June 12, 13

Slide 16

Slide 16 text

Total dataset size: n_samples: 1288, n_features: 1850, n_classes: 7 Extracting the top 150 eigenfaces from 966 faces done in 0.466s Projecting the input data on the eigenfaces orthonormal basis done in 0.056s Fitting the SVM classifier to the training set done in 18.549s Predicting people's names on the test set done in 0.062s precision recall f1-score support Ariel Sharon 0.90 0.75 0.82 12 Colin Powell 0.78 0.94 0.85 62 Donald Rumsfeld 0.86 0.72 0.78 25 George W Bush 0.89 0.96 0.92 141 Gerhard Schroeder 0.92 0.74 0.82 31 Hugo Chavez 0.90 0.53 0.67 17 Tony Blair 0.81 0.74 0.77 34 avg / total 0.86 0.86 0.86 322 Wednesday, June 12, 13

Slide 17

Slide 17 text

Wednesday, June 12, 13

Slide 18

Slide 18 text

Learned Eigen Faces Wednesday, June 12, 13

Slide 19

Slide 19 text

scikit-learn contributors • GitHub-centric contribution workflow • each pull request needs 2 x [+1] reviews • code + tests + doc + example • 92% test coverage / Continuous Integr. • 4 major releases per years + 4 bugfix rel. • 66 contributors for release 0.13 Wednesday, June 12, 13

Slide 20

Slide 20 text

scikit-learn users • We support users on & ML • 200+ questions tagged with [scikit-learn] • Many competitors + benchmarks • 500+ answers on ongoing user survey • 60% academics / 40% from industry • Some data-driven Startups use sklearn Wednesday, June 12, 13

Slide 21

Slide 21 text

Caveat Emptor • Domain specific tooling kept to a minimum • Some feature extraction for Bag of Words Text Analysis • Some functions for extracting image patches • Domain integration is the responsibility of the user or 3rd party libraries Wednesday, June 12, 13

Slide 22

Slide 22 text

Thank you! • http://scikit-learn.org • http://github.com/scikit-learn/scikit-learn • http://github.com/ogrisel/notebooks • @ogrisel on twitter Wednesday, June 12, 13