Machine Learning in Python with scikit-learn

Slide 1

Slide 1 text

Machine Learning in Python with scikit-learn DataJob - 2013 - Paris

Slide 2

Slide 2 text

Outline • Predictive Modeling & scikit-learn • Application to “Learning to Rank” for web search • Forests of Randomized Trees • The Python ecosystem: IPython, StarCluster

Slide 3

Slide 3 text

Training! text docs! images! sounds! transactions Predictive Modeling Data Flow

Slide 4

Slide 4 text

Training! text docs! images! sounds! transactions Labels Predictive Modeling Data Flow

Slide 5

Slide 5 text

Training! text docs! images! sounds! transactions Labels Machine! Learning! Algorithm Predictive Modeling Data Flow Feature vectors

Slide 6

Slide 6 text

Training! text docs! images! sounds! transactions Labels Machine! Learning! Algorithm Model Predictive Modeling Data Flow Feature vectors

Slide 7

Slide 7 text

New! text doc! image! sound! transaction Model Expected! Label Predictive Modeling Data Flow Training! text docs! images! sounds! transactions Labels Machine! Learning! Algorithm Feature vectors Feature vector

Slide 8

Slide 8 text

New! text doc! image! sound! transaction Model Expected! Label Predictive Modeling Data Flow Feature vector

Slide 9

Slide 9 text

Possible Applications • Text Classiﬁcation / Sequence Tagging NLP • Spam Filtering, Sentiment Analysis... • Computer Vision / Speech Recognition • Learning To Rank - IR and advertisement • Science: Statistical Analysis of the Brain, Astronomy, Biology, Social Sciences...

Slide 10

Slide 10 text

• Library for Machine Learning • Open Source (BSD) • Simple fit / predict / transform API • Python / NumPy / SciPy / Cython • Model Assessment, Selection & Ensembles

Slide 11

Slide 11 text

Support Vector Machine from sklearn.svm import SVC! ! ! model = SVC(kernel=“rbf”, C=1.0, gamma=1e-4)! model.fit(X_train, y_train)! ! ! from sklearn.metrics import f1_score! ! y_predicted = model.predict(X_test)! f1_score(y_test, y_predicted)

Slide 12

Slide 12 text

Linear Classiﬁer from sklearn.linear_model import SGDClassifier! ! ! model = SGDClassifier(alpha=1e-4, penalty=“elasticnet")! model.fit(X_train, y_train)! ! ! from sklearn.metrics import f1_score! ! y_predicted = model.predict(X_test)! f1_score(y_test, y_predicted)

Slide 13

Slide 13 text

Random Forests from sklearn.ensemble import RandomForestClassifier! ! ! model = RandomForestClassifier(n_estimators=200)! model.fit(X_train, y_train)! ! ! from sklearn.metrics import f1_score! ! y_predicted = model.predict(X_test)! f1_score(y_test, y_predicted)

Slide 14

Slide 14 text

No content

Slide 15

Slide 15 text

scikit-learn contributors • GitHub-centric contribution workﬂow • each pull request needs 2 x [+1] reviews • code + tests + doc + example • 92% test coverage / Continuous Integration • 2-3 major releases per years + bug-ﬁx • 80+ contributors for release 0.14

Slide 16

Slide 16 text

scikit-learn users • We support users on & ML • 650+ questions tagged with [scikit-learn] • Many competitors + benchmarks • 500+ answers on 0.13 release user survey • 60% academics / 40% from industry • Some data-driven Startups use sklearn

Slide 17

Slide 17 text

Example: Learning to Rank

Slide 18

Slide 18 text

Example: Learning to Rank • Learning to rank web search results • Input: numerical descriptors for query / results pairs • Target: relevance score • 0: irrelevant • 1: somewhat relevant • 2: relevant

Slide 19

Slide 19 text

Input Features • Result page descriptors: • PageRank, Click Through Rate,  last update time… • Query / Result page descriptors • BM25, TF*IDF cosine similarity • Ratio of covered query terms • User context descriptors: past user interactions (clicks, +1), time of the day, day of the month, month of the year and user language • … typically more than 40 descriptors and up to several hundreds

Slide 20

Slide 20 text

Quantifying Success • Measure discrepancy between predicted and true relevance scores • Traditional Regression Metrics: • Mean Absolute Error • Explained Variance • But the ranking quality is more important than the predicted scores…

Slide 21

Slide 21 text

NDCG: a ranking metric

Slide 22

Slide 22 text

NDCG in Greek DCGk == Discounted Cumulative Gains at rank k

Slide 23

Slide 23 text

Data from Microsoft Bing • http://research.microsoft.com/en-us/projects/mslr • 10K or 30K anonymized queries (terms and results URLs) • 10K queries: • ~1.2M search results • 136 descriptors • 5 target relevance levels • ~650MB in NumPy

Slide 24

Slide 24 text

Disclaimer: this is not Big Data • Couple of GB: ﬁts in RAM on my laptop • But painful to download / upload over the internet. • Processing and modeling can be CPU intensive (and sometimes distributed).

Slide 25

Slide 25 text

Growing randomized trees

Slide 26

Slide 26 text

Training a Decision Tree

Slide 27

Slide 27 text

Training a Decision Tree Term Match! Rate

Slide 28

Slide 28 text

Training a Decision Tree Term Match! Rate < 0.2 > 0.2

Slide 29

Slide 29 text

Training a Decision Tree Term Match! Rate Score == 0 < 0.2 > 0.2

Slide 30

Slide 30 text

Training a Decision Tree Term Match! Rate Score == 0 < 0.2 PageRank > 0.2 < 3 > 3

Slide 31

Slide 31 text

Training a Decision Tree Term Match! Rate Score == 0 < 0.2 PageRank > 0.2 < 3 Score == 1 > 3

Slide 32

Slide 32 text

Training a Decision Tree Term Match! Rate Score == 0 < 0.2 PageRank > 0.2 < 3 Score == 1 > 3 Score == 2

Slide 33

Slide 33 text

Training a Randomized Tree • Pick a random subset of features (e.g. TFIDF, BM25, PageRank, CTR…) • Find the feature that best splits the dataset • Randomize the split threshold between observed min and max values • Send each half of the split dataset to build the 2 subtrees

Slide 34

Slide 34 text

Training a Forest • Train n random trees independently • Use different PRNG seeds • At prediction time, make each tree predict its best guess and: • make them vote (classiﬁcation) • average predicted values (regression)

Slide 35

Slide 35 text

Extra Trees one node with 8 CPUs

Slide 36

Slide 36 text

Growing randomized trees on the cloud

Slide 37

Slide 37 text

10x8 cores cluster on EC2 in 20min

Slide 38

Slide 38 text

No content

Slide 39

Slide 39 text

>>> Configuring cluster took 12.865 mins! >>> Starting cluster took 20.144 mins

Slide 40

Slide 40 text

• Notebook interface: in-browser, interactive data exploration environment • IPython.parallel: async load-balancing API for interactive dispatching processing • Based on ZeroMQ and msgpack for IPC

Slide 41

Slide 41 text

No content

Slide 42

Slide 42 text

Grow random trees in parallel in the cloud

Slide 43

Slide 43 text

Fetch back all the trees as a big forest on one node

Slide 44

Slide 44 text

No content

Slide 45

Slide 45 text

Demo http://j.mp/pyrallel-mslr

Slide 46

Slide 46 text

Results • NDGC@5: ~0.52 for 500 trees on MSLR-WEB10K • Could maybe be improved by: • increasing the number of trees (but model gets too big in memory and slower to predict) • replacing base trees by bagged GBRT models • pair-wise or list-wise ranking models (not in sklearn) • Linear regression model baseline: NDGC@5: ~0.43

Slide 47

Slide 47 text

Your turn now!

Slide 48

Slide 48 text

No content

Slide 49

Slide 49 text

Thank you! • http://ipython.org • http://scikit-learn.org! • http://star.mit.edu/cluster • https://github.com/pydata/pyrallel • http://github.com/ogrisel/notebooks @ogrisel

Slide 50

Slide 50 text

Backup slides

Slide 51

Slide 51 text

No content

Slide 52

Slide 52 text

Caveat Emptor • Domain speciﬁc tooling kept to a minimum • Some feature extraction for Bag of Words Text Analysis • Some functions for extracting image patches • Domain integration is the responsibility of the user or 3rd party libraries