Machine Learning in Python with scikit-learn

Machine Learning in Python with scikit-learn DataJob - 2013 -
Paris

Outline • Predictive Modeling & scikit-learn • Application to “Learning
to Rank” for web search • Forests of Randomized Trees • The Python ecosystem: IPython, StarCluster

Training! text docs! images! sounds! transactions Predictive Modeling Data Flow

Training! text docs! images! sounds! transactions Labels Predictive Modeling Data
Flow

Training! text docs! images! sounds! transactions Labels Machine! Learning! Algorithm
Predictive Modeling Data Flow Feature vectors

Training! text docs! images! sounds! transactions Labels Machine! Learning! Algorithm
Model Predictive Modeling Data Flow Feature vectors

New! text doc! image! sound! transaction Model Expected! Label Predictive
Modeling Data Flow Training! text docs! images! sounds! transactions Labels Machine! Learning! Algorithm Feature vectors Feature vector

New! text doc! image! sound! transaction Model Expected! Label Predictive
Modeling Data Flow Feature vector

Possible Applications • Text Classiﬁcation / Sequence Tagging NLP •
Spam Filtering, Sentiment Analysis... • Computer Vision / Speech Recognition • Learning To Rank - IR and advertisement • Science: Statistical Analysis of the Brain, Astronomy, Biology, Social Sciences...

• Library for Machine Learning • Open Source (BSD) •
Simple fit / predict / transform API • Python / NumPy / SciPy / Cython • Model Assessment, Selection & Ensembles

Support Vector Machine from sklearn.svm import SVC! ! ! model
= SVC(kernel=“rbf”, C=1.0, gamma=1e-4)! model.fit(X_train, y_train)! ! ! from sklearn.metrics import f1_score! ! y_predicted = model.predict(X_test)! f1_score(y_test, y_predicted)

Linear Classiﬁer from sklearn.linear_model import SGDClassifier! ! ! model =
SGDClassifier(alpha=1e-4, penalty=“elasticnet")! model.fit(X_train, y_train)! ! ! from sklearn.metrics import f1_score! ! y_predicted = model.predict(X_test)! f1_score(y_test, y_predicted)

Random Forests from sklearn.ensemble import RandomForestClassifier! ! ! model =
RandomForestClassifier(n_estimators=200)! model.fit(X_train, y_train)! ! ! from sklearn.metrics import f1_score! ! y_predicted = model.predict(X_test)! f1_score(y_test, y_predicted)

scikit-learn contributors • GitHub-centric contribution workﬂow • each pull request
needs 2 x [+1] reviews • code + tests + doc + example • 92% test coverage / Continuous Integration • 2-3 major releases per years + bug-ﬁx • 80+ contributors for release 0.14

scikit-learn users • We support users on & ML •
650+ questions tagged with [scikit-learn] • Many competitors + benchmarks • 500+ answers on 0.13 release user survey • 60% academics / 40% from industry • Some data-driven Startups use sklearn

Example: Learning to Rank

Example: Learning to Rank • Learning to rank web search
results • Input: numerical descriptors for query / results pairs • Target: relevance score • 0: irrelevant • 1: somewhat relevant • 2: relevant

Input Features • Result page descriptors: • PageRank, Click Through
Rate,  last update time… • Query / Result page descriptors • BM25, TF*IDF cosine similarity • Ratio of covered query terms • User context descriptors: past user interactions (clicks, +1), time of the day, day of the month, month of the year and user language • … typically more than 40 descriptors and up to several hundreds

Quantifying Success • Measure discrepancy between predicted and true relevance
scores • Traditional Regression Metrics: • Mean Absolute Error • Explained Variance • But the ranking quality is more important than the predicted scores…

NDCG: a ranking metric

NDCG in Greek DCGk == Discounted Cumulative Gains at rank
k

Data from Microsoft Bing • http://research.microsoft.com/en-us/projects/mslr • 10K or 30K
anonymized queries (terms and results URLs) • 10K queries: • ~1.2M search results • 136 descriptors • 5 target relevance levels • ~650MB in NumPy

Disclaimer: this is not Big Data • Couple of GB:
ﬁts in RAM on my laptop • But painful to download / upload over the internet. • Processing and modeling can be CPU intensive (and sometimes distributed).

Growing randomized trees

Training a Decision Tree

Training a Decision Tree Term Match! Rate

Training a Decision Tree Term Match! Rate < 0.2 >
0.2

Training a Decision Tree Term Match! Rate Score == 0
< 0.2 > 0.2

< 0.2 PageRank > 0.2 < 3 > 3

< 0.2 PageRank > 0.2 < 3 Score == 1 > 3

< 0.2 PageRank > 0.2 < 3 Score == 1 > 3 Score == 2

Training a Randomized Tree • Pick a random subset of
features (e.g. TFIDF, BM25, PageRank, CTR…) • Find the feature that best splits the dataset • Randomize the split threshold between observed min and max values • Send each half of the split dataset to build the 2 subtrees

Training a Forest • Train n random trees independently •
Use different PRNG seeds • At prediction time, make each tree predict its best guess and: • make them vote (classiﬁcation) • average predicted values (regression)

Extra Trees one node with 8 CPUs

Growing randomized trees on the cloud

10x8 cores cluster on EC2 in 20min

>>> Configuring cluster took 12.865 mins! >>> Starting cluster took
20.144 mins

• Notebook interface: in-browser, interactive data exploration environment • IPython.parallel:
async load-balancing API for interactive dispatching processing • Based on ZeroMQ and msgpack for IPC

Grow random trees in parallel in the cloud

Fetch back all the trees as a big forest on
one node

Demo http://j.mp/pyrallel-mslr

Results • NDGC@5: ~0.52 for 500 trees on MSLR-WEB10K •
Could maybe be improved by: • increasing the number of trees (but model gets too big in memory and slower to predict) • replacing base trees by bagged GBRT models • pair-wise or list-wise ranking models (not in sklearn) • Linear regression model baseline: NDGC@5: ~0.43

Your turn now!

Thank you! • http://ipython.org • http://scikit-learn.org! • http://star.mit.edu/cluster • https://github.com/pydata/pyrallel
• http://github.com/ogrisel/notebooks @ogrisel

Backup slides

Caveat Emptor • Domain speciﬁc tooling kept to a minimum
• Some feature extraction for Bag of Words Text Analysis • Some functions for extracting image patches • Domain integration is the responsibility of the user or 3rd party libraries

Loading the data with Scikit-learn

NDCG in Python

Machine Learning in Python with scikit-learn

Machine Learning in Python with scikit-learn

More Decks by Olivier Grisel

Other Decks in Technology

Featured

Transcript