Slide 1

Slide 1 text

Predictive Modeling with IPython & scikit-learn Strata - 2014 - Santa Clara

Slide 2

Slide 2 text

Outline • Predictive Modeling & scikit-learn • Application to “Learning to Rank” for web search • Forests of Randomized Trees • The Python ecosystem: IPython, StarCluster

Slide 3

Slide 3 text

Training! text docs! images! sounds! transactions Predictive Modeling Data Flow

Slide 4

Slide 4 text

Training! text docs! images! sounds! transactions Labels Predictive Modeling Data Flow

Slide 5

Slide 5 text

Training! text docs! images! sounds! transactions Labels Machine! Learning! Algorithm Predictive Modeling Data Flow Feature vectors

Slide 6

Slide 6 text

Training! text docs! images! sounds! transactions Labels Machine! Learning! Algorithm Model Predictive Modeling Data Flow Feature vectors

Slide 7

Slide 7 text

New! text doc! image! sound! transaction Model Expected! Label Predictive Modeling Data Flow Training! text docs! images! sounds! transactions Labels Machine! Learning! Algorithm Feature vectors Feature vector

Slide 8

Slide 8 text

New! text doc! image! sound! transaction Model Expected! Label Predictive Modeling Data Flow Feature vector

Slide 9

Slide 9 text

Possible Applications • Text Classification / Sequence Tagging NLP • Spam Filtering, Sentiment Analysis... • Computer Vision / Speech Recognition • Learning To Rank - IR and advertisement • Science: Statistical Analysis of the Brain, Astronomy, Biology, Social Sciences...

Slide 10

Slide 10 text

• Library of Machine Learning algorithms • Focus on standard methods (e.g. ESL-II) • Open Source (BSD) • Simple fit / predict / transform API • Python / NumPy / SciPy / Cython • Model Assessment, Selection & Ensembles

Slide 11

Slide 11 text

46. 200. 1 0 0.0 N -30. 150. 2. 149 0.1 S 87. 50 1000 10 0.1 W 45. 10 10. 1 0.4 NW 5. 2. 67. 1. 0.2 E Latitude Altitude D istance to closest river Altitude closest river Slope Slope orientation location 1 X 0 1 0 1 0 0 0 0 1 0 1 0 1 0 0 Rain forest G rassland Arid Ice y location 2 location 3 location 4 location 5 Vegetation Cover Type

Slide 12

Slide 12 text

Support Vector Machine from sklearn.svm import SVC! ! model = SVC(kernel=“rbf”, C=1.0, gamma=1e-4)! model.fit(X_train, y_train)! ! ! y_predicted = model.predict(X_test)! ! from sklearn.metrics import f1_score! f1_score(y_test, y_predicted)

Slide 13

Slide 13 text

Linear Classifier from sklearn.linear_model import SGDClassifier! ! model = SGDClassifier(alpha=1e-4, penalty=“elasticnet")! model.fit(X_train, y_train)! ! ! y_predicted = model.predict(X_test)! ! from sklearn.metrics import f1_score! f1_score(y_test, y_predicted)

Slide 14

Slide 14 text

Random Forests from sklearn.ensemble import RandomForestClassifier! ! model = RandomForestClassifier(n_estimators=200)! model.fit(X_train, y_train)! ! ! y_predicted = model.predict(X_test)! ! from sklearn.metrics import f1_score! f1_score(y_test, y_predicted)

Slide 15

Slide 15 text

No content

Slide 16

Slide 16 text

scikit-learn contributors • GitHub-centric contribution workflow • each pull request needs 2 x [+1] reviews • code + tests + doc + example • 92% test coverage / Continuous Integration • 2-3 major releases per years + bug-fix • 80+ contributors for release 0.14

Slide 17

Slide 17 text

Example: Learning to Rank

Slide 18

Slide 18 text

Example: Learning to Rank • Learning to rank web search results • Input: numerical descriptors for query / results pairs • Target: relevance score • 0: irrelevant • 1: somewhat relevant • 2: relevant

Slide 19

Slide 19 text

Input Features • Result page descriptors: • PageRank, Click Through Rate,
 last update time… • Query / Result page descriptors • BM25, TF*IDF cosine similarity • Ratio of covered query terms • User context descriptors: past user interactions (clicks, +1), time of the day, day of the month, month of the year and user language • … typically more than 40 descriptors and up to several hundreds

Slide 20

Slide 20 text

Quantifying Success • Measure discrepancy between predicted and true relevance scores • Traditional Regression Metrics: • Mean Absolute Error • Explained Variance • But the ranking quality is more important than the predicted scores…

Slide 21

Slide 21 text

NDCG: a ranking metric

Slide 22

Slide 22 text

NDCG in Greek DCGk == Discounted Cumulative Gains at rank k

Slide 23

Slide 23 text

Data from Microsoft Bing • http://research.microsoft.com/en-us/projects/mslr • 10K or 30K anonymized queries (terms and results URLs) • 10K queries: • ~1.2M search results • 136 descriptors • 5 target relevance levels • ~650MB in NumPy

Slide 24

Slide 24 text

Disclaimer: this is not Big Data • Couple of GB: fits in RAM on my laptop • But painful to download / upload over the internet. • Processing and modeling can be CPU intensive (and sometimes distributed).

Slide 25

Slide 25 text

Growing randomized trees

Slide 26

Slide 26 text

Training a Decision Tree

Slide 27

Slide 27 text

Training a Decision Tree Term Match! Rate

Slide 28

Slide 28 text

Training a Decision Tree Term Match! Rate < 0.2 > 0.2

Slide 29

Slide 29 text

Training a Decision Tree Term Match! Rate Score == 0 < 0.2 > 0.2

Slide 30

Slide 30 text

Training a Decision Tree Term Match! Rate Score == 0 < 0.2 PageRank > 0.2 < 3 > 3

Slide 31

Slide 31 text

Training a Decision Tree Term Match! Rate Score == 0 < 0.2 PageRank > 0.2 < 3 Score == 1 > 3

Slide 32

Slide 32 text

Training a Decision Tree Term Match! Rate Score == 0 < 0.2 PageRank > 0.2 < 3 Score == 1 > 3 Score == 2

Slide 33

Slide 33 text

Training a Randomized Tree • Pick a random subset of features (e.g. TFIDF, BM25, PageRank, CTR…) • Find the feature that best splits the dataset • Randomize the split threshold between observed min and max values • Send each half of the split dataset to build the 2 subtrees

Slide 34

Slide 34 text

Training a Forest • Train n random trees independently • Use different PRNG seeds • At prediction time, make each tree predict its best guess and: • make them vote (classification) • average predicted values (regression)

Slide 35

Slide 35 text

Extra Trees one node with 8 CPUs

Slide 36

Slide 36 text

Growing randomized trees on the cloud

Slide 37

Slide 37 text

10x8 cores cluster on EC2 in 20min

Slide 38

Slide 38 text

No content

Slide 39

Slide 39 text

>>> Configuring cluster took 12.865 mins! >>> Starting cluster took 20.144 mins

Slide 40

Slide 40 text

• Notebook interface: in-browser, interactive data exploration environment • IPython.parallel: async load-balancing API for interactive dispatching processing • Based on ZeroMQ and msgpack for IPC

Slide 41

Slide 41 text

No content

Slide 42

Slide 42 text

Grow random trees in parallel in the cloud

Slide 43

Slide 43 text

Fetch back all the trees as a big forest on one node

Slide 44

Slide 44 text

No content

Slide 45

Slide 45 text

One big notebook: ! http://j.mp/pyrallel-mslr

Slide 46

Slide 46 text

No content

Slide 47

Slide 47 text

No content

Slide 48

Slide 48 text

No content

Slide 49

Slide 49 text

Results • NDGC@10: ~0.53 for 200 trees on MSLR-WEB10K • Could be improved by: • increasing the number of trees (but model gets too big in memory and slower to predict) • replacing base trees by bagged GBRT models • LambdaMART list-wise ranking models
 (under development for inclusion in sklearn) • Linear regression model baseline: NDGC@10: ~0.45

Slide 50

Slide 50 text

No content

Slide 51

Slide 51 text

http://blog.kaggle.com/2014/02/06/winning-personalized-web-search-team-dataiku/

Slide 52

Slide 52 text

Thank you! • http://ipython.org • http://scikit-learn.org! • http://star.mit.edu/cluster • https://github.com/pydata/pyrallel • https://speakerdeck.com/ogrisel @ogrisel

Slide 53

Slide 53 text

Backup slides

Slide 54

Slide 54 text

No content

Slide 55

Slide 55 text

Loading the data with Scikit-learn

Slide 56

Slide 56 text

NDCG in Python

Slide 57

Slide 57 text

No content

Slide 58

Slide 58 text

No content

Slide 59

Slide 59 text

No content

Slide 60

Slide 60 text

scikit-learn users • We support users on & ML • 650+ questions tagged with [scikit-learn] • Many competitors + benchmarks • 500+ answers on 0.13 release user survey • 60% academics / 40% from industry • Some data-driven Startups use sklearn

Slide 61

Slide 61 text

No content

Slide 62

Slide 62 text

No content

Slide 63

Slide 63 text

No content

Slide 64

Slide 64 text

No content

Slide 65

Slide 65 text

scikit-learn users • We support users on & ML • 820+ questions tagged with [scikit-learn] • Many competitors + benchmarks • 500+ answers on 0.13 release user survey • 60% academics / 40% from industry • Some data-driven Startups use sklearn

Slide 66

Slide 66 text

No content