Slide 1

Slide 1 text

Growing Randomized Trees in the Cloud PyCon FR 2013 - Strasbourg

Slide 2

Slide 2 text

This talk is about Data Science

Slide 3

Slide 3 text

Outline • Machine Learning • Application to “Learning to Rank” for web search • Forests of Randomized Trees • Scikit-learn, IPython, StarCluster, Apache Libcloud

Slide 4

Slide 4 text

Supervised Machine Learning

Slide 5

Slide 5 text

Example: Learning to Rank

Slide 6

Slide 6 text

Example: Learning to Rank • Learning to rank web search results • Input: numerical descriptors for query / results pairs • Target: relevance score • 0: irrelevant • 1: somewhat relevant • 2: relevant

Slide 7

Slide 7 text

Input Features • Result page descriptors: • PageRank, Click Through Rate,
 last update time… • Query / Result page descriptors • BM25, TF*IDF cosine similarity • Ratio of covered query terms • User context descriptors: past user interactions (clicks, +1), time of the day, day of the month, month of the year and user language • … typically more than 40 descriptors and up to several hundreds

Slide 8

Slide 8 text

Quantifying Success • Measure discrepancy between predicted and true relevance scores • Traditional Regression Metrics: • Mean Absolute Error • Explained Variance • But the ranking quality is more important than the predicted scores…

Slide 9

Slide 9 text

NDCG: a ranking metric

Slide 10

Slide 10 text

NDCG in Greek DCGk == Discounted Cumulative Gains at rank k

Slide 11

Slide 11 text

Data from Microsoft Bing • http://research.microsoft.com/en-us/projects/mslr • 10K or 30K anonymized queries (terms and results URLs) • 10K queries: • ~1.2M search results • 136 descriptors • 5 target relevance levels • ~650MB in NumPy

Slide 12

Slide 12 text

Disclaimer: this is not Big Data • Couple of GB: fits in RAM on my shiny Mac laptop • But painful to download / upload over the internet. • Processing and modeling can be CPU intensive (and sometimes distributed).

Slide 13

Slide 13 text

Growing randomized trees

Slide 14

Slide 14 text

Training a Decision Tree

Slide 15

Slide 15 text

Training a Decision Tree Term Match! Rate

Slide 16

Slide 16 text

Training a Decision Tree Term Match! Rate < 0.2 > 0.2

Slide 17

Slide 17 text

Training a Decision Tree Term Match! Rate Score == 0 < 0.2 > 0.2

Slide 18

Slide 18 text

Training a Decision Tree Term Match! Rate Score == 0 < 0.2 PageRank > 0.2 < 3 > 3

Slide 19

Slide 19 text

Training a Decision Tree Term Match! Rate Score == 0 < 0.2 PageRank > 0.2 < 3 Score == 1 > 3

Slide 20

Slide 20 text

Training a Decision Tree Term Match! Rate Score == 0 < 0.2 PageRank > 0.2 < 3 Score == 1 > 3 Score == 2

Slide 21

Slide 21 text

Training a Randomized Tree • Pick a random subset of features (e.g. TFIDF, BM25, PageRank, CTR…) • Find the feature that best splits the dataset • Randomize the split threshold between observed min and max values • Send each half of the split dataset to build the 2 subtrees

Slide 22

Slide 22 text

Training a Forest • Train n random trees independently • Use different PRNG seeds • At prediction time, make each tree predict its best guess and: • make them vote (classification) • average predicted values (regression)

Slide 23

Slide 23 text

Extra Trees one node with 8 CPUs

Slide 24

Slide 24 text

Growing randomized trees on the cloud

Slide 25

Slide 25 text

10x8 cores cluster on EC2 in 20min

Slide 26

Slide 26 text

No content

Slide 27

Slide 27 text

>>> Configuring cluster took 12.865 mins! >>> Starting cluster took 20.144 mins

Slide 28

Slide 28 text

• Notebook interface: in-browser, interactive data exploration environment • IPython.parallel: async load-balancing API for interactive dispatching processing • Based on ZeroMQ and msgpack for IPC

Slide 29

Slide 29 text

No content

Slide 30

Slide 30 text

Grow random trees in parallel in the cloud

Slide 31

Slide 31 text

Fetch back all the trees as a big forest on one node

Slide 32

Slide 32 text

No content

Slide 33

Slide 33 text

Demo http://j.mp/pyrallel-mslr

Slide 34

Slide 34 text

Results • NDGC@5: ~0.52 for 500 trees on MSLR-WEB10K • Could maybe be improved by: • increasing the number of trees (but model gets too big in memory and slower to predict) • replacing base trees by bagged GBRT models • pairwise or list-wise ranking models (not in sklearn) • Linear regression model baseline: NDGC@5: ~0.45

Slide 35

Slide 35 text

Your turn now!

Slide 36

Slide 36 text

No content

Slide 37

Slide 37 text

Questions? • http://ipython.org • http://scikit-learn.org! • http://star.mit.edu/cluster • https://github.com/pydata/pyrallel • http://github.com/ogrisel/notebooks

Slide 38

Slide 38 text

Backup slides

Slide 39

Slide 39 text

Loading the data with Scikit-learn

Slide 40

Slide 40 text

NDCG in Python