Kaggle Criteo Challenge and Online Learning

Slide 1

Slide 1 text

Kaggle Criteo Challenge and Online Learning Kaggle Paris Meetup – October 23nd, 2014 - Paris, France Christophe Bourguignat – AXA Data Innovation Lab - @chris_bour

Slide 2

Slide 2 text

Kaggle Criteo Challenge • Develop a model predicting ad click- through rate (CTR) • Train : a portion of Criteo's traffic over a period of 7 days • Test : events on the day following the training period

Slide 3

Slide 3 text

Kaggle Criteo Challenge • Total : – 52 millions rows – 13 GB (uncompressed)

Slide 4

Slide 4 text

Kaggle Criteo Challenge • Undisclosed detailed meaning of the features • 13 numerical features – Mostly counters : number of times the user visited the advertiser website, … • 26 categorical features – Publisher features : the domain of the url where the ad was displayed, … – Advertiser features : advertiser id, type of products, … – User features : browser type, …

Slide 5

Slide 5 text

Cardinalities & ranges examples • I5 : – Between 0 and 23 159 456 • C3 : – 10 131 226 distinct categories

Slide 6

Slide 6 text

• Model that learns one instance at a time • Soon after the prediction is made, the true label of the instance is discovered Online Machine Learning

Slide 7

Slide 7 text

• 1 : receive the instance • 2 : predict the label for the instance • 3 : the algorithm receives the true label of the instance – use this label feedback to update hypothesis for future trials Online Machine Learning

Slide 8

Slide 8 text

• Applications of a sequential nature • Applications with huge amounts of data – traditional learning approaches that use the entire data set are computationally infeasible Typical use cases

Slide 9

Slide 9 text

from datetime import datetime from csv import DictReader from math import exp, log, sqrt Solution with 75 lines of Python Code Source : https://www.kaggle.com/c/criteo-display-ad-challenge/forums/t/10322/beat-the-benchmark-with-less-then-200mb-of-memory

Slide 10

Slide 10 text

• 1 : receive the instance D = 2 ** 20 # number of weights use for learning def get_x(csv_row, D): x = [0] # 0 is the index of the bias term for key, value in csv_row.items(): index = int(value + key[1:], 16) % D # weakest hash ever ;) x.append(index) return x # x contains indices of features that have a value of 1 Solution with 75 lines of Python Code

Slide 11

Slide 11 text

• 2 : predict the label for the instance # initialize our model D = 2 ** 20 # number of weights use for learning w = [0.] * D # weights def get_p(x, w): wTx = 0. for i in x: # do wTx wTx += w[i] * 1. # w[i] * x[i], but if i in x we got x[i] = 1. return 1. / (1. + exp(-max(min(wTx, 20.), -20.))) # bounded sigmoid Solution with 75 lines of Python Code

Slide 12

Slide 12 text

• 3 : the algorithm receives the true label of the instance – use this label feedback to update hypothesis for future trials # initialize our model w = [0.] * D # weights n = [0.] * D # number of times we've encountered a feature alpha = .1 # learning rate for sgd optimization def update_w(w, n, x, p, y): for i in x: # alpha / (sqrt(n) + 1) is the adaptive learning rate heuristic # (p - y) * x[i] is the current gradient # note that in our case, if i in x then x[i] = 1 w[i] -= (p - y) * alpha / (sqrt(n[i]) + 1.) n[i] += 1. return w, n Solution with 75 lines of Python Code

Slide 13

Slide 13 text

Hashing trick (features hashing) • Typically used for text classification • Use of dictionaries to represent features have drawbacks – Take large amount of memory – Grow in size as training set grows – ML can be attacked (e.g. using misspelling words not in the stored vocabulary)

Slide 14

Slide 14 text

Improvements • Hashing trick dimension D = 2 ** 20 -> D = 2 ** 29 • Hashing function index = int(value + key[1:], 16) % D -> index = mmh3.hash(str(i) + value) % D MurmurHash performs well in a random distribution of regular keys

Slide 15

Slide 15 text

Online ML with scikit-learn import pandas as pd from sklearn import linear_model model = linear_model.SGDClassifier() train = pd.read_csv(“train.csv”, chunksize = 100000, iterator = True) for chunk in train: model.partial_fit(X, y)

Slide 16

Slide 16 text

Thank You !