Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Kaggle Criteo Challenge and Online Learning

Kaggle Criteo Challenge and Online Learning

Kaggle Paris Meetup – October 23nd, 2014 - Paris, France

Christophe Bourguignat

October 23, 2014
Tweet

More Decks by Christophe Bourguignat

Other Decks in Science

Transcript

  1. Kaggle Criteo Challenge and Online Learning Kaggle Paris Meetup –

    October 23nd, 2014 - Paris, France Christophe Bourguignat – AXA Data Innovation Lab - @chris_bour
  2. Kaggle Criteo Challenge • Develop a model predicting ad click-

    through rate (CTR) • Train : a portion of Criteo's traffic over a period of 7 days • Test : events on the day following the training period
  3. Kaggle Criteo Challenge • Undisclosed detailed meaning of the features

    • 13 numerical features – Mostly counters : number of times the user visited the advertiser website, … • 26 categorical features – Publisher features : the domain of the url where the ad was displayed, … – Advertiser features : advertiser id, type of products, … – User features : browser type, …
  4. Cardinalities & ranges examples • I5 : – Between 0

    and 23 159 456 • C3 : – 10 131 226 distinct categories
  5. • Model that learns one instance at a time •

    Soon after the prediction is made, the true label of the instance is discovered Online Machine Learning
  6. • 1 : receive the instance • 2 : predict

    the label for the instance • 3 : the algorithm receives the true label of the instance – use this label feedback to update hypothesis for future trials Online Machine Learning
  7. • Applications of a sequential nature • Applications with huge

    amounts of data – traditional learning approaches that use the entire data set are computationally infeasible Typical use cases
  8. from datetime import datetime from csv import DictReader from math

    import exp, log, sqrt Solution with 75 lines of Python Code Source : https://www.kaggle.com/c/criteo-display-ad-challenge/forums/t/10322/beat-the-benchmark-with-less-then-200mb-of-memory
  9. • 1 : receive the instance D = 2 **

    20 # number of weights use for learning def get_x(csv_row, D): x = [0] # 0 is the index of the bias term for key, value in csv_row.items(): index = int(value + key[1:], 16) % D # weakest hash ever ;) x.append(index) return x # x contains indices of features that have a value of 1 Solution with 75 lines of Python Code
  10. • 2 : predict the label for the instance #

    initialize our model D = 2 ** 20 # number of weights use for learning w = [0.] * D # weights def get_p(x, w): wTx = 0. for i in x: # do wTx wTx += w[i] * 1. # w[i] * x[i], but if i in x we got x[i] = 1. return 1. / (1. + exp(-max(min(wTx, 20.), -20.))) # bounded sigmoid Solution with 75 lines of Python Code
  11. • 3 : the algorithm receives the true label of

    the instance – use this label feedback to update hypothesis for future trials # initialize our model w = [0.] * D # weights n = [0.] * D # number of times we've encountered a feature alpha = .1 # learning rate for sgd optimization def update_w(w, n, x, p, y): for i in x: # alpha / (sqrt(n) + 1) is the adaptive learning rate heuristic # (p - y) * x[i] is the current gradient # note that in our case, if i in x then x[i] = 1 w[i] -= (p - y) * alpha / (sqrt(n[i]) + 1.) n[i] += 1. return w, n Solution with 75 lines of Python Code
  12. Hashing trick (features hashing) • Typically used for text classification

    • Use of dictionaries to represent features have drawbacks – Take large amount of memory – Grow in size as training set grows – ML can be attacked (e.g. using misspelling words not in the stored vocabulary)
  13. Improvements • Hashing trick dimension D = 2 ** 20

    -> D = 2 ** 29 • Hashing function index = int(value + key[1:], 16) % D -> index = mmh3.hash(str(i) + value) % D MurmurHash performs well in a random distribution of regular keys
  14. Online ML with scikit-learn import pandas as pd from sklearn

    import linear_model model = linear_model.SGDClassifier() train = pd.read_csv(“train.csv”, chunksize = 100000, iterator = True) for chunk in train: model.partial_fit(X, y)