• 13 numerical features – Mostly counters : number of times the user visited the advertiser website, … • 26 categorical features – Publisher features : the domain of the url where the ad was displayed, … – Advertiser features : advertiser id, type of products, … – User features : browser type, …
the label for the instance • 3 : the algorithm receives the true label of the instance – use this label feedback to update hypothesis for future trials Online Machine Learning
20 # number of weights use for learning def get_x(csv_row, D): x = [0] # 0 is the index of the bias term for key, value in csv_row.items(): index = int(value + key[1:], 16) % D # weakest hash ever ;) x.append(index) return x # x contains indices of features that have a value of 1 Solution with 75 lines of Python Code
initialize our model D = 2 ** 20 # number of weights use for learning w = [0.] * D # weights def get_p(x, w): wTx = 0. for i in x: # do wTx wTx += w[i] * 1. # w[i] * x[i], but if i in x we got x[i] = 1. return 1. / (1. + exp(-max(min(wTx, 20.), -20.))) # bounded sigmoid Solution with 75 lines of Python Code
the instance – use this label feedback to update hypothesis for future trials # initialize our model w = [0.] * D # weights n = [0.] * D # number of times we've encountered a feature alpha = .1 # learning rate for sgd optimization def update_w(w, n, x, p, y): for i in x: # alpha / (sqrt(n) + 1) is the adaptive learning rate heuristic # (p - y) * x[i] is the current gradient # note that in our case, if i in x then x[i] = 1 w[i] -= (p - y) * alpha / (sqrt(n[i]) + 1.) n[i] += 1. return w, n Solution with 75 lines of Python Code
• Use of dictionaries to represent features have drawbacks – Take large amount of memory – Grow in size as training set grows – ML can be attacked (e.g. using misspelling words not in the stored vocabulary)
-> D = 2 ** 29 • Hashing function index = int(value + key[1:], 16) % D -> index = mmh3.hash(str(i) + value) % D MurmurHash performs well in a random distribution of regular keys