Upgrade to Pro — share decks privately, control downloads, hide ads and more …

A Tale of Two Models: "Traditional" Machine Le...

A Tale of Two Models: "Traditional" Machine Learning vs Deep Learning

I took a close look at a recent predictive modelling competition on Kaggle.com. The challenge was to predict gocery re-orders on Instacart.com, based on historical data and the particular user’s history.

I focused on two submissions, ranked #2 and #3 on the leader board. These two solutions yielded nearly identical scores, using completely different approaches.

These two approaches are emblematic of two approaches to machine learning today.

Robin Ranjit Singh Chauhan

November 03, 2017
Tweet

More Decks by Robin Ranjit Singh Chauhan

Other Decks in Technology

Transcript

  1. Market Basket Analysis A Tale of Two Models “Traditional” ML

    vs Deep Learning Analysis of https://www.kaggle.com/c/instacart-market-basket-analysis By Robin Chauhan [email protected]
  2. • Robin Ranjit Singh Chauhan https://ca.linkedin.com/in/robinc • Head of Engineering,

    and Part-time Data Scientist at AgFunder • Founder, Pathway Intelligence Inc • New to kaggle.com and this group (my 3rd sesh) • Indo-Canadian • Live on Sunshine Coast • New here, so grateful to meet people (ie. you). • Would love to hear about what you’re into, both on data science and domain sides! Who I am 2
  3. Credits • Kaggle.com competitors ONODERA and seanvj for sharing their

    fascinating models • Kernel authors on Kaggle.com for sharing their insight • Bruce Sharpe, Mike Irvine, and for their advice in preparing this presentation • Matt Kierans, Charles Iliya Krempeaux, and Bruce Sharpe for organizing great data science events in Vancouver • SFU VentureLabs for the meeting space 3
  4. - If you understand the Thing - better, or in

    a different way - If a Thing is confusing - Share an anecdote about using a relate Thing in your domain of interest - Goal: Mutual Enlightenment - Sorry in advance if I get something wrong, please correct me :) - I am here to learn - After, I would love to hear what I could have improved on Requests: Please Share 4
  5. Almost picked: Msk Personalized Cancer Treatment • Data leakage •

    New dataset released 5 days before deadline • Private leaderboard scores were astoundingly worse than public • Suspicion that data was mislabelled 7
  6. 9

  7. Instacart • US Grocery delivery company • Orders primarily made

    via phone app + website • “the Most Promising Company in America” -- Forbes in 2015 • Allows shopping at in-store prices (delivery charge?) • Investors include Whole Foods (now owned by Amazon...), Y Combinator, Sequoia, Khosla, Andreessen H, FundersClub, etc • As of March 2017, Instacart services 36 markets, composed of 1,200 cities in 25 states (US only atm) • Recently valued at $3.4B (raised $400M in March 2017) 10
  8. Instacart - Data Science Stack • Strong programming skills in

    Python and fluency in data manipulation (SQL, Spark, Pandas) and machine learning (scikit-learn, XGBoost, Keras/Tensorflow) tools • Expertise in machine learning for search optimization, discovery driven recommendations, or advertising optimization • Expertise in natural language and/or image processing for e-commerce catalog quality, enrichment and optimization • Expertise in combining machine learning and operations research to solve optimal routing or inventory management problems • Expertise with deep learning methods and their practical application at scale • Writing production applications using Python, R, and SQL • Blog post: Instacart currently uses XGBoost, word2vec and Annoy in production on similar data to sort items for users to “buy again” ◦ Annoy github: Approximate Nearest Neighbors Oh Yeah is a C++ library with Python bindings to search for points in space that are close to a given query point 11
  9. Goal “Currently they use transactional data to develop models that

    predict which products a user will buy again, try for the first time, or add to their cart next during a session” “use this anonymized data on customer orders over time to predict which [if any] previously purchased products will be in a user’s next order” Winners get: “cash prize and a fast track through the recruiting process” 12
  10. Quirks • Different from the general recommendation problem ◦ cold

    start issue of making predictions for new users and new items that we’ve never seen before ◦ Eg. a movie site may need to recommend new movies and make recommendations for brand new users • Sequential + temporal aspects ◦ How do we take the time since a user last purchased an item into account? ◦ Do users have specific purchase patterns ◦ Do they buy different kinds of items at different times of the day? 13
  11. Vital statistics • $12k, 8k, 5k prizes • Deadline Aug

    14, 2017 • Submission ◦ predict a space-delimited list of product_ids for that order. ◦ If you wish to predict an empty order, you should submit an explicit 'None' value order_id,products 17,1 2 34,None 137,1 2 3 etc. 14
  12. Range: 0.0 (worst) to 1.0 (best) inclusive Requires both good

    Precision and Recall Implies picking a threshold (unlike AUC) …. Evaluation: F1 Metric 15
  13. Precision + Recall Refresher Q: “How well did we do?”

    has 2 different answers: “Did we succeed in picking mostly just green ones?” “Out of all the green ones, how many did we pick?” 16 Chart credit: https://commons.wikimedia.org/wiki/User:Walber
  14. 17

  15. 18

  16. The Instacart Online Grocery Shopping Dataset 2017 • Anonymized •

    over 3 million grocery orders from more than 200,000 Instacart users • For each user, between 4 and 100 of their orders, with the sequence of products purchased in each order. • the week and hour of day the order was placed • a relative measure of time between orders 19
  17. Orders 3.4m rows, 206k users • order_id: order identifier •

    user_id: customer identifier • eval_set: which evaluation set this order belongs in (see SET described below) • order_number: the order sequence number for this user (1 = first, n = nth) • order_dow: the day of the week the order was placed on • order_hour_of_day: the hour of the day the order was placed on • days_since_prior: days since the last order, capped at 30 (with NAs for order_number = 1) 21
  18. Orders • order_products__SET (30m+ rows): ◦ order_id: foreign key ◦

    product_id: foreign key ◦ add_to_cart_order: order in which each product was added to cart ◦ reordered: 1 if this product has been ordered by this user in the past, 0 otherwise • where SET is one of the four following evaluation sets (eval_set in orders): ◦ "prior": orders prior to that users most recent order (~3.2m orders) ◦ "train": training data supplied to participants (~131k orders) ◦ "test": test data reserved for machine learning competitions (~75k orders) 22
  19. 23

  20. Reorder Frequency 59% of the ordered items are reorders 34

    Chart credit: https://www.kaggle.com/philippsp
  21. Add to cart order vs reorder ratio 36 Chart credit:

    https://www.kaggle.com/sudalairajkumar
  22. Time of last order vs probability of reorder “We can

    see that if people order again on the same day, they order the same product more often. Whereas when 30 days have passed, they tend to try out new things in their order.” [Ed: contrast with “when do reorders happen” slide where 30 days is a peak...] 39 Chart credit: https://www.kaggle.com/philippsp
  23. # of orders vs Probability of reordering “Products with a

    high number of orders are naturally more likely to be reordered. However, there seems to be a ceiling effect.” 40 Chart credit: https://www.kaggle.com/philippsp
  24. By frequency of purchase Ed: wish placement was same :|

    43 Chart credit: https://www.kaggle.com/frednavruzov
  25. Model • ReorderPrediction( U, i ) ◦ each of 6

    GBDTs uses a different random seed • NonePrediction( U ) ◦ “11 of these use an eta parameter (a step size shrinkage) set to 0.01, and the others use an eta parameter set to 0.002” 53 View model code: https://github.com/KazukiOnodera/Instacart/ Interview + Explanations: http://blog.kaggle.com/2017/09/21/instacart-market-basket-analysis-winners-interview-2nd-place-kazuki-onodera/
  26. None Model “ One way to think about None is

    as the probability (1 - Item A) * (1 - Item B) * … But another method is to try to predict None as a special case. By creating a None model and treating None as just another item, I was able to boost my F1 score from 0.400 to 0.407.” 54
  27. Data Augmentation Used past 3 prior purchases, as additional training

    data ”Instead of only using the provided training set (“tr”), I also looked a short window back in time (the cells shaded in yellow) to gather more data.” 55
  28. Feature Engineering “I believe my strength is feature engineering” --ONODERA

    • User Features • Item Features • ( User, Item ) features • Datetime features 56
  29. Feature Importance • total_buy_n5(User A, Item B) is the total

    number of times User A bought Item B out of the 5 most recent orders • total_buy_ratio_n5 is the proportion of A's 5 most recent orders in which A bought B [ed: how different is that than total_buy_n5?] • useritem_order_days_max_n5, the longest that A has recently gone without buying B. • order_ratio_by_chance_n5 proportion of recent orders in which A had the chance to buy B, and did indeed do so. ◦ A "chance" refers to the number of opportunities the user had for buying the item after first encountering it. ◦ For example, if a user A had order numbers 1-5, and bought item B at order number 2, then the user had 4 chances to buy the item at order numbers 2, 3, 4, and 5.) • useritem_order_days_median_n5 is the median number of days that A has recently gone without buying B. 58
  30. • useritem_sum_pos_cart-mean(User A) whether the user tends to buy a

    lot of items at once. • total_buy-max max number times the user has bought any item. • total_buy_ratio_n5-max is the maximum proportion of the 5 most recent orders in which the user bought a certain item. Eg, if there was an item the user bought in 4 out of their 5 most recent orders, but no other item more often than that, this feature would be 0.8. • total_buy-mean mean number of times the user has bought any item. • t-1_reordered_ratio proportion of items in the previous order that were repurchases. Feature Importance: None Prediction 60
  31. Insight: When user does *not* order “This user pretty much

    always orders Cola. But at order number 8, the user didn’t. Why not? Probably because the user bought Fridge Pack Cola instead.” 61
  32. Insight: days_last_order-max Days_since_last_order_this_item(U ser A, Item B) # of days

    since User A last ordered Item B Useritem_orders_days_max(User A, Item B) max of the above feature across time, i.e., the longest that User A has ever gone without ordering B. Days_last_order-max(User A, Item B) diff between these two. How ready the user is to repurchase the item. 62
  33. Initial grid search for global p_min yielded 0.2. Discussion about

    how order should have its own threshold, code from Faron Row 1: We should predict that Item A and only Item A will be reordered. Need a threshold between 0.3 and 0.9. Row 2: optimal choice is to predict that Items A and B will both be reordered. Needs threshold less than 0.2 (the probability that Item B will be reordered) F1 Maximization Model predictions are Model predictions are Our submission is... Our expected F1 for this (repeated/average) case... Our submission is... 63
  34. F1 Maximization: Threshold Selection “So each order should have its

    own threshold. To determine this threshold I wrote a simulation algorithm as follows. Simulate 10k itemsets using probabilities from model…” 64
  35. F1 Maximization: Threshold Selection Calculate the expected F1 score for

    each set of labels, starting from the highest probability items Add items (e.g., [A], then [A, B], then [A, B, C], etc) until the F1 score peaks and then decreases [ed: at time of prediction] 65
  36. 3rd Place: seanjv • Deep learning (plus lightgbm)! • Tensorflow

    (no keras) • Polar opposite of ONODERA solution: little feature engineering • Github: “Student at MIT” • https://github.com/sjvasquez/instacart-basket-prediction 66
  37. Top Level Architecture Product: RNN w/LSTM, Wavenet CNN Aisle RNN

    Dept RNN Product RNN Bernoulli mixture model Order size RNN Order size RNN mixture model Skipgram w/neg sampling NNMF lightgbm FF NN (2 layer, 1 skip) Wtd avg ; F1 max Inputs 67 Diagram credit: Robin Chauhan [email protected]
  38. Deep learning solution: First level • Product RNN/CNN (code): a

    combined RNN and CNN trained to predict the probability that a user will order a product at each timestep. The RNN is a single-layer LSTM and the CNN is a 6-layer causal CNN with dilated convolutions. • Aisle RNN (code): an RNN similar to the first model, but trained at the aisle level (predict whether a user purchases any products from a given aisle at each timestep). • Department RNN (code): an RNN trained at the department level. • Product RNN mixture model (code): an RNN similar to the first model, but instead trained to maximize the likelihood of a bernoulli mixture model. • Order size RNN (code): an RNN trained to predict the next order size, minimizing RMSE. • Order size RNN mixture model (code): an RNN trained to predict the next order size, maximizing the likelihood of a gaussian mixture model. • Skip-Gram with Negative Sampling (SGNS) (code): SGNS trained on sequences of ordered products. • Non-Negative Matrix Factorization (NNMF) (code): NNMF trained on a matrix of user-product order counts. 68
  39. Product RNN/CNN: Detail Wavenet: 6 layer dilated causal convolutions LSTM

    layer TDDL relu Inputs TDDL sigmoid 69 Diagram credit: Robin Chauhan [email protected]
  40. Product RNN/CNN h = lstm_layer(x, self.history_length, self.lstm_size) # wavenet: time_distributed_dense_layer,

    multiple (6) temporal_convolution_layers , # then time_distributed_dense_layer c = wavenet(x, self.dilations, self.filter_widths, self.skip_channels, self.residual_channels) h = tf.concat([h, c, x], axis=2) # wavenet(x) and lstm(x) in parallel with x # time_distributed_dense_layer: Applies a shared dense layer to each timestep of a tensor of shape # [batch_size, max_seq_len, input_units] self.h_final = time_distributed_dense_layer(h, 50, activation=tf.nn.relu, scope='dense-1') y_hat = time_distributed_dense_layer(self.h_final, 1, activation=tf.nn.sigmoid, scope='dense-2') y_hat = tf.squeeze(y_hat, 2) 70
  41. Product RNN/CNN: Inputs def get_input_sequences(self): self.user_id = tf.placeholder(tf.int32, [None]) self.product_id

    = tf.placeholder(tf.int32, [None]) self.aisle_id = tf.placeholder(tf.int32, [None]) self.department_id = tf.placeholder(tf.int32, [None]) self.is_none = tf.placeholder(tf.int32, [None]) self.history_length = tf.placeholder(tf.int32, [None]) self.is_ordered_history = tf.placeholder(tf.int32, [None, 100]) # Note dimensions self.index_in_order_history = tf.placeholder(tf.int32, [None, 100]) self.order_dow_history = tf.placeholder(tf.int32, [None, 100]) self.order_hour_history = tf.placeholder(tf.int32, [None, 100]) self.days_since_prior_order_history = tf.placeholder(tf.int32, [None, 100]) self.order_size_history = tf.placeholder(tf.int32, [None, 100]) self.reorder_size_history = tf.placeholder(tf.int32, [None, 100]) self.order_number_history = tf.placeholder(tf.int32, [None, 100]) self.product_name = tf.placeholder(tf.int32, [None, 30]) self.product_name_length = tf.placeholder(tf.int32, [None]) self.next_is_ordered = tf.placeholder(tf.int32, [None, 100]) ….. 71
  42. Product RNN/CNN: Wavenet def wavenet(x, dilations, filter_widths, skip_channels, residual_channels, scope='wavenet',

    reuse=False): """ A stack of causal dilated convolutions with paramaterized residual and skip connections as described in the WaveNet paper (with some minor differences). …. 72
  43. TDDL: time_distributed_dense_layer def time_distributed_dense_layer(inputs, output_units, bias=True, activation=None, batch_norm=None, dropout=None, scope='time-distributed-dense-layer',

    reuse=False): """ Applies a shared dense layer to each timestep of a tensor of shape [batch_size, max_seq_len, input_units] to produce a tensor of shape [batch_size, max_seq_len, output_units]. Args: inputs: Tensor of shape [batch size, max sequence length, ...]. output_units: Number of output units. activation: activation function. dropout: dropout keep prob. Returns: Tensor of shape [batch size, max sequence length, output_units]. """ 74
  44. Product RNN/CNN: Embeddings product_embeddings = tf.get_variable( name='product_embeddings', shape=[50000, self.lstm_size], dtype=tf.float32

    ) aisle_embeddings = tf.get_variable( name='aisle_embeddings', shape=[250, 50], dtype=tf.float32 ) department_embeddings = tf.get_variable( name='department_embeddings', shape=[50, 10], dtype=tf.float32 ) user_embeddings = tf.get_variable( name='user_embeddings', shape=[207000, self.lstm_size], dtype=tf.float32 ) 75
  45. Deep learning solution: Second level • GBM (code): a lightgbm

    model. • Feedforward NN (code): a feedforward neural network. The final reorder probabilities are a weighted average of the outputs from the second-level models. The final basket is chosen by using these probabilities and choosing the product subset with maximum expected F1-score. 76
  46. Other solutions • CatBoost popular • Some users isolated the

    most popular products • Association rule learning (mlextend), moving averages, moments of distributions • “you can do (almost) anything in pandas in reasonable time if you spend enough time on planning and vectorizing” -- Kucsko ◦ “we could do a groupby user+product apply diff, however this is extremely slow. ◦ instead we can be a bit clever and do a groupby user+product shift and then take the difference between the shifted an unshifted vector. ◦ after that we can easily take mean, med, std etc 77
  47. Words of wisdom from ONODERA What have you taken away

    from this competition? All metrics can be hacked, I think. Especially metrics where we have to convert probabilities to binary scores. (Although metrics like AUC are rarely hacked.) 78
  48. Words of wisdom from ONODERA Do you have any advice

    for those just getting started in data science? Join the competitions you like. But never give up before the end, and try every approach you come up with. I know it’s a tradeoff between sleep and your leaderboard ranking. It’s common for features that take a lot of time to construct to wind up doing nothing. But we can’t know the result if we don't do anything. So the most important thing is to participate in the delusion that you’ll get a better result if you try! 79
  49. “Traditional” Machine Learning vs Deep Learning 80 “Traditional” ML Deep

    Learning User ONODERA seanjv Final Private Leaderboard Score 0.4082039 0.4081041 Primary modelling libraries XGBoost Tensorflow, XGBoost Feature Engineering Many hand-built features. Features were explicitly the focus here. You could say “Feature Engineering” was involved in designing the sub-networks. Other features “discovered” by deep learning training Modelling order history Implicit using a variety of features based on order history data Explicitly modeled order history using input sequence Ensemble Multiple similar XGBoost models ; None-models + Product models Multiple different TF models trained to predict different aspects of output; these outputs became inputs to [ NX , XGboost ] parallel layer, finally weighted average Hyperparameter Optimization Unknown Unknown Leaderboard trajectory Leader for weeks; fell to #2, #3 then ended at #2 Late entry, quick ascent to top, ending at #3