Slide 1

Slide 1 text

Implicit and explicit feedback recommenders And the curse of RMSE Maciej Kula @Maciej_Kula

Slide 2

Slide 2 text

Purpose of this talk Convince you that: 1. RMSE is never an appropriate evaluation metric for a recommender system, and 2. Implicit feedback is far more valuable than explicit feedback (in most cases).

Slide 3

Slide 3 text

Terminology

Slide 4

Slide 4 text

Explicit feedback recommender system A system where we rely on the user giving us explicit signals about their preferences. Most famously, ratings. Could also be thumbs up, thumbs down.

Slide 5

Slide 5 text

Implicit feedback recommender system No explicit feedback. Use user clicks/queries/watches to infer preference. Lack of clicks is implicit lack of preference.

Slide 6

Slide 6 text

Root mean square error (RMSE) Evaluation metric: how well can we predict the ratings users give to movies they watched.

Slide 7

Slide 7 text

Historical note

Slide 8

Slide 8 text

The Netflix Challenge There were other collaborative filtering datasets before. But it is the Netflix Challenge that really generated momentum in the field. $1,000,000 prize for beating the existing Netflix system. Importantly, the dataset contained ratings, and accuracy was evaluated using RMSE.

Slide 9

Slide 9 text

Rapid pace of innovation Lots of innovative solutions were devised. Variants of matrix factorization proved most successful.

Slide 10

Slide 10 text

Implicit feedback Separately, there was also work on implicit feedback recommenders. Hu, Koren, and Volinsky (2008 ) Collaborative Filtering for Implicit Feedback Datasets But the approach was still treated as a fallback solution when explicit feedback was not available.

Slide 11

Slide 11 text

But implicit feedback is more useful than that

Slide 12

Slide 12 text

In fact, there is a problem with recommender systems build solely on explicit feedback Excellent paper by Steck (2010), Training and Testing of Recommender Systems on Data Missing not at Random.

Slide 13

Slide 13 text

We want to define a ranking over all items So we shouldn’t evaluate our system only on observed rankings. In general, we can have recommenders that give a perfect RMSE score and yet are utterly useless. The implicit assumption behind models trained and evaluated only on observed ratings is that ratings that are not observed are missing at random.

Slide 14

Slide 14 text

Are ratings missing at random? For this to be true, the following need to be true: 1. Once a user watches a movie, how much they enjoyed the movie does not influence the likelihood that they will leave a rating. 2. The likelihood that a user watches a movie is not correlated with how well the user rates the movie: that is, watching or not watching a movie carries no information on whether a user likes the movie. Both are patently false.

Slide 15

Slide 15 text

Truncated variable model Need to model both components of the model, the conditional ratings and the truncation P(rating, observed) = P(rating | observed) x P(observed) Situation common in econometrics; without taking truncation into account estimated coefficients may even be of the wrong sign.

Slide 16

Slide 16 text

Empirical evaluation Does modelling the truncation mechanism improve the resulting recommender model? Steck (2010) runs the following experiment: 1. Train a classic factorization model on observed ratings only. 2. Train a logistic regression model, setting the outcome to 1 if rating is 5, and 0 otherwise. 3. Compare the two models using ranking metrics.

Slide 17

Slide 17 text

Empirical evaluation

Slide 18

Slide 18 text

Empirical evaluation It’s not even close. Implicit feedback alone is much better than explicit feedback alone. Putting the two together gives the best result.

Slide 19

Slide 19 text

Easily reproducible I ran an experiment with the same general setup. The code is at https://github.com/maciejkula/explicit-vs-implicit as a Jupyter notebook. The results are the same: an implicit feedback model achieves an MRR of 0.07, compared to 0.02 from an explicit feedback model.

Slide 20

Slide 20 text

This is a (mostly) well-known conclusion Netflix don’t use stars any more (Goodbye stars, hello thumbs). But every year, new (otherwise great) papers come out that use explicit feedback only and evaluate on observed rankings. So if there are two things you take away from this talk….

Slide 21

Slide 21 text

Never use RMSE Or any metric on observed ratings only.

Slide 22

Slide 22 text

Implicit feedback beats explicit feedback (with caveats)

Slide 23

Slide 23 text

Thanks! Find me on Twitter @Maciej_Kula