Item Similarity Revisited

Item Similarity Revisited Mark Levy Mendeley

Real World Requirements • Cheap to compute • Explicable •
Easy to tweak + combine with business rules • Not a black box • Work without ratings • Handle multiple data sources • Offer something to anon users

Cold Start? • Can't work miracles for new items •
Serve new users asap • Fine to run a special system for either case • Most content leaks • … or is leaked • All serious commercial content is annotated

Life Before Netflix • Customers who buy X also buy
Y • You might also like item-based k-NN • Not much “Recommender Systems”

Degrees of Freedom for k-NN • Input numbers from mining
logs • Temporal “modelling” (e.g. fake users) • Data pruning (scalability, popularity bias, quality) • Preprocessing (tf-idf, log/sqrt, ) … • Hand crafted similarity metric • Hand crafted aggregation formula • Postprocessing (popularity matching) • Diversification • Attention profile

$1M's Worth of Innovation • Formal models (for rating prediction)
• Latent factors • Efficient SVD for incomplete data • Recommender Systems • Cult of RMSE

Netflix Recommendations: Beyond the 5 stars Amatriain and Basilico, 2012

Solving the Wrong Problem “The problem with [the rating prediction]
approach is that all elements the model should rank in the future are presented to the learning algorithm as negative feedback during training. That means a model with enough expressiveness (that can fit the training data exactly) cannot rank at all as it predicts only 0s. The only reason why such machine learning methods can predict rankings are strategies to prevent overfitting, like regularization.” Bayesian Personalized Ranking from Implicit Feedback Rendle et al., 2009

Solving the Wrong Problem “The literature has focused on predicting
the rating values for those items that a user has deliberately chosen to rate. This kind of data can be collected easily, moreover [RMSE] … can easily be evaluated on the user-item pairs that actually have a rating value in the data. The objective of common real-world rating prediction tasks, however, is often different from this scenario: typically, the goal is to predict the rating value for any item in the collection, independent of the fact if a user rates it or not.” Evaluation of Recommendations: Rating-Prediction and Ranking Steck, 2013

Implicit Feedback and Ranking • Most people won't give ratings
• Rating prediction doesn't learn ranking • New algorithms to optimize AUC, MRR: BPR, RankALS, CCF, CliMF, CoFiSet, ... • Still learn latent factors

From Factors to Recommendations • Predict preferences for every item,
sort them • Wait for someone to invent the Maximum Inner Product tree • What about anon users?

A Use Case at Mendeley • Social network products @Mendeley
• ERASM Eurostars project • Make newsfeed more interesting • First task: who to follow

A Use Case at Mendeley • Multiple datasets – 2M
users, many active – ~100M documents – author keywords, social tags – 50M physical pdfs – 15M <user,document> per month – User-document matrix currently ~250M non-zeros

A Use Case at Mendeley • Approach as item similarity
problem – with a constrained target item set • Possible state of the art c. April 2013: – pimped old skool neighbourhood method – matrix factorization and then neighbourhood – something dynamic (?) – SLIM • Use some side data

SLIM • Learn sparse item similarity weights • No explicit
neighbourhood or metric • L1 regularization gives sparsity • Bound-constrained least squares:

SLIM R ≈ x W w j R r j
items users model.fit(R,r j ) w j = model.coef_

SLIM Good: – Outperforms MF methods on implicit ratings data
[1] – Easy extensions to include side data [2] Not so good: – Reported to be slow beyond small datasets [1] [1] X. Ning and G. Karypis, SLIM: Sparse Linear Methods for Top-N Recommender Systems, Proc. IEEE ICDM, 2011. [2] X. Ning and G. Karypis, Sparse Linear Methods with Side Information for Top-N Recommendations, Proc. ACM RecSys, 2012.

From SLIM to regression • Avoid constraints • Regularized regression,
learn with SGD • Easy to implement • Faster on large datasets

From SLIM to regression R´ ≈ x W R r
j items users w j r j model.fit(R´,r j ) w j = model.coef_

Results on reference datasets

Results on Mendeley data • Stacked readership counts, keyword counts
• 5M docs/keywords, 1M users, 140M non-zeros • Constrained to ~100k target users • Python implementation on top of scikit-learn – Trivially parallelized with IPython – eats CPU but easy on AWS • 10% improvement over nearest neighbours • 5% CTR in email test

Our use case R´ ≈ x W R j target
users all users items all users items j readership keywords

Tuning regularization constants • Relate directly to business logic –
want sparsest similarity lists that are not too sparse – “too sparse” = # items with < k similar items • Grid search with a small sample of items • Empirically corresponds well to optimising recommendation accuracy of a validation set – but faster and easier

Software release: mrec • Wrote our own small framework •
Includes parallelized SLIM, WRMF, WARP • Support for evaluation • BSD licence • https://github.com/mendeley/mrec • Please use it or contribute!

Software release: mrec

Thanks for listening [email protected] @gamboviol https://github.com/mendeley/mrec goodbye hello

Isn't there software for that? Rules of the game: –
Model fit metrics (e.g. validation loss) don't count – Need a transparent “audit trail” of data to support genuine reproducibility – Just using public datasets doesn't ensure this

Isn't there software for that? Wish list for reproducible evaluation:
– Integrate with recommender implementations – Handle data formats and preprocessing – Handle splitting, cross-validation, side datasets – Save everything to file – Work from file inputs so not tied to one framework – Generate meaningful metrics – Well documented and easy to use

Isn't there software for that? Current offerings: • GraphChi/GraphLab •
Mahout • LensKit • MyMediaLite

Isn't there software for that? Current offerings: • GraphChi/GraphLab –
Model validation loss, doesn't count • Mahout – Only rating prediction accuracy, doesn't count • LensKit – Too hard to understand, won't use

Isn't there software for that? Current offerings: • MyMediaLite –
Reports meaningful metrics – Handles cross-validation – Data splitting not transparent – No support for pre-processing – No built in support for standalone evaluation – API is capable but current utils don't meet wishlist

Item Similarity Revisited

Item Similarity Revisited

Data Science London

More Decks by Data Science London

Other Decks in Technology

Featured

Transcript