Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Item Similarity Revisited

Item Similarity Revisited

Talk by Mark Levy, Sr. Data Scientist @Mendeley at Data Science London meetup.

Data Science London

January 12, 2014
Tweet

More Decks by Data Science London

Other Decks in Technology

Transcript

  1. Real World Requirements • Cheap to compute • Explicable •

    Easy to tweak + combine with business rules • Not a black box • Work without ratings • Handle multiple data sources • Offer something to anon users
  2. Cold Start? • Can't work miracles for new items •

    Serve new users asap • Fine to run a special system for either case • Most content leaks • … or is leaked • All serious commercial content is annotated
  3. Life Before Netflix • Customers who buy X also buy

    Y • You might also like item-based k-NN • Not much “Recommender Systems”
  4. Degrees of Freedom for k-NN • Input numbers from mining

    logs • Temporal “modelling” (e.g. fake users) • Data pruning (scalability, popularity bias, quality) • Preprocessing (tf-idf, log/sqrt, ) … • Hand crafted similarity metric • Hand crafted aggregation formula • Postprocessing (popularity matching) • Diversification • Attention profile
  5. $1M's Worth of Innovation • Formal models (for rating prediction)

    • Latent factors • Efficient SVD for incomplete data • Recommender Systems • Cult of RMSE
  6. Solving the Wrong Problem “The problem with [the rating prediction]

    approach is that all elements the model should rank in the future are presented to the learning algorithm as negative feedback during training. That means a model with enough expressiveness (that can fit the training data exactly) cannot rank at all as it predicts only 0s. The only reason why such machine learning methods can predict rankings are strategies to prevent overfitting, like regularization.” Bayesian Personalized Ranking from Implicit Feedback Rendle et al., 2009
  7. Solving the Wrong Problem “The literature has focused on predicting

    the rating values for those items that a user has deliberately chosen to rate. This kind of data can be collected easily, moreover [RMSE] … can easily be evaluated on the user-item pairs that actually have a rating value in the data. The objective of common real-world rating prediction tasks, however, is often different from this scenario: typically, the goal is to predict the rating value for any item in the collection, independent of the fact if a user rates it or not.” Evaluation of Recommendations: Rating-Prediction and Ranking Steck, 2013
  8. Implicit Feedback and Ranking • Most people won't give ratings

    • Rating prediction doesn't learn ranking • New algorithms to optimize AUC, MRR: BPR, RankALS, CCF, CliMF, CoFiSet, ... • Still learn latent factors
  9. From Factors to Recommendations • Predict preferences for every item,

    sort them • Wait for someone to invent the Maximum Inner Product tree • What about anon users?
  10. A Use Case at Mendeley • Social network products @Mendeley

    • ERASM Eurostars project • Make newsfeed more interesting • First task: who to follow
  11. A Use Case at Mendeley • Multiple datasets – 2M

    users, many active – ~100M documents – author keywords, social tags – 50M physical pdfs – 15M <user,document> per month – User-document matrix currently ~250M non-zeros
  12. A Use Case at Mendeley • Approach as item similarity

    problem – with a constrained target item set • Possible state of the art c. April 2013: – pimped old skool neighbourhood method – matrix factorization and then neighbourhood – something dynamic (?) – SLIM • Use some side data
  13. SLIM • Learn sparse item similarity weights • No explicit

    neighbourhood or metric • L1 regularization gives sparsity • Bound-constrained least squares:
  14. SLIM R ≈ x W w j R r j

    items users model.fit(R,r j ) w j = model.coef_
  15. SLIM Good: – Outperforms MF methods on implicit ratings data

    [1] – Easy extensions to include side data [2] Not so good: – Reported to be slow beyond small datasets [1] [1] X. Ning and G. Karypis, SLIM: Sparse Linear Methods for Top-N Recommender Systems, Proc. IEEE ICDM, 2011. [2] X. Ning and G. Karypis, Sparse Linear Methods with Side Information for Top-N Recommendations, Proc. ACM RecSys, 2012.
  16. From SLIM to regression • Avoid constraints • Regularized regression,

    learn with SGD • Easy to implement • Faster on large datasets
  17. From SLIM to regression R´ ≈ x W R r

    j items users w j r j model.fit(R´,r j ) w j = model.coef_
  18. Results on Mendeley data • Stacked readership counts, keyword counts

    • 5M docs/keywords, 1M users, 140M non-zeros • Constrained to ~100k target users • Python implementation on top of scikit-learn – Trivially parallelized with IPython – eats CPU but easy on AWS • 10% improvement over nearest neighbours • 5% CTR in email test
  19. Our use case R´ ≈ x W R j target

    users all users items all users items j readership keywords
  20. Tuning regularization constants • Relate directly to business logic –

    want sparsest similarity lists that are not too sparse – “too sparse” = # items with < k similar items • Grid search with a small sample of items • Empirically corresponds well to optimising recommendation accuracy of a validation set – but faster and easier
  21. Software release: mrec • Wrote our own small framework •

    Includes parallelized SLIM, WRMF, WARP • Support for evaluation • BSD licence • https://github.com/mendeley/mrec • Please use it or contribute!
  22. Isn't there software for that? Rules of the game: –

    Model fit metrics (e.g. validation loss) don't count – Need a transparent “audit trail” of data to support genuine reproducibility – Just using public datasets doesn't ensure this
  23. Isn't there software for that? Wish list for reproducible evaluation:

    – Integrate with recommender implementations – Handle data formats and preprocessing – Handle splitting, cross-validation, side datasets – Save everything to file – Work from file inputs so not tied to one framework – Generate meaningful metrics – Well documented and easy to use
  24. Isn't there software for that? Current offerings: • GraphChi/GraphLab –

    Model validation loss, doesn't count • Mahout – Only rating prediction accuracy, doesn't count • LensKit – Too hard to understand, won't use
  25. Isn't there software for that? Current offerings: • MyMediaLite –

    Reports meaningful metrics – Handles cross-validation – Data splitting not transparent – No support for pre-processing – No built in support for standalone evaluation – API is capable but current utils don't meet wishlist