Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Recommender Systems in R

Recommender Systems in R

Talk by Tamas Jambor, Data Scientist @Sky Data Science London @ds_ldn meetup on 12/02/2013

Data Science London

February 18, 2013
Tweet

More Decks by Data Science London

Other Decks in Technology

Transcript

  1. Short Bio Big Data • Data Scientist at BSkyB (Big

    Data team) – Descriptive/exploratory statistics – Clustering – Recommender systems • PhD in Recommender Systems and Machine Learning – Recommender System algorithm design – Optimisation – Dynamic systems
  2. Type of data for recommendation Big Data • Explicit preferences

    – User ratings – User reviews • Implicit preferences – Preference inferred from logs – Purchases • Additional (content based) data features – Type of the item (e.g. genre for movies) • Context aware features – Time of recommendation – Mood, weather, location – Business objectives (e.g. profit, risk)
  3. Inferring preferences Big Data • Example signals from set-top box

    logs Action Signal strength Set reminder Very high Repeat watch Very high Length of video watch High Explicit rating High Record product High Purchase product High Search for product Low Skip product Low Pause live TV Low
  4. Data aggregations in R Big Data Useful packages • base

    package (aggregate, tapply) • sqldf package – SQL like queries • plyr package – Splits the data and combines it after processing – Wide variety of aggregate functions – Intuitive syntax • data.table package – Very fast (ideal for bigger data sets) • reshape package – Additional package for data manipulation (e.g. convert table from wide to long format)
  5. Quick prototyping in R Big Data R> library(recommenderlab) # version

    0.1-4 R> m <-matrix(sample(c(as.numeric(1:5), NA), 50, replace=TRUE, prob=c(0.1,0.05,0.1,0.15,0.1,0.5)), ncol=10, dimnames=list(user=paste("user", 1:5,sep=''), item=paste("item", 1:10, sep=''))) item1 item2 item3 item4 item5 item6 item7 item8 item9 item10 user1 NA NA NA NA 2 NA NA NA NA 4 user2 3 4 4 NA NA 2 5 4 5 4 user3 NA NA 2 NA NA 1 3 NA 3 2 user4 NA NA NA 4 NA 4 NA 4 NA NA user5 1 NA NA NA NA 4 2 NA NA NA
  6. Storing, pre-processing data Big Data • Sparse matrix R> r

    <- as(m, "realRatingMatrix") • Normalising data (remove user bias) (centre or Z-score) R> r_m <- normalize(r) • Binarising data R> r_b <- binarize(r, minRating=3) item1 item2 item3 item4 item5 item6 item7 item8 item9 item10 [1,] 0 0 0 0 0 0 0 0 0 1 [2,] 0 1 1 0 0 0 1 1 1 1 [3,] 0 0 0 0 0 0 0 0 0 0 [4,] 0 0 0 1 0 1 0 1 0 0 [5,] 0 0 0 0 0 1 0 0 0 0
  7. Exploring data Big Data • MovieLens data (http://www.movielens.org) R> data(MovieLense)

    R> MovieLense 943 x 1664 rating matrix of class ‘realRatingMatrix’ with 99392 ratings.
  8. Top-N recommendation Big Data – Precision – Recall – Issues

    • Definition of relevance • Missing items • Varying user profile size | ∩ { }| | | | ∩ | | | Predicted class (expectation) Observed class (observation) TP FN FP TN
  9. Evaluation strategies Big Data • Split by user R> scheme

    <- evaluationScheme (MovieLense,method="split",train=0.8) • k-fold cross-validation R> scheme <- evaluationScheme (MovieLense,method="cross-validation",k=10) • Bootstrap sampling R> scheme <- evaluationScheme (MovieLense,method=“bootstrap",k=10,train=0.8) • Additional parameters – goodRating – defines relevance threshold for top-N evaluators – given – number of (fixed length) items given to use for prediction – train – percentage of users used for training
  10. Evaluation on predicted ratings Big Data • UBCF - Recommender

    based on user-based collaborative filtering (explicit ratings). • IBCF - Recommender based on item-based collaborative filtering (explicit ratings). • SVD - Recommender based on SVD approximation (explicit ratings). • POPULAR - Recommender based on item popularity (explicit ratings). • RANDOM - Produce random recommendations (explicit ratings).
  11. Algorithms Big Data R> es <- evaluationScheme(MovieLense, method="cross- validation", goodRating=4,k=4,

    given=10) R> algorithms <- list( RANDOM = list(name = "RANDOM", param = NULL), POPULAR = list(name = "POPULAR", param = NULL), UBCF = list(name = "UBCF", param =list(normalize=NULL,method="Cosine",nn=50)), IBCF = list(name = "IBCF", param =list(normalize=NULL)), SVD = list(name = "SVD", param = list(categories=30,normalize=NULL, treat_na = "median"))) R> evlist <- evaluate(es, algorithms,n=c(1, 3, 5, 10, 15, 20))
  12. Implementing new algorithms Big Data Main functions to implement •

    Train function –Process and train the data • Predict function –Return top-N list or ratings for a given vector/matrix • Top-N function
  13. Why R? Big Data • Advantages – Great selection of

    packages to enhance recommendation – Most statistical tools are available – Quick prototyping – Compact code – Good visualisation tools • Disadvantages – Some base functions are slow (e.g. loops) – Steep learning curve – Scaling