Recommender Systems in R

Recommender Systems in R Tamas Jambor (@jamborta) [email protected]

Short Bio Big Data • Data Scientist at BSkyB (Big
Data team) – Descriptive/exploratory statistics – Clustering – Recommender systems • PhD in Recommender Systems and Machine Learning – Recommender System algorithm design – Optimisation – Dynamic systems

Type of data for recommendation Big Data • Explicit preferences
– User ratings – User reviews • Implicit preferences – Preference inferred from logs – Purchases • Additional (content based) data features – Type of the item (e.g. genre for movies) • Context aware features – Time of recommendation – Mood, weather, location – Business objectives (e.g. profit, risk)

Inferring preferences Big Data • Example signals from set-top box
logs Action Signal strength Set reminder Very high Repeat watch Very high Length of video watch High Explicit rating High Record product High Purchase product High Search for product Low Skip product Low Pause live TV Low

Data aggregations in R Big Data Useful packages • base
package (aggregate, tapply) • sqldf package – SQL like queries • plyr package – Splits the data and combines it after processing – Wide variety of aggregate functions – Intuitive syntax • data.table package – Very fast (ideal for bigger data sets) • reshape package – Additional package for data manipulation (e.g. convert table from wide to long format)

Quick prototyping in R Big Data R> library(recommenderlab) # version
0.1-4 R> m <-matrix(sample(c(as.numeric(1:5), NA), 50, replace=TRUE, prob=c(0.1,0.05,0.1,0.15,0.1,0.5)), ncol=10, dimnames=list(user=paste("user", 1:5,sep=''), item=paste("item", 1:10, sep=''))) item1 item2 item3 item4 item5 item6 item7 item8 item9 item10 user1 NA NA NA NA 2 NA NA NA NA 4 user2 3 4 4 NA NA 2 5 4 5 4 user3 NA NA 2 NA NA 1 3 NA 3 2 user4 NA NA NA 4 NA 4 NA 4 NA NA user5 1 NA NA NA NA 4 2 NA NA NA

Storing, pre-processing data Big Data • Sparse matrix R> r
<- as(m, "realRatingMatrix") • Normalising data (remove user bias) (centre or Z-score) R> r_m <- normalize(r) • Binarising data R> r_b <- binarize(r, minRating=3) item1 item2 item3 item4 item5 item6 item7 item8 item9 item10 [1,] 0 0 0 0 0 0 0 0 0 1 [2,] 0 1 1 0 0 0 1 1 1 1 [3,] 0 0 0 0 0 0 0 0 0 0 [4,] 0 0 0 1 0 1 0 1 0 0 [5,] 0 0 0 0 0 1 0 0 0 0

Storing, pre-processing data Big Data R> image(r)

Exploring data Big Data • MovieLens data (http://www.movielens.org) R> data(MovieLense)
R> MovieLense 943 x 1664 rating matrix of class ‘realRatingMatrix’ with 99392 ratings.

Exploring data Big Data R> image(sample(MovieLense,500))

Exploring data Big Data R> library(ggplot2) R> ggplot(data.frame(ratings=getRatings(MovieLense)), aes(ratings)) +geom_bar(binwidth=1)
+ theme_bw()

Exploring data Big Data R> ggplot(data.frame(ratings=getRatings(normalize(MovieLense))), aes(ratings)) +geom_density() + theme_bw()

Exploring data Big Data R>ggplot(data.frame(ratings=rowCounts(MovieLense)),aes(ratings)) + geom_bar(binwidth=5) + theme_bw()

Evaluation on predicted ratings Big Data – Mean average error
– Root mean squared error

Top-N recommendation Big Data – Precision – Recall – Issues
• Definition of relevance • Missing items • Varying user profile size | ∩ { }| | | | ∩ | | | Predicted class (expectation) Observed class (observation) TP FN FP TN

Evaluation strategies Big Data • Split by user R> scheme
<- evaluationScheme (MovieLense,method="split",train=0.8) • k-fold cross-validation R> scheme <- evaluationScheme (MovieLense,method="cross-validation",k=10) • Bootstrap sampling R> scheme <- evaluationScheme (MovieLense,method=“bootstrap",k=10,train=0.8) • Additional parameters – goodRating – defines relevance threshold for top-N evaluators – given – number of (fixed length) items given to use for prediction – train – percentage of users used for training

Evaluation on predicted ratings Big Data • UBCF - Recommender
based on user-based collaborative filtering (explicit ratings). • IBCF - Recommender based on item-based collaborative filtering (explicit ratings). • SVD - Recommender based on SVD approximation (explicit ratings). • POPULAR - Recommender based on item popularity (explicit ratings). • RANDOM - Produce random recommendations (explicit ratings).

Algorithms Big Data R> es <- evaluationScheme(MovieLense, method="cross- validation", goodRating=4,k=4,
given=10) R> algorithms <- list( RANDOM = list(name = "RANDOM", param = NULL), POPULAR = list(name = "POPULAR", param = NULL), UBCF = list(name = "UBCF", param =list(normalize=NULL,method="Cosine",nn=50)), IBCF = list(name = "IBCF", param =list(normalize=NULL)), SVD = list(name = "SVD", param = list(categories=30,normalize=NULL, treat_na = "median"))) R> evlist <- evaluate(es, algorithms,n=c(1, 3, 5, 10, 15, 20))

Comparing results (TPR/FPR) Big Data R> plot(evlist, annotate=1:5, legend="topleft")

Comparing results (Precision/Recall) Big Data R> plot(evlist, "prec",annotate=1:5, legend="bottomright")

Implementing new algorithms Big Data Main functions to implement •
Train function –Process and train the data • Predict function –Return top-N list or ratings for a given vector/matrix • Top-N function

Why R? Big Data • Advantages – Great selection of
packages to enhance recommendation – Most statistical tools are available – Quick prototyping – Compact code – Good visualisation tools • Disadvantages – Some base functions are slow (e.g. loops) – Steep learning curve – Scaling

Recommender Systems in R

Recommender Systems in R

Data Science London

More Decks by Data Science London

Other Decks in Technology

Featured

Transcript