How to Design and Build a Recommendation Pipeline in Python

How to Design and Build a Recommendation Pipeline in Python
Jill Cates November 10th, 2018 PyCon Canada

Overview of the Recommender Pipeline 1. Pre-processing 2. Hyperparameter Tuning
3. Model Training and Prediction 4. Post-processing 5. Evaluation

Spotify Discover Weekly

Netﬂix “Because you watched this TV show…”

Amazon “Frequently bought together” “Customers who bought this item also
bought”

OkCupid “Finding your best match”

Recommender Systems in the Wild Spotify Discover Weekly Amazon Customers
who bought this item also bought Netﬂix Because you watched this show… OkCupid Finding your best match LinkedIn Jobs recommended for you New York Times Recommended Articles for You Medicine Facilitating clinical decision making GitHub Repos “based on your interest”

Things were sold exclusively in brick-and-mortar stores… Before e-commerce limited
inventory mainstream products

Things were sold exclusively in brick-and-mortar stores… Before e-commerce limited
inventory mainstream products unlimited inventory niche products unlimited inventory niche products E-commerce

Recommender Systems in the Wild The Tasting Booth Experiment 6
jam samples 24 jam samples vs.

Recommender Systems in the Wild The Tasting Booth Experiment 6
jam samples 24 jam samples vs. “30% of the consumers in the limited-choice condition subsequently purchased a jar of jam; in contrast, only 3% of the consumers in the extensive-choice condition did so”

Machine Learning Model Recommender Crash Course Data Predictions

Recommender System Recommender Crash Course User Preferences Recommendations

Recommender System Recommender Crash Course User Preferences Recommendations predicting future
behaviour explicit feedback implicit feedback

Recommender System Collaborative ﬁltering Content-based ﬁltering Recommender Crash Course User
Preferences Recommendations predicting future behaviour similar users like similar things item user considers items/users features explicit feedback implicit feedback John Jim Anne Liz Erica

Recommender System Collaborative ﬁltering Recommender Crash Course User Preferences Recommendations
predicting future behaviour similar users like similar things item user explicit feedback implicit feedback John Jim Anne Liz Erica • “Because you watched Movie X” • “Customers who bought this item also bought” 3

Recommender System Recommender Crash Course User Preferences Recommendations predicting future
behaviour explicit feedback implicit feedback Content-based ﬁltering user and item features • user features: age, gender, spoken language • item features: movie genre, year of release, cast users John Jim Anne Liz Erica items scary funny family anime drama indie age gender country lang kids? religion

Overview of the Recommender Pipeline 1. Pre-processing 2. Hyperparameter Tuning
3. Model Training and Prediction 4. Post-processing 5. Evaluation

Pre-processing Hyperparameter Tuning Model Training Post-processing Evaluation user_id movie_id rating
2 439 4.0 10 368 4.5 14 114 5.0 19 371 1.0 2 371 3.0 19 114 4.5 3 439 3.5 54 421 2.0 32 114 3.0 10 369 1.0 Step 1: Data Pre-processing

2 439 4.0 10 368 4.5 14 114 5.0 19 371 1.0 2 371 3.0 19 114 4.5 3 439 3.5 54 421 2.0 32 114 3.0 10 369 1.0 Step 1: Data Pre-processing 1.5 2.0 3.5 4.5 2.0 3.0 5.0 4.5 2.0 1.0 3.0 2.5 4.0 3.0 3.0 4.5 5.0 items users Transform original data to user-item (utility) matrix

2 439 4.0 10 368 4.5 14 114 5.0 19 371 1.0 2 371 3.0 19 114 4.5 3 439 3.5 54 421 2.0 32 114 3.0 10 369 1.0 Step 1: Data Pre-processing items users 1.5 2.0 3.5 4.5 2.0 3.0 5.0 4.5 2.0 1.0 3.0 2.5 4.0 3.0 3.0 4.5 5.0 scipy.sparse.csr_matrix

2 439 4.0 10 368 4.5 14 114 5.0 19 371 1.0 2 371 3.0 19 114 4.5 3 439 3.5 54 421 2.0 32 114 3.0 10 369 1.0 Step 1: Data Pre-processing items users 1.5 2.0 3.5 4.5 2.0 3.0 5.0 4.5 2.0 1.0 3.0 2.5 4.0 3.0 3.0 4.5 5.0 sparsity = # ratings total # elements Calculate Matrix Sparsity

Pre-processing Hyperparameter Tuning Model Training Post-processing Evaluation Step 1: Data
Pre-processing Normalization • Optimists → rate everything 4 or 5 • Pessimists → rate everything 1 or 2 • Need to normalize ratings by accounting for user and item bias • Mean normalization - subtract from each user’s rating for given item bui = μ + bi + bu global avg user-item rating bias item’s avg rating user’s avg rating bi i

Pick a Model Matrix Factorization • factorize the user-item matrix
to get 2 latent factor matrices: - user-factor matrix - item-factor matrix • missing ratings are predicted from the inner product of these two factor matrices Xmn ≈ Pmk × QT nk = ̂ X user item user K K item X ≈

Pick a Model • Algorithms that perform matrix factorization: -
Alternating Least Squares (ALS) - Stochastic Gradient Descent (SGD) - Singular Value Decomposition (SVD) Matrix Factorization Xmn ≈ Pmk × QT nk = ̂ X user item user K K item X ≈

• Of the top K recommendations, what proportion are relevant
to the user? Pick an Evaluation Metric Precision@K

• Of the top 10 recommendations, what proportion are relevant
to the user? Pick an Evaluation Metric Precision@10

Pre-processing Hyperparameter Tuning Model Training Post-processing Evaluation What is a
hyperparameter? Step 2: Hyperparameter Tuning model hyperparameters configuration that is external to the model

Pre-processing Hyperparameter Tuning Model Training Post-processing Evaluation Alternating Least Square’s
Hyperparameters Step 2: Hyperparameter Tuning Goal: ﬁnd the hyperparameters that give the best precision@10 * (or any other evaluation metric that you want to optimize) • (# of factors) • (regularization parameter) λ k

Pre-processing Hyperparameter Tuning Model Training Post-processing Evaluation Step 2: Hyperparameter
Tuning Grid Search source: blog.kaggle.com Random Search # factors # factors regularization regularization λ λ sklearn.model_selection.GridSearchCV sklearn.model_selection.RandomizedSearchCV

Pre-processing Hyperparameter Tuning Model Training Post-processing Evaluation Step 2: Hyperparameter
Tuning scikit-optimize (skopt) hyperopt Metric Optimization Engine (MOE) Sequential Model-Based Optimization # factors regularization # factors regularization

Pre-processing Hyperparameter Tuning Model Training Post-processing Evaluation Step 3: Model
Training 1.5 2.0 3.5 4.5 2.0 3.0 5.0 4.5 2.0 1.0 3.0 2.5 4.0 3.0 3.0 4.5 5.0 items users 1.5 2.0 3.5 4.5 2.0 3.0 5.0 4.5 2.0 1.0 3.0 2.5 4.0 3.0 3.0 4.5 5.0 4.74 3.12 1.22 4.39 2.75 0 -1.27 0 0 0 0 0 0 0 3.99 1.73 2.93 -2.15 4.79 0.82 5.61 3.28 4.95 -0.21 -1.84 4.77 5.10 3.08 5.19 -2.46 3.27 items users AlternatingLeastSquares(k=8, regularization=0.001)

Pre-processing Hyperparameter Tuning Model Training Post-processing Evaluation Step 4: Post-processing
• Sort predicted ratings and get top N • Filter out items that a user has already purchased, watched, interacted with • Item-item recommendations - Use a similarity metric (e.g., cosine similarity) -“Because you watched Movie X”

Pre-processing Hyperparameter Tuning Model Training Post-processing Evaluation Step 5: Evaluation
How do we evaluate recommendations? Traditional ML Recommendation Systems

Step 5: Evaluation Metrics RMSE = ΣN i=1 (y −
̂ y)2 N precision = TP TP + FP recall = TP TP + FN F1 = 2 ⋅ precision ⋅ recall precision + recall Pre-processing Hyperparameter Tuning Model Training Post-processing Evaluation

Precision@K Of the top k recommendations, what proportion are actually
“relevant”? Recall@K Proportion of items that were found in the top k recommendations. True negative False negative Reality Predicted liked did not like liked did not like precision = TP TP + FP recall = TP TP + FN True positive False positive Step 5: Evaluation Pre-processing Hyperparameter Tuning Model Training Post-processing Evaluation

Important Considerations •Interpretability •Eﬃciency and scalability •Diversity •Serendipity

• import surprise (@NicolasHug) • import implicit (@benfred) • import
LightFM (@lyst) • import pyspark.mlib.recommendation Python Tools

Thank you! Jill Cates twitter: @jillacates github: @topspinj cates.jill@gmail.com

How to Design and Build a Recommendation Pipeli...

How to Design and Build a Recommendation Pipeline in Python

More Decks by Jill Cates

Other Decks in Science

Featured

Transcript