Big Practical Recommendations with Alternating Least Squares

Slide 1

Slide 1 text

Big, Practical Recommendations

Slide 2

Slide 2 text

WHERE’S BIG LEARNING?   Next: Application Layer   Analytics   Machine Learning   Like Apache Mahout   Common Big Data app today   Clustering, recommenders,

Slide 3

Slide 3 text

A RECOMMENDER SHOULD …   Answer in Real-time   Ingest new data, now   Modify recommendations based on newest data   No “cold start” for new data   Scale Horizontally   For queries per second   For size of data set   Accept Diverse Input   Not just people and products   Not just explicit ratings   Clicks, views, buys   Side information   Be “Pretty Accurate”

Slide 4

Slide 4 text

NEED: 2-TIER ARCHITECTURE   Real-time Serving Layer   Quick results based on

Slide 5

Slide 5 text

A PRACTICAL ALGORITHM MATRIX FACTORIZATION BENEFITS   Factor user-item matrix to user- feature + feature-item matrix   Well understood in ML, as:   Principal Component Analysis   Latent Semantic Indexing   Several algorithms, like:   Singular Value Decomposition   Alternating Least Squares   Models intuition   Factorization is batch parallelizable   Reconstruction (recs) in

Slide 6

Slide 6 text

A PRACTICAL IMPLEMENTATION ALTERNATING LEAST SQUARES BENEFITS   Simple factorization P ≈ X YT   Approximate: X, Y are “skinny” (low-rank)   Faster than the SVD   Trivially parallel, iterative   Dumber than the SVD   No singular values,

Slide 7

Slide 7 text

ALS ALGORITHM 1   Input: (user, item, strength) tuples   Anything you can quantify is input   Strength is positive   Many tuples per user-item   R is sparse user-item

Slide 8

Slide 8 text

ALS ALGORITHM 2   Follow “Collaborative Filtering for Implicit Feedback Datasets”

Slide 9

Slide 9 text

ALS ALGORITHM 3   P is m x n   Choose k << m, n   Factor P as Q = X YT, Q ≈ P   X is m x k ; YT is k x n   Find best approximation Q   Minimize L2 norm of diff: || P-Q ||2   Minimal squared error:

Slide 10

Slide 10 text

ALS ALGORITHM 4   Optimizing X, Y simultaneously is non-convex, hard   If X or Y are ﬁxed, system of linear equations: convex, easy   Initialize Y with random values   Solve for X   Fix X, solve for Y   Repeat (“Alternating”) YT X

Slide 11

Slide 11 text

ALS ALGORITHM 5   Deﬁne regularization weights cui = 1 + α rui   Minimize:

Slide 12

Slide 12 text

ALS ALGORITHM 6   With fixed Y, compute optimal X   Each row xu is independent   Define Cu as diagonal matrix of cu (user strength weights)   xu = (YTCu Y + λI)-1 YTCu pu   Compare to simple least-squares regression solution (YTY)-1 YTpu   Adds Tikhonov / ridge regression regularization term λI   Attaches cu weights to YT   See paper for how YTCu Y is computed efficiently;

Slide 13

Slide 13 text

1 1 1 0 0 0 0 1 0 0 0 1 0 1 1 1 0 1 0 1 0 0 0 1 0 1 1 0 0 0 EXAMPLE FACTORIZATION   k = 3, λ = 2, α = 40, 10 iterations ≈ 0.96 0.99 0.99 0.38 0.93 0.44 0.39 0.98 -0.11 0.39 0.70 0.99 0.42 0.98 0.98 1.00 1.04 0.99 0.44 0.98 0.11 0.51 -0.13 1.00 0.57 0.97 1.00 0.68 0.47 0.91 Q = X•YT

Slide 14

Slide 14 text

FOLD-IN   Need immediate, if approximate, updates for new data   New user u needs new row

Slide 15

Slide 15 text

THIS IS MYRRIX   Soft-launched   Serving Layer available

Slide 16

Slide 16 text

APPENDIX

Slide 17

Slide 17 text

EXAMPLES STACKOVERFLOW TAGS WIKIPEDIA LINKS   Recommend tags to questions   Tag questions automatically, improve tag coverage   3.5M questions x 30K tags   4.3 hours x 5 machines on Amazon EMR   $3.03 ≈ $0.08 per 100,000 recs   Recommend new linked articles from existing links   Propose missing, related links   2.5M articles x 1.8M articles   28 hours x 2 PCs on