Demo - Query-Based Simple and Scalable Recommender Systems with Apache Hivemall

Demo - Query-Based Simple and Scalable Recommender Systems with Apache Hivemall

37130a5f1550eb2d91e640cedf907a78?s=128

Takuya Kitazawa

October 04, 2018
Tweet

Transcript

  1. Query-Based Simple and Scalable Recommendation with Apache Hivemall Easy-to-use ‣

    ML in SQL ‣ No expertise ‣ Sharable SELECT train_classifier( -- train_regressor( features, label, ‘-loss logloss -optimizer AdaGrad -reg L1' ) as (feature, weight) FROM training ‣ Loss func?on ‣ Op?mizer ‣ Regulariza?on ‣ Learning rate ‣ Mini-batch Scalable ‣ Runs in parallel ‣ Hadoop ecosystem ‣ Flexible selection of each layer: - HiveQL - Pig - Spark - DataFrame - HiveContext Versatile ‣ Regression ‣ Classification ‣ Feature engineering ‣ Evaluation ‣ Topic modeling ‣ Anomaly detection ‣ NLP ‣ Generic array/map operations Multi-platform
  2. Item-Based Collaborative Filtering in Query Language itemid other cnt 583266

    621056 231 583266 583266 923 31231 13212 129 31231 31231 542 … … … CREATE TABLE cooccurrence as SELECT u1.itemid, u2.itemid as other, count(1) as cnt FROM user_purchased u1 JOIN user_purchased u2 ON (u1.userid = u2.userid) WHERE u1.itemid != u2.itemid GROUP BY u1.itemid, u2.itemid userid itemid purchased_at 1 31231 2015-04-09 00:29:02 1 13212 2016-05-24 16:29:02 2 312 2016-06-03 23:29:02 3 2312 2016-06-04 19:29:02 … … … CREATE TABLE user_purchased as SELECT userid, itemid, count(1) as purchase_count FROM history GROUP BY userid, itemid Count # of transac?ons for each user-item pair Compute item-item co-count What’s next?
  3. Matrix Factorization in Query Language CREATE TABLE sgd_model as SELECT

    idx, array_avg(u_rank) as Pu, array_avg(i_rank) as Qi, avg(u_bias) as Bu, avg(i_bias) as Bi FROM ( SELECT train_mf_sgd( user_id, item_id, rating, '-factor ${factor} -mu ${mu} -iter ${iters}' ) as (idx, u_rank, i_rank, u_bias, i_bias) FROM training ) t GROUP BY idx SELECT mf_predict(t2.Pu, p2.Qi, t2.Bu, p2.Bi, ${mu}) as predicted FROM ( SELECT t1.user_id, t1.item_id, m1.Pu, m1.Bu FROM target t1 LEFT OUTER JOIN sgd_model m1 ON (t1.user_id = m1.idx) ) t2 LEFT OUTER JOIN sgd_model m2 ON (t2.item_id = m2.idx)
  4. List of Recommender Related Capabilities ‣ List top-3 items per

    user: item user score 1 B 70 2 A 80 3 A 90 4 B 60 5 A 70 … … … SELECT each_top_k( 2, user, score, user, item -- output columns ) as (rank, score, user, item) FROM ( SELECT * FROM table CLUSTER BY user ) t Complete in 2 hrs. k-nearest-neighbor ‣ MinHash and b-Bit MinHash (LSH) ‣ Similari?es - Euclid - Cosine - Jaccard - Angular Efficient top-k retrieval Efficient item-based CF techniques ‣ Sparse Linear Method (SLIM) ‣ Approximated all-pair similari?es (DIMSUM) Matrix completion ‣ Matrix Factoriza?on ‣ (Field-Aware) Factoriza?on Machines SELECT item, user, score, rank FROM ( SELECT item, user, score, rank() over (PARTITION BY user ORDER BY score DESC) as rank FROM table ) t WHERE rank <= 2 NOT finish in 24 hrs. for 20M users and 
 ~1k items in each