$30 off During Our Annual Pro Sale. View Details »

Demo - Query-Based Simple and Scalable Recommender Systems with Apache Hivemall

Demo - Query-Based Simple and Scalable Recommender Systems with Apache Hivemall

Takuya Kitazawa

October 04, 2018
Tweet

More Decks by Takuya Kitazawa

Other Decks in Research

Transcript

  1. Query-Based Simple and Scalable Recommendation with Apache Hivemall
    Easy-to-use
    ‣ ML in SQL
    ‣ No expertise
    ‣ Sharable
    SELECT
    train_classifier( -- train_regressor(
    features, label,
    ‘-loss logloss -optimizer AdaGrad -reg L1'
    ) as (feature, weight)
    FROM
    training
    ‣ Loss func?on
    ‣ Op?mizer
    ‣ Regulariza?on
    ‣ Learning rate
    ‣ Mini-batch
    Scalable
    ‣ Runs in parallel
    ‣ Hadoop ecosystem
    ‣ Flexible selection
    of each layer:
    - HiveQL
    - Pig
    - Spark
    - DataFrame
    - HiveContext
    Versatile
    ‣ Regression
    ‣ Classification
    ‣ Feature engineering
    ‣ Evaluation
    ‣ Topic modeling
    ‣ Anomaly detection
    ‣ NLP
    ‣ Generic array/map
    operations
    Multi-platform

    View Slide

  2. Item-Based Collaborative Filtering in Query Language
    itemid other cnt
    583266 621056 231
    583266 583266 923
    31231 13212 129
    31231 31231 542
    … … …
    CREATE TABLE cooccurrence as
    SELECT
    u1.itemid, u2.itemid as other,
    count(1) as cnt
    FROM
    user_purchased u1
    JOIN user_purchased u2 ON (u1.userid = u2.userid)
    WHERE
    u1.itemid != u2.itemid
    GROUP BY
    u1.itemid, u2.itemid
    userid itemid purchased_at
    1 31231 2015-04-09 00:29:02
    1 13212 2016-05-24 16:29:02
    2 312 2016-06-03 23:29:02
    3 2312 2016-06-04 19:29:02
    … … …
    CREATE TABLE user_purchased as
    SELECT
    userid,
    itemid,
    count(1) as purchase_count
    FROM
    history
    GROUP BY
    userid,
    itemid
    Count # of transac?ons for each user-item pair Compute item-item co-count What’s next?

    View Slide

  3. Matrix Factorization in Query Language
    CREATE TABLE sgd_model
    as
    SELECT
    idx,
    array_avg(u_rank) as Pu,
    array_avg(i_rank) as Qi,
    avg(u_bias) as Bu,
    avg(i_bias) as Bi
    FROM (
    SELECT
    train_mf_sgd(
    user_id, item_id, rating,
    '-factor ${factor} -mu ${mu} -iter ${iters}'
    ) as (idx, u_rank, i_rank, u_bias, i_bias)
    FROM
    training
    ) t
    GROUP BY idx
    SELECT
    mf_predict(t2.Pu, p2.Qi, t2.Bu, p2.Bi, ${mu}) as predicted
    FROM (
    SELECT
    t1.user_id,
    t1.item_id,
    m1.Pu,
    m1.Bu
    FROM
    target t1
    LEFT OUTER JOIN
    sgd_model m1
    ON (t1.user_id = m1.idx)
    ) t2
    LEFT OUTER JOIN
    sgd_model m2
    ON (t2.item_id = m2.idx)

    View Slide

  4. List of Recommender Related Capabilities
    ‣ List top-3 items per user:
    item user score
    1 B 70
    2 A 80
    3 A 90
    4 B 60
    5 A 70
    … … …
    SELECT
    each_top_k(
    2, user, score,
    user, item -- output columns
    ) as (rank, score, user, item)
    FROM (
    SELECT * FROM table
    CLUSTER BY user
    ) t
    Complete in 2 hrs.
    k-nearest-neighbor
    ‣ MinHash and b-Bit MinHash (LSH)
    ‣ Similari?es
    - Euclid
    - Cosine
    - Jaccard
    - Angular
    Efficient top-k retrieval
    Efficient item-based CF techniques
    ‣ Sparse Linear Method (SLIM)
    ‣ Approximated all-pair similari?es (DIMSUM)
    Matrix completion
    ‣ Matrix Factoriza?on
    ‣ (Field-Aware) Factoriza?on Machines
    SELECT
    item, user, score, rank
    FROM (
    SELECT
    item, user, score,
    rank() over
    (PARTITION BY user
    ORDER BY score DESC) as rank
    FROM
    table
    ) t
    WHERE rank <= 2
    NOT finish in 24 hrs. for
    20M users and 

    ~1k items in each

    View Slide