$30 off During Our Annual Pro Sale. View Details »

Query-Based Simple and Scalable Recommender Systems with Apache Hivemall

Query-Based Simple and Scalable Recommender Systems with Apache Hivemall

Introduce Apache Hivemall https://github.com/apache/incubator-hivemall and its recommendation features. See https://youtu.be/cMUsuA9KZ_c for recorded presentation.

Takuya Kitazawa

July 03, 2018
Tweet

More Decks by Takuya Kitazawa

Other Decks in Programming

Transcript

  1. View Slide

  2. Apache Hivemall
    Easy-to-use
    ML in SQL
    Scalable
    Runs in parallel on
    Hadoop ecosystem
    Multi-platform
    Hive, Spark, Pig
    Versatile
    Efficient, generic
    functions
    ‣ Scalable ML library implemented as Hive UDFs
    ‣ OSS project under Apache Software Foundation Released on Mar 5, 2018

    View Slide

  3. Easy-to-use and scalable
    Example: Scalable Logistic Regression written in ~10 lines of SQL
    Automatically runs in parallel
    on Hadoop ecosystem

    View Slide

  4. Feature representation in Hivemall
    index : weight
    or
    index
    INT BIGINT TEXT
    FLOAT
    index
    weight
    ( )
    ‣ libSVM formatɹɹɹɹɹɹɹɹɹɹɹɹɹɹ 10 : 3.4, 123 : 0.5, 34567 : 0.231
    ‣ can be text ɹ ɹɹɹɹɹɹɹɹɹɹ price : 600, size : 2.5
    ‣ -only means = 1.0ʢe.g., categoricalʣ male = male : 1.0
    index
    weight
    index

    View Slide

  5. Training and prediction over features-label pairs
    Model
    Table
    Train
    SQL
    Predict
    SQL
    Bought
    1 label
    0

    1
    So, how about male customer who watched $9.99 book?
    Unforeseen 0 … 0 1 1 0 0 9.99 1 0 …
    Saturday Male Book
    Buy?
    ?
    Historically, for this kind of customer-product pairs, …
    features 0 … 0 1 1 0 0 8.29 1 0 …
    0 … 1 0 0 1 0 5.25 0 0 …
    … … … … … … … … … … …
    0 … 1 0 0 0 1 195.00 0 1 …

    View Slide

  6. CREATE TABLE lr_model
    AS
    SELECT
    feature,
    avg(weight) as weight -- reducers perform model averaging in parallel
    FROM (
    SELECT
    logress(features, label, "-total_steps ${total_steps}") as (feature, weight)
    FROM
    training
    ) t -- map-only task
    GROUP BY feature; -- shuffled to reducers

    View Slide

  7. Variety of classification and regression algorithms
    Classification
    ‣ Generic classifier
    - HingeLoss
    - LogLoss (a.k.a. logistic loss)
    - SquaredHingeLoss
    - ModifiedHuberLoss
    - Perceptron
    ‣ Passive Aggressive (PA, PA1, PA2)
    ‣ Confidence Weighted (CW)
    ‣ Adaptive Regularization of Weight Vectors (AROW)
    ‣ Soft Confidence Weighted (SCW)
    ‣ (Field-Aware) Factorization Machines
    ‣ RandomForest
    Regression
    ‣ Generic regressor
    - SquaredLoss
    - QuantileLoss
    - EpsilonInsensitiveLoss
    - SquaredEpsilonInsensitiveLoss
    - HuberLoss
    ‣ PA Regression
    ‣ AROW Regression
    ‣ (Field-Aware) Factorization Machines
    ‣ RandomForest

    View Slide

  8. Multi-platform

    View Slide

  9. Apache Pig
    a = load 'a9a.train'
    as (rowid:int, label:float, features:{(featurepair:chararray)});
    b = foreach a generate flatten(
    logress(features, label, '-total_steps ${total_steps}')
    ) as (feature, weight);
    c = group b by feature;
    d = foreach c generate group, AVG(b.weight);
    store d into 'a9a_model';

    View Slide

  10. Apache Spark
    DataFrames
    val trainDf =
    spark.read.format("libsvm").load("a9a.train")
    val modelDf =
    trainDf.train_logregr(append_bias($"features"), $"label")
    .groupBy("feature").avg("weight")
    .toDF("feature", "weight")
    .cache

    View Slide

  11. context = HiveContext(sc)
    context.sql("
    SELECT
    feature,
    avg(weight) as weight
    FROM (
    SELECT
    train_logregr(features, label) as (feature, weight)
    FROM
    training
    ) t
    GROUP BY feature
    ")
    Apache Spark
    SQL in HiveContext

    View Slide

  12. Apache Spark
    Online prediction on Spark Streaming
    val testData =
    ssc.textFileStream(...).map(LabeledPoint.parse)
    testData.predict { case testDf =>
    // Explode features in input streams
    val testDf_exploded = ...
    val predictDf = testDf_exploded
    .join(model, testDf_exploded("feature") === model("feature"), "LEFT_OUTER")
    .select($"rowid", ($"weight" * $"value").as("value"))
    .groupby("rowid").sum("value")
    .select($"rowid", sigmoid($"SUM(value)"))
    predictDf
    }

    View Slide

  13. Multi-platform

    View Slide

  14. Versatile
    - Feature hashing
    - Feature scaling (normalization, z-score)
    - Feature binning
    - TF-IDF vectorizer
    - Polynomial expansion
    - Amplifier
    - AUC, nDCG, log loss, precision recall, …
    - Concatenation
    - Intersection
    - Remove
    - Sort
    - Average
    - Sum
    - …

    Feature engineering
    Evaluation metrics
    Array and maps
    Bit, compress, character encoding
    Efficient top-k query processing

    View Slide

  15. Top-k efficient query processing
    student class score
    1 B 70
    2 A 80
    3 A 90
    4 B 60
    5 A 70
    … … …
    List top-2 students for each class:
    SELECT
    student, class, score, rank
    FROM (
    SELECT
    student, class, score,
    rank() over (PARTITION BY class ORDER BY score DESC)
    as rank
    FROM
    table
    ) t
    WHERE rank <= 2
    SELECT
    each_top_k(
    2, class, score,
    class, student -- output columns
    ) as (rank, score, class, student)
    FROM (
    SELECT * FROM table
    CLUSTER BY class
    ) t
    Not finish in 24 hrs. for 20M
    classes and ~1k students in each
    Finish in 2 hrs.

    View Slide

  16. Recommendation in Hivemall with each_top_k()
    k-nearest-neighbor
    ‣ MinHash and b-Bit MinHash (LSH)
    ‣ Similarities
    - Euclid
    - Cosine
    - Jaccard
    - Angular
    Efficient item-based collaborative filtering
    ‣ Sparse Linear Method (SLIM)
    ‣ Approximated all-pair similarities (DIMSUM)
    Matrix completion
    ‣ Matrix Factorization
    ‣ Factorization Machines

    View Slide

  17. Both explicit and implicit recommendation
    Find each user’s most promising items
    ‣ Improve customers’ engagement
    ‣ Encourage to buy more products on EC sites
    User Item Rating
    Tom Laptop 3 ˒˒˒ˑˑ
    Jack Coffee beans 5 ˒˒˒˒˒
    Mike Watch 1 ˒ˑˑˑˑ
    … … …
    User Top-3 recommended items
    Tom Headphone, USB charger, 4K monitor
    Jack Mug, Coffee machine, Chocolate
    Mike Ring, T-shirt, Bag
    … …
    Input Output
    User Bought item
    Tom Laptop
    Jack Coffee beans
    Mike Watch
    … …
    User Top-3 recommended items
    Tom Headphone, USB charger, 4K monitor
    Jack Mug, Coffee machine, Chocolate
    Mike Ring, T-shirt, Bag
    … …

    View Slide

  18. Building Recommendation Model with Hivemall
    Explicit rating prediction on the ML20M data

    View Slide

  19. –Johnny Appleseed
    “Type a quote here.”
    !19

    View Slide

  20. github.com/apache/incubator-hivemall

    View Slide

  21. Natural Language Processing — English, Japanese and Chinese tokenizer, word N-grams, …
    ‣ ɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹ 

    ["Hello", "world"]
    ‣ 

    apple
    Sketching

    Geospatial functions
    select tokenize('Hello, world!')
    select singularize('apples')
    SELECT count(distinct user_id) FROM t SELECT approx_count_distinct(user_id) FROM t
    SELECT
    map_url(lat, lon, zoom) as osm_url,
    map_url(lat, lon, zoom,'-type googlemaps') as gmap_url
    FROM (
    SELECT 51.51202 as lat, 0.02435 as lon, 17 as zoom
    UNION ALL
    SELECT 51.51202 as lat, 0.02435 as lon, 4 as zoom
    ) t

    View Slide

  22. Anomaly / Change-point detection
    ‣ Local outlier factor (k-NN-based technique)
    ‣ ChangeFinder
    ‣ Singular Spectrum Transformation
    Clustering / Topic modeling
    ‣ Latent Dirichlet Allocation
    ‣ Probabilistic Latent Semantic Analysis

    View Slide

  23. Apache Hivemall
    Easy-to-use
    ML in SQL
    Scalable
    Runs in parallel on
    Hadoop ecosystem
    Multi-platform
    Hive, Spark, Pig
    Versatile
    Efficient, generic
    functions
    github.com/apache/incubator-hivemall

    View Slide