Query-Based Simple and Scalable Recommender Systems with Apache Hivemall

Apache Hivemall Easy-to-use ML in SQL Scalable Runs in parallel
on Hadoop ecosystem Multi-platform Hive, Spark, Pig Versatile Eﬃcient, generic functions ‣ Scalable ML library implemented as Hive UDFs ‣ OSS project under Apache Software Foundation Released on Mar 5, 2018

Easy-to-use and scalable Example: Scalable Logistic Regression written in ~10
lines of SQL Automatically runs in parallel on Hadoop ecosystem

Feature representation in Hivemall index : weight or index INT
BIGINT TEXT FLOAT index weight ( ) ‣ libSVM formatɹɹɹɹɹɹɹɹɹɹɹɹɹɹ 10 : 3.4, 123 : 0.5, 34567 : 0.231 ‣ can be text ɹ ɹɹɹɹɹɹɹɹɹɹ price : 600, size : 2.5 ‣ -only means = 1.0ʢe.g., categoricalʣ male = male : 1.0 index weight index

Training and prediction over features-label pairs Model Table Train SQL
Predict SQL Bought 1 label 0 … 1 So, how about male customer who watched $9.99 book? Unforeseen 0 … 0 1 1 0 0 9.99 1 0 … Saturday Male Book Buy? ? Historically, for this kind of customer-product pairs, … features 0 … 0 1 1 0 0 8.29 1 0 … 0 … 1 0 0 1 0 5.25 0 0 … … … … … … … … … … … … 0 … 1 0 0 0 1 195.00 0 1 …

CREATE TABLE lr_model AS SELECT feature, avg(weight) as weight --
reducers perform model averaging in parallel FROM ( SELECT logress(features, label, "-total_steps ${total_steps}") as (feature, weight) FROM training ) t -- map-only task GROUP BY feature; -- shuffled to reducers

Variety of classification and regression algorithms Classification ‣ Generic classifier
- HingeLoss - LogLoss (a.k.a. logistic loss) - SquaredHingeLoss - ModifiedHuberLoss - Perceptron ‣ Passive Aggressive (PA, PA1, PA2) ‣ Confidence Weighted (CW) ‣ Adaptive Regularization of Weight Vectors (AROW) ‣ Soft Confidence Weighted (SCW) ‣ (Field-Aware) Factorization Machines ‣ RandomForest Regression ‣ Generic regressor - SquaredLoss - QuantileLoss - EpsilonInsensitiveLoss - SquaredEpsilonInsensitiveLoss - HuberLoss ‣ PA Regression ‣ AROW Regression ‣ (Field-Aware) Factorization Machines ‣ RandomForest

Multi-platform

Apache Pig a = load 'a9a.train' as (rowid:int, label:float, features:{(featurepair:chararray)});
b = foreach a generate flatten( logress(features, label, '-total_steps ${total_steps}') ) as (feature, weight); c = group b by feature; d = foreach c generate group, AVG(b.weight); store d into 'a9a_model';

Apache Spark DataFrames val trainDf = spark.read.format("libsvm").load("a9a.train") val modelDf =
trainDf.train_logregr(append_bias($"features"), $"label") .groupBy("feature").avg("weight") .toDF("feature", "weight") .cache

context = HiveContext(sc) context.sql(" SELECT feature, avg(weight) as weight FROM
( SELECT train_logregr(features, label) as (feature, weight) FROM training ) t GROUP BY feature ") Apache Spark SQL in HiveContext

Apache Spark Online prediction on Spark Streaming val testData =
ssc.textFileStream(...).map(LabeledPoint.parse) testData.predict { case testDf => // Explode features in input streams val testDf_exploded = ... val predictDf = testDf_exploded .join(model, testDf_exploded("feature") === model("feature"), "LEFT_OUTER") .select($"rowid", ($"weight" * $"value").as("value")) .groupby("rowid").sum("value") .select($"rowid", sigmoid($"SUM(value)")) predictDf }

Multi-platform

Versatile - Feature hashing - Feature scaling (normalization, z-score) -
Feature binning - TF-IDF vectorizer - Polynomial expansion - Ampliﬁer - AUC, nDCG, log loss, precision recall, … - Concatenation - Intersection - Remove - Sort - Average - Sum - … … Feature engineering Evaluation metrics Array and maps Bit, compress, character encoding Eﬃcient top-k query processing

Top-k efficient query processing student class score 1 B 70
2 A 80 3 A 90 4 B 60 5 A 70 … … … List top-2 students for each class: SELECT student, class, score, rank FROM ( SELECT student, class, score, rank() over (PARTITION BY class ORDER BY score DESC) as rank FROM table ) t WHERE rank <= 2 SELECT each_top_k( 2, class, score, class, student -- output columns ) as (rank, score, class, student) FROM ( SELECT * FROM table CLUSTER BY class ) t Not ﬁnish in 24 hrs. for 20M classes and ~1k students in each Finish in 2 hrs.

Recommendation in Hivemall with each_top_k() k-nearest-neighbor ‣ MinHash and b-Bit
MinHash (LSH) ‣ Similarities - Euclid - Cosine - Jaccard - Angular Eﬃcient item-based collaborative ﬁltering ‣ Sparse Linear Method (SLIM) ‣ Approximated all-pair similarities (DIMSUM) Matrix completion ‣ Matrix Factorization ‣ Factorization Machines

Both explicit and implicit recommendation Find each user’s most promising
items ‣ Improve customers’ engagement ‣ Encourage to buy more products on EC sites User Item Rating Tom Laptop 3 ˒˒˒ˑˑ Jack Coffee beans 5 ˒˒˒˒˒ Mike Watch 1 ˒ˑˑˑˑ … … … User Top-3 recommended items Tom Headphone, USB charger, 4K monitor Jack Mug, Coffee machine, Chocolate Mike Ring, T-shirt, Bag … … Input Output User Bought item Tom Laptop Jack Coffee beans Mike Watch … … User Top-3 recommended items Tom Headphone, USB charger, 4K monitor Jack Mug, Coffee machine, Chocolate Mike Ring, T-shirt, Bag … …

Building Recommendation Model with Hivemall Explicit rating prediction on the
ML20M data

–Johnny Appleseed “Type a quote here.” !19

github.com/apache/incubator-hivemall

Natural Language Processing — English, Japanese and Chinese tokenizer, word
N-grams, … ‣ ɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹ   ["Hello", "world"] ‣   apple Sketching ‣ Geospatial functions select tokenize('Hello, world!') select singularize('apples') SELECT count(distinct user_id) FROM t SELECT approx_count_distinct(user_id) FROM t SELECT map_url(lat, lon, zoom) as osm_url, map_url(lat, lon, zoom,'-type googlemaps') as gmap_url FROM ( SELECT 51.51202 as lat, 0.02435 as lon, 17 as zoom UNION ALL SELECT 51.51202 as lat, 0.02435 as lon, 4 as zoom ) t

Anomaly / Change-point detection ‣ Local outlier factor (k-NN-based technique)
‣ ChangeFinder ‣ Singular Spectrum Transformation Clustering / Topic modeling ‣ Latent Dirichlet Allocation ‣ Probabilistic Latent Semantic Analysis

Apache Hivemall Easy-to-use ML in SQL Scalable Runs in parallel
on Hadoop ecosystem Multi-platform Hive, Spark, Pig Versatile Eﬃcient, generic functions github.com/apache/incubator-hivemall

Query-Based Simple and Scalable Recommender Sys...

Query-Based Simple and Scalable Recommender Systems with Apache Hivemall

Takuya Kitazawa

More Decks by Takuya Kitazawa

Other Decks in Programming

Featured

Transcript

Apache Hivemall Easy-to-use ML in SQL Scalable Runs in parallel

Easy-to-use and scalable Example: Scalable Logistic Regression written in ~10

Feature representation in Hivemall index : weight or index INT

Training and prediction over features-label pairs Model Table Train SQL

CREATE TABLE lr_model AS SELECT feature, avg(weight) as weight --

Variety of classification and regression algorithms Classiﬁcation ‣ Generic classiﬁer

Multi-platform

Apache Pig a = load 'a9a.train' as (rowid:int, label:float, features:{(featurepair:chararray)});

Apache Spark DataFrames val trainDf = spark.read.format("libsvm").load("a9a.train") val modelDf =

context = HiveContext(sc) context.sql(" SELECT feature, avg(weight) as weight FROM

Apache Spark Online prediction on Spark Streaming val testData =

Multi-platform

Versatile - Feature hashing - Feature scaling (normalization, z-score) -

Top-k efficient query processing student class score 1 B 70

Recommendation in Hivemall with each_top_k() k-nearest-neighbor ‣ MinHash and b-Bit

Both explicit and implicit recommendation Find each user’s most promising

Building Recommendation Model with Hivemall Explicit rating prediction on the

–Johnny Appleseed “Type a quote here.” !19

github.com/apache/incubator-hivemall

Natural Language Processing — English, Japanese and Chinese tokenizer, word

Anomaly / Change-point detection ‣ Local outlier factor (k-NN-based technique)

Apache Hivemall Easy-to-use ML in SQL Scalable Runs in parallel