Slide 1

Slide 1 text

No content

Slide 2

Slide 2 text

Apache Hivemall Easy-to-use ML in SQL Scalable Runs in parallel on Hadoop ecosystem Multi-platform Hive, Spark, Pig Versatile Efficient, generic functions ‣ Scalable ML library implemented as Hive UDFs ‣ OSS project under Apache Software Foundation Released on Mar 5, 2018

Slide 3

Slide 3 text

Easy-to-use and scalable Example: Scalable Logistic Regression written in ~10 lines of SQL Automatically runs in parallel on Hadoop ecosystem

Slide 4

Slide 4 text

Feature representation in Hivemall index : weight or index INT BIGINT TEXT FLOAT index weight ( ) ‣ libSVM formatɹɹɹɹɹɹɹɹɹɹɹɹɹɹ 10 : 3.4, 123 : 0.5, 34567 : 0.231 ‣ can be text ɹ ɹɹɹɹɹɹɹɹɹɹ price : 600, size : 2.5 ‣ -only means = 1.0ʢe.g., categoricalʣ male = male : 1.0 index weight index

Slide 5

Slide 5 text

Training and prediction over features-label pairs Model Table Train SQL Predict SQL Bought 1 label 0 … 1 So, how about male customer who watched $9.99 book? Unforeseen 0 … 0 1 1 0 0 9.99 1 0 … Saturday Male Book Buy? ? Historically, for this kind of customer-product pairs, … features 0 … 0 1 1 0 0 8.29 1 0 … 0 … 1 0 0 1 0 5.25 0 0 … … … … … … … … … … … … 0 … 1 0 0 0 1 195.00 0 1 …

Slide 6

Slide 6 text

CREATE TABLE lr_model AS SELECT feature, avg(weight) as weight -- reducers perform model averaging in parallel FROM ( SELECT logress(features, label, "-total_steps ${total_steps}") as (feature, weight) FROM training ) t -- map-only task GROUP BY feature; -- shuffled to reducers

Slide 7

Slide 7 text

Variety of classification and regression algorithms Classification ‣ Generic classifier - HingeLoss - LogLoss (a.k.a. logistic loss) - SquaredHingeLoss - ModifiedHuberLoss - Perceptron ‣ Passive Aggressive (PA, PA1, PA2) ‣ Confidence Weighted (CW) ‣ Adaptive Regularization of Weight Vectors (AROW) ‣ Soft Confidence Weighted (SCW) ‣ (Field-Aware) Factorization Machines ‣ RandomForest Regression ‣ Generic regressor - SquaredLoss - QuantileLoss - EpsilonInsensitiveLoss - SquaredEpsilonInsensitiveLoss - HuberLoss ‣ PA Regression ‣ AROW Regression ‣ (Field-Aware) Factorization Machines ‣ RandomForest

Slide 8

Slide 8 text

Multi-platform

Slide 9

Slide 9 text

Apache Pig a = load 'a9a.train' as (rowid:int, label:float, features:{(featurepair:chararray)}); b = foreach a generate flatten( logress(features, label, '-total_steps ${total_steps}') ) as (feature, weight); c = group b by feature; d = foreach c generate group, AVG(b.weight); store d into 'a9a_model';

Slide 10

Slide 10 text

Apache Spark DataFrames val trainDf = spark.read.format("libsvm").load("a9a.train") val modelDf = trainDf.train_logregr(append_bias($"features"), $"label") .groupBy("feature").avg("weight") .toDF("feature", "weight") .cache

Slide 11

Slide 11 text

context = HiveContext(sc) context.sql(" SELECT feature, avg(weight) as weight FROM ( SELECT train_logregr(features, label) as (feature, weight) FROM training ) t GROUP BY feature ") Apache Spark SQL in HiveContext

Slide 12

Slide 12 text

Apache Spark Online prediction on Spark Streaming val testData = ssc.textFileStream(...).map(LabeledPoint.parse) testData.predict { case testDf => // Explode features in input streams val testDf_exploded = ... val predictDf = testDf_exploded .join(model, testDf_exploded("feature") === model("feature"), "LEFT_OUTER") .select($"rowid", ($"weight" * $"value").as("value")) .groupby("rowid").sum("value") .select($"rowid", sigmoid($"SUM(value)")) predictDf }

Slide 13

Slide 13 text

Multi-platform

Slide 14

Slide 14 text

Versatile - Feature hashing - Feature scaling (normalization, z-score) - Feature binning - TF-IDF vectorizer - Polynomial expansion - Amplifier - AUC, nDCG, log loss, precision recall, … - Concatenation - Intersection - Remove - Sort - Average - Sum - … … Feature engineering Evaluation metrics Array and maps Bit, compress, character encoding Efficient top-k query processing

Slide 15

Slide 15 text

Top-k efficient query processing student class score 1 B 70 2 A 80 3 A 90 4 B 60 5 A 70 … … … List top-2 students for each class: SELECT student, class, score, rank FROM ( SELECT student, class, score, rank() over (PARTITION BY class ORDER BY score DESC) as rank FROM table ) t WHERE rank <= 2 SELECT each_top_k( 2, class, score, class, student -- output columns ) as (rank, score, class, student) FROM ( SELECT * FROM table CLUSTER BY class ) t Not finish in 24 hrs. for 20M classes and ~1k students in each Finish in 2 hrs.

Slide 16

Slide 16 text

Recommendation in Hivemall with each_top_k() k-nearest-neighbor ‣ MinHash and b-Bit MinHash (LSH) ‣ Similarities - Euclid - Cosine - Jaccard - Angular Efficient item-based collaborative filtering ‣ Sparse Linear Method (SLIM) ‣ Approximated all-pair similarities (DIMSUM) Matrix completion ‣ Matrix Factorization ‣ Factorization Machines

Slide 17

Slide 17 text

Both explicit and implicit recommendation Find each user’s most promising items ‣ Improve customers’ engagement ‣ Encourage to buy more products on EC sites User Item Rating Tom Laptop 3 ˒˒˒ˑˑ Jack Coffee beans 5 ˒˒˒˒˒ Mike Watch 1 ˒ˑˑˑˑ … … … User Top-3 recommended items Tom Headphone, USB charger, 4K monitor Jack Mug, Coffee machine, Chocolate Mike Ring, T-shirt, Bag … … Input Output User Bought item Tom Laptop Jack Coffee beans Mike Watch … … User Top-3 recommended items Tom Headphone, USB charger, 4K monitor Jack Mug, Coffee machine, Chocolate Mike Ring, T-shirt, Bag … …

Slide 18

Slide 18 text

Building Recommendation Model with Hivemall Explicit rating prediction on the ML20M data

Slide 19

Slide 19 text

–Johnny Appleseed “Type a quote here.” !19

Slide 20

Slide 20 text

github.com/apache/incubator-hivemall

Slide 21

Slide 21 text

Natural Language Processing — English, Japanese and Chinese tokenizer, word N-grams, … ‣ ɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹ 
 ["Hello", "world"] ‣ 
 apple Sketching ‣ Geospatial functions select tokenize('Hello, world!') select singularize('apples') SELECT count(distinct user_id) FROM t SELECT approx_count_distinct(user_id) FROM t SELECT map_url(lat, lon, zoom) as osm_url, map_url(lat, lon, zoom,'-type googlemaps') as gmap_url FROM ( SELECT 51.51202 as lat, 0.02435 as lon, 17 as zoom UNION ALL SELECT 51.51202 as lat, 0.02435 as lon, 4 as zoom ) t

Slide 22

Slide 22 text

Anomaly / Change-point detection ‣ Local outlier factor (k-NN-based technique) ‣ ChangeFinder ‣ Singular Spectrum Transformation Clustering / Topic modeling ‣ Latent Dirichlet Allocation ‣ Probabilistic Latent Semantic Analysis

Slide 23

Slide 23 text

Apache Hivemall Easy-to-use ML in SQL Scalable Runs in parallel on Hadoop ecosystem Multi-platform Hive, Spark, Pig Versatile Efficient, generic functions github.com/apache/incubator-hivemall