Apache Hivemall: Query-Based Handy, Scalable Machine Learning on Hive

Slide 1

Slide 1 text

Apache Hivemall Query-Based Handy, Scalable Machine Learning on Hive Takuya Kitazawa @takuti Data Science Engineer at Arm Treasure Data / Committer of Apache Hivemall

Slide 2

Slide 2 text

Machine Learning in Query Language

Slide 3

Slide 3 text

BigQuery ML at Google I/O 2018 hCps://ai.googleblog.com/2018/07/machine-learning-in-google-bigquery.html

Slide 4

Slide 4 text

Q. Solve regression problem on massive data stored in data warehouse

Slide 5

Slide 5 text

Prac7cal experience in science and engineering Theore7cal understanding Tool (Python?) Scalability Q. Solve regression problem on massive data stored in data warehouse

Slide 6

Slide 6 text

Done by less than 10 lines of queries hCps://ai.googleblog.com/2018/07/machine-learning-in-google-bigquery.html

Slide 7

Slide 7 text

Machine Learning for everyone

Slide 8

Slide 8 text

Open source query-based machine learning solution github.com/apache/incubator-hivemall

Slide 9

Slide 9 text

Hivemall in real-world ML workflow Why Hivemall is notably preferable, and who can get the benefit

Slide 10

Slide 10 text

Problem What you want to “predict” Hypothesis & Proposal Build machine learning model Historical data Cleanse data Evaluate Real-world ML workflow is for experts Data scientists and ML engineers who have solid relevant background Deploy to produc7on

Slide 11

Slide 11 text

Slide 12

Slide 12 text

No content

Slide 13

Slide 13 text

No content

Slide 14

Slide 14 text

Problem What you want to “predict” Hypothesis & Proposal Build machine learning model Historical data Cleanse data Evaluate Hivemall makes ML more simple, handy for non experts Anybody who knows SQL basics Deploy to produc7on Easily try, save, share, schedule via simple I/F in scalable manner

Slide 15

Slide 15 text

Problem What you want to “predict” Hypothesis & Proposal Build machine learning model Historical data Cleanse data Evaluate How can we organize query fragments? Extract Filter Interpolate Normalize … … Query Query Query Query Train data Get features Train … Query Query Query Test data Get features Predict … Accuracy Query Query Query Query

Slide 16

Slide 16 text

Tip: Combining with workflow engine e.g., Digdag allows to define highly dependent ML workflow as YAML file Extract Filter Interpolate Normalize … … Query Query Query Query Train data Get features Train … Query Query Query Test data Get features Predict … Accuracy Query Query Query Query https://www.digdag.io/

Slide 17

Slide 17 text

Query-based ML: New option outside of common ML toolkit Your data-related work can be simpler OSS solution <3 workflow engine

Slide 18

Slide 18 text

Introduction to query-based machine learning with Apache Hivemall

Slide 19

Slide 19 text

‣ Data warehousing solu7on built on top of Apache Hadoop ‣ Eﬃciently access and analyze large-scale data via SQL-like interface, HiveQL - create table - select - join - group by - count() - sum() - … - order by - cluster by - … Apache Hive

Slide 20

Slide 20 text

‣ OSS project under Apache SoRware Founda7on since 2017 ‣ Scalable ML library implemented as Hive user-deﬁned func7ons (UDFs) Apache Hivemall column 1 aaa bbb ccc column 1’ xxx yyy zzz column 1 aaa bbb ccc column 2 scalar column 1 aaa bbb ccc column 2 column 3 xxx 111 yyy 222 UDF UDAF (aggregacon) UDTF (tabular) ‣ l1_normalize() ‣ rmse() ‣ train_regressor()

Slide 21

Slide 21 text

Easy-to-use ML in SQL Scalable Runs in parallel on Hadoop ecosystem Mul7-plaXorm Hive, Spark, Pig Versa7le Eﬃcient, generic funccons Apache Hivemall

Slide 22

Slide 22 text

Easy-to-use and scalable

Slide 23

Slide 23 text

Example: Scalable Logistic Regression written in ~10 lines of queries Automa7cally runs in parallel on Hadoop

Slide 24

Slide 24 text

index : value or index INT BIGINT TEXT FLOAT index value ( ) ‣ libSVM formatɹɹɹɹɹɹɹɹɹɹɹɹɹɹ 10 : 3.4, 123 : 0.5, 34567 : 0.231 ‣ can be text ɹ ɹɹɹɹɹɹɹɹɹɹ price : 600, size : 2.5 ‣ -only means = 1.0ʢe.g., categoricalʣ gender#male = gender#male : 1.0 index value index Space-efficient feature representation in Hivemall TEXT

Slide 25

Slide 25 text

Array of quanctacve features : select quantitative_features(array("price", "size"), 600, 2.5) ["price:600.0", "size:2.5"] Array of categorical features # select categorical_features(array("gender", "category"), “male", "book") [“gender#male", "category#book"] * NULL is automaccally omiCed Hivemall internally does one-hot encoding (e.g., book → 1, 0, 0, …) value index value index Feature vector = array of string

Slide 26

Slide 26 text

Simplify name of quanctacve feature and categorical feature # select feature_hashing(array("price:600", "category#book")) ["14142887:600", "10413006"] (Default upper limit: 224 + 1 = 16777217) index value index Feature hashing: Approximation improves scalability

Slide 27

Slide 27 text

Example: Table “purchase_history” http://hivemall.incubator.apache.org/userguide/supervised_learning/tutorial.html

Slide 28

Slide 28 text

select array_concat( -- Concatenate features as a feature vector quantitative_features( -- Create quantitative features array("price"), price ), categorical_features( -- Create categorical features array("day of week", "gender", "category"), day_of_week, gender, category ) ) as features, label from purchase_history

Slide 29

Slide 29 text

Resulting table “training”

Slide 30

Slide 30 text

select add_bias( array_concat( — Concatenate features as a feature vector quantitative_features( -- Create quantitative features array("price"), price ), categorical_features( -- Create categorical features array("day of week", "gender", "category"), day_of_week, gender, category ) ) ) as features, label from purchase_history

Slide 31

Slide 31 text

select feature_hashing( add_bias( array_concat( -- Concatenate features as a feature vector quantitative_features( -- Create quantitative features array("price"), price ), categorical_features( -- Create categorical features array("day of week", "gender", "category"), day_of_week, gender, category ) ) ) ) as features, label from purchase_history

Slide 32

Slide 32 text

SELECT train_classifier( -- train_regressor( features, label, '-loss logloss -opt SGD -reg no -eta simple -total_steps ${total_steps}' ) as (feature, weight) FROM training Classiﬁcation ‣ HingeLoss ‣ LogLoss (a.k.a. logistic loss) ‣ SquaredHingeLoss ‣ ModiﬁedHuberLoss Regression ‣ SquaredLoss ‣ QuantileLoss ‣ EpsilonInsensitiveLoss ‣ SquaredEpsilonInsensitiveLoss ‣ HuberLoss Supervised learning by unified function

Slide 33

Slide 33 text

SELECT train_classifier( -- train_regressor( features, label, '-loss logloss -opt SGD -reg no -eta simple -total_steps ${total_steps}' ) as (feature, weight) FROM training Optimizer ‣ SGD ‣ AdaGrad ‣ AdaDelta ‣ ADAM Regularization ‣ L1 ‣ L2 ‣ ElasticNet ‣ RDA ‣ Iteration with learning rate control ‣ Mini-batch training ‣ Early stopping Supervised learning by unified function

Slide 34

Slide 34 text

Model = table

Slide 35

Slide 35 text

Classification ‣ Generic classifier ‣ Perceptron ‣ Passive Aggressive (PA, PA1, PA2) ‣ Confidence Weighted (CW) ‣ Adaptive Regularization of Weight Vectors (AROW) ‣ Soft Confidence Weighted (SCW) ‣ (Field-Aware) Factorization Machines ‣ RandomForest Regression ‣ Generic regressor ‣ PA Regression ‣ AROW Regression ‣ (Field-Aware) Factorization Machines ‣ RandomForest Classification and regression with variety of algorithms

Slide 36

Slide 36 text

Factorization Machines S. Rendle. Factoriza7on Machines with libFM. ACM Transaccons on Intelligent Systems and Technology, 3(3), May 2012.

Slide 37

Slide 37 text

Factorization Machines

Slide 38

Slide 38 text

RandomForest Training CREATE TABLE rf_model AS SELECT train_randomforest_classifier( features, label, '-trees 50 -seed 71' -- hyperparameters ) as (model_id, model_weight, model, var_importance, oob_errors, oob_tests) FROM training;

Slide 39

Slide 39 text

RandomForest Model table

Slide 40

Slide 40 text

RandomForest Prediction CREATE TABLE rf_predicted as SELECT rowid, rf_ensemble(predicted.value, predicted.posteriori, model_weight) as predicted FROM ( SELECT t.rowid, m.model_weight, tree_predict(m.model_id, m.model, t.features, ${classification}) as predicted FROM testing t CROSS JOIN rf_model m ) t1 GROUP BY rowid

Slide 41

Slide 41 text

RandomForest Export decision trees for visualization SELECT tree_export(model, "-type javascript", ...) as js, tree_export(model, "-type graphvis", ...) as dot FROM rf_model

Slide 42

Slide 42 text

Versatile

Slide 43

Slide 43 text

- Feature hashing - Feature scaling (normalization, z-score) - Feature binning - TF-IDF vectorizer - Polynomial expansion - Ampliﬁer - AUC, nDCG, log loss, precision, recall, … - Concatenation - Intersection - Remove - Sort - Average - Sum - … Feature engineering Evalua7on metrics Array and maps Bit, compress, character encoding Eﬃcient top-k query processing

Slide 44

Slide 44 text

Efficient top-k retrieval Internally hold bounded priority queue List top-3 items per user: item user score 1 B 70 2 A 80 3 A 90 4 B 60 5 A 70 … … … SELECT item, user, score, rank FROM ( SELECT item, user, score, rank() over (PARTITION BY user ORDER BY score DESC) as rank FROM table ) t WHERE rank <= 2 SELECT each_top_k( 2, user, score, user, item -- output columns ) as (rank, score, user, item) FROM ( SELECT * FROM table CLUSTER BY user ) t Not ﬁnish in 24 hrs. for 20M users and ~1k items in each Finish in 2 hrs.

Slide 45

Slide 45 text

Recommendation <3 tabular form User Item Ra7ng Tom Laptop 3 ˒˒˒ˑˑ Jack Coffee beans 5 ˒˒˒˒˒ Mike Watch 1 ˒ˑˑˑˑ … … … User Top-3 recommended items Tom Headphone, USB charger, 4K monitor Jack Mug, Coffee machine, Chocolate Mike Ring, T-shirt, Bag … … Input Output User Bought item Tom Laptop Jack Coffee beans Mike Watch … …

Slide 46

Slide 46 text

Recommendation with Hivemall k-nearest-neighbor ‣ MinHash and b-Bit MinHash (LSH) ‣ Similarities - Euclid - Cosine - Jaccard - Angular Eﬃcient item-based collaborative ﬁltering ‣ Sparse Linear Method (SLIM) ‣ Approximated all-pair similarities (DIMSUM) Matrix completion ‣ Matrix Factorization ‣ Factorization Machines

Slide 47

Slide 47 text

Natural Language Processing — English, Japanese and Chinese tokenizer, word N-grams, … ‣ ɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹ   ["Hello", "world"] ‣   apple Sketching ‣ Geospatial functions select tokenize('Hello, world!') select singularize('apples') SELECT count(distinct user_id) FROM t SELECT approx_count_distinct(user_id) FROM t SELECT map_url(lat, lon, zoom) as osm_url, map_url(lat, lon, zoom,'-type googlemaps') as gmap_url FROM ( SELECT 51.51202 as lat, 0.02435 as lon, 17 as zoom UNION ALL SELECT 51.51202 as lat, 0.02435 as lon, 4 as zoom ) t

Slide 48

Slide 48 text

Anomaly / Change-point detection ‣ Local outlier factor (k-NN-based technique) ‣ ChangeFinder ‣ Singular Spectrum Transformation Clustering / Topic modeling ‣ Latent Dirichlet Allocation ‣ Probabilistic Latent Semantic Analysis

Slide 49

Slide 49 text

Multi-platform

Slide 50

Slide 50 text

No content

Slide 51

Slide 51 text

CREATE TABLE lr_model AS SELECT feature, avg(weight) as weight FROM ( SELECT logress(features, label, "-total_steps ${total_steps}") as (feature, weight) FROM training ) t GROUP BY feature; Apache Hive

Slide 52

Slide 52 text

Apache Pig a = load 'a9a.train' as (rowid:int, label:float, features:{(featurepair:chararray)}); b = foreach a generate flatten( logress(features, label, '-total_steps ${total_steps}') ) as (feature, weight); c = group b by feature; d = foreach c generate group, AVG(b.weight); store d into 'a9a_model';

Slide 53

Slide 53 text

Apache Spark DataFrames val trainDf = spark.read.format("libsvm").load("a9a.train") val modelDf = trainDf.train_logregr(append_bias($"features"), $"label") .groupBy("feature").avg("weight") .toDF("feature", "weight") .cache

Slide 54

Slide 54 text

context = HiveContext(sc) context.sql(" SELECT feature, avg(weight) as weight FROM ( SELECT train_logregr(features, label) as (feature, weight) FROM training ) t GROUP BY feature ") Apache Spark Query in HiveContext

Slide 55

Slide 55 text

Apache Spark Online prediction on Spark Streaming val testData = ssc.textFileStream(...).map(LabeledPoint.parse) testData.predict { case testDf => // Explode features in input streams val testDf_exploded = ... val predictDf = testDf_exploded .join(model, testDf_exploded("feature") === model("feature"), "LEFT_OUTER") .select($"rowid", ($"weight" * $"value").as("value")) .groupby("rowid").sum("value") .select($"rowid", sigmoid($"SUM(value)")) predictDf }

Slide 56

Slide 56 text

Future development plan ‣ word2vec ‣ Field-aware factorizacon machines stability improvement ‣ XGBoost ‣ Gradient booscng, LightGBM ‣ Hivemall on Kaqa KSQL ‣ …

Slide 57

Slide 57 text

XGBoost with Hivemall (experimental) SELECT train_xgboost_classifier(features, label) as (model_id, model) FROM training SELECT rowid, avg(predicted) as predicted FROM ( -- predict with each model SELECT xgboost_predict(rowid, features, model_id, model) AS (rowid, predicted) -- join each test record with each model FROM xgboost_models CROSS JOIN testing ) t GROUP BY rowid

Slide 58

Slide 58 text

Installation http://hivemall.incubator.apache.org/userguide/getting_started/installation.html $ hive add jar /path/to/hivemall-all-VERSION.jar; source /path/to/define-all.hive;

Slide 59

Slide 59 text

–Johnny Appleseed “Type a quote here.” !59

Slide 60

Slide 60 text

github.com/apache/incubator-hivemall Docker image, documentation and step-by-step tutorial are available

Slide 61

Slide 61 text

with query language and Apache Hivemall Easy Scalable Sharable Clean Making machine learning

Slide 62

Slide 62 text

Apache Hivemall Query-Based Handy, Scalable Machine Learning on Hive Takuya Kitazawa @takuti Data Science Engineer at Arm Treasure Data / Committer of Apache Hivemall