Apache Hivemall: Query-Based Handy, Scalable Machine Learning on Hive

Apache Hivemall Query-Based Handy, Scalable Machine Learning on Hive Takuya
Kitazawa @takuti Data Science Engineer at Arm Treasure Data / Committer of Apache Hivemall

Machine Learning in Query Language

BigQuery ML at Google I/O 2018 hCps://ai.googleblog.com/2018/07/machine-learning-in-google-bigquery.html

Q. Solve regression problem on massive data stored in data
warehouse

Prac7cal experience in science and engineering Theore7cal understanding Tool (Python?)
Scalability Q. Solve regression problem on massive data stored in data warehouse

Done by less than 10 lines of queries hCps://ai.googleblog.com/2018/07/machine-learning-in-google-bigquery.html

Machine Learning for everyone

Open source query-based machine learning solution github.com/apache/incubator-hivemall

Hivemall in real-world ML workflow Why Hivemall is notably preferable,
and who can get the benefit

Problem What you want to “predict” Hypothesis & Proposal Build
machine learning model Historical data Cleanse data Evaluate Real-world ML workflow is for experts Data scientists and ML engineers who have solid relevant background Deploy to produc7on

machine learning model Historical data Cleanse data Evaluate Real-world ML workflow is for experts Data scientists and ML engineers who have solid relevant background Deploy to produc7on Do you really need such complexity and flexibility?

machine learning model Historical data Cleanse data Evaluate Hivemall makes ML more simple, handy for non experts Anybody who knows SQL basics Deploy to produc7on Easily try, save, share, schedule via simple I/F in scalable manner

machine learning model Historical data Cleanse data Evaluate How can we organize query fragments? Extract Filter Interpolate Normalize … … Query Query Query Query Train data Get features Train … Query Query Query Test data Get features Predict … Accuracy Query Query Query Query

Tip: Combining with workflow engine e.g., Digdag allows to define
highly dependent ML workflow as YAML file Extract Filter Interpolate Normalize … … Query Query Query Query Train data Get features Train … Query Query Query Test data Get features Predict … Accuracy Query Query Query Query https://www.digdag.io/

Query-based ML: New option outside of common ML toolkit Your
data-related work can be simpler OSS solution <3 workflow engine

Introduction to query-based machine learning with Apache Hivemall

‣ Data warehousing solu7on built on top of Apache Hadoop
‣ Eﬃciently access and analyze large-scale data via SQL-like interface, HiveQL - create table - select - join - group by - count() - sum() - … - order by - cluster by - … Apache Hive

‣ OSS project under Apache SoRware Founda7on since 2017 ‣
Scalable ML library implemented as Hive user-deﬁned func7ons (UDFs) Apache Hivemall column 1 aaa bbb ccc column 1’ xxx yyy zzz column 1 aaa bbb ccc column 2 scalar column 1 aaa bbb ccc column 2 column 3 xxx 111 yyy 222 UDF UDAF (aggregacon) UDTF (tabular) ‣ l1_normalize() ‣ rmse() ‣ train_regressor()

Easy-to-use ML in SQL Scalable Runs in parallel on Hadoop
ecosystem Mul7-plaXorm Hive, Spark, Pig Versa7le Eﬃcient, generic funccons Apache Hivemall

Easy-to-use and scalable

Example: Scalable Logistic Regression written in ~10 lines of queries
Automa7cally runs in parallel on Hadoop

index : value or index INT BIGINT TEXT FLOAT index
value ( ) ‣ libSVM formatɹɹɹɹɹɹɹɹɹɹɹɹɹɹ 10 : 3.4, 123 : 0.5, 34567 : 0.231 ‣ can be text ɹ ɹɹɹɹɹɹɹɹɹɹ price : 600, size : 2.5 ‣ -only means = 1.0ʢe.g., categoricalʣ gender#male = gender#male : 1.0 index value index Space-efficient feature representation in Hivemall TEXT

Array of quanctacve features : select quantitative_features(array("price", "size"), 600, 2.5)
["price:600.0", "size:2.5"] Array of categorical features # select categorical_features(array("gender", "category"), “male", "book") [“gender#male", "category#book"] * NULL is automaccally omiCed Hivemall internally does one-hot encoding (e.g., book → 1, 0, 0, …) value index value index Feature vector = array of string

Simplify name of quanctacve feature and categorical feature # select
feature_hashing(array("price:600", "category#book")) ["14142887:600", "10413006"] (Default upper limit: 224 + 1 = 16777217) index value index Feature hashing: Approximation improves scalability

Example: Table “purchase_history” http://hivemall.incubator.apache.org/userguide/supervised_learning/tutorial.html

select array_concat( -- Concatenate features as a feature vector quantitative_features(
-- Create quantitative features array("price"), price ), categorical_features( -- Create categorical features array("day of week", "gender", "category"), day_of_week, gender, category ) ) as features, label from purchase_history

Resulting table “training”

select add_bias( array_concat( — Concatenate features as a feature vector
quantitative_features( -- Create quantitative features array("price"), price ), categorical_features( -- Create categorical features array("day of week", "gender", "category"), day_of_week, gender, category ) ) ) as features, label from purchase_history

select feature_hashing( add_bias( array_concat( -- Concatenate features as a feature
vector quantitative_features( -- Create quantitative features array("price"), price ), categorical_features( -- Create categorical features array("day of week", "gender", "category"), day_of_week, gender, category ) ) ) ) as features, label from purchase_history

SELECT train_classifier( -- train_regressor( features, label, '-loss logloss -opt SGD
-reg no -eta simple -total_steps ${total_steps}' ) as (feature, weight) FROM training Classiﬁcation ‣ HingeLoss ‣ LogLoss (a.k.a. logistic loss) ‣ SquaredHingeLoss ‣ ModiﬁedHuberLoss Regression ‣ SquaredLoss ‣ QuantileLoss ‣ EpsilonInsensitiveLoss ‣ SquaredEpsilonInsensitiveLoss ‣ HuberLoss Supervised learning by unified function

SELECT train_classifier( -- train_regressor( features, label, '-loss logloss -opt SGD
-reg no -eta simple -total_steps ${total_steps}' ) as (feature, weight) FROM training Optimizer ‣ SGD ‣ AdaGrad ‣ AdaDelta ‣ ADAM Regularization ‣ L1 ‣ L2 ‣ ElasticNet ‣ RDA ‣ Iteration with learning rate control ‣ Mini-batch training ‣ Early stopping Supervised learning by unified function

Model = table

Classification ‣ Generic classifier ‣ Perceptron ‣ Passive Aggressive (PA,
PA1, PA2) ‣ Confidence Weighted (CW) ‣ Adaptive Regularization of Weight Vectors (AROW) ‣ Soft Confidence Weighted (SCW) ‣ (Field-Aware) Factorization Machines ‣ RandomForest Regression ‣ Generic regressor ‣ PA Regression ‣ AROW Regression ‣ (Field-Aware) Factorization Machines ‣ RandomForest Classification and regression with variety of algorithms

Factorization Machines S. Rendle. Factoriza7on Machines with libFM. ACM Transaccons
on Intelligent Systems and Technology, 3(3), May 2012.

Factorization Machines

RandomForest Training CREATE TABLE rf_model AS SELECT train_randomforest_classifier( features, label,
'-trees 50 -seed 71' -- hyperparameters ) as (model_id, model_weight, model, var_importance, oob_errors, oob_tests) FROM training;

RandomForest Model table

RandomForest Prediction CREATE TABLE rf_predicted as SELECT rowid, rf_ensemble(predicted.value, predicted.posteriori,
model_weight) as predicted FROM ( SELECT t.rowid, m.model_weight, tree_predict(m.model_id, m.model, t.features, ${classification}) as predicted FROM testing t CROSS JOIN rf_model m ) t1 GROUP BY rowid

RandomForest Export decision trees for visualization SELECT tree_export(model, "-type javascript",
...) as js, tree_export(model, "-type graphvis", ...) as dot FROM rf_model

Versatile

- Feature hashing - Feature scaling (normalization, z-score) - Feature
binning - TF-IDF vectorizer - Polynomial expansion - Ampliﬁer - AUC, nDCG, log loss, precision, recall, … - Concatenation - Intersection - Remove - Sort - Average - Sum - … Feature engineering Evalua7on metrics Array and maps Bit, compress, character encoding Eﬃcient top-k query processing

Efficient top-k retrieval Internally hold bounded priority queue List top-3
items per user: item user score 1 B 70 2 A 80 3 A 90 4 B 60 5 A 70 … … … SELECT item, user, score, rank FROM ( SELECT item, user, score, rank() over (PARTITION BY user ORDER BY score DESC) as rank FROM table ) t WHERE rank <= 2 SELECT each_top_k( 2, user, score, user, item -- output columns ) as (rank, score, user, item) FROM ( SELECT * FROM table CLUSTER BY user ) t Not ﬁnish in 24 hrs. for 20M users and ~1k items in each Finish in 2 hrs.

Recommendation <3 tabular form User Item Ra7ng Tom Laptop 3
˒˒˒ˑˑ Jack Coffee beans 5 ˒˒˒˒˒ Mike Watch 1 ˒ˑˑˑˑ … … … User Top-3 recommended items Tom Headphone, USB charger, 4K monitor Jack Mug, Coffee machine, Chocolate Mike Ring, T-shirt, Bag … … Input Output User Bought item Tom Laptop Jack Coffee beans Mike Watch … …

Recommendation with Hivemall k-nearest-neighbor ‣ MinHash and b-Bit MinHash (LSH)
‣ Similarities - Euclid - Cosine - Jaccard - Angular Eﬃcient item-based collaborative ﬁltering ‣ Sparse Linear Method (SLIM) ‣ Approximated all-pair similarities (DIMSUM) Matrix completion ‣ Matrix Factorization ‣ Factorization Machines

Natural Language Processing — English, Japanese and Chinese tokenizer, word
N-grams, … ‣ ɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹ   ["Hello", "world"] ‣   apple Sketching ‣ Geospatial functions select tokenize('Hello, world!') select singularize('apples') SELECT count(distinct user_id) FROM t SELECT approx_count_distinct(user_id) FROM t SELECT map_url(lat, lon, zoom) as osm_url, map_url(lat, lon, zoom,'-type googlemaps') as gmap_url FROM ( SELECT 51.51202 as lat, 0.02435 as lon, 17 as zoom UNION ALL SELECT 51.51202 as lat, 0.02435 as lon, 4 as zoom ) t

Anomaly / Change-point detection ‣ Local outlier factor (k-NN-based technique)
‣ ChangeFinder ‣ Singular Spectrum Transformation Clustering / Topic modeling ‣ Latent Dirichlet Allocation ‣ Probabilistic Latent Semantic Analysis

Multi-platform

CREATE TABLE lr_model AS SELECT feature, avg(weight) as weight FROM
( SELECT logress(features, label, "-total_steps ${total_steps}") as (feature, weight) FROM training ) t GROUP BY feature; Apache Hive

Apache Pig a = load 'a9a.train' as (rowid:int, label:float, features:{(featurepair:chararray)});
b = foreach a generate flatten( logress(features, label, '-total_steps ${total_steps}') ) as (feature, weight); c = group b by feature; d = foreach c generate group, AVG(b.weight); store d into 'a9a_model';

Apache Spark DataFrames val trainDf = spark.read.format("libsvm").load("a9a.train") val modelDf =
trainDf.train_logregr(append_bias($"features"), $"label") .groupBy("feature").avg("weight") .toDF("feature", "weight") .cache

context = HiveContext(sc) context.sql(" SELECT feature, avg(weight) as weight FROM
( SELECT train_logregr(features, label) as (feature, weight) FROM training ) t GROUP BY feature ") Apache Spark Query in HiveContext

Apache Spark Online prediction on Spark Streaming val testData =
ssc.textFileStream(...).map(LabeledPoint.parse) testData.predict { case testDf => // Explode features in input streams val testDf_exploded = ... val predictDf = testDf_exploded .join(model, testDf_exploded("feature") === model("feature"), "LEFT_OUTER") .select($"rowid", ($"weight" * $"value").as("value")) .groupby("rowid").sum("value") .select($"rowid", sigmoid($"SUM(value)")) predictDf }

Future development plan ‣ word2vec ‣ Field-aware factorizacon machines stability
improvement ‣ XGBoost ‣ Gradient booscng, LightGBM ‣ Hivemall on Kaqa KSQL ‣ …

XGBoost with Hivemall (experimental) SELECT train_xgboost_classifier(features, label) as (model_id, model)
FROM training SELECT rowid, avg(predicted) as predicted FROM ( -- predict with each model SELECT xgboost_predict(rowid, features, model_id, model) AS (rowid, predicted) -- join each test record with each model FROM xgboost_models CROSS JOIN testing ) t GROUP BY rowid

Installation http://hivemall.incubator.apache.org/userguide/getting_started/installation.html $ hive add jar /path/to/hivemall-all-VERSION.jar; source /path/to/define-all.hive;

–Johnny Appleseed “Type a quote here.” !59

github.com/apache/incubator-hivemall Docker image, documentation and step-by-step tutorial are available

with query language and Apache Hivemall Easy Scalable Sharable Clean
Making machine learning

Apache Hivemall Query-Based Handy, Scalable Machine Learning on Hive Takuya
Kitazawa @takuti Data Science Engineer at Arm Treasure Data / Committer of Apache Hivemall

Apache Hivemall: Query-Based Handy, Scalable Ma...

Apache Hivemall: Query-Based Handy, Scalable Machine Learning on Hive

More Decks by Takuya Kitazawa

Other Decks in Programming

Featured

Transcript