What's New and Coming to Apache Hivemall  #ACNA19

What's New and Coming to Apache Hivemall   Building More
Flexible Machine Learning Solution for  Apache Hive and Spark    Takuya Kitazawa @takuti   Makoto Yui @myui Apache Hivemall PPMCs

Machine Learning in Query Language

Q. Solve regression problem on massive data stored in data
warehouse

Prac@cal experience in science and engineering ML theory Tool /
Data model Scalability Q. Solve regression problem on massive data stored in data warehouse

Done by ~10 lines of queries

Machine Learning for everyone  Open source query-based machine learning solution
- Incubating since Sept 13, 2016 - @ApacheHivemall - GitHub: apache/incubator-hivemall - Team: 6 PPMCs + 3 committers - Latest release: v0.5.2 (Dec 3, 2018) - Toward graduation: ✓ Community growth ✓ 1+ Apache releases ✓ Documentation improvements

Introduction to Apache Hivemall What’s New in v0.5.2 What’s Coming
Next?

‣ Data warehousing solu@on built on top of Apache Hadoop
‣ Eﬃciently access and analyze large-scale data via SQL-like interface, HiveQL - create table - select - join - group by - count() - sum() - … - order by - cluster by - … Apache Hive

‣ OSS project under Apache SoPware Founda@on ‣ Scalable ML
library implemented as Hive user-deﬁned func@ons (UDFs) Apache Hivemall column 1 aaa bbb ccc column 1’ xxx yyy zzz column 1 aaa bbb ccc column 2 scalar column 1 aaa bbb ccc column 2 column 3 xxx 111 yyy 222 UDF UDAF (aggregacon) UDTF (tabular) ‣ l1_normalize() ‣ rmse() ‣ train_regressor()

Easy-to-use ML in SQL Scalable Runs in parallel on Hadoop
ecosystem Mul@-plaWorm Hive, Spark, Pig Versa@le Eﬃcient, generic funccons Apache Hivemall

Use case #1: Enterprise Big Data analytics platform

Problem What you want to “predict” Hypothesis & Proposal Build
machine learning model Historical data Cleanse data Evaluate Hivemall makes ML more simple, handy for non experts Anybody who knows SQL basics Deploy to produc@on Easily try, save, share, schedule via simple I/F in scalable manner

Use case #2: Large-scale recommender systems  Demo paper @ ACM
RecSys 2018

Recommendation <3 tabular form User Item Ra@ng Tom Laptop 3
˒˒˒ˑˑ Jack Coffee beans 5 ˒˒˒˒˒ Mike Watch 1 ˒ˑˑˑˑ … … … User Top-3 recommended items Tom Headphone, USB charger, 4K monitor Jack Mug, Coffee machine, Chocolate Mike Ring, T-shirt, Bag … … Input Output User Bought item Tom Laptop Jack Coffee beans Mike Watch … …

Use case #3: E-learning “New in Big Data” Machine Learning
with SQL @ Udemy

ecosystem Mul@-plaWorm Hive, Spark, Pig Versa@le Eﬃcient, generic funccons

Example: Scalable Logistic Regression written in ~10 lines of queries
Automa@cally runs in parallel on Hadoop

Example: Table “purchase_history” http://hivemall.incubator.apache.org/userguide/supervised_learning/tutorial.html

Preprocessing select array_concat( -- Concatenate features as a feature vector
quantitative_features( -- Create quantitative features array("price"), price ), categorical_features( -- Create categorical features array("day of week", "gender", "category"), day_of_week, gender, category ) ) as features, label from purchase_history

Resulting table “training”

SELECT train_classifier( -- train_regressor( features, label, '-loss logloss -opt SGD
-reg no -eta simple -total_steps ${total_steps}' ) as (feature, weight) FROM training Classiﬁca@on ‣ HingeLoss ‣ LogLoss (a.k.a. logis7c loss) ‣ SquaredHingeLoss ‣ ModiﬁedHuberLoss Regression ‣ SquaredLoss ‣ QuancleLoss ‣ EpsilonInsensicveLoss ‣ SquaredEpsilonInsensicveLoss ‣ HuberLoss Supervised learning by unified function

SELECT train_classifier( -- train_regressor( features, label, '-loss logloss -opt SGD
-reg no -eta simple -total_steps ${total_steps}' ) as (feature, weight) FROM training Op@mizer ‣ SGD ‣ AdaGrad ‣ AdaDelta ‣ ADAM Regulariza@on ‣ L1 ‣ L2 ‣ ElasccNet ‣ RDA ‣ Iteracon with learning rate control ‣ Mini-batch training ‣ Early stopping Supervised learning by unified function

Model = table

Classifica@on ‣ Generic classifier ‣ Perceptron ‣ Passive Aggressive (PA,
PA1, PA2) ‣ Confidence Weighted (CW) ‣ Adapcve Regularizacon of Weight Vectors (AROW) ‣ Soo Confidence Weighted (SCW) ‣ (Field-Aware) Factoriza@on Machines ‣ RandomForest Regression ‣ Generic regressor ‣ PA Regression ‣ AROW Regression ‣ (Field-Aware) Factoriza@on Machines ‣ RandomForest Classification and regression with variety of algorithms

Factorization Machines S. Rendle. Factoriza@on Machines with libFM. ACM Transaccons
on Intelligent Systems and Technology, 3(3), May 2012.

Factorization Machines

RandomForest Training CREATE TABLE rf_model AS SELECT train_randomforest_classifier( features, label,
'-trees 50 -seed 71' -- hyperparameters ) as (model_id, model_weight, model, var_importance, oob_errors, oob_tests) FROM training;

RandomForest Model table

RandomForest Export decision trees for visualization SELECT tree_export(model, "-type javascript",
...) as js, tree_export(model, "-type graphvis", ...) as dot FROM rf_model

- Feature hashing - Feature scaling (normalizacon, z-score) - Feature
binning - TF-IDF vectorizer - Polynomial expansion - Ampliﬁer - AUC, nDCG, log loss, precision, recall, … - Concatenacon - Interseccon - Remove - Sort - Average - Sum - … - Feature engineering Evalua@on metrics Array and maps Bit, compress, character encoding Eﬃcient top-k query processing

Efficient top-k retrieval Internally hold bounded priority queue List top-3
items per user: item user score 1 B 70 2 A 80 3 A 90 4 B 60 5 A 70 … … … SELECT item, user, score, rank FROM ( SELECT item, user, score, rank() over (PARTITION BY user ORDER BY score DESC) as rank FROM table ) t WHERE rank <= 2 SELECT each_top_k( 2, user, score, user, item -- output columns ) as (rank, score, user, item) FROM ( SELECT * FROM table CLUSTER BY user ) t Not ﬁnish in 24 hrs. for 20M users and ~1k items in each Finish in 2 hrs.

Recommendation with Hivemall k-nearest-neighbor ‣ MinHash and b-Bit MinHash (LSH)
‣ Similarices - Euclid - Cosine - Jaccard - Angular Eﬃcient item-based collabora@ve ﬁltering ‣ Sparse Linear Method (SLIM) ‣ Approximated all-pair similarices (DIMSUM) Matrix comple@on ‣ Matrix Factorizacon ‣ Factorizacon Machines

Natural Language Processing — English, Japanese and Chinese tokenizer, word
N-grams, … ‣ ɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹ   ["Hello", "world"] ‣   apple Sketching ‣ Geospa@al func@ons select tokenize('Hello, world!') select singularize('apples') SELECT count(distinct user_id) FROM t SELECT approx_count_distinct(user_id) FROM t SELECT map_url(lat, lon, zoom) as osm_url, map_url(lat, lon, zoom,'-type googlemaps') as gmap_url FROM ( SELECT 51.51202 as lat, 0.02435 as lon, 17 as zoom UNION ALL SELECT 51.51202 as lat, 0.02435 as lon, 4 as zoom ) t

Anomaly / Change-point detec@on ‣ Local outlier factor (k-NN-based technique)
‣ ChangeFinder ‣ Singular Spectrum Transformacon Clustering / Topic modeling ‣ Latent Dirichlet Allocacon ‣ Probabiliscc Latent Semancc Analysis

CREATE TABLE lr_model AS SELECT feature, avg(weight) as weight FROM
( SELECT logress(features, label, "-total_steps ${total_steps}") as (feature, weight) FROM training ) t GROUP BY feature; Apache Hive

Apache Pig a = load 'a9a.train' as (rowid:int, label:float, features:{(featurepair:chararray)});
b = foreach a generate flatten( logress(features, label, '-total_steps ${total_steps}') ) as (feature, weight); c = group b by feature; d = foreach c generate group, AVG(b.weight); store d into 'a9a_model';

Apache Spark DataFrames val trainDf = spark.read.format("libsvm").load("a9a.train") val modelDf =
trainDf.train_logregr(append_bias($"features"), $"label") .groupBy("feature").avg("weight") .toDF("feature", "weight") .cache

context = HiveContext(sc) context.sql(" SELECT feature, avg(weight) as weight FROM
( SELECT train_logregr(features, label) as (feature, weight) FROM training ) t GROUP BY feature ") Apache Spark Query in HiveContext

Apache Spark Online prediction on Spark Streaming val testData =
ssc.textFileStream(...).map(LabeledPoint.parse) testData.predict { case testDf => // Explode features in input streams val testDf_exploded = ... val predictDf = testDf_exploded .join(model, testDf_exploded("feature") === model("feature"), "LEFT_OUTER") .select($"rowid", ($"weight" * $"value").as("value")) .groupby("rowid").sum("value") .select($"rowid", sigmoid($"SUM(value)")) predictDf }

Hivemall meets PySpark from pyspark.sql import SparkSession spark = SparkSession
\ .builder \ .master('local[*]') \ .config('spark.jars', 'hivemall-spark2.3-0.5.2-incubating-with-dependencies.jar') \ .enableHiveSupport() \ .getOrCreate()

Next?

Brickhouse Hive UDFs collection is merged to Hivemall Welcome Jerome
to the team!

WITH high_rated_items as ( SELECT bloom(itemid) as items FROM (
SELECT itemid FROM ratings GROUP BY itemid HAVING avg(rating) >= 4.0 ) t ) SELECT l.rating, count(distinct l.userid) as cnt FROM ratings l CROSS JOIN high_rated_items r WHERE bloom_contains(r.items, l.itemid) GROUP BY l.rating; Bloom Filters: Probabilistic data structures Build Bloom Filter (i.e., probabiliscc set of) high-rated items Check if item is in Bloom Filter, and see their actual racngs:

Working with JSON SELECT from_json( '{ "location" : { "Country"
: "Japan" , "City" : "Osaka" } }', -- json 'map<string,string>', -- return type array(‘location') -- key ) to_json()

Array / Vector ‣ Append ‣ Element-at ‣ Union ‣
First/last element ‣ Flaxen ‣ Vector add/dot Map ‣ Convert into array of key-value pairs ‣ Filter elements by keys Sanity check ‣ Assert ‣ Raise error Misc ‣ Try-cast ‣ Sessionize records by cme ‣ Moving average More utility functions

Field-Aware Factorization Machines

Field-Aware Factorization Machines Y. Juan, Y. Zhuang, W. Chin, and
C. Lin. Field-Aware Factoriza@on Machines for CTR Predic@on. RecSys 2016.

SELECT train_ffm( features, label, '-init_v random -max_init_value 0.5 -classification -iterations
15  -factors 4 -eta 0.2 -optimizer adagrad -lambda 0.00002' ) FROM ( SELECT features, label FROM train_vectorized CLUSTER BY rand(1) ) t

Next?

HIVEMALL-118: word2vec SELECT train_word2vec( r.negative_table, l.words, "-n {n} -win 5
-neg 15 -iters 5 -model cbow" ) FROM train_docs l CROSS JOIN negative_table r

HIVEMALL-126: Multi-Nominal Logistic Regression SELECT train_maxent_classifier(features, label, "-attrs C,Q,Q") FROM
train

XGBoost with Hivemall (experimental) SELECT train_xgboost_classifier(features, label) as (model_id, model)
FROM training SELECT rowid, avg(predicted) as predicted FROM ( -- predict with each model SELECT xgboost_predict(rowid, features, model_id, model) AS (rowid, predicted) -- join each test record with each model FROM xgboost_models CROSS JOIN testing ) t GROUP BY rowid

WE NEED YOUR CONTRIBUTION From documentation and utility UDFs, to
state-of-the-art ML algorithms and toolkits github.com/apache/incubator-hivemall

Installation http://hivemall.incubator.apache.org/userguide/getting_started/installation.html $ hive add jar /path/to/hivemall-all-VERSION.jar; source /path/to/define-all.hive; Docker
image, documentation and step-by-step tutorial are available

What's New and Coming to Apache Hivemall   Building More
Flexible Machine Learning Solution for  Apache Hive and Spark    Takuya Kitazawa @takuti   Makoto Yui @myui Apache Hivemall PPMCs

What's New and Coming to Apache Hivemall #ACNA19

What's New and Coming to Apache Hivemall #ACNA19

More Decks by Takuya Kitazawa

Other Decks in Technology

Featured

Transcript

What's New and Coming to Apache Hivemall  #ACNA19

What's New and Coming to Apache Hivemall  #ACNA19