What's New and Coming to Apache Hivemall 
#ACNA19

What's New and Coming to Apache Hivemall 
#ACNA19

37130a5f1550eb2d91e640cedf907a78?s=128

Takuya Kitazawa

September 12, 2019
Tweet

Transcript

  1. What's New and Coming to Apache Hivemall 
 Building More

    Flexible Machine Learning Solution for
 Apache Hive and Spark
 
 Takuya Kitazawa @takuti 
 Makoto Yui @myui Apache Hivemall PPMCs
  2. Machine Learning in Query Language

  3. Q. Solve regression problem on massive data stored in data

    warehouse
  4. Prac@cal experience in science and engineering ML theory Tool /

    Data model Scalability Q. Solve regression problem on massive data stored in data warehouse
  5. Done by ~10 lines of queries

  6. Machine Learning for everyone
 Open source query-based machine learning solution

    - Incubating since Sept 13, 2016 - @ApacheHivemall - GitHub: apache/incubator-hivemall - Team: 6 PPMCs + 3 committers - Latest release: v0.5.2 (Dec 3, 2018) - Toward graduation: ✓ Community growth ✓ 1+ Apache releases ✓ Documentation improvements
  7. Introduction to Apache Hivemall What’s New in v0.5.2 What’s Coming

    Next?
  8. Introduction to Apache Hivemall What’s New in v0.5.2 What’s Coming

    Next?
  9. ‣ Data warehousing solu@on built on top of Apache Hadoop

    ‣ Efficiently access and analyze large-scale data via SQL-like interface, HiveQL - create table - select - join - group by - count() - sum() - … - order by - cluster by - … Apache Hive
  10. ‣ OSS project under Apache SoPware Founda@on ‣ Scalable ML

    library implemented as Hive user-defined func@ons (UDFs) Apache Hivemall column 1 aaa bbb ccc column 1’ xxx yyy zzz column 1 aaa bbb ccc column 2 scalar column 1 aaa bbb ccc column 2 column 3 xxx 111 yyy 222 UDF UDAF (aggregacon) UDTF (tabular) ‣ l1_normalize() ‣ rmse() ‣ train_regressor()
  11. Easy-to-use ML in SQL Scalable Runs in parallel on Hadoop

    ecosystem Mul@-plaWorm Hive, Spark, Pig Versa@le Efficient, generic funccons Apache Hivemall
  12. Use case #1: Enterprise Big Data analytics platform

  13. Problem What you want to “predict” Hypothesis & Proposal Build

    machine learning model Historical data Cleanse data Evaluate Hivemall makes ML more simple, handy for non experts Anybody who knows SQL basics Deploy to produc@on Easily try, save, share, schedule via simple I/F in scalable manner
  14. Use case #2: Large-scale recommender systems
 Demo paper @ ACM

    RecSys 2018
  15. Recommendation <3 tabular form User Item Ra@ng Tom Laptop 3

    ˒˒˒ˑˑ Jack Coffee beans 5 ˒˒˒˒˒ Mike Watch 1 ˒ˑˑˑˑ … … … User Top-3 recommended items Tom Headphone, USB charger, 4K monitor Jack Mug, Coffee machine, Chocolate Mike Ring, T-shirt, Bag … … Input Output User Bought item Tom Laptop Jack Coffee beans Mike Watch … …
  16. Use case #3: E-learning “New in Big Data” Machine Learning

    with SQL @ Udemy
  17. None
  18. Easy-to-use ML in SQL Scalable Runs in parallel on Hadoop

    ecosystem Mul@-plaWorm Hive, Spark, Pig Versa@le Efficient, generic funccons
  19. Example: Scalable Logistic Regression written in ~10 lines of queries

    Automa@cally runs in parallel on Hadoop
  20. Example: Table “purchase_history” http://hivemall.incubator.apache.org/userguide/supervised_learning/tutorial.html

  21. Preprocessing select array_concat( -- Concatenate features as a feature vector

    quantitative_features( -- Create quantitative features array("price"), price ), categorical_features( -- Create categorical features array("day of week", "gender", "category"), day_of_week, gender, category ) ) as features, label from purchase_history
  22. Resulting table “training”

  23. SELECT train_classifier( -- train_regressor( features, label, '-loss logloss -opt SGD

    -reg no -eta simple -total_steps ${total_steps}' ) as (feature, weight) FROM training Classifica@on ‣ HingeLoss ‣ LogLoss (a.k.a. logis7c loss) ‣ SquaredHingeLoss ‣ ModifiedHuberLoss Regression ‣ SquaredLoss ‣ QuancleLoss ‣ EpsilonInsensicveLoss ‣ SquaredEpsilonInsensicveLoss ‣ HuberLoss Supervised learning by unified function
  24. SELECT train_classifier( -- train_regressor( features, label, '-loss logloss -opt SGD

    -reg no -eta simple -total_steps ${total_steps}' ) as (feature, weight) FROM training Op@mizer ‣ SGD ‣ AdaGrad ‣ AdaDelta ‣ ADAM Regulariza@on ‣ L1 ‣ L2 ‣ ElasccNet ‣ RDA ‣ Iteracon with learning rate control ‣ Mini-batch training ‣ Early stopping Supervised learning by unified function
  25. Model = table

  26. Classifica@on ‣ Generic classifier ‣ Perceptron ‣ Passive Aggressive (PA,

    PA1, PA2) ‣ Confidence Weighted (CW) ‣ Adapcve Regularizacon of Weight Vectors (AROW) ‣ Soo Confidence Weighted (SCW) ‣ (Field-Aware) Factoriza@on Machines ‣ RandomForest Regression ‣ Generic regressor ‣ PA Regression ‣ AROW Regression ‣ (Field-Aware) Factoriza@on Machines ‣ RandomForest Classification and regression with variety of algorithms
  27. Factorization Machines S. Rendle. Factoriza@on Machines with libFM. ACM Transaccons

    on Intelligent Systems and Technology, 3(3), May 2012.
  28. Factorization Machines

  29. RandomForest Training CREATE TABLE rf_model AS SELECT train_randomforest_classifier( features, label,

    '-trees 50 -seed 71' -- hyperparameters ) as (model_id, model_weight, model, var_importance, oob_errors, oob_tests) FROM training;
  30. RandomForest Model table

  31. RandomForest Export decision trees for visualization SELECT tree_export(model, "-type javascript",

    ...) as js, tree_export(model, "-type graphvis", ...) as dot FROM rf_model
  32. Easy-to-use ML in SQL Scalable Runs in parallel on Hadoop

    ecosystem Mul@-plaWorm Hive, Spark, Pig Versa@le Efficient, generic funccons
  33. - Feature hashing - Feature scaling (normalizacon, z-score) - Feature

    binning - TF-IDF vectorizer - Polynomial expansion - Amplifier - AUC, nDCG, log loss, precision, recall, … - Concatenacon - Interseccon - Remove - Sort - Average - Sum - … - Feature engineering Evalua@on metrics Array and maps Bit, compress, character encoding Efficient top-k query processing
  34. Efficient top-k retrieval Internally hold bounded priority queue List top-3

    items per user: item user score 1 B 70 2 A 80 3 A 90 4 B 60 5 A 70 … … … SELECT item, user, score, rank FROM ( SELECT item, user, score, rank() over (PARTITION BY user ORDER BY score DESC) as rank FROM table ) t WHERE rank <= 2 SELECT each_top_k( 2, user, score, user, item -- output columns ) as (rank, score, user, item) FROM ( SELECT * FROM table CLUSTER BY user ) t Not finish in 24 hrs. for 20M users and ~1k items in each Finish in 2 hrs.
  35. Recommendation with Hivemall k-nearest-neighbor ‣ MinHash and b-Bit MinHash (LSH)

    ‣ Similarices - Euclid - Cosine - Jaccard - Angular Efficient item-based collabora@ve filtering ‣ Sparse Linear Method (SLIM) ‣ Approximated all-pair similarices (DIMSUM) Matrix comple@on ‣ Matrix Factorizacon ‣ Factorizacon Machines
  36. Natural Language Processing — English, Japanese and Chinese tokenizer, word

    N-grams, … ‣ ɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹ 
 ["Hello", "world"] ‣ 
 apple Sketching ‣ Geospa@al func@ons select tokenize('Hello, world!') select singularize('apples') SELECT count(distinct user_id) FROM t SELECT approx_count_distinct(user_id) FROM t SELECT map_url(lat, lon, zoom) as osm_url, map_url(lat, lon, zoom,'-type googlemaps') as gmap_url FROM ( SELECT 51.51202 as lat, 0.02435 as lon, 17 as zoom UNION ALL SELECT 51.51202 as lat, 0.02435 as lon, 4 as zoom ) t
  37. Anomaly / Change-point detec@on ‣ Local outlier factor (k-NN-based technique)

    ‣ ChangeFinder ‣ Singular Spectrum Transformacon Clustering / Topic modeling ‣ Latent Dirichlet Allocacon ‣ Probabiliscc Latent Semancc Analysis
  38. Easy-to-use ML in SQL Scalable Runs in parallel on Hadoop

    ecosystem Mul@-plaWorm Hive, Spark, Pig Versa@le Efficient, generic funccons
  39. None
  40. CREATE TABLE lr_model AS SELECT feature, avg(weight) as weight FROM

    ( SELECT logress(features, label, "-total_steps ${total_steps}") as (feature, weight) FROM training ) t GROUP BY feature; Apache Hive
  41. Apache Pig a = load 'a9a.train' as (rowid:int, label:float, features:{(featurepair:chararray)});

    b = foreach a generate flatten( logress(features, label, '-total_steps ${total_steps}') ) as (feature, weight); c = group b by feature; d = foreach c generate group, AVG(b.weight); store d into 'a9a_model';
  42. Apache Spark DataFrames val trainDf = spark.read.format("libsvm").load("a9a.train") val modelDf =

    trainDf.train_logregr(append_bias($"features"), $"label") .groupBy("feature").avg("weight") .toDF("feature", "weight") .cache
  43. context = HiveContext(sc) context.sql(" SELECT feature, avg(weight) as weight FROM

    ( SELECT train_logregr(features, label) as (feature, weight) FROM training ) t GROUP BY feature ") Apache Spark Query in HiveContext
  44. Apache Spark Online prediction on Spark Streaming val testData =

    ssc.textFileStream(...).map(LabeledPoint.parse) testData.predict { case testDf => // Explode features in input streams val testDf_exploded = ... val predictDf = testDf_exploded .join(model, testDf_exploded("feature") === model("feature"), "LEFT_OUTER") .select($"rowid", ($"weight" * $"value").as("value")) .groupby("rowid").sum("value") .select($"rowid", sigmoid($"SUM(value)")) predictDf }
  45. Hivemall meets PySpark from pyspark.sql import SparkSession spark = SparkSession

    \ .builder \ .master('local[*]') \ .config('spark.jars', 'hivemall-spark2.3-0.5.2-incubating-with-dependencies.jar') \ .enableHiveSupport() \ .getOrCreate()
  46. None
  47. Introduction to Apache Hivemall What’s New in v0.5.2 What’s Coming

    Next?
  48. None
  49. Brickhouse Hive UDFs collection is merged to Hivemall Welcome Jerome

    to the team!
  50. WITH high_rated_items as ( SELECT bloom(itemid) as items FROM (

    SELECT itemid FROM ratings GROUP BY itemid HAVING avg(rating) >= 4.0 ) t ) SELECT l.rating, count(distinct l.userid) as cnt FROM ratings l CROSS JOIN high_rated_items r WHERE bloom_contains(r.items, l.itemid) GROUP BY l.rating; Bloom Filters: Probabilistic data structures Build Bloom Filter (i.e., probabiliscc set of) high-rated items Check if item is in Bloom Filter, and see their actual racngs:
  51. Working with JSON SELECT from_json( '{ "location" : { "Country"

    : "Japan" , "City" : "Osaka" } }', -- json 'map<string,string>', -- return type array(‘location') -- key ) to_json()
  52. Array / Vector ‣ Append ‣ Element-at ‣ Union ‣

    First/last element ‣ Flaxen ‣ Vector add/dot Map ‣ Convert into array of key-value pairs ‣ Filter elements by keys Sanity check ‣ Assert ‣ Raise error Misc ‣ Try-cast ‣ Sessionize records by cme ‣ Moving average More utility functions
  53. Field-Aware Factorization Machines

  54. Field-Aware Factorization Machines Y. Juan, Y. Zhuang, W. Chin, and

    C. Lin. Field-Aware Factoriza@on Machines for CTR Predic@on. RecSys 2016.
  55. SELECT train_ffm( features, label, '-init_v random -max_init_value 0.5 -classification -iterations

    15
 -factors 4 -eta 0.2 -optimizer adagrad -lambda 0.00002' ) FROM ( SELECT features, label FROM train_vectorized CLUSTER BY rand(1) ) t
  56. Introduction to Apache Hivemall What’s New in v0.5.2 What’s Coming

    Next?
  57. HIVEMALL-118: word2vec SELECT train_word2vec( r.negative_table, l.words, "-n {n} -win 5

    -neg 15 -iters 5 -model cbow" ) FROM train_docs l CROSS JOIN negative_table r
  58. HIVEMALL-126: Multi-Nominal Logistic Regression SELECT train_maxent_classifier(features, label, "-attrs C,Q,Q") FROM

    train
  59. XGBoost with Hivemall (experimental) SELECT train_xgboost_classifier(features, label) as (model_id, model)

    FROM training SELECT rowid, avg(predicted) as predicted FROM ( -- predict with each model SELECT xgboost_predict(rowid, features, model_id, model) AS (rowid, predicted) -- join each test record with each model FROM xgboost_models CROSS JOIN testing ) t GROUP BY rowid
  60. WE NEED YOUR CONTRIBUTION From documentation and utility UDFs, to

    state-of-the-art ML algorithms and toolkits github.com/apache/incubator-hivemall
  61. Installation http://hivemall.incubator.apache.org/userguide/getting_started/installation.html $ hive add jar /path/to/hivemall-all-VERSION.jar; source /path/to/define-all.hive; Docker

    image, documentation and step-by-step tutorial are available
  62. What's New and Coming to Apache Hivemall 
 Building More

    Flexible Machine Learning Solution for
 Apache Hive and Spark
 
 Takuya Kitazawa @takuti 
 Makoto Yui @myui Apache Hivemall PPMCs