Upgrade to Pro — share decks privately, control downloads, hide ads and more …

What's New and Coming to Apache Hivemall 
#ACNA19

What's New and Coming to Apache Hivemall 
#ACNA19

Takuya Kitazawa

September 12, 2019
Tweet

More Decks by Takuya Kitazawa

Other Decks in Technology

Transcript

  1. What's New and Coming to Apache Hivemall 
 Building More

    Flexible Machine Learning Solution for
 Apache Hive and Spark
 
 Takuya Kitazawa @takuti 
 Makoto Yui @myui Apache Hivemall PPMCs
  2. Prac@cal experience in science and engineering ML theory Tool /

    Data model Scalability Q. Solve regression problem on massive data stored in data warehouse
  3. Machine Learning for everyone
 Open source query-based machine learning solution

    - Incubating since Sept 13, 2016 - @ApacheHivemall - GitHub: apache/incubator-hivemall - Team: 6 PPMCs + 3 committers - Latest release: v0.5.2 (Dec 3, 2018) - Toward graduation: ✓ Community growth ✓ 1+ Apache releases ✓ Documentation improvements
  4. ‣ Data warehousing solu@on built on top of Apache Hadoop

    ‣ Efficiently access and analyze large-scale data via SQL-like interface, HiveQL - create table - select - join - group by - count() - sum() - … - order by - cluster by - … Apache Hive
  5. ‣ OSS project under Apache SoPware Founda@on ‣ Scalable ML

    library implemented as Hive user-defined func@ons (UDFs) Apache Hivemall column 1 aaa bbb ccc column 1’ xxx yyy zzz column 1 aaa bbb ccc column 2 scalar column 1 aaa bbb ccc column 2 column 3 xxx 111 yyy 222 UDF UDAF (aggregacon) UDTF (tabular) ‣ l1_normalize() ‣ rmse() ‣ train_regressor()
  6. Easy-to-use ML in SQL Scalable Runs in parallel on Hadoop

    ecosystem Mul@-plaWorm Hive, Spark, Pig Versa@le Efficient, generic funccons Apache Hivemall
  7. Problem What you want to “predict” Hypothesis & Proposal Build

    machine learning model Historical data Cleanse data Evaluate Hivemall makes ML more simple, handy for non experts Anybody who knows SQL basics Deploy to produc@on Easily try, save, share, schedule via simple I/F in scalable manner
  8. Recommendation <3 tabular form User Item Ra@ng Tom Laptop 3

    ˒˒˒ˑˑ Jack Coffee beans 5 ˒˒˒˒˒ Mike Watch 1 ˒ˑˑˑˑ … … … User Top-3 recommended items Tom Headphone, USB charger, 4K monitor Jack Mug, Coffee machine, Chocolate Mike Ring, T-shirt, Bag … … Input Output User Bought item Tom Laptop Jack Coffee beans Mike Watch … …
  9. Easy-to-use ML in SQL Scalable Runs in parallel on Hadoop

    ecosystem Mul@-plaWorm Hive, Spark, Pig Versa@le Efficient, generic funccons
  10. Preprocessing select array_concat( -- Concatenate features as a feature vector

    quantitative_features( -- Create quantitative features array("price"), price ), categorical_features( -- Create categorical features array("day of week", "gender", "category"), day_of_week, gender, category ) ) as features, label from purchase_history
  11. SELECT train_classifier( -- train_regressor( features, label, '-loss logloss -opt SGD

    -reg no -eta simple -total_steps ${total_steps}' ) as (feature, weight) FROM training Classifica@on ‣ HingeLoss ‣ LogLoss (a.k.a. logis7c loss) ‣ SquaredHingeLoss ‣ ModifiedHuberLoss Regression ‣ SquaredLoss ‣ QuancleLoss ‣ EpsilonInsensicveLoss ‣ SquaredEpsilonInsensicveLoss ‣ HuberLoss Supervised learning by unified function
  12. SELECT train_classifier( -- train_regressor( features, label, '-loss logloss -opt SGD

    -reg no -eta simple -total_steps ${total_steps}' ) as (feature, weight) FROM training Op@mizer ‣ SGD ‣ AdaGrad ‣ AdaDelta ‣ ADAM Regulariza@on ‣ L1 ‣ L2 ‣ ElasccNet ‣ RDA ‣ Iteracon with learning rate control ‣ Mini-batch training ‣ Early stopping Supervised learning by unified function
  13. Classifica@on ‣ Generic classifier ‣ Perceptron ‣ Passive Aggressive (PA,

    PA1, PA2) ‣ Confidence Weighted (CW) ‣ Adapcve Regularizacon of Weight Vectors (AROW) ‣ Soo Confidence Weighted (SCW) ‣ (Field-Aware) Factoriza@on Machines ‣ RandomForest Regression ‣ Generic regressor ‣ PA Regression ‣ AROW Regression ‣ (Field-Aware) Factoriza@on Machines ‣ RandomForest Classification and regression with variety of algorithms
  14. Factorization Machines S. Rendle. Factoriza@on Machines with libFM. ACM Transaccons

    on Intelligent Systems and Technology, 3(3), May 2012.
  15. RandomForest Training CREATE TABLE rf_model AS SELECT train_randomforest_classifier( features, label,

    '-trees 50 -seed 71' -- hyperparameters ) as (model_id, model_weight, model, var_importance, oob_errors, oob_tests) FROM training;
  16. RandomForest Export decision trees for visualization SELECT tree_export(model, "-type javascript",

    ...) as js, tree_export(model, "-type graphvis", ...) as dot FROM rf_model
  17. Easy-to-use ML in SQL Scalable Runs in parallel on Hadoop

    ecosystem Mul@-plaWorm Hive, Spark, Pig Versa@le Efficient, generic funccons
  18. - Feature hashing - Feature scaling (normalizacon, z-score) - Feature

    binning - TF-IDF vectorizer - Polynomial expansion - Amplifier - AUC, nDCG, log loss, precision, recall, … - Concatenacon - Interseccon - Remove - Sort - Average - Sum - … - Feature engineering Evalua@on metrics Array and maps Bit, compress, character encoding Efficient top-k query processing
  19. Efficient top-k retrieval Internally hold bounded priority queue List top-3

    items per user: item user score 1 B 70 2 A 80 3 A 90 4 B 60 5 A 70 … … … SELECT item, user, score, rank FROM ( SELECT item, user, score, rank() over (PARTITION BY user ORDER BY score DESC) as rank FROM table ) t WHERE rank <= 2 SELECT each_top_k( 2, user, score, user, item -- output columns ) as (rank, score, user, item) FROM ( SELECT * FROM table CLUSTER BY user ) t Not finish in 24 hrs. for 20M users and ~1k items in each Finish in 2 hrs.
  20. Recommendation with Hivemall k-nearest-neighbor ‣ MinHash and b-Bit MinHash (LSH)

    ‣ Similarices - Euclid - Cosine - Jaccard - Angular Efficient item-based collabora@ve filtering ‣ Sparse Linear Method (SLIM) ‣ Approximated all-pair similarices (DIMSUM) Matrix comple@on ‣ Matrix Factorizacon ‣ Factorizacon Machines
  21. Natural Language Processing — English, Japanese and Chinese tokenizer, word

    N-grams, … ‣ ɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹ 
 ["Hello", "world"] ‣ 
 apple Sketching ‣ Geospa@al func@ons select tokenize('Hello, world!') select singularize('apples') SELECT count(distinct user_id) FROM t SELECT approx_count_distinct(user_id) FROM t SELECT map_url(lat, lon, zoom) as osm_url, map_url(lat, lon, zoom,'-type googlemaps') as gmap_url FROM ( SELECT 51.51202 as lat, 0.02435 as lon, 17 as zoom UNION ALL SELECT 51.51202 as lat, 0.02435 as lon, 4 as zoom ) t
  22. Anomaly / Change-point detec@on ‣ Local outlier factor (k-NN-based technique)

    ‣ ChangeFinder ‣ Singular Spectrum Transformacon Clustering / Topic modeling ‣ Latent Dirichlet Allocacon ‣ Probabiliscc Latent Semancc Analysis
  23. Easy-to-use ML in SQL Scalable Runs in parallel on Hadoop

    ecosystem Mul@-plaWorm Hive, Spark, Pig Versa@le Efficient, generic funccons
  24. CREATE TABLE lr_model AS SELECT feature, avg(weight) as weight FROM

    ( SELECT logress(features, label, "-total_steps ${total_steps}") as (feature, weight) FROM training ) t GROUP BY feature; Apache Hive
  25. Apache Pig a = load 'a9a.train' as (rowid:int, label:float, features:{(featurepair:chararray)});

    b = foreach a generate flatten( logress(features, label, '-total_steps ${total_steps}') ) as (feature, weight); c = group b by feature; d = foreach c generate group, AVG(b.weight); store d into 'a9a_model';
  26. Apache Spark DataFrames val trainDf = spark.read.format("libsvm").load("a9a.train") val modelDf =

    trainDf.train_logregr(append_bias($"features"), $"label") .groupBy("feature").avg("weight") .toDF("feature", "weight") .cache
  27. context = HiveContext(sc) context.sql(" SELECT feature, avg(weight) as weight FROM

    ( SELECT train_logregr(features, label) as (feature, weight) FROM training ) t GROUP BY feature ") Apache Spark Query in HiveContext
  28. Apache Spark Online prediction on Spark Streaming val testData =

    ssc.textFileStream(...).map(LabeledPoint.parse) testData.predict { case testDf => // Explode features in input streams val testDf_exploded = ... val predictDf = testDf_exploded .join(model, testDf_exploded("feature") === model("feature"), "LEFT_OUTER") .select($"rowid", ($"weight" * $"value").as("value")) .groupby("rowid").sum("value") .select($"rowid", sigmoid($"SUM(value)")) predictDf }
  29. Hivemall meets PySpark from pyspark.sql import SparkSession spark = SparkSession

    \ .builder \ .master('local[*]') \ .config('spark.jars', 'hivemall-spark2.3-0.5.2-incubating-with-dependencies.jar') \ .enableHiveSupport() \ .getOrCreate()
  30. WITH high_rated_items as ( SELECT bloom(itemid) as items FROM (

    SELECT itemid FROM ratings GROUP BY itemid HAVING avg(rating) >= 4.0 ) t ) SELECT l.rating, count(distinct l.userid) as cnt FROM ratings l CROSS JOIN high_rated_items r WHERE bloom_contains(r.items, l.itemid) GROUP BY l.rating; Bloom Filters: Probabilistic data structures Build Bloom Filter (i.e., probabiliscc set of) high-rated items Check if item is in Bloom Filter, and see their actual racngs:
  31. Working with JSON SELECT from_json( '{ "location" : { "Country"

    : "Japan" , "City" : "Osaka" } }', -- json 'map<string,string>', -- return type array(‘location') -- key ) to_json()
  32. Array / Vector ‣ Append ‣ Element-at ‣ Union ‣

    First/last element ‣ Flaxen ‣ Vector add/dot Map ‣ Convert into array of key-value pairs ‣ Filter elements by keys Sanity check ‣ Assert ‣ Raise error Misc ‣ Try-cast ‣ Sessionize records by cme ‣ Moving average More utility functions
  33. Field-Aware Factorization Machines Y. Juan, Y. Zhuang, W. Chin, and

    C. Lin. Field-Aware Factoriza@on Machines for CTR Predic@on. RecSys 2016.
  34. SELECT train_ffm( features, label, '-init_v random -max_init_value 0.5 -classification -iterations

    15
 -factors 4 -eta 0.2 -optimizer adagrad -lambda 0.00002' ) FROM ( SELECT features, label FROM train_vectorized CLUSTER BY rand(1) ) t
  35. HIVEMALL-118: word2vec SELECT train_word2vec( r.negative_table, l.words, "-n {n} -win 5

    -neg 15 -iters 5 -model cbow" ) FROM train_docs l CROSS JOIN negative_table r
  36. XGBoost with Hivemall (experimental) SELECT train_xgboost_classifier(features, label) as (model_id, model)

    FROM training SELECT rowid, avg(predicted) as predicted FROM ( -- predict with each model SELECT xgboost_predict(rowid, features, model_id, model) AS (rowid, predicted) -- join each test record with each model FROM xgboost_models CROSS JOIN testing ) t GROUP BY rowid
  37. WE NEED YOUR CONTRIBUTION From documentation and utility UDFs, to

    state-of-the-art ML algorithms and toolkits github.com/apache/incubator-hivemall
  38. What's New and Coming to Apache Hivemall 
 Building More

    Flexible Machine Learning Solution for
 Apache Hive and Spark
 
 Takuya Kitazawa @takuti 
 Makoto Yui @myui Apache Hivemall PPMCs