Apache Hivemall: Query-Based Handy, Scalable Machine Learning on Hive

Apache Hivemall: Query-Based Handy, Scalable Machine Learning on Hive

37130a5f1550eb2d91e640cedf907a78?s=128

Takuya Kitazawa

September 22, 2018
Tweet

Transcript

  1. Apache Hivemall Query-Based Handy, Scalable Machine Learning on Hive Takuya

    Kitazawa @takuti Data Science Engineer at Arm Treasure Data / Committer of Apache Hivemall
  2. Machine Learning in Query Language

  3. BigQuery ML at Google I/O 2018 hCps://ai.googleblog.com/2018/07/machine-learning-in-google-bigquery.html

  4. Q. Solve regression problem on massive data stored in data

    warehouse
  5. Prac7cal experience in science and engineering Theore7cal understanding Tool (Python?)

    Scalability Q. Solve regression problem on massive data stored in data warehouse
  6. Done by less than 10 lines of queries hCps://ai.googleblog.com/2018/07/machine-learning-in-google-bigquery.html

  7. Machine Learning for everyone

  8. Open source query-based machine learning solution github.com/apache/incubator-hivemall

  9. Hivemall in real-world ML workflow Why Hivemall is notably preferable,

    and who can get the benefit
  10. Problem What you want to “predict” Hypothesis & Proposal Build

    machine learning model Historical data Cleanse data Evaluate Real-world ML workflow is for experts Data scientists and ML engineers who have solid relevant background Deploy to produc7on
  11. Problem What you want to “predict” Hypothesis & Proposal Build

    machine learning model Historical data Cleanse data Evaluate Real-world ML workflow is for experts Data scientists and ML engineers who have solid relevant background Deploy to produc7on Do you really need such complexity and flexibility?
  12. None
  13. None
  14. Problem What you want to “predict” Hypothesis & Proposal Build

    machine learning model Historical data Cleanse data Evaluate Hivemall makes ML more simple, handy for non experts Anybody who knows SQL basics Deploy to produc7on Easily try, save, share, schedule via simple I/F in scalable manner
  15. Problem What you want to “predict” Hypothesis & Proposal Build

    machine learning model Historical data Cleanse data Evaluate How can we organize query fragments? Extract Filter Interpolate Normalize … … Query Query Query Query Train data Get features Train … Query Query Query Test data Get features Predict … Accuracy Query Query Query Query
  16. Tip: Combining with workflow engine e.g., Digdag allows to define

    highly dependent ML workflow as YAML file Extract Filter Interpolate Normalize … … Query Query Query Query Train data Get features Train … Query Query Query Test data Get features Predict … Accuracy Query Query Query Query https://www.digdag.io/
  17. Query-based ML: New option outside of common ML toolkit Your

    data-related work can be simpler OSS solution <3 workflow engine
  18. Introduction to query-based machine learning with Apache Hivemall

  19. ‣ Data warehousing solu7on built on top of Apache Hadoop

    ‣ Efficiently access and analyze large-scale data via SQL-like interface, HiveQL - create table - select - join - group by - count() - sum() - … - order by - cluster by - … Apache Hive
  20. ‣ OSS project under Apache SoRware Founda7on since 2017 ‣

    Scalable ML library implemented as Hive user-defined func7ons (UDFs) Apache Hivemall column 1 aaa bbb ccc column 1’ xxx yyy zzz column 1 aaa bbb ccc column 2 scalar column 1 aaa bbb ccc column 2 column 3 xxx 111 yyy 222 UDF UDAF (aggregacon) UDTF (tabular) ‣ l1_normalize() ‣ rmse() ‣ train_regressor()
  21. Easy-to-use ML in SQL Scalable Runs in parallel on Hadoop

    ecosystem Mul7-plaXorm Hive, Spark, Pig Versa7le Efficient, generic funccons Apache Hivemall
  22. Easy-to-use and scalable

  23. Example: Scalable Logistic Regression written in ~10 lines of queries

    Automa7cally runs in parallel on Hadoop
  24. index : value or index INT BIGINT TEXT FLOAT index

    value ( ) ‣ libSVM formatɹɹɹɹɹɹɹɹɹɹɹɹɹɹ 10 : 3.4, 123 : 0.5, 34567 : 0.231 ‣ can be text ɹ ɹɹɹɹɹɹɹɹɹɹ price : 600, size : 2.5 ‣ -only means = 1.0ʢe.g., categoricalʣ gender#male = gender#male : 1.0 index value index Space-efficient feature representation in Hivemall TEXT
  25. Array of quanctacve features : select quantitative_features(array("price", "size"), 600, 2.5)

    ["price:600.0", "size:2.5"] Array of categorical features # select categorical_features(array("gender", "category"), “male", "book") [“gender#male", "category#book"] * NULL is automaccally omiCed Hivemall internally does one-hot encoding (e.g., book → 1, 0, 0, …) value index value index Feature vector = array of string
  26. Simplify name of quanctacve feature and categorical feature # select

    feature_hashing(array("price:600", "category#book")) ["14142887:600", "10413006"] (Default upper limit: 224 + 1 = 16777217) index value index Feature hashing: Approximation improves scalability
  27. Example: Table “purchase_history” http://hivemall.incubator.apache.org/userguide/supervised_learning/tutorial.html

  28. select array_concat( -- Concatenate features as a feature vector quantitative_features(

    -- Create quantitative features array("price"), price ), categorical_features( -- Create categorical features array("day of week", "gender", "category"), day_of_week, gender, category ) ) as features, label from purchase_history
  29. Resulting table “training”

  30. select add_bias( array_concat( — Concatenate features as a feature vector

    quantitative_features( -- Create quantitative features array("price"), price ), categorical_features( -- Create categorical features array("day of week", "gender", "category"), day_of_week, gender, category ) ) ) as features, label from purchase_history
  31. select feature_hashing( add_bias( array_concat( -- Concatenate features as a feature

    vector quantitative_features( -- Create quantitative features array("price"), price ), categorical_features( -- Create categorical features array("day of week", "gender", "category"), day_of_week, gender, category ) ) ) ) as features, label from purchase_history
  32. SELECT train_classifier( -- train_regressor( features, label, '-loss logloss -opt SGD

    -reg no -eta simple -total_steps ${total_steps}' ) as (feature, weight) FROM training Classification ‣ HingeLoss ‣ LogLoss (a.k.a. logistic loss) ‣ SquaredHingeLoss ‣ ModifiedHuberLoss Regression ‣ SquaredLoss ‣ QuantileLoss ‣ EpsilonInsensitiveLoss ‣ SquaredEpsilonInsensitiveLoss ‣ HuberLoss Supervised learning by unified function
  33. SELECT train_classifier( -- train_regressor( features, label, '-loss logloss -opt SGD

    -reg no -eta simple -total_steps ${total_steps}' ) as (feature, weight) FROM training Optimizer ‣ SGD ‣ AdaGrad ‣ AdaDelta ‣ ADAM Regularization ‣ L1 ‣ L2 ‣ ElasticNet ‣ RDA ‣ Iteration with learning rate control ‣ Mini-batch training ‣ Early stopping Supervised learning by unified function
  34. Model = table

  35. Classification ‣ Generic classifier ‣ Perceptron ‣ Passive Aggressive (PA,

    PA1, PA2) ‣ Confidence Weighted (CW) ‣ Adaptive Regularization of Weight Vectors (AROW) ‣ Soft Confidence Weighted (SCW) ‣ (Field-Aware) Factorization Machines ‣ RandomForest Regression ‣ Generic regressor ‣ PA Regression ‣ AROW Regression ‣ (Field-Aware) Factorization Machines ‣ RandomForest Classification and regression with variety of algorithms
  36. Factorization Machines S. Rendle. Factoriza7on Machines with libFM. ACM Transaccons

    on Intelligent Systems and Technology, 3(3), May 2012.
  37. Factorization Machines

  38. RandomForest Training CREATE TABLE rf_model AS SELECT train_randomforest_classifier( features, label,

    '-trees 50 -seed 71' -- hyperparameters ) as (model_id, model_weight, model, var_importance, oob_errors, oob_tests) FROM training;
  39. RandomForest Model table

  40. RandomForest Prediction CREATE TABLE rf_predicted as SELECT rowid, rf_ensemble(predicted.value, predicted.posteriori,

    model_weight) as predicted FROM ( SELECT t.rowid, m.model_weight, tree_predict(m.model_id, m.model, t.features, ${classification}) as predicted FROM testing t CROSS JOIN rf_model m ) t1 GROUP BY rowid
  41. RandomForest Export decision trees for visualization SELECT tree_export(model, "-type javascript",

    ...) as js, tree_export(model, "-type graphvis", ...) as dot FROM rf_model
  42. Versatile

  43. - Feature hashing - Feature scaling (normalization, z-score) - Feature

    binning - TF-IDF vectorizer - Polynomial expansion - Amplifier - AUC, nDCG, log loss, precision, recall, … - Concatenation - Intersection - Remove - Sort - Average - Sum - … Feature engineering Evalua7on metrics Array and maps Bit, compress, character encoding Efficient top-k query processing
  44. Efficient top-k retrieval Internally hold bounded priority queue List top-3

    items per user: item user score 1 B 70 2 A 80 3 A 90 4 B 60 5 A 70 … … … SELECT item, user, score, rank FROM ( SELECT item, user, score, rank() over (PARTITION BY user ORDER BY score DESC) as rank FROM table ) t WHERE rank <= 2 SELECT each_top_k( 2, user, score, user, item -- output columns ) as (rank, score, user, item) FROM ( SELECT * FROM table CLUSTER BY user ) t Not finish in 24 hrs. for 20M users and ~1k items in each Finish in 2 hrs.
  45. Recommendation <3 tabular form User Item Ra7ng Tom Laptop 3

    ˒˒˒ˑˑ Jack Coffee beans 5 ˒˒˒˒˒ Mike Watch 1 ˒ˑˑˑˑ … … … User Top-3 recommended items Tom Headphone, USB charger, 4K monitor Jack Mug, Coffee machine, Chocolate Mike Ring, T-shirt, Bag … … Input Output User Bought item Tom Laptop Jack Coffee beans Mike Watch … …
  46. Recommendation with Hivemall k-nearest-neighbor ‣ MinHash and b-Bit MinHash (LSH)

    ‣ Similarities - Euclid - Cosine - Jaccard - Angular Efficient item-based collaborative filtering ‣ Sparse Linear Method (SLIM) ‣ Approximated all-pair similarities (DIMSUM) Matrix completion ‣ Matrix Factorization ‣ Factorization Machines
  47. Natural Language Processing — English, Japanese and Chinese tokenizer, word

    N-grams, … ‣ ɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹ 
 ["Hello", "world"] ‣ 
 apple Sketching ‣ Geospatial functions select tokenize('Hello, world!') select singularize('apples') SELECT count(distinct user_id) FROM t SELECT approx_count_distinct(user_id) FROM t SELECT map_url(lat, lon, zoom) as osm_url, map_url(lat, lon, zoom,'-type googlemaps') as gmap_url FROM ( SELECT 51.51202 as lat, 0.02435 as lon, 17 as zoom UNION ALL SELECT 51.51202 as lat, 0.02435 as lon, 4 as zoom ) t
  48. Anomaly / Change-point detection ‣ Local outlier factor (k-NN-based technique)

    ‣ ChangeFinder ‣ Singular Spectrum Transformation Clustering / Topic modeling ‣ Latent Dirichlet Allocation ‣ Probabilistic Latent Semantic Analysis
  49. Multi-platform

  50. None
  51. CREATE TABLE lr_model AS SELECT feature, avg(weight) as weight FROM

    ( SELECT logress(features, label, "-total_steps ${total_steps}") as (feature, weight) FROM training ) t GROUP BY feature; Apache Hive
  52. Apache Pig a = load 'a9a.train' as (rowid:int, label:float, features:{(featurepair:chararray)});

    b = foreach a generate flatten( logress(features, label, '-total_steps ${total_steps}') ) as (feature, weight); c = group b by feature; d = foreach c generate group, AVG(b.weight); store d into 'a9a_model';
  53. Apache Spark DataFrames val trainDf = spark.read.format("libsvm").load("a9a.train") val modelDf =

    trainDf.train_logregr(append_bias($"features"), $"label") .groupBy("feature").avg("weight") .toDF("feature", "weight") .cache
  54. context = HiveContext(sc) context.sql(" SELECT feature, avg(weight) as weight FROM

    ( SELECT train_logregr(features, label) as (feature, weight) FROM training ) t GROUP BY feature ") Apache Spark Query in HiveContext
  55. Apache Spark Online prediction on Spark Streaming val testData =

    ssc.textFileStream(...).map(LabeledPoint.parse) testData.predict { case testDf => // Explode features in input streams val testDf_exploded = ... val predictDf = testDf_exploded .join(model, testDf_exploded("feature") === model("feature"), "LEFT_OUTER") .select($"rowid", ($"weight" * $"value").as("value")) .groupby("rowid").sum("value") .select($"rowid", sigmoid($"SUM(value)")) predictDf }
  56. Future development plan ‣ word2vec ‣ Field-aware factorizacon machines stability

    improvement ‣ XGBoost ‣ Gradient booscng, LightGBM ‣ Hivemall on Kaqa KSQL ‣ …
  57. XGBoost with Hivemall (experimental) SELECT train_xgboost_classifier(features, label) as (model_id, model)

    FROM training SELECT rowid, avg(predicted) as predicted FROM ( -- predict with each model SELECT xgboost_predict(rowid, features, model_id, model) AS (rowid, predicted) -- join each test record with each model FROM xgboost_models CROSS JOIN testing ) t GROUP BY rowid
  58. Installation http://hivemall.incubator.apache.org/userguide/getting_started/installation.html $ hive add jar /path/to/hivemall-all-VERSION.jar; source /path/to/define-all.hive;

  59. –Johnny Appleseed “Type a quote here.” !59

  60. github.com/apache/incubator-hivemall Docker image, documentation and step-by-step tutorial are available

  61. with query language and Apache Hivemall Easy Scalable Sharable Clean

    Making machine learning
  62. Apache Hivemall Query-Based Handy, Scalable Machine Learning on Hive Takuya

    Kitazawa @takuti Data Science Engineer at Arm Treasure Data / Committer of Apache Hivemall