Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Apache Hivemall: Query-Based Handy, Scalable Machine Learning on Hive

Takuya Kitazawa
September 22, 2018

Apache Hivemall: Query-Based Handy, Scalable Machine Learning on Hive

Takuya Kitazawa

September 22, 2018
Tweet

More Decks by Takuya Kitazawa

Other Decks in Programming

Transcript

  1. Apache Hivemall Query-Based Handy, Scalable Machine Learning on Hive Takuya

    Kitazawa @takuti Data Science Engineer at Arm Treasure Data / Committer of Apache Hivemall
  2. Prac7cal experience in science and engineering Theore7cal understanding Tool (Python?)

    Scalability Q. Solve regression problem on massive data stored in data warehouse
  3. Problem What you want to “predict” Hypothesis & Proposal Build

    machine learning model Historical data Cleanse data Evaluate Real-world ML workflow is for experts Data scientists and ML engineers who have solid relevant background Deploy to produc7on
  4. Problem What you want to “predict” Hypothesis & Proposal Build

    machine learning model Historical data Cleanse data Evaluate Real-world ML workflow is for experts Data scientists and ML engineers who have solid relevant background Deploy to produc7on Do you really need such complexity and flexibility?
  5. Problem What you want to “predict” Hypothesis & Proposal Build

    machine learning model Historical data Cleanse data Evaluate Hivemall makes ML more simple, handy for non experts Anybody who knows SQL basics Deploy to produc7on Easily try, save, share, schedule via simple I/F in scalable manner
  6. Problem What you want to “predict” Hypothesis & Proposal Build

    machine learning model Historical data Cleanse data Evaluate How can we organize query fragments? Extract Filter Interpolate Normalize … … Query Query Query Query Train data Get features Train … Query Query Query Test data Get features Predict … Accuracy Query Query Query Query
  7. Tip: Combining with workflow engine e.g., Digdag allows to define

    highly dependent ML workflow as YAML file Extract Filter Interpolate Normalize … … Query Query Query Query Train data Get features Train … Query Query Query Test data Get features Predict … Accuracy Query Query Query Query https://www.digdag.io/
  8. Query-based ML: New option outside of common ML toolkit Your

    data-related work can be simpler OSS solution <3 workflow engine
  9. ‣ Data warehousing solu7on built on top of Apache Hadoop

    ‣ Efficiently access and analyze large-scale data via SQL-like interface, HiveQL - create table - select - join - group by - count() - sum() - … - order by - cluster by - … Apache Hive
  10. ‣ OSS project under Apache SoRware Founda7on since 2017 ‣

    Scalable ML library implemented as Hive user-defined func7ons (UDFs) Apache Hivemall column 1 aaa bbb ccc column 1’ xxx yyy zzz column 1 aaa bbb ccc column 2 scalar column 1 aaa bbb ccc column 2 column 3 xxx 111 yyy 222 UDF UDAF (aggregacon) UDTF (tabular) ‣ l1_normalize() ‣ rmse() ‣ train_regressor()
  11. Easy-to-use ML in SQL Scalable Runs in parallel on Hadoop

    ecosystem Mul7-plaXorm Hive, Spark, Pig Versa7le Efficient, generic funccons Apache Hivemall
  12. index : value or index INT BIGINT TEXT FLOAT index

    value ( ) ‣ libSVM formatɹɹɹɹɹɹɹɹɹɹɹɹɹɹ 10 : 3.4, 123 : 0.5, 34567 : 0.231 ‣ can be text ɹ ɹɹɹɹɹɹɹɹɹɹ price : 600, size : 2.5 ‣ -only means = 1.0ʢe.g., categoricalʣ gender#male = gender#male : 1.0 index value index Space-efficient feature representation in Hivemall TEXT
  13. Array of quanctacve features : select quantitative_features(array("price", "size"), 600, 2.5)

    ["price:600.0", "size:2.5"] Array of categorical features # select categorical_features(array("gender", "category"), “male", "book") [“gender#male", "category#book"] * NULL is automaccally omiCed Hivemall internally does one-hot encoding (e.g., book → 1, 0, 0, …) value index value index Feature vector = array of string
  14. Simplify name of quanctacve feature and categorical feature # select

    feature_hashing(array("price:600", "category#book")) ["14142887:600", "10413006"] (Default upper limit: 224 + 1 = 16777217) index value index Feature hashing: Approximation improves scalability
  15. select array_concat( -- Concatenate features as a feature vector quantitative_features(

    -- Create quantitative features array("price"), price ), categorical_features( -- Create categorical features array("day of week", "gender", "category"), day_of_week, gender, category ) ) as features, label from purchase_history
  16. select add_bias( array_concat( — Concatenate features as a feature vector

    quantitative_features( -- Create quantitative features array("price"), price ), categorical_features( -- Create categorical features array("day of week", "gender", "category"), day_of_week, gender, category ) ) ) as features, label from purchase_history
  17. select feature_hashing( add_bias( array_concat( -- Concatenate features as a feature

    vector quantitative_features( -- Create quantitative features array("price"), price ), categorical_features( -- Create categorical features array("day of week", "gender", "category"), day_of_week, gender, category ) ) ) ) as features, label from purchase_history
  18. SELECT train_classifier( -- train_regressor( features, label, '-loss logloss -opt SGD

    -reg no -eta simple -total_steps ${total_steps}' ) as (feature, weight) FROM training Classification ‣ HingeLoss ‣ LogLoss (a.k.a. logistic loss) ‣ SquaredHingeLoss ‣ ModifiedHuberLoss Regression ‣ SquaredLoss ‣ QuantileLoss ‣ EpsilonInsensitiveLoss ‣ SquaredEpsilonInsensitiveLoss ‣ HuberLoss Supervised learning by unified function
  19. SELECT train_classifier( -- train_regressor( features, label, '-loss logloss -opt SGD

    -reg no -eta simple -total_steps ${total_steps}' ) as (feature, weight) FROM training Optimizer ‣ SGD ‣ AdaGrad ‣ AdaDelta ‣ ADAM Regularization ‣ L1 ‣ L2 ‣ ElasticNet ‣ RDA ‣ Iteration with learning rate control ‣ Mini-batch training ‣ Early stopping Supervised learning by unified function
  20. Classification ‣ Generic classifier ‣ Perceptron ‣ Passive Aggressive (PA,

    PA1, PA2) ‣ Confidence Weighted (CW) ‣ Adaptive Regularization of Weight Vectors (AROW) ‣ Soft Confidence Weighted (SCW) ‣ (Field-Aware) Factorization Machines ‣ RandomForest Regression ‣ Generic regressor ‣ PA Regression ‣ AROW Regression ‣ (Field-Aware) Factorization Machines ‣ RandomForest Classification and regression with variety of algorithms
  21. Factorization Machines S. Rendle. Factoriza7on Machines with libFM. ACM Transaccons

    on Intelligent Systems and Technology, 3(3), May 2012.
  22. RandomForest Training CREATE TABLE rf_model AS SELECT train_randomforest_classifier( features, label,

    '-trees 50 -seed 71' -- hyperparameters ) as (model_id, model_weight, model, var_importance, oob_errors, oob_tests) FROM training;
  23. RandomForest Prediction CREATE TABLE rf_predicted as SELECT rowid, rf_ensemble(predicted.value, predicted.posteriori,

    model_weight) as predicted FROM ( SELECT t.rowid, m.model_weight, tree_predict(m.model_id, m.model, t.features, ${classification}) as predicted FROM testing t CROSS JOIN rf_model m ) t1 GROUP BY rowid
  24. RandomForest Export decision trees for visualization SELECT tree_export(model, "-type javascript",

    ...) as js, tree_export(model, "-type graphvis", ...) as dot FROM rf_model
  25. - Feature hashing - Feature scaling (normalization, z-score) - Feature

    binning - TF-IDF vectorizer - Polynomial expansion - Amplifier - AUC, nDCG, log loss, precision, recall, … - Concatenation - Intersection - Remove - Sort - Average - Sum - … Feature engineering Evalua7on metrics Array and maps Bit, compress, character encoding Efficient top-k query processing
  26. Efficient top-k retrieval Internally hold bounded priority queue List top-3

    items per user: item user score 1 B 70 2 A 80 3 A 90 4 B 60 5 A 70 … … … SELECT item, user, score, rank FROM ( SELECT item, user, score, rank() over (PARTITION BY user ORDER BY score DESC) as rank FROM table ) t WHERE rank <= 2 SELECT each_top_k( 2, user, score, user, item -- output columns ) as (rank, score, user, item) FROM ( SELECT * FROM table CLUSTER BY user ) t Not finish in 24 hrs. for 20M users and ~1k items in each Finish in 2 hrs.
  27. Recommendation <3 tabular form User Item Ra7ng Tom Laptop 3

    ˒˒˒ˑˑ Jack Coffee beans 5 ˒˒˒˒˒ Mike Watch 1 ˒ˑˑˑˑ … … … User Top-3 recommended items Tom Headphone, USB charger, 4K monitor Jack Mug, Coffee machine, Chocolate Mike Ring, T-shirt, Bag … … Input Output User Bought item Tom Laptop Jack Coffee beans Mike Watch … …
  28. Recommendation with Hivemall k-nearest-neighbor ‣ MinHash and b-Bit MinHash (LSH)

    ‣ Similarities - Euclid - Cosine - Jaccard - Angular Efficient item-based collaborative filtering ‣ Sparse Linear Method (SLIM) ‣ Approximated all-pair similarities (DIMSUM) Matrix completion ‣ Matrix Factorization ‣ Factorization Machines
  29. Natural Language Processing — English, Japanese and Chinese tokenizer, word

    N-grams, … ‣ ɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹ 
 ["Hello", "world"] ‣ 
 apple Sketching ‣ Geospatial functions select tokenize('Hello, world!') select singularize('apples') SELECT count(distinct user_id) FROM t SELECT approx_count_distinct(user_id) FROM t SELECT map_url(lat, lon, zoom) as osm_url, map_url(lat, lon, zoom,'-type googlemaps') as gmap_url FROM ( SELECT 51.51202 as lat, 0.02435 as lon, 17 as zoom UNION ALL SELECT 51.51202 as lat, 0.02435 as lon, 4 as zoom ) t
  30. Anomaly / Change-point detection ‣ Local outlier factor (k-NN-based technique)

    ‣ ChangeFinder ‣ Singular Spectrum Transformation Clustering / Topic modeling ‣ Latent Dirichlet Allocation ‣ Probabilistic Latent Semantic Analysis
  31. CREATE TABLE lr_model AS SELECT feature, avg(weight) as weight FROM

    ( SELECT logress(features, label, "-total_steps ${total_steps}") as (feature, weight) FROM training ) t GROUP BY feature; Apache Hive
  32. Apache Pig a = load 'a9a.train' as (rowid:int, label:float, features:{(featurepair:chararray)});

    b = foreach a generate flatten( logress(features, label, '-total_steps ${total_steps}') ) as (feature, weight); c = group b by feature; d = foreach c generate group, AVG(b.weight); store d into 'a9a_model';
  33. Apache Spark DataFrames val trainDf = spark.read.format("libsvm").load("a9a.train") val modelDf =

    trainDf.train_logregr(append_bias($"features"), $"label") .groupBy("feature").avg("weight") .toDF("feature", "weight") .cache
  34. context = HiveContext(sc) context.sql(" SELECT feature, avg(weight) as weight FROM

    ( SELECT train_logregr(features, label) as (feature, weight) FROM training ) t GROUP BY feature ") Apache Spark Query in HiveContext
  35. Apache Spark Online prediction on Spark Streaming val testData =

    ssc.textFileStream(...).map(LabeledPoint.parse) testData.predict { case testDf => // Explode features in input streams val testDf_exploded = ... val predictDf = testDf_exploded .join(model, testDf_exploded("feature") === model("feature"), "LEFT_OUTER") .select($"rowid", ($"weight" * $"value").as("value")) .groupby("rowid").sum("value") .select($"rowid", sigmoid($"SUM(value)")) predictDf }
  36. Future development plan ‣ word2vec ‣ Field-aware factorizacon machines stability

    improvement ‣ XGBoost ‣ Gradient booscng, LightGBM ‣ Hivemall on Kaqa KSQL ‣ …
  37. XGBoost with Hivemall (experimental) SELECT train_xgboost_classifier(features, label) as (model_id, model)

    FROM training SELECT rowid, avg(predicted) as predicted FROM ( -- predict with each model SELECT xgboost_predict(rowid, features, model_id, model) AS (rowid, predicted) -- join each test record with each model FROM xgboost_models CROSS JOIN testing ) t GROUP BY rowid
  38. Apache Hivemall Query-Based Handy, Scalable Machine Learning on Hive Takuya

    Kitazawa @takuti Data Science Engineer at Arm Treasure Data / Committer of Apache Hivemall