$30 off During Our Annual Pro Sale. View Details »

Apache Hivemall: Query-Based Handy, Scalable Machine Learning on Hive

Takuya Kitazawa
September 22, 2018

Apache Hivemall: Query-Based Handy, Scalable Machine Learning on Hive

Takuya Kitazawa

September 22, 2018
Tweet

More Decks by Takuya Kitazawa

Other Decks in Programming

Transcript

  1. Apache Hivemall
    Query-Based Handy, Scalable Machine Learning on Hive
    Takuya Kitazawa @takuti
    Data Science Engineer at Arm Treasure Data / Committer of Apache Hivemall

    View Slide

  2. Machine Learning in Query Language

    View Slide

  3. BigQuery ML at Google I/O 2018
    hCps://ai.googleblog.com/2018/07/machine-learning-in-google-bigquery.html

    View Slide

  4. Q. Solve regression problem on massive data stored in data warehouse

    View Slide

  5. Prac7cal experience in science and engineering
    Theore7cal
    understanding
    Tool
    (Python?)
    Scalability
    Q. Solve regression problem on massive data stored in data warehouse

    View Slide

  6. Done by less than 10 lines of queries
    hCps://ai.googleblog.com/2018/07/machine-learning-in-google-bigquery.html

    View Slide

  7. Machine Learning for everyone

    View Slide

  8. Open source query-based machine learning solution
    github.com/apache/incubator-hivemall

    View Slide

  9. Hivemall in real-world ML workflow
    Why Hivemall is notably preferable, and who can get the benefit

    View Slide

  10. Problem
    What you want to “predict”
    Hypothesis & Proposal
    Build machine learning model
    Historical data
    Cleanse data
    Evaluate
    Real-world ML workflow is for experts
    Data scientists and ML engineers who have solid relevant background
    Deploy to produc7on

    View Slide

  11. Problem
    What you want to “predict”
    Hypothesis & Proposal
    Build machine learning model
    Historical data
    Cleanse data
    Evaluate
    Real-world ML workflow is for experts
    Data scientists and ML engineers who have solid relevant background
    Deploy to produc7on
    Do you really need such complexity and flexibility?

    View Slide

  12. View Slide

  13. View Slide

  14. Problem
    What you want to “predict”
    Hypothesis & Proposal
    Build machine learning model
    Historical data
    Cleanse data
    Evaluate
    Hivemall makes ML more simple, handy for non experts
    Anybody who knows SQL basics
    Deploy to produc7on
    Easily try, save, share, schedule
    via simple I/F in scalable manner

    View Slide

  15. Problem
    What you want to “predict”
    Hypothesis & Proposal
    Build machine learning model
    Historical data
    Cleanse data
    Evaluate
    How can we organize query fragments?
    Extract
    Filter
    Interpolate
    Normalize
    … …
    Query
    Query
    Query
    Query
    Train data
    Get features
    Train

    Query
    Query
    Query
    Test data
    Get features
    Predict

    Accuracy
    Query
    Query
    Query
    Query

    View Slide

  16. Tip: Combining with workflow engine
    e.g., Digdag allows to define highly dependent ML workflow as YAML file
    Extract
    Filter
    Interpolate
    Normalize
    … …
    Query
    Query
    Query
    Query Train data
    Get features
    Train

    Query
    Query
    Query
    Test data
    Get features
    Predict

    Accuracy
    Query
    Query
    Query
    Query
    https://www.digdag.io/

    View Slide

  17. Query-based ML: New option outside of common ML toolkit
    Your data-related work can be simpler
    OSS solution <3 workflow engine

    View Slide

  18. Introduction to query-based machine learning with
    Apache Hivemall

    View Slide

  19. ‣ Data warehousing solu7on built on top of Apache Hadoop
    ‣ Efficiently access and analyze large-scale data via SQL-like interface, HiveQL
    - create table
    - select
    - join
    - group by
    - count()
    - sum()
    - …
    - order by
    - cluster by
    - …
    Apache Hive

    View Slide

  20. ‣ OSS project under Apache SoRware Founda7on since 2017
    ‣ Scalable ML library implemented as Hive user-defined func7ons (UDFs)
    Apache Hivemall
    column 1
    aaa
    bbb
    ccc
    column 1’
    xxx
    yyy
    zzz
    column 1
    aaa
    bbb
    ccc
    column 2
    scalar
    column 1
    aaa
    bbb
    ccc
    column 2 column 3
    xxx 111
    yyy 222
    UDF UDAF (aggregacon) UDTF (tabular)
    ‣ l1_normalize() ‣ rmse() ‣ train_regressor()

    View Slide

  21. Easy-to-use
    ML in SQL
    Scalable
    Runs in parallel on
    Hadoop ecosystem
    Mul7-plaXorm
    Hive, Spark, Pig
    Versa7le
    Efficient, generic
    funccons
    Apache Hivemall

    View Slide

  22. Easy-to-use and scalable

    View Slide

  23. Example: Scalable Logistic Regression written in ~10 lines of queries
    Automa7cally runs in parallel on Hadoop

    View Slide

  24. index : value
    or
    index
    INT BIGINT TEXT
    FLOAT
    index
    value
    ( )
    ‣ libSVM formatɹɹɹɹɹɹɹɹɹɹɹɹɹɹ 10 : 3.4, 123 : 0.5, 34567 : 0.231
    ‣ can be text ɹ ɹɹɹɹɹɹɹɹɹɹ price : 600, size : 2.5
    ‣ -only means = 1.0ʢe.g., categoricalʣ gender#male = gender#male : 1.0
    index
    value
    index
    Space-efficient feature representation in Hivemall
    TEXT

    View Slide

  25. Array of quanctacve features :
    select quantitative_features(array("price", "size"), 600, 2.5)
    ["price:600.0", "size:2.5"]
    Array of categorical features #
    select categorical_features(array("gender", "category"), “male", "book")
    [“gender#male", "category#book"]
    * NULL is automaccally omiCed
    Hivemall internally does one-hot encoding (e.g., book → 1, 0, 0, …)
    value
    index
    value
    index
    Feature vector = array of string

    View Slide

  26. Simplify name of quanctacve feature and categorical feature #
    select feature_hashing(array("price:600", "category#book"))
    ["14142887:600", "10413006"]
    (Default upper limit: 224 + 1 = 16777217)
    index value
    index
    Feature hashing: Approximation improves scalability

    View Slide

  27. Example: Table “purchase_history”
    http://hivemall.incubator.apache.org/userguide/supervised_learning/tutorial.html

    View Slide

  28. select
    array_concat( -- Concatenate features as a feature vector
    quantitative_features( -- Create quantitative features
    array("price"),
    price
    ),
    categorical_features( -- Create categorical features
    array("day of week", "gender", "category"),
    day_of_week, gender, category
    )
    )
    as features,
    label
    from
    purchase_history

    View Slide

  29. Resulting table “training”

    View Slide

  30. select
    add_bias(
    array_concat( — Concatenate features as a feature vector
    quantitative_features( -- Create quantitative features
    array("price"),
    price
    ),
    categorical_features( -- Create categorical features
    array("day of week", "gender", "category"),
    day_of_week, gender, category
    )
    )
    )
    as features,
    label
    from
    purchase_history

    View Slide

  31. select
    feature_hashing(
    add_bias(
    array_concat( -- Concatenate features as a feature vector
    quantitative_features( -- Create quantitative features
    array("price"),
    price
    ),
    categorical_features( -- Create categorical features
    array("day of week", "gender", "category"),
    day_of_week, gender, category
    )
    )
    )
    )
    as features,
    label
    from
    purchase_history

    View Slide

  32. SELECT
    train_classifier( -- train_regressor(
    features,
    label,
    '-loss logloss -opt SGD -reg no -eta simple -total_steps ${total_steps}'
    ) as (feature, weight)
    FROM
    training
    Classification
    ‣ HingeLoss
    ‣ LogLoss (a.k.a. logistic loss)
    ‣ SquaredHingeLoss
    ‣ ModifiedHuberLoss
    Regression
    ‣ SquaredLoss
    ‣ QuantileLoss
    ‣ EpsilonInsensitiveLoss
    ‣ SquaredEpsilonInsensitiveLoss
    ‣ HuberLoss
    Supervised learning by unified function

    View Slide

  33. SELECT
    train_classifier( -- train_regressor(
    features,
    label,
    '-loss logloss -opt SGD -reg no -eta simple -total_steps ${total_steps}'
    ) as (feature, weight)
    FROM
    training
    Optimizer
    ‣ SGD
    ‣ AdaGrad
    ‣ AdaDelta
    ‣ ADAM
    Regularization
    ‣ L1
    ‣ L2
    ‣ ElasticNet
    ‣ RDA
    ‣ Iteration with learning rate control
    ‣ Mini-batch training
    ‣ Early stopping
    Supervised learning by unified function

    View Slide

  34. Model = table

    View Slide

  35. Classification
    ‣ Generic classifier
    ‣ Perceptron
    ‣ Passive Aggressive (PA, PA1, PA2)
    ‣ Confidence Weighted (CW)
    ‣ Adaptive Regularization of Weight Vectors (AROW)
    ‣ Soft Confidence Weighted (SCW)
    ‣ (Field-Aware) Factorization Machines
    ‣ RandomForest
    Regression
    ‣ Generic regressor
    ‣ PA Regression
    ‣ AROW Regression
    ‣ (Field-Aware) Factorization Machines
    ‣ RandomForest
    Classification and regression with variety of algorithms

    View Slide

  36. Factorization Machines
    S. Rendle. Factoriza7on Machines with libFM. ACM Transaccons on Intelligent Systems and Technology, 3(3), May 2012.

    View Slide

  37. Factorization Machines

    View Slide

  38. RandomForest
    Training
    CREATE TABLE rf_model
    AS
    SELECT
    train_randomforest_classifier(
    features,
    label,
    '-trees 50 -seed 71' -- hyperparameters
    ) as (model_id, model_weight, model, var_importance, oob_errors, oob_tests)
    FROM
    training;

    View Slide

  39. RandomForest
    Model table

    View Slide

  40. RandomForest
    Prediction
    CREATE TABLE rf_predicted
    as
    SELECT
    rowid,
    rf_ensemble(predicted.value, predicted.posteriori, model_weight) as predicted
    FROM (
    SELECT
    t.rowid,
    m.model_weight,
    tree_predict(m.model_id, m.model, t.features, ${classification}) as predicted
    FROM
    testing t
    CROSS JOIN
    rf_model m
    ) t1
    GROUP BY
    rowid

    View Slide

  41. RandomForest
    Export decision trees for visualization
    SELECT
    tree_export(model, "-type javascript", ...) as js,
    tree_export(model, "-type graphvis", ...) as dot
    FROM
    rf_model

    View Slide

  42. Versatile

    View Slide

  43. - Feature hashing
    - Feature scaling (normalization, z-score)
    - Feature binning
    - TF-IDF vectorizer
    - Polynomial expansion
    - Amplifier
    - AUC, nDCG, log loss, precision, recall, …
    - Concatenation
    - Intersection
    - Remove
    - Sort
    - Average
    - Sum
    - …
    Feature engineering
    Evalua7on metrics
    Array and maps
    Bit, compress, character encoding
    Efficient top-k query processing

    View Slide

  44. Efficient top-k retrieval
    Internally hold bounded priority queue
    List top-3 items per user:
    item user score
    1 B 70
    2 A 80
    3 A 90
    4 B 60
    5 A 70
    … … …
    SELECT
    item, user, score, rank
    FROM (
    SELECT
    item, user, score,
    rank() over (PARTITION BY user ORDER BY score DESC)
    as rank
    FROM
    table
    ) t
    WHERE rank <= 2
    SELECT
    each_top_k(
    2, user, score,
    user, item -- output columns
    ) as (rank, score, user, item)
    FROM (
    SELECT * FROM table
    CLUSTER BY user
    ) t
    Not finish in 24 hrs. for 20M users
    and ~1k items in each
    Finish in 2 hrs.

    View Slide

  45. Recommendation <3 tabular form
    User Item Ra7ng
    Tom Laptop 3 ˒˒˒ˑˑ
    Jack Coffee beans 5 ˒˒˒˒˒
    Mike Watch 1 ˒ˑˑˑˑ
    … … …
    User Top-3 recommended items
    Tom Headphone, USB charger, 4K monitor
    Jack Mug, Coffee machine, Chocolate
    Mike Ring, T-shirt, Bag
    … …
    Input
    Output
    User Bought item
    Tom Laptop
    Jack Coffee beans
    Mike Watch
    … …

    View Slide

  46. Recommendation with Hivemall
    k-nearest-neighbor
    ‣ MinHash and b-Bit MinHash (LSH)
    ‣ Similarities
    - Euclid
    - Cosine
    - Jaccard
    - Angular
    Efficient item-based collaborative filtering
    ‣ Sparse Linear Method (SLIM)
    ‣ Approximated all-pair similarities (DIMSUM)
    Matrix completion
    ‣ Matrix Factorization
    ‣ Factorization Machines

    View Slide

  47. Natural Language Processing — English, Japanese and Chinese tokenizer, word N-grams, …
    ‣ ɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹ 

    ["Hello", "world"]
    ‣ 

    apple
    Sketching

    Geospatial functions
    select tokenize('Hello, world!')
    select singularize('apples')
    SELECT count(distinct user_id) FROM t SELECT approx_count_distinct(user_id) FROM t
    SELECT
    map_url(lat, lon, zoom) as osm_url,
    map_url(lat, lon, zoom,'-type googlemaps') as gmap_url
    FROM (
    SELECT 51.51202 as lat, 0.02435 as lon, 17 as zoom
    UNION ALL
    SELECT 51.51202 as lat, 0.02435 as lon, 4 as zoom
    ) t

    View Slide

  48. Anomaly / Change-point detection
    ‣ Local outlier factor (k-NN-based technique)
    ‣ ChangeFinder
    ‣ Singular Spectrum Transformation
    Clustering / Topic modeling
    ‣ Latent Dirichlet Allocation
    ‣ Probabilistic Latent Semantic Analysis

    View Slide

  49. Multi-platform

    View Slide

  50. View Slide

  51. CREATE TABLE lr_model
    AS
    SELECT
    feature,
    avg(weight) as weight
    FROM (
    SELECT
    logress(features, label, "-total_steps ${total_steps}") as (feature, weight)
    FROM
    training
    ) t
    GROUP BY feature;
    Apache Hive

    View Slide

  52. Apache Pig
    a = load 'a9a.train'
    as (rowid:int, label:float, features:{(featurepair:chararray)});
    b = foreach a generate flatten(
    logress(features, label, '-total_steps ${total_steps}')
    ) as (feature, weight);
    c = group b by feature;
    d = foreach c generate group, AVG(b.weight);
    store d into 'a9a_model';

    View Slide

  53. Apache Spark
    DataFrames
    val trainDf =
    spark.read.format("libsvm").load("a9a.train")
    val modelDf =
    trainDf.train_logregr(append_bias($"features"), $"label")
    .groupBy("feature").avg("weight")
    .toDF("feature", "weight")
    .cache

    View Slide

  54. context = HiveContext(sc)
    context.sql("
    SELECT
    feature,
    avg(weight) as weight
    FROM (
    SELECT
    train_logregr(features, label) as (feature, weight)
    FROM
    training
    ) t
    GROUP BY feature
    ")
    Apache Spark
    Query in HiveContext

    View Slide

  55. Apache Spark
    Online prediction on Spark Streaming
    val testData =
    ssc.textFileStream(...).map(LabeledPoint.parse)
    testData.predict { case testDf =>
    // Explode features in input streams
    val testDf_exploded = ...
    val predictDf = testDf_exploded
    .join(model, testDf_exploded("feature") === model("feature"), "LEFT_OUTER")
    .select($"rowid", ($"weight" * $"value").as("value"))
    .groupby("rowid").sum("value")
    .select($"rowid", sigmoid($"SUM(value)"))
    predictDf
    }

    View Slide

  56. Future development plan
    ‣ word2vec
    ‣ Field-aware factorizacon machines
    stability improvement
    ‣ XGBoost
    ‣ Gradient booscng, LightGBM
    ‣ Hivemall on Kaqa KSQL
    ‣ …

    View Slide

  57. XGBoost with Hivemall (experimental)
    SELECT train_xgboost_classifier(features, label) as (model_id, model)
    FROM training
    SELECT
    rowid,
    avg(predicted) as predicted
    FROM ( -- predict with each model
    SELECT xgboost_predict(rowid, features, model_id, model) AS (rowid, predicted)
    -- join each test record with each model
    FROM xgboost_models CROSS JOIN testing
    ) t
    GROUP BY
    rowid

    View Slide

  58. Installation
    http://hivemall.incubator.apache.org/userguide/getting_started/installation.html
    $ hive
    add jar /path/to/hivemall-all-VERSION.jar;
    source /path/to/define-all.hive;

    View Slide

  59. –Johnny Appleseed
    “Type a quote here.”
    !59

    View Slide

  60. github.com/apache/incubator-hivemall
    Docker image, documentation and step-by-step tutorial are available

    View Slide

  61. with query language and Apache Hivemall
    Easy Scalable Sharable
    Clean
    Making machine learning

    View Slide

  62. Apache Hivemall
    Query-Based Handy, Scalable Machine Learning on Hive
    Takuya Kitazawa @takuti
    Data Science Engineer at Arm Treasure Data / Committer of Apache Hivemall

    View Slide