$30 off During Our Annual Pro Sale. View Details »

What's New and Coming to Apache Hivemall 
#ACNA19

What's New and Coming to Apache Hivemall 
#ACNA19

Takuya Kitazawa

September 12, 2019
Tweet

More Decks by Takuya Kitazawa

Other Decks in Technology

Transcript

  1. What's New and Coming to Apache Hivemall 

    Building More Flexible Machine Learning Solution for

    Apache Hive and Spark


    Takuya Kitazawa @takuti 

    Makoto Yui @myui
    Apache Hivemall PPMCs

    View Slide

  2. Machine Learning in Query Language

    View Slide

  3. Q. Solve regression problem on massive data stored in data warehouse

    View Slide

  4. Prac@cal experience in science and engineering
    ML theory Tool / Data model
    Scalability
    Q. Solve regression problem on massive data stored in data warehouse

    View Slide

  5. Done by ~10 lines of queries

    View Slide

  6. Machine Learning for everyone

    Open source query-based machine learning solution
    - Incubating since Sept 13, 2016
    - @ApacheHivemall
    - GitHub: apache/incubator-hivemall
    - Team: 6 PPMCs + 3 committers
    - Latest release: v0.5.2 (Dec 3, 2018)
    - Toward graduation:
    ✓ Community growth
    ✓ 1+ Apache releases
    ✓ Documentation improvements

    View Slide

  7. Introduction to Apache Hivemall
    What’s New in v0.5.2
    What’s Coming Next?

    View Slide

  8. Introduction to Apache Hivemall
    What’s New in v0.5.2
    What’s Coming Next?

    View Slide

  9. ‣ Data warehousing solu@on built on top of Apache Hadoop
    ‣ Efficiently access and analyze large-scale data via SQL-like interface, HiveQL
    - create table
    - select
    - join
    - group by
    - count()
    - sum()
    - …
    - order by
    - cluster by
    - …
    Apache Hive

    View Slide

  10. ‣ OSS project under Apache SoPware Founda@on
    ‣ Scalable ML library implemented as Hive user-defined func@ons (UDFs)
    Apache Hivemall
    column 1
    aaa
    bbb
    ccc
    column 1’
    xxx
    yyy
    zzz
    column 1
    aaa
    bbb
    ccc
    column 2
    scalar
    column 1
    aaa
    bbb
    ccc
    column 2 column 3
    xxx 111
    yyy 222
    UDF UDAF (aggregacon) UDTF (tabular)
    ‣ l1_normalize() ‣ rmse() ‣ train_regressor()

    View Slide

  11. Easy-to-use
    ML in SQL
    Scalable
    Runs in parallel on
    Hadoop ecosystem
    Mul@-plaWorm
    Hive, Spark, Pig
    Versa@le
    Efficient, generic
    funccons
    Apache Hivemall

    View Slide

  12. Use case #1: Enterprise Big Data analytics platform

    View Slide

  13. Problem
    What you want to “predict”
    Hypothesis & Proposal
    Build machine learning model
    Historical data
    Cleanse data
    Evaluate
    Hivemall makes ML more simple, handy for non experts
    Anybody who knows SQL basics
    Deploy to produc@on
    Easily try, save, share, schedule
    via simple I/F in scalable manner

    View Slide

  14. Use case #2: Large-scale recommender systems

    Demo paper @ ACM RecSys 2018

    View Slide

  15. Recommendation <3 tabular form
    User Item Ra@ng
    Tom Laptop 3 ˒˒˒ˑˑ
    Jack Coffee beans 5 ˒˒˒˒˒
    Mike Watch 1 ˒ˑˑˑˑ
    … … …
    User Top-3 recommended items
    Tom Headphone, USB charger, 4K monitor
    Jack Mug, Coffee machine, Chocolate
    Mike Ring, T-shirt, Bag
    … …
    Input
    Output
    User Bought item
    Tom Laptop
    Jack Coffee beans
    Mike Watch
    … …

    View Slide

  16. Use case #3: E-learning
    “New in Big Data” Machine Learning with SQL @ Udemy

    View Slide

  17. View Slide

  18. Easy-to-use
    ML in SQL
    Scalable
    Runs in parallel on
    Hadoop ecosystem
    Mul@-plaWorm
    Hive, Spark, Pig
    Versa@le
    Efficient, generic
    funccons

    View Slide

  19. Example: Scalable Logistic Regression written in ~10 lines of queries
    Automa@cally runs in parallel on Hadoop

    View Slide

  20. Example: Table “purchase_history”
    http://hivemall.incubator.apache.org/userguide/supervised_learning/tutorial.html

    View Slide

  21. Preprocessing
    select
    array_concat( -- Concatenate features as a feature vector
    quantitative_features( -- Create quantitative features
    array("price"),
    price
    ),
    categorical_features( -- Create categorical features
    array("day of week", "gender", "category"),
    day_of_week, gender, category
    )
    )
    as features,
    label
    from
    purchase_history

    View Slide

  22. Resulting table “training”

    View Slide

  23. SELECT
    train_classifier( -- train_regressor(
    features,
    label,
    '-loss logloss -opt SGD -reg no -eta simple -total_steps ${total_steps}'
    ) as (feature, weight)
    FROM
    training
    Classifica@on
    ‣ HingeLoss
    ‣ LogLoss (a.k.a. logis7c loss)
    ‣ SquaredHingeLoss
    ‣ ModifiedHuberLoss
    Regression
    ‣ SquaredLoss
    ‣ QuancleLoss
    ‣ EpsilonInsensicveLoss
    ‣ SquaredEpsilonInsensicveLoss
    ‣ HuberLoss
    Supervised learning by unified function

    View Slide

  24. SELECT
    train_classifier( -- train_regressor(
    features,
    label,
    '-loss logloss -opt SGD -reg no -eta simple -total_steps ${total_steps}'
    ) as (feature, weight)
    FROM
    training
    Op@mizer
    ‣ SGD
    ‣ AdaGrad
    ‣ AdaDelta
    ‣ ADAM
    Regulariza@on
    ‣ L1
    ‣ L2
    ‣ ElasccNet
    ‣ RDA
    ‣ Iteracon with learning rate control
    ‣ Mini-batch training
    ‣ Early stopping
    Supervised learning by unified function

    View Slide

  25. Model = table

    View Slide

  26. Classifica@on
    ‣ Generic classifier
    ‣ Perceptron
    ‣ Passive Aggressive (PA, PA1, PA2)
    ‣ Confidence Weighted (CW)
    ‣ Adapcve Regularizacon of Weight Vectors (AROW)
    ‣ Soo Confidence Weighted (SCW)
    ‣ (Field-Aware) Factoriza@on Machines
    ‣ RandomForest
    Regression
    ‣ Generic regressor
    ‣ PA Regression
    ‣ AROW Regression
    ‣ (Field-Aware) Factoriza@on Machines
    ‣ RandomForest
    Classification and regression with variety of algorithms

    View Slide

  27. Factorization Machines
    S. Rendle. Factoriza@on Machines with libFM. ACM Transaccons on Intelligent Systems and Technology, 3(3), May 2012.

    View Slide

  28. Factorization Machines

    View Slide

  29. RandomForest
    Training
    CREATE TABLE rf_model
    AS
    SELECT
    train_randomforest_classifier(
    features,
    label,
    '-trees 50 -seed 71' -- hyperparameters
    ) as (model_id, model_weight, model, var_importance, oob_errors, oob_tests)
    FROM
    training;

    View Slide

  30. RandomForest
    Model table

    View Slide

  31. RandomForest
    Export decision trees for visualization
    SELECT
    tree_export(model, "-type javascript", ...) as js,
    tree_export(model, "-type graphvis", ...) as dot
    FROM
    rf_model

    View Slide

  32. Easy-to-use
    ML in SQL
    Scalable
    Runs in parallel on
    Hadoop ecosystem
    Mul@-plaWorm
    Hive, Spark, Pig
    Versa@le
    Efficient, generic
    funccons

    View Slide

  33. - Feature hashing
    - Feature scaling (normalizacon, z-score)
    - Feature binning
    - TF-IDF vectorizer
    - Polynomial expansion
    - Amplifier
    - AUC, nDCG, log loss, precision, recall, …
    - Concatenacon
    - Interseccon
    - Remove
    - Sort
    - Average
    - Sum
    - …
    -
    Feature engineering
    Evalua@on metrics
    Array and maps
    Bit, compress, character encoding
    Efficient top-k query processing

    View Slide

  34. Efficient top-k retrieval
    Internally hold bounded priority queue
    List top-3 items per user:
    item user score
    1 B 70
    2 A 80
    3 A 90
    4 B 60
    5 A 70
    … … …
    SELECT
    item, user, score, rank
    FROM (
    SELECT
    item, user, score,
    rank() over (PARTITION BY user ORDER BY score DESC)
    as rank
    FROM
    table
    ) t
    WHERE rank <= 2
    SELECT
    each_top_k(
    2, user, score,
    user, item -- output columns
    ) as (rank, score, user, item)
    FROM (
    SELECT * FROM table
    CLUSTER BY user
    ) t
    Not finish in 24 hrs. for 20M users
    and ~1k items in each
    Finish in 2 hrs.

    View Slide

  35. Recommendation with Hivemall
    k-nearest-neighbor
    ‣ MinHash and b-Bit MinHash (LSH)
    ‣ Similarices
    - Euclid
    - Cosine
    - Jaccard
    - Angular
    Efficient item-based collabora@ve filtering
    ‣ Sparse Linear Method (SLIM)
    ‣ Approximated all-pair similarices (DIMSUM)
    Matrix comple@on
    ‣ Matrix Factorizacon
    ‣ Factorizacon Machines

    View Slide

  36. Natural Language Processing — English, Japanese and Chinese tokenizer, word N-grams, …
    ‣ ɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹ 

    ["Hello", "world"]
    ‣ 

    apple
    Sketching

    Geospa@al func@ons
    select tokenize('Hello, world!')
    select singularize('apples')
    SELECT count(distinct user_id) FROM t SELECT approx_count_distinct(user_id) FROM t
    SELECT
    map_url(lat, lon, zoom) as osm_url,
    map_url(lat, lon, zoom,'-type googlemaps') as gmap_url
    FROM (
    SELECT 51.51202 as lat, 0.02435 as lon, 17 as zoom
    UNION ALL
    SELECT 51.51202 as lat, 0.02435 as lon, 4 as zoom
    ) t

    View Slide

  37. Anomaly / Change-point detec@on
    ‣ Local outlier factor (k-NN-based technique)
    ‣ ChangeFinder
    ‣ Singular Spectrum Transformacon
    Clustering / Topic modeling
    ‣ Latent Dirichlet Allocacon
    ‣ Probabiliscc Latent Semancc Analysis

    View Slide

  38. Easy-to-use
    ML in SQL
    Scalable
    Runs in parallel on
    Hadoop ecosystem
    Mul@-plaWorm
    Hive, Spark, Pig
    Versa@le
    Efficient, generic
    funccons

    View Slide

  39. View Slide

  40. CREATE TABLE lr_model
    AS
    SELECT
    feature,
    avg(weight) as weight
    FROM (
    SELECT
    logress(features, label, "-total_steps ${total_steps}") as (feature, weight)
    FROM
    training
    ) t
    GROUP BY feature;
    Apache Hive

    View Slide

  41. Apache Pig
    a = load 'a9a.train'
    as (rowid:int, label:float, features:{(featurepair:chararray)});
    b = foreach a generate flatten(
    logress(features, label, '-total_steps ${total_steps}')
    ) as (feature, weight);
    c = group b by feature;
    d = foreach c generate group, AVG(b.weight);
    store d into 'a9a_model';

    View Slide

  42. Apache Spark
    DataFrames
    val trainDf =
    spark.read.format("libsvm").load("a9a.train")
    val modelDf =
    trainDf.train_logregr(append_bias($"features"), $"label")
    .groupBy("feature").avg("weight")
    .toDF("feature", "weight")
    .cache

    View Slide

  43. context = HiveContext(sc)
    context.sql("
    SELECT
    feature,
    avg(weight) as weight
    FROM (
    SELECT
    train_logregr(features, label) as (feature, weight)
    FROM
    training
    ) t
    GROUP BY feature
    ")
    Apache Spark
    Query in HiveContext

    View Slide

  44. Apache Spark
    Online prediction on Spark Streaming
    val testData =
    ssc.textFileStream(...).map(LabeledPoint.parse)
    testData.predict { case testDf =>
    // Explode features in input streams
    val testDf_exploded = ...
    val predictDf = testDf_exploded
    .join(model, testDf_exploded("feature") === model("feature"), "LEFT_OUTER")
    .select($"rowid", ($"weight" * $"value").as("value"))
    .groupby("rowid").sum("value")
    .select($"rowid", sigmoid($"SUM(value)"))
    predictDf
    }

    View Slide

  45. Hivemall meets PySpark
    from pyspark.sql import SparkSession
    spark = SparkSession \
    .builder \
    .master('local[*]') \
    .config('spark.jars',
    'hivemall-spark2.3-0.5.2-incubating-with-dependencies.jar') \
    .enableHiveSupport() \
    .getOrCreate()

    View Slide

  46. View Slide

  47. Introduction to Apache Hivemall
    What’s New in v0.5.2
    What’s Coming Next?

    View Slide

  48. View Slide

  49. Brickhouse Hive UDFs collection is merged to Hivemall
    Welcome Jerome to the team!

    View Slide

  50. WITH high_rated_items as (
    SELECT bloom(itemid) as items
    FROM (
    SELECT itemid
    FROM ratings
    GROUP BY itemid
    HAVING avg(rating) >= 4.0
    ) t
    )
    SELECT
    l.rating,
    count(distinct l.userid) as cnt
    FROM
    ratings l
    CROSS JOIN high_rated_items r
    WHERE
    bloom_contains(r.items, l.itemid)
    GROUP BY
    l.rating;
    Bloom Filters: Probabilistic data structures
    Build Bloom Filter (i.e., probabiliscc set of) high-rated items
    Check if item is in Bloom Filter, and see their actual racngs:

    View Slide

  51. Working with JSON
    SELECT
    from_json(
    '{ "location" : { "Country" : "Japan" , "City" : "Osaka" } }', -- json
    'map', -- return type
    array(‘location') -- key
    )
    to_json()

    View Slide

  52. Array / Vector
    ‣ Append
    ‣ Element-at
    ‣ Union
    ‣ First/last element
    ‣ Flaxen
    ‣ Vector add/dot
    Map
    ‣ Convert into array of key-value pairs
    ‣ Filter elements by keys
    Sanity check
    ‣ Assert
    ‣ Raise error
    Misc
    ‣ Try-cast
    ‣ Sessionize records by cme
    ‣ Moving average
    More utility functions

    View Slide

  53. Field-Aware Factorization Machines

    View Slide

  54. Field-Aware Factorization Machines
    Y. Juan, Y. Zhuang, W. Chin, and C. Lin. Field-Aware Factoriza@on Machines for CTR Predic@on. RecSys 2016.

    View Slide

  55. SELECT
    train_ffm(
    features,
    label,
    '-init_v random -max_init_value 0.5 -classification -iterations 15

    -factors 4 -eta 0.2 -optimizer adagrad -lambda 0.00002'
    )
    FROM (
    SELECT
    features, label
    FROM
    train_vectorized
    CLUSTER BY rand(1)
    ) t

    View Slide

  56. Introduction to Apache Hivemall
    What’s New in v0.5.2
    What’s Coming Next?

    View Slide

  57. HIVEMALL-118: word2vec
    SELECT
    train_word2vec(
    r.negative_table,
    l.words,
    "-n {n} -win 5 -neg 15 -iters 5 -model cbow"
    )
    FROM
    train_docs l
    CROSS JOIN negative_table r

    View Slide

  58. HIVEMALL-126: Multi-Nominal Logistic Regression
    SELECT
    train_maxent_classifier(features, label, "-attrs C,Q,Q")
    FROM
    train

    View Slide

  59. XGBoost with Hivemall (experimental)
    SELECT train_xgboost_classifier(features, label) as (model_id, model)
    FROM training
    SELECT
    rowid,
    avg(predicted) as predicted
    FROM ( -- predict with each model
    SELECT xgboost_predict(rowid, features, model_id, model) AS (rowid, predicted)
    -- join each test record with each model
    FROM xgboost_models CROSS JOIN testing
    ) t
    GROUP BY
    rowid

    View Slide

  60. WE NEED YOUR CONTRIBUTION
    From documentation and utility UDFs, to state-of-the-art ML algorithms and toolkits
    github.com/apache/incubator-hivemall

    View Slide

  61. Installation
    http://hivemall.incubator.apache.org/userguide/getting_started/installation.html
    $ hive
    add jar /path/to/hivemall-all-VERSION.jar;
    source /path/to/define-all.hive;
    Docker image, documentation and step-by-step tutorial are available

    View Slide

  62. What's New and Coming to Apache Hivemall 

    Building More Flexible Machine Learning Solution for

    Apache Hive and Spark


    Takuya Kitazawa @takuti 

    Makoto Yui @myui
    Apache Hivemall PPMCs

    View Slide