$30 off During Our Annual Pro Sale. View Details »

Mbed Connect USA 2018 Workshop: Introduction to Data Science

Mbed Connect USA 2018 Workshop: Introduction to Data Science

Takuya Kitazawa

October 22, 2018
Tweet

More Decks by Takuya Kitazawa

Other Decks in Technology

Transcript

  1. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
    Introduction to Data Science
    Takuya Kitazawa | @takuti | [email protected]
    Data Science Engineer at Arm Treasure Data / Committer of Apache Hivemall

    View Slide

  2. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
    1. Understand what Arm Treasure Data Customer Data Platform is [5mins]
    Data management layer of Arm Pelion IoT platform
    2. Learn how machine learning and data science works [5mins]
    Capture characteristics of historical data, and predict unseen result
    3. See every single steps of real-world data science workflow on TD [50mins]
    IoT-ish sample scenario based on real-life environmental data from City of Chicago
    2

    View Slide

  3. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
    Enterprise Customer Data Platform on big data infrastructure

    View Slide

  4. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
    4

    View Slide

  5. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
    5

    View Slide

  6. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
    6
    Predic:ve analy:cs on UI Data science in query language
    Integra:on with third-party ML toolkit
    For everyone who knows SQL basics
    For non technical people like marketers

    View Slide

  7. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
    Data science in query language
    Handy, flexible way to leverage machine learning

    View Slide

  8. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
    8

    View Slide

  9. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
    Data analytics in query language
    9

    View Slide

  10. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
    Difference between “Presto” and “Hive”
    Hive
    Presto
    Lightweight and interac:ve data access
    Fast ↔ Not suited for batch processing on massive data
    Heavy data processing task like daily batch
    Slow ↔ Can process massive records at once
    10

    View Slide

  11. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
    Data
    3rd-party tools (e.g., visualiza6on)
    SQL
    +
    heavy
    lightweight
    ML with Apache Hivemall
    SELECT * FROM data …
    How to analyze your data on Treasure Data at scale
    11
    Schedule Treasure Workflow (a.k.a. Digdag)

    View Slide

  12. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
    12
    ‣ Scalable ML library implemented as Hive UDFs
    ‣ OSS project under Apache Software Foundation
    ‣ TD bundles Hivemall and has 3 developers (original creator + 2 core committers)
    https://github.com/apache/incubator-hivemall
    TD’s ML capability: Apache Hivemall
    Easy-to-use
    ML in SQL
    Scalable
    Runs in parallel on
    Hadoop ecosystem
    Versa:le
    Efficient, generic
    funcbons

    View Slide

  13. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
    What ML internally does — Learning from data
    13
    Historical data and problem
    e.g., purchase log and # of sales predic:on
    Model
    Characteristics of
    historical data

    View Slide

  14. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
    What ML internally does — Predicting unforeseen results
    14
    Model
    Characteristics of
    historical data
    Unforeseen data Predic:on result

    View Slide

  15. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
    Real-world ML & data science workflow for experts
    15
    Problem
    What you want to “predict”
    Hypothesis & Proposal
    Build machine learning model
    Historical data
    Cleanse data
    Evaluate
    Deploy to produc:on

    View Slide

  16. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
    Hivemall makes it more handy
    16
    Automa:cally runs in parallel on Hadoop

    View Slide

  17. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
    Query-based simple, scalable data science workflow on TD
    17
    Problem
    What you want to “predict”
    Hypothesis & Proposal
    Build machine learning model
    Historical data
    Cleanse data
    Evaluate
    Deploy to produc:on
    Easily try, save, share, schedule
    via simple I/F in scalable manner

    View Slide

  18. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
    Treasure Workflow: Manage highly-dependent query fragments
    18
    Extract
    Filter
    Interpolate
    Normalize
    … …
    Query
    Query
    Query
    Query Train data
    Get features
    Train

    Query
    Query
    Query
    Test data
    Get features
    Predict

    Accuracy
    Query
    Query
    Query
    Query
    https://www.digdag.io/

    View Slide

  19. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
    Define your data science workflow in YAML format
    19
    _export:
    !include : config/general.yml
    td:
    engine: hive
    +prepare:
    call>: common/prepare_data.dig
    +main:
    +logress_train:
    td>: queries/logress_train.sql
    create_table: logress_model
    +compute_downsampling_rate:
    td>: queries/downsampling_rate.sql
    engine: presto
    store_last_results: true
    +logress_predict:
    td>: queries/logress_predict.sql
    create_table: prediction
    +evaluate:
    td>: queries/evaluate.sql
    store_last_results: true
    +show_accuracy:
    echo>: "Logloss (smaller is better): ${td.last_results.logloss}"

    View Slide

  20. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
    Visually check progress on console
    20

    View Slide

  21. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
    Hivemall documentation
    http://hivemall.incubator.apache.org/userguide/
    Step-by-step ML on Hivemall tutorial
    http://hivemall.incubator.apache.org/userguide/supervised_learning/tutorial.html
    Treasure Data ML workflow examples
    https://github.com/treasure-data/workflow-examples/tree/master/machine-learning
    21

    View Slide

  22. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
    Demo and hands-on
    Step-by-step guide to running data science workflow on TD

    View Slide

  23. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
    23
    https://console.treasuredata.com/users/sign_in

    View Slide

  24. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
    24
    Problem
    What you want to “predict”
    Hypothesis & Proposal
    Build machine learning model
    Historical data
    Cleanse data
    Evaluate
    Deploy to produc:on

    View Slide

  25. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
    Sample scenario: City has historical energy benchmarking data
    25
    https://data.cityofchicago.org/Environment-Sustainable-Development/Chicago-Energy-Benchmarking/xq83-jr8c
    https://www.cityofchicago.org/city/en/progs/env/building-energy-benchmarking---transparency.html

    View Slide

  26. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
    Unify variety of data on single platform
    26
    https://console.treasuredata.com/app/integrations/catalog
    Original proprietary
    data source

    View Slide

  27. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
    [TASK] Upload dump benchmarking result to TD
    27

    View Slide

  28. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
    28
    Problem
    What you want to “predict”
    Hypothesis & Proposal
    Build machine learning model
    Historical data
    Cleanse data
    Evaluate
    Deploy to produc:on

    View Slide

  29. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
    ML problems Hivemall can solve
    Classifica:on
    - Binary: Purchase or not / Spam detecbon
    - Mul:-class: Tomorrow’s weather / This user’s generabon
    Regression
    - Tomorrow’s temperature / Next month’s sales / This user’s income
    Recommenda:on
    - Customers who viewed this item also viewed…
    29

    View Slide

  30. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
    ML problems Hivemall can solve
    Anomaly detec:on
    - Find excepbonally high error rate from bme series data sent by IoT device
    Natural language processing
    - Tokenize sentence and extract keywords
    Clustering
    - Grouping users based on their similaribes
    30

    View Slide

  31. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
    31
    1. Predict probability of churn (i.e., binary classification)
    2. Aggressively reach out “likely to churn” customers
    https://www.slideshare.net/TaisukeFukawa/hivemall-meetup-vol2-oisix (Japanese)
    Web
    Mobile
    Customer attr.
    Behavior on web
    Complaint log
    Source
    Signed-up services
    Actions (direct)
    Actions (indirect)
    Point Call
    Guide to success
    UI
    OISIX’s data
    Example: ML-based customer segmentation

    View Slide

  32. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
    Our goal: Beyond rating and encouragement 

    Predict future energy consumption (electricity use)
    32
    https://www.cityofchicago.org/city/en/depts/mayor/supp_info/chicago-energy-benchmarking/
    Chicago_Energy_Benchmarking_Beyond_Benchmarking.html

    View Slide

  33. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
    33
    Problem
    What you want to “predict”
    Hypothesis & Proposal
    Build machine learning model
    Historical data
    Cleanse data
    Evaluate
    Deploy to produc:on

    View Slide

  34. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
    [TASK] Understand your data with Presto ad-hoc queries
    ‣ What each column means 

    https://data.cityofchicago.org/Environment-Sustainable-Development/Chicago-Energy-Benchmarking/xq83-jr8c
    ‣ Total number of records
    ‣ Benchmarking time period and frequency
    ‣ Distribution of different community_area and primary_property_type
    ‣ Max and min values in num_of_buildings and electricity_use__kbtu_ columns
    ‣ Missing value rate in each *_use__kbtu_ (kBtu; thousand British thermal units) column
    Presto aggregation functions: https://prestodb.io/docs/current/functions/aggregate.html
    34

    View Slide

  35. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
    $ pip install pandas-td
    35
    import pandas_td as td
    con = td.connect(apikey=...,
    endpoint=...)
    presto = td.create_engine('presto:mydb', con=con)
    hive = td.create_engine('hive:mydb', con=con)
    df = td.read_td('SELECT COUNT(1) FROM www_access’, presto)
    https://github.com/treasure-data/pandas-td

    View Slide

  36. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
    Exploratory data analysis (EDA) with visualization
    36

    View Slide

  37. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
    37
    Problem
    What you want to “predict”
    Hypothesis & Proposal
    Build machine learning model
    Historical data
    Cleanse data
    Evaluate
    Deploy to produc:on

    View Slide

  38. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
    Key: Feature engineering
    38
    ID
    Community
    area
    Property type …
    Gross floor
    area
    Built year … Usage (kBtu)
    100001 WEST TOWN Hospital … 309056 1928 … 21470037
    100256
    ARCHER
    HEIGHTS
    K-12 School … 447330 1990 … 35792767
    … … … … … … … …
    250150
    NEAR NORTH
    SIDE
    Office … 335281 1912 … 24220915
    Data: Historical energy consump:on at buildings
    Problem: Predict future electricity use
    1 kWh x 3.142 = 3.142 kBtu

    View Slide

  39. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
    Different types of features
    39
    Numeric objec:ve
    (Solubon to problem)
    Categorical feature Quan:ta:ve feature
    Timestamp / Year
    Need to convert
    e.g., elapsed years
    ID
    Community
    area
    Property type …
    Gross floor
    area
    Built year … Usage (kBtu)
    100001 WEST TOWN Hospital … 309056 1928 … 21470037
    100256
    ARCHER
    HEIGHTS
    K-12 School … 447330 1990 … 35792767
    … … … … … … … …
    250150
    NEAR NORTH
    SIDE
    Office … 335281 1912 … 24220915

    View Slide

  40. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
    Make features “machine readable”
    ‣ Categorical : Set “1” or “0” to corresponding position
    ‣ Quantitative : Directly use it
    ‣ Label : “1” for purchase, “0” for non-purchase
    ‣ + Converting year into elapsed years as of benchmarking year
    40
    Community area Property type
    Area
    Building age
    Usage

    (kBtu)
    NEAR
    NORTH
    SIDE

    ARCHER
    HEIGHTS
    WEST
    TOWN
    Hospital … Office
    (A) 

    Data year
    (B)

    Built year
    (A) - (B)
    0 … 0 1 1 … 0 309055 2014 1928 86 21470037
    0 … 1 0 0 … 0 447330 2014 1990 24 35792767
    … … … … … … … … … … … …
    1 … 0 0 0 0 1 335281 2014 1912 102 24220915

    View Slide

  41. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
    Feature representation in Hivemall
    ‣ libSVM formatɹɹɹɹɹɹɹɹɹɹɹɹɹɹ 10 : 3.4, 123 : 0.5, 34567 : 0.231
    ‣ can be text ɹ ɹɹɹɹɹɹɹɹɹɹ age : 86, area : 447330
    ‣ -only means = 1.0ʢe.g., categoricalʣ type#office = type#office : 1.0
    41
    index : value
    or
    index
    INT BIGINT TEXT
    FLOAT
    index
    value
    ( )
    index
    value
    index
    TEXT

    View Slide

  42. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
    Array of quanbtabve features :
    select quantitative_features(array("age", "area"), 86, 447330)
    ["age:86.0", “area:447330"]
    Array of categorical features #
    select categorical_features(array(“commun", "type"), "NEAR NORTH", "office")
    [“commun#NEAR NORTH", “type#office”]
    * NULL is automabcally omired
    Hivemall internally does one-hot encoding (e.g., office → 1, 0, 0, …)
    Create feature vector in SQL
    42
    value
    index
    value
    index

    View Slide

  43. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
    Feature hashing: Approximation improves scalability
    43
    Simplify name of quanbtabve feature and categorical feature #
    select feature_hashing(array("age:86", “type#office"))
    ["14142887:600", "10413006"]
    (Default upper limit: 224 + 1 = 16777217)
    index value
    index

    View Slide

  44. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
    select
    id,
    array_concat(
    categorical_features(
    array('Chicago community area', 'Primary use of property'),
    community_area, primary_property_type
    ),
    quantitative_features(
    array('Total interior floor space', 'Building age', 'Number of buildings'),
    gross_floor_area___buildings__sq_ft_, data_year - year_built, num_of_buildings
    )
    ) as features,
    electricity_use__kbtu_ as annual_electricity_consumption
    from
    chicago_smart_green.energy_benchmarking
    where
    electricity_use__kbtu_ is not null
    [TASK] Create feature vector and check output (Hive)
    44

    View Slide

  45. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
    select
    id,
    feature_hashing(
    add_bias(
    array_concat(
    categorical_features(
    array('Chicago community area', 'Primary use of property'),
    community_area, primary_property_type
    ),
    quantitative_features(
    array('Total interior floor space', 'Building age', 'Number of buildings'),
    gross_floor_area___buildings__sq_ft_, data_year - year_built, num_of_buildings
    )
    )
    )
    ) as features,
    electricity_use__kbtu_ as annual_electricity_consumption
    from
    chicago_smart_green.energy_benchmarking
    where
    electricity_use__kbtu_ is not null
    [TASK] Advanced technique: add_bias() and feature_hashing()
    45

    View Slide

  46. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
    46
    Problem
    What you want to “predict”
    Hypothesis & Proposal
    Build machine learning model
    Historical data
    Cleanse data
    Evaluate
    Deploy to produc:on

    View Slide

  47. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
    Training and prediction over features-label pairs
    47
    Model
    Table
    Train
    SQL
    Predict
    SQL
    Usage
    21M
    35M

    24M
    How about hospital in West Town built 10 years ago?
    Unforeseen 0 … 0 1 1 0 0 500k 10
    West Town Hospital Age
    ?
    Historically, this building consumed…
    features 0 … 0 1 1 … 0 309k 86
    0 … 1 0 0 … 0 447k 24
    … … … … … … … … …
    1 … 0 0 0 … 1 335k 102

    View Slide

  48. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
    Design features that clearly represent “characteristics” of data
    48
    Model
    Usage
    21M
    35M

    24M
    How about hospital in West Town built 10 years ago?
    Unforeseen 0 … 0 1 1 0 0 500k 10
    West Town Hospital Age
    20M
    Historically, this building consumed…
    features 0 … 0 1 1 … 0 309k 86
    0 … 1 0 0 … 0 447k 24
    … … … … … … … … …
    1 … 0 0 0 … 1 335k 102
    Similar

    View Slide

  49. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
    Building prediction model from historical data
    Classification
    ‣ HingeLoss
    ‣ LogLoss (a.k.a. logistic loss)
    ‣ SquaredHingeLoss
    ‣ ModifiedHuberLoss
    49
    SELECT
    train_regressor( -- train_classifier(
    features,
    annual_electricity_consumption,
    '-loss squared -opt SGD -reg no -eta simple -total_steps ${total_steps}'
    ) as (feature, weight)
    FROM
    training
    Regression
    ‣ SquaredLoss
    ‣ QuantileLoss
    ‣ EpsilonInsensitiveLoss
    ‣ SquaredEpsilonInsensitiveLoss
    ‣ HuberLoss

    View Slide

  50. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
    Building prediction model from historical data
    Optimizer
    ‣ SGD
    ‣ AdaGrad
    ‣ AdaDelta
    ‣ ADAM
    50
    SELECT
    train_classifier( -- train_regressor(
    features,
    annual_electricity_consumption,
    '-loss squared -opt SGD -reg no -eta simple -total_steps ${total_steps}'
    ) as (feature, weight)
    FROM
    training
    Regularization
    ‣ L1
    ‣ L2
    ‣ ElasticNet
    ‣ RDA
    ‣ Iteration with learning rate control
    ‣ Mini-batch training
    ‣ Early stopping

    View Slide

  51. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
    Model = table
    51

    View Slide

  52. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
    52
    Problem
    What you want to “predict”
    Hypothesis & Proposal
    Build machine learning model
    Historical data
    Cleanse data
    Evaluate
    Deploy to produc:on

    View Slide

  53. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
    Evaluate accuracy of ML model
    53
    Historical data
    Training data
    Valida:on data
    ML model
    Predicbon result
    Measure predic:on accuracy
    Predicbon Actual value
    38.5 30
    12.1 18
    25.2 20
    Q. How much is this overall predic:on result good (or bad)?

    View Slide

  54. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
    Evaluation metric example:
    Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE)
    54
    Predicbon Actual value
    38.5 30
    12.1 18
    25.2 20

    View Slide

  55. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
    _export:
    !include : config.yml
    td:
    database: ${database}
    engine: hive
    +vectorize:
    td>: queries/vectorize.sql
    create_table: vectors
    +shuffle:
    td>: queries/shuffle.sql
    create_table: samples
    engine: presto
    +train:
    td>: queries/train.sql
    create_table: model
    +evaluate:
    td>: queries/evaluate.sql
    store_last_results: true
    +show_accuracy:
    echo>: "RMSE: ${td.last_results.rmse}, MAE: ${td.last_results.mae}"
    [TASK] Run simple regression workflow
    55

    View Slide

  56. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
    [TASK] Check if model weights make sense
    56

    View Slide

  57. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
    57
    Problem
    What you want to “predict”
    Hypothesis & Proposal
    Build machine learning model
    Historical data
    Cleanse data
    Evaluate
    Deploy to produc:on

    View Slide

  58. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
    Possible directions
    ‣ Design different feature vector
    ‣ Normalize and re-scale feature values
    ‣ Collect more data
    ‣ Join with different types of data
    ‣ Tweak better hyper-parameters
    ‣ Use other ML model
    58

    View Slide

  59. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
    Possible scenario: Chicago smart green infrastructure monitoring data
    https://data.cityofchicago.org/Environment-Sustainable-Development/Smart-Green-Infrastructure-Monitoring-Sensors-Hist/ggws-77ih
    https://github.com/BlackstoneEngineering/mbed-os-example-treasuredata-rest
    Each data stream captures:
    ‣ Temperature
    ‣ Wind speed and direction
    ‣ Rainfall
    ‣ Pressure
    ‣ Soil moisture
    59
    Connector
    chicago_smart_green.sensors_history

    View Slide

  60. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
    60
    sensors_history
    Environmental monitoring results from many sensors deployed in city

    community_areas (auxiliary data from CSV import)
    Definition of community area boundaries

    energy_benchmarking
    Result of annual energy benchmarking for large buildings in city
    Database — chicago_smart_green
    Latitude, Longitude
    Area

    View Slide

  61. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
    Joining different datasets on geospatial information
    61
    ST_Contains(
    ST_GeometryFromText(community_areas.the_geom),
    ST_Point(sensors_historical.longitude, sensors_historical.latitude)
    )
    https://prestodb.io/docs/current/functions/geospatial.html

    View Slide

  62. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
    62
    Problem
    What you want to “predict”
    Hypothesis & Proposal
    Build machine learning model
    Historical data
    Cleanse data
    Evaluate
    Deploy to produc:on

    View Slide

  63. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
    [TASK] Predict unknown NULL electricity usage
    63
    with null_samples as (
    select
    id,
    array_concat(
    categorical_features(
    array('Chicago community area', 'Primary use of property'),
    community_area, primary_property_type
    ),
    quantitative_features(
    array('Total interior floor space', 'Building age', 'Number of buildings'),
    gross_floor_area___buildings__sq_ft_, data_year - year_built, num_of_buildings
    )
    ) as features
    from
    chicago_smart_green.energy_benchmarking
    where
    electricity_use__kbtu_ is null
    ),
    features_exploded as (
    select
    id,
    extract_feature(fv) as feature,
    extract_weight(fv) as value
    from null_samples t1
    LATERAL VIEW explode(features) t2 as fv
    )
    select
    t1.id,
    sum(p1.weight * t1.value) as predicted_electricity_consumption
    from
    features_exploded t1
    LEFT OUTER JOIN model p1 ON (t1.feature = p1.feature)
    group by
    t1.id

    View Slide

  64. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
    Deploy ML model to production
    64
    https://hivemall.incubator.apache.org/userguide/tips/rt_prediction.html
    Producbon env External signal
    Predict
    Vector/matrix
    computabon

    View Slide

  65. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
    Predictive analytics on UI
    Minimal and powerful ML capabilities on unified interface

    View Slide

  66. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
    66

    View Slide

  67. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
    67

    View Slide

  68. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
    Word-based customer tagging and categorization
    68
    Store customers’ browsing log
    from TD JavaScript SDK
    STEP 1 Extract keywords from each article STEP 2 Aggregate customers’ visits as td_interest_words and td_affinity_categories
    Society
    Olympic game medal
    president citizen rule law
    data cloud CDP
    Create audience
    politics law US nation
    equation math
    curry rice history
    Science
    Food, Culture
    td_client_id XXX-YYY-ZZZZZ
    td_title Today’s news
    td_description The Olympic game has been started …
    td_host www.td-news.com
    td_path /2017/10/01/olympic
    td_client_id XXX-YYY-ZZZZZ
    td_interest_words Olympic, baseball, game
    td_affinity_categories Sports, Entertainment

    View Slide

  69. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
    td_client_id XXX-YYY-ZZZZZ
    td_ip 192.168.0.1
    td_referrer http://google.com/…
    spend_time 1.5
    … …
    td_interest_words Olympic, baseball, game
    td_affinity_categories Sports, Entertainment
    Profile set
    Segment
    What you want to predict
    Build predictive model
    Guess how to cleanse data Evaluation
    Japan
    google.com
    1.5
    accuracy
    Sufficient?
    Audience
    Unlikely Marginally Possibly Likely
    12
    20
    3 34
    40
    72
    58 82
    93
    99
    78
    GUESS Automatically select and transform customer attributes
    1ST PASS Treasure CDP does everything for you
    FROM 2ND PASS You can make your predictive model better with ML experts
    SCORE CUSTOMERS
    SYNDICATE
    Base segment
    Population
    Predictive customer scoring
    69

    View Slide

  70. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
    70

    View Slide

  71. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
    Treasure Data audience suite release announcement
    https://blog.treasuredata.com/blog/2018/10/02/audience-suite-intuitive-actionable-customer-data-platform/
    Predictive scoring documentation
    https://support.treasuredata.com/hc/en-us/articles/360001458407-Predicting-Customer-Behavior
    Developer’s tech talk slides explaining technical detail
    https://speakerdeck.com/takuti/machine-learning-and-natural-language-processing-on-treasure-cdp
    71

    View Slide

  72. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
    72

    View Slide