Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Machine Learning and Natural Language Processing on Treasure CDP

Machine Learning and Natural Language Processing on Treasure CDP

PLAZMA TD Internal Day: TD Tech Talk 2018: https://techplay.jp/event/650390

Video: https://youtu.be/RzQT_9jcrx8?t=2h4m17s

Takuya Kitazawa

February 19, 2018
Tweet

More Decks by Takuya Kitazawa

Other Decks in Technology

Transcript

  1. Machine Learning and Natural Language Processing
    on Treasure CDP
    Takuya Kitazawa @takuti
    Data Science Engineer at Treasure Data, Inc. and Apache Hivemall Committer

    View Slide

  2. takuti.me

    View Slide

  3. Word-based customer tagging and categorization (2017)
    Store customers’ browsing log
    from TD JavaScript SDK
    STEP 1 Extract keywords from each article STEP 2 Aggregate customers’ visits as td_interest_words and td_affinity_categories
    Society
    Olympic game medal
    president citizen rule law
    data cloud CDP
    Create audience
    politics law US nation
    equation math
    curry rice history
    Science
    Food, Culture
    td_client_id XXX-YYY-ZZZZZ
    td_title Today’s news
    td_description The Olympic game has been started …
    td_host www.td-news.com
    td_path /2017/10/01/olympic
    td_client_id XXX-YYY-ZZZZZ
    td_interest_words Olympic, baseball, game
    td_affinity_categories Sports, Entertainment

    View Slide

  4. Predictive customer scoring (2018)

    View Slide

  5. Past
    Hivemall
    Present
    + Digdag
    Future
    ML & NLP on UI

    View Slide

  6. Past and present:
    Machine-learning-related capabilities on TD

    View Slide

  7. View Slide

  8. Data
    3rd-party tools (e.g., visualization)
    SQL
    +
    heavy
    lightweight
    Treasure ML
    SELECT * FROM data …
    Pandas TD

    View Slide

  9. To be released…
    server
    /
    store
    load

    View Slide

  10. ML-related capability on / (1/2)
    Classification — Soft Confidence-Weighted, Random Forest, Logistic Regression, …
    ‣ Binary “Likely to buy our product?” “Is this email spam?”
    ‣ Multi-class “Will be sunny, cloudy, or rainy?” “Which group does this user belong?”
    Regression — Random Forest, AdaDelta, Factorization Machines, …
    ‣ “Tomorrow’s temperature” “Estimated product sales in next month” “This user’s annual income”
    Recommendation — Matrix Factorization, Factorization Machines, …
    ‣ “Customers who bought this also bought …”
    Anomaly Detection — Local Outlier Detection, ChangeFinder, …
    ‣ “Suddenly increased # of visitors on our web site”

    View Slide

  11. Natural Language Processing — Sentence tokenization, Find singular form of English word, …
    ‣ ɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹ 

    ["Hello", "world"]
    ‣ 

    apple
    Clustering — Latent Dirichlet Allocation, Probabilistic Latent Semantic Analysis
    ‣ “Which articles are similar to this one?”

    Geospatial Functions
    ‣ “I love to see map around specific pair of latitude and longitude”
    select tokenize('Hello, world!')
    select singularize('apples')
    ML-related capability on / (2/2)

    View Slide

  12. Use case: ML-based customer segmentation at OISIX
    1. Predict probability of churn
    2. Aggressively reach out “likely to churn” customers
    https://www.slideshare.net/TaisukeFukawa/hivemall-meetup-vol2-oisix
    Web
    Mobile
    Customer attr.
    Behavior on web
    Complaint log
    Source
    Signed-up services
    Actions (direct)
    Actions (indirect)
    Point Call
    Guide to success
    UI
    OISIX’s data

    View Slide

  13. Real-world ML workflow
    Problem
    What you want to “predict”
    Hypothesis & Proposal
    Evaluate
    Build machine learning model
    Historical data
    Cleanse data
    Ship to production
    Sufficient accuracy? Which columns should we use?
    Extract
    Filter
    Interpolate
    Normalize
    … …
    Query
    Query
    Query
    Query
    Train data
    Get features
    Train

    Query
    Query
    Query
    Test data
    Get features
    Predict

    Accuracy
    Query
    Query
    Query
    Query

    View Slide

  14. Split samples
    Rescale and vectorize samples Train model

    View Slide

  15. Digdag…!
    Evaluate
    Build machine learning model
    Cleanse data Extract
    Filter
    Interpolate
    Normalize

    Train data
    Get features
    Train


    Test data
    Get features
    Predict

    Accuracy
    Query
    Query
    Query
    Query Query
    Query
    Query
    Query
    Query
    Query
    Query

    View Slide

  16. +preprocess:
    _parallel: true
    +train:
    td>: ../queries/preprocess_train.sql
    create_table: train
    +test:
    td>: ../queries/preprocess_test.sql
    create_table: test
    +logress_train:
    td>: queries/logress_train.sql
    create_table: logress_model
    +compute_downsampling_rate:
    td>: queries/downsampling_rate.sql
    engine: presto
    store_last_results: true
    +logress_predict:
    td>: queries/logress_predict.sql
    create_table: prediction
    +evaluate:
    td>: queries/evaluate.sql
    store_last_results: true
    +show_accuracy:
    echo>: "Logloss (smaller is better): ${td.last_results.logloss}"

    View Slide

  17. treasure-data/workflow-examples

    View Slide

  18. A Customer Data Platform is a marketer-controlled integrated customer database that
    can support coordinated programs across multiple channels.
    Treasure CDP

    ID Unification, Segmentation, Syndication
    Workflow, Query, Reporting, Data Warehouse, Machine Learning
    Data Collection ID Unification, Segmentation, Syndication Campaign Execution

    View Slide

  19. View Slide

  20. System is scalable ML team is NOT scalable
    but

    View Slide

  21. Future:
    ML & NLP solutions for everyone on CDP
    Providing unified interface to all TD users

    View Slide

  22. “customer” = attributes + behaviors
    on CDP application
    Time Host Path Browser …
    1514899923 takuti.me /about Chrome …
    1517305451 takuti.me / Safari …
    1518765966 takuti.me /note Chrome …
    … … … … …
    Age 24
    Sex Man
    Email [email protected]
    Address Nakano, Tokyo, Japan
    … …
    Time Item ID Referrer OS …
    1513080070 XXX twitter.com macOS …
    1515488949 YYY google.com iOS …
    1518766618 ZZZ facebook.com Android …
    … … … … …

    cdp_customer_id
    “aaa-bbb-cccc”

    View Slide

  23. “audience” = set of customers
    Audience

    View Slide

  24. 1. Word-based customer tagging and categorization
    for Japanese and English
    Store customers’ browsing log
    from TD JavaScript SDK
    STEP 1 Extract keywords from each article STEP 2 Aggregate customers’ visits as td_interest_words and td_affinity_categories
    Society
    Olympic game medal
    president citizen rule law
    data cloud CDP
    Create audience
    politics law US nation
    equation math
    curry rice history
    Science
    Food, Culture
    td_client_id XXX-YYY-ZZZZZ
    td_title Today’s news
    td_description The Olympic game has been started …
    td_host www.td-news.com
    td_path /2017/10/01/olympic
    td_client_id XXX-YYY-ZZZZZ
    td_interest_words Olympic, baseball, game
    td_affinity_categories Sports, Entertainment

    View Slide

  25. Challenges
    Short input texts and wide-ranging content type
    depending on data
    Unsupervised customer categorization
    with less false positives
    Tokenizing new words
    չײ׷ٍؗٝպװչַ׵ְַ♳䩛ך넝加ׁ׿պכ♧⽃铂
    Non-ML (!), deterministic customer profiling
    based on Wikipedia mining and TF-IDF weighting

    View Slide

  26. Digdag workflow built by API
    Preprocess
    SELECT
    ${join_column_name},
    concat(td_host, td_path) AS article_id,
    concat(
    -- remove site name which commonly occurs at the foot of page title
    regexp_replace(
    -- "(xxx)" is generally meaningless, accessory part of page title
    regexp_replace(
    td_title,
    '[(ʢ].+?[)ʣ]', ''
    ),
    '[|-] .+$', ''
    ),
    ' ',
    coalesce(td_description, '')
    ) AS content
    FROM
    ${behavior}
    WHERE
    td_title IS NOT NULL
    AND TD_TIME_RANGE(time, TD_TIME_ADD(TD_SCHEDULED_TIME(), '-90d'))

    View Slide

  27. Digdag workflow built by API
    Tokenize (Japanese)
    SELECT
    article_id,
    word
    FROM
    article t1
    LATERAL VIEW explode(
    tokenize_ja(
    normalize_unicode(content, 'NFKC'),
    "normal",
    array(“a”,”about","above","across","after","again",...),
    array(“෭ࢺ”,”ॿࢺ","ಈࢺ","ه߸","໊ࢺ-਺","෭ࢺ-Ұൠ","ॿࢺ-ಛघ","ಈࢺ-઀ඌ",...),
    "https://s3.amazonaws.com/td-cdp-tagging/stable/kuromoji-user-dict-neologd.csv.gz"
    )
    ) t2 AS word
    WHERE
    length(word) >= 2
    AND word RLIKE '^[͊-ΜʔΝ-ϲʔҰ-ᴱa-zA-Z̰-͉̖-̯ɾʂʁ]+$' -- acceptable characters
    AND word NOT RLIKE '^([^Ұ-ᴱ]{1,2}|[͊-Μʔ]{1,3})$' -- even if word consists of acceptable
    characters, reject "len-2 non-kanji word" and "len-3 hiragana-only word"

    View Slide

  28. NEologd-based custom Kuromoji dictionary
    github.com/neologd/mecab-ipadic-neologd / github.com/atilika/kuromoji
    Kuromoji format for
    tokenize_ja()
    Filter useless words

    View Slide

  29. Digdag workflow built by API
    TF-IDF weighting and keyword extraction takuti.me/note/tf-idf
    article_keyword AS (
    SELECT
    tf.article_id,
    tf.word,
    tfidf(tf.freq, df.cnt, ${td.last_results.n_article}) AS tfidf
    FROM
    tf
    JOIN
    df
    ON tf.word = df.word
    WHERE
    df.cnt >= 2
    AND df.cnt <= ${Math.max(100000, td.last_results.n_article / 2)} -- ignore too common words
    )
    SELECT
    each_top_k(
    20, article_id, tfidf,
    article_id, word
    ) AS (rank, score, article_id, word)
    FROM (
    SELECT
    article_id,
    word,
    tfidf
    FROM
    article_keyword
    CLUSTER BY
    article_id
    ) t

    View Slide

  30. Aggregate over customers’ behaviors
    STEP 1
    STEP 2
    Society
    Olympic game medal
    president citizen rule law
    data cloud CDP
    politics law US nation
    equation math
    curry rice history
    Science
    Food, Culture
    sum()
    l1_normalize()
    each_top_k()
    td_interest_words
    Next:
    Map words into categories
    td_affinity_categories
    JOIN

    View Slide

  31. Map words into IAB categories in relational schema
    support.aerserv.com/hc/en-us/articles/207148516-List-of-IAB-Categories
    cdp_customer_id word score
    TF-IDF
    aaa-bbb-cccc politics 0.3
    aaa-bbb-cccc law 0.2
    … … …
    ddd-eee-ffff math 0.7
    … … …
    xxx-yyy-zzzz history 0.4
    word category probability
    anime IAB1
    Arts & Entertainment
    0.4
    anime IAB5
    Education
    0.1
    anime IAB9
    Hobbies & Interests
    0.5
    politics IAB11
    Law, Gov’t & Politics
    0.8
    … … …
    coffee IAB8
    Food & Drink
    0.9
    td_interest_words Mapping table
    JOIN

    View Slide

  32. Join “inverted” mapping table
    cdp_customer_id word score
    TF-IDF
    aaa-bbb-cccc politics 0.3
    aaa-bbb-cccc law 0.2
    … … …
    ddd-eee-ffff math 0.7
    … … …
    xxx-yyy-zzzz history 0.4
    word category:probability
    anime [ IAB1:0.4, IAB5:0.1, IAB9:0.5 ]
    politics [ IAB11:0.8, … ]
    … …
    coffee [ IAB8:0.9, … ]
    td_interest_words Mapping table
    SELECT
    sum(score * probability)
    GROUP BY
    cdp_customer_id,
    category

    View Slide

  33. Create mapping table from Wikipedia dump
    word category probability
    anime
    IAB1
    Arts & Entertainment
    0.4
    anime
    IAB5
    Education
    0.1
    anime
    IAB9
    Hobbies & Interests
    0.5
    politics
    IAB11
    Law, Gov’t & Politics
    0.8
    … … …
    coffee
    IAB8
    Food & Drink
    0.9
    Corpus
    pairs of articles

    View Slide

  34. Create mapping table from Wikipedia dump
    word category probability
    anime
    IAB1
    Arts & Entertainment
    0.4
    anime
    IAB5
    Education
    0.1
    anime
    IAB9
    Hobbies & Interests
    0.5
    politics
    IAB11
    Law, Gov’t & Politics
    0.8
    … … …
    coffee
    IAB8
    Food & Drink
    0.9
    IAB
    category
    Wikipedia category
    English Japanese
    IAB1
    Arts & Entertainment
    Entertainment 㬗嚂
    IAB2
    Automotive
    Automobilities 荈⹛鮦
    … … …
    IAB23
    Religion & Spirituality
    Religion 㸹侄
    Entertainment






    Find related articles from root category
    github.com/takuti/fastcat

    View Slide

  35. Create mapping table from Wikipedia dump
    word category probability
    anime
    IAB1
    Arts & Entertainment
    0.4
    anime
    IAB5
    Education
    0.1
    anime
    IAB9
    Hobbies & Interests
    0.5
    politics
    IAB11
    Law, Gov’t & Politics
    0.8
    … … …
    coffee
    IAB8
    Food & Drink
    0.9
    IAB
    category
    Wikipedia category
    English Japanese
    IAB1
    Arts & Entertainment
    Entertainment 㬗嚂
    IAB2
    Automotive
    Automobilities 荈⹛鮦
    … … …
    IAB23
    Religion & Spirituality
    Religion 㸹侄



    Corpus

    1) Aggregate word scores per category

    2) Normalize them per word

    View Slide

  36. Put sub categories in parallel, and filter out unconfident ones
    cdp_customer_id td_affinity_main_categories td_affinity_sub_categories
    aaa-bbb-cccc [ IAB11, IAB23 ] [ IAB2-4, IAB11-1, IAB12-3 ]
    ddd-eee-ffff [ IAB9, IAB15 ] [ IAB8-3, IAB20-8 ]
    … … …
    xxx-yyy-zzzz [ IAB14 ] [ IAB14-1, IAB14-3, IAB19-7 ]

    View Slide

  37. New challenge:
    Computationally heavy…

    View Slide

  38. View Slide

  39. 2. Predictive customer scoring
    UI-assisted binary classification (logistic regression)

    View Slide

  40. “segment” = subset of audience customers
    Audience
    Segment

    View Slide

  41. Create segment corresponding to positive samples
    Audience
    Segment
    Already “converted” customers

    View Slide

  42. Select features and their preprocessing rule
    “Guess features”
    Suggest useful features

    View Slide

  43. RUN…!
    Distribution of
    predictive scores
    Classify customers
    by predictive scores

    View Slide

  44. Accuracy and metrics

    View Slide

  45. Challenges
    Guessing feature representation
    along with detecting “categorical” and “quantitative” columns to apply min-max normalization
    Calibrating number of positive/negative samples
    for differently sized data
    1SPWJEJOHFOPVHIJOGPSNBUJPOUPSFGJOFGFBUVSFTBOEQSFWFOUˑMFBLBHF˒
    FWFOGPSOPO.-FYQFSUT

    View Slide

  46. For sampled values:
    ‣ Column name, type
    ‣ Cardinality
    ‣ Mean, variance,
    percentile
    ‣ Regular expression
    ‣ …
    Guess feature representation
    API

    View Slide

  47. Guess:
    Test on Criteo data from Kaggle competition

    View Slide

  48. Guess:
    Correctly detect categorical and quantitative columns

    View Slide

  49. How I integrated ML-related knowledge with API code:
    Write everything in comments, documents and commits

    View Slide

  50. Calibrating # of samples: Over-sample minor class
    takuti.me/note/adjusting-for-oversampling-and-undersampling
    WITH label2cnt AS (
    SELECT
    map_agg(label, cnt) AS kv
    FROM (
    SELECT
    label,
    CAST(COUNT(1) AS double) AS cnt
    FROM
    cdp_tmp_${model_table_name}_samples_${scope}
    GROUP BY
    label
    ) t
    )
    SELECT
    -- If % of minor samples is very small (less than 0.1%),
    -- amplify them so that at least 1% of samples are occupied by the minors.
    IF(kv[1] / kv[0] < 0.001, -- % of positive samples is less than 0.1%
    cast(floor(0.01 / (kv[1] / kv[0])) AS integer), 1) AS pos_oversample_rate,
    IF(kv[0] / kv[1] < 0.001, -- % of negative samples is less than 0.1%
    cast(floor(0.01 / (kv[0] / kv[1])) AS integer), 1) AS neg_oversample_rate,
    -- Amplify very small data regardless of its label, because tiny dataset
    -- possibly shows poor accuracy.
    IF(${td.last_results.num_samples} > 100000, 1, 10) AS all_oversample_rate
    FROM
    label2cnt
    Negative samples
    Positive samples

    View Slide

  51. To refine predictive model and prevent leakage:
    Show evaluation results and feature importance
    Audience
    Segment
    80%
    20%
    Predict
    Train
    Test
    Accuracy
    AUC, LogLoss
    Model for validation
    Model for production

    View Slide

  52. td_client_id XXX-YYY-ZZZZZ
    td_ip 192.168.0.1
    td_referrer http://google.com/…
    spend_time 1.5
    … …
    td_interest_words Olympic, baseball, game
    td_affinity_categories Sports, Entertainment
    Audience
    Segment
    Already “converted” customers
    Build predictive model
    Guess how to cleanse data Evaluation
    Japan
    google.com
    1.5
    accuracy
    Sufficient?
    Audience
    Unlikely Marginally Possibly Likely
    12
    20
    3 34
    40
    72
    58 82
    93
    99
    78
    GUESS Automatically select and transform customer attributes
    1ST PASS Treasure CDP does everything for you
    FROM 2ND PASS You can make your predictive model better with ML experts
    SCORE CUSTOMERS
    SYNDICATE
    Overview: How predictive customer scoring works

    View Slide

  53. How enterprise-grade ML/NLP solution should be
    Scalable
    Digdag, Hivemall, Presto, Hadoop, Embulk, …
    Accurate
    with no crucial mistakes and trivial false positives
    Interpretable
    in terms of both algorithm and UI design
    for all users

    View Slide

  54. MVP = classic algorithms and heuristics
    because there is no free lunch
    ajustchicago.org/2016/01/aint-no-free-lunch

    View Slide

  55. Machine Learning and Natural Language Processing
    on Treasure CDP
    Takuya Kitazawa @takuti
    Data Science Engineer at Treasure Data, Inc. and Apache Hivemall Committer

    View Slide