Pro Yearly is on sale from $80 to $50! »

Machine Learning and Natural Language Processing on Treasure CDP

Machine Learning and Natural Language Processing on Treasure CDP

PLAZMA TD Internal Day: TD Tech Talk 2018: https://techplay.jp/event/650390

Video: https://youtu.be/RzQT_9jcrx8?t=2h4m17s

37130a5f1550eb2d91e640cedf907a78?s=128

Takuya Kitazawa

February 19, 2018
Tweet

Transcript

  1. Machine Learning and Natural Language Processing on Treasure CDP Takuya

    Kitazawa @takuti Data Science Engineer at Treasure Data, Inc. and Apache Hivemall Committer
  2. takuti.me

  3. Word-based customer tagging and categorization (2017) Store customers’ browsing log

    from TD JavaScript SDK STEP 1 Extract keywords from each article STEP 2 Aggregate customers’ visits as td_interest_words and td_affinity_categories Society Olympic game medal president citizen rule law data cloud CDP Create audience politics law US nation equation math curry rice history Science Food, Culture td_client_id XXX-YYY-ZZZZZ td_title Today’s news td_description The Olympic game has been started … td_host www.td-news.com td_path /2017/10/01/olympic td_client_id XXX-YYY-ZZZZZ td_interest_words Olympic, baseball, game td_affinity_categories Sports, Entertainment
  4. Predictive customer scoring (2018)

  5. Past Hivemall Present + Digdag Future ML & NLP on

    UI
  6. Past and present: Machine-learning-related capabilities on TD

  7. None
  8. Data 3rd-party tools (e.g., visualization) SQL + heavy lightweight Treasure

    ML SELECT * FROM data … Pandas TD
  9. To be released… server / store load

  10. ML-related capability on / (1/2) Classification — Soft Confidence-Weighted, Random

    Forest, Logistic Regression, … ‣ Binary “Likely to buy our product?” “Is this email spam?” ‣ Multi-class “Will be sunny, cloudy, or rainy?” “Which group does this user belong?” Regression — Random Forest, AdaDelta, Factorization Machines, … ‣ “Tomorrow’s temperature” “Estimated product sales in next month” “This user’s annual income” Recommendation — Matrix Factorization, Factorization Machines, … ‣ “Customers who bought this also bought …” Anomaly Detection — Local Outlier Detection, ChangeFinder, … ‣ “Suddenly increased # of visitors on our web site”
  11. Natural Language Processing — Sentence tokenization, Find singular form of

    English word, … ‣ ɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹ 
 ["Hello", "world"] ‣ 
 apple Clustering — Latent Dirichlet Allocation, Probabilistic Latent Semantic Analysis ‣ “Which articles are similar to this one?”
 Geospatial Functions ‣ “I love to see map around specific pair of latitude and longitude” select tokenize('Hello, world!') select singularize('apples') ML-related capability on / (2/2)
  12. Use case: ML-based customer segmentation at OISIX 1. Predict probability

    of churn 2. Aggressively reach out “likely to churn” customers https://www.slideshare.net/TaisukeFukawa/hivemall-meetup-vol2-oisix Web Mobile Customer attr. Behavior on web Complaint log Source Signed-up services Actions (direct) Actions (indirect) Point Call Guide to success UI OISIX’s data
  13. Real-world ML workflow Problem What you want to “predict” Hypothesis

    & Proposal Evaluate Build machine learning model Historical data Cleanse data Ship to production Sufficient accuracy? Which columns should we use? Extract Filter Interpolate Normalize … … Query Query Query Query Train data Get features Train … Query Query Query Test data Get features Predict … Accuracy Query Query Query Query
  14. Split samples Rescale and vectorize samples Train model

  15. Digdag…! Evaluate Build machine learning model Cleanse data Extract Filter

    Interpolate Normalize … Train data Get features Train … … Test data Get features Predict … Accuracy Query Query Query Query Query Query Query Query Query Query Query
  16. +preprocess: _parallel: true +train: td>: ../queries/preprocess_train.sql create_table: train +test: td>:

    ../queries/preprocess_test.sql create_table: test +logress_train: td>: queries/logress_train.sql create_table: logress_model +compute_downsampling_rate: td>: queries/downsampling_rate.sql engine: presto store_last_results: true +logress_predict: td>: queries/logress_predict.sql create_table: prediction +evaluate: td>: queries/evaluate.sql store_last_results: true +show_accuracy: echo>: "Logloss (smaller is better): ${td.last_results.logloss}"
  17. treasure-data/workflow-examples

  18. A Customer Data Platform is a marketer-controlled integrated customer database

    that can support coordinated programs across multiple channels. Treasure CDP
 ID Unification, Segmentation, Syndication Workflow, Query, Reporting, Data Warehouse, Machine Learning Data Collection ID Unification, Segmentation, Syndication Campaign Execution
  19. None
  20. System is scalable ML team is NOT scalable but

  21. Future: ML & NLP solutions for everyone on CDP Providing

    unified interface to all TD users
  22. “customer” = attributes + behaviors on CDP application Time Host

    Path Browser … 1514899923 takuti.me /about Chrome … 1517305451 takuti.me / Safari … 1518765966 takuti.me /note Chrome … … … … … … Age 24 Sex Man Email k.takuti@gmail.com Address Nakano, Tokyo, Japan … … Time Item ID Referrer OS … 1513080070 XXX twitter.com macOS … 1515488949 YYY google.com iOS … 1518766618 ZZZ facebook.com Android … … … … … … … cdp_customer_id “aaa-bbb-cccc”
  23. “audience” = set of customers Audience

  24. 1. Word-based customer tagging and categorization for Japanese and English

    Store customers’ browsing log from TD JavaScript SDK STEP 1 Extract keywords from each article STEP 2 Aggregate customers’ visits as td_interest_words and td_affinity_categories Society Olympic game medal president citizen rule law data cloud CDP Create audience politics law US nation equation math curry rice history Science Food, Culture td_client_id XXX-YYY-ZZZZZ td_title Today’s news td_description The Olympic game has been started … td_host www.td-news.com td_path /2017/10/01/olympic td_client_id XXX-YYY-ZZZZZ td_interest_words Olympic, baseball, game td_affinity_categories Sports, Entertainment
  25. Challenges Short input texts and wide-ranging content type depending on

    data Unsupervised customer categorization with less false positives Tokenizing new words չײ׷ٍؗٝպװչַ׵ְַ♳䩛ך넝加ׁ׿պכ♧⽃铂 Non-ML (!), deterministic customer profiling based on Wikipedia mining and TF-IDF weighting
  26. Digdag workflow built by API Preprocess SELECT ${join_column_name}, concat(td_host, td_path)

    AS article_id, concat( -- remove site name which commonly occurs at the foot of page title regexp_replace( -- "(xxx)" is generally meaningless, accessory part of page title regexp_replace( td_title, '[(ʢ].+?[)ʣ]', '' ), '[|-] .+$', '' ), ' ', coalesce(td_description, '') ) AS content FROM ${behavior} WHERE td_title IS NOT NULL AND TD_TIME_RANGE(time, TD_TIME_ADD(TD_SCHEDULED_TIME(), '-90d'))
  27. Digdag workflow built by API Tokenize (Japanese) SELECT article_id, word

    FROM article t1 LATERAL VIEW explode( tokenize_ja( normalize_unicode(content, 'NFKC'), "normal", array(“a”,”about","above","across","after","again",...), array(“෭ࢺ”,”ॿࢺ","ಈࢺ","ه߸","໊ࢺ-਺","෭ࢺ-Ұൠ","ॿࢺ-ಛघ","ಈࢺ-઀ඌ",...), "https://s3.amazonaws.com/td-cdp-tagging/stable/kuromoji-user-dict-neologd.csv.gz" ) ) t2 AS word WHERE length(word) >= 2 AND word RLIKE '^[͊-ΜʔΝ-ϲʔҰ-ᴱa-zA-Z̰-͉̖-̯ɾʂʁ]+$' -- acceptable characters AND word NOT RLIKE '^([^Ұ-ᴱ]{1,2}|[͊-Μʔ]{1,3})$' -- even if word consists of acceptable characters, reject "len-2 non-kanji word" and "len-3 hiragana-only word"
  28. NEologd-based custom Kuromoji dictionary github.com/neologd/mecab-ipadic-neologd / github.com/atilika/kuromoji Kuromoji format for

    tokenize_ja() Filter useless words
  29. Digdag workflow built by API TF-IDF weighting and keyword extraction

    takuti.me/note/tf-idf article_keyword AS ( SELECT tf.article_id, tf.word, tfidf(tf.freq, df.cnt, ${td.last_results.n_article}) AS tfidf FROM tf JOIN df ON tf.word = df.word WHERE df.cnt >= 2 AND df.cnt <= ${Math.max(100000, td.last_results.n_article / 2)} -- ignore too common words ) SELECT each_top_k( 20, article_id, tfidf, article_id, word ) AS (rank, score, article_id, word) FROM ( SELECT article_id, word, tfidf FROM article_keyword CLUSTER BY article_id ) t
  30. Aggregate over customers’ behaviors STEP 1 STEP 2 Society Olympic

    game medal president citizen rule law data cloud CDP politics law US nation equation math curry rice history Science Food, Culture sum() l1_normalize() each_top_k() td_interest_words Next: Map words into categories td_affinity_categories JOIN
  31. Map words into IAB categories in relational schema support.aerserv.com/hc/en-us/articles/207148516-List-of-IAB-Categories cdp_customer_id

    word score TF-IDF aaa-bbb-cccc politics 0.3 aaa-bbb-cccc law 0.2 … … … ddd-eee-ffff math 0.7 … … … xxx-yyy-zzzz history 0.4 word category probability anime IAB1 Arts & Entertainment 0.4 anime IAB5 Education 0.1 anime IAB9 Hobbies & Interests 0.5 politics IAB11 Law, Gov’t & Politics 0.8 … … … coffee IAB8 Food & Drink 0.9 td_interest_words Mapping table JOIN
  32. Join “inverted” mapping table cdp_customer_id word score TF-IDF aaa-bbb-cccc politics

    0.3 aaa-bbb-cccc law 0.2 … … … ddd-eee-ffff math 0.7 … … … xxx-yyy-zzzz history 0.4 word category:probability anime [ IAB1:0.4, IAB5:0.1, IAB9:0.5 ] politics [ IAB11:0.8, … ] … … coffee [ IAB8:0.9, … ] td_interest_words Mapping table SELECT sum(score * probability) GROUP BY cdp_customer_id, category
  33. Create mapping table from Wikipedia dump word category probability anime

    IAB1 Arts & Entertainment 0.4 anime IAB5 Education 0.1 anime IAB9 Hobbies & Interests 0.5 politics IAB11 Law, Gov’t & Politics 0.8 … … … coffee IAB8 Food & Drink 0.9 Corpus <word, score> pairs of articles
  34. Create mapping table from Wikipedia dump word category probability anime

    IAB1 Arts & Entertainment 0.4 anime IAB5 Education 0.1 anime IAB9 Hobbies & Interests 0.5 politics IAB11 Law, Gov’t & Politics 0.8 … … … coffee IAB8 Food & Drink 0.9 IAB category Wikipedia category English Japanese IAB1 Arts & Entertainment Entertainment 㬗嚂 IAB2 Automotive Automobilities 荈⹛鮦 … … … IAB23 Religion & Spirituality Religion 㸹侄 Entertainment … … … … … … Find related articles from root category github.com/takuti/fastcat
  35. Create mapping table from Wikipedia dump word category probability anime

    IAB1 Arts & Entertainment 0.4 anime IAB5 Education 0.1 anime IAB9 Hobbies & Interests 0.5 politics IAB11 Law, Gov’t & Politics 0.8 … … … coffee IAB8 Food & Drink 0.9 IAB category Wikipedia category English Japanese IAB1 Arts & Entertainment Entertainment 㬗嚂 IAB2 Automotive Automobilities 荈⹛鮦 … … … IAB23 Religion & Spirituality Religion 㸹侄 … … … Corpus <word, score> 1) Aggregate word scores per category
 2) Normalize them per word
  36. Put sub categories in parallel, and filter out unconfident ones

    cdp_customer_id td_affinity_main_categories td_affinity_sub_categories aaa-bbb-cccc [ IAB11, IAB23 ] [ IAB2-4, IAB11-1, IAB12-3 ] ddd-eee-ffff [ IAB9, IAB15 ] [ IAB8-3, IAB20-8 ] … … … xxx-yyy-zzzz [ IAB14 ] [ IAB14-1, IAB14-3, IAB19-7 ]
  37. New challenge: Computationally heavy…

  38. None
  39. 2. Predictive customer scoring UI-assisted binary classification (logistic regression)

  40. “segment” = subset of audience customers Audience Segment

  41. Create segment corresponding to positive samples Audience Segment Already “converted”

    customers
  42. Select features and their preprocessing rule “Guess features” Suggest useful

    features
  43. RUN…! Distribution of predictive scores Classify customers by predictive scores

  44. Accuracy and metrics

  45. Challenges Guessing feature representation along with detecting “categorical” and “quantitative”

    columns to apply min-max normalization Calibrating number of positive/negative samples for differently sized data 1SPWJEJOHFOPVHIJOGPSNBUJPOUPSFGJOFGFBUVSFTBOEQSFWFOUˑMFBLBHF˒ FWFOGPSOPO.-FYQFSUT
  46. For sampled values: ‣ Column name, type ‣ Cardinality ‣

    Mean, variance, percentile ‣ Regular expression ‣ … Guess feature representation API
  47. Guess: Test on Criteo data from Kaggle competition

  48. Guess: Correctly detect categorical and quantitative columns

  49. How I integrated ML-related knowledge with API code: Write everything

    in comments, documents and commits
  50. Calibrating # of samples: Over-sample minor class takuti.me/note/adjusting-for-oversampling-and-undersampling WITH label2cnt

    AS ( SELECT map_agg(label, cnt) AS kv FROM ( SELECT label, CAST(COUNT(1) AS double) AS cnt FROM cdp_tmp_${model_table_name}_samples_${scope} GROUP BY label ) t ) SELECT -- If % of minor samples is very small (less than 0.1%), -- amplify them so that at least 1% of samples are occupied by the minors. IF(kv[1] / kv[0] < 0.001, -- % of positive samples is less than 0.1% cast(floor(0.01 / (kv[1] / kv[0])) AS integer), 1) AS pos_oversample_rate, IF(kv[0] / kv[1] < 0.001, -- % of negative samples is less than 0.1% cast(floor(0.01 / (kv[0] / kv[1])) AS integer), 1) AS neg_oversample_rate, -- Amplify very small data regardless of its label, because tiny dataset -- possibly shows poor accuracy. IF(${td.last_results.num_samples} > 100000, 1, 10) AS all_oversample_rate FROM label2cnt Negative samples Positive samples
  51. To refine predictive model and prevent leakage: Show evaluation results

    and feature importance Audience Segment 80% 20% Predict Train Test Accuracy AUC, LogLoss Model for validation Model for production
  52. td_client_id XXX-YYY-ZZZZZ td_ip 192.168.0.1 td_referrer http://google.com/… spend_time 1.5 … …

    td_interest_words Olympic, baseball, game td_affinity_categories Sports, Entertainment Audience Segment Already “converted” customers Build predictive model Guess how to cleanse data Evaluation Japan google.com 1.5 accuracy Sufficient? Audience Unlikely Marginally Possibly Likely 12 20 3 34 40 72 58 82 93 99 78 GUESS Automatically select and transform customer attributes 1ST PASS Treasure CDP does everything for you FROM 2ND PASS You can make your predictive model better with ML experts SCORE CUSTOMERS SYNDICATE Overview: How predictive customer scoring works
  53. How enterprise-grade ML/NLP solution should be Scalable Digdag, Hivemall, Presto,

    Hadoop, Embulk, … Accurate with no crucial mistakes and trivial false positives Interpretable in terms of both algorithm and UI design for all users
  54. MVP = classic algorithms and heuristics because there is no

    free lunch ajustchicago.org/2016/01/aint-no-free-lunch
  55. Machine Learning and Natural Language Processing on Treasure CDP Takuya

    Kitazawa @takuti Data Science Engineer at Treasure Data, Inc. and Apache Hivemall Committer