Machine Learning and Natural Language Processing on Treasure CDP

Slide 1

Slide 1 text

Machine Learning and Natural Language Processing on Treasure CDP Takuya Kitazawa @takuti Data Science Engineer at Treasure Data, Inc. and Apache Hivemall Committer

Slide 2

Slide 2 text

takuti.me

Slide 3

Slide 3 text

Word-based customer tagging and categorization (2017) Store customers’ browsing log from TD JavaScript SDK STEP 1 Extract keywords from each article STEP 2 Aggregate customers’ visits as td_interest_words and td_aﬃnity_categories Society Olympic game medal president citizen rule law data cloud CDP Create audience politics law US nation equation math curry rice history Science Food, Culture td_client_id XXX-YYY-ZZZZZ td_title Today’s news td_description The Olympic game has been started … td_host www.td-news.com td_path /2017/10/01/olympic td_client_id XXX-YYY-ZZZZZ td_interest_words Olympic, baseball, game td_aﬃnity_categories Sports, Entertainment

Slide 4

Slide 4 text

Predictive customer scoring (2018)

Slide 5

Slide 5 text

Past Hivemall Present + Digdag Future ML & NLP on UI

Slide 6

Slide 6 text

Past and present: Machine-learning-related capabilities on TD

Slide 7

Slide 7 text

No content

Slide 8

Slide 8 text

Data 3rd-party tools (e.g., visualization) SQL + heavy lightweight Treasure ML SELECT * FROM data … Pandas TD

Slide 9

Slide 9 text

To be released… server / store load

Slide 10

Slide 10 text

ML-related capability on / (1/2) Classiﬁcation — Soft Conﬁdence-Weighted, Random Forest, Logistic Regression, … ‣ Binary “Likely to buy our product?” “Is this email spam?” ‣ Multi-class “Will be sunny, cloudy, or rainy?” “Which group does this user belong?” Regression — Random Forest, AdaDelta, Factorization Machines, … ‣ “Tomorrow’s temperature” “Estimated product sales in next month” “This user’s annual income” Recommendation — Matrix Factorization, Factorization Machines, … ‣ “Customers who bought this also bought …” Anomaly Detection — Local Outlier Detection, ChangeFinder, … ‣ “Suddenly increased # of visitors on our web site”

Slide 11

Slide 11 text

Natural Language Processing — Sentence tokenization, Find singular form of English word, … ‣ ɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹ   ["Hello", "world"] ‣   apple Clustering — Latent Dirichlet Allocation, Probabilistic Latent Semantic Analysis ‣ “Which articles are similar to this one?”  Geospatial Functions ‣ “I love to see map around speciﬁc pair of latitude and longitude” select tokenize('Hello, world!') select singularize('apples') ML-related capability on / (2/2)

Slide 12

Slide 12 text

Use case: ML-based customer segmentation at OISIX 1. Predict probability of churn 2. Aggressively reach out “likely to churn” customers https://www.slideshare.net/TaisukeFukawa/hivemall-meetup-vol2-oisix Web Mobile Customer attr. Behavior on web Complaint log Source Signed-up services Actions (direct) Actions (indirect) Point Call Guide to success UI OISIX’s data

Slide 13

Slide 13 text

Real-world ML workflow Problem What you want to “predict” Hypothesis & Proposal Evaluate Build machine learning model Historical data Cleanse data Ship to production Suﬃcient accuracy? Which columns should we use? Extract Filter Interpolate Normalize … … Query Query Query Query Train data Get features Train … Query Query Query Test data Get features Predict … Accuracy Query Query Query Query

Slide 14

Slide 14 text

Split samples Rescale and vectorize samples Train model

Slide 15

Slide 15 text

Digdag…! Evaluate Build machine learning model Cleanse data Extract Filter Interpolate Normalize … Train data Get features Train … … Test data Get features Predict … Accuracy Query Query Query Query Query Query Query Query Query Query Query

Slide 16

Slide 16 text

+preprocess: _parallel: true +train: td>: ../queries/preprocess_train.sql create_table: train +test: td>: ../queries/preprocess_test.sql create_table: test +logress_train: td>: queries/logress_train.sql create_table: logress_model +compute_downsampling_rate: td>: queries/downsampling_rate.sql engine: presto store_last_results: true +logress_predict: td>: queries/logress_predict.sql create_table: prediction +evaluate: td>: queries/evaluate.sql store_last_results: true +show_accuracy: echo>: "Logloss (smaller is better): ${td.last_results.logloss}"

Slide 17

Slide 17 text

treasure-data/workflow-examples

Slide 18

Slide 18 text

A Customer Data Platform is a marketer-controlled integrated customer database that can support coordinated programs across multiple channels. Treasure CDP  ID Unification, Segmentation, Syndication Workflow, Query, Reporting, Data Warehouse, Machine Learning Data Collection ID Unification, Segmentation, Syndication Campaign Execution

Slide 19

Slide 19 text

No content

Slide 20

Slide 20 text

System is scalable ML team is NOT scalable but

Slide 21

Slide 21 text

Future: ML & NLP solutions for everyone on CDP Providing unified interface to all TD users

Slide 22

Slide 22 text

“customer” = attributes + behaviors on CDP application Time Host Path Browser … 1514899923 takuti.me /about Chrome … 1517305451 takuti.me / Safari … 1518765966 takuti.me /note Chrome … … … … … … Age 24 Sex Man Email [email protected] Address Nakano, Tokyo, Japan … … Time Item ID Referrer OS … 1513080070 XXX twitter.com macOS … 1515488949 YYY google.com iOS … 1518766618 ZZZ facebook.com Android … … … … … … … cdp_customer_id “aaa-bbb-cccc”

Slide 23

Slide 23 text

“audience” = set of customers Audience

Slide 24

Slide 24 text

1. Word-based customer tagging and categorization for Japanese and English Store customers’ browsing log from TD JavaScript SDK STEP 1 Extract keywords from each article STEP 2 Aggregate customers’ visits as td_interest_words and td_aﬃnity_categories Society Olympic game medal president citizen rule law data cloud CDP Create audience politics law US nation equation math curry rice history Science Food, Culture td_client_id XXX-YYY-ZZZZZ td_title Today’s news td_description The Olympic game has been started … td_host www.td-news.com td_path /2017/10/01/olympic td_client_id XXX-YYY-ZZZZZ td_interest_words Olympic, baseball, game td_aﬃnity_categories Sports, Entertainment

Slide 25

Slide 25 text

Challenges Short input texts and wide-ranging content type depending on data Unsupervised customer categorization with less false positives Tokenizing new words չײ׷ٍؗٝպװչַ׵ְַ♳䩛ך넝加ׁ׿պכ♧⽃铂 Non-ML (!), deterministic customer profiling based on Wikipedia mining and TF-IDF weighting

Slide 26

Slide 26 text

Digdag workflow built by API Preprocess SELECT ${join_column_name}, concat(td_host, td_path) AS article_id, concat( -- remove site name which commonly occurs at the foot of page title regexp_replace( -- "(xxx)" is generally meaningless, accessory part of page title regexp_replace( td_title, '[(ʢ].+?[)ʣ]', '' ), '[|-] .+$', '' ), ' ', coalesce(td_description, '') ) AS content FROM ${behavior} WHERE td_title IS NOT NULL AND TD_TIME_RANGE(time, TD_TIME_ADD(TD_SCHEDULED_TIME(), '-90d'))

Slide 27

Slide 27 text

Digdag workflow built by API Tokenize (Japanese) SELECT article_id, word FROM article t1 LATERAL VIEW explode( tokenize_ja( normalize_unicode(content, 'NFKC'), "normal", array(“a”,”about","above","across","after","again",...), array(“෭ࢺ”,”ॿࢺ","ಈࢺ","ه߸","໊ࢺ-਺","෭ࢺ-Ұൠ","ॿࢺ-ಛघ","ಈࢺ-઀ඌ",...), "https://s3.amazonaws.com/td-cdp-tagging/stable/kuromoji-user-dict-neologd.csv.gz" ) ) t2 AS word WHERE length(word) >= 2 AND word RLIKE '^[͊-ΜʔΝ-ϲʔҰ-ᴱa-zA-Z̰-͉̖-̯ɾʂʁ]+$' -- acceptable characters AND word NOT RLIKE '^([^Ұ-ᴱ]{1,2}|[͊-Μʔ]{1,3})$' -- even if word consists of acceptable characters, reject "len-2 non-kanji word" and "len-3 hiragana-only word"

Slide 28

Slide 28 text

NEologd-based custom Kuromoji dictionary github.com/neologd/mecab-ipadic-neologd / github.com/atilika/kuromoji Kuromoji format for tokenize_ja() Filter useless words

Slide 29

Slide 29 text

Digdag workflow built by API TF-IDF weighting and keyword extraction takuti.me/note/tf-idf article_keyword AS ( SELECT tf.article_id, tf.word, tfidf(tf.freq, df.cnt, ${td.last_results.n_article}) AS tfidf FROM tf JOIN df ON tf.word = df.word WHERE df.cnt >= 2 AND df.cnt <= ${Math.max(100000, td.last_results.n_article / 2)} -- ignore too common words ) SELECT each_top_k( 20, article_id, tfidf, article_id, word ) AS (rank, score, article_id, word) FROM ( SELECT article_id, word, tfidf FROM article_keyword CLUSTER BY article_id ) t

Slide 30

Slide 30 text

Aggregate over customers’ behaviors STEP 1 STEP 2 Society Olympic game medal president citizen rule law data cloud CDP politics law US nation equation math curry rice history Science Food, Culture sum() l1_normalize() each_top_k() td_interest_words Next: Map words into categories td_affinity_categories JOIN

Slide 31

Slide 31 text

Map words into IAB categories in relational schema support.aerserv.com/hc/en-us/articles/207148516-List-of-IAB-Categories cdp_customer_id word score TF-IDF aaa-bbb-cccc politics 0.3 aaa-bbb-cccc law 0.2 … … … ddd-eee-ffff math 0.7 … … … xxx-yyy-zzzz history 0.4 word category probability anime IAB1 Arts & Entertainment 0.4 anime IAB5 Education 0.1 anime IAB9 Hobbies & Interests 0.5 politics IAB11 Law, Gov’t & Politics 0.8 … … … coffee IAB8 Food & Drink 0.9 td_interest_words Mapping table JOIN

Slide 32

Slide 32 text

Join “inverted” mapping table cdp_customer_id word score TF-IDF aaa-bbb-cccc politics 0.3 aaa-bbb-cccc law 0.2 … … … ddd-eee-ffff math 0.7 … … … xxx-yyy-zzzz history 0.4 word category:probability anime [ IAB1:0.4, IAB5:0.1, IAB9:0.5 ] politics [ IAB11:0.8, … ] … … coffee [ IAB8:0.9, … ] td_interest_words Mapping table SELECT sum(score * probability) GROUP BY cdp_customer_id, category

Slide 33

Slide 33 text

Slide 34

Slide 34 text

Create mapping table from Wikipedia dump word category probability anime IAB1 Arts & Entertainment 0.4 anime IAB5 Education 0.1 anime IAB9 Hobbies & Interests 0.5 politics IAB11 Law, Gov’t & Politics 0.8 … … … coﬀee IAB8 Food & Drink 0.9 IAB category Wikipedia category English Japanese IAB1 Arts & Entertainment Entertainment 㬗嚂 IAB2 Automotive Automobilities 荈⹛鮦 … … … IAB23 Religion & Spirituality Religion 㸹侄 Entertainment … … … … … … Find related articles from root category github.com/takuti/fastcat

Slide 35

Slide 35 text

Create mapping table from Wikipedia dump word category probability anime IAB1 Arts & Entertainment 0.4 anime IAB5 Education 0.1 anime IAB9 Hobbies & Interests 0.5 politics IAB11 Law, Gov’t & Politics 0.8 … … … coﬀee IAB8 Food & Drink 0.9 IAB category Wikipedia category English Japanese IAB1 Arts & Entertainment Entertainment 㬗嚂 IAB2 Automotive Automobilities 荈⹛鮦 … … … IAB23 Religion & Spirituality Religion 㸹侄 … … … Corpus 1) Aggregate word scores per category  2) Normalize them per word

Slide 36

Slide 36 text

Put sub categories in parallel, and filter out unconfident ones cdp_customer_id td_affinity_main_categories td_affinity_sub_categories aaa-bbb-cccc [ IAB11, IAB23 ] [ IAB2-4, IAB11-1, IAB12-3 ] ddd-eee-ﬀﬀ [ IAB9, IAB15 ] [ IAB8-3, IAB20-8 ] … … … xxx-yyy-zzzz [ IAB14 ] [ IAB14-1, IAB14-3, IAB19-7 ]

Slide 37

Slide 37 text

New challenge: Computationally heavy…

Slide 38

Slide 38 text

No content

Slide 39

Slide 39 text

2. Predictive customer scoring UI-assisted binary classification (logistic regression)

Slide 40

Slide 40 text

“segment” = subset of audience customers Audience Segment

Slide 41

Slide 41 text

Create segment corresponding to positive samples Audience Segment Already “converted” customers

Slide 42

Slide 42 text

Select features and their preprocessing rule “Guess features” Suggest useful features

Slide 43

Slide 43 text

RUN…! Distribution of predictive scores Classify customers by predictive scores

Slide 44

Slide 44 text

Accuracy and metrics

Slide 45

Slide 45 text

Challenges Guessing feature representation along with detecting “categorical” and “quantitative” columns to apply min-max normalization Calibrating number of positive/negative samples for differently sized data 1SPWJEJOHFOPVHIJOGPSNBUJPOUPSFGJOFGFBUVSFTBOEQSFWFOUˑMFBLBHF˒ FWFOGPSOPO.-FYQFSUT

Slide 46

Slide 46 text

For sampled values: ‣ Column name, type ‣ Cardinality ‣ Mean, variance, percentile ‣ Regular expression ‣ … Guess feature representation API

Slide 47

Slide 47 text

Guess: Test on Criteo data from Kaggle competition

Slide 48

Slide 48 text

Guess: Correctly detect categorical and quantitative columns

Slide 49

Slide 49 text

How I integrated ML-related knowledge with API code: Write everything in comments, documents and commits

Slide 50

Slide 50 text

Calibrating # of samples: Over-sample minor class takuti.me/note/adjusting-for-oversampling-and-undersampling WITH label2cnt AS ( SELECT map_agg(label, cnt) AS kv FROM ( SELECT label, CAST(COUNT(1) AS double) AS cnt FROM cdp_tmp_${model_table_name}_samples_${scope} GROUP BY label ) t ) SELECT -- If % of minor samples is very small (less than 0.1%), -- amplify them so that at least 1% of samples are occupied by the minors. IF(kv[1] / kv[0] < 0.001, -- % of positive samples is less than 0.1% cast(floor(0.01 / (kv[1] / kv[0])) AS integer), 1) AS pos_oversample_rate, IF(kv[0] / kv[1] < 0.001, -- % of negative samples is less than 0.1% cast(floor(0.01 / (kv[0] / kv[1])) AS integer), 1) AS neg_oversample_rate, -- Amplify very small data regardless of its label, because tiny dataset -- possibly shows poor accuracy. IF(${td.last_results.num_samples} > 100000, 1, 10) AS all_oversample_rate FROM label2cnt Negative samples Positive samples

Slide 51

Slide 51 text

To refine predictive model and prevent leakage: Show evaluation results and feature importance Audience Segment 80% 20% Predict Train Test Accuracy AUC, LogLoss Model for validation Model for production

Slide 52

Slide 52 text

td_client_id XXX-YYY-ZZZZZ td_ip 192.168.0.1 td_referrer http://google.com/… spend_time 1.5 … … td_interest_words Olympic, baseball, game td_aﬃnity_categories Sports, Entertainment Audience Segment Already “converted” customers Build predictive model Guess how to cleanse data Evaluation Japan google.com 1.5 accuracy Suﬃcient? Audience Unlikely Marginally Possibly Likely 12 20 3 34 40 72 58 82 93 99 78 GUESS Automatically select and transform customer attributes 1ST PASS Treasure CDP does everything for you FROM 2ND PASS You can make your predictive model better with ML experts SCORE CUSTOMERS SYNDICATE Overview: How predictive customer scoring works

Slide 53

Slide 53 text

How enterprise-grade ML/NLP solution should be Scalable Digdag, Hivemall, Presto, Hadoop, Embulk, … Accurate with no crucial mistakes and trivial false positives Interpretable in terms of both algorithm and UI design for all users

Slide 54

Slide 54 text

MVP = classic algorithms and heuristics because there is no free lunch ajustchicago.org/2016/01/aint-no-free-lunch

Slide 55

Slide 55 text

Machine Learning and Natural Language Processing on Treasure CDP Takuya Kitazawa @takuti Data Science Engineer at Treasure Data, Inc. and Apache Hivemall Committer