Machine Learning and Natural Language Processing on Treasure CDP

Machine Learning and Natural Language Processing on Treasure CDP Takuya
Kitazawa @takuti Data Science Engineer at Treasure Data, Inc. and Apache Hivemall Committer

takuti.me

Word-based customer tagging and categorization (2017) Store customers’ browsing log
from TD JavaScript SDK STEP 1 Extract keywords from each article STEP 2 Aggregate customers’ visits as td_interest_words and td_aﬃnity_categories Society Olympic game medal president citizen rule law data cloud CDP Create audience politics law US nation equation math curry rice history Science Food, Culture td_client_id XXX-YYY-ZZZZZ td_title Today’s news td_description The Olympic game has been started … td_host www.td-news.com td_path /2017/10/01/olympic td_client_id XXX-YYY-ZZZZZ td_interest_words Olympic, baseball, game td_aﬃnity_categories Sports, Entertainment

Predictive customer scoring (2018)

Past Hivemall Present + Digdag Future ML & NLP on
UI

Past and present: Machine-learning-related capabilities on TD

Data 3rd-party tools (e.g., visualization) SQL + heavy lightweight Treasure
ML SELECT * FROM data … Pandas TD

To be released… server / store load

ML-related capability on / (1/2) Classiﬁcation — Soft Conﬁdence-Weighted, Random
Forest, Logistic Regression, … ‣ Binary “Likely to buy our product?” “Is this email spam?” ‣ Multi-class “Will be sunny, cloudy, or rainy?” “Which group does this user belong?” Regression — Random Forest, AdaDelta, Factorization Machines, … ‣ “Tomorrow’s temperature” “Estimated product sales in next month” “This user’s annual income” Recommendation — Matrix Factorization, Factorization Machines, … ‣ “Customers who bought this also bought …” Anomaly Detection — Local Outlier Detection, ChangeFinder, … ‣ “Suddenly increased # of visitors on our web site”

Natural Language Processing — Sentence tokenization, Find singular form of
English word, … ‣ ɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹ   ["Hello", "world"] ‣   apple Clustering — Latent Dirichlet Allocation, Probabilistic Latent Semantic Analysis ‣ “Which articles are similar to this one?”  Geospatial Functions ‣ “I love to see map around speciﬁc pair of latitude and longitude” select tokenize('Hello, world!') select singularize('apples') ML-related capability on / (2/2)

Use case: ML-based customer segmentation at OISIX 1. Predict probability
of churn 2. Aggressively reach out “likely to churn” customers https://www.slideshare.net/TaisukeFukawa/hivemall-meetup-vol2-oisix Web Mobile Customer attr. Behavior on web Complaint log Source Signed-up services Actions (direct) Actions (indirect) Point Call Guide to success UI OISIX’s data

Real-world ML workflow Problem What you want to “predict” Hypothesis
& Proposal Evaluate Build machine learning model Historical data Cleanse data Ship to production Suﬃcient accuracy? Which columns should we use? Extract Filter Interpolate Normalize … … Query Query Query Query Train data Get features Train … Query Query Query Test data Get features Predict … Accuracy Query Query Query Query

Split samples Rescale and vectorize samples Train model

Digdag…! Evaluate Build machine learning model Cleanse data Extract Filter
Interpolate Normalize … Train data Get features Train … … Test data Get features Predict … Accuracy Query Query Query Query Query Query Query Query Query Query Query

+preprocess: _parallel: true +train: td>: ../queries/preprocess_train.sql create_table: train +test: td>:
../queries/preprocess_test.sql create_table: test +logress_train: td>: queries/logress_train.sql create_table: logress_model +compute_downsampling_rate: td>: queries/downsampling_rate.sql engine: presto store_last_results: true +logress_predict: td>: queries/logress_predict.sql create_table: prediction +evaluate: td>: queries/evaluate.sql store_last_results: true +show_accuracy: echo>: "Logloss (smaller is better): ${td.last_results.logloss}"

treasure-data/workflow-examples

A Customer Data Platform is a marketer-controlled integrated customer database
that can support coordinated programs across multiple channels. Treasure CDP  ID Unification, Segmentation, Syndication Workflow, Query, Reporting, Data Warehouse, Machine Learning Data Collection ID Unification, Segmentation, Syndication Campaign Execution

System is scalable ML team is NOT scalable but

Future: ML & NLP solutions for everyone on CDP Providing
unified interface to all TD users

“customer” = attributes + behaviors on CDP application Time Host
Path Browser … 1514899923 takuti.me /about Chrome … 1517305451 takuti.me / Safari … 1518765966 takuti.me /note Chrome … … … … … … Age 24 Sex Man Email [email protected] Address Nakano, Tokyo, Japan … … Time Item ID Referrer OS … 1513080070 XXX twitter.com macOS … 1515488949 YYY google.com iOS … 1518766618 ZZZ facebook.com Android … … … … … … … cdp_customer_id “aaa-bbb-cccc”

“audience” = set of customers Audience

1. Word-based customer tagging and categorization for Japanese and English
Store customers’ browsing log from TD JavaScript SDK STEP 1 Extract keywords from each article STEP 2 Aggregate customers’ visits as td_interest_words and td_aﬃnity_categories Society Olympic game medal president citizen rule law data cloud CDP Create audience politics law US nation equation math curry rice history Science Food, Culture td_client_id XXX-YYY-ZZZZZ td_title Today’s news td_description The Olympic game has been started … td_host www.td-news.com td_path /2017/10/01/olympic td_client_id XXX-YYY-ZZZZZ td_interest_words Olympic, baseball, game td_aﬃnity_categories Sports, Entertainment

Challenges Short input texts and wide-ranging content type depending on
data Unsupervised customer categorization with less false positives Tokenizing new words չײ׷ٍؗٝպװչַ׵ְַ♳䩛ך넝加ׁ׿պכ♧⽃铂 Non-ML (!), deterministic customer profiling based on Wikipedia mining and TF-IDF weighting

Digdag workflow built by API Preprocess SELECT ${join_column_name}, concat(td_host, td_path)
AS article_id, concat( -- remove site name which commonly occurs at the foot of page title regexp_replace( -- "(xxx)" is generally meaningless, accessory part of page title regexp_replace( td_title, '[(ʢ].+?[)ʣ]', '' ), '[|-] .+$', '' ), ' ', coalesce(td_description, '') ) AS content FROM ${behavior} WHERE td_title IS NOT NULL AND TD_TIME_RANGE(time, TD_TIME_ADD(TD_SCHEDULED_TIME(), '-90d'))

Digdag workflow built by API Tokenize (Japanese) SELECT article_id, word
FROM article t1 LATERAL VIEW explode( tokenize_ja( normalize_unicode(content, 'NFKC'), "normal", array(“a”,”about","above","across","after","again",...), array(“෭ࢺ”,”ॿࢺ","ಈࢺ","ه߸","໊ࢺ-਺","෭ࢺ-Ұൠ","ॿࢺ-ಛघ","ಈࢺ-઀ඌ",...), "https://s3.amazonaws.com/td-cdp-tagging/stable/kuromoji-user-dict-neologd.csv.gz" ) ) t2 AS word WHERE length(word) >= 2 AND word RLIKE '^[͊-ΜʔΝ-ϲʔҰ-ᴱa-zA-Z̰-͉̖-̯ɾʂʁ]+$' -- acceptable characters AND word NOT RLIKE '^([^Ұ-ᴱ]{1,2}|[͊-Μʔ]{1,3})$' -- even if word consists of acceptable characters, reject "len-2 non-kanji word" and "len-3 hiragana-only word"

NEologd-based custom Kuromoji dictionary github.com/neologd/mecab-ipadic-neologd / github.com/atilika/kuromoji Kuromoji format for
tokenize_ja() Filter useless words

Digdag workflow built by API TF-IDF weighting and keyword extraction
takuti.me/note/tf-idf article_keyword AS ( SELECT tf.article_id, tf.word, tfidf(tf.freq, df.cnt, ${td.last_results.n_article}) AS tfidf FROM tf JOIN df ON tf.word = df.word WHERE df.cnt >= 2 AND df.cnt <= ${Math.max(100000, td.last_results.n_article / 2)} -- ignore too common words ) SELECT each_top_k( 20, article_id, tfidf, article_id, word ) AS (rank, score, article_id, word) FROM ( SELECT article_id, word, tfidf FROM article_keyword CLUSTER BY article_id ) t

Aggregate over customers’ behaviors STEP 1 STEP 2 Society Olympic
game medal president citizen rule law data cloud CDP politics law US nation equation math curry rice history Science Food, Culture sum() l1_normalize() each_top_k() td_interest_words Next: Map words into categories td_affinity_categories JOIN

Map words into IAB categories in relational schema support.aerserv.com/hc/en-us/articles/207148516-List-of-IAB-Categories cdp_customer_id
word score TF-IDF aaa-bbb-cccc politics 0.3 aaa-bbb-cccc law 0.2 … … … ddd-eee-ffff math 0.7 … … … xxx-yyy-zzzz history 0.4 word category probability anime IAB1 Arts & Entertainment 0.4 anime IAB5 Education 0.1 anime IAB9 Hobbies & Interests 0.5 politics IAB11 Law, Gov’t & Politics 0.8 … … … coffee IAB8 Food & Drink 0.9 td_interest_words Mapping table JOIN

Join “inverted” mapping table cdp_customer_id word score TF-IDF aaa-bbb-cccc politics
0.3 aaa-bbb-cccc law 0.2 … … … ddd-eee-ffff math 0.7 … … … xxx-yyy-zzzz history 0.4 word category:probability anime [ IAB1:0.4, IAB5:0.1, IAB9:0.5 ] politics [ IAB11:0.8, … ] … … coffee [ IAB8:0.9, … ] td_interest_words Mapping table SELECT sum(score * probability) GROUP BY cdp_customer_id, category

Create mapping table from Wikipedia dump word category probability anime
IAB1 Arts & Entertainment 0.4 anime IAB5 Education 0.1 anime IAB9 Hobbies & Interests 0.5 politics IAB11 Law, Gov’t & Politics 0.8 … … … coﬀee IAB8 Food & Drink 0.9 Corpus <word, score> pairs of articles

IAB1 Arts & Entertainment 0.4 anime IAB5 Education 0.1 anime IAB9 Hobbies & Interests 0.5 politics IAB11 Law, Gov’t & Politics 0.8 … … … coﬀee IAB8 Food & Drink 0.9 IAB category Wikipedia category English Japanese IAB1 Arts & Entertainment Entertainment 㬗嚂 IAB2 Automotive Automobilities 荈⹛鮦 … … … IAB23 Religion & Spirituality Religion 㸹侄 Entertainment … … … … … … Find related articles from root category github.com/takuti/fastcat

IAB1 Arts & Entertainment 0.4 anime IAB5 Education 0.1 anime IAB9 Hobbies & Interests 0.5 politics IAB11 Law, Gov’t & Politics 0.8 … … … coﬀee IAB8 Food & Drink 0.9 IAB category Wikipedia category English Japanese IAB1 Arts & Entertainment Entertainment 㬗嚂 IAB2 Automotive Automobilities 荈⹛鮦 … … … IAB23 Religion & Spirituality Religion 㸹侄 … … … Corpus <word, score> 1) Aggregate word scores per category  2) Normalize them per word

Put sub categories in parallel, and filter out unconfident ones
cdp_customer_id td_affinity_main_categories td_affinity_sub_categories aaa-bbb-cccc [ IAB11, IAB23 ] [ IAB2-4, IAB11-1, IAB12-3 ] ddd-eee-ﬀﬀ [ IAB9, IAB15 ] [ IAB8-3, IAB20-8 ] … … … xxx-yyy-zzzz [ IAB14 ] [ IAB14-1, IAB14-3, IAB19-7 ]

New challenge: Computationally heavy…

2. Predictive customer scoring UI-assisted binary classification (logistic regression)

“segment” = subset of audience customers Audience Segment

Create segment corresponding to positive samples Audience Segment Already “converted”
customers

Select features and their preprocessing rule “Guess features” Suggest useful
features

RUN…! Distribution of predictive scores Classify customers by predictive scores

Accuracy and metrics

Challenges Guessing feature representation along with detecting “categorical” and “quantitative”
columns to apply min-max normalization Calibrating number of positive/negative samples for differently sized data 1SPWJEJOHFOPVHIJOGPSNBUJPOUPSFGJOFGFBUVSFTBOEQSFWFOUˑMFBLBHF˒ FWFOGPSOPO.-FYQFSUT

For sampled values: ‣ Column name, type ‣ Cardinality ‣
Mean, variance, percentile ‣ Regular expression ‣ … Guess feature representation API

Guess: Test on Criteo data from Kaggle competition

Guess: Correctly detect categorical and quantitative columns

How I integrated ML-related knowledge with API code: Write everything
in comments, documents and commits

Calibrating # of samples: Over-sample minor class takuti.me/note/adjusting-for-oversampling-and-undersampling WITH label2cnt
AS ( SELECT map_agg(label, cnt) AS kv FROM ( SELECT label, CAST(COUNT(1) AS double) AS cnt FROM cdp_tmp_${model_table_name}_samples_${scope} GROUP BY label ) t ) SELECT -- If % of minor samples is very small (less than 0.1%), -- amplify them so that at least 1% of samples are occupied by the minors. IF(kv[1] / kv[0] < 0.001, -- % of positive samples is less than 0.1% cast(floor(0.01 / (kv[1] / kv[0])) AS integer), 1) AS pos_oversample_rate, IF(kv[0] / kv[1] < 0.001, -- % of negative samples is less than 0.1% cast(floor(0.01 / (kv[0] / kv[1])) AS integer), 1) AS neg_oversample_rate, -- Amplify very small data regardless of its label, because tiny dataset -- possibly shows poor accuracy. IF(${td.last_results.num_samples} > 100000, 1, 10) AS all_oversample_rate FROM label2cnt Negative samples Positive samples

To refine predictive model and prevent leakage: Show evaluation results
and feature importance Audience Segment 80% 20% Predict Train Test Accuracy AUC, LogLoss Model for validation Model for production

td_client_id XXX-YYY-ZZZZZ td_ip 192.168.0.1 td_referrer http://google.com/… spend_time 1.5 … …
td_interest_words Olympic, baseball, game td_aﬃnity_categories Sports, Entertainment Audience Segment Already “converted” customers Build predictive model Guess how to cleanse data Evaluation Japan google.com 1.5 accuracy Suﬃcient? Audience Unlikely Marginally Possibly Likely 12 20 3 34 40 72 58 82 93 99 78 GUESS Automatically select and transform customer attributes 1ST PASS Treasure CDP does everything for you FROM 2ND PASS You can make your predictive model better with ML experts SCORE CUSTOMERS SYNDICATE Overview: How predictive customer scoring works

How enterprise-grade ML/NLP solution should be Scalable Digdag, Hivemall, Presto,
Hadoop, Embulk, … Accurate with no crucial mistakes and trivial false positives Interpretable in terms of both algorithm and UI design for all users

MVP = classic algorithms and heuristics because there is no
free lunch ajustchicago.org/2016/01/aint-no-free-lunch

Machine Learning and Natural Language Processing on Treasure CDP Takuya
Kitazawa @takuti Data Science Engineer at Treasure Data, Inc. and Apache Hivemall Committer

Machine Learning and Natural Language Processin...

Machine Learning and Natural Language Processing on Treasure CDP

More Decks by Takuya Kitazawa

Other Decks in Technology

Featured

Transcript