from TD JavaScript SDK STEP 1 Extract keywords from each article STEP 2 Aggregate customers’ visits as td_interest_words and td_affinity_categories Society Olympic game medal president citizen rule law data cloud CDP Create audience politics law US nation equation math curry rice history Science Food, Culture td_client_id XXX-YYY-ZZZZZ td_title Today’s news td_description The Olympic game has been started … td_host www.td-news.com td_path /2017/10/01/olympic td_client_id XXX-YYY-ZZZZZ td_interest_words Olympic, baseball, game td_affinity_categories Sports, Entertainment
Forest, Logistic Regression, … ‣ Binary “Likely to buy our product?” “Is this email spam?” ‣ Multi-class “Will be sunny, cloudy, or rainy?” “Which group does this user belong?” Regression — Random Forest, AdaDelta, Factorization Machines, … ‣ “Tomorrow’s temperature” “Estimated product sales in next month” “This user’s annual income” Recommendation — Matrix Factorization, Factorization Machines, … ‣ “Customers who bought this also bought …” Anomaly Detection — Local Outlier Detection, ChangeFinder, … ‣ “Suddenly increased # of visitors on our web site”
English word, … ‣ ɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹ ["Hello", "world"] ‣ apple Clustering — Latent Dirichlet Allocation, Probabilistic Latent Semantic Analysis ‣ “Which articles are similar to this one?” Geospatial Functions ‣ “I love to see map around specific pair of latitude and longitude” select tokenize('Hello, world!') select singularize('apples') ML-related capability on / (2/2)
of churn 2. Aggressively reach out “likely to churn” customers https://www.slideshare.net/TaisukeFukawa/hivemall-meetup-vol2-oisix Web Mobile Customer attr. Behavior on web Complaint log Source Signed-up services Actions (direct) Actions (indirect) Point Call Guide to success UI OISIX’s data
& Proposal Evaluate Build machine learning model Historical data Cleanse data Ship to production Sufficient accuracy? Which columns should we use? Extract Filter Interpolate Normalize … … Query Query Query Query Train data Get features Train … Query Query Query Test data Get features Predict … Accuracy Query Query Query Query
Interpolate Normalize … Train data Get features Train … … Test data Get features Predict … Accuracy Query Query Query Query Query Query Query Query Query Query Query
that can support coordinated programs across multiple channels. Treasure CDP ID Unification, Segmentation, Syndication Workflow, Query, Reporting, Data Warehouse, Machine Learning Data Collection ID Unification, Segmentation, Syndication Campaign Execution
Store customers’ browsing log from TD JavaScript SDK STEP 1 Extract keywords from each article STEP 2 Aggregate customers’ visits as td_interest_words and td_affinity_categories Society Olympic game medal president citizen rule law data cloud CDP Create audience politics law US nation equation math curry rice history Science Food, Culture td_client_id XXX-YYY-ZZZZZ td_title Today’s news td_description The Olympic game has been started … td_host www.td-news.com td_path /2017/10/01/olympic td_client_id XXX-YYY-ZZZZZ td_interest_words Olympic, baseball, game td_affinity_categories Sports, Entertainment
data Unsupervised customer categorization with less false positives Tokenizing new words չײٍؗٝպװչְַַ♳䩛ך넝加ׁպכ♧⽃铂 Non-ML (!), deterministic customer profiling based on Wikipedia mining and TF-IDF weighting
AS article_id, concat( -- remove site name which commonly occurs at the foot of page title regexp_replace( -- "(xxx)" is generally meaningless, accessory part of page title regexp_replace( td_title, '[(ʢ].+?[)ʣ]', '' ), '[|-] .+$', '' ), ' ', coalesce(td_description, '') ) AS content FROM ${behavior} WHERE td_title IS NOT NULL AND TD_TIME_RANGE(time, TD_TIME_ADD(TD_SCHEDULED_TIME(), '-90d'))
FROM article t1 LATERAL VIEW explode( tokenize_ja( normalize_unicode(content, 'NFKC'), "normal", array(“a”,”about","above","across","after","again",...), array(“෭ࢺ”,”ॿࢺ","ಈࢺ","ه߸","໊ࢺ-","෭ࢺ-Ұൠ","ॿࢺ-ಛघ","ಈࢺ-ඌ",...), "https://s3.amazonaws.com/td-cdp-tagging/stable/kuromoji-user-dict-neologd.csv.gz" ) ) t2 AS word WHERE length(word) >= 2 AND word RLIKE '^[͊-ΜʔΝ-ϲʔҰ-ᴱa-zA-Z̰-͉̖-̯ɾʂʁ]+$' -- acceptable characters AND word NOT RLIKE '^([^Ұ-ᴱ]{1,2}|[͊-Μʔ]{1,3})$' -- even if word consists of acceptable characters, reject "len-2 non-kanji word" and "len-3 hiragana-only word"
takuti.me/note/tf-idf article_keyword AS ( SELECT tf.article_id, tf.word, tfidf(tf.freq, df.cnt, ${td.last_results.n_article}) AS tfidf FROM tf JOIN df ON tf.word = df.word WHERE df.cnt >= 2 AND df.cnt <= ${Math.max(100000, td.last_results.n_article / 2)} -- ignore too common words ) SELECT each_top_k( 20, article_id, tfidf, article_id, word ) AS (rank, score, article_id, word) FROM ( SELECT article_id, word, tfidf FROM article_keyword CLUSTER BY article_id ) t
game medal president citizen rule law data cloud CDP politics law US nation equation math curry rice history Science Food, Culture sum() l1_normalize() each_top_k() td_interest_words Next: Map words into categories td_affinity_categories JOIN
columns to apply min-max normalization Calibrating number of positive/negative samples for differently sized data 1SPWJEJOHFOPVHIJOGPSNBUJPOUPSFGJOFGFBUVSFTBOEQSFWFOUˑMFBLBHF˒ FWFOGPSOPO.-FYQFSUT
AS ( SELECT map_agg(label, cnt) AS kv FROM ( SELECT label, CAST(COUNT(1) AS double) AS cnt FROM cdp_tmp_${model_table_name}_samples_${scope} GROUP BY label ) t ) SELECT -- If % of minor samples is very small (less than 0.1%), -- amplify them so that at least 1% of samples are occupied by the minors. IF(kv[1] / kv[0] < 0.001, -- % of positive samples is less than 0.1% cast(floor(0.01 / (kv[1] / kv[0])) AS integer), 1) AS pos_oversample_rate, IF(kv[0] / kv[1] < 0.001, -- % of negative samples is less than 0.1% cast(floor(0.01 / (kv[0] / kv[1])) AS integer), 1) AS neg_oversample_rate, -- Amplify very small data regardless of its label, because tiny dataset -- possibly shows poor accuracy. IF(${td.last_results.num_samples} > 100000, 1, 10) AS all_oversample_rate FROM label2cnt Negative samples Positive samples
td_interest_words Olympic, baseball, game td_affinity_categories Sports, Entertainment Audience Segment Already “converted” customers Build predictive model Guess how to cleanse data Evaluation Japan google.com 1.5 accuracy Sufficient? Audience Unlikely Marginally Possibly Likely 12 20 3 34 40 72 58 82 93 99 78 GUESS Automatically select and transform customer attributes 1ST PASS Treasure CDP does everything for you FROM 2ND PASS You can make your predictive model better with ML experts SCORE CUSTOMERS SYNDICATE Overview: How predictive customer scoring works