NLP_D2C本番

NLP with Scala at D2C Masaki Rikitoku @Big Data
Scala By the Bay August 18, 2015

Masaki Rikitoku •  NLP/Data engineer@D2C •  NLP/Machine Learning
Ø  Japanese morphological analysis Ø  Text classiﬁcation •  Big Data Processing Ø  In-‐‑‒memory aggregation engine for BI Ø  Big Data Processing for global social game About Me 2

•  D2C is a mobile online advertising company • 
Subsidiary company of NTT DOCOMO INC. (mobile carrier) •  Our Services are Ø  D2C Listing/Display Ads Ø  D2C PLA Ø  Mail Push Ads Ø  Etc. About Our Company – D2C h"p://smt.docomo.ne.jp/ 3

1.  NLP web services for Listing Ads powered by the
framework 2.  Mining search query logs by the Ø Extracting/Categorizing frequent keywords Overview 4

NLP web services for Listing Ads by
5

search Sneaker D2C Listing Ads Process >me:
<200 ms 6 PV/day: 28 million UU/week: 10 million

Which WAF is suitable for us? We chose the play2
scala framework!

NLP web services for D2C Listing ads 　 Ad keywords
db. Front server　 Visualizer Searcher NLP web APIs 1　 Query Ads db. Lis>ng Ads servers NLP web services . . . LB Query Ads Keyword set NLP web APIs 6　 Total: 15 servers Produc>on NLP web services are composed of 6 play2 servers! 8

Performance •  Play2 scala has good performance. •  No
down>me since this April! Stable! Over 3500 tps. Avg. 10 ms./query 9

Word splitting by Part Of Speech Tagger すももももものうち (SUMOMOMOMOMONOUCHI)
Plums are part of peaches Japanese text is not split by space between characters! 10 We need to split text into words SUMOMO(plum) noun MO par>cle MOMO(peach) noun NO par>cle UCHI noun POS Tagger

Japanese NLP module for listing ads SUMOMOMOMONOUCHI POS Tagger
{SUMOMO, MOMO} |SUMOMO|MO|MOMO|NO|UCHI| Ad keywords Db. Keyword matching •  NLP for ads is 2 step processing •  Keyword detec>on is based on the Ad keywords db. •  NLP module is pure Java implementa>on Keyword detec>on NLP module 11 |SUMOMO|MO|MOMO|NO|UCHI| {SUMOMO, MOMO} Query: KW set

Summary •  Play2 scala enables D2C’s service launch Ø 
Only 6 play2 servers! Ø  Covers en>re DOCOMO portal for mobile •  Only One person developed Ø  Our NLP java module was embedded in play2 scala smoothly. 12

Search Query Log Mining for Online Ads by
13

Search Query Log Mining for Ads. •  Large search query
logs for over 2 years. •  The number of queries increase day after day •  Represents individual interests and likes. Extracting variable information 14 Query logs are our treasure!

Mining Frequent Keywords with freq. and UU (Unique User)
Search Query logs = List（query, date, transac>on_id, user_id） = List（keyword, date, freq., user list） Keyword set with freq. and UU •  Obtain user cluster by frequent keywords •  Optimize Ad keywords 15 Log mining Purpose:

Mining Frequent Keywords with freq. and UU Search Query logs
query 　 user_id -‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐ tennis shoes user1　　　　　　 tennis ball user2 shoes user3 … Frequent Keywords with UU keyword freq. user_list -‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐ tennis　　　　 2 {user1, user2} shoes 2 {user1, user3} … 16 Query Log mining Es>mate user cluster by frequent keywords with UU

Extracting Frequent Keywords by Spark RDD 1/3 17 tennis
ball tennis shoes shoes tennis ball tennis_ball tennis shoes tennis_shoes shoes tennis 1 ball, 1 Tennis_ball,1 tennis, 1 shoes, 1 tennis_shoes, 1 shoes, 1 tennis,1 tennis,1 tennis_ball, 1 ball, 1 shoes, 1 shoes, 1 tennis_shoes, 1 tennis,2 Shoes, 2 ball, 1 tennis_ball, 1 tennis_shoes, 1 Queries segment & expand mapping shuﬄing reducing & result

RDD for Search queries Frequent keyword mining process
Extracting Frequent Keywords by Spark RDD 2/3 18

Distributed word splimng Aggrega>on by reduceByKey 19
mining process Extracting Frequent Keywords by Spark 3/3

Cancer-‐metastasis Hoge-‐Hospital .. Guam-‐Travel Kyushu-‐Tour ..
Soccer-‐shoes Baseball-‐shoes .. Categorize the Frequent Keywords Sports Travel How do we categorize? 20 Fashion Soccer-‐shoes Baseball-‐shoes Guam-‐Travel Kyushu-‐Tour Color-‐Skirt-‐Summer Reversible-‐Tulle-‐Skirt Cancer-‐metastasis Hoge-‐Hospital .. Frequent keyword set Guam-‐Travel Kyushu-‐Tour .. Color-‐Skirt-‐Summer Reversible-‐Tulle-‐Skirt … Health Categorize Categorize the keywords automa>cally

Spark Classiﬁcation algorithms Classiﬁca>on (supervised) Clustering (unsupervised) 1.  Naive Bayes
(NB) 2.  Support Vector Machine 3.  Logis>c Regression 4.  Decision Tree 5.  Random Forest 6.  Gradient Boosted Tree 1.  K-‐means 2.  Gaussian Mixture 21 There are no semi-‐supervised learning algorithms yet. So we made it !

Naive Bayes EM (NBEM) on Spark •  Created semi-‐supervised
extension of the Naive Bayes •  Training based on EM algorithm 22

NBEM vs. NB 0.4 0.5 0.6
0.7 0.8 0.9 1 0.001 0.01 0.1 Accuracy train data ra>o Accuracy for the news20 group data NaiveBayes NaiveBayesEM 23 NBEM outperforms NB in a low train data ra>o

Summary •  Spark RDD was used for the search query
log mining Ø  Obtained frequent keyword set for user interests targe>ng •  Semi-‐supervised extension of the spark Naive Bayes classiﬁer was proposed. Ø  NBEM outperforms NB in case of small train data Ø  NBEM is used for categorizing Keywords. 24

NLP_D2C本番

NLP_D2C本番

Masaki Rikitoku

More Decks by Masaki Rikitoku

Other Decks in Technology

Featured

Transcript

NLP with Scala at D2C Masaki Rikitoku @Big Data

Masaki Rikitoku •  NLP/Data engineer@D2C •  NLP/Machine Learning

•  D2C is a mobile online advertising company •

1.  NLP web services for Listing Ads powered by the

NLP web services for Listing Ads by

search Sneaker D2C Listing Ads Process >me:

Which WAF is suitable for us? We chose the play2

NLP web services for D2C Listing ads 　 Ad keywords

Performance •  Play2 scala has good performance. •  No

Word splitting by Part Of Speech Tagger すももももものうち (SUMOMOMOMOMONOUCHI)

Japanese NLP module for listing ads SUMOMOMOMONOUCHI POS Tagger

Summary •  Play2 scala enables D2C’s service launch Ø

Search Query Log Mining for Online Ads by

Search Query Log Mining for Ads. •  Large search query

Mining Frequent Keywords with freq. and UU (Unique User)

Mining Frequent Keywords with freq. and UU Search Query logs

Extracting Frequent Keywords by Spark RDD 1/3 17 tennis

RDD for Search queries Frequent keyword mining process

Distributed word splimng Aggrega>on by reduceByKey 19

Cancer-‐metastasis Hoge-‐Hospital .. Guam-‐Travel Kyushu-‐Tour ..

Spark Classiﬁcation algorithms Classiﬁca>on (supervised) Clustering (unsupervised) 1.  Naive Bayes

Naive Bayes EM (NBEM) on Spark •  Created semi-‐supervised

NBEM vs. NB 0.4 0.5 0.6

Summary •  Spark RDD was used for the search query

25