Upgrade to Pro — share decks privately, control downloads, hide ads and more …

NLP_D2C本番

 NLP_D2C本番

Big Data Scala by the Bayの発表スライド

Masaki Rikitoku

October 24, 2015
Tweet

More Decks by Masaki Rikitoku

Other Decks in Technology

Transcript

  1. NLP  with  Scala  at  D2C Masaki  Rikitoku   @Big  Data

     Scala  By  the  Bay   August  18,  2015    
  2.    Masaki  Rikitoku •  NLP/Data  engineer@D2C •  NLP/Machine  Learning  

    Ø  Japanese  morphological  analysis Ø  Text  classification •  Big  Data  Processing Ø  In-‐‑‒memory  aggregation  engine  for  BI Ø  Big  Data  Processing  for  global  social   game   About  Me 2
  3. •  D2C  is  a  mobile  online  advertising   company • 

    Subsidiary  company  of  NTT  DOCOMO  INC.   (mobile  carrier)     •  Our  Services  are Ø  D2C  Listing/Display  Ads   Ø  D2C  PLA   Ø  Mail  Push  Ads     Ø  Etc. About  Our  Company  –  D2C h"p://smt.docomo.ne.jp/ 3
  4. 1.  NLP  web  services  for  Listing  Ads  powered  by  the

                               framework 2.  Mining  search  query  logs  by  the   Ø Extracting/Categorizing  frequent  keywords Overview 4
  5. search  Sneaker D2C  Listing  Ads Process  >me:      

     <200  ms   6 PV/day:    28  million UU/week:      10  million  
  6. NLP  web  services  for  D2C  Listing  ads   Ad  keywords  

    db. Front  server  Visualizer Searcher NLP  web   APIs    1  Query Ads  db. Lis>ng  Ads  servers NLP  web  services .   .   . LB Query Ads Keyword  set NLP  web   APIs    6  Total:  15  servers Produc>on  NLP  web  services  are  composed  of  6  play2  servers!   8
  7. Performance •  Play2  scala  has  good  performance.   •  No

     down>me  since  this  April!  Stable!   Over  3500  tps.   Avg.  10  ms./query 9
  8. Word  splitting  by  Part  Of  Speech  Tagger すももももものうち   (SUMOMOMOMOMONOUCHI)

      Plums  are  part  of  peaches   Japanese  text  is  not  split  by  space  between  characters!   10 We  need  to  split  text  into  words   SUMOMO(plum)            noun   MO                                                    par>cle   MOMO(peach)                  noun   NO                                                    par>cle   UCHI                                                        noun   POS  Tagger
  9. Japanese  NLP  module  for  listing  ads SUMOMOMOMONOUCHI   POS  Tagger

    {SUMOMO,  MOMO}   |SUMOMO|MO|MOMO|NO|UCHI|   Ad  keywords   Db. Keyword  matching •  NLP  for  ads  is  2  step  processing   •  Keyword  detec>on  is  based  on  the  Ad  keywords  db.   •  NLP  module  is  pure  Java  implementa>on   Keyword  detec>on NLP  module 11 |SUMOMO|MO|MOMO|NO|UCHI|   {SUMOMO,  MOMO}   Query: KW  set
  10. Summary •  Play2  scala  enables  D2C’s  service  launch   Ø 

    Only  6  play2  servers!   Ø  Covers  en>re  DOCOMO  portal  for  mobile   •  Only  One  person  developed   Ø  Our  NLP  java  module  was  embedded  in  play2  scala     smoothly.   12
  11. Search  Query  Log  Mining  for  Ads. •  Large  search  query

     logs  for  over  2  years. •  The  number  of  queries  increase  day  after  day •  Represents  individual  interests  and  likes. Extracting  variable  information     14 Query  logs  are  our  treasure!
  12. Mining  Frequent  Keywords  with  freq.   and  UU  (Unique  User)

    Search  Query  logs    =  List(query,  date,  transac>on_id,  user_id) =  List(keyword,  date,  freq.,  user  list) Keyword  set        with  freq.  and  UU     •  Obtain  user  cluster  by  frequent  keywords •  Optimize  Ad  keywords   15 Log  mining Purpose:  
  13. Mining  Frequent  Keywords  with  freq.  and  UU Search  Query  logs

    query                        user_id   -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐   tennis  shoes        user1         tennis  ball                user2   shoes                                  user3   …     Frequent  Keywords  with  UU keyword            freq.                          user_list                               -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐   tennis          2            {user1,  user2}     shoes                                  2            {user1,  user3}   …   16 Query  Log  mining Es>mate  user  cluster  by  frequent  keywords  with  UU
  14. Extracting  Frequent  Keywords  by  Spark  RDD   1/3 17 tennis

     ball   tennis  shoes   shoes   tennis   ball   tennis_ball   tennis   shoes   tennis_shoes   shoes tennis  1   ball,  1   Tennis_ball,1   tennis,  1   shoes,  1   tennis_shoes,  1   shoes,  1     tennis,1   tennis,1     tennis_ball,  1   ball,  1   shoes,  1   shoes,  1   tennis_shoes,  1     tennis,2   Shoes,  2   ball,  1   tennis_ball,  1   tennis_shoes,  1       Queries                          segment  &  expand                    mapping                                              shuffling                      reducing  &  result
  15. RDD  for  Search  queries Frequent  keyword   mining  process  

    Extracting  Frequent  Keywords  by  Spark  RDD   2/3 18
  16. Distributed     word  splimng Aggrega>on  by   reduceByKey 19

    mining  process   Extracting  Frequent  Keywords  by  Spark  3/3
  17. Cancer-­‐metastasis   Hoge-­‐Hospital   .. Guam-­‐Travel   Kyushu-­‐Tour   ..

    Soccer-­‐shoes       Baseball-­‐shoes   ..   Categorize  the  Frequent  Keywords Sports Travel How    do  we  categorize?     20 Fashion Soccer-­‐shoes     Baseball-­‐shoes   Guam-­‐Travel   Kyushu-­‐Tour   Color-­‐Skirt-­‐Summer   Reversible-­‐Tulle-­‐Skirt   Cancer-­‐metastasis   Hoge-­‐Hospital   .. Frequent  keyword  set Guam-­‐Travel   Kyushu-­‐Tour   .. Color-­‐Skirt-­‐Summer   Reversible-­‐Tulle-­‐Skirt   …   Health Categorize Categorize  the  keywords  automa>cally  
  18. Spark  Classification  algorithms Classifica>on  (supervised) Clustering  (unsupervised) 1.  Naive  Bayes

     (NB)   2.  Support  Vector  Machine   3.  Logis>c  Regression   4.  Decision  Tree   5.  Random  Forest   6.  Gradient  Boosted  Tree   1.  K-­‐means   2.  Gaussian  Mixture   21 There  are  no  semi-­‐supervised  learning  algorithms  yet. So  we  made  it  !
  19. Naive  Bayes  EM  (NBEM)  on  Spark   •  Created  semi-­‐supervised

     extension  of  the  Naive  Bayes   •  Training  based  on  EM  algorithm   22
  20.  NBEM    vs.  NB 0.4   0.5   0.6  

    0.7   0.8   0.9   1   0.001   0.01   0.1   Accuracy   train  data  ra>o   Accuracy  for  the  news20  group  data   NaiveBayes   NaiveBayesEM   23 NBEM  outperforms  NB  in  a  low  train  data  ra>o  
  21. Summary •  Spark  RDD  was  used  for  the  search  query

      log  mining   Ø  Obtained  frequent  keyword  set  for  user  interests   targe>ng     •  Semi-­‐supervised  extension  of  the  spark   Naive  Bayes  classifier  was  proposed.   Ø  NBEM  outperforms  NB  in  case  of  small  train  data   Ø  NBEM  is  used  for  categorizing  Keywords.     24
  22. 25