Challenges in Search in Global Services

666ef10ec14e5a23d0fcf05bd2665575?s=47 rejasupotaro
February 18, 2018
3.1k

Challenges in Search in Global Services

666ef10ec14e5a23d0fcf05bd2665575?s=128

rejasupotaro

February 18, 2018
Tweet

Transcript

  1. There's no way you wouldn't enjoy the search system which

    supports 20+ languages Kentaro Takiguchi / @rejasupotaro 1
  2. Share challenges in search 2

  3. I worked in... 4 US: 6 months 4 Spain: 6

    months 4 Indonesia: 3 months 4 UAE: 2 weeks 4 Vietnam, Taiwan, Germany, Italy: 1 week 4 UK: 1 year and 3 months ! Now 3
  4. Experienced different cultures in different countries 4

  5. https://sourcediving.com/localization-at-cookpad-aa4e11498564 5

  6. Discovery Team Providing users with the right contents at the

    right time 4 Search 4 Recommendation ” The difference is if explicit query is given or not 6
  7. 7

  8. Search 1. Index Construction 2. Query Understanding 3. Dictionary 4.

    Scoring 8
  9. 1. Index Construction 4 Elasticsearch = DB in search 4

    ! "Focusing on search" ≈ "Focusing on index" 9
  10. Index = Schema + Preprocessing 4 Schema: title, country, cuisine,

    ingredients, ... 4 Preprocessing: https://www.codeschool.com/blog/2016/03/25/machine-learning-working-with-stop-words-stemming-and-spam/ 10
  11. Challenge The number of indices = Types * Languages 4

    ! 4 " n = the number of types m = the number of supported languages 11
  12. 2. Query Understanding What is query? 4 Text 4 Voice

    4 Image 4 ... 12
  13. Intention = Query (+ Hidden Context) [User] => (Query) =>

    [Server] => (Query + Context) => [Search Engine] 4 Locale (! Index, Dictionary, ...) 4 Location (! Scoring, ...) 4 Synonyms (! Dictionary, Query, ...) 4 Misspellings (! Language Model, Query, ...) 13
  14. "Desayuno" ! (Breakfast ") { "_source": false, "query": { "function_score":

    { "query": { "bool": { "must": [ { "term": { "forbidden": { "value": false, "boost": 0 } } }, { "term": { "approved": { "value": true, "boost": 0 } } }, { "function_score": { "boost_mode": "replace", "score_mode": "sum", "functions": [ { "filter": { "span_or": { "clauses": [ { "span_term": { "title.raw": "desayuno" } }, { "span_term": { "title.raw": "desayunos" } } ] } }, "weight": 300 }, { "filter": { "simple_query_string": { "fields": [ "title", "dish_name" ], "query": "(crep | avena | crepa | crepe | crêpe | donut | fakas | isler | lassi | mufin | yogur | churro | crepes | donuts | mafins | muffin | yogurd | yogurt | brioche | cookies | curasan | galleta | granola | maffins | muffins | mugcake | pascade | smothie | tostada | yoghurt | yogourt | yogurth | bollería | croisant | croissan | flapjack | larpeira | \"mug cake\" | porridge | smoothie | croissant | ensaimada | fritatten | frittaten | smoothies | tejeringo | frittatten | shortbread | gingerbread | \"short bread\" | \"batido verde\" | \"pan de leche\" | \"pastas de té\" | pricomigdale | \"bizcocho taza\" | \"medias noches\" | \"masa para creps\" | \"bizcocho en taza\" | \"pastas navideñas\" | \"bizcocho a la taza\" | \"hombre de jengibre\" | \"muñeco de jengibre\" | \"decoraditas vintahe\")", "default_operator": "AND" } }, "weight": 100 } ], "query": { "bool": { "must": [ { "simple_query_string": { "fields": [ "title", "dish_name" ], "query": "(crep | avena | crepa | crepe | crêpe | donut | fakas | isler | lassi | mufin | yogur | churro | crepes | donuts | mafins | muffin | yogurd | yogurt | brioche | cookies | curasan | galleta | granola | maffins | muffins | mugcake | pascade | smothie | tostada | yoghurt | yogourt | yogurth | bollería | croisant | croissan | flapjack | larpeira | \"mug cake\" | porridge | smoothie | croissant | ensaimada | fritatten | frittaten | smoothies | tejeringo | frittatten | shortbread | gingerbread | \"short bread\" | \"batido verde\" | \"pan de leche\" | \"pastas de té\" | pricomigdale | \"bizcocho taza\" | \"medias noches\" | \"masa para creps\" | \"bizcocho en taza\" | \"pastas navideñas\" | \"bizcocho a la taza\" | \"hombre de jengibre\" | \"muñeco de jengibre\" | \"decoraditas vintahe\")", "default_operator": "AND" } } ], "must_not": [ { "simple_query_string": { "fields": [ "title", "ingredients", "dish_name" ], "query": "(flan | pollo | salsa | tarta | helado | pastel | postre | salmón | ensalada | hamburguesa)", "default_operator": "AND" } } ] } } } } ] } }, "functions": [ { "filter": { "term": { "with_image": true } }, "weight": 100 }, { "exp": { "published_at": { "origin": "2018-02-12T06:06:42.266Z", "scale": "30d", "decay": 0.99 } }, "weight": 1 }, { "filter": { "term": { "region_id": 1 } }, "weight": 1000.0 } ], "score_mode": "sum", "boost_mode": "sum" } } } ” It's the actual Elasticsearch query. How come does it become so long? 14
  15. Do you understand what "Desayuno" is? It doesn't mean "the

    user want to see recipes which have Desayuno in its title" ! Query expansion 4 Desayuno 4 tostada, avena, galleta, crêpe, churros con chocolate, ... 15
  16. Challenge 4 Knowledge of food culture is required 4 Knowledge

    of language/grammer is required ! Spanish breakfast, Japanese breakfast" 16
  17. "Query Preprocess" 4 Cleaning 4 Tokenization 4 Normalization 4 Removing

    stopwords 4 ... 17
  18. Cleaning 4 Force encode Latin-1 to UTF-8 ('tortas de cumpleaños'

    ! 'tortas de cumpleaños') 4 "ͷ" in Taiwan mysql> select title from recipes where title like "%ͷ%" and provider_id = 12 order by id desc limit 3; +-----------------------------------------------------------+ | title | +-----------------------------------------------------------+ | ຑ᫤∍໦ࣖྋ፩ࢁᠴਯɽᐬᇯᇯͷ২෺ԏᜰ | | ؔ੢෩ͷᆹتᗑ | | ὑѼͷਗ਼ྋҰՆ ♥ ྉཧ ✿ ᫊ᱷ༻ిುࣽ“→౾౬” | +-----------------------------------------------------------+ 3 rows in set (0.24 sec) 18
  19. 19

  20. Diacritical Marks in Vietnamese https://ja.wikipedia.org/wiki/%E3%83%99%E3%83%88%E3%83%8A%E3%83%A0%E8%AA%9E 20

  21. 21

  22. Tokenization 1. Simple Tokenizer 2. Ngram Tokenizer 3. Language Dependent

    Tokenizer 22
  23. 1. Simple Tokenizer 4 "Cheese Cake" ! "Cheese", "Cake" 4

    "Water/Salt" ! "Water", "Salt" 23
  24. 2. Ngram Tokenizer 24

  25. 2. Ngram Tokenizer "Donaudampfschiffahrtselektrizitätenhauptbetriebsw erkbauunterbeamtengesellschaft" υφ΢ؿધిؾࣄۀຊ޻৔޻ࣄ෦໳Լڃ׭ལ૊߹ 25

  26. 26

  27. 3. Language Dependent Tokenizer 4 "ࢲ͸ਓࢀΛ1ຊ͚ͩങͬͨ" 4 "ࢲ", "͸", "ਓࢀ",

    "Λ", "1", "ຊ", "͚ͩ", "ങͬ", "ͨ" 27
  28. Negation Handling 4 en:"without", "no", "with no" 4 es: "sin"

    4 id: "tanpa", "no" 4 vi: "không cần dùng", "không cần", "không dùng", "không" 4 ar: " د و ن " , " ب د و ن " , " م ن د و ن " , " م ن غ ي ر " , " ب ال " 28
  29. Normalization: Misspellings Which is the correct spelling? 1. brocoli 2.

    brocolli 3. broccoli 4. broccolli 5. brroccolli 29
  30. Normalization: Spelling Correction Affix File + Custom Dictionary File for

    ! TRY esiaénrtolcdugmfphbyvkw-'.zqjxSNRTLCGDMFPHBEAUYOIVKWóöâôZQJXÅçèîêàïüäñ NOSUGGEST ! COMPOUNDMIN 1 ONLYINCOMPOUND _ COMPOUNDRULE 2 COMPOUNDRULE #*0{ COMPOUNDRULE #*@} WORDCHARS 0123456789’ REP 27 REP f ph REP ph f PFX A Y 2 PFX A 0 re [^e] ...) 30
  31. Normalization: Spelling Correction Affix File + Custom Dictionary File for

    ! Aachen/M aardvark/MS aardwolf aardwolves aargh Aarhus/M Aaron/M Aaronvitch/M ab Ababa/M aback abacus/SM ... 31
  32. 32

  33. Normalization: Spelling Correction Autoencoder: Query + Noise which simulates human

    misspellings ☑ protyein => protein ☑ chorrilltana => chorrillana ☑ obusuuma => obusuma ☑ noirdic => nordic ☑ midwlestern => midwestern ☑ fklower => flower ☑ gtoast => toast ☑ byat => bat ☑ yoogurt => yogurt ☑ lgrill => grill ☑ romsanian => romanian ☑ bulgaroia => bulgaria ☑ bzarbecue => barbecue ☑ krombu => kombu ☑ wwine => wine ☑ icelanldic => icelandic ... 33
  34. Normalization "cheesecakes" => "cheesecake" "cheescake" => "cheesecake" "tarta de queso"

    => "cheesecake" "tarta de quesos" => "cheesecake" "tartas de queso => "cheesecake" "tartas de quesos" => "cheesecake" 34
  35. 35

  36. 3. Dictionary = Stored Domain Knowledge 4 Categories (Cuisine, Event,

    Occastion, ...) 4 Types (Dish, Ingredient, Tool, ...) 4 Parent-child relationships 4 Synonyms 4 Negatives 36
  37. Challenge Dictonaries are prepared not only by language, by region

    37
  38. Synonyms e.g. weenie 38

  39. Knowledge... depends on the indivisual 4 What if the dictionary

    maintainer is good at cooking seafood but not good at making sweets? 39
  40. Ontology (Knowledge Graph) 40

  41. Synonyms Dictionary Tool => SPARQL => Ontology SELECT $redirectsTo WHERE

    { FILTER (?uri = dbr:#{word}) ?redirectsTo dbo:wikiPageRedirects ?uri . } => Chili, Chili Pepper, Red Chili, Hot Pepper, ... 41
  42. Synonyms Needs to be specialized in Cooking issue word example

    of retrieved result irrelevant to cooking dinner the dinner (chinema) not useful for search mexican united states of mexico child included 1 chicken chicken leg child included 2 donburi ikuradon a bit (?) different donburi tenshin-han 42
  43. Misspellin... Synonyms? siomay, somay, sio may, shomay, shumai, shumay, syomay,

    siomai, ... 43
  44. Synonyms How about word embeddings? "tortilla" is similar to... ('tortillas',

    0.9796085953712463) ('enchiladas', 0.977012574672699) ('enchilada', 0.9445205926895142) ('taco', 0.9363764524459839) ('pepperoni', 0.9355688095092773) ('doritos', 0.9267387390136719) ('burritos', 0.9246858954429626) ('jack', 0.9222062230110168) ('burrito', 0.919107973575592) 44
  45. Negatives e.g. $Picnic\ Shoulder \notin Picnic$ 45

  46. 4. Scoring 46

  47. What users want to see varies by region 47

  48. Recommendation 4 Provide recipes without an explicit query (Query +

    Context) 4 What the user tend to like/cook 4 Taste preferences 4 Available ingredients 4 For what? Breakfast? Lunch? Dinner? Party? 48
  49. Each market has different features 4 Number of recipes 4

    Matured Markets >>>>> New Markets 4 Priority 4 Available data/logs 4 Quality of product 49
  50. Understand Users ❤ Recipes 50

  51. A lot of challenges 4 Knowledge of Food 4 Locality

    (languages, traditions, religions, …) 4 Linguistics Issues 4 How to build/store knowledge 4 We ❤ Machine Learning 51
  52. We are hiring! 4 ! https://info.cookpad.com/en/careers 4 " https://info.cookpad.com/careers 52