Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Challenges in Search in Global Services

rejasupotaro
February 18, 2018
3.5k

Challenges in Search in Global Services

rejasupotaro

February 18, 2018
Tweet

Transcript

  1. There's no way you wouldn't enjoy the search system which

    supports 20+ languages Kentaro Takiguchi / @rejasupotaro 1
  2. I worked in... 4 US: 6 months 4 Spain: 6

    months 4 Indonesia: 3 months 4 UAE: 2 weeks 4 Vietnam, Taiwan, Germany, Italy: 1 week 4 UK: 1 year and 3 months ! Now 3
  3. Discovery Team Providing users with the right contents at the

    right time 4 Search 4 Recommendation ” The difference is if explicit query is given or not 6
  4. 7

  5. 1. Index Construction 4 Elasticsearch = DB in search 4

    ! "Focusing on search" ≈ "Focusing on index" 9
  6. Index = Schema + Preprocessing 4 Schema: title, country, cuisine,

    ingredients, ... 4 Preprocessing: https://www.codeschool.com/blog/2016/03/25/machine-learning-working-with-stop-words-stemming-and-spam/ 10
  7. Challenge The number of indices = Types * Languages 4

    ! 4 " n = the number of types m = the number of supported languages 11
  8. Intention = Query (+ Hidden Context) [User] => (Query) =>

    [Server] => (Query + Context) => [Search Engine] 4 Locale (! Index, Dictionary, ...) 4 Location (! Scoring, ...) 4 Synonyms (! Dictionary, Query, ...) 4 Misspellings (! Language Model, Query, ...) 13
  9. "Desayuno" ! (Breakfast ") { "_source": false, "query": { "function_score":

    { "query": { "bool": { "must": [ { "term": { "forbidden": { "value": false, "boost": 0 } } }, { "term": { "approved": { "value": true, "boost": 0 } } }, { "function_score": { "boost_mode": "replace", "score_mode": "sum", "functions": [ { "filter": { "span_or": { "clauses": [ { "span_term": { "title.raw": "desayuno" } }, { "span_term": { "title.raw": "desayunos" } } ] } }, "weight": 300 }, { "filter": { "simple_query_string": { "fields": [ "title", "dish_name" ], "query": "(crep | avena | crepa | crepe | crêpe | donut | fakas | isler | lassi | mufin | yogur | churro | crepes | donuts | mafins | muffin | yogurd | yogurt | brioche | cookies | curasan | galleta | granola | maffins | muffins | mugcake | pascade | smothie | tostada | yoghurt | yogourt | yogurth | bollería | croisant | croissan | flapjack | larpeira | \"mug cake\" | porridge | smoothie | croissant | ensaimada | fritatten | frittaten | smoothies | tejeringo | frittatten | shortbread | gingerbread | \"short bread\" | \"batido verde\" | \"pan de leche\" | \"pastas de té\" | pricomigdale | \"bizcocho taza\" | \"medias noches\" | \"masa para creps\" | \"bizcocho en taza\" | \"pastas navideñas\" | \"bizcocho a la taza\" | \"hombre de jengibre\" | \"muñeco de jengibre\" | \"decoraditas vintahe\")", "default_operator": "AND" } }, "weight": 100 } ], "query": { "bool": { "must": [ { "simple_query_string": { "fields": [ "title", "dish_name" ], "query": "(crep | avena | crepa | crepe | crêpe | donut | fakas | isler | lassi | mufin | yogur | churro | crepes | donuts | mafins | muffin | yogurd | yogurt | brioche | cookies | curasan | galleta | granola | maffins | muffins | mugcake | pascade | smothie | tostada | yoghurt | yogourt | yogurth | bollería | croisant | croissan | flapjack | larpeira | \"mug cake\" | porridge | smoothie | croissant | ensaimada | fritatten | frittaten | smoothies | tejeringo | frittatten | shortbread | gingerbread | \"short bread\" | \"batido verde\" | \"pan de leche\" | \"pastas de té\" | pricomigdale | \"bizcocho taza\" | \"medias noches\" | \"masa para creps\" | \"bizcocho en taza\" | \"pastas navideñas\" | \"bizcocho a la taza\" | \"hombre de jengibre\" | \"muñeco de jengibre\" | \"decoraditas vintahe\")", "default_operator": "AND" } } ], "must_not": [ { "simple_query_string": { "fields": [ "title", "ingredients", "dish_name" ], "query": "(flan | pollo | salsa | tarta | helado | pastel | postre | salmón | ensalada | hamburguesa)", "default_operator": "AND" } } ] } } } } ] } }, "functions": [ { "filter": { "term": { "with_image": true } }, "weight": 100 }, { "exp": { "published_at": { "origin": "2018-02-12T06:06:42.266Z", "scale": "30d", "decay": 0.99 } }, "weight": 1 }, { "filter": { "term": { "region_id": 1 } }, "weight": 1000.0 } ], "score_mode": "sum", "boost_mode": "sum" } } } ” It's the actual Elasticsearch query. How come does it become so long? 14
  10. Do you understand what "Desayuno" is? It doesn't mean "the

    user want to see recipes which have Desayuno in its title" ! Query expansion 4 Desayuno 4 tostada, avena, galleta, crêpe, churros con chocolate, ... 15
  11. Challenge 4 Knowledge of food culture is required 4 Knowledge

    of language/grammer is required ! Spanish breakfast, Japanese breakfast" 16
  12. Cleaning 4 Force encode Latin-1 to UTF-8 ('tortas de cumpleaños'

    ! 'tortas de cumpleaños') 4 "ͷ" in Taiwan mysql> select title from recipes where title like "%ͷ%" and provider_id = 12 order by id desc limit 3; +-----------------------------------------------------------+ | title | +-----------------------------------------------------------+ | ຑ᫤∍໦ࣖྋ፩ࢁᠴਯɽᐬᇯᇯͷ২෺ԏᜰ | | ؔ੢෩ͷᆹتᗑ | | ὑѼͷਗ਼ྋҰՆ ♥ ྉཧ ✿ ᫊ᱷ༻ిುࣽ“→౾౬” | +-----------------------------------------------------------+ 3 rows in set (0.24 sec) 18
  13. 19

  14. 21

  15. 26

  16. Negation Handling 4 en:"without", "no", "with no" 4 es: "sin"

    4 id: "tanpa", "no" 4 vi: "không cần dùng", "không cần", "không dùng", "không" 4 ar: " د و ن " , " ب د و ن " , " م ن د و ن " , " م ن غ ي ر " , " ب ال " 28
  17. Normalization: Misspellings Which is the correct spelling? 1. brocoli 2.

    brocolli 3. broccoli 4. broccolli 5. brroccolli 29
  18. Normalization: Spelling Correction Affix File + Custom Dictionary File for

    ! TRY esiaénrtolcdugmfphbyvkw-'.zqjxSNRTLCGDMFPHBEAUYOIVKWóöâôZQJXÅçèîêàïüäñ NOSUGGEST ! COMPOUNDMIN 1 ONLYINCOMPOUND _ COMPOUNDRULE 2 COMPOUNDRULE #*0{ COMPOUNDRULE #*@} WORDCHARS 0123456789’ REP 27 REP f ph REP ph f PFX A Y 2 PFX A 0 re [^e] ...) 30
  19. Normalization: Spelling Correction Affix File + Custom Dictionary File for

    ! Aachen/M aardvark/MS aardwolf aardwolves aargh Aarhus/M Aaron/M Aaronvitch/M ab Ababa/M aback abacus/SM ... 31
  20. 32

  21. Normalization: Spelling Correction Autoencoder: Query + Noise which simulates human

    misspellings ☑ protyein => protein ☑ chorrilltana => chorrillana ☑ obusuuma => obusuma ☑ noirdic => nordic ☑ midwlestern => midwestern ☑ fklower => flower ☑ gtoast => toast ☑ byat => bat ☑ yoogurt => yogurt ☑ lgrill => grill ☑ romsanian => romanian ☑ bulgaroia => bulgaria ☑ bzarbecue => barbecue ☑ krombu => kombu ☑ wwine => wine ☑ icelanldic => icelandic ... 33
  22. Normalization "cheesecakes" => "cheesecake" "cheescake" => "cheesecake" "tarta de queso"

    => "cheesecake" "tarta de quesos" => "cheesecake" "tartas de queso => "cheesecake" "tartas de quesos" => "cheesecake" 34
  23. 35

  24. 3. Dictionary = Stored Domain Knowledge 4 Categories (Cuisine, Event,

    Occastion, ...) 4 Types (Dish, Ingredient, Tool, ...) 4 Parent-child relationships 4 Synonyms 4 Negatives 36
  25. Knowledge... depends on the indivisual 4 What if the dictionary

    maintainer is good at cooking seafood but not good at making sweets? 39
  26. Synonyms Dictionary Tool => SPARQL => Ontology SELECT $redirectsTo WHERE

    { FILTER (?uri = dbr:#{word}) ?redirectsTo dbo:wikiPageRedirects ?uri . } => Chili, Chili Pepper, Red Chili, Hot Pepper, ... 41
  27. Synonyms Needs to be specialized in Cooking issue word example

    of retrieved result irrelevant to cooking dinner the dinner (chinema) not useful for search mexican united states of mexico child included 1 chicken chicken leg child included 2 donburi ikuradon a bit (?) different donburi tenshin-han 42
  28. Synonyms How about word embeddings? "tortilla" is similar to... ('tortillas',

    0.9796085953712463) ('enchiladas', 0.977012574672699) ('enchilada', 0.9445205926895142) ('taco', 0.9363764524459839) ('pepperoni', 0.9355688095092773) ('doritos', 0.9267387390136719) ('burritos', 0.9246858954429626) ('jack', 0.9222062230110168) ('burrito', 0.919107973575592) 44
  29. Recommendation 4 Provide recipes without an explicit query (Query +

    Context) 4 What the user tend to like/cook 4 Taste preferences 4 Available ingredients 4 For what? Breakfast? Lunch? Dinner? Party? 48
  30. Each market has different features 4 Number of recipes 4

    Matured Markets >>>>> New Markets 4 Priority 4 Available data/logs 4 Quality of product 49
  31. A lot of challenges 4 Knowledge of Food 4 Locality

    (languages, traditions, religions, …) 4 Linguistics Issues 4 How to build/store knowledge 4 We ❤ Machine Learning 51