Slide 1

Slide 1 text

There's no way you wouldn't enjoy the search system which supports 20+ languages Kentaro Takiguchi / @rejasupotaro 1

Slide 2

Slide 2 text

Share challenges in search 2

Slide 3

Slide 3 text

I worked in... 4 US: 6 months 4 Spain: 6 months 4 Indonesia: 3 months 4 UAE: 2 weeks 4 Vietnam, Taiwan, Germany, Italy: 1 week 4 UK: 1 year and 3 months ! Now 3

Slide 4

Slide 4 text

Experienced different cultures in different countries 4

Slide 5

Slide 5 text

https://sourcediving.com/localization-at-cookpad-aa4e11498564 5

Slide 6

Slide 6 text

Discovery Team Providing users with the right contents at the right time 4 Search 4 Recommendation ” The difference is if explicit query is given or not 6

Slide 7

Slide 7 text

7

Slide 8

Slide 8 text

Search 1. Index Construction 2. Query Understanding 3. Dictionary 4. Scoring 8

Slide 9

Slide 9 text

1. Index Construction 4 Elasticsearch = DB in search 4 ! "Focusing on search" ≈ "Focusing on index" 9

Slide 10

Slide 10 text

Index = Schema + Preprocessing 4 Schema: title, country, cuisine, ingredients, ... 4 Preprocessing: https://www.codeschool.com/blog/2016/03/25/machine-learning-working-with-stop-words-stemming-and-spam/ 10

Slide 11

Slide 11 text

Challenge The number of indices = Types * Languages 4 ! 4 " n = the number of types m = the number of supported languages 11

Slide 12

Slide 12 text

2. Query Understanding What is query? 4 Text 4 Voice 4 Image 4 ... 12

Slide 13

Slide 13 text

Intention = Query (+ Hidden Context) [User] => (Query) => [Server] => (Query + Context) => [Search Engine] 4 Locale (! Index, Dictionary, ...) 4 Location (! Scoring, ...) 4 Synonyms (! Dictionary, Query, ...) 4 Misspellings (! Language Model, Query, ...) 13

Slide 14

Slide 14 text

"Desayuno" ! (Breakfast ") { "_source": false, "query": { "function_score": { "query": { "bool": { "must": [ { "term": { "forbidden": { "value": false, "boost": 0 } } }, { "term": { "approved": { "value": true, "boost": 0 } } }, { "function_score": { "boost_mode": "replace", "score_mode": "sum", "functions": [ { "filter": { "span_or": { "clauses": [ { "span_term": { "title.raw": "desayuno" } }, { "span_term": { "title.raw": "desayunos" } } ] } }, "weight": 300 }, { "filter": { "simple_query_string": { "fields": [ "title", "dish_name" ], "query": "(crep | avena | crepa | crepe | crêpe | donut | fakas | isler | lassi | mufin | yogur | churro | crepes | donuts | mafins | muffin | yogurd | yogurt | brioche | cookies | curasan | galleta | granola | maffins | muffins | mugcake | pascade | smothie | tostada | yoghurt | yogourt | yogurth | bollería | croisant | croissan | flapjack | larpeira | \"mug cake\" | porridge | smoothie | croissant | ensaimada | fritatten | frittaten | smoothies | tejeringo | frittatten | shortbread | gingerbread | \"short bread\" | \"batido verde\" | \"pan de leche\" | \"pastas de té\" | pricomigdale | \"bizcocho taza\" | \"medias noches\" | \"masa para creps\" | \"bizcocho en taza\" | \"pastas navideñas\" | \"bizcocho a la taza\" | \"hombre de jengibre\" | \"muñeco de jengibre\" | \"decoraditas vintahe\")", "default_operator": "AND" } }, "weight": 100 } ], "query": { "bool": { "must": [ { "simple_query_string": { "fields": [ "title", "dish_name" ], "query": "(crep | avena | crepa | crepe | crêpe | donut | fakas | isler | lassi | mufin | yogur | churro | crepes | donuts | mafins | muffin | yogurd | yogurt | brioche | cookies | curasan | galleta | granola | maffins | muffins | mugcake | pascade | smothie | tostada | yoghurt | yogourt | yogurth | bollería | croisant | croissan | flapjack | larpeira | \"mug cake\" | porridge | smoothie | croissant | ensaimada | fritatten | frittaten | smoothies | tejeringo | frittatten | shortbread | gingerbread | \"short bread\" | \"batido verde\" | \"pan de leche\" | \"pastas de té\" | pricomigdale | \"bizcocho taza\" | \"medias noches\" | \"masa para creps\" | \"bizcocho en taza\" | \"pastas navideñas\" | \"bizcocho a la taza\" | \"hombre de jengibre\" | \"muñeco de jengibre\" | \"decoraditas vintahe\")", "default_operator": "AND" } } ], "must_not": [ { "simple_query_string": { "fields": [ "title", "ingredients", "dish_name" ], "query": "(flan | pollo | salsa | tarta | helado | pastel | postre | salmón | ensalada | hamburguesa)", "default_operator": "AND" } } ] } } } } ] } }, "functions": [ { "filter": { "term": { "with_image": true } }, "weight": 100 }, { "exp": { "published_at": { "origin": "2018-02-12T06:06:42.266Z", "scale": "30d", "decay": 0.99 } }, "weight": 1 }, { "filter": { "term": { "region_id": 1 } }, "weight": 1000.0 } ], "score_mode": "sum", "boost_mode": "sum" } } } ” It's the actual Elasticsearch query. How come does it become so long? 14

Slide 15

Slide 15 text

Do you understand what "Desayuno" is? It doesn't mean "the user want to see recipes which have Desayuno in its title" ! Query expansion 4 Desayuno 4 tostada, avena, galleta, crêpe, churros con chocolate, ... 15

Slide 16

Slide 16 text

Challenge 4 Knowledge of food culture is required 4 Knowledge of language/grammer is required ! Spanish breakfast, Japanese breakfast" 16

Slide 17

Slide 17 text

"Query Preprocess" 4 Cleaning 4 Tokenization 4 Normalization 4 Removing stopwords 4 ... 17

Slide 18

Slide 18 text

Cleaning 4 Force encode Latin-1 to UTF-8 ('tortas de cumpleaños' ! 'tortas de cumpleaños') 4 "ͷ" in Taiwan mysql> select title from recipes where title like "%ͷ%" and provider_id = 12 order by id desc limit 3; +-----------------------------------------------------------+ | title | +-----------------------------------------------------------+ | ຑ᫤∍໦ࣖྋ፩ࢁᠴਯɽᐬᇯᇯͷ২෺ԏᜰ | | ؔ੢෩ͷᆹتᗑ | | ὑѼͷਗ਼ྋҰՆ ♥ ྉཧ ✿ ᫊ᱷ༻ిುࣽ“→౾౬” | +-----------------------------------------------------------+ 3 rows in set (0.24 sec) 18

Slide 19

Slide 19 text

19

Slide 20

Slide 20 text

Diacritical Marks in Vietnamese https://ja.wikipedia.org/wiki/%E3%83%99%E3%83%88%E3%83%8A%E3%83%A0%E8%AA%9E 20

Slide 21

Slide 21 text

21

Slide 22

Slide 22 text

Tokenization 1. Simple Tokenizer 2. Ngram Tokenizer 3. Language Dependent Tokenizer 22

Slide 23

Slide 23 text

1. Simple Tokenizer 4 "Cheese Cake" ! "Cheese", "Cake" 4 "Water/Salt" ! "Water", "Salt" 23

Slide 24

Slide 24 text

2. Ngram Tokenizer 24

Slide 25

Slide 25 text

2. Ngram Tokenizer "Donaudampfschiffahrtselektrizitätenhauptbetriebsw erkbauunterbeamtengesellschaft" υφ΢ؿધిؾࣄۀຊ޻৔޻ࣄ෦໳Լڃ׭ལ૊߹ 25

Slide 26

Slide 26 text

26

Slide 27

Slide 27 text

3. Language Dependent Tokenizer 4 "ࢲ͸ਓࢀΛ1ຊ͚ͩങͬͨ" 4 "ࢲ", "͸", "ਓࢀ", "Λ", "1", "ຊ", "͚ͩ", "ങͬ", "ͨ" 27

Slide 28

Slide 28 text

Negation Handling 4 en:"without", "no", "with no" 4 es: "sin" 4 id: "tanpa", "no" 4 vi: "không cần dùng", "không cần", "không dùng", "không" 4 ar: " د و ن " , " ب د و ن " , " م ن د و ن " , " م ن غ ي ر " , " ب ال " 28

Slide 29

Slide 29 text

Normalization: Misspellings Which is the correct spelling? 1. brocoli 2. brocolli 3. broccoli 4. broccolli 5. brroccolli 29

Slide 30

Slide 30 text

Normalization: Spelling Correction Affix File + Custom Dictionary File for ! TRY esiaénrtolcdugmfphbyvkw-'.zqjxSNRTLCGDMFPHBEAUYOIVKWóöâôZQJXÅçèîêàïüäñ NOSUGGEST ! COMPOUNDMIN 1 ONLYINCOMPOUND _ COMPOUNDRULE 2 COMPOUNDRULE #*0{ COMPOUNDRULE #*@} WORDCHARS 0123456789’ REP 27 REP f ph REP ph f PFX A Y 2 PFX A 0 re [^e] ...) 30

Slide 31

Slide 31 text

Normalization: Spelling Correction Affix File + Custom Dictionary File for ! Aachen/M aardvark/MS aardwolf aardwolves aargh Aarhus/M Aaron/M Aaronvitch/M ab Ababa/M aback abacus/SM ... 31

Slide 32

Slide 32 text

32

Slide 33

Slide 33 text

Normalization: Spelling Correction Autoencoder: Query + Noise which simulates human misspellings ☑ protyein => protein ☑ chorrilltana => chorrillana ☑ obusuuma => obusuma ☑ noirdic => nordic ☑ midwlestern => midwestern ☑ fklower => flower ☑ gtoast => toast ☑ byat => bat ☑ yoogurt => yogurt ☑ lgrill => grill ☑ romsanian => romanian ☑ bulgaroia => bulgaria ☑ bzarbecue => barbecue ☑ krombu => kombu ☑ wwine => wine ☑ icelanldic => icelandic ... 33

Slide 34

Slide 34 text

Normalization "cheesecakes" => "cheesecake" "cheescake" => "cheesecake" "tarta de queso" => "cheesecake" "tarta de quesos" => "cheesecake" "tartas de queso => "cheesecake" "tartas de quesos" => "cheesecake" 34

Slide 35

Slide 35 text

35

Slide 36

Slide 36 text

3. Dictionary = Stored Domain Knowledge 4 Categories (Cuisine, Event, Occastion, ...) 4 Types (Dish, Ingredient, Tool, ...) 4 Parent-child relationships 4 Synonyms 4 Negatives 36

Slide 37

Slide 37 text

Challenge Dictonaries are prepared not only by language, by region 37

Slide 38

Slide 38 text

Synonyms e.g. weenie 38

Slide 39

Slide 39 text

Knowledge... depends on the indivisual 4 What if the dictionary maintainer is good at cooking seafood but not good at making sweets? 39

Slide 40

Slide 40 text

Ontology (Knowledge Graph) 40

Slide 41

Slide 41 text

Synonyms Dictionary Tool => SPARQL => Ontology SELECT $redirectsTo WHERE { FILTER (?uri = dbr:#{word}) ?redirectsTo dbo:wikiPageRedirects ?uri . } => Chili, Chili Pepper, Red Chili, Hot Pepper, ... 41

Slide 42

Slide 42 text

Synonyms Needs to be specialized in Cooking issue word example of retrieved result irrelevant to cooking dinner the dinner (chinema) not useful for search mexican united states of mexico child included 1 chicken chicken leg child included 2 donburi ikuradon a bit (?) different donburi tenshin-han 42

Slide 43

Slide 43 text

Misspellin... Synonyms? siomay, somay, sio may, shomay, shumai, shumay, syomay, siomai, ... 43

Slide 44

Slide 44 text

Synonyms How about word embeddings? "tortilla" is similar to... ('tortillas', 0.9796085953712463) ('enchiladas', 0.977012574672699) ('enchilada', 0.9445205926895142) ('taco', 0.9363764524459839) ('pepperoni', 0.9355688095092773) ('doritos', 0.9267387390136719) ('burritos', 0.9246858954429626) ('jack', 0.9222062230110168) ('burrito', 0.919107973575592) 44

Slide 45

Slide 45 text

Negatives e.g. $Picnic\ Shoulder \notin Picnic$ 45

Slide 46

Slide 46 text

4. Scoring 46

Slide 47

Slide 47 text

What users want to see varies by region 47

Slide 48

Slide 48 text

Recommendation 4 Provide recipes without an explicit query (Query + Context) 4 What the user tend to like/cook 4 Taste preferences 4 Available ingredients 4 For what? Breakfast? Lunch? Dinner? Party? 48

Slide 49

Slide 49 text

Each market has different features 4 Number of recipes 4 Matured Markets >>>>> New Markets 4 Priority 4 Available data/logs 4 Quality of product 49

Slide 50

Slide 50 text

Understand Users ❤ Recipes 50

Slide 51

Slide 51 text

A lot of challenges 4 Knowledge of Food 4 Locality (languages, traditions, religions, …) 4 Linguistics Issues 4 How to build/store knowledge 4 We ❤ Machine Learning 51

Slide 52

Slide 52 text

We are hiring! 4 ! https://info.cookpad.com/en/careers 4 " https://info.cookpad.com/careers 52