Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Challenges in Search in Global Services

rejasupotaro
February 18, 2018
3.4k

Challenges in Search in Global Services

rejasupotaro

February 18, 2018
Tweet

Transcript

  1. There's no way you
    wouldn't enjoy the search
    system which supports
    20+ languages
    Kentaro Takiguchi / @rejasupotaro
    1

    View Slide

  2. Share challenges in search
    2

    View Slide

  3. I worked in...
    4 US: 6 months
    4 Spain: 6 months
    4 Indonesia: 3 months
    4 UAE: 2 weeks
    4 Vietnam, Taiwan, Germany, Italy: 1 week
    4 UK: 1 year and 3 months ! Now
    3

    View Slide

  4. Experienced different cultures in different countries 4

    View Slide

  5. https://sourcediving.com/localization-at-cookpad-aa4e11498564 5

    View Slide

  6. Discovery Team
    Providing users with the right contents at the right time
    4 Search
    4 Recommendation
    ” The difference is if explicit query is given or not
    6

    View Slide

  7. 7

    View Slide

  8. Search
    1. Index Construction
    2. Query Understanding
    3. Dictionary
    4. Scoring
    8

    View Slide

  9. 1. Index Construction
    4 Elasticsearch = DB in search
    4 ! "Focusing on search" ≈ "Focusing on index"
    9

    View Slide

  10. Index = Schema + Preprocessing
    4 Schema: title, country, cuisine, ingredients, ...
    4 Preprocessing:
    https://www.codeschool.com/blog/2016/03/25/machine-learning-working-with-stop-words-stemming-and-spam/ 10

    View Slide

  11. Challenge
    The number of indices = Types * Languages
    4 !
    4 "
    n = the number of types
    m = the number of supported languages
    11

    View Slide

  12. 2. Query Understanding
    What is query?
    4 Text
    4 Voice
    4 Image
    4 ...
    12

    View Slide

  13. Intention = Query (+ Hidden Context)
    [User] => (Query) => [Server] => (Query + Context) => [Search Engine]
    4 Locale (! Index, Dictionary, ...)
    4 Location (! Scoring, ...)
    4 Synonyms (! Dictionary, Query, ...)
    4 Misspellings (! Language Model, Query, ...)
    13

    View Slide

  14. "Desayuno" ! (Breakfast ")
    {
    "_source": false,
    "query": {
    "function_score": {
    "query": {
    "bool": {
    "must": [
    {
    "term": {
    "forbidden": {
    "value": false,
    "boost": 0
    }
    }
    },
    {
    "term": {
    "approved": {
    "value": true,
    "boost": 0
    }
    }
    },
    {
    "function_score": {
    "boost_mode": "replace",
    "score_mode": "sum",
    "functions": [
    {
    "filter": {
    "span_or": {
    "clauses": [
    {
    "span_term": {
    "title.raw": "desayuno"
    }
    },
    {
    "span_term": {
    "title.raw": "desayunos"
    }
    }
    ]
    }
    },
    "weight": 300
    },
    {
    "filter": {
    "simple_query_string": {
    "fields": [
    "title",
    "dish_name"
    ],
    "query": "(crep | avena | crepa | crepe | crêpe | donut | fakas | isler | lassi | mufin | yogur | churro | crepes | donuts | mafins | muffin | yogurd | yogurt | brioche | cookies | curasan | galleta | granola | maffins | muffins | mugcake | pascade | smothie | tostada | yoghurt | yogourt | yogurth | bollería | croisant | croissan | flapjack | larpeira | \"mug cake\" | porridge | smoothie | croissant | ensaimada | fritatten | frittaten | smoothies | tejeringo | frittatten | shortbread | gingerbread | \"short bread\" | \"batido verde\" | \"pan de leche\" | \"pastas de té\" | pricomigdale | \"bizcocho taza\" | \"medias noches\" | \"masa para creps\" | \"bizcocho en taza\" | \"pastas navideñas\" | \"bizcocho a la taza\" | \"hombre de jengibre\" | \"muñeco de jengibre\" | \"decoraditas vintahe\")",
    "default_operator": "AND"
    }
    },
    "weight": 100
    }
    ],
    "query": {
    "bool": {
    "must": [
    {
    "simple_query_string": {
    "fields": [
    "title",
    "dish_name"
    ],
    "query": "(crep | avena | crepa | crepe | crêpe | donut | fakas | isler | lassi | mufin | yogur | churro | crepes | donuts | mafins | muffin | yogurd | yogurt | brioche | cookies | curasan | galleta | granola | maffins | muffins | mugcake | pascade | smothie | tostada | yoghurt | yogourt | yogurth | bollería | croisant | croissan | flapjack | larpeira | \"mug cake\" | porridge | smoothie | croissant | ensaimada | fritatten | frittaten | smoothies | tejeringo | frittatten | shortbread | gingerbread | \"short bread\" | \"batido verde\" | \"pan de leche\" | \"pastas de té\" | pricomigdale | \"bizcocho taza\" | \"medias noches\" | \"masa para creps\" | \"bizcocho en taza\" | \"pastas navideñas\" | \"bizcocho a la taza\" | \"hombre de jengibre\" | \"muñeco de jengibre\" | \"decoraditas vintahe\")",
    "default_operator": "AND"
    }
    }
    ],
    "must_not": [
    {
    "simple_query_string": {
    "fields": [
    "title",
    "ingredients",
    "dish_name"
    ],
    "query": "(flan | pollo | salsa | tarta | helado | pastel | postre | salmón | ensalada | hamburguesa)",
    "default_operator": "AND"
    }
    }
    ]
    }
    }
    }
    }
    ]
    }
    },
    "functions": [
    {
    "filter": {
    "term": {
    "with_image": true
    }
    },
    "weight": 100
    },
    {
    "exp": {
    "published_at": {
    "origin": "2018-02-12T06:06:42.266Z",
    "scale": "30d",
    "decay": 0.99
    }
    },
    "weight": 1
    },
    {
    "filter": {
    "term": {
    "region_id": 1
    }
    },
    "weight": 1000.0
    }
    ],
    "score_mode": "sum",
    "boost_mode": "sum"
    }
    }
    }
    ” It's the actual Elasticsearch query. How come does it
    become so long?
    14

    View Slide

  15. Do you understand what "Desayuno"
    is?
    It doesn't mean "the user want to see recipes which
    have Desayuno in its title" ! Query expansion
    4 Desayuno
    4 tostada, avena, galleta, crêpe, churros con
    chocolate, ...
    15

    View Slide

  16. Challenge
    4 Knowledge of food culture is required
    4 Knowledge of language/grammer is required
    ! Spanish breakfast, Japanese breakfast" 16

    View Slide

  17. "Query Preprocess"
    4 Cleaning
    4 Tokenization
    4 Normalization
    4 Removing stopwords
    4 ...
    17

    View Slide

  18. Cleaning
    4 Force encode Latin-1 to UTF-8 ('tortas de
    cumpleaños' ! 'tortas de cumpleaños')
    4 "ͷ" in Taiwan
    mysql> select title from recipes where title like "%ͷ%" and provider_id = 12 order by id desc limit 3;
    +-----------------------------------------------------------+
    | title |
    +-----------------------------------------------------------+
    | ຑ᫤∍໦ࣖྋ፩ࢁᠴਯɽᐬᇯᇯͷ২෺ԏᜰ |
    | ؔ੢෩ͷᆹتᗑ |
    | ὑѼͷਗ਼ྋҰՆ ♥ ྉཧ ✿ ᫊ᱷ༻ిುࣽ“→౾౬” |
    +-----------------------------------------------------------+
    3 rows in set (0.24 sec)
    18

    View Slide

  19. 19

    View Slide

  20. Diacritical Marks in Vietnamese
    https://ja.wikipedia.org/wiki/%E3%83%99%E3%83%88%E3%83%8A%E3%83%A0%E8%AA%9E 20

    View Slide

  21. 21

    View Slide

  22. Tokenization
    1. Simple Tokenizer
    2. Ngram Tokenizer
    3. Language Dependent Tokenizer
    22

    View Slide

  23. 1. Simple Tokenizer
    4 "Cheese Cake" ! "Cheese", "Cake"
    4 "Water/Salt" ! "Water", "Salt"
    23

    View Slide

  24. 2. Ngram Tokenizer
    24

    View Slide

  25. 2. Ngram Tokenizer
    "Donaudampfschiffahrtselektrizitätenhauptbetriebsw
    erkbauunterbeamtengesellschaft"
    υφ΢ؿધిؾࣄۀຊ޻৔޻ࣄ෦໳Լڃ׭ལ૊߹ 25

    View Slide

  26. 26

    View Slide

  27. 3. Language Dependent Tokenizer
    4 "ࢲ͸ਓࢀΛ1ຊ͚ͩങͬͨ"
    4 "ࢲ", "͸", "ਓࢀ", "Λ", "1", "ຊ", "͚ͩ", "ങͬ", "ͨ"
    27

    View Slide

  28. Negation Handling
    4 en:"without", "no", "with no"
    4 es: "sin"
    4 id: "tanpa", "no"
    4 vi: "không cần dùng", "không cần", "không dùng",
    "không"
    4 ar: " د
    و
    ن
    "
    ,
    "
    ب
    د
    و
    ن
    "
    ,
    "
    م
    ن
    د
    و
    ن
    "
    ,
    "
    م
    ن
    غ
    ي
    ر
    "
    ,
    "
    ب
    ال "
    28

    View Slide

  29. Normalization: Misspellings
    Which is the correct spelling?
    1. brocoli
    2. brocolli
    3. broccoli
    4. broccolli
    5. brroccolli
    29

    View Slide

  30. Normalization: Spelling Correction
    Affix File + Custom Dictionary File for !
    TRY esiaénrtolcdugmfphbyvkw-'.zqjxSNRTLCGDMFPHBEAUYOIVKWóöâôZQJXÅçèîêàïüäñ
    NOSUGGEST !
    COMPOUNDMIN 1
    ONLYINCOMPOUND _
    COMPOUNDRULE 2
    COMPOUNDRULE #*0{
    COMPOUNDRULE #*@}
    WORDCHARS 0123456789’
    REP 27
    REP f ph
    REP ph f
    PFX A Y 2
    PFX A 0 re [^e]
    ...)
    30

    View Slide

  31. Normalization: Spelling Correction
    Affix File + Custom Dictionary File for !
    Aachen/M
    aardvark/MS
    aardwolf
    aardwolves
    aargh
    Aarhus/M
    Aaron/M
    Aaronvitch/M
    ab
    Ababa/M
    aback
    abacus/SM
    ...
    31

    View Slide

  32. 32

    View Slide

  33. Normalization: Spelling Correction
    Autoencoder: Query + Noise which simulates human misspellings
    ☑ protyein => protein
    ☑ chorrilltana => chorrillana
    ☑ obusuuma => obusuma
    ☑ noirdic => nordic
    ☑ midwlestern => midwestern
    ☑ fklower => flower
    ☑ gtoast => toast
    ☑ byat => bat
    ☑ yoogurt => yogurt
    ☑ lgrill => grill
    ☑ romsanian => romanian
    ☑ bulgaroia => bulgaria
    ☑ bzarbecue => barbecue
    ☑ krombu => kombu
    ☑ wwine => wine
    ☑ icelanldic => icelandic
    ...
    33

    View Slide

  34. Normalization
    "cheesecakes" => "cheesecake"
    "cheescake" => "cheesecake"
    "tarta de queso" => "cheesecake"
    "tarta de quesos" => "cheesecake"
    "tartas de queso => "cheesecake"
    "tartas de quesos" => "cheesecake"
    34

    View Slide

  35. 35

    View Slide

  36. 3. Dictionary
    = Stored Domain Knowledge
    4 Categories (Cuisine, Event, Occastion, ...)
    4 Types (Dish, Ingredient, Tool, ...)
    4 Parent-child relationships
    4 Synonyms
    4 Negatives
    36

    View Slide

  37. Challenge
    Dictonaries are prepared not only by language, by
    region
    37

    View Slide

  38. Synonyms
    e.g. weenie
    38

    View Slide

  39. Knowledge... depends on the indivisual
    4 What if the dictionary maintainer is good at cooking
    seafood but not good at making sweets?
    39

    View Slide

  40. Ontology (Knowledge Graph)
    40

    View Slide

  41. Synonyms
    Dictionary Tool => SPARQL => Ontology
    SELECT $redirectsTo
    WHERE {
    FILTER (?uri = dbr:#{word})
    ?redirectsTo dbo:wikiPageRedirects ?uri .
    }
    => Chili, Chili Pepper, Red Chili, Hot Pepper, ...
    41

    View Slide

  42. Synonyms
    Needs to be specialized in Cooking
    issue word example of retrieved result
    irrelevant to cooking dinner the dinner (chinema)
    not useful for search mexican united states of mexico
    child included 1 chicken chicken leg
    child included 2 donburi ikuradon
    a bit (?) different donburi tenshin-han
    42

    View Slide

  43. Misspellin... Synonyms?
    siomay, somay, sio may, shomay, shumai, shumay,
    syomay, siomai, ...
    43

    View Slide

  44. Synonyms
    How about word embeddings?
    "tortilla" is similar to...
    ('tortillas', 0.9796085953712463)
    ('enchiladas', 0.977012574672699)
    ('enchilada', 0.9445205926895142)
    ('taco', 0.9363764524459839)
    ('pepperoni', 0.9355688095092773)
    ('doritos', 0.9267387390136719)
    ('burritos', 0.9246858954429626)
    ('jack', 0.9222062230110168)
    ('burrito', 0.919107973575592)
    44

    View Slide

  45. Negatives
    e.g. $Picnic\ Shoulder \notin Picnic$
    45

    View Slide

  46. 4. Scoring
    46

    View Slide

  47. What users want to see varies by region 47

    View Slide

  48. Recommendation
    4 Provide recipes without an explicit query (Query +
    Context)
    4 What the user tend to like/cook
    4 Taste preferences
    4 Available ingredients
    4 For what? Breakfast? Lunch? Dinner? Party?
    48

    View Slide

  49. Each market has different features
    4 Number of recipes
    4 Matured Markets >>>>> New Markets
    4 Priority
    4 Available data/logs
    4 Quality of product
    49

    View Slide

  50. Understand
    Users ❤ Recipes
    50

    View Slide

  51. A lot of challenges
    4 Knowledge of Food
    4 Locality (languages, traditions, religions, …)
    4 Linguistics Issues
    4 How to build/store knowledge
    4 We ❤ Machine Learning
    51

    View Slide

  52. We are hiring!
    4 ! https://info.cookpad.com/en/careers
    4 " https://info.cookpad.com/careers
    52

    View Slide