Upgrade to Pro — share decks privately, control downloads, hide ads and more …

クックパッドにおける研究開発/HCG2020

j.harashima
December 16, 2020
2.1k

 クックパッドにおける研究開発/HCG2020

j.harashima

December 16, 2020
Tweet

More Decks by j.harashima

Transcript

  1. ΫοΫύουʹ͓͚Δݚڀ։ൃ
    ΫοΫύουגࣜձࣾݪౡ७

    View Slide

  2. ೥݄
    ೥݄
    ژ౎େֶࠇڮݚ഑ଐ
    ΫοΫύουೖࣾ
    ത࢜ʢ৘ใֶʣऔಘ
    ΤϯδχΞʢ3VCZPO3BJMTʣ
    ೥݄
    ֶੜʢࣗવݴޠॲཧ৘ใݕࡧʣ
    ݪౡ७
    ೥݄
    Ϛωʔδϟʔʢਓࣄ޿ใʜʣ
    ݚڀ։ൃ෦໳ઃཱ

    View Slide

  3. ΫοΫύου
    Πϯλʔωοτ্ͰϨγϐͷ౤ߘɾݕࡧ౳͕Ͱ͖Δ
    ੈք࠷େͷྉཧϨγϐαʔϏε
    ݄ؒར༻ऀ਺ԯਓ௒ʢΧࠃݴޠʣ
    ϓϨϛΞϜձһ਺໿ສਓ

    View Slide

  4. ΫοΫύουϚʔτ

    View Slide

  5. ͨͷ͍͠Ωονϯෆಈ࢈

    View Slide

  6. Ԋֵ
    ೥݄ ༗ݶձࣾίΠϯʢݱΫοΫύουגࣜձࣾʣઃཱ
    ೥݄
    ೥݄
    ೥݄
    ೥݄
    ೥݄ Ϩγϐͷ౤ߘɾݕࡧαʔϏεLJUDIFO!DPJO։࢝
    ΫοΫύουʹαʔϏε໊มߋ
    ೥݄ ϓϨϛΞϜαʔϏε։࢝
    Ϛβʔζ্৔
    ౦ূҰ෦ʹࢢ৔มߋ
    ϓϨϛΞϜαʔϏεձһ਺ສਓಥഁ
    ೥݄ ւ֎ల։Λຊ֨Խ
    ೥݄ ݚڀ։ൃ෦໳ൃ଍

    View Slide

  7. ؔ࿈αʔϏε
    ͓·͔ͤ੔ཧ
    ०ͷ໺ࡊϨγϐ΍
    ϝΠϯʹ࢖͑ΔϨγϐ͕
    ͙͢ʹݟ͔ͭΔ
    "MFYBεΩϧ
    ࢖͍͍ͨࡐྉ͚ͩͰ
    ͙͢ʹ࡞ΕΔਓؾͷϨγϐఏҊ
    ྉཧ͖Ζ͘
    ࡱΔ͚ͩͰه࿥ʹʂ
    ͋ͳͨͷ͓ྉཧࣸਅΛ
    ࣗಈͰ੔ཧ
    Ϩγϐͷͦͷઌ΁
    ͓͍͍͠εϚʔτΩονϯ
    0J$Z

    View Slide

  8. ϨγϐʢςΩετʴը૾ʣ
    ϑΟʔυόοΫ΋ςΩετʴը૾
    ϝσΟΞॲཧʢಛʹ
    /-1$7ʣ͕ॏཁ

    View Slide

  9. /-1

    View Slide

  10. ϨγϐΛͦͷछྨ΍ௐཧ๏ɺδϟϯϧɺʜͰ෼ྨ
    ɾछྨʢFH ೑ྉཧɺڕྉཧʣɺௐཧ๏ʢFH ࣽ෺ɺম෺ʣɺδϟϯϧʢFH ࿨ɺ༸ʣ
    ɾ෼ྨ݁Ռ͸ݕࡧ݁ՌͷߜΓࠐΈ౳Ͱར༻
    Ϟσϧ͸47.΍3'
    ɾλάʹ΋ґΔ͕ɺे෼ͳੑೳ
    ֶशσʔλ
    ɾΞϊςʔλʔ͕λά෇͚ͨ͠Ϩγϐ΍Ϣʔβʔ͕λά෇͚ͨ͠ϨγϐΛར༻
    Ϩγϐͷ෼ྨ

    View Slide

  11. ΫοΫύουͷࡐྉ໊͸਺ඦສछྨ
    ɾϨγϐͷΧϩϦʔ౳Λܭࢉ͢Δࡍʹ໰୊
    &ODPEFS%FDPEFSͰਖ਼نԽ
    ɾDIBSBDUFSCBTFETUBDLFEVOJEJSFDUJPOBM-45.ʢ૚ʣ
    ɾͳ͓ɺCJEJSFDUJPOBM-45.ͱBUUFOUJPO͸ޮՌͳ͔ͬͨ
    ˒ ে ༉
    ΐ
    ͠ ͏ Ώ &08
    &08 ΐ
    ͠ ͏ Ώ
    ࡐྉ໊ͷਖ਼نԽ

    View Slide

  12. $BMPSJF&TUJNBUJPO )BSBTIJNBFUBM

    "TBSJ$MBN3JDF

    HP

    ɹɹɹɹɹΞαϦɹ
    ɹɹɹɹɹถɹɹɹ
    ɹɹɹɹɹԘɹɹɹ
    ɹɹɹɹɹञɹɹɹ
    ɹɹɹɹɹ͠ΐ͏Ώ
    ɹɹɹɹɹΈΓΜɹ
    SJDF

    BTBSJDMBN

    TBMU

    TBLF

    TPZTBVDF

    TXFFUTBLF







    8FIBWFFTUJNBUFEUIFOVNCFSPGDBMPSJFTJOPWFS
    SFDJQFTBOEBDUVBMMZVTFUIFNJOPVSSFDJQFTFSWJDF
    6TFUIFTJOHMFTPVSDFNPEFMGPSTFSWJOHFTUJNBUJPO

    c(r) =

    i∈Ir
    c(i) ⋅ q(i)/100
    s(r)
    = 306.6

    *OHSFEJFOU

    View Slide

  13. Ϣʔβʔ͕ೖྗͨ͠λΠτϧΛ&NCFE
    ɾ֤୯ޠΛ&NCFEͯ͠ɺͦͷฏۉΛऔಘ
    ϨγϐΛݕࡧ
    ɾ&NCFEEJOH4QBDF಺ͰࣅͨλΠτϧΛ࣋ͭϨγϐΛ
    ɹΫοΫύουͷطଘͷϨγϐ͔Βݕࡧ
    ࡐྉ໊ΛϨίϝϯυ
    ɾݕࡧ͞ΕͨϨγϐ಺Ͱڞ௨͢Δࡐྉ໊ΛϨίϝϯυ
    ࡐྉ໊ͷϨίϝϯυ

    View Slide

  14. खॱͷ෼ྨ
    Ϩγϐͷ֤ௐཧखॱΛ5SVF4UFQͱ'BLF4UFQʹ෼ྨ
    ɾ໿ खॱΛλά෇͚ɺ-45.Λֶशʢਖ਼ղ཰ʣ
    ɾϨγϐͷಡΈ্͛౳ͷαʔϏεͰར༻

    View Slide

  15. ࣾ಺ʹʢͨ·ͨ·ʣର༁ίʔύε͕͋ͬͨ
    ɾաڈͷαʔϏεͷҨ෺
    ɾ໿ ඼ͷ೔ӳର༁
    /.5<#BIEBOBVFUBM>Λςετ
    ɾ݁ՌɺαʔϏεԽͰ͖Δ΄Ͳͷ຋༁͸೉͍͠
    ɾͨ·ʹͦΕͳΓͷ຋༁͕ੜ੒Ͱ͖Δ͜ͱ΋
    ɹɾ೔ɿ֖·ͨ͸ΞϧϛϗΠϧͰམͱ֖͠Λͯ͠தՐͰৠ͠ম͖ʹ͠·͢ɻ
    ɹɾӳɿDPWFSXJUIBMJEPSBMVNJOVNGPJM BOETUFBNUIFDIJDLFOPONFEJVNIFBU
    Ϩγϐͷ຋༁
    #BIEBOBVFUBM

    View Slide

  16. ͝ҙݟͷ෼ྨ
    ʮඒຯ͘͠Ͱ͖·ͨ͠ʂʯ
    ϙδςΟϒ
    Ϣʔβʔ͔Βͷ͝ҙݟΛࣗಈͰ෼ྨ
    ɾݸͷλάʹରԠʢݸͷ47.Λֶशʣ
    ɾ֬཰ͷߴ͍λάΛαδΣετɺελοϑͷ෼ྨ࡞ۀΛ൒ݮ

    View Slide

  17. $7

    View Slide

  18. ྉཧ
    ඇྉཧ ඇྉཧ
    ඇྉཧ
    ਫ਼౓࠶ݱ཰
    ྉཧ͖Ζ͘͸ެ։೥Ͱສਓ͕ར༻
    ྉཧࣸਅͷݕग़
    ϢʔβͷεϚʔτϑΥϯ͔ΒྉཧࣸਅΛਂ૚ֶशͰݕग़

    View Slide

  19. WFS
    ɾ$B
    ff
    F/FU
    ɾྉཧPSඇྉཧͷೋ஋෼ྨ
    WFS
    ɾ*ODFQUJPOW
    ɾྉཧ ২෺ ʜ PSඇྉཧͷଟ஋෼ྨ
    WFS
    ɾ*ODFQUJPOWQBUDIFEDMBTTJ
    fi
    DBUJPO
    ɾྉཧPSඇྉཧͷೋ஋෼ྨ
    ֶशσʔλ
    ɾਖ਼ྫɿΫοΫύουͷϨγϐͷը૾
    ɾෛྫɿϥΠηϯεϑϦʔͷछʑͷը૾ WFSͷೝࣝ݁Ռ
    ྉཧࣸਅͷݕग़

    View Slide

  20. https://twitter.com/ohmycorgi/status/867745923719364609
    https://twitter.com/teenybiscuit/status/707727863571582978
    㾎͕Ϟσϧ͕ྉཧͱ൑ఆͨࣸ͠ਅ
    ྉཧࣸਅͷݕग़

    View Slide

  21. ඇྉཧ
    ྉཧ ྉཧ
    ྉཧ
    ඇྉཧࣸਅͷݕग़
    Ϣʔβʔ͕౤ߘͨࣸ͠ਅ͔ΒඇྉཧࣸਅΛݕग़
    ྉཧࣸਅͷݕग़ͱಉ͡ϞσϧΛར༻
    ϢʔβʔʹΨΠυϥΠϯΛ௨஌
    ˞ඇྉཧͷ৔߹ɺՈ଒΍ϖοτͷөΓࠐΈ͕ଟ͍

    View Slide

  22. ૝૾Λ௒͑ΔʁτϚτϨλεཛͷ̏৭
    εʔϓ
    ؆୯τϚτεʔϓ
    ʢϛωετϩʔωʣ
    ৽ۄͶ͗ͱτϚτͷαϥμ
    ໨ࢦͤσϦ෩ʂϨλεͱτϚτͷα
    ϥμ
    ௒γϯϓϧʂτϚτͱେࠜͰ࿨෩
    ྫྷ੩ύελ
    τϚτͱ৽ۄͶ͗ͷ
    ͬ͞ͺΓϚϦω
    τϚτ
    ࡐྉࣸਅͷ෼ྨ
    ࡐྉࣸਅΛ໿छྨʹ෼ྨɺαʔϏε։ൃʹར༻ʢ༧ఆʣ

    View Slide

  23. ͦͷଞͷτϐοΫ

    View Slide

  24. Ϩγϐσʔλͷެ։

    View Slide

  25. ར༻ঢ়گ
    ؔ

    τ
    ϐ
    ο
    Ϋ
    ར༻ঢ়گ
    ެ։લʢʙ ೥ ݄ʣ
    େֶ ݚڀࣨ
    ެ։ޙʢ ೥ ݄ʣ
    େֶ ݚڀࣨ

    View Slide

  26. ྉཧݕग़෦໳
    ɾࣸਅதͷྉཧྖҬΛݕग़
    ྉཧ෼ྨ෦໳
    ɾྉཧࣸਅΛΫϥεʹ෼ྨ
    ͦ͏ΊΜ
    ͏ͲΜ
    ୈճ"*νϟϨϯδίϯςετ

    View Slide

  27. +4"*$VQ
    ࡐྉࣸਅͷ෼ྨ
    ɾΧςΰϦʢFH ۄೢʣ
    ɾશ෦Ͱ ຕ

    View Slide

  28. 8"5

    View Slide

  29. MA
    - MeCab, the most popular morphological analyzer of Japanese,
    was tested
    - All metrics indicated 89–91% although the tool has already
    achieved over 98% on newspaper articles in Kudo et al. (2004)
    Cookpad Parsed Corpus: Linguistic Annotations of Japanese Recipes
    Jun Harashima and Makoto Hiramatsu (Cookpad Inc.)
    The 14th Linguistic Annotation Workshop
    Background

    Cookpad Parsed Corpus
    Name Year Main Content
    CURD 2008 Machine-readable language representations
    Flow Graph Corpus 2014 Graph representations and named entities
    SIMMR Recipe Dataset 2015 Graph representations
    Cookpad Recipe Dataset 2016 Reviews and meals
    Cookpad Image Dataset 2017 Food images and cooking images
    Recipe1M 2017 Food images
    RecipeQA 2018 Question-answer pairs
    Stroyboarding Data 2019 Cooking images
    r-FG BB dataset 2019 Bounding boxes for cooking images
    English Recipe Flow Graph Corpus 2020 Graph representations and named entities
    Mulitimodal Aligned Recipe Corpus 2020 URLs to YouTube videos
    Mulit-modal Recipe Structure dataset 2020 Graph representations and cooking images
    Cookpad Parsed Corpus 2020 Linguistic annotations
    Name Year Target documents
    KU Text Corpus 2002 Newspaper articles
    GDA Corpus 2005 Newspaper articles and dictionary entries
    NAIST Text Corpus 2007 Newspaper articles
    KU and NTT Blog Corpus 2011 Blogs
    KU Web Document Leads Corpus 2012 Web documents
    BCCWJ 2014 Newspaper articles, books, magazines, etc
    Cookpad Parsed Corpus 2020 Cooking recipes
    # Step-ID:1
    # Sentence-ID:1-1
    * 0 4D 1/2 .7
    1 3:,,?,35,*,*,*,*,1,,,B-Fi
    + ?,,<,*,*,*,*,+, , ,I-Fi
    0,,$0,,*,*,*,*,,,,O
    * 1 2D 1/2 =4'
    ( ?,,<,*,*,*,*,(,,,B-Sf
    6 ?,,<,*,*,*,*,6, , ,I-Sf
    0,, 0,,<,*,*,*,,,,O
    * 2 4P 0/0 /'
    2 ;,,-A,*,*,&8),B@%,2,,,B-Ap
    * 3 4D 0/1 =4'
    ?,,<,*,*,*,*,,
    ,
    ,B-Fi
    0,, 0,,<,*,*,*,,,,O
    * 4 -1O 0/0 /'
    ;,,-A,*,*,&8),!>%,,,,B-Ap
    "*,#9,*,*,*,*,,,,O
    EOS
    raw
    salmon
    (topic marker)
    a bite
    size
    (dative)
    cut
    salt
    (accusative)
    sprinkle
    .
    - The number of cooking recipes on the Internet has grown
    - Recipe-related studies and datasets are also increasing
    - However, there are still few datasets that provide linguistic
    annotations for recipe-related studies even though such
    annotations should form the basis of the studies
    Table1. Existing recipe-related datasets and our corpus
    Figure 1. Linguistic annotations for an example sentence,

    (Cut the raw salmon into bite-size
    chunks and sprinkle them with salt.), in our corpus.
    Precision Recall F1
    MeCab 88.91 88.95 88.93
    MeCab w/ domain adaptation 91.12 91.04 91.08
    Accuracy Precision Recall F1
    Sasada et al. (2015) 88.30 74.65 82.77 78.50
    Lample et al. (2016) 91.41 88.17 87.18 87.67
    Accuracy
    CaboCha 92.21
    CaboCha w/ domain adaptation 94.68
    Table 3. Benchmark results for MA
    Table 4. Benchmark results for NER
    Table 5. Benchmark results for DP
    - We divided our corpus into training (400 recipes), validation
    (100 recipes), and test sets (100 recipes) and tested popular
    tools or methods for Japanese MA, NER, and DP
    - We also tested the tools with performing domain adaptation
    NER
    - We trained/tested two recognizers using our training/test sets
    - Many errors were caused by domain-specific unknown words
    DP
    - We tested CaboCha, the most popular dependency parser for
    Japanese
    - Accuracy was 92–95% (over 20% of the sentences in our test
    set had at least one parsing error)
    - We randomly selected 500 recipes from the Cookpad Recipe
    Dataset
    - 4,738 sentences in the 500 recipes were annotated with
    morphemes, named entities, and dependency relations
    - Construction of a novel corpus, which contains linguistic
    annotations of 500 Japanese recipes
    - Benchmark results on the corpus for Japanese morphological
    analysis (MA), named entity recognition (NER), and dependency
    parsing (DP)
    Contributions of this study
    Morphemes
    - We decided boundaries and part-of-speech for each morpheme
    based on the IPA dictionary, commonly used for MA
    Named entities
    - Morphemes were annotated with 17 tags such as Fi (food
    ingredient) and Sf (state of food) based on IOB2 format
    Dependency relations
    - Bunsetsus were annotated with the relations such as D (normal
    dependency) and P (coordination dependency)
    - A bunsetsu is a unit of Japanese that consists of one or more
    content words and zero or more functions words
    - Bunsetsus were also annotated with 7 types such as
    (Topic)
    Other
    - Content in the Cookpad Recipe and Image Datasets, which
    include the same 500 recipes, can also be used
    - There is still room for improvement in Japanese MA, NER, and
    DP of cooking recipes
    - By improving the analyses using our corpus, a variety of recipe-
    related studies based on them can also be improved
    Table 2. Existing Japanese parsed corpora and our corpus

    View Slide

  30. View Slide

  31. ·ͱΊ
    ΫοΫύουʹ͓͚Δݚڀ։ൃ
    ɾ/-1ɿࡐྉͷਖ਼نԽɺࡐྉͷϨίϝϯυɺखॱͷ෼ྨɺʜ
    ɾ$7ɿྉཧࣸਅͷݕग़ɺඇྉཧࣸਅͷݕग़ɺࡐྉࣸਅͷ෼ྨɺʜ
    ৄ͘͠͸ҎԼ΋͝ཡ͍ͩ͘͞
    ɾSFTFBSDIDPPLQBEDPN
    ɾUFDIMJGFDPPLQBEDPN
    ɾXXXBJHBLLBJPSKQSFTPVSDFBJ@DPNJDT
    ɹୈճʮϨγϐαʔϏεͱ"*ʯ

    View Slide

  32. એ఻
    དྷय़ɺྉཧͱ/-1ɾ$7ʹؔ͢Δॻ੶Λग़൛͢Δ༧ఆͰ͢
    ࣗ෼ֶੜͷݚڀςʔϚʹ͓೰Έͷํ͸ੋඇ͝Ұಡ͍ͩ͘͞
    ΩονϯɾΠϯϑΥϚςΟΫε
    ʙྉཧΛࢧ͑Δࣗવݴޠॲཧͱը૾ॲཧʙ
    ݪౡ७ɾڮຊರ࢙

    View Slide

  33. ͝ਗ਼ௌ͋Γ͕ͱ͏͍͟͝·ͨ͠ɻ

    View Slide