Upgrade to Pro — share decks privately, control downloads, hide ads and more …

クックパッドにおける研究開発/HCG2020

558ffd4e5cf6c95e60c91333677b93c4?s=47 j.harashima
December 16, 2020
450

 クックパッドにおける研究開発/HCG2020

558ffd4e5cf6c95e60c91333677b93c4?s=128

j.harashima

December 16, 2020
Tweet

More Decks by j.harashima

Transcript

  1. ΫοΫύουʹ͓͚Δݚڀ։ൃ ΫοΫύουגࣜձࣾݪౡ७

  2. ೥݄ ೥݄ ژ౎େֶࠇڮݚ഑ଐ ΫοΫύουೖࣾ ത࢜ʢ৘ใֶʣऔಘ ΤϯδχΞʢ3VCZPO3BJMTʣ ೥݄ ֶੜʢࣗવݴޠॲཧ৘ใݕࡧʣ ݪౡ७ ೥݄

    Ϛωʔδϟʔʢਓࣄ޿ใʜʣ ݚڀ։ൃ෦໳ઃཱ
  3. ΫοΫύου Πϯλʔωοτ্ͰϨγϐͷ౤ߘɾݕࡧ౳͕Ͱ͖Δ ੈք࠷େͷྉཧϨγϐαʔϏε ݄ؒར༻ऀ਺ԯਓ௒ʢΧࠃݴޠʣ ϓϨϛΞϜձһ਺໿ສਓ

  4. ΫοΫύουϚʔτ

  5. ͨͷ͍͠Ωονϯෆಈ࢈

  6. Ԋֵ ೥݄ ༗ݶձࣾίΠϯʢݱΫοΫύουגࣜձࣾʣઃཱ ೥݄ ೥݄ ೥݄ ೥݄ ೥݄ Ϩγϐͷ౤ߘɾݕࡧαʔϏεLJUDIFO!DPJO։࢝ ΫοΫύουʹαʔϏε໊มߋ

    ೥݄ ϓϨϛΞϜαʔϏε։࢝ Ϛβʔζ্৔ ౦ূҰ෦ʹࢢ৔มߋ ϓϨϛΞϜαʔϏεձһ਺ສਓಥഁ ೥݄ ւ֎ల։Λຊ֨Խ ೥݄ ݚڀ։ൃ෦໳ൃ଍
  7. ؔ࿈αʔϏε ͓·͔ͤ੔ཧ ०ͷ໺ࡊϨγϐ΍ ϝΠϯʹ࢖͑ΔϨγϐ͕ ͙͢ʹݟ͔ͭΔ "MFYBεΩϧ ࢖͍͍ͨࡐྉ͚ͩͰ ͙͢ʹ࡞ΕΔਓؾͷϨγϐఏҊ ྉཧ͖Ζ͘ ࡱΔ͚ͩͰه࿥ʹʂ

    ͋ͳͨͷ͓ྉཧࣸਅΛ ࣗಈͰ੔ཧ Ϩγϐͷͦͷઌ΁ ͓͍͍͠εϚʔτΩονϯ 0J$Z
  8. ϨγϐʢςΩετʴը૾ʣ ϑΟʔυόοΫ΋ςΩετʴը૾ ϝσΟΞॲཧʢಛʹ /-1$7ʣ͕ॏཁ

  9. /-1

  10. ϨγϐΛͦͷछྨ΍ௐཧ๏ɺδϟϯϧɺʜͰ෼ྨ ɾछྨʢFH ೑ྉཧɺڕྉཧʣɺௐཧ๏ʢFH ࣽ෺ɺম෺ʣɺδϟϯϧʢFH ࿨ɺ༸ʣ ɾ෼ྨ݁Ռ͸ݕࡧ݁ՌͷߜΓࠐΈ౳Ͱར༻ Ϟσϧ͸47.΍3' ɾλάʹ΋ґΔ͕ɺे෼ͳੑೳ ֶशσʔλ ɾΞϊςʔλʔ͕λά෇͚ͨ͠Ϩγϐ΍Ϣʔβʔ͕λά෇͚ͨ͠ϨγϐΛར༻

    Ϩγϐͷ෼ྨ
  11. ΫοΫύουͷࡐྉ໊͸਺ඦສछྨ ɾϨγϐͷΧϩϦʔ౳Λܭࢉ͢Δࡍʹ໰୊ &ODPEFS%FDPEFSͰਖ਼نԽ ɾDIBSBDUFSCBTFETUBDLFEVOJEJSFDUJPOBM-45.ʢ૚ʣ ɾͳ͓ɺCJEJSFDUJPOBM-45.ͱBUUFOUJPO͸ޮՌͳ͔ͬͨ ˒ ে ༉ ΐ ͠

    ͏ Ώ &08 &08 ΐ ͠ ͏ Ώ ࡐྉ໊ͷਖ਼نԽ
  12. $BMPSJF&TUJNBUJPO )BSBTIJNBFUBM "TBSJ$MBN3JDF HP ɹɹɹɹɹΞαϦɹ ɹɹɹɹɹถɹɹɹ ɹɹɹɹɹԘɹɹɹ ɹɹɹɹɹञɹɹɹ ɹɹɹɹɹ͠ΐ͏Ώ ɹɹɹɹɹΈΓΜɹ

    SJDF BTBSJDMBN TBMU TBLF TPZTBVDF TXFFUTBLF 㾎 㾎 㾎 㾎 㾎 㾎 8FIBWFFTUJNBUFEUIFOVNCFSPGDBMPSJFTJOPWFS  SFDJQFTBOEBDUVBMMZVTFUIFNJOPVSSFDJQFTFSWJDF 6TFUIFTJOHMFTPVSDFNPEFMGPSTFSWJOHFTUJNBUJPO  c(r) = ∑ i∈Ir c(i) ⋅ q(i)/100 s(r) = 306.6 㾎 *OHSFEJFOU
  13. Ϣʔβʔ͕ೖྗͨ͠λΠτϧΛ&NCFE ɾ֤୯ޠΛ&NCFEͯ͠ɺͦͷฏۉΛऔಘ ϨγϐΛݕࡧ ɾ&NCFEEJOH4QBDF಺ͰࣅͨλΠτϧΛ࣋ͭϨγϐΛ ɹΫοΫύουͷطଘͷϨγϐ͔Βݕࡧ ࡐྉ໊ΛϨίϝϯυ ɾݕࡧ͞ΕͨϨγϐ಺Ͱڞ௨͢Δࡐྉ໊ΛϨίϝϯυ ࡐྉ໊ͷϨίϝϯυ

  14. खॱͷ෼ྨ Ϩγϐͷ֤ௐཧखॱΛ5SVF4UFQͱ'BLF4UFQʹ෼ྨ ɾ໿ खॱΛλά෇͚ɺ-45.Λֶशʢਖ਼ղ཰ʣ ɾϨγϐͷಡΈ্͛౳ͷαʔϏεͰར༻

  15. ࣾ಺ʹʢͨ·ͨ·ʣର༁ίʔύε͕͋ͬͨ ɾաڈͷαʔϏεͷҨ෺ ɾ໿ ඼ͷ೔ӳର༁ /.5<#BIEBOBVFUBM>Λςετ ɾ݁ՌɺαʔϏεԽͰ͖Δ΄Ͳͷ຋༁͸೉͍͠ ɾͨ·ʹͦΕͳΓͷ຋༁͕ੜ੒Ͱ͖Δ͜ͱ΋ ɹɾ೔ɿ֖·ͨ͸ΞϧϛϗΠϧͰམͱ֖͠Λͯ͠தՐͰৠ͠ম͖ʹ͠·͢ɻ ɹɾӳɿDPWFSXJUIBMJEPSBMVNJOVNGPJM BOETUFBNUIFDIJDLFOPONFEJVNIFBU

    Ϩγϐͷ຋༁ #BIEBOBVFUBM
  16. ͝ҙݟͷ෼ྨ ʮඒຯ͘͠Ͱ͖·ͨ͠ʂʯ ϙδςΟϒ Ϣʔβʔ͔Βͷ͝ҙݟΛࣗಈͰ෼ྨ ɾݸͷλάʹରԠʢݸͷ47.Λֶशʣ ɾ֬཰ͷߴ͍λάΛαδΣετɺελοϑͷ෼ྨ࡞ۀΛ൒ݮ

  17. $7

  18. ྉཧ ඇྉཧ ඇྉཧ ඇྉཧ ਫ਼౓࠶ݱ཰ ྉཧ͖Ζ͘͸ެ։೥Ͱສਓ͕ར༻ ྉཧࣸਅͷݕग़ ϢʔβͷεϚʔτϑΥϯ͔ΒྉཧࣸਅΛਂ૚ֶशͰݕग़

  19. WFS ɾ$B ff F/FU ɾྉཧPSඇྉཧͷೋ஋෼ྨ WFS ɾ*ODFQUJPOW ɾྉཧ ২෺ ʜ

    PSඇྉཧͷଟ஋෼ྨ WFS ɾ*ODFQUJPOW QBUDIFEDMBTTJ fi DBUJPO ɾྉཧPSඇྉཧͷೋ஋෼ྨ ֶशσʔλ ɾਖ਼ྫɿΫοΫύουͷϨγϐͷը૾ ɾෛྫɿϥΠηϯεϑϦʔͷछʑͷը૾ WFSͷೝࣝ݁Ռ ྉཧࣸਅͷݕग़
  20. https://twitter.com/ohmycorgi/status/867745923719364609 https://twitter.com/teenybiscuit/status/707727863571582978 㾎͕Ϟσϧ͕ྉཧͱ൑ఆͨࣸ͠ਅ ྉཧࣸਅͷݕग़

  21. ඇྉཧ ྉཧ ྉཧ ྉཧ ඇྉཧࣸਅͷݕग़ Ϣʔβʔ͕౤ߘͨࣸ͠ਅ͔ΒඇྉཧࣸਅΛݕग़ ྉཧࣸਅͷݕग़ͱಉ͡ϞσϧΛར༻ ϢʔβʔʹΨΠυϥΠϯΛ௨஌ ˞ඇྉཧͷ৔߹ɺՈ଒΍ϖοτͷөΓࠐΈ͕ଟ͍

  22. ૝૾Λ௒͑ΔʁτϚτϨλεཛͷ̏৭ εʔϓ ؆୯τϚτεʔϓ ʢϛωετϩʔωʣ ৽ۄͶ͗ͱτϚτͷαϥμ ໨ࢦͤσϦ෩ʂϨλεͱτϚτͷα ϥμ ௒γϯϓϧʂτϚτͱେࠜͰ࿨෩ ྫྷ੩ύελ τϚτͱ৽ۄͶ͗ͷ

    ͬ͞ͺΓϚϦω τϚτ ࡐྉࣸਅͷ෼ྨ ࡐྉࣸਅΛ໿छྨʹ෼ྨɺαʔϏε։ൃʹར༻ʢ༧ఆʣ
  23. ͦͷଞͷτϐοΫ

  24. Ϩγϐσʔλͷެ։

  25. ར༻ঢ়گ ؔ ࿈ τ ϐ ο Ϋ ར༻ঢ়گ ެ։લʢʙ ೥

    ݄ʣ େֶ ݚڀࣨ ެ։ޙʢ ೥ ݄ʣ  େֶ  ݚڀࣨ
  26. ྉཧݕग़෦໳ ɾࣸਅதͷྉཧྖҬΛݕग़ ྉཧ෼ྨ෦໳ ɾྉཧࣸਅΛΫϥεʹ෼ྨ ͦ͏ΊΜ ͏ͲΜ ୈճ"*νϟϨϯδίϯςετ

  27. +4"*$VQ ࡐྉࣸਅͷ෼ྨ ɾΧςΰϦʢFH ۄೢʣ ɾશ෦Ͱ ຕ

  28. 8"5

  29. MA - MeCab, the most popular morphological analyzer of Japanese,

    was tested - All metrics indicated 89–91% although the tool has already achieved over 98% on newspaper articles in Kudo et al. (2004) Cookpad Parsed Corpus: Linguistic Annotations of Japanese Recipes Jun Harashima and Makoto Hiramatsu (Cookpad Inc.) The 14th Linguistic Annotation Workshop  Background     Cookpad Parsed Corpus Name Year Main Content CURD 2008 Machine-readable language representations Flow Graph Corpus 2014 Graph representations and named entities SIMMR Recipe Dataset 2015 Graph representations Cookpad Recipe Dataset 2016 Reviews and meals Cookpad Image Dataset 2017 Food images and cooking images Recipe1M 2017 Food images RecipeQA 2018 Question-answer pairs Stroyboarding Data 2019 Cooking images r-FG BB dataset 2019 Bounding boxes for cooking images English Recipe Flow Graph Corpus 2020 Graph representations and named entities Mulitimodal Aligned Recipe Corpus 2020 URLs to YouTube videos Mulit-modal Recipe Structure dataset 2020 Graph representations and cooking images Cookpad Parsed Corpus 2020 Linguistic annotations Name Year Target documents KU Text Corpus 2002 Newspaper articles GDA Corpus 2005 Newspaper articles and dictionary entries NAIST Text Corpus 2007 Newspaper articles KU and NTT Blog Corpus 2011 Blogs KU Web Document Leads Corpus 2012 Web documents BCCWJ 2014 Newspaper articles, books, magazines, etc Cookpad Parsed Corpus 2020 Cooking recipes # Step-ID:1 # Sentence-ID:1-1 * 0 4D 1/2 .7 1 3:,,?,35,*,*,*,*,1,,,B-Fi + ?,,<,*,*,*,*,+, , ,I-Fi  0,,$0,,*,*,*,*,,,,O * 1 2D 1/2 =4' ( ?,,<,*,*,*,*,(, , ,B-Sf 6 ?,,<,*,*,*,*,6, , ,I-Sf  0,, 0,,<,*,*,*,,,,O * 2 4P 0/0 /' 2 ;,,-A,*,*,&8),B@%,2, , ,B-Ap * 3 4D 0/1 =4'  ?,,<,*,*,*,*,, , ,B-Fi  0,, 0,,<,*,*,*,,,,O * 4 -1O 0/0 /'  ;,,-A,*,*,&8),!>%,,,,B-Ap  "*,#9,*,*,*,*,,,,O EOS raw salmon (topic marker) a bite size (dative) cut salt (accusative) sprinkle . - The number of cooking recipes on the Internet has grown - Recipe-related studies and datasets are also increasing - However, there are still few datasets that provide linguistic annotations for recipe-related studies even though such annotations should form the basis of the studies Table1. Existing recipe-related datasets and our corpus Figure 1. Linguistic annotations for an example sentence,     (Cut the raw salmon into bite-size chunks and sprinkle them with salt.), in our corpus. Precision Recall F1 MeCab 88.91 88.95 88.93 MeCab w/ domain adaptation 91.12 91.04 91.08 Accuracy Precision Recall F1 Sasada et al. (2015) 88.30 74.65 82.77 78.50 Lample et al. (2016) 91.41 88.17 87.18 87.67 Accuracy CaboCha 92.21 CaboCha w/ domain adaptation 94.68 Table 3. Benchmark results for MA Table 4. Benchmark results for NER Table 5. Benchmark results for DP - We divided our corpus into training (400 recipes), validation (100 recipes), and test sets (100 recipes) and tested popular tools or methods for Japanese MA, NER, and DP - We also tested the tools with performing domain adaptation NER - We trained/tested two recognizers using our training/test sets - Many errors were caused by domain-specific unknown words DP - We tested CaboCha, the most popular dependency parser for Japanese - Accuracy was 92–95% (over 20% of the sentences in our test set had at least one parsing error) - We randomly selected 500 recipes from the Cookpad Recipe Dataset - 4,738 sentences in the 500 recipes were annotated with morphemes, named entities, and dependency relations - Construction of a novel corpus, which contains linguistic annotations of 500 Japanese recipes - Benchmark results on the corpus for Japanese morphological analysis (MA), named entity recognition (NER), and dependency parsing (DP)  Contributions of this study Morphemes - We decided boundaries and part-of-speech for each morpheme based on the IPA dictionary, commonly used for MA Named entities - Morphemes were annotated with 17 tags such as Fi (food ingredient) and Sf (state of food) based on IOB2 format Dependency relations - Bunsetsus were annotated with the relations such as D (normal dependency) and P (coordination dependency) - A bunsetsu is a unit of Japanese that consists of one or more content words and zero or more functions words - Bunsetsus were also annotated with 7 types such as  (Topic) Other - Content in the Cookpad Recipe and Image Datasets, which include the same 500 recipes, can also be used - There is still room for improvement in Japanese MA, NER, and DP of cooking recipes - By improving the analyses using our corpus, a variety of recipe- related studies based on them can also be improved Table 2. Existing Japanese parsed corpora and our corpus
  30. None
  31. ·ͱΊ ΫοΫύουʹ͓͚Δݚڀ։ൃ ɾ/-1ɿࡐྉͷਖ਼نԽɺࡐྉͷϨίϝϯυɺखॱͷ෼ྨɺʜ ɾ$7ɿྉཧࣸਅͷݕग़ɺඇྉཧࣸਅͷݕग़ɺࡐྉࣸਅͷ෼ྨɺʜ ৄ͘͠͸ҎԼ΋͝ཡ͍ͩ͘͞ ɾSFTFBSDIDPPLQBEDPN ɾUFDIMJGFDPPLQBEDPN ɾXXXBJHBLLBJPSKQSFTPVSDFBJ@DPNJDT ɹୈճʮϨγϐαʔϏεͱ"*ʯ

  32. એ఻ དྷय़ɺྉཧͱ/-1ɾ$7ʹؔ͢Δॻ੶Λग़൛͢Δ༧ఆͰ͢ ࣗ෼ֶੜͷݚڀςʔϚʹ͓೰Έͷํ͸ੋඇ͝Ұಡ͍ͩ͘͞ ΩονϯɾΠϯϑΥϚςΟΫε ʙྉཧΛࢧ͑Δࣗવݴޠॲཧͱը૾ॲཧʙ ݪౡ७ɾڮຊರ࢙

  33. ͝ਗ਼ௌ͋Γ͕ͱ͏͍͟͝·ͨ͠ɻ