Upgrade to Pro — share decks privately, control downloads, hide ads and more …

日本語レシピデータセットの継続的な構築と複合的な利用/JED2022

558ffd4e5cf6c95e60c91333677b93c4?s=47 j.harashima
March 19, 2022
1.2k

 日本語レシピデータセットの継続的な構築と複合的な利用/JED2022

558ffd4e5cf6c95e60c91333677b93c4?s=128

j.harashima

March 19, 2022
Tweet

Transcript

  1. ೔ຊޠϨγϐσʔληοτͷ
 ܧଓతͳߏஙͱෳ߹తͳར༻ ݪౡ७ɺฏদ३ɺਂᖒ༞ԉɺࢁޱହ߂ʢΫοΫύουגࣜձࣾʣ NLP2022 Workshop on Japanese Evaluation Dataset (JED2022)

    1
  2. എܠ Πϯλʔωοτ΍εϚʔτϑΥϯͷීٴʹΑΓΠϯλʔωοτ্ͷϨγϐ͕૿Ճ ɾ೔ຊޠͩͱ 70 ສϨγϐʢ2010ʣˠ 500 ສϨγϐʢ2020ʣˎ1 Ϩγϐʹؔ͢Δݚڀ΍σʔληοτ΋૿Ճ ɾݚڀɿݴޠཧղ [Kiddon+

    15]ɺจॻੜ੒ [Kiddon+ 16]ɺ৘ใݕࡧ [Salvador+ 17]ɺ࣭໰Ԡ౴ [Yagcioglu+ 18]ɺ…
 ɾσʔληοτɿRecipe1M+ [Marin+ 19]ɺRISeC [Jiang+ 20]ɺARA [Donatelli+ 21]ɺ… ݚڀʹ͠Ζσʔληοτʹ͠ΖɺϝΠϯ͸΍͸ΓӳޠʢಛʹτοϓΧϯϑΝϨϯεʣˠ ೔ຊޠ΋ෛ͚ͯΒΕͳ͍ʂ ˎ1 ΫοΫύουͱָఱϨγϐʹ౤ߘ͞ΕͨϨγϐͷ૯਺ʢൃදऀௐ΂ʣ 2
  3. ໨࣍ ೔ຊޠϨγϐσʔληοτͷܧଓతͳߏங ೔ຊޠϨγϐσʔληοτͷෳ߹తͳར༻ ·ͱΊͱࠓޙͷల๬ 3

  4. Πϯλʔωοτ্ͰϨγϐͷ౤ߘɾݕࡧ͕Ͱ͖Δ೔ຊ࠷େˎ1ͷ
 ϨγϐαʔϏε
 ɾϨγϐ౤ߘ਺ɿ365 ສ඼
 ɾࠃ಺݄ؒར༻ऀ਺ɿ5,600 ສਓ
 ɾϓϨϛΞϜձһ਺ɿ183 ສਓ
 ɾల։ࠃɾ஍Ҭ਺ɿ74 Χࠃ


    ɾରԠݴޠ਺ɿ32 ݴޠ ΫοΫύου ˎ1 ͦΕͧΕ 2021 ೥ 12 ݄ 31 ೔࣌఺ͷσʔλ 4
  5. Ϩγϐ ྉཧͷࡐྉ΍࡞ΓํΛهड़ͨ͠จॻ ଟ͘ͷ৔߹ɺҎԼͷཁૉͰߏ੒͞ΕΔ
 ɾλΠτϧ
 ɾ࡞ऀͷίϝϯτ
 ɾ࡞ऀͷ໊લ
 ɾࡐྉ
 ɾ࡞Γํ
 ɾௐཧޙͷࣸਅʢ৔߹ʹΑͬͯ͸ಈըʣ
 ɾௐཧதͷࣸਅʢ৔߹ʹΑͬͯ͸ಈըʣ


    ɾ… 5
  6. Cookpad Dataset ΫοΫύουגࣜձ͕ࣾܧଓతʹߏஙɾެ։͍ͯ͠Δσʔληοτ ɾCookpad Recipe Datasetʢ2015 ೥ެ։ʣ ɾCookpad Image Datasetʢ2017

    ೥ެ։ʣ ɾCookpad Comparable Corpusʢ2017 ೥ެ։ʣ ɾCookpad Parsed Corpusʢ2020 ೥ެ։ʣ 6
  7. Cookpad Recipe Dataset 2014 ೥ 9 ຤·Ͱʹ౤ߘ͞Εͨ໿ 172 ສϨγϐͷςΩετʢλΠτϧɺ
 ࡞ऀͷίϝϯτɺࡐྉɺ࡞Γํɺ…ʣΛऩ࿥

    [Harashima+ 16]
 Ұ෦ͷϨγϐʹ͸ΧςΰϦ΍ݙཱͷ৘ใ΋͋Δʢٯʹݴ͏ͱɺશͯͷ
 Ϩγϐʹ͸ͳ͍ʣ 2015 ೥ʹެ։ɺϨγϐؔ࿈ͷςΩετσʔληοτͱͯ͠͸ੈք࠷େ 7
  8. Cookpad Recipe Dataset ޙड़͢Δଞͷσʔληοτͱҧ͍ɺNIIˎ1 ܦ༝Ͱެ։ 2022 ೥ 3 ݄࣌఺Ͱશࠃ 110

    େֶ 212 ݚڀࣨˎ2͕ར༻ ˎ1 https://www.nii.ac.jp/dsc/idr/cookpad/
 ˎ2 NLP Ҏ֎ͷݚڀࣨ΋ଟ਺ 8
  9. Cookpad Image Dataset Recipe Dataset ͱಉ͡ 172 ສϨγϐͷը૾ʢௐཧޙͷࣸਅɺௐཧத
 ͷࣸਅʣΛऩ࿥ [Harashima+

    17] 2017 ೥ʹެ։ɺϨγϐؔ࿈ͷը૾σʔληοτͱͯ͠͸ੈք࠷େ 9
  10. Cookpad Image Dataset ௐཧதͷࣸਅ਺Ͱ΋ੈք࠷େ ௐཧޙͷࣸਅ਺Ͱੈք࠷େ Recipe Dataset ͱඥ෇͚Մೳ 10

  11. Cookpad Comparable Corpus 16,000 Ϩγϐʹର͢Δ຋༁σʔλʢ೔ˠӳʣΛऩ࿥ ɾաڈʹ։ൃ͍ͯͨ͠αʔϏεʢΫϩʔζࡁΈʣͰ࢖༻
 ɹ͍ͯͨ͠σʔλ 
 ຋༁ϓϩηε
 ɾ1.

    ೔ຊޠωΠςΟϒ 1 ໊ˎ1ˎ2 ͕຋༁
 ɾ2. ӳޠωΠςΟϒ 2 ໊ˎ2 ͕मਖ਼
 
 WAT 2017 ͱ 2018ˎ3 ͷ subtask ͱͯ͠ఏڙ ˎ1 ӳޠʹਫ਼௨͍ͯ͠ΔਓΛ࠾༻
 ˎ2 ྉཧʹਫ਼௨͍ͯ͠ΔਓΛ࠾༻
 ˎ3 http://lotus.kuee.kyoto-u.ac.jp/WAT/WAT{2017,2018}/index.html • ja: { • title: ཛ౾෗ͷ͢·͠ो, • ingredients: [ • ཛ౾෗, • … • ], • steps: [ • ͚ͨͷ͜͸্ͷ΍ΘΒ͔͍෦෼͚ͩΛബ͘੾Δɻ, • … • ], • }, • en: { • title: Clear Broth with Egg Tofu, • ingredients: [ • Egg tofu, • … • ], • steps: [ • Take the soft part of the top of the bamboo shoot and thinly slice., • … • ], • } ؆୯ͷͨΊɺ࣮ࡍͷσʔλΛվมɾলུ 11
  12. Cookpad Comparable Corpus ϕϯνϚʔΫͷ݁Ռ΍࣮ݧ༻ͷεΫϦϓτ͕ӾཡɾऔಘՄೳ 12

  13. Cookpad Parsed Corpus 500 ϨγϐʢλΠτϧͱ࡞Γํʣʹର͢Δܗଶૉղੳͱߏจղੳɺ
 ݻ༗දݱೝࣝͷਖ਼ղσʔλΛऩ࿥ [Harashima&Hiramatsu 20] ɾܗଶૉղੳɿMeCabʢipadicʣͷ݁ՌΛਓखͰमਖ਼ ɾߏจղੳɿCaboCha

    ͷ݁ՌΛਓखͰमਖ਼
 ɾݻ༗දݱೝࣝɿಠࣗͷ 17 λάΛਓखͰ෇༩ اۀʹΑΔ೔ຊޠղੳࡁΈίʔύεͷެ։͸ॳʁ # Step-ID:1 # Sentence-ID:1-1 * 0 4D 1/2 .7 1 3:,,?,35,*,*,*,*,1,,,B-Fi + ?,,<,*,*,*,*,+, , ,I-Fi  0,,$0,,*,*,*,*,,,,O * 1 2D 1/2 =4' ( ?,,<,*,*,*,*,(, , ,B-Sf 6 ?,,<,*,*,*,*,6, , ,I-Sf  0,, 0,,<,*,*,*,,,,O * 2 4P 0/0 /' 2 ;,,-A,*,*,&8),B@%,2, , ,B-Ap * 3 4D 0/1 =4'  ?,,<,*,*,*,*,, , ,B-Fi  0,, 0,,<,*,*,*,,,,O * 4 -1O 0/0 /'  ;,,-A,*,*,&8),!>%,,,,B-Ap  "*,#9,*,*,*,*,,,,O EOS 13
  14. Cookpad Parsed Corpus ৽ฉهࣄͷղੳͱൺ΂Δͱ…
 ɾܗଶૉղੳ͸೉͍͠ʢະ஌ޠ͕ଟ͍ͨΊʣ
 ɾߏจղੳ͸қ͍͠ʢจ͕୹͍ͨΊʣ
 ɾݻ༗දݱೝࣝ͸ෆ໌ʢಉ͡λά͕෇͍ͯͳ͍ͨΊʣ ࠶ֶश ద߹཰ ࠶ݱ཰

    '஋ ୯ޠ෼ׂͷΈ ͳ͠       ͋Γ       ୯ޠ෼ׂʴ ඼ࢺλά෇͚ ͳ͠       ͋Γ       ਖ਼ղ཰ ద߹཰ ࠶ݱ཰ '஋ <4BTBEB >         <-BNQMF >         ܗଶૉղੳثʢ.F$BCʣͷੑೳˎ ݻ༗දݱೝࣝثͷੑೳˎ ࠶ֶश ਖ਼ղ཰ จઅ୯Ґ จ୯Ґ ͳ͠     ͋Γ     ߏจղੳثʢ$BCP$IBʣͷੑೳˎ ˎ1 ࣮ݧ༻ͷεΫϦϓτ͸ https://github.com/cookpad/cpc1.0 Ͱެ։ 14
  15. ͨ΂ΈΔʢ༨ஊʣ ΫοΫύουͷݕࡧσʔλΛ஝ੵɺ๏ਓ޲͚ʹ
 ల։͍ͯ͠Δ෼ੳπʔϧ 2016 ೥ʹެ։
 ɾσʔληοτͱͯ͠ެ։͍ͯ͠ΔΘ͚Ͱ͸ͳ͘
 ɹΞΧ΢ϯτΛແঈͰఏڙʢݚڀऀͷΈʣ 15

  16. ໨࣍ ೔ຊޠϨγϐσʔληοτͷܧଓతͳߏங ೔ຊޠϨγϐσʔληοτͷෳ߹తͳར༻ ·ͱΊͱࠓޙͷల๬ 16

  17. ෳ߹తͳར༻ʁ ֤σʔληοτ͸ݸผʹར༻Մೳʢ౰ͨΓલʣ Ұํɺෳ߹తʹར༻͢Δ͜ͱͰॳΊͯऔΓ૊ΊΔλεΫ΍ख๏΋ 17

  18. ݸผͷར༻ Recipe Dataset Image Dataset Comparable Corpus Parsed Corpus ɾػց຋༁ʢ೔ӳʣ

    ɾܗଶૉղੳ
 ɾߏจղੳ
 ɾݻ༗දݱೝࣝ ɾ௒ղ૾
 ɾ… 18 ɾจॻਪનʢओࡊਪનɾ෭ࡊਪનʣ
 ɾจॻੜ੒ʢλΠτϧɾ࡞Γํੜ੒ʣ
 ɾΩʔϫʔυਪનʢࡐྉਪનʣ
 ɾ…
  19. ෳ߹తͳར༻ Recipe Dataset Image Dataset Comparable Corpus Parsed Corpus ࢹ֮త࣭໰Ԡ౴

    Ωϟϓγϣϯੜ੒ ϚϧνϞʔμϧݕࡧ ϚϧνϞʔμϧ຋༁ ը૾ೝࣝʢྉཧೝࣝɾࡐྉೝࣝʣ ɾจॻਪનʢओࡊਪનɾ෭ࡊਪનʣ
 ɾจॻੜ੒ʢλΠτϧɾ࡞Γํੜ੒ʣ
 ɾΩʔϫʔυਪનʢࡐྉਪનʣ
 ɾ… ɾػց຋༁ʢ೔ӳʣ ɾܗଶૉղੳ
 ɾߏจղੳ
 ɾݻ༗දݱೝࣝ ɾ௒ղ૾
 ɾ… 19
  20. Recipe Dataset Comparable Corpus Parsed Corpus ࣄલֶश
 ɾMasked Language Model


    ɾNext Sentence Prediction
 ɾ… ɾػց຋༁ʢ೔ӳʣ ɾܗଶૉղੳ
 ɾߏจղੳ
 ɾݻ༗දݱೝࣝ ෳ߹తͳར༻ʢख๏ͷ؍఺ʣ ϑΝΠϯνϡʔχϯά ϑΝΠϯνϡʔχϯά 20
  21. ࣄલֶशϞσϧͷߏங طʹऔΓ૊Έ͸͡Ί͍ͯͩͬͯ͘͞Δํ΋ HCG γϯϙδ΢Ϝ 2021 21

  22. ͞ΒͳΔซ༻΋ʁ ɾָఱσʔληοτ ɾϑϩʔάϥϑίʔύε [Mori+ 14] ɾྉཧΦϯτϩδʔ [Nanba+ 14] ɾجຊྉཧ஌ࣝϕʔε [ਗ਼ؙ+

    18] ɾr-FG-BB σʔληοτ [Nishimura+ 20] ɾ… ͍ͣΕ΋Ϩγϐ΍ྉཧʹؔ͢Δ
 ೔ຊޠͷσʔληοτ 22
  23. ໨࣍ ೔ຊޠϨγϐσʔληοτͷܧଓతͳߏங ೔ຊޠϨγϐσʔληοτͷෳ߹తͳར༻ ·ͱΊͱࠓޙͷల๬ 23

  24. ·ͱΊ ೔ຊޠϨγϐσʔληοτͷܧଓతͳߏங
 ɾCookpad Recipe Datasetʢ2015 ೥ެ։ʣ
 ɾCookpad Image Datasetʢ2017 ೥ެ։ʣ


    ɾCookpad Comparable Corpusʢ2017 ೥ެ։ʣ
 ɾCookpad Parsed Corpusʢ2020 ೥ެ։ʣ ೔ຊޠϨγϐσʔληοτͷෳ߹తͳར༻
 ɾλεΫɿࢹ֮త࣭໰Ԡ౴ɺϚϧνϞʔμϧݕࡧɺΩϟϓγϣϯੜ੒ɺ…
 ɾख๏ɿࣄલֶशʴϑΝΠϯνϡʔχϯά 24
  25. ࠓޙͷల๬ Cookpad Video Dataset with OMRON SINIC X Ӷҙ։ൃதʂ 25

    Parsed Corpus # Step-ID:1 # Sentence-ID:1-1 * 0 4D 1/2 .7 1 3:,,?,35,*,*,*,*,1,,,B-Fi + ?,,<,*,*,*,*,+, , ,I-Fi  0,,$0,,*,*,*,*,,,,O * 1 2D 1/2 =4' ( ?,,<,*,*,*,*,(, , ,B-Sf 6 ?,,<,*,*,*,*,6, , ,I-Sf  0,, 0,,<,*,*,*,,,,O * 2 4P 0/0 /' 2 ;,,-A,*,*,&8),B@%,2, , ,B-Ap * 3 4D 0/1 =4'  ?,,<,*,*,*,*,, , ,B-Fi  0,, 0,,<,*,*,*,,,,O * 4 -1O 0/0 /'  ;,,-A,*,*,&8),!>%,,,,B-Ap … Video Dataset ղੳࡁΈϨγϐͱௐཧಈըΛඥ෇͚
  26. ࢀߟจݙ • [Donatelli+ 21] Aligning Actions Across Recipe Graphs •

    [Harashima+ 16] A Large-Scale Recipe and Meal Data Collection as Infrastructure for Food Research • [Harashima+ 17] Cookpad Image Dataset: An Image Collection as Infrastructure for Food Research • [Harashima&Hiramatsu 20] Cookpad Parsed Corpus: Linguistic Annotations of Japanese Recipes • [Jiang+ 20] Recipe Instruction Semantics Corpus (RISeC): Resolving Semantic Structure and Zero Anaphora in Recipes • [Kiddon+ 15] Mise en Place: Unsupervised Interpretation of Instructional Recipes • [Kiddon+ 16] Globally Coherent Text Generation with Neural Checklist Models • [Lample+ 16] Neural Architectures for Named Entity Recognition • [Marin+ 19] Recipe1M+: A Dataset for Learning Cross-Modal Embeddings for Cooking Recipes and Food Images • [Mori+ 14] Flow Graph Corpus from Recipe Texts • [Nanba+ 14] Construction of a Cooking Ontology from Cooking Recipes and Patents • [Nishimura+ 20] Visual Grounding Annotation of Recipe Flow Graph • [Salvador+ 17] Learning Cross-modal Embeddings for Cooking Recipes and Food Images • [Sasada+ 15] Named Entity Recognizer Trainable from Partially Annotated Data • [Yagcioglu+ 18] RecipeQA: A Challenge Dataset for Multimodal Comprehension of Cooking Recipes • [߳઒+ 21] ΫοΫύουσʔληοτͰֶशͨ͠ BERT ٴͼ GPT-2 ͷ׆༻๏ • [ਗ਼ؙ+ 18] ྉཧϨγϐͱΫϥ΢υιʔγϯάʹجͮ͘جຊྉཧ஌ࣝϕʔεͷߏங 26