Upgrade to Pro — share decks privately, control downloads, hide ads and more …

日本語レシピデータセットの継続的な構築と複合的な利用/JED2022

j.harashima
March 19, 2022
9.3k

 日本語レシピデータセットの継続的な構築と複合的な利用/JED2022

j.harashima

March 19, 2022
Tweet

More Decks by j.harashima

Transcript

  1. എܠ Πϯλʔωοτ΍εϚʔτϑΥϯͷීٴʹΑΓΠϯλʔωοτ্ͷϨγϐ͕૿Ճ ɾ೔ຊޠͩͱ 70 ສϨγϐʢ2010ʣˠ 500 ສϨγϐʢ2020ʣˎ1 Ϩγϐʹؔ͢Δݚڀ΍σʔληοτ΋૿Ճ ɾݚڀɿݴޠཧղ [Kiddon+

    15]ɺจॻੜ੒ [Kiddon+ 16]ɺ৘ใݕࡧ [Salvador+ 17]ɺ࣭໰Ԡ౴ [Yagcioglu+ 18]ɺ…
 ɾσʔληοτɿRecipe1M+ [Marin+ 19]ɺRISeC [Jiang+ 20]ɺARA [Donatelli+ 21]ɺ… ݚڀʹ͠Ζσʔληοτʹ͠ΖɺϝΠϯ͸΍͸ΓӳޠʢಛʹτοϓΧϯϑΝϨϯεʣˠ ೔ຊޠ΋ෛ͚ͯΒΕͳ͍ʂ ˎ1 ΫοΫύουͱָఱϨγϐʹ౤ߘ͞ΕͨϨγϐͷ૯਺ʢൃදऀௐ΂ʣ 2
  2. Cookpad Recipe Dataset 2014 ೥ 9 ຤·Ͱʹ౤ߘ͞Εͨ໿ 172 ສϨγϐͷςΩετʢλΠτϧɺ
 ࡞ऀͷίϝϯτɺࡐྉɺ࡞Γํɺ…ʣΛऩ࿥

    [Harashima+ 16]
 Ұ෦ͷϨγϐʹ͸ΧςΰϦ΍ݙཱͷ৘ใ΋͋Δʢٯʹݴ͏ͱɺશͯͷ
 Ϩγϐʹ͸ͳ͍ʣ 2015 ೥ʹެ։ɺϨγϐؔ࿈ͷςΩετσʔληοτͱͯ͠͸ੈք࠷େ 7
  3. Cookpad Recipe Dataset ޙड़͢Δଞͷσʔληοτͱҧ͍ɺNIIˎ1 ܦ༝Ͱެ։ 2022 ೥ 3 ݄࣌఺Ͱશࠃ 110

    େֶ 212 ݚڀࣨˎ2͕ར༻ ˎ1 https://www.nii.ac.jp/dsc/idr/cookpad/
 ˎ2 NLP Ҏ֎ͷݚڀࣨ΋ଟ਺ 8
  4. Cookpad Comparable Corpus 16,000 Ϩγϐʹର͢Δ຋༁σʔλʢ೔ˠӳʣΛऩ࿥ ɾաڈʹ։ൃ͍ͯͨ͠αʔϏεʢΫϩʔζࡁΈʣͰ࢖༻
 ɹ͍ͯͨ͠σʔλ 
 ຋༁ϓϩηε
 ɾ1.

    ೔ຊޠωΠςΟϒ 1 ໊ˎ1ˎ2 ͕຋༁
 ɾ2. ӳޠωΠςΟϒ 2 ໊ˎ2 ͕मਖ਼
 
 WAT 2017 ͱ 2018ˎ3 ͷ subtask ͱͯ͠ఏڙ ˎ1 ӳޠʹਫ਼௨͍ͯ͠ΔਓΛ࠾༻
 ˎ2 ྉཧʹਫ਼௨͍ͯ͠ΔਓΛ࠾༻
 ˎ3 http://lotus.kuee.kyoto-u.ac.jp/WAT/WAT{2017,2018}/index.html • ja: { • title: ཛ౾෗ͷ͢·͠ो, • ingredients: [ • ཛ౾෗, • … • ], • steps: [ • ͚ͨͷ͜͸্ͷ΍ΘΒ͔͍෦෼͚ͩΛബ͘੾Δɻ, • … • ], • }, • en: { • title: Clear Broth with Egg Tofu, • ingredients: [ • Egg tofu, • … • ], • steps: [ • Take the soft part of the top of the bamboo shoot and thinly slice., • … • ], • } ؆୯ͷͨΊɺ࣮ࡍͷσʔλΛվมɾলུ 11
  5. Cookpad Parsed Corpus 500 ϨγϐʢλΠτϧͱ࡞Γํʣʹର͢Δܗଶૉղੳͱߏจղੳɺ
 ݻ༗දݱೝࣝͷਖ਼ղσʔλΛऩ࿥ [Harashima&Hiramatsu 20] ɾܗଶૉղੳɿMeCabʢipadicʣͷ݁ՌΛਓखͰमਖ਼ ɾߏจղੳɿCaboCha

    ͷ݁ՌΛਓखͰमਖ਼
 ɾݻ༗දݱೝࣝɿಠࣗͷ 17 λάΛਓखͰ෇༩ اۀʹΑΔ೔ຊޠղੳࡁΈίʔύεͷެ։͸ॳʁ # Step-ID:1 # Sentence-ID:1-1 * 0 4D 1/2 .7 1 3:,,?,35,*,*,*,*,1,,,B-Fi + ?,,<,*,*,*,*,+, , ,I-Fi  0,,$0,,*,*,*,*,,,,O * 1 2D 1/2 =4' ( ?,,<,*,*,*,*,(, , ,B-Sf 6 ?,,<,*,*,*,*,6, , ,I-Sf  0,, 0,,<,*,*,*,,,,O * 2 4P 0/0 /' 2 ;,,-A,*,*,&8),B@%,2, , ,B-Ap * 3 4D 0/1 =4'  ?,,<,*,*,*,*,, , ,B-Fi  0,, 0,,<,*,*,*,,,,O * 4 -1O 0/0 /'  ;,,-A,*,*,&8),!>%,,,,B-Ap  "*,#9,*,*,*,*,,,,O EOS 13
  6. Cookpad Parsed Corpus ৽ฉهࣄͷղੳͱൺ΂Δͱ…
 ɾܗଶૉղੳ͸೉͍͠ʢະ஌ޠ͕ଟ͍ͨΊʣ
 ɾߏจղੳ͸қ͍͠ʢจ͕୹͍ͨΊʣ
 ɾݻ༗දݱೝࣝ͸ෆ໌ʢಉ͡λά͕෇͍ͯͳ͍ͨΊʣ ࠶ֶश ద߹཰ ࠶ݱ཰

    '஋ ୯ޠ෼ׂͷΈ ͳ͠       ͋Γ       ୯ޠ෼ׂʴ ඼ࢺλά෇͚ ͳ͠       ͋Γ       ਖ਼ղ཰ ద߹཰ ࠶ݱ཰ '஋ <4BTBEB >         <-BNQMF >         ܗଶૉղੳثʢ.F$BCʣͷੑೳˎ ݻ༗දݱೝࣝثͷੑೳˎ ࠶ֶश ਖ਼ղ཰ จઅ୯Ґ จ୯Ґ ͳ͠     ͋Γ     ߏจղੳثʢ$BCP$IBʣͷੑೳˎ ˎ1 ࣮ݧ༻ͷεΫϦϓτ͸ https://github.com/cookpad/cpc1.0 Ͱެ։ 14
  7. ݸผͷར༻ Recipe Dataset Image Dataset Comparable Corpus Parsed Corpus ɾػց຋༁ʢ೔ӳʣ

    ɾܗଶૉղੳ
 ɾߏจղੳ
 ɾݻ༗දݱೝࣝ ɾ௒ղ૾
 ɾ… 18 ɾจॻਪનʢओࡊਪનɾ෭ࡊਪનʣ
 ɾจॻੜ੒ʢλΠτϧɾ࡞Γํੜ੒ʣ
 ɾΩʔϫʔυਪનʢࡐྉਪનʣ
 ɾ…
  8. ෳ߹తͳར༻ Recipe Dataset Image Dataset Comparable Corpus Parsed Corpus ࢹ֮త࣭໰Ԡ౴

    Ωϟϓγϣϯੜ੒ ϚϧνϞʔμϧݕࡧ ϚϧνϞʔμϧ຋༁ ը૾ೝࣝʢྉཧೝࣝɾࡐྉೝࣝʣ ɾจॻਪનʢओࡊਪનɾ෭ࡊਪનʣ
 ɾจॻੜ੒ʢλΠτϧɾ࡞Γํੜ੒ʣ
 ɾΩʔϫʔυਪનʢࡐྉਪનʣ
 ɾ… ɾػց຋༁ʢ೔ӳʣ ɾܗଶૉղੳ
 ɾߏจղੳ
 ɾݻ༗දݱೝࣝ ɾ௒ղ૾
 ɾ… 19
  9. Recipe Dataset Comparable Corpus Parsed Corpus ࣄલֶश
 ɾMasked Language Model


    ɾNext Sentence Prediction
 ɾ… ɾػց຋༁ʢ೔ӳʣ ɾܗଶૉղੳ
 ɾߏจղੳ
 ɾݻ༗දݱೝࣝ ෳ߹తͳར༻ʢख๏ͷ؍఺ʣ ϑΝΠϯνϡʔχϯά ϑΝΠϯνϡʔχϯά 20
  10. ͞ΒͳΔซ༻΋ʁ ɾָఱσʔληοτ ɾϑϩʔάϥϑίʔύε [Mori+ 14] ɾྉཧΦϯτϩδʔ [Nanba+ 14] ɾجຊྉཧ஌ࣝϕʔε [ਗ਼ؙ+

    18] ɾr-FG-BB σʔληοτ [Nishimura+ 20] ɾ… ͍ͣΕ΋Ϩγϐ΍ྉཧʹؔ͢Δ
 ೔ຊޠͷσʔληοτ 22
  11. ·ͱΊ ೔ຊޠϨγϐσʔληοτͷܧଓతͳߏங
 ɾCookpad Recipe Datasetʢ2015 ೥ެ։ʣ
 ɾCookpad Image Datasetʢ2017 ೥ެ։ʣ


    ɾCookpad Comparable Corpusʢ2017 ೥ެ։ʣ
 ɾCookpad Parsed Corpusʢ2020 ೥ެ։ʣ ೔ຊޠϨγϐσʔληοτͷෳ߹తͳར༻
 ɾλεΫɿࢹ֮త࣭໰Ԡ౴ɺϚϧνϞʔμϧݕࡧɺΩϟϓγϣϯੜ੒ɺ…
 ɾख๏ɿࣄલֶशʴϑΝΠϯνϡʔχϯά 24
  12. ࠓޙͷల๬ Cookpad Video Dataset with OMRON SINIC X Ӷҙ։ൃதʂ 25

    Parsed Corpus # Step-ID:1 # Sentence-ID:1-1 * 0 4D 1/2 .7 1 3:,,?,35,*,*,*,*,1,,,B-Fi + ?,,<,*,*,*,*,+, , ,I-Fi  0,,$0,,*,*,*,*,,,,O * 1 2D 1/2 =4' ( ?,,<,*,*,*,*,(, , ,B-Sf 6 ?,,<,*,*,*,*,6, , ,I-Sf  0,, 0,,<,*,*,*,,,,O * 2 4P 0/0 /' 2 ;,,-A,*,*,&8),B@%,2, , ,B-Ap * 3 4D 0/1 =4'  ?,,<,*,*,*,*,, , ,B-Fi  0,, 0,,<,*,*,*,,,,O * 4 -1O 0/0 /'  ;,,-A,*,*,&8),!>%,,,,B-Ap … Video Dataset ղੳࡁΈϨγϐͱௐཧಈըΛඥ෇͚
  13. ࢀߟจݙ • [Donatelli+ 21] Aligning Actions Across Recipe Graphs •

    [Harashima+ 16] A Large-Scale Recipe and Meal Data Collection as Infrastructure for Food Research • [Harashima+ 17] Cookpad Image Dataset: An Image Collection as Infrastructure for Food Research • [Harashima&Hiramatsu 20] Cookpad Parsed Corpus: Linguistic Annotations of Japanese Recipes • [Jiang+ 20] Recipe Instruction Semantics Corpus (RISeC): Resolving Semantic Structure and Zero Anaphora in Recipes • [Kiddon+ 15] Mise en Place: Unsupervised Interpretation of Instructional Recipes • [Kiddon+ 16] Globally Coherent Text Generation with Neural Checklist Models • [Lample+ 16] Neural Architectures for Named Entity Recognition • [Marin+ 19] Recipe1M+: A Dataset for Learning Cross-Modal Embeddings for Cooking Recipes and Food Images • [Mori+ 14] Flow Graph Corpus from Recipe Texts • [Nanba+ 14] Construction of a Cooking Ontology from Cooking Recipes and Patents • [Nishimura+ 20] Visual Grounding Annotation of Recipe Flow Graph • [Salvador+ 17] Learning Cross-modal Embeddings for Cooking Recipes and Food Images • [Sasada+ 15] Named Entity Recognizer Trainable from Partially Annotated Data • [Yagcioglu+ 18] RecipeQA: A Challenge Dataset for Multimodal Comprehension of Cooking Recipes • [߳઒+ 21] ΫοΫύουσʔληοτͰֶशͨ͠ BERT ٴͼ GPT-2 ͷ׆༻๏ • [ਗ਼ؙ+ 18] ྉཧϨγϐͱΫϥ΢υιʔγϯάʹجͮ͘جຊྉཧ஌ࣝϕʔεͷߏங 26