Upgrade to Pro — share decks privately, control downloads, hide ads and more …

日本語レシピデータセットの継続的な構築と複合的な利用/JED2022

j.harashima
March 19, 2022
3.8k

 日本語レシピデータセットの継続的な構築と複合的な利用/JED2022

j.harashima

March 19, 2022
Tweet

More Decks by j.harashima

Transcript

  1. ೔ຊޠϨγϐσʔληοτͷ

    ܧଓతͳߏஙͱෳ߹తͳར༻
    ݪౡ७ɺฏদ३ɺਂᖒ༞ԉɺࢁޱହ߂ʢΫοΫύουגࣜձࣾʣ
    NLP2022 Workshop on Japanese Evaluation Dataset (JED2022) 1

    View Slide

  2. എܠ
    Πϯλʔωοτ΍εϚʔτϑΥϯͷීٴʹΑΓΠϯλʔωοτ্ͷϨγϐ͕૿Ճ
    ɾ೔ຊޠͩͱ 70 ສϨγϐʢ2010ʣˠ 500 ສϨγϐʢ2020ʣˎ1
    Ϩγϐʹؔ͢Δݚڀ΍σʔληοτ΋૿Ճ
    ɾݚڀɿݴޠཧղ [Kiddon+ 15]ɺจॻੜ੒ [Kiddon+ 16]ɺ৘ใݕࡧ [Salvador+ 17]ɺ࣭໰Ԡ౴ [Yagcioglu+ 18]ɺ…

    ɾσʔληοτɿRecipe1M+ [Marin+ 19]ɺRISeC [Jiang+ 20]ɺARA [Donatelli+ 21]ɺ…
    ݚڀʹ͠Ζσʔληοτʹ͠ΖɺϝΠϯ͸΍͸ΓӳޠʢಛʹτοϓΧϯϑΝϨϯεʣˠ ೔ຊޠ΋ෛ͚ͯΒΕͳ͍ʂ
    ˎ1 ΫοΫύουͱָఱϨγϐʹ౤ߘ͞ΕͨϨγϐͷ૯਺ʢൃදऀௐ΂ʣ
    2

    View Slide

  3. ໨࣍
    ೔ຊޠϨγϐσʔληοτͷܧଓతͳߏங
    ೔ຊޠϨγϐσʔληοτͷෳ߹తͳར༻
    ·ͱΊͱࠓޙͷల๬
    3

    View Slide

  4. Πϯλʔωοτ্ͰϨγϐͷ౤ߘɾݕࡧ͕Ͱ͖Δ೔ຊ࠷େˎ1ͷ

    ϨγϐαʔϏε

    ɾϨγϐ౤ߘ਺ɿ365 ສ඼

    ɾࠃ಺݄ؒར༻ऀ਺ɿ5,600 ສਓ

    ɾϓϨϛΞϜձһ਺ɿ183 ສਓ

    ɾల։ࠃɾ஍Ҭ਺ɿ74 Χࠃ

    ɾରԠݴޠ਺ɿ32 ݴޠ
    ΫοΫύου
    ˎ1 ͦΕͧΕ 2021 ೥ 12 ݄ 31 ೔࣌఺ͷσʔλ
    4

    View Slide

  5. Ϩγϐ
    ྉཧͷࡐྉ΍࡞ΓํΛهड़ͨ͠จॻ
    ଟ͘ͷ৔߹ɺҎԼͷཁૉͰߏ੒͞ΕΔ

    ɾλΠτϧ

    ɾ࡞ऀͷίϝϯτ

    ɾ࡞ऀͷ໊લ

    ɾࡐྉ

    ɾ࡞Γํ

    ɾௐཧޙͷࣸਅʢ৔߹ʹΑͬͯ͸ಈըʣ

    ɾௐཧதͷࣸਅʢ৔߹ʹΑͬͯ͸ಈըʣ

    ɾ…
    5

    View Slide

  6. Cookpad Dataset
    ΫοΫύουגࣜձ͕ࣾܧଓతʹߏஙɾެ։͍ͯ͠Δσʔληοτ
    ɾCookpad Recipe Datasetʢ2015 ೥ެ։ʣ
    ɾCookpad Image Datasetʢ2017 ೥ެ։ʣ
    ɾCookpad Comparable Corpusʢ2017 ೥ެ։ʣ
    ɾCookpad Parsed Corpusʢ2020 ೥ެ։ʣ
    6

    View Slide

  7. Cookpad Recipe Dataset
    2014 ೥ 9 ຤·Ͱʹ౤ߘ͞Εͨ໿ 172 ສϨγϐͷςΩετʢλΠτϧɺ

    ࡞ऀͷίϝϯτɺࡐྉɺ࡞Γํɺ…ʣΛऩ࿥ [Harashima+ 16]

    Ұ෦ͷϨγϐʹ͸ΧςΰϦ΍ݙཱͷ৘ใ΋͋Δʢٯʹݴ͏ͱɺશͯͷ

    Ϩγϐʹ͸ͳ͍ʣ
    2015 ೥ʹެ։ɺϨγϐؔ࿈ͷςΩετσʔληοτͱͯ͠͸ੈք࠷େ
    7

    View Slide

  8. Cookpad Recipe Dataset
    ޙड़͢Δଞͷσʔληοτͱҧ͍ɺNIIˎ1 ܦ༝Ͱެ։
    2022 ೥ 3 ݄࣌఺Ͱશࠃ 110 େֶ 212 ݚڀࣨˎ2͕ར༻
    ˎ1 https://www.nii.ac.jp/dsc/idr/cookpad/

    ˎ2 NLP Ҏ֎ͷݚڀࣨ΋ଟ਺
    8

    View Slide

  9. Cookpad Image Dataset
    Recipe Dataset ͱಉ͡ 172 ສϨγϐͷը૾ʢௐཧޙͷࣸਅɺௐཧத

    ͷࣸਅʣΛऩ࿥ [Harashima+ 17]
    2017 ೥ʹެ։ɺϨγϐؔ࿈ͷը૾σʔληοτͱͯ͠͸ੈք࠷େ
    9

    View Slide

  10. Cookpad Image Dataset
    ௐཧதͷࣸਅ਺Ͱ΋ੈք࠷େ
    ௐཧޙͷࣸਅ਺Ͱੈք࠷େ Recipe Dataset ͱඥ෇͚Մೳ
    10

    View Slide

  11. Cookpad Comparable Corpus
    16,000 Ϩγϐʹର͢Δ຋༁σʔλʢ೔ˠӳʣΛऩ࿥
    ɾաڈʹ։ൃ͍ͯͨ͠αʔϏεʢΫϩʔζࡁΈʣͰ࢖༻

    ɹ͍ͯͨ͠σʔλ

    ຋༁ϓϩηε

    ɾ1. ೔ຊޠωΠςΟϒ 1 ໊ˎ1ˎ2 ͕຋༁

    ɾ2. ӳޠωΠςΟϒ 2 ໊ˎ2 ͕मਖ਼


    WAT 2017 ͱ 2018ˎ3 ͷ subtask ͱͯ͠ఏڙ
    ˎ1 ӳޠʹਫ਼௨͍ͯ͠ΔਓΛ࠾༻

    ˎ2 ྉཧʹਫ਼௨͍ͯ͠ΔਓΛ࠾༻

    ˎ3 http://lotus.kuee.kyoto-u.ac.jp/WAT/WAT{2017,2018}/index.html
    • ja: {
    • title: ཛ౾෗ͷ͢·͠ो,
    • ingredients: [
    • ཛ౾෗,
    • …
    • ],
    • steps: [
    • ͚ͨͷ͜͸্ͷ΍ΘΒ͔͍෦෼͚ͩΛബ͘੾Δɻ,
    • …
    • ],
    • },
    • en: {
    • title: Clear Broth with Egg Tofu,
    • ingredients: [
    • Egg tofu,
    • …
    • ],
    • steps: [
    • Take the soft part of the top of the bamboo shoot and thinly slice.,
    • …
    • ],
    • }
    ؆୯ͷͨΊɺ࣮ࡍͷσʔλΛվมɾলུ
    11

    View Slide

  12. Cookpad Comparable Corpus
    ϕϯνϚʔΫͷ݁Ռ΍࣮ݧ༻ͷεΫϦϓτ͕ӾཡɾऔಘՄೳ 12

    View Slide

  13. Cookpad Parsed Corpus
    500 ϨγϐʢλΠτϧͱ࡞Γํʣʹର͢Δܗଶૉղੳͱߏจղੳɺ

    ݻ༗දݱೝࣝͷਖ਼ղσʔλΛऩ࿥ [Harashima&Hiramatsu 20]
    ɾܗଶૉղੳɿMeCabʢipadicʣͷ݁ՌΛਓखͰमਖ਼
    ɾߏจղੳɿCaboCha ͷ݁ՌΛਓखͰमਖ਼

    ɾݻ༗දݱೝࣝɿಠࣗͷ 17 λάΛਓखͰ෇༩
    اۀʹΑΔ೔ຊޠղੳࡁΈίʔύεͷެ։͸ॳʁ
    # Step-ID:1
    # Sentence-ID:1-1
    * 0 4D 1/2 .7
    1 3:,,?,35,*,*,*,*,1,,,B-Fi
    + ?,,<,*,*,*,*,+, , ,I-Fi
    0,,$0,,*,*,*,*,,,,O
    * 1 2D 1/2 =4'
    ( ?,,<,*,*,*,*,(,,,B-Sf
    6 ?,,<,*,*,*,*,6, , ,I-Sf
    0,, 0,,<,*,*,*,,,,O
    * 2 4P 0/0 /'
    2 ;,,-A,*,*,&8),[email protected]%,2,,,B-Ap
    * 3 4D 0/1 =4'
    ?,,<,*,*,*,*,,
    ,
    ,B-Fi
    0,, 0,,<,*,*,*,,,,O
    * 4 -1O 0/0 /'
    ;,,-A,*,*,&8),!>%,,,,B-Ap
    "*,#9,*,*,*,*,,,,O
    EOS
    13

    View Slide

  14. Cookpad Parsed Corpus
    ৽ฉهࣄͷղੳͱൺ΂Δͱ…

    ɾܗଶૉղੳ͸೉͍͠ʢະ஌ޠ͕ଟ͍ͨΊʣ

    ɾߏจղੳ͸қ͍͠ʢจ͕୹͍ͨΊʣ

    ɾݻ༗දݱೝࣝ͸ෆ໌ʢಉ͡λά͕෇͍ͯͳ͍ͨΊʣ
    ࠶ֶश ద߹཰ ࠶ݱ཰ '஋
    ୯ޠ෼ׂͷΈ
    ͳ͠
    ͋Γ
    ୯ޠ෼ׂʴ
    ඼ࢺλά෇͚
    ͳ͠
    ͋Γ
    ਖ਼ղ཰ ద߹཰ ࠶ݱ཰ '஋
    <4BTBEB>
    <-BNQMF>
    ܗଶૉղੳثʢ.F$BCʣͷੑೳˎ
    ݻ༗දݱೝࣝثͷੑೳˎ
    ࠶ֶश
    ਖ਼ղ཰
    จઅ୯Ґ จ୯Ґ
    ͳ͠
    ͋Γ
    ߏจղੳثʢ$BCP$IBʣͷੑೳˎ
    ˎ1 ࣮ݧ༻ͷεΫϦϓτ͸ https://github.com/cookpad/cpc1.0 Ͱެ։
    14

    View Slide

  15. ͨ΂ΈΔʢ༨ஊʣ
    ΫοΫύουͷݕࡧσʔλΛ஝ੵɺ๏ਓ޲͚ʹ

    ల։͍ͯ͠Δ෼ੳπʔϧ
    2016 ೥ʹެ։

    ɾσʔληοτͱͯ͠ެ։͍ͯ͠ΔΘ͚Ͱ͸ͳ͘

    ɹΞΧ΢ϯτΛແঈͰఏڙʢݚڀऀͷΈʣ
    15

    View Slide

  16. ໨࣍
    ೔ຊޠϨγϐσʔληοτͷܧଓతͳߏங
    ೔ຊޠϨγϐσʔληοτͷෳ߹తͳར༻
    ·ͱΊͱࠓޙͷల๬
    16

    View Slide

  17. ෳ߹తͳར༻ʁ
    ֤σʔληοτ͸ݸผʹར༻Մೳʢ౰ͨΓલʣ
    Ұํɺෳ߹తʹར༻͢Δ͜ͱͰॳΊͯऔΓ૊ΊΔλεΫ΍ख๏΋
    17

    View Slide

  18. ݸผͷར༻
    Recipe Dataset Image Dataset
    Comparable Corpus Parsed Corpus
    ɾػց຋༁ʢ೔ӳʣ
    ɾܗଶૉղੳ

    ɾߏจղੳ

    ɾݻ༗දݱೝࣝ
    ɾ௒ղ૾

    ɾ…
    18
    ɾจॻਪનʢओࡊਪનɾ෭ࡊਪનʣ

    ɾจॻੜ੒ʢλΠτϧɾ࡞Γํੜ੒ʣ

    ɾΩʔϫʔυਪનʢࡐྉਪનʣ

    ɾ…

    View Slide

  19. ෳ߹తͳར༻
    Recipe Dataset Image Dataset
    Comparable Corpus Parsed Corpus
    ࢹ֮త࣭໰Ԡ౴
    Ωϟϓγϣϯੜ੒
    ϚϧνϞʔμϧݕࡧ
    ϚϧνϞʔμϧ຋༁
    ը૾ೝࣝʢྉཧೝࣝɾࡐྉೝࣝʣ
    ɾจॻਪનʢओࡊਪનɾ෭ࡊਪનʣ

    ɾจॻੜ੒ʢλΠτϧɾ࡞Γํੜ੒ʣ

    ɾΩʔϫʔυਪનʢࡐྉਪનʣ

    ɾ…
    ɾػց຋༁ʢ೔ӳʣ
    ɾܗଶૉղੳ

    ɾߏจղੳ

    ɾݻ༗දݱೝࣝ
    ɾ௒ղ૾

    ɾ…
    19

    View Slide

  20. Recipe Dataset
    Comparable Corpus Parsed Corpus
    ࣄલֶश

    ɾMasked Language Model

    ɾNext Sentence Prediction

    ɾ…
    ɾػց຋༁ʢ೔ӳʣ
    ɾܗଶૉղੳ

    ɾߏจղੳ

    ɾݻ༗දݱೝࣝ
    ෳ߹తͳར༻ʢख๏ͷ؍఺ʣ
    ϑΝΠϯνϡʔχϯά
    ϑΝΠϯνϡʔχϯά
    20

    View Slide

  21. ࣄલֶशϞσϧͷߏங
    طʹऔΓ૊Έ͸͡Ί͍ͯͩͬͯ͘͞Δํ΋
    HCG γϯϙδ΢Ϝ 2021 21

    View Slide

  22. ͞ΒͳΔซ༻΋ʁ
    ɾָఱσʔληοτ
    ɾϑϩʔάϥϑίʔύε [Mori+ 14]
    ɾྉཧΦϯτϩδʔ [Nanba+ 14]
    ɾجຊྉཧ஌ࣝϕʔε [ਗ਼ؙ+ 18]
    ɾr-FG-BB σʔληοτ [Nishimura+ 20]
    ɾ…
    ͍ͣΕ΋Ϩγϐ΍ྉཧʹؔ͢Δ

    ೔ຊޠͷσʔληοτ
    22

    View Slide

  23. ໨࣍
    ೔ຊޠϨγϐσʔληοτͷܧଓతͳߏங
    ೔ຊޠϨγϐσʔληοτͷෳ߹తͳར༻
    ·ͱΊͱࠓޙͷల๬
    23

    View Slide

  24. ·ͱΊ
    ೔ຊޠϨγϐσʔληοτͷܧଓతͳߏங

    ɾCookpad Recipe Datasetʢ2015 ೥ެ։ʣ

    ɾCookpad Image Datasetʢ2017 ೥ެ։ʣ

    ɾCookpad Comparable Corpusʢ2017 ೥ެ։ʣ

    ɾCookpad Parsed Corpusʢ2020 ೥ެ։ʣ
    ೔ຊޠϨγϐσʔληοτͷෳ߹తͳར༻

    ɾλεΫɿࢹ֮త࣭໰Ԡ౴ɺϚϧνϞʔμϧݕࡧɺΩϟϓγϣϯੜ੒ɺ…

    ɾख๏ɿࣄલֶशʴϑΝΠϯνϡʔχϯά
    24

    View Slide

  25. ࠓޙͷల๬
    Cookpad Video Dataset with OMRON SINIC X Ӷҙ։ൃதʂ
    25
    Parsed Corpus
    # Step-ID:1
    # Sentence-ID:1-1
    * 0 4D 1/2 .7
    1 3:,,?,35,*,*,*,*,1,,,B-Fi
    + ?,,<,*,*,*,*,+, , ,I-Fi
    0,,$0,,*,*,*,*,,,,O
    * 1 2D 1/2 =4'
    ( ?,,<,*,*,*,*,(,,,B-Sf
    6 ?,,<,*,*,*,*,6, , ,I-Sf
    0,, 0,,<,*,*,*,,,,O
    * 2 4P 0/0 /'
    2 ;,,-A,*,*,&8),[email protected]%,2,,,B-Ap
    * 3 4D 0/1 =4'
    ?,,<,*,*,*,*,,
    ,
    ,B-Fi
    0,, 0,,<,*,*,*,,,,O
    * 4 -1O 0/0 /'
    ;,,-A,*,*,&8),!>%,,,,B-Ap

    Video Dataset
    ղੳࡁΈϨγϐͱௐཧಈըΛඥ෇͚

    View Slide

  26. ࢀߟจݙ
    • [Donatelli+ 21] Aligning Actions Across Recipe Graphs
    • [Harashima+ 16] A Large-Scale Recipe and Meal Data Collection as Infrastructure for Food Research
    • [Harashima+ 17] Cookpad Image Dataset: An Image Collection as Infrastructure for Food Research
    • [Harashima&Hiramatsu 20] Cookpad Parsed Corpus: Linguistic Annotations of Japanese Recipes
    • [Jiang+ 20] Recipe Instruction Semantics Corpus (RISeC): Resolving Semantic Structure and Zero Anaphora in Recipes
    • [Kiddon+ 15] Mise en Place: Unsupervised Interpretation of Instructional Recipes
    • [Kiddon+ 16] Globally Coherent Text Generation with Neural Checklist Models
    • [Lample+ 16] Neural Architectures for Named Entity Recognition
    • [Marin+ 19] Recipe1M+: A Dataset for Learning Cross-Modal Embeddings for Cooking Recipes and Food Images
    • [Mori+ 14] Flow Graph Corpus from Recipe Texts
    • [Nanba+ 14] Construction of a Cooking Ontology from Cooking Recipes and Patents
    • [Nishimura+ 20] Visual Grounding Annotation of Recipe Flow Graph
    • [Salvador+ 17] Learning Cross-modal Embeddings for Cooking Recipes and Food Images
    • [Sasada+ 15] Named Entity Recognizer Trainable from Partially Annotated Data
    • [Yagcioglu+ 18] RecipeQA: A Challenge Dataset for Multimodal Comprehension of Cooking Recipes
    • [߳઒+ 21] ΫοΫύουσʔληοτͰֶशͨ͠ BERT ٴͼ GPT-2 ͷ׆༻๏
    • [ਗ਼ؙ+ 18] ྉཧϨγϐͱΫϥ΢υιʔγϯάʹجͮ͘جຊྉཧ஌ࣝϕʔεͷߏங
    26

    View Slide