Structure-Aware Procedural Text Generation from an Image Sequence

Structure-Aware Procedural Text Generation from an Image Sequence Taichi Nishimura
Supervisor: Prof. Shinsuke Mori Master thesis, ﬁnal defense

(1.1) Background 2 Procedural texts are wisdom of human beings
describing the steps to accomplish tasks. Generating a procedural text from visual observation (e.g., an image or a video) has various real-world applications. Voice support system [Nouri+ SIGIR20] Robotic Assistant [Bollini+ 12] High-level applications Multimedia Archive Procedural Text Generation Cut Chicken Stir-fry Chicken Database Cut Chicken.. Stir-fry it..

(1.2) Procedural Text Generation from an Image Sequence Input: Image
sequence (taken by users) Output: Procedural text (steps are corresponded to images) Step1 Step2 Cut the chicken. Add seasoning. Put oil into the pan. Add the chicken. Add the cabbage to the pan. Steam it. Step3 Image Sequence Procedural Text 3 t

(1.3) Understanding of Context Dependency Models must understand context dependency
From natural language understanding, some studies have represented it as a tree or graph structure: • Recipe Flow Graph [Mori+ LREC14] • SIMMR [Jermsurawong+ EMNLP15] t Step 1 Boil pasta Step 2 Cut vegetables Step 4 Serve on a plate Step 3 Stir-fry it 4

(1.4) SIMMR [Jermusurawong and Habash EMNLP15] SIMMR is a merging
tree. • All leaves are materials. • Intermediate nodes are instructions. 5 (Predicting the structure of cooking recipes, Figure 1)

Tomatoes Pumpkin Ketchup Mayonnaise (1.5) Research Objective and Contributions Research
Objective: We investigate the impact of explicitly introducing such a structure on the task of procedural text generation from an image sequence Contributions: • Propose a new dataset: visual SIMMR (vSIMMR). • Propose a novel structure-aware procedural text generation model 6 Merging tree Inputs Procedural text Step 1 Cut the .… Step 2 Stir-fry the …. Step 3 Put them …. Step 4 Cover them ….

(2.1) Multimedia Research in How-to Domains Cooking is one of
the most popular domain among how-to domains. 7 Vision and Language SIMMR Flow Graph Language Ingredient Recognition Vision Kitchen Action Recognition Im2recipe RecipeQA

(2.2) A Comparison of vSIMMR with Other How-to Datasets 8
Datasets Recipes? Ingredients? Structure? Visual Data #Recipes Breakfast Video N/A EPIC-Kitchen Video N/A YouCook2 ✔ Video 89 RecipeQA ✔ Image sequence 19,779 Story boarding ✔ Image sequence 16,405 Cookpad Image Dataset ✔ ✔ Image sequence 1,715,595 Recipe Flow Graph ✔ ✔ ✔ N/A 266 Action Graph ✔ ✔ ✔ N/A 133 SIMMR ✔ ✔ ✔ N/A 260 vSIMMR ✔ ✔ ✔ Image sequence 2103

9 (2.3) Procedural Text Generation from Various Inputs Language information
Generated procedural texts must be coherent Inputs are divided into two resources: language and vision Visual information Title and ingredient Finished dish image [Kiddon+ EMNLP16] Attention mechanism [Bosselut+ NAACL16] Reinforcement learning [Salvador+ CVPR19] Cooking video [Ushiku+ IJCNLP17], [Shi+ ACL19] Image sequence [Nishimura+ INLG19] [Chandu+ ACL19] Structure-aware version

(3.) visual SIMMR (vSIMMR) Extension points 1. vSIMMR is a
vision and language version of SIMMR 2. vSIMMR is about 8 times larger than SIMMR • SIMMR: 260 recipe-tree pairs / vSIMMR: 2106 pairs 10 Merging tree Procedural text Image sequence Cut the tomatoes. Stir-fry the pumpkin. Put them on the tomatoes. Cover them with ketchup and mayonnaise. Image 1 Tomatoes Pumpkin Ketchup Mayonnaise 1 2 3 4 Image 2 Image 3 Image 4 Materials Step 1 (a) SIMMR (b) vSIMMR (ours) Step 2 Step 3 Step 4 Step 1 Step 2 Step 3 Step 4

Tomatoes Pumpkin Ketchup Mayonnaise Materials Image sequence Tomatoes Pumpkin Ketchup
Mayonnaise Material-to-step LPM Cut the tomatoes. Stir-fry the pumpkin. Put them on the tomatoes. Cover them with ketchup and mayonnaise. Step-to-step LPM Link Probability Matrix Calculation Structured Decoder Process (i) Process (ii) Gumbel Softmax Resampling Process (iii) P mat→v ̂ P mat→v Merging tree Procedural Text ̂ P v→v P v→v Structure-Aware Procedural Text Generation Model (Section 4.1) Tree Re-prediction (Section 4.2) Re-predicted Material-to-step LPM Step 1 Step 2 Step 3 Step 4 Re-predicted Step-to-step LPM Step 4 Step 3 Step 2 Step 1 Step 4 Step 3 Step 2 Step 1 Order constraint Order constraint (4.1) An Overview of Our Method Two parts: 1. Structure-aware procedural text generation model (i) Link Probability Matrix (LPM) Calculation (ii) Gumbel Softmax Resampling (iii)Structured Decoder 2. Tree re-prediction module 11

(4.2) Link Probability Matrix (LPM) Calculation 12 Merging tree can
be predicted from a probability matrix. Because a processed material cannot merged by returning to a previous step, a step-to-step link has an order constraint. Material-to-step Step-to-step Order constraint Tomatoes Pumpkin Ketchup Mayonnaise

(4.3) Gumbel Softmax Resampling [Jang+ ICLR17] 13 Material-to-step Step-to-step Merging
Tree ~ ~ Gumbel Softmax Gumbel Softmax

(4.4) Structured decoder: Tree-LSTM [Tai+ ACL15] Child-sum Tree-LSTM: a variant
of Tree-LSTM. Encoding the context dependency via sum of children nodes. 14 h in c in f h f c h 1 2 3 4 h c h c h c c h c h c h c

(4.5) Structured Decoder: Procedural Text Generation Cut the tomatoes. Stir-fry
the pumpkin. Put them on the tomatoes. Cover them with ketchup and mayonnaise. 15 Merging Tree Procedural Text Generator Procedural Text Step 1 Step 2 Step 3 Step 4

(4.6) Merging Tree Re-prediction 16 Cut the tomatoes. Stir-fry the
pumpkin. Put them on the tomatoes. Cover them with ketchup and mayonnaise. Procedural Text Step 1 Step 2 Step 3 Step 4 Material-to-step Step-to-step Tree Re-prediction Step 1 Step 2 Step 3 Step 4 Step 1 Step 2 Step 3 Step 4

(4.7) Loss Function Merging Tree / Procedural Text Losses •
Procedural text generation loss: • Merging tree prediction loss: • (only labeled) compared with the ground truth • Merging tree re-prediction loss: • (labeled) compared with the ground truth • (unlabeled) compared with the argmax of merging tree VAE losses ( VAE) for Semi-Supervised Learning • Tree2image VAE: • Step2image VAE: Total loss: Ltext Ltp Ltc β Lt2i Ls2i λ* ∈ {0.001,0.01,0.1} L = Ltext + Ltp + λtc Ltc + λt2i Lt2i + λs2i Ls2i 17

(5.1) Datasets (a) Cookpad Image Dataset [Harashima+ SIGIR17] • Image
sequence, Recipe (Japanese) (b) vSIMMR • Image sequence, Ingredient tree, Recipe (Japanese) train val test #recipes 163,525 18,051 20,193 #steps 6.24 6.15 6.26 #words 148.52 147.02 148.50 #ingredients 7.85 7.79 7.86 train val test #recipes 1,603 250 250 #steps 6.78 6.74 6.85 #words 118.23 113.91 114.68 #ingredients 6.58 6.37 6.64 (a) Cookpad Image Dataset (b) vSIMMR 18

(5.2) Procedural Text Generators Procedural text generation models: • Images2seq
[Huang+ NAACL16]: baseline of visual storytelling (ViST) • GLAC Net [Kim+ arXiv18]: SoTA model at ViST Workshop. • SSiD [Chandu+ ACL19] • SSiL [Chandu+ ACL19] • RetAttn [Nishimura+ INLG19] Ablations: • Half model: does not calculate tree re-prediction loss • Full model: calculates tree re-prediction loss Note: For a fair comparison, we add a module to incorporate ingredient vectors into the models. 19

(5.3) Commonly Used Word-Overlap Metrics BLEU1 BLEU4 ROUGE-L Distinct-1 Distinct-2
Images2seq 27.5 5.1 18.4 38.3 54.7 Half 27.8* 5.8* 20.6* 51.1* 75.0* Full 29.6* 6.3* 21.7* 47.6* 71.0* GLAC Net 28.5 5.9 21.4 46.6 69.0 Half 28.5 5.9 21.8* 46.7 68.8 Full 28.9 6.1 21.3 47.2* 69.9* SSiD 28.7 6.0 20.9 45.5 66.6 Half 30.3* 6.2* 20.8 43.9 65.1 Full 31.1* 6.4* 21.6* 48.3* 71.0* SSiL 31.0 6.3 21.4 45.5 66.8 Half 28.1 5.4 21.4 46.7* 68.2* Full 30.4 6.4 21.9* 47.3* 70.9* RetAttn 32.2 6.5 21.6 40.2 60.3 Half 32.2 6.5 21.8 52.4* 77.8* Full 33.2 7.1 22.1 52.7* 78.6* The proposed method boosts the performance in a versatile manner. 20

(5.4) Word-Overlap Metrics for Coherency Evaluation I = Ingredient, Ac
= Action words, deﬁned by [Mori et al. LREC2014] The proposed method generates coherent procedural texts. 21 I-BLEU1 I-BLEU2 I-ROUGE Ac-BLEU1 Ac-BLEU2 Ac-ROUGE Images2se q 7.0 1.8 9.7 18.2 8.1 18.4 Half 7.9* 2.2* 12.7* 18.8* 8.4* 21.2* Full 8.8* 2.7* 13.8* 21.4* 9.4* 22.9* GLAC Net 9.5 2.9 13.2 21.2 9.3 22.8 Half 10.5* 3.4* 15.6* 21.0 9.4 23.1 Full 11.8* 3.8* 16.3* 21.0 9.1 22.5 SSiD 8.6 2.7 13.1 20.1 8.7 21.6 Half 11.4* 3.7* 16.4* 19.8 8.7 22.1* Full 12.9* 4.1* 16.7* 21.5* 9.5* 23.5* SSiL 8.3 2.4 12.5 19.8 8.7 22.1 Half 11.3* 3.7* 16.9* 18.7 8.1 22.2 Full 12.4* 4.2* 17.0* 21.4* 9.5* 23.0* RetAttn 11.2 3.3 14.5 22.1 9.0 21.2 Half 11.9* 3.7* 14.8 21.8 9.4* 22.9* Full 12.1* 3.8* 14.8 22.3 9.5* 23.1*

(5.6) Human Evaluation We evaluated 120 generated recipes from three
perspectives: • Fluency (F) • Ingredient use (IU) • Image ﬁtting (IF) The proposed method clearly outperforms the baseline in all metrics. Baseline Full Tie F 26.7 62.5 10.8 IU 30.8 66.7 2.5 IF 31.7 45.0 23.3 Baseline Full Tie F 16.7 70.0 13.3 IU 26.7 73.3 0.0 IF 36.7 50.0 13.3 all 120 recipes longest 30 recipes 22

(5.7) Generated Procedural Text Example 23 Merging tree generated by
half model Merging tree generated by full model Models Step 1 Step 2 Step 3 Step 4 Step 5 Step 6 RetAttn (baseline) ࡐྉΛશͯ ࠞͥ·͢ɻ ͝΅͏͸͕͖͞͞ ʹ͠, ਫʹਁ͠· ͢ɻ ਓࢀ͸ઍ੾Γʹ͠ ·͢ɻ ϑϥΠύϯʹ͝· ༉Λ೤͠ɺਓࢀΛ ᖱΊ·͢ɻ ਓࢀΛೖΕᖱΊ ·͢ɻ ਓࢀΛᖱΊऴΘͬ ͨΒͻ͖͡ΛೖΕ ᖱΊ·͢ɻ RetAttn (half model) ࡐྉΛ ༻ҙ͠·͢ɻ ͝΅͏͸͕͖͞͞ ʹ͠, ਫʹਁ͠· ͢ɻ ਓࢀ͸ઍ੾Γʹ͠ ·͢ɻ ϑϥΠύϯ͕Թ· ͬͨΒɺਓࢀΛೖ ΕᖱΊ·͢ɻ ΋΍͠ΛೖΕͨ ޙɺਓࢀΛೖΕ ͯᖱΊ·͢ɻ ԘͱމᑦͰຯΛ ͚ͭ·͢ɻ RetAttn (full model) ࡐྉΛ༻ҙ͠· ͢ɻ•ͷௐຯྉΛ߹ Θ͓͖ͤͯ·͢ɻ ͝΅͏͸͕͖͞͞ ʹ͠ɺਫʹ͞Β͠ ͓͖ͯ·͢ɻ ʹΜ͡Μ͸ઍ੾Γ ʹ͠·͢ɻ ϑϥΠύϯʹ͝· ༉Λ೤͠ɺਓࢀΛ ᖱΊ·͢ɻ ਓࢀ͕͠ΜͳΓͨ͠ Βɺ͝΅͏ΛೖΕ ͯɺશମ͕͠ΜͳΓ ͢Δ·ͰᖱΊ·͢ɻ શମʹՐ͕௨ͬͨ ΒɺௐຯྉΛೖΕ ͯɺΑࠞͥ͘· ͢ɻͰ͖͕͋Γʂ Ground Truth •ͷௐຯྉΛࠞͥ ͓͖ͯ·͢ɻ ͝΅͏͸ൽΛണ͍ͯ̑ ᶲ͙Β͍ͷ௕͞Ͱࡉ੾ Γʹ͠ɺਫʹ͞Βͯ͠ ͓͖·͢ɻ ʹΜ͡ΜΛࡉ੾Γ ʹ͠·͢ɻ ೤ͨ͠ϑϥΠύϯʹ͝·༉Λ ͻ͍ͯΰϘ΢ΛᖱΊ·͢ɻ ͝΅͏ʹՐ͕ͱ͓ͬͨΒʹΜ ͡ΜΛೖΕͯᖱΊ·͢ɻ ʹΜ͡ΜʹՐ͕ͱ͓ͬ ͨΒᶃͷௐຯྉΛೖΕ ͯोؾ͕ͳ͘ͳΔ·Ͱ ᖱΊ·͢ɻ ՐΛࢭΊͯɺന͝·Λ ;Γ͔͚ͯࠞͥΕ͹Ͱ ͖͕͋ΓͰ͢ɻ Step 1 Step 2 Step 3 Step 4 Step 5 Step 6 Step 1 Step 2 Step 3 Step 4 Step 5 Step 6 ͝΅͏ ʹΜ͡Μ • ࠭౶ • ञ • ͠ΐ͏༉ ന͝· • Θ͞ͼ ͝·༉ ͝΅͏ ʹΜ͡Μ • ࠭౶ • ञ • ͠ΐ͏༉ ന͝· • Θ͞ͼ ͝·༉ Ingredients ͝΅͏, ʹΜ͡Μ, • ࠭౶, • ञ, • ͠ΐ͏༉, • Θ͞ͼ, ന͝·, ͝·༉ Input

Merging tree generated by half model Merging tree generated by
full model Models Step 1 Step 2 Step 3 Step 4 Step 5 Step 6 RetAttn (baseline) ࡐྉΛશͯ ࠞͥ·͢ɻ ͝΅͏͸͕͖͞͞ ʹ͠, ਫʹਁ͠· ͢ɻ ਓࢀ͸ઍ੾Γʹ͠ ·͢ɻ ϑϥΠύϯʹ͝· ༉Λ೤͠ɺਓࢀΛ ᖱΊ·͢ɻ ਓࢀΛೖΕᖱΊ ·͢ɻ ਓࢀΛᖱΊऴΘͬ ͨΒͻ͖͡ΛೖΕ ᖱΊ·͢ɻ RetAttn (half model) ࡐྉΛ ༻ҙ͠·͢ɻ ͝΅͏͸͕͖͞͞ ʹ͠, ਫʹਁ͠· ͢ɻ ਓࢀ͸ઍ੾Γʹ͠ ·͢ɻ ϑϥΠύϯ͕Թ· ͬͨΒɺਓࢀΛೖ ΕᖱΊ·͢ɻ ΋΍͠ΛೖΕͨ ޙɺਓࢀΛೖΕ ͯᖱΊ·͢ɻ ԘͱމᑦͰຯΛ ͚ͭ·͢ɻ RetAttn (full model) ࡐྉΛ༻ҙ͠· ͢ɻ•ͷௐຯྉΛ߹ Θ͓͖ͤͯ·͢ɻ ͝΅͏͸͕͖͞͞ ʹ͠ɺਫʹ͞Β͠ ͓͖ͯ·͢ɻ ʹΜ͡Μ͸ઍ੾Γ ʹ͠·͢ɻ ϑϥΠύϯʹ͝· ༉Λ೤͠ɺਓࢀΛ ᖱΊ·͢ɻ ਓࢀ͕͠ΜͳΓͨ͠ Βɺ͝΅͏ΛೖΕ ͯɺશମ͕͠ΜͳΓ ͢Δ·ͰᖱΊ·͢ɻ શମʹՐ͕௨ͬͨ ΒɺௐຯྉΛೖΕ ͯɺΑࠞͥ͘· ͢ɻͰ͖͕͋Γʂ Ground Truth •ͷௐຯྉΛࠞͥ ͓͖ͯ·͢ɻ ͝΅͏͸ൽΛണ͍ͯ̑ ᶲ͙Β͍ͷ௕͞Ͱࡉ੾ Γʹ͠ɺਫʹ͞Βͯ͠ ͓͖·͢ɻ ʹΜ͡ΜΛࡉ੾Γ ʹ͠·͢ɻ ೤ͨ͠ϑϥΠύϯʹ͝·༉Λ ͻ͍ͯΰϘ΢ΛᖱΊ·͢ɻ ͝΅͏ʹՐ͕ͱ͓ͬͨΒʹΜ ͡ΜΛೖΕͯᖱΊ·͢ɻ ʹΜ͡ΜʹՐ͕ͱ͓ͬ ͨΒᶃͷௐຯྉΛೖΕ ͯोؾ͕ͳ͘ͳΔ·Ͱ ᖱΊ·͢ɻ ՐΛࢭΊͯɺന͝·Λ ;Γ͔͚ͯࠞͥΕ͹Ͱ ͖͕͋ΓͰ͢ɻ Step 1 Step 2 Step 3 Step 4 Step 5 Step 6 Step 1 Step 2 Step 3 Step 4 Step 5 Step 6 ͝΅͏ ʹΜ͡Μ • ࠭౶ • ञ • ͠ΐ͏༉ ന͝· • Θ͞ͼ ͝·༉ ͝΅͏ ʹΜ͡Μ • ࠭౶ • ञ • ͠ΐ͏༉ ന͝· • Θ͞ͼ ͝·༉ Ingredients ͝΅͏, ʹΜ͡Μ, • ࠭౶, • ञ, • ͠ΐ͏༉, • Θ͞ͼ, ന͝·, ͝·༉ (5.8) Referred Expression 24 Input

full model Models Step 1 Step 2 Step 3 Step 4 Step 5 Step 6 RetAttn (baseline) ࡐྉΛશͯ ࠞͥ·͢ɻ ͝΅͏͸͕͖͞͞ ʹ͠, ਫʹਁ͠· ͢ɻ ਓࢀ͸ઍ੾Γʹ͠ ·͢ɻ ϑϥΠύϯʹ͝· ༉Λ೤͠ɺਓࢀΛ ᖱΊ·͢ɻ ਓࢀΛೖΕᖱΊ ·͢ɻ ਓࢀΛᖱΊऴΘͬ ͨΒͻ͖͡ΛೖΕ ᖱΊ·͢ɻ RetAttn (half model) ࡐྉΛ ༻ҙ͠·͢ɻ ͝΅͏͸͕͖͞͞ ʹ͠, ਫʹਁ͠· ͢ɻ ਓࢀ͸ઍ੾Γʹ͠ ·͢ɻ ϑϥΠύϯ͕Թ· ͬͨΒɺਓࢀΛೖ ΕᖱΊ·͢ɻ ΋΍͠ΛೖΕͨ ޙɺਓࢀΛೖΕ ͯᖱΊ·͢ɻ ԘͱމᑦͰຯΛ ͚ͭ·͢ɻ RetAttn (full model) ࡐྉΛ༻ҙ͠· ͢ɻ•ͷௐຯྉΛ߹ Θ͓͖ͤͯ·͢ɻ ͝΅͏͸͕͖͞͞ ʹ͠ɺਫʹ͞Β͠ ͓͖ͯ·͢ɻ ʹΜ͡Μ͸ઍ੾Γ ʹ͠·͢ɻ ϑϥΠύϯʹ͝· ༉Λ೤͠ɺਓࢀΛ ᖱΊ·͢ɻ ਓࢀ͕͠ΜͳΓͨ͠ Βɺ͝΅͏ΛೖΕ ͯɺશମ͕͠ΜͳΓ ͢Δ·ͰᖱΊ·͢ɻ શମʹՐ͕௨ͬͨ ΒɺௐຯྉΛೖΕ ͯɺΑࠞͥ͘· ͢ɻͰ͖͕͋Γʂ Ground Truth •ͷௐຯྉΛࠞͥ ͓͖ͯ·͢ɻ ͝΅͏͸ൽΛണ͍ͯ̑ ᶲ͙Β͍ͷ௕͞Ͱࡉ੾ Γʹ͠ɺਫʹ͞Βͯ͠ ͓͖·͢ɻ ʹΜ͡ΜΛࡉ੾Γ ʹ͠·͢ɻ ೤ͨ͠ϑϥΠύϯʹ͝·༉Λ ͻ͍ͯΰϘ΢ΛᖱΊ·͢ɻ ͝΅͏ʹՐ͕ͱ͓ͬͨΒʹΜ ͡ΜΛೖΕͯᖱΊ·͢ɻ ʹΜ͡ΜʹՐ͕ͱ͓ͬ ͨΒᶃͷௐຯྉΛೖΕ ͯोؾ͕ͳ͘ͳΔ·Ͱ ᖱΊ·͢ɻ ՐΛࢭΊͯɺന͝·Λ ;Γ͔͚ͯࠞͥΕ͹Ͱ ͖͕͋ΓͰ͢ɻ Step 1 Step 2 Step 3 Step 4 Step 5 Step 6 Step 1 Step 2 Step 3 Step 4 Step 5 Step 6 ͝΅͏ ʹΜ͡Μ • ࠭౶ • ञ • ͠ΐ͏༉ ന͝· • Θ͞ͼ ͝·༉ ͝΅͏ ʹΜ͡Μ • ࠭౶ • ञ • ͠ΐ͏༉ ന͝· • Θ͞ͼ ͝·༉ Ingredients ͝΅͏, ʹΜ͡Μ, • ࠭౶, • ञ, • ͠ΐ͏༉, • Θ͞ͼ, ന͝·, ͝·༉ (5.9) Restraining Duplicated Ingredients and Actions Duplicated ingredients and actions 25 Input

full model Models Step 1 Step 2 Step 3 Step 4 Step 5 Step 6 RetAttn (baseline) ࡐྉΛશͯ ࠞͥ·͢ɻ ͝΅͏͸͕͖͞͞ ʹ͠, ਫʹਁ͠· ͢ɻ ਓࢀ͸ઍ੾Γʹ͠ ·͢ɻ ϑϥΠύϯʹ͝· ༉Λ೤͠ɺਓࢀΛ ᖱΊ·͢ɻ ਓࢀΛೖΕᖱΊ ·͢ɻ ਓࢀΛᖱΊऴΘͬ ͨΒͻ͖͡ΛೖΕ ᖱΊ·͢ɻ RetAttn (half model) ࡐྉΛ ༻ҙ͠·͢ɻ ͝΅͏͸͕͖͞͞ ʹ͠, ਫʹਁ͠· ͢ɻ ਓࢀ͸ઍ੾Γʹ͠ ·͢ɻ ϑϥΠύϯ͕Թ· ͬͨΒɺਓࢀΛೖ ΕᖱΊ·͢ɻ ΋΍͠ΛೖΕͨ ޙɺਓࢀΛೖΕ ͯᖱΊ·͢ɻ ԘͱމᑦͰຯΛ ͚ͭ·͢ɻ RetAttn (full model) ࡐྉΛ༻ҙ͠· ͢ɻ•ͷௐຯྉΛ߹ Θ͓͖ͤͯ·͢ɻ ͝΅͏͸͕͖͞͞ ʹ͠ɺਫʹ͞Β͠ ͓͖ͯ·͢ɻ ʹΜ͡Μ͸ઍ੾Γ ʹ͠·͢ɻ ϑϥΠύϯʹ͝· ༉Λ೤͠ɺਓࢀΛ ᖱΊ·͢ɻ ਓࢀ͕͠ΜͳΓͨ͠ Βɺ͝΅͏ΛೖΕ ͯɺશମ͕͠ΜͳΓ ͢Δ·ͰᖱΊ·͢ɻ શମʹՐ͕௨ͬͨ ΒɺௐຯྉΛೖΕ ͯɺΑࠞͥ͘· ͢ɻͰ͖͕͋Γʂ Ground Truth •ͷௐຯྉΛࠞͥ ͓͖ͯ·͢ɻ ͝΅͏͸ൽΛണ͍ͯ̑ ᶲ͙Β͍ͷ௕͞Ͱࡉ੾ Γʹ͠ɺਫʹ͞Βͯ͠ ͓͖·͢ɻ ʹΜ͡ΜΛࡉ੾Γ ʹ͠·͢ɻ ೤ͨ͠ϑϥΠύϯʹ͝·༉Λ ͻ͍ͯΰϘ΢ΛᖱΊ·͢ɻ ͝΅͏ʹՐ͕ͱ͓ͬͨΒʹΜ ͡ΜΛೖΕͯᖱΊ·͢ɻ ʹΜ͡ΜʹՐ͕ͱ͓ͬ ͨΒᶃͷௐຯྉΛೖΕ ͯोؾ͕ͳ͘ͳΔ·Ͱ ᖱΊ·͢ɻ ՐΛࢭΊͯɺന͝·Λ ;Γ͔͚ͯࠞͥΕ͹Ͱ ͖͕͋ΓͰ͢ɻ Step 1 Step 2 Step 3 Step 4 Step 5 Step 6 Step 1 Step 2 Step 3 Step 4 Step 5 Step 6 ͝΅͏ ʹΜ͡Μ • ࠭౶ • ञ • ͠ΐ͏༉ ന͝· • Θ͞ͼ ͝·༉ ͝΅͏ ʹΜ͡Μ • ࠭౶ • ञ • ͠ΐ͏༉ ന͝· • Θ͞ͼ ͝·༉ Ingredients ͝΅͏, ʹΜ͡Μ, • ࠭౶, • ञ, • ͠ΐ͏༉, • Θ͞ͼ, ന͝·, ͝·༉ (5.10) Limitation 26 Slight difference Input

(5.11) Satisfactory Number of Labeled Data Items How many labeled
data are required to achieve a higher performance than the baselines? Answer Images2seq: 100%, SSiD, SSiL, GLAC Net: 25%~50%, RetAttn: 0% In general, labeled data contributed to the increase of performance. 27 Images2seq GLAC Net SSiD SSiL RetAttn Ac-B1 I-B1 I-RL Ac-B2 Ac-RL 5.0 7.5 10.0 12.5 1 8 0 1 4 1 2 1 I-B2 1 8 0 1 4 1 2 1 1.0 2.0 3.0 4.0 1 8 0 1 4 1 2 1 1 8 0 1 4 1 2 1 15.0 17.5 20.0 22.5 1 8 0 1 4 1 2 1 6.0 7.0 8.0 9.0 1 1 2 1 4 1 8 0 7.5 10.0 12.5 15.0 17.5 20.0 22.5

(6.) Conclusion 28 Research Objective: I investigate the impact of
explicitly introducing such a structure on the task of procedural text generation from an image sequence Proposed Dataset / Method: • Proposed a new dataset: visual SIMMR (vSIMMR). • Proposed a novel structure-aware procedural text generation model Experiments / Results: • Three criteria • automatic evaluation • human evaluation • qualitative analysis. • Proposed models generate coherent procedural texts in a versatile manner

Appendix

VAE Modules for Semi-Supervised Learning Two VAEs: tree2image VAE and
step2image VAE 30 1 2 3 4 Tree2image VAE Step2image VAE Processes (i)~(iii) Cut the … Step 1 Stir-fry … Step 2 Step 3 Put them… Step 4 Cover them…

Applications for Text Generation 31 Our approach can be used
for various text generation tasks. Machine translation tree2seq [Eriguchi+ ACL16]: effectiveness of encoding a syntax tree. Source code summarization [Ahmad+ ACL20]: generates source code summarization using Transformer. tree2seq [Eriguchi+ ACL16]

Annotation Process 32 We annotate merging trees with the following
web tool Rules: 1. Based on images, annotate merging trees. 2. If some materials can not be seen in an image (e.g., seasoning), allow to refer to procedural texts. Image sequence Merging tree Procedural text

Merging Tree Accuracy 33 Baseline: tree prediction only Generating a
procedural text contributes to merging tree accuracy. Material-to-step Step-to-step Total Tree prediction only 70.1 90.0 80.3 Images2seq 72.9 90.0 81.0 GLAC Net 73.7 90.5 81.7 SSiD 72.8 90.7 81.3 SSiL 73.8 90.5 81.7 RetAttn 73.0 91.0 81.5

List of Publications 34 Journals 1. ੢ଜ ଠҰɼڮຊ ರ࢙ɼ৿ ৳հ
“ॏཁޠʹண໨ͨࣸ͠ਅྻ͔Βͷखॱॻੜ ੒”ɼࣗવݴޠॲཧ Vol. 27 (2) International Conferences 1. Taichi Nishimura, Suzushi Tomori, Hayato Hashimoto, Atsushi Hashimoto, Yoko Yamakata, Jun Harashima, Yoshitaka Ushiku, Shinsuke Mori, “Visual Grounding Annotation of Recipe Flow Graph”, In LREC2020. 2. Taichi Nishimura, Atsushi Hashimoto, Shinsuke Mori, “Procedural Text Generation from an Image Sequence”, In INLG2019. 3. Taichi Nishimura, Atsushi Hashimoto, Yoko Yamakata, Shinsuke Mori, “Frame Selection for Producing Recipe with Pictures from an Execution Video of a Recipe”, In CEA2019 [Best Paper Award]

Structure-Aware Procedural Text Generation from...

Structure-Aware Procedural Text Generation from an Image Sequence

More Decks by Taichi Nishimura

Other Decks in Research

Featured

Transcript