Structure-Aware Procedural Text Generation from an Image Sequence

Slide 1

Slide 1 text

Structure-Aware Procedural Text Generation from an Image Sequence Taichi Nishimura Supervisor: Prof. Shinsuke Mori Master thesis, ﬁnal defense

Slide 2

Slide 2 text

(1.1) Background 2 Procedural texts are wisdom of human beings describing the steps to accomplish tasks. Generating a procedural text from visual observation (e.g., an image or a video) has various real-world applications. Voice support system [Nouri+ SIGIR20] Robotic Assistant [Bollini+ 12] High-level applications Multimedia Archive Procedural Text Generation Cut Chicken Stir-fry Chicken Database Cut Chicken.. Stir-fry it..

Slide 3

Slide 3 text

(1.2) Procedural Text Generation from an Image Sequence Input: Image sequence (taken by users) Output: Procedural text (steps are corresponded to images) Step1 Step2 Cut the chicken. Add seasoning. Put oil into the pan. Add the chicken. Add the cabbage to the pan. Steam it. Step3 Image Sequence Procedural Text 3 t

Slide 4

Slide 4 text

(1.3) Understanding of Context Dependency Models must understand context dependency From natural language understanding, some studies have represented it as a tree or graph structure: • Recipe Flow Graph [Mori+ LREC14] • SIMMR [Jermsurawong+ EMNLP15] t Step 1 Boil pasta Step 2 Cut vegetables Step 4 Serve on a plate Step 3 Stir-fry it 4

Slide 5

Slide 5 text

(1.4) SIMMR [Jermusurawong and Habash EMNLP15] SIMMR is a merging tree. • All leaves are materials. • Intermediate nodes are instructions. 5 (Predicting the structure of cooking recipes, Figure 1)

Slide 6

Slide 6 text

Tomatoes Pumpkin Ketchup Mayonnaise (1.5) Research Objective and Contributions Research Objective: We investigate the impact of explicitly introducing such a structure on the task of procedural text generation from an image sequence Contributions: • Propose a new dataset: visual SIMMR (vSIMMR). • Propose a novel structure-aware procedural text generation model 6 Merging tree Inputs Procedural text Step 1 Cut the .… Step 2 Stir-fry the …. Step 3 Put them …. Step 4 Cover them ….

Slide 7

Slide 7 text

(2.1) Multimedia Research in How-to Domains Cooking is one of the most popular domain among how-to domains. 7 Vision and Language SIMMR Flow Graph Language Ingredient Recognition Vision Kitchen Action Recognition Im2recipe RecipeQA

Slide 8

Slide 8 text

(2.2) A Comparison of vSIMMR with Other How-to Datasets 8 Datasets Recipes? Ingredients? Structure? Visual Data #Recipes Breakfast Video N/A EPIC-Kitchen Video N/A YouCook2 ✔ Video 89 RecipeQA ✔ Image sequence 19,779 Story boarding ✔ Image sequence 16,405 Cookpad Image Dataset ✔ ✔ Image sequence 1,715,595 Recipe Flow Graph ✔ ✔ ✔ N/A 266 Action Graph ✔ ✔ ✔ N/A 133 SIMMR ✔ ✔ ✔ N/A 260 vSIMMR ✔ ✔ ✔ Image sequence 2103

Slide 9

Slide 9 text

9 (2.3) Procedural Text Generation from Various Inputs Language information Generated procedural texts must be coherent Inputs are divided into two resources: language and vision Visual information Title and ingredient Finished dish image [Kiddon+ EMNLP16] Attention mechanism [Bosselut+ NAACL16] Reinforcement learning [Salvador+ CVPR19] Cooking video [Ushiku+ IJCNLP17], [Shi+ ACL19] Image sequence [Nishimura+ INLG19] [Chandu+ ACL19] Structure-aware version

Slide 10

Slide 10 text

(3.) visual SIMMR (vSIMMR) Extension points 1. vSIMMR is a vision and language version of SIMMR 2. vSIMMR is about 8 times larger than SIMMR • SIMMR: 260 recipe-tree pairs / vSIMMR: 2106 pairs 10 Merging tree Procedural text Image sequence Cut the tomatoes. Stir-fry the pumpkin. Put them on the tomatoes. Cover them with ketchup and mayonnaise. Image 1 Tomatoes Pumpkin Ketchup Mayonnaise 1 2 3 4 Image 2 Image 3 Image 4 Materials Step 1 (a) SIMMR (b) vSIMMR (ours) Step 2 Step 3 Step 4 Step 1 Step 2 Step 3 Step 4

Slide 11

Slide 11 text

Tomatoes Pumpkin Ketchup Mayonnaise Materials Image sequence Tomatoes Pumpkin Ketchup Mayonnaise Material-to-step LPM Cut the tomatoes. Stir-fry the pumpkin. Put them on the tomatoes. Cover them with ketchup and mayonnaise. Step-to-step LPM Link Probability Matrix Calculation Structured Decoder Process (i) Process (ii) Gumbel Softmax Resampling Process (iii) P mat→v ̂ P mat→v Merging tree Procedural Text ̂ P v→v P v→v Structure-Aware Procedural Text Generation Model (Section 4.1) Tree Re-prediction (Section 4.2) Re-predicted Material-to-step LPM Step 1 Step 2 Step 3 Step 4 Re-predicted Step-to-step LPM Step 4 Step 3 Step 2 Step 1 Step 4 Step 3 Step 2 Step 1 Order constraint Order constraint (4.1) An Overview of Our Method Two parts: 1. Structure-aware procedural text generation model (i) Link Probability Matrix (LPM) Calculation (ii) Gumbel Softmax Resampling (iii)Structured Decoder 2. Tree re-prediction module 11

Slide 12

Slide 12 text

(4.2) Link Probability Matrix (LPM) Calculation 12 Merging tree can be predicted from a probability matrix. Because a processed material cannot merged by returning to a previous step, a step-to-step link has an order constraint. Material-to-step Step-to-step Order constraint Tomatoes Pumpkin Ketchup Mayonnaise

Slide 13

Slide 13 text

(4.3) Gumbel Softmax Resampling [Jang+ ICLR17] 13 Material-to-step Step-to-step Merging Tree ~ ~ Gumbel Softmax Gumbel Softmax

Slide 14

Slide 14 text

(4.4) Structured decoder: Tree-LSTM [Tai+ ACL15] Child-sum Tree-LSTM: a variant of Tree-LSTM. Encoding the context dependency via sum of children nodes. 14 h in c in f h f c h 1 2 3 4 h c h c h c c h c h c h c

Slide 15

Slide 15 text

(4.5) Structured Decoder: Procedural Text Generation Cut the tomatoes. Stir-fry the pumpkin. Put them on the tomatoes. Cover them with ketchup and mayonnaise. 15 Merging Tree Procedural Text Generator Procedural Text Step 1 Step 2 Step 3 Step 4

Slide 16

Slide 16 text

(4.6) Merging Tree Re-prediction 16 Cut the tomatoes. Stir-fry the pumpkin. Put them on the tomatoes. Cover them with ketchup and mayonnaise. Procedural Text Step 1 Step 2 Step 3 Step 4 Material-to-step Step-to-step Tree Re-prediction Step 1 Step 2 Step 3 Step 4 Step 1 Step 2 Step 3 Step 4

Slide 17

Slide 17 text

(4.7) Loss Function Merging Tree / Procedural Text Losses • Procedural text generation loss: • Merging tree prediction loss: • (only labeled) compared with the ground truth • Merging tree re-prediction loss: • (labeled) compared with the ground truth • (unlabeled) compared with the argmax of merging tree VAE losses ( VAE) for Semi-Supervised Learning • Tree2image VAE: • Step2image VAE: Total loss: Ltext Ltp Ltc β Lt2i Ls2i λ* ∈ {0.001,0.01,0.1} L = Ltext + Ltp + λtc Ltc + λt2i Lt2i + λs2i Ls2i 17

Slide 18

Slide 18 text

(5.1) Datasets (a) Cookpad Image Dataset [Harashima+ SIGIR17] • Image sequence, Recipe (Japanese) (b) vSIMMR • Image sequence, Ingredient tree, Recipe (Japanese) train val test #recipes 163,525 18,051 20,193 #steps 6.24 6.15 6.26 #words 148.52 147.02 148.50 #ingredients 7.85 7.79 7.86 train val test #recipes 1,603 250 250 #steps 6.78 6.74 6.85 #words 118.23 113.91 114.68 #ingredients 6.58 6.37 6.64 (a) Cookpad Image Dataset (b) vSIMMR 18

Slide 19

Slide 19 text

(5.2) Procedural Text Generators Procedural text generation models: • Images2seq [Huang+ NAACL16]: baseline of visual storytelling (ViST) • GLAC Net [Kim+ arXiv18]: SoTA model at ViST Workshop. • SSiD [Chandu+ ACL19] • SSiL [Chandu+ ACL19] • RetAttn [Nishimura+ INLG19] Ablations: • Half model: does not calculate tree re-prediction loss • Full model: calculates tree re-prediction loss Note: For a fair comparison, we add a module to incorporate ingredient vectors into the models. 19

Slide 20

Slide 20 text

(5.3) Commonly Used Word-Overlap Metrics BLEU1 BLEU4 ROUGE-L Distinct-1 Distinct-2 Images2seq 27.5 5.1 18.4 38.3 54.7 Half 27.8* 5.8* 20.6* 51.1* 75.0* Full 29.6* 6.3* 21.7* 47.6* 71.0* GLAC Net 28.5 5.9 21.4 46.6 69.0 Half 28.5 5.9 21.8* 46.7 68.8 Full 28.9 6.1 21.3 47.2* 69.9* SSiD 28.7 6.0 20.9 45.5 66.6 Half 30.3* 6.2* 20.8 43.9 65.1 Full 31.1* 6.4* 21.6* 48.3* 71.0* SSiL 31.0 6.3 21.4 45.5 66.8 Half 28.1 5.4 21.4 46.7* 68.2* Full 30.4 6.4 21.9* 47.3* 70.9* RetAttn 32.2 6.5 21.6 40.2 60.3 Half 32.2 6.5 21.8 52.4* 77.8* Full 33.2 7.1 22.1 52.7* 78.6* The proposed method boosts the performance in a versatile manner. 20

Slide 21

Slide 21 text

(5.4) Word-Overlap Metrics for Coherency Evaluation I = Ingredient, Ac = Action words, deﬁned by [Mori et al. LREC2014] The proposed method generates coherent procedural texts. 21 I-BLEU1 I-BLEU2 I-ROUGE Ac-BLEU1 Ac-BLEU2 Ac-ROUGE Images2se q 7.0 1.8 9.7 18.2 8.1 18.4 Half 7.9* 2.2* 12.7* 18.8* 8.4* 21.2* Full 8.8* 2.7* 13.8* 21.4* 9.4* 22.9* GLAC Net 9.5 2.9 13.2 21.2 9.3 22.8 Half 10.5* 3.4* 15.6* 21.0 9.4 23.1 Full 11.8* 3.8* 16.3* 21.0 9.1 22.5 SSiD 8.6 2.7 13.1 20.1 8.7 21.6 Half 11.4* 3.7* 16.4* 19.8 8.7 22.1* Full 12.9* 4.1* 16.7* 21.5* 9.5* 23.5* SSiL 8.3 2.4 12.5 19.8 8.7 22.1 Half 11.3* 3.7* 16.9* 18.7 8.1 22.2 Full 12.4* 4.2* 17.0* 21.4* 9.5* 23.0* RetAttn 11.2 3.3 14.5 22.1 9.0 21.2 Half 11.9* 3.7* 14.8 21.8 9.4* 22.9* Full 12.1* 3.8* 14.8 22.3 9.5* 23.1*

Slide 22

Slide 22 text

(5.6) Human Evaluation We evaluated 120 generated recipes from three perspectives: • Fluency (F) • Ingredient use (IU) • Image ﬁtting (IF) The proposed method clearly outperforms the baseline in all metrics. Baseline Full Tie F 26.7 62.5 10.8 IU 30.8 66.7 2.5 IF 31.7 45.0 23.3 Baseline Full Tie F 16.7 70.0 13.3 IU 26.7 73.3 0.0 IF 36.7 50.0 13.3 all 120 recipes longest 30 recipes 22

Slide 23

Slide 23 text

(5.7) Generated Procedural Text Example 23 Merging tree generated by half model Merging tree generated by full model Models Step 1 Step 2 Step 3 Step 4 Step 5 Step 6 RetAttn (baseline) ࡐྉΛશͯ ࠞͥ·͢ɻ ͝΅͏͸͕͖͞͞ ʹ͠, ਫʹਁ͠· ͢ɻ ਓࢀ͸ઍ੾Γʹ͠ ·͢ɻ ϑϥΠύϯʹ͝· ༉Λ೤͠ɺਓࢀΛ ᖱΊ·͢ɻ ਓࢀΛೖΕᖱΊ ·͢ɻ ਓࢀΛᖱΊऴΘͬ ͨΒͻ͖͡ΛೖΕ ᖱΊ·͢ɻ RetAttn (half model) ࡐྉΛ ༻ҙ͠·͢ɻ ͝΅͏͸͕͖͞͞ ʹ͠, ਫʹਁ͠· ͢ɻ ਓࢀ͸ઍ੾Γʹ͠ ·͢ɻ ϑϥΠύϯ͕Թ· ͬͨΒɺਓࢀΛೖ ΕᖱΊ·͢ɻ ΋΍͠ΛೖΕͨ ޙɺਓࢀΛೖΕ ͯᖱΊ·͢ɻ ԘͱމᑦͰຯΛ ͚ͭ·͢ɻ RetAttn (full model) ࡐྉΛ༻ҙ͠· ͢ɻ●ͷௐຯྉΛ߹ Θ͓͖ͤͯ·͢ɻ ͝΅͏͸͕͖͞͞ ʹ͠ɺਫʹ͞Β͠ ͓͖ͯ·͢ɻ ʹΜ͡Μ͸ઍ੾Γ ʹ͠·͢ɻ ϑϥΠύϯʹ͝· ༉Λ೤͠ɺਓࢀΛ ᖱΊ·͢ɻ ਓࢀ͕͠ΜͳΓͨ͠ Βɺ͝΅͏ΛೖΕ ͯɺશମ͕͠ΜͳΓ ͢Δ·ͰᖱΊ·͢ɻ શମʹՐ͕௨ͬͨ ΒɺௐຯྉΛೖΕ ͯɺΑࠞͥ͘· ͢ɻͰ͖͕͋Γʂ Ground Truth ●ͷௐຯྉΛࠞͥ ͓͖ͯ·͢ɻ ͝΅͏͸ൽΛണ͍ͯ̑ ᶲ͙Β͍ͷ௕͞Ͱࡉ੾ Γʹ͠ɺਫʹ͞Βͯ͠ ͓͖·͢ɻ ʹΜ͡ΜΛࡉ੾Γ ʹ͠·͢ɻ ೤ͨ͠ϑϥΠύϯʹ͝·༉Λ ͻ͍ͯΰϘ΢ΛᖱΊ·͢ɻ ͝΅͏ʹՐ͕ͱ͓ͬͨΒʹΜ ͡ΜΛೖΕͯᖱΊ·͢ɻ ʹΜ͡ΜʹՐ͕ͱ͓ͬ ͨΒᶃͷௐຯྉΛೖΕ ͯोؾ͕ͳ͘ͳΔ·Ͱ ᖱΊ·͢ɻ ՐΛࢭΊͯɺന͝·Λ ;Γ͔͚ͯࠞͥΕ͹Ͱ ͖͕͋ΓͰ͢ɻ Step 1 Step 2 Step 3 Step 4 Step 5 Step 6 Step 1 Step 2 Step 3 Step 4 Step 5 Step 6 ͝΅͏ ʹΜ͡Μ ● ࠭౶ ● ञ ● ͠ΐ͏༉ ന͝· ● Θ͞ͼ ͝·༉ ͝΅͏ ʹΜ͡Μ ● ࠭౶ ● ञ ● ͠ΐ͏༉ ന͝· ● Θ͞ͼ ͝·༉ Ingredients ͝΅͏, ʹΜ͡Μ, ● ࠭౶, ● ञ, ● ͠ΐ͏༉, ● Θ͞ͼ, ന͝·, ͝·༉ Input

Slide 24

Slide 24 text

Merging tree generated by half model Merging tree generated by full model Models Step 1 Step 2 Step 3 Step 4 Step 5 Step 6 RetAttn (baseline) ࡐྉΛશͯ ࠞͥ·͢ɻ ͝΅͏͸͕͖͞͞ ʹ͠, ਫʹਁ͠· ͢ɻ ਓࢀ͸ઍ੾Γʹ͠ ·͢ɻ ϑϥΠύϯʹ͝· ༉Λ೤͠ɺਓࢀΛ ᖱΊ·͢ɻ ਓࢀΛೖΕᖱΊ ·͢ɻ ਓࢀΛᖱΊऴΘͬ ͨΒͻ͖͡ΛೖΕ ᖱΊ·͢ɻ RetAttn (half model) ࡐྉΛ ༻ҙ͠·͢ɻ ͝΅͏͸͕͖͞͞ ʹ͠, ਫʹਁ͠· ͢ɻ ਓࢀ͸ઍ੾Γʹ͠ ·͢ɻ ϑϥΠύϯ͕Թ· ͬͨΒɺਓࢀΛೖ ΕᖱΊ·͢ɻ ΋΍͠ΛೖΕͨ ޙɺਓࢀΛೖΕ ͯᖱΊ·͢ɻ ԘͱމᑦͰຯΛ ͚ͭ·͢ɻ RetAttn (full model) ࡐྉΛ༻ҙ͠· ͢ɻ●ͷௐຯྉΛ߹ Θ͓͖ͤͯ·͢ɻ ͝΅͏͸͕͖͞͞ ʹ͠ɺਫʹ͞Β͠ ͓͖ͯ·͢ɻ ʹΜ͡Μ͸ઍ੾Γ ʹ͠·͢ɻ ϑϥΠύϯʹ͝· ༉Λ೤͠ɺਓࢀΛ ᖱΊ·͢ɻ ਓࢀ͕͠ΜͳΓͨ͠ Βɺ͝΅͏ΛೖΕ ͯɺશମ͕͠ΜͳΓ ͢Δ·ͰᖱΊ·͢ɻ શମʹՐ͕௨ͬͨ ΒɺௐຯྉΛೖΕ ͯɺΑࠞͥ͘· ͢ɻͰ͖͕͋Γʂ Ground Truth ●ͷௐຯྉΛࠞͥ ͓͖ͯ·͢ɻ ͝΅͏͸ൽΛണ͍ͯ̑ ᶲ͙Β͍ͷ௕͞Ͱࡉ੾ Γʹ͠ɺਫʹ͞Βͯ͠ ͓͖·͢ɻ ʹΜ͡ΜΛࡉ੾Γ ʹ͠·͢ɻ ೤ͨ͠ϑϥΠύϯʹ͝·༉Λ ͻ͍ͯΰϘ΢ΛᖱΊ·͢ɻ ͝΅͏ʹՐ͕ͱ͓ͬͨΒʹΜ ͡ΜΛೖΕͯᖱΊ·͢ɻ ʹΜ͡ΜʹՐ͕ͱ͓ͬ ͨΒᶃͷௐຯྉΛೖΕ ͯोؾ͕ͳ͘ͳΔ·Ͱ ᖱΊ·͢ɻ ՐΛࢭΊͯɺന͝·Λ ;Γ͔͚ͯࠞͥΕ͹Ͱ ͖͕͋ΓͰ͢ɻ Step 1 Step 2 Step 3 Step 4 Step 5 Step 6 Step 1 Step 2 Step 3 Step 4 Step 5 Step 6 ͝΅͏ ʹΜ͡Μ ● ࠭౶ ● ञ ● ͠ΐ͏༉ ന͝· ● Θ͞ͼ ͝·༉ ͝΅͏ ʹΜ͡Μ ● ࠭౶ ● ञ ● ͠ΐ͏༉ ന͝· ● Θ͞ͼ ͝·༉ Ingredients ͝΅͏, ʹΜ͡Μ, ● ࠭౶, ● ञ, ● ͠ΐ͏༉, ● Θ͞ͼ, ന͝·, ͝·༉ (5.8) Referred Expression 24 Input

Slide 25

Slide 25 text

Slide 26

Slide 26 text

Slide 27

Slide 27 text

(5.11) Satisfactory Number of Labeled Data Items How many labeled data are required to achieve a higher performance than the baselines? Answer Images2seq: 100%, SSiD, SSiL, GLAC Net: 25%~50%, RetAttn: 0% In general, labeled data contributed to the increase of performance. 27 Images2seq GLAC Net SSiD SSiL RetAttn Ac-B1 I-B1 I-RL Ac-B2 Ac-RL 5.0 7.5 10.0 12.5 1 8 0 1 4 1 2 1 I-B2 1 8 0 1 4 1 2 1 1.0 2.0 3.0 4.0 1 8 0 1 4 1 2 1 1 8 0 1 4 1 2 1 15.0 17.5 20.0 22.5 1 8 0 1 4 1 2 1 6.0 7.0 8.0 9.0 1 1 2 1 4 1 8 0 7.5 10.0 12.5 15.0 17.5 20.0 22.5

Slide 28

Slide 28 text

(6.) Conclusion 28 Research Objective: I investigate the impact of explicitly introducing such a structure on the task of procedural text generation from an image sequence Proposed Dataset / Method: • Proposed a new dataset: visual SIMMR (vSIMMR). • Proposed a novel structure-aware procedural text generation model Experiments / Results: • Three criteria • automatic evaluation • human evaluation • qualitative analysis. • Proposed models generate coherent procedural texts in a versatile manner

Slide 29

Slide 29 text

Appendix

Slide 30

Slide 30 text

VAE Modules for Semi-Supervised Learning Two VAEs: tree2image VAE and step2image VAE 30 1 2 3 4 Tree2image VAE Step2image VAE Processes (i)~(iii) Cut the … Step 1 Stir-fry … Step 2 Step 3 Put them… Step 4 Cover them…

Slide 31

Slide 31 text

Applications for Text Generation 31 Our approach can be used for various text generation tasks. Machine translation tree2seq [Eriguchi+ ACL16]: effectiveness of encoding a syntax tree. Source code summarization [Ahmad+ ACL20]: generates source code summarization using Transformer. tree2seq [Eriguchi+ ACL16]

Slide 32

Slide 32 text

Annotation Process 32 We annotate merging trees with the following web tool Rules: 1. Based on images, annotate merging trees. 2. If some materials can not be seen in an image (e.g., seasoning), allow to refer to procedural texts. Image sequence Merging tree Procedural text

Slide 33

Slide 33 text

Merging Tree Accuracy 33 Baseline: tree prediction only Generating a procedural text contributes to merging tree accuracy. Material-to-step Step-to-step Total Tree prediction only 70.1 90.0 80.3 Images2seq 72.9 90.0 81.0 GLAC Net 73.7 90.5 81.7 SSiD 72.8 90.7 81.3 SSiL 73.8 90.5 81.7 RetAttn 73.0 91.0 81.5

Slide 34

Slide 34 text

List of Publications 34 Journals 1. ੢ଜ ଠҰɼڮຊ ರ࢙ɼ৿ ৳հ “ॏཁޠʹண໨ͨࣸ͠ਅྻ͔Βͷखॱॻੜ ੒”ɼࣗવݴޠॲཧ Vol. 27 (2) International Conferences 1. Taichi Nishimura, Suzushi Tomori, Hayato Hashimoto, Atsushi Hashimoto, Yoko Yamakata, Jun Harashima, Yoshitaka Ushiku, Shinsuke Mori, “Visual Grounding Annotation of Recipe Flow Graph”, In LREC2020. 2. Taichi Nishimura, Atsushi Hashimoto, Shinsuke Mori, “Procedural Text Generation from an Image Sequence”, In INLG2019. 3. Taichi Nishimura, Atsushi Hashimoto, Yoko Yamakata, Shinsuke Mori, “Frame Selection for Producing Recipe with Pictures from an Execution Video of a Recipe”, In CEA2019 [Best Paper Award]