Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Frame Selection for Producing Recipe with Pictures from an Execution Video of a Recipe

Taichi Nishimura
February 02, 2021
180

Frame Selection for Producing Recipe with Pictures from an Execution Video of a Recipe

Taichi Nishimura

February 02, 2021
Tweet

Transcript

  1. Frame Selection for Producing Recipe with Pictures from an Execution

    Video of a Recipe Kyoto University, Japan   Taichi Nishimura♣ Atsushi Hashimoto♦ Yoko Yamakata♠ Shinsuke Mori♣ ♣ Omron Sinic X Corporation, Japan ♦ The University of Tokyo, Japan ♠
  2. Background : Multimedia recipes assist people to cook https://cookpad.com/recipe/5479663 •

    Multimedia Type: • text + pictures • text + video clips • It’s costly to produce a multimedia recipe 3
  3. Research Objective Generation of multimedia recipes from text recipe and

    its execution video Text recipe Execution Execution video Step.1 Cut the chicken in the bite size, then, cover it with flour. Step. 2 … image + instruction 4 1. 1 Cut the chicken in the bite size, 1. 2 then, cover it with flour.
  4. Research Objective Generation of multimedia recipes from text recipe and

    its execution video Text recipe Execution Execution video Step.1 Cut the chicken in the bite size, then, cover it with flour. Step. 2 … image + instruction 1. 1 Cut the chicken in the bite size, 5 1. 2 then, cover it with flour.
  5. Related Work • Malmaud et al. [NAACL 2015] • Input

    : Text, edited video, and narration • Bojanowski et al. [ICCV 2014] • Input : Text, non-edited video, and annotation • Ours • Input : Text, and non-edited Automatically Illustration of a recipe [Malmaud et al, NAACL, 2015] 6
  6. Summary • Producing multimedia recipes assist people to cook •

    Co-occurrence of key objects : related to the procedure. Our Goal Proposed method We confirm that the steady increase compared with the baselines. Cut a carrot instruction >> matched less-matched 7
  7. Overview 9 Cut the chicken Cross-modal Embedding Scene importance (A)

    Conversion into feature vector (B) Two matching criteria (C) Intstruction-frame alignment 1. Cut the chicken 2. then, cover it with flour. Multimedia recipe N. …. …. Viterbi algorithm …. …. …. …. …. ….
  8. recipe Named Entity ( r-NE ) [Mori et al. LREC2014]

    10 NE tag Meaning F Food T Tool Ac Action by the chef "G Action by the foods Sf State of foods St State of Tools D Duration Q Quality • We use F, T, and Ac in the proposed method
  9. (A-1) Preprocess 11 Step 1. Cut the chicken in the

    bite size, then, cover it with flour. Recipe divided with the actions ( annotation ) Step 1. Cut(Ac) the chicken in the bite size, then, cover(Ac) it with flour. Video Key frame extraction based on the touch and release (Original language is Japanese)
  10. (A-2) Conversion into feature vector 12 Cut the chicken(Food) in

    the bite size. Key objects = Food, Tool (rNE) [0,0,⋯,1,⋯,0] chicken knife chicken board rNE Probability 0.87 0.56 0.43 [0,0.87,⋯,0.56,⋯,0.43,⋯,0] knife chicken board Extracting object categories, object location, and probability Extracting object categories of foods and tools Recipe Video
  11. (B-1) Matching Score Calculation : cosine similarity Key objects appear

    at the same time : Matched frame 13 knife chicken board rNE Probability 0.87 0.56 0.43 Cut the chicken(Food) in the bite size. [0,0,⋯,1,⋯,0] chicken [0,0.87,⋯,0.56,⋯,0.43,⋯,0] knife chicken board cosine similarity Video Recipe
  12. (B-2) Matching Score Calculation : scene importance 14 Cut a

    chicken(Food) Visual appearance changes imply the instruction progress ResNet50 (trained with Imagenet) 1 2 3 appearance feature 1 2 3 small difference 2 is not important large difference 3 may be important
  13. (C) Producing a multimedia recipe • Frame selection based on

    the Viterbi algorithm. • We get an alignment as a multimedia recipe using the matching score for the Viterbi algorithm. 15                                         video frame instruction • ( ) : skip the video frame • ( ) : skip the instruction • ( ) : match between video frame and instruction 1. Cut the .. 2. Cover it … 3. Add it … 4. Boil it …
  14. Dataset 17 • KUSK Dataset [Hashimoto et al. CEA 2014]

    • 12 recipes/videos • no annotation Sensors in the smart kitchen environment [Hashimoto et al. CEA 2014] Video taken by three fixed camera We preprocess the dataset Extraction of key frames and phrases
  15. Manual Annotation • Two annotators attach suitable frames for each

    instruction. • Select frames suitable for phrase among key frames. • You are able to select multiple frames. You also are able to select no frame. • Select frames which the action happen at the center of. Rules 18 Key frames Instructions
  16. • Two annotator attached enough frames to evaluate the proposed

    method • 72% (=143/198) instructions are attached to one frame at least • AND : Intersection between two annotators • OR : Union between two annotators Annotation Results Annotator 1 Annotator 2 AND OR # of instruction w/ frames 130 125 112 143 # of instruction w/o frames 68 73 86 55 Total 198 19
  17. Results • Comparison of the three methods : Average intersection

    a) cos (baseline1) : Highest the cosine similarity for each instruction. b) +Viterbi (baseline2) c) +scene (our full model) d) manual : two annotators’ intersection. 20 Recall Precision F-measure a) cos 0.046 0.059 0.052 b) cos+Viterbi 0.099 0.134 0.114 c) cos+Viterbi+scene 0.135 0.176 0.153 d) Manual 0.749 0.875 0.805 Steady increase compared with the baselines.
  18. Impact of the scene importance • Scene importance helps to

    select like human. 21 cos+Viterbi cos+Viterbi+scene (manual) Instruction : Chop an onion(Food), carrot(Food), and green pepper(Food)
  19. Error Analysis : cos + Viterbi The Example of the

    score drop in the cosine similarity. Instruction : Chop an onion(Food), carrot(Food), and green pepper(Food) low detection probability 22 high detection probability because of bias in object detection model.
  20. Limitation of the method : Zero anaphora and omission •

    The typical case of fail to extract the feature vector. • Zero anaphora / omissions • In the recipes, these problems occurs [Malmaud+ ACL2014]. 23 No image cos+Viterbi+scene manual Instruction : ( Peel a potato(F) and ), cut it into bite size. preceding instruction Instruction : ( ͡Ό͕͍΋ͷൽΛണ͍ͯ, ) Ұޱେͷେ͖͞ʹ੾Δ unmentioned in japanese
  21. Limitation of the method : Detection error 24 • Typical

    mistakes in object detection. • It depends on the training dataset. cos+Viterbi+scene manual Instruction : Cover it with flour(F)
  22. Conclusion 26 • Producing a multimedia recipe for assisting people

    to cook. • Task Definition : Fine-grained Alignment instruction and frames • Scene importance makes the steady increase compared with the baseline Future Work • Incorporating the action information to the model • Training the joint embedding network. Summary