Slide 1

Slide 1 text

Connecting Vision and Language with Video Localized Narratives shade-tree Twitter: @shade_tree2112 [PDF] [Project Page of the paper] 第59回 コンピュータビジョン勉強会@関東 CVPR 2023 読み会 2023/07/23 1 このスライド

Slide 2

Slide 2 text

લޱ্ 2023/07/23 2 ୈճ$7ษڧձ!ؔ౦

Slide 3

Slide 3 text

ࠓճ঺հ͢Δ࿦จΛܾΊΔաఔͰɺߟ͑ͨ͜ͱ • ʮAward Candidates Ͱ͋Δ “Ego-Body Pose Estimation via Ego-Head Pose Estimation” ͸ɺલճ঺հͯ͠͠·ͬͨ……ʯʢએ఻ʣ 2023/07/23 ୈճ$7ษڧձ!ؔ౦ 3 Ego-Body Pose Estimation via Ego-Head Pose Estimation shade-tree Twitter: @shade_tree2112 [Project Page] 第58回 コンピュータビジョン勉強会@関東 「深層学習+3D論⽂読み会」 2023/04/30 1

Slide 4

Slide 4 text

ࠓճ঺հ͢Δ࿦จΛܾΊΔաఔͰɺߟ͑ͨ͜ͱ • CVPR 2023 Accepted Papers • https://cvpr2023.thecvf.com/Conferences/2023/AcceptedPapers • ͜͜Ͱɺaward candidates, highlights ͳͲͷ৘ใΛؚΉϦετ͕ެ։ • https://docs.google.com/spreadsheets/d/1OAUf7sQfJ6cSU4BiOtyl- t4dMm4iFqdEDHCSs7R2jZo • ͜ͷϦετ͔Β୳͢ • Highlight Ͱݕࡧ͢ΔɺStory, Stories, Narrative ͱ͔Ͱݕࡧ͢Δ • “Vision and Language” Ͱ “Narratives” ! ͜Εʹ͠Α͏ʂ 2023/07/23 ୈճ$7ษڧձ!ؔ౦ 4

Slide 5

Slide 5 text

ൃදऀͷཱ৔ɾࢹ఺ • CV ษڧձͰ͸ɺҎԼͷ࿦จΛ঺հ͍͖ͤͯͨͩ͞·ͨ͠ • “A Hierarchical Approach for Generating Descriptive Image Paragraphs” [Krause et al., CVPR 2017] • “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale” [Dosovitskiy et al., ICLR 2021] • “Transitional Adaptation of Pretrained Models for Visual Storytelling” [Yu et al., CVPR 2021] • “Panoptic Narrative Grounding” [González et al., ICCV 2021] • “GeoDiff: A Geometric Diffusion Model for Molecular Conformation Generation” [Xu et al., ICLR 2022] • “It is Okay to Not be Okay: Overcoming Emotional Bias in Affective Image Captioning by Contrastive Data Collection” [Mohamed et al., CVPR 2022] • “Ego-Body Pose Estimation via Ego-Head Pose Estimation” [Li et al., CVPR 2023] • ͜ͷϖʔδͰݴ͍͍ͨ͜ͱɿ ʢࣗ෼ͷ΋ͷ͸੿͍͕ɺʣCV ษڧձͰ͸༷ʑͳ࿦จ͕঺հ͞Εɺϩά΋ࢀߟʹͳΔʂ 2023/07/23 ୈճ$7ษڧձ!ؔ౦ 5

Slide 6

Slide 6 text

ࠓճͷ࿦จͱಛʹؔ܎͢Δ΋ͷ • CV ษڧձͰ͸ɺҎԼͷ࿦จΛ঺հ͍͖ͤͯͨͩ͞·ͨ͠ • “A Hierarchical Approach for Generating Descriptive Image Paragraphs” [Krause et al., CVPR 2017] • “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale” [Dosovitskiy et al., ICLR 2021] • “Transitional Adaptation of Pretrained Models for Visual Storytelling” [Yu et al., CVPR 2021] • “Panoptic Narrative Grounding” [González et al., ICCV 2021] • “GeoDiff: A Geometric Diffusion Model for Molecular Conformation Generation” [Xu et al., ICLR 2022] • “It is Okay to Not be Okay: Overcoming Emotional Bias in Affective Image Captioning by Contrastive Data Collection” [Mohamed et al., CVPR 2022] • “Ego-Body Pose Estimation via Ego-Head Pose Estimation” [Li et al., CVPR 2023] 2023/07/23 ୈճ$7ษڧձ!ؔ౦ 6

Slide 7

Slide 7 text

Ҏલʹ͝঺հͨ͠࿦จ 2023/07/23 7 [PDF] [Project Page] ୈճ$7ษڧձ!ؔ౦

Slide 8

Slide 8 text

Localized Narratives ͷܥේ • Localized Narratives [Pont-Tuset et al., ECCV 2020] • Panoptic Narrative Grounding [González et al., ICCV 2021] • Ҏલʹ঺հͨ͠΋ͷ • Video Localized Narratives [Voigtlaender et al., CVPR 2023] • ࠓճ঺հ͢Δ΋ͷ 2023/07/23 ୈճ$7ษڧձ!ؔ౦ 8

Slide 9

Slide 9 text

Localized Narratives [Pont-Tuset et al., ECCV 2020] 2023/07/23 9 • ϚϧνϞʔμϧը૾ΞϊςʔγϣϯͷܗࣜΛఏҊ • Ի੠ͰΞϊςʔγϣϯΛߦ͏ͱಉ࣌ʹɺ஫ࢹ͢Δ෦෼ΛϚ΢εΦʔόʔʢϗόʔʣ • Ի੠ͱϚ΢εͷಈ͖͕ಉظ͍ͯ͠ΔͷͰɺ֤୯ޠͱը૾தͷҐஔΛରԠ෇͚ΒΕΔ ୈճ$7ษڧձ!ؔ౦

Slide 10

Slide 10 text

[෮श] Panoptic Narrative Grounding ʢ̍ʣ • ͜ͷ࿦จͰఏҊ͞ΕͨλεΫͷ໊শ • ୯ޠϨϕϧ͔Β΋͏গ͠ৄ͘͠ݟ͍ͯ͘ • Panoptic ͬͯʁ • ύϊϥϚతͳɺશମΛҰ໨Ͱݟ౉ͤΔ • pan (= all) + optic • Panoptic Segmentation [Kirillov et al., CVPR 2019] Λ࢖͏ • Narrative ͬͯʁ • Story ͱಉٛͰ࢖ΘΕΔ͜ͱ΋͋Δ͕ɺ࢖͍෼͚Δ৔߹ɺΑΓ޿͍֓೦ɻ༁ ͠෼͚ΔͳΒɺNarrative ͸ʮޠΓʯ • ͨͱ͑͹ɺ͜ͷલ޲্͸ɺStory Ͱ͸ͳ͍͔΋͠Εͳ͍͕ Narrative ͩͱ͸ݴ͑Δ • Localized Narratives [Pont-Tuset et al., ECCV 2020] ͕ॏཁͳઌߦݚڀ 2023/07/23 10 ୈճ$7ษڧձ!ؔ౦

Slide 11

Slide 11 text

[෮श] Panoptic Narrative Grounding ʢ̎ʣ • Grounding ͬͯʁ • ͜͜Ͱ͸ɺγϯϘϧάϥ΢ϯσΟϯάʢSymbol Groundingʣͱಉ༷ͷ ࢖ΘΕํ • ը૾Λ࢖ͬͨ Grounding ͳͷͰɺ Visual Grounding ͱ͍͏༻ޠ͕࿦จதʹසग़ • Symbol Grounding ͸ Harnad [1990] ʹΑΓఏএ [link] • This paper describes the "symbol grounding problem": How can the semantic interpretation of a formal symbol system be made intrinsic to the system, rather than just parasitic on the meanings in our heads? How can the meanings of the meaningless symbol tokens, manipulated solely on the basis of their (arbitrary) shapes, be grounded in anything but other meaningless symbols? 2023/07/23 11 ୈճ$7ษڧձ!ؔ౦

Slide 12

Slide 12 text

[෮श] Panoptic Narrative Grounding ʢ̏ʣ • ͬ͘͟ΓͱɺҎԼͷΑ͏ͳ΋ͷͰ͋Δͱߟ͑ΒΕΔ • શମΛҰ໨Ͱݟ౉ͤΔΑ͏ͳը૾ʹɺ • ୯ޠͰ͸ͳ͘ Narrative Λɺ • άϥ΢ϯσΟϯά͢Δʢ઀஍͢ΔɺରԠ෇͚Δʣ 2023/07/23 12 ୈճ$7ษڧձ!ؔ౦

Slide 13

Slide 13 text

[෮श] Panoptic Narrative Grounding ʢ̐ʣ • ը૾ͱͷ૊Έ߹ΘͤʹΑΔݴޠͷάϥ΢ϯσΟϯάʢVisual GroundingʣͷϑϨʔϜϫʔΫ͸͜Ε·Ͱʹ΋͋Δ͕ɺཻ౓ʹ ໰୊͕͋Δ • ݴޠʹৄ͍͠Ξϊςʔγϣϯ͕෇͍ͨ΋ͷ͸ଘࡏ͢Δ͕ɺ ը૾ͷΞϊςʔγϣϯ͸·ͩૄͰૈ͍ 2023/07/23 13 ୈճ$7ษڧձ!ؔ౦

Slide 14

Slide 14 text

[෮श] Panoptic Narrative Grounding ʢ̑ʣ • Panoptic Narrative Grounding ͷఏҊ • panoptic segmentation regions Λ Visual Grounding ͱͯ͠༻͍ɺ natural language visual grounding problem Λ৽ͨʹఆࣜԽ • λεΫɺσʔληοτɺࢦඪɺख๏ΛఏҊͨ͠ 2023/07/23 14 全部乗せ! ୈճ$7ษڧձ!ؔ౦

Slide 15

Slide 15 text

ຊ୊ ը૾͸ɺಛʹ஫هͷͳ͍ݶΓɺ ঺հ࿦จ΋͘͠͸ͦͷϖʔδͰऔΓ্͍͛ͯΔؔ࿈ݚڀ͔ΒҾ༻͍ͯ͠·͢ 2023/07/23 15 ୈճ$7ษڧձ!ؔ౦

Slide 16

Slide 16 text

঺հ࿦จͷ֓ཁʢ̍ʣ • Video Localized Narratives (VidLN) ΛఏҊ • ੩ࢭըͰ͸ͳ͘ɺಈըͰͷ Localized Narratives • ಈըΛΞϊςʔγϣϯ͢Ε͹ɺ෺ޠશମΛݟΒΕΔ • ΠϕϯτͷྲྀΕ͕͋Δ • ෳ਺ͷਓ෺΍ΦϒδΣΫτ͕ొ৔͠ɺ૬ޓʹ࡞༻͢Δ • ͔͠͠ɺ੩ࢭըʹൺ΂ͯΞϊςʔγϣϯ͕೉͍͠ • Ξϊςʔλʔʹͱͬͯɺ࣌ؒͱͷڝ૪ (race against time) 2023/07/23 ୈճ$7ษڧձ!ؔ౦ 16

Slide 17

Slide 17 text

঺հ࿦จͷ֓ཁʢ̎ʣ • VidLN ͷͨΊʹɺ৽͍͠ΞϊςʔγϣϯͷϓϩτίϧΛఏҊ • ෳ਺ͷਓ෺͕ొ৔ɺΦϒδΣΫτ͕ग़ݱ͠ɺෳࡶͳΠϕϯτ͕ى͖Δ ಈըͰ΋ɺΞϊςʔγϣϯ͕Մೳʹ • ༷ʑͳλεΫʹ࢖͑Δྑ࣭ͳσʔληοτ • ྫɿVideo Narrative Grounding, Video Question Answering 2023/07/23 ୈճ$7ษڧձ!ؔ౦ 17

Slide 18

Slide 18 text

VidLN Ξϊςʔγϣϯͷྫ 2023/07/23 ୈճ$7ษڧձ!ؔ౦ 18

Slide 19

Slide 19 text

“Video” ΁ͷΞϊςʔγϣϯͳͷͰɺಈըͰ • ಈըͷՄࢹԽ • https://google.github.io/video-localized-narratives/ 2023/07/23 19 ୈճ$7ษڧձ!ؔ౦

Slide 20

Slide 20 text

࿦จͷߏ੒ • Abstract • 1. Introduction • 2. Related Work • “Tasks Related to VNG.” ͷ߲ͰɺPanoptic Narrative Grounding ͱͷࠩҟʹݴٴ • 3. Video Localized Narrative Annotations • Ξϊςʔγϣϯͷํ๏ʹ͍ͭͯ • 4. Video Narrative Grounding (VNG) • ϕϯνϚʔΫͱͯ͠ͷԠ༻ɿྫ̍ • 5. Video Question Answering (VideoQA) • ϕϯνϚʔΫͱͯ͠ͷԠ༻ɿྫ̎ 2023/07/23 ୈճ$7ษڧձ!ؔ౦ 20

Slide 21

Slide 21 text

Panoptic Narrative Grounding ͱͷࠩҟ • Panoptic Narrative Grounding • શମΛݟΔΑ͏ͳηάϝϯςʔγϣϯΛߦ͏ • ੩ࢭըͷΩϟϓγϣϯʹݱΕΔ໊ࢺͷάϥ΢ϯσΟϯά • Video Narrative Grounding (VNG,ࠓճѻ͏λεΫͷҰͭ) • ಈըΛର৅ͱ͠ɺ۩ମతͳΦϒδΣΫτΛର৅ͱ͢Δ 2023/07/23 ୈճ$7ษڧձ!ؔ౦ 21

Slide 22

Slide 22 text

Ξϊςʔγϣϯͷํ๏ • ̑εςοϓͷΞϊςʔγϣϯ • Ξϊςʔγϣϯͷ্Ͱͷ஫ҙࣄ߲ 2023/07/23 ୈճ$7ษڧձ!ؔ౦ 22

Slide 23

Slide 23 text

̑εςοϓͷΞϊςʔγϣϯ ᶃ Understand the Video ᶄ Actor Selection ᶅ Key-frame Selection ᶆ A Story for each Actor ᶇ Transcription and Time Alignment 2023/07/23 ୈճ$7ษڧձ!ؔ౦ 23

Slide 24

Slide 24 text

̑εςοϓͷΞϊςʔγϣϯ ᶃ Understand the Video ·ͣϏσΦΛݟͯɺ಺༰Λཧղ 2023/07/23 ୈճ$7ษڧձ!ؔ౦ 24

Slide 25

Slide 25 text

̑εςοϓͷΞϊςʔγϣϯ ᶄ Actor Selection ͲͷΞΫλʔʢਓ෺ɾΦϒδΣΫτʣʹ͍ͭͯΞϊςʔγϣϯ͢Δ͔ΛબͿ ᶅ Key-frame Selection ࣌ؒ࣠ͰҰ༷෼෍͔Βαϯϓϧ͞ΕͨΩʔϑϨʔϜީิ͔Βɺ ओཁͳΞΫγϣϯΛؚΉ΋ͷΛબͿ 2023/07/23 ୈճ$7ษڧձ!ؔ౦ 25

Slide 26

Slide 26 text

̑εςοϓͷΞϊςʔγϣϯ ᶆ Speak and move mouse ΞϊςʔγϣϯͷϝΠϯύʔτ • ݸผͷΞΫλʔʹ͍ͭͯɺબ୒͞ΕͨΩʔϑϨʔϜʹର͠ɺޱ಄Ͱઆ໌ • ΞΫλʔ໊ɺΞτϦϏϡʔτɺΞΫγϣϯɺ૬ޓ࡞༻͍ͯ͠ΔଞͷΦϒδΣΫτ • આ໌͠ͳ͕ΒɺϚ΢εϙΠϯλʔΛಈ͔͠ɺΦϒδΣΫτ΍ΞΫγϣϯΛࢦࣔ͢͠ 2023/07/23 ୈճ$7ษڧձ!ؔ౦ 26

Slide 27

Slide 27 text

̑εςοϓͷΞϊςʔγϣϯ ᶇ Transcription and Time Alignment ΞϊςʔλʹखಈͰԻ੠ೖྗΛॻ͖ىͯ͜͠΋Β͏ localization ͷͨΊʹɺݴ༿ͱԻ੠ΛରԠ෇͚ʢ୯ޠ΁ͷλΠϜελϯϓͷ෇༩ʣ 2023/07/23 ୈճ$7ษڧձ!ؔ౦ 27

Slide 28

Slide 28 text

࡞ۀࢦࣔʹ͓͚Δ޻෉ • ஻Γํʹ͍ͭͯͷࢦࣔ • Ώͬ͘Γ஻Δ • ΦϒδΣΫτؒͰϚ΢εΛಈ͔͍ͯ͠Δؒ͸ɺ࿩͢ͷΛࢭΊΔ • Ϛ΢εΛΫϦοΫ͢Δ͜ͱͰɺτϨʔε͕ࢭ·ΔΑ͏ʹઃఆ 2023/07/23 ୈճ$7ษڧձ!ؔ౦ 28

Slide 29

Slide 29 text

Statistics • ̏ͭͷσʔληοτʹΞϊςʔγϣϯΛ෇༩ • OVIS [Qi et al., IJCV 2022], UVO [Sadhu et al., NAACL 2021] , Oops [Epstein et al., CVPR 2020] • Total • 20K videos • 1,65M words • 3.54 actors / video 2023/07/23 ୈճ$7ษڧձ!ؔ౦ 29

Slide 30

Slide 30 text

Statistics: ؔ࿈σʔληοτͱͷൺֱ • ؔ࿈͢Δσʔληοτͱͷൺֱ 2023/07/23 ୈճ$7ษڧձ!ؔ౦ 30

Slide 31

Slide 31 text

Statistics: ؔ࿈σʔληοτͱͷൺֱʢٙ໰఺ʣ • Actors / narr ʹ͍ͭͯɺ̏छͰ max 3.45 ͳͷʹɺ All ͩͱ 3.54? • 71,976 / 22,091 = 3.26 Ͱܭࢉ݁Ռ͕߹Θͳ͍ • ࿦จதͷิ଍ࣄ߲Λݟམͱ͍ͯ͠Δʁ 2023/07/23 ୈճ$7ษڧձ!ؔ౦ 31

Slide 32

Slide 32 text

Statistics: ΩϟϓγϣϯͷϦον͞ͷ֬ೝ • ActivityNet-Entities [Zhou et al., CVPR 2019] ͱ ൺֱͯ͠ɺϦονͳΩϟϓγϣϯ • ௕͘ʢฏۉ 75.1 wordsʣɺ඼ࢺผͰ΋ ଟ͘ͷ୯ޠΛؚΉ • 23.0 nouns • 9.5 verbs • 8.5 adjectives • 7.2 adpositions • 2.4 pronouns 2023/07/23 ୈճ$7ษڧձ!ؔ౦ 32

Slide 33

Slide 33 text

Statistics: Ωϟϓγϣϯͷਖ਼֬ੑ • Semantic Accuracy ͱ Localization Accuracy ͷ֬ೝ • Semantic Accuracy ʢ্දʣ • 70 ಈըΛϥϯμϜʹબͼɺ໊ࢺ۟ͱಈࢺΛ֬ೝ → ΄΅׬ᘳ • Localization Accuracy • ੩ࢭըͷ Localized Narrative (ImLN) ͷσʔληοτͱൺ΂ͯ΋ߴਫ਼౓ʢ+25%ʣ → ఏҊͨ͠ϓϩτίϧ͸༗༻ 2023/07/23 ୈճ$7ษڧձ!ؔ౦ 33

Slide 34

Slide 34 text

ఏҊͨ͠σʔληοτʢVidLNʣͷ׆༻ • ༷ʑʹԠ༻Ͱ͖Δ͕ɺ۩ମྫͱͯ̎͠छͷϕϯνϚʔΫΛఏҊ • Video Narrative Grounding (VNG) → Sec. 4 • Video Question Answering (VideoQA) → Sec. 5 2023/07/23 ୈճ$7ษڧձ!ؔ౦ 34

Slide 35

Slide 35 text

VNG: λεΫͷఆٛ • ೖྗɿ • φϥςΟϒ͕ςΩετͰ෇༩͞ΕͨಈըͰɺಛఆͷ໊ࢺʹҹ • ग़ྗɿ • ҹͷ໊͍ͭͨࢺͷͦΕͧΕʹର͢ΔηάϝϯςʔγϣϯϚεΫ 2023/07/23 ୈճ$7ษڧձ!ؔ౦ 35

Slide 36

Slide 36 text

VNG: ॏཁͳνϟϨϯδ • ಉ໊͡ࢺ͕ࢦ͢ΦϒδΣΫτ͕ɺෳ਺͋Δ͜ͱ͕͋Δ • ্ͷྫͰ͸ “parrot ʢᳳ໅ʣ” ͚ͩͰ͸ಛఆͰ͖ͳ͍ • “a red-black neckline” ͷ৘ใ΋࢖Θͳ͍ͱ͍͚ͳ͍ 2023/07/23 ୈճ$7ษڧձ!ؔ౦ 36

Slide 37

Slide 37 text

VNG: ख๏ • ReferFormer [Wu et al., CVPR 2022] Λվྑͨ͠ɺReferFormer-VNG • ݩͷख๏ • Referring Video Object Segmentation (R-VOS) λεΫ༻ͷ΋ͷ • ୹͍આ໌ɺ୯ҰͷΦϒδΣΫτ • Visual Encoder Ͱɺಈը͔Βಛ௃நग़ • Text Encoder ͰɺςΩετ͔Βಛ௃நग़ • ͜ΕΒͷಛ௃Λ Decoder ʹೖΕɺ֤ϑϨʔϜʹ͓͍ͯϚεΫΛੜ੒ 2023/07/23 ୈճ$7ษڧձ!ؔ౦ 37

Slide 38

Slide 38 text

VNG: ख๏ • VNG Ͱ͸ɺಉ໊͡ࢺͰද͞ΕΔෳ਺ͷΦϒδΣΫτ͕͋Δ • Text Encoder ͷॲཧΛมߋ • ReferFormerɿ τʔΫϯ͝ͱͷಛ௃ + ςΩετશମͷಛ௃ • ςΩετશମ͕ಉҰͷΦϒδΣΫτΛࢦࣔ͢͠ͷͰɺ༗༻ • ReferFormer-VNGɿ ηάϝϯτ໊͍ͨ͠ࢺʹରԠͨ͠τʔΫϯͷಛ௃ͷ average-pool • ʢະ֬ೝʣτʔΫϯ͝ͱͷಛ௃ͱ average-pool ͷ྆ํ͔ɺޙऀ͚͔ͩ 2023/07/23 ୈճ$7ษڧձ!ؔ౦ 38

Slide 39

Slide 39 text

VNG: ධՁࢦඪͱɺϕʔεϥΠϯͷ݁Ռ • ධՁࢦඪɿ 𝒯&ℱ [Pont-Tuset et al., 2018] • The 2017 DAVIS Challenge on Video Object Segmentation ͷ΋ͷ • ϕʔεϥΠϯɿReferFormer • VNG λεΫͷੑ্࣭ɺ Narrative શମΛೖΕΔΑΓɺNoun ͚ͩͷ΄͏͕ͪΐͬͱྑ͍ 2023/07/23 ୈճ$7ษڧձ!ؔ౦ 39

Slide 40

Slide 40 text

VNG: ఏҊख๏Ͱͷ݁Ռ • ϕʔεϥΠϯͱൺ΂Δͱྑ͘ɺᐆດੑΛղܾͰ͖͍ͯΔ • Ұ൪্ͷྻͷ݁Ռ͕Ұ൪ྑ͍ • े෼ʹֶश͞Ε͍ͯΔͷͰɺOVIS-VNG Ͱͷ fine-tuning Ͱ͸ͦ͜·Ͱ ਫ਼౓͕޲্͍ͯ͠ͳ͍ 2023/07/23 ୈճ$7ษڧձ!ؔ౦ 40

Slide 41

Slide 41 text

ʲ࠶ܝʳఏҊͨ͠σʔληοτʢVidLNʣͷ׆༻ • ͍Ζ͍ΖͳԠ༻͕Ͱ͖Δ͕ɺ۩ମྫͱͯ̎͠छ • Video Narrative Grounding (VNG) → Sec. 4 • Video Question Answering (VideoQA) → Sec. 5 2023/07/23 ୈճ$7ษڧձ!ؔ౦ 41

Slide 42

Slide 42 text

VideoQA • Text-output Questions • ࣗ༝هड़ͷճ౴ ※ ൃද࣌ؒΛߟྀͯ͠ɺઆ໌ΛׂѪ͠·͢ʢεϥΠυ͸ Appendix ʹ͋Γ·͢ʣ • Location-output Questions • “where is” Ͱ࢝·Δ࣭໰ʹɺಈը্Ͱͷ࣌ؒͱۭؒͷಛఆͰճ౴ 2023/07/23 ୈճ$7ษڧձ!ؔ౦ 42

Slide 43

Slide 43 text

VideoQA: Location-output ͷ Q&A ࡞੒ • ࣗಈੜ੒ • spaCy Λ࢖ͬͯ part-of-speech λά΍ parse tree Λ෇༩ • “where the subject is” ͷ࣭໰ʹม׵ • 3.7 questions / video • Ϛ΢εͷτϨʔεͷ৘ใΛ࢖ͬͯਖ਼ղΛ࡞Δ • ਓؒʹΑΔνΣοΫ • ̎ਓ͕νΣοΫɺͲͪΒ͔Β΋߹֨ͱͳͬͨ΋ͷͷΈ࢒͢ 2023/07/23 ୈճ$7ษڧձ!ؔ౦ 43

Slide 44

Slide 44 text

VideoQA: Location-output ͷ Q&A ͷྫ 2023/07/23 ୈճ$7ษڧձ!ؔ౦ 44

Slide 45

Slide 45 text

VideoQA: Location-output ͷ݁Ռ • ReferFormer-VNG ΛϕʔεϥΠϯͱͯ͠࢖༻ • “where is” ͱ͍͏୯ޠΛ࣭໰͔Β࡟Γɺจதͷ࠷ॳͷ໊ࢺΛର৅ʹ ReferFormer-VNG ͰηάϝϯςʔγϣϯϚεΫΛग़ྗ • ϚεΫΛ bounding box ʹม׵ • ݁Ռ • Recall: 66.7% • Precision: 53.9% • ྑ͍݁Ռ͕ͩ׬ᘳʹ͸ఔଟ͍ɻఏҊͨ͠ϕϯνϚʔΫʹ͸Ձ஋͕͋Δ 2023/07/23 ୈճ$7ษڧձ!ؔ౦ 45

Slide 46

Slide 46 text

݁࿦ • Video Localized Narratives ΛఏҊ • ΞϊςʔγϣϯͷϓϩτίϧΛఏҊ • ߏஙͨ͠σʔληοτͰϕϯνϚʔΫΛఏҊ 2023/07/23 46 ୈճ$7ษڧձ!ؔ౦

Slide 47

Slide 47 text

Appendix 2023/07/23 47 ୈճ$7ษڧձ!ؔ౦

Slide 48

Slide 48 text

Story ͱ Narrative ʹ͍ͭͯɺ΋͏গ͠ৄ͘͠ • Oxford Learner’s Dictionary of Academic English [2014] ʹΑΔ ͱɺ • Story 1. a description, often spoken, of what happened to sb or of how sth happened • Narrative 1. [C] a description of events, especially in a novel [synonym] story (1) 2. [U] the act, process or skill of telling a story • Localized Narratives ͷ Narrative ཁૉ͸ʁ • ʢΞϊςʔλͷ஫ࢹ͢Δʣ࣌ܥྻΛ࢖͏ • ࠓճ͸ͦ΋ͦ΋ Video Ͱɺ Narrative શମΛѻ͑Δ 2023/07/23 48 ୈճ$7ษڧձ!ؔ౦

Slide 49

Slide 49 text

VideoQA: Text-output ͷ Q&A ࡞੒ • ࣗಈੜ੒ • VQ2A [Changpinyo et al., NAACL 2022] ʹΑΔ • ࣗಈͰɺճ౴͕ॏෳ͢Δ΋ͷΛআ͘ͳͲͷॲཧΛߦ͏ • ਓؒʹΑΔνΣοΫ • “What color is the sky?” • ಈըΛݟͳͯ͘΋ճ౴Ͱ͖Δ΋ͷ • “What color is the cat?” • ಈըʹෳ਺ͷೣ͕ग़͍ͯͨΒɺճ౴͕ᐆດ 2023/07/23 ୈճ$7ษڧձ!ؔ౦ 49

Slide 50

Slide 50 text

VideoQA: Text-output ͷ Q&A ͷྫ 2023/07/23 ୈճ$7ษڧձ!ؔ౦ 50 Q: “What falls out of the man’s hand?” , A: “dog”

Slide 51

Slide 51 text

VideoQA: Text-output ͷ݁Ռ • PaLI [Chen et al., ICLR 2023] ΛϕʔεϥΠϯͱͯ͠࢖༻ • PaLI-1.5B • Ground-truth ͱͷ exact-match ͰධՁ • ୯ҰϑϨʔϜͰͷ݁Ռ • Zero-shot Ͱ 24.1% • Fine-tuned (on Oops-QA) Ͱ 44.9% • ఏҊͨ͠ϕϯνϚʔΫ͸ෳࡶͳ΋ͷͰɺख๏վળͷ༨஍͋Γ 2023/07/23 ୈճ$7ษڧձ!ؔ౦ 51