$30 off During Our Annual Pro Sale. View Details »

論文紹介 / Connecting Vision and Language with Video Localized Narratives

論文紹介 / Connecting Vision and Language with Video Localized Narratives

第59回 コンピュータビジョン勉強会@関東 CVPR2023読み会(前編) にて、
"Connecting Vision and Language with Video Localized Narratives" [Voigtlaender et al., CVPR 2023]
のご紹介をさせていただきました。

◆イベント詳細 URL:
https://kantocv.connpass.com/event/288899/
◆発表日:
2023/07/23
◆紹介論文のプロジェクトページ
https://google.github.io/video-localized-narratives/
◆紹介論文の PDF
https://openaccess.thecvf.com/content/CVPR2023/papers/Voigtlaender_Connecting_Vision_and_Language_With_Video_Localized_Narratives_CVPR_2023_paper.pdf

(2023/09/02 スライドを修正版に差し替えました。誤字脱字の修正等を行なっています。)

Yusuke Mori

July 23, 2023
Tweet

More Decks by Yusuke Mori

Other Decks in Research

Transcript

  1. Connecting Vision and Language
    with Video Localized Narratives
    shade-tree
    Twitter: @shade_tree2112
    [PDF] [Project Page of the paper]
    第59回 コンピュータビジョン勉強会@関東
    CVPR 2023 読み会
    2023/07/23 1
    このスライド

    View Slide

  2. લޱ্
    2023/07/23 2
    ୈճ$7ษڧձ!ؔ౦

    View Slide

  3. ࠓճ঺հ͢Δ࿦จΛܾΊΔաఔͰɺߟ͑ͨ͜ͱ
    • ʮAward Candidates Ͱ͋Δ
    “Ego-Body Pose Estimation via Ego-Head Pose Estimation”
    ͸ɺલճ঺հͯ͠͠·ͬͨ……ʯʢએ఻ʣ
    2023/07/23 ୈճ$7ษڧձ!ؔ౦ 3
    Ego-Body Pose Estimation via
    Ego-Head Pose Estimation
    shade-tree
    Twitter: @shade_tree2112
    [Project Page]
    第58回 コンピュータビジョン勉強会@関東
    「深層学習+3D論⽂読み会」
    2023/04/30 1

    View Slide

  4. ࠓճ঺հ͢Δ࿦จΛܾΊΔաఔͰɺߟ͑ͨ͜ͱ
    • CVPR 2023 Accepted Papers
    • https://cvpr2023.thecvf.com/Conferences/2023/AcceptedPapers
    • ͜͜Ͱɺaward candidates, highlights ͳͲͷ৘ใΛؚΉϦετ͕ެ։
    • https://docs.google.com/spreadsheets/d/1OAUf7sQfJ6cSU4BiOtyl-
    t4dMm4iFqdEDHCSs7R2jZo
    • ͜ͷϦετ͔Β୳͢
    • Highlight Ͱݕࡧ͢ΔɺStory, Stories, Narrative ͱ͔Ͱݕࡧ͢Δ
    • “Vision and Language” Ͱ “Narratives” ! ͜Εʹ͠Α͏ʂ
    2023/07/23 ୈճ$7ษڧձ!ؔ౦ 4

    View Slide

  5. ൃදऀͷཱ৔ɾࢹ఺
    • CV ษڧձͰ͸ɺҎԼͷ࿦จΛ঺հ͍͖ͤͯͨͩ͞·ͨ͠
    • “A Hierarchical Approach for Generating Descriptive Image Paragraphs” [Krause et al., CVPR
    2017]
    • “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale” [Dosovitskiy
    et al., ICLR 2021]
    • “Transitional Adaptation of Pretrained Models for Visual Storytelling” [Yu et al., CVPR 2021]
    • “Panoptic Narrative Grounding” [González et al., ICCV 2021]
    • “GeoDiff: A Geometric Diffusion Model for Molecular Conformation Generation” [Xu et al.,
    ICLR 2022]
    • “It is Okay to Not be Okay: Overcoming Emotional Bias in Affective Image Captioning by
    Contrastive Data Collection” [Mohamed et al., CVPR 2022]
    • “Ego-Body Pose Estimation via Ego-Head Pose Estimation” [Li et al., CVPR 2023]
    • ͜ͷϖʔδͰݴ͍͍ͨ͜ͱɿ
    ʢࣗ෼ͷ΋ͷ͸੿͍͕ɺʣCV ษڧձͰ͸༷ʑͳ࿦จ͕঺հ͞Εɺϩά΋ࢀߟʹͳΔʂ
    2023/07/23 ୈճ$7ษڧձ!ؔ౦ 5

    View Slide

  6. ࠓճͷ࿦จͱಛʹؔ܎͢Δ΋ͷ
    • CV ษڧձͰ͸ɺҎԼͷ࿦จΛ঺հ͍͖ͤͯͨͩ͞·ͨ͠
    • “A Hierarchical Approach for Generating Descriptive Image Paragraphs” [Krause et al., CVPR
    2017]
    • “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale” [Dosovitskiy
    et al., ICLR 2021]
    • “Transitional Adaptation of Pretrained Models for Visual Storytelling” [Yu et al., CVPR 2021]
    • “Panoptic Narrative Grounding” [González et al., ICCV 2021]
    • “GeoDiff: A Geometric Diffusion Model for Molecular Conformation Generation” [Xu et al.,
    ICLR 2022]
    • “It is Okay to Not be Okay: Overcoming Emotional Bias in Affective Image Captioning by
    Contrastive Data Collection” [Mohamed et al., CVPR 2022]
    • “Ego-Body Pose Estimation via Ego-Head Pose Estimation” [Li et al., CVPR 2023]
    2023/07/23 ୈճ$7ษڧձ!ؔ౦ 6

    View Slide

  7. Ҏલʹ͝঺հͨ͠࿦จ
    2023/07/23 7
    [PDF] [Project Page]
    ୈճ$7ษڧձ!ؔ౦

    View Slide

  8. Localized Narratives ͷܥේ
    • Localized Narratives
    [Pont-Tuset et al., ECCV 2020]
    • Panoptic Narrative Grounding
    [González et al., ICCV 2021]
    • Ҏલʹ঺հͨ͠΋ͷ
    • Video Localized Narratives
    [Voigtlaender et al., CVPR 2023]
    • ࠓճ঺հ͢Δ΋ͷ
    2023/07/23 ୈճ$7ษڧձ!ؔ౦ 8

    View Slide

  9. Localized Narratives
    [Pont-Tuset et al., ECCV 2020]
    2023/07/23 9
    • ϚϧνϞʔμϧը૾ΞϊςʔγϣϯͷܗࣜΛఏҊ
    • Ի੠ͰΞϊςʔγϣϯΛߦ͏ͱಉ࣌ʹɺ஫ࢹ͢Δ෦෼ΛϚ΢εΦʔόʔʢϗόʔʣ
    • Ի੠ͱϚ΢εͷಈ͖͕ಉظ͍ͯ͠ΔͷͰɺ֤୯ޠͱը૾தͷҐஔΛରԠ෇͚ΒΕΔ
    ୈճ$7ษڧձ!ؔ౦

    View Slide

  10. [෮श] Panoptic Narrative Grounding ʢ̍ʣ
    • ͜ͷ࿦จͰఏҊ͞ΕͨλεΫͷ໊শ
    • ୯ޠϨϕϧ͔Β΋͏গ͠ৄ͘͠ݟ͍ͯ͘
    • Panoptic ͬͯʁ
    • ύϊϥϚతͳɺશମΛҰ໨Ͱݟ౉ͤΔ
    • pan (= all) + optic
    • Panoptic Segmentation [Kirillov et al., CVPR 2019] Λ࢖͏
    • Narrative ͬͯʁ
    • Story ͱಉٛͰ࢖ΘΕΔ͜ͱ΋͋Δ͕ɺ࢖͍෼͚Δ৔߹ɺΑΓ޿͍֓೦ɻ༁
    ͠෼͚ΔͳΒɺNarrative ͸ʮޠΓʯ
    • ͨͱ͑͹ɺ͜ͷલ޲্͸ɺStory Ͱ͸ͳ͍͔΋͠Εͳ͍͕ Narrative ͩͱ͸ݴ͑Δ
    • Localized Narratives [Pont-Tuset et al., ECCV 2020] ͕ॏཁͳઌߦݚڀ
    2023/07/23 10
    ୈճ$7ษڧձ!ؔ౦

    View Slide

  11. [෮श] Panoptic Narrative Grounding ʢ̎ʣ
    • Grounding ͬͯʁ
    • ͜͜Ͱ͸ɺγϯϘϧάϥ΢ϯσΟϯάʢSymbol Groundingʣͱಉ༷ͷ
    ࢖ΘΕํ
    • ը૾Λ࢖ͬͨ Grounding ͳͷͰɺ Visual Grounding ͱ͍͏༻ޠ͕࿦จதʹසग़
    • Symbol Grounding ͸ Harnad [1990] ʹΑΓఏএ [link]
    • This paper describes the "symbol grounding problem": How can the semantic
    interpretation of a formal symbol system be made intrinsic to the system, rather
    than just parasitic on the meanings in our heads? How can the meanings of the
    meaningless symbol tokens, manipulated solely on the basis of their (arbitrary)
    shapes, be grounded in anything but other meaningless symbols?
    2023/07/23 11
    ୈճ$7ษڧձ!ؔ౦

    View Slide

  12. [෮श] Panoptic Narrative Grounding ʢ̏ʣ
    • ͬ͘͟ΓͱɺҎԼͷΑ͏ͳ΋ͷͰ͋Δͱߟ͑ΒΕΔ
    • શମΛҰ໨Ͱݟ౉ͤΔΑ͏ͳը૾ʹɺ
    • ୯ޠͰ͸ͳ͘ Narrative Λɺ
    • άϥ΢ϯσΟϯά͢Δʢ઀஍͢ΔɺରԠ෇͚Δʣ
    2023/07/23 12
    ୈճ$7ษڧձ!ؔ౦

    View Slide

  13. [෮श] Panoptic Narrative Grounding ʢ̐ʣ
    • ը૾ͱͷ૊Έ߹ΘͤʹΑΔݴޠͷάϥ΢ϯσΟϯάʢVisual
    GroundingʣͷϑϨʔϜϫʔΫ͸͜Ε·Ͱʹ΋͋Δ͕ɺཻ౓ʹ
    ໰୊͕͋Δ
    • ݴޠʹৄ͍͠Ξϊςʔγϣϯ͕෇͍ͨ΋ͷ͸ଘࡏ͢Δ͕ɺ
    ը૾ͷΞϊςʔγϣϯ͸·ͩૄͰૈ͍
    2023/07/23 13
    ୈճ$7ษڧձ!ؔ౦

    View Slide

  14. [෮श] Panoptic Narrative Grounding ʢ̑ʣ
    • Panoptic Narrative Grounding ͷఏҊ
    • panoptic segmentation regions Λ Visual Grounding ͱͯ͠༻͍ɺ
    natural language visual grounding problem Λ৽ͨʹఆࣜԽ
    • λεΫɺσʔληοτɺࢦඪɺख๏ΛఏҊͨ͠
    2023/07/23 14
    全部乗せ!
    ୈճ$7ษڧձ!ؔ౦

    View Slide

  15. ຊ୊
    ը૾͸ɺಛʹ஫هͷͳ͍ݶΓɺ
    ঺հ࿦จ΋͘͠͸ͦͷϖʔδͰऔΓ্͍͛ͯΔؔ࿈ݚڀ͔ΒҾ༻͍ͯ͠·͢
    2023/07/23 15
    ୈճ$7ษڧձ!ؔ౦

    View Slide

  16. ঺հ࿦จͷ֓ཁʢ̍ʣ
    • Video Localized Narratives (VidLN) ΛఏҊ
    • ੩ࢭըͰ͸ͳ͘ɺಈըͰͷ Localized Narratives
    • ಈըΛΞϊςʔγϣϯ͢Ε͹ɺ෺ޠશମΛݟΒΕΔ
    • ΠϕϯτͷྲྀΕ͕͋Δ
    • ෳ਺ͷਓ෺΍ΦϒδΣΫτ͕ొ৔͠ɺ૬ޓʹ࡞༻͢Δ
    • ͔͠͠ɺ੩ࢭըʹൺ΂ͯΞϊςʔγϣϯ͕೉͍͠
    • Ξϊςʔλʔʹͱͬͯɺ࣌ؒͱͷڝ૪ (race against time)
    2023/07/23 ୈճ$7ษڧձ!ؔ౦ 16

    View Slide

  17. ঺հ࿦จͷ֓ཁʢ̎ʣ
    • VidLN ͷͨΊʹɺ৽͍͠ΞϊςʔγϣϯͷϓϩτίϧΛఏҊ
    • ෳ਺ͷਓ෺͕ొ৔ɺΦϒδΣΫτ͕ग़ݱ͠ɺෳࡶͳΠϕϯτ͕ى͖Δ
    ಈըͰ΋ɺΞϊςʔγϣϯ͕Մೳʹ
    • ༷ʑͳλεΫʹ࢖͑Δྑ࣭ͳσʔληοτ
    • ྫɿVideo Narrative Grounding, Video Question Answering
    2023/07/23 ୈճ$7ษڧձ!ؔ౦ 17

    View Slide

  18. VidLN Ξϊςʔγϣϯͷྫ
    2023/07/23 ୈճ$7ษڧձ!ؔ౦ 18

    View Slide

  19. “Video” ΁ͷΞϊςʔγϣϯͳͷͰɺಈըͰ
    • ಈըͷՄࢹԽ
    • https://google.github.io/video-localized-narratives/
    2023/07/23 19
    ୈճ$7ษڧձ!ؔ౦

    View Slide

  20. ࿦จͷߏ੒
    • Abstract
    • 1. Introduction
    • 2. Related Work
    • “Tasks Related to VNG.” ͷ߲ͰɺPanoptic Narrative Grounding ͱͷࠩҟʹݴٴ
    • 3. Video Localized Narrative Annotations
    • Ξϊςʔγϣϯͷํ๏ʹ͍ͭͯ
    • 4. Video Narrative Grounding (VNG)
    • ϕϯνϚʔΫͱͯ͠ͷԠ༻ɿྫ̍
    • 5. Video Question Answering (VideoQA)
    • ϕϯνϚʔΫͱͯ͠ͷԠ༻ɿྫ̎
    2023/07/23 ୈճ$7ษڧձ!ؔ౦ 20

    View Slide

  21. Panoptic Narrative Grounding ͱͷࠩҟ
    • Panoptic Narrative Grounding
    • શମΛݟΔΑ͏ͳηάϝϯςʔγϣϯΛߦ͏
    • ੩ࢭըͷΩϟϓγϣϯʹݱΕΔ໊ࢺͷάϥ΢ϯσΟϯά
    • Video Narrative Grounding (VNG,ࠓճѻ͏λεΫͷҰͭ)
    • ಈըΛର৅ͱ͠ɺ۩ମతͳΦϒδΣΫτΛର৅ͱ͢Δ
    2023/07/23 ୈճ$7ษڧձ!ؔ౦ 21

    View Slide

  22. Ξϊςʔγϣϯͷํ๏
    • ̑εςοϓͷΞϊςʔγϣϯ
    • Ξϊςʔγϣϯͷ্Ͱͷ஫ҙࣄ߲
    2023/07/23 ୈճ$7ษڧձ!ؔ౦ 22

    View Slide

  23. ̑εςοϓͷΞϊςʔγϣϯ
    ᶃ Understand the Video
    ᶄ Actor Selection
    ᶅ Key-frame Selection
    ᶆ A Story for each Actor
    ᶇ Transcription and Time Alignment
    2023/07/23 ୈճ$7ษڧձ!ؔ౦ 23

    View Slide

  24. ̑εςοϓͷΞϊςʔγϣϯ
    ᶃ Understand the Video
    ·ͣϏσΦΛݟͯɺ಺༰Λཧղ
    2023/07/23 ୈճ$7ษڧձ!ؔ౦ 24

    View Slide

  25. ̑εςοϓͷΞϊςʔγϣϯ
    ᶄ Actor Selection
    ͲͷΞΫλʔʢਓ෺ɾΦϒδΣΫτʣʹ͍ͭͯΞϊςʔγϣϯ͢Δ͔ΛબͿ
    ᶅ Key-frame Selection
    ࣌ؒ࣠ͰҰ༷෼෍͔Βαϯϓϧ͞ΕͨΩʔϑϨʔϜީิ͔Βɺ
    ओཁͳΞΫγϣϯΛؚΉ΋ͷΛબͿ
    2023/07/23 ୈճ$7ษڧձ!ؔ౦ 25

    View Slide

  26. ̑εςοϓͷΞϊςʔγϣϯ
    ᶆ Speak and move mouse
    ΞϊςʔγϣϯͷϝΠϯύʔτ
    • ݸผͷΞΫλʔʹ͍ͭͯɺબ୒͞ΕͨΩʔϑϨʔϜʹର͠ɺޱ಄Ͱઆ໌
    • ΞΫλʔ໊ɺΞτϦϏϡʔτɺΞΫγϣϯɺ૬ޓ࡞༻͍ͯ͠ΔଞͷΦϒδΣΫτ
    • આ໌͠ͳ͕ΒɺϚ΢εϙΠϯλʔΛಈ͔͠ɺΦϒδΣΫτ΍ΞΫγϣϯΛࢦࣔ͢͠
    2023/07/23 ୈճ$7ษڧձ!ؔ౦ 26

    View Slide

  27. ̑εςοϓͷΞϊςʔγϣϯ
    ᶇ Transcription and Time Alignment
    ΞϊςʔλʹखಈͰԻ੠ೖྗΛॻ͖ىͯ͜͠΋Β͏
    localization ͷͨΊʹɺݴ༿ͱԻ੠ΛରԠ෇͚ʢ୯ޠ΁ͷλΠϜελϯϓͷ෇༩ʣ
    2023/07/23 ୈճ$7ษڧձ!ؔ౦ 27

    View Slide

  28. ࡞ۀࢦࣔʹ͓͚Δ޻෉
    • ஻Γํʹ͍ͭͯͷࢦࣔ
    • Ώͬ͘Γ஻Δ
    • ΦϒδΣΫτؒͰϚ΢εΛಈ͔͍ͯ͠Δؒ͸ɺ࿩͢ͷΛࢭΊΔ
    • Ϛ΢εΛΫϦοΫ͢Δ͜ͱͰɺτϨʔε͕ࢭ·ΔΑ͏ʹઃఆ
    2023/07/23 ୈճ$7ษڧձ!ؔ౦ 28

    View Slide

  29. Statistics
    • ̏ͭͷσʔληοτʹΞϊςʔγϣϯΛ෇༩
    • OVIS [Qi et al., IJCV 2022],
    UVO [Sadhu et al., NAACL 2021]
    , Oops [Epstein et al., CVPR 2020]
    • Total
    • 20K videos
    • 1,65M words
    • 3.54 actors / video
    2023/07/23 ୈճ$7ษڧձ!ؔ౦ 29

    View Slide

  30. Statistics: ؔ࿈σʔληοτͱͷൺֱ
    • ؔ࿈͢Δσʔληοτͱͷൺֱ
    2023/07/23 ୈճ$7ษڧձ!ؔ౦ 30

    View Slide

  31. Statistics: ؔ࿈σʔληοτͱͷൺֱʢٙ໰఺ʣ
    • Actors / narr ʹ͍ͭͯɺ̏छͰ max 3.45 ͳͷʹɺ All ͩͱ 3.54?
    • 71,976 / 22,091 = 3.26 Ͱܭࢉ݁Ռ͕߹Θͳ͍
    • ࿦จதͷิ଍ࣄ߲Λݟམͱ͍ͯ͠Δʁ
    2023/07/23 ୈճ$7ษڧձ!ؔ౦ 31

    View Slide

  32. Statistics: ΩϟϓγϣϯͷϦον͞ͷ֬ೝ
    • ActivityNet-Entities [Zhou et al., CVPR 2019] ͱ
    ൺֱͯ͠ɺϦονͳΩϟϓγϣϯ
    • ௕͘ʢฏۉ 75.1 wordsʣɺ඼ࢺผͰ΋
    ଟ͘ͷ୯ޠΛؚΉ
    • 23.0 nouns
    • 9.5 verbs
    • 8.5 adjectives
    • 7.2 adpositions
    • 2.4 pronouns
    2023/07/23 ୈճ$7ษڧձ!ؔ౦ 32

    View Slide

  33. Statistics: Ωϟϓγϣϯͷਖ਼֬ੑ
    • Semantic Accuracy ͱ Localization Accuracy ͷ֬ೝ
    • Semantic Accuracy ʢ্දʣ
    • 70 ಈըΛϥϯμϜʹબͼɺ໊ࢺ۟ͱಈࢺΛ֬ೝ → ΄΅׬ᘳ
    • Localization Accuracy
    • ੩ࢭըͷ Localized Narrative (ImLN) ͷσʔληοτͱൺ΂ͯ΋ߴਫ਼౓ʢ+25%ʣ
    → ఏҊͨ͠ϓϩτίϧ͸༗༻
    2023/07/23 ୈճ$7ษڧձ!ؔ౦ 33

    View Slide

  34. ఏҊͨ͠σʔληοτʢVidLNʣͷ׆༻
    • ༷ʑʹԠ༻Ͱ͖Δ͕ɺ۩ମྫͱͯ̎͠छͷϕϯνϚʔΫΛఏҊ
    • Video Narrative Grounding (VNG) → Sec. 4
    • Video Question Answering (VideoQA) → Sec. 5
    2023/07/23 ୈճ$7ษڧձ!ؔ౦ 34

    View Slide

  35. VNG: λεΫͷఆٛ
    • ೖྗɿ
    • φϥςΟϒ͕ςΩετͰ෇༩͞ΕͨಈըͰɺಛఆͷ໊ࢺʹҹ
    • ग़ྗɿ
    • ҹͷ໊͍ͭͨࢺͷͦΕͧΕʹର͢ΔηάϝϯςʔγϣϯϚεΫ
    2023/07/23 ୈճ$7ษڧձ!ؔ౦ 35

    View Slide

  36. VNG: ॏཁͳνϟϨϯδ
    • ಉ໊͡ࢺ͕ࢦ͢ΦϒδΣΫτ͕ɺෳ਺͋Δ͜ͱ͕͋Δ
    • ্ͷྫͰ͸ “parrot ʢᳳ໅ʣ” ͚ͩͰ͸ಛఆͰ͖ͳ͍
    • “a red-black neckline” ͷ৘ใ΋࢖Θͳ͍ͱ͍͚ͳ͍
    2023/07/23 ୈճ$7ษڧձ!ؔ౦ 36

    View Slide

  37. VNG: ख๏
    • ReferFormer [Wu et al., CVPR 2022] Λվྑͨ͠ɺReferFormer-VNG
    • ݩͷख๏
    • Referring Video Object Segmentation (R-VOS) λεΫ༻ͷ΋ͷ
    • ୹͍આ໌ɺ୯ҰͷΦϒδΣΫτ
    • Visual Encoder Ͱɺಈը͔Βಛ௃நग़
    • Text Encoder ͰɺςΩετ͔Βಛ௃நग़
    • ͜ΕΒͷಛ௃Λ Decoder ʹೖΕɺ֤ϑϨʔϜʹ͓͍ͯϚεΫΛੜ੒
    2023/07/23 ୈճ$7ษڧձ!ؔ౦ 37

    View Slide

  38. VNG: ख๏
    • VNG Ͱ͸ɺಉ໊͡ࢺͰද͞ΕΔෳ਺ͷΦϒδΣΫτ͕͋Δ
    • Text Encoder ͷॲཧΛมߋ
    • ReferFormerɿ
    τʔΫϯ͝ͱͷಛ௃ + ςΩετશମͷಛ௃
    • ςΩετશମ͕ಉҰͷΦϒδΣΫτΛࢦࣔ͢͠ͷͰɺ༗༻
    • ReferFormer-VNGɿ
    ηάϝϯτ໊͍ͨ͠ࢺʹରԠͨ͠τʔΫϯͷಛ௃ͷ average-pool
    • ʢະ֬ೝʣτʔΫϯ͝ͱͷಛ௃ͱ average-pool ͷ྆ํ͔ɺޙऀ͚͔ͩ
    2023/07/23 ୈճ$7ษڧձ!ؔ౦ 38

    View Slide

  39. VNG: ධՁࢦඪͱɺϕʔεϥΠϯͷ݁Ռ
    • ධՁࢦඪɿ 𝒯&ℱ [Pont-Tuset et al., 2018]
    • The 2017 DAVIS Challenge on Video Object Segmentation ͷ΋ͷ
    • ϕʔεϥΠϯɿReferFormer
    • VNG λεΫͷੑ্࣭ɺ
    Narrative શମΛೖΕΔΑΓɺNoun ͚ͩͷ΄͏͕ͪΐͬͱྑ͍
    2023/07/23 ୈճ$7ษڧձ!ؔ౦ 39

    View Slide

  40. VNG: ఏҊख๏Ͱͷ݁Ռ
    • ϕʔεϥΠϯͱൺ΂Δͱྑ͘ɺᐆດੑΛղܾͰ͖͍ͯΔ
    • Ұ൪্ͷྻͷ݁Ռ͕Ұ൪ྑ͍
    • े෼ʹֶश͞Ε͍ͯΔͷͰɺOVIS-VNG Ͱͷ fine-tuning Ͱ͸ͦ͜·Ͱ
    ਫ਼౓͕޲্͍ͯ͠ͳ͍
    2023/07/23 ୈճ$7ษڧձ!ؔ౦ 40

    View Slide

  41. ʲ࠶ܝʳఏҊͨ͠σʔληοτʢVidLNʣͷ׆༻
    • ͍Ζ͍ΖͳԠ༻͕Ͱ͖Δ͕ɺ۩ମྫͱͯ̎͠छ
    • Video Narrative Grounding (VNG) → Sec. 4
    • Video Question Answering (VideoQA) → Sec. 5
    2023/07/23 ୈճ$7ษڧձ!ؔ౦ 41

    View Slide

  42. VideoQA
    • Text-output Questions
    • ࣗ༝هड़ͷճ౴
    ※ ൃද࣌ؒΛߟྀͯ͠ɺઆ໌ΛׂѪ͠·͢ʢεϥΠυ͸ Appendix ʹ͋Γ·͢ʣ
    • Location-output Questions
    • “where is” Ͱ࢝·Δ࣭໰ʹɺಈը্Ͱͷ࣌ؒͱۭؒͷಛఆͰճ౴
    2023/07/23 ୈճ$7ษڧձ!ؔ౦ 42

    View Slide

  43. VideoQA: Location-output ͷ Q&A ࡞੒
    • ࣗಈੜ੒
    • spaCy Λ࢖ͬͯ part-of-speech λά΍ parse tree Λ෇༩
    • “where the subject is” ͷ࣭໰ʹม׵
    • 3.7 questions / video
    • Ϛ΢εͷτϨʔεͷ৘ใΛ࢖ͬͯਖ਼ղΛ࡞Δ
    • ਓؒʹΑΔνΣοΫ
    • ̎ਓ͕νΣοΫɺͲͪΒ͔Β΋߹֨ͱͳͬͨ΋ͷͷΈ࢒͢
    2023/07/23 ୈճ$7ษڧձ!ؔ౦ 43

    View Slide

  44. VideoQA: Location-output ͷ Q&A ͷྫ
    2023/07/23 ୈճ$7ษڧձ!ؔ౦ 44

    View Slide

  45. VideoQA: Location-output ͷ݁Ռ
    • ReferFormer-VNG ΛϕʔεϥΠϯͱͯ͠࢖༻
    • “where is” ͱ͍͏୯ޠΛ࣭໰͔Β࡟Γɺจதͷ࠷ॳͷ໊ࢺΛର৅ʹ
    ReferFormer-VNG ͰηάϝϯςʔγϣϯϚεΫΛग़ྗ
    • ϚεΫΛ bounding box ʹม׵
    • ݁Ռ
    • Recall: 66.7%
    • Precision: 53.9%
    • ྑ͍݁Ռ͕ͩ׬ᘳʹ͸ఔଟ͍ɻఏҊͨ͠ϕϯνϚʔΫʹ͸Ձ஋͕͋Δ
    2023/07/23 ୈճ$7ษڧձ!ؔ౦ 45

    View Slide

  46. ݁࿦
    • Video Localized Narratives ΛఏҊ
    • ΞϊςʔγϣϯͷϓϩτίϧΛఏҊ
    • ߏஙͨ͠σʔληοτͰϕϯνϚʔΫΛఏҊ
    2023/07/23 46
    ୈճ$7ษڧձ!ؔ౦

    View Slide

  47. Appendix
    2023/07/23 47
    ୈճ$7ษڧձ!ؔ౦

    View Slide

  48. Story ͱ Narrative ʹ͍ͭͯɺ΋͏গ͠ৄ͘͠
    • Oxford Learner’s Dictionary of Academic English [2014] ʹΑΔ
    ͱɺ
    • Story
    1. a description, often spoken, of what happened to sb or of how sth happened
    • Narrative
    1. [C] a description of events, especially in a novel
    [synonym] story (1)
    2. [U] the act, process or skill of telling a story
    • Localized Narratives ͷ Narrative ཁૉ͸ʁ
    • ʢΞϊςʔλͷ஫ࢹ͢Δʣ࣌ܥྻΛ࢖͏
    • ࠓճ͸ͦ΋ͦ΋ Video Ͱɺ Narrative શମΛѻ͑Δ
    2023/07/23 48
    ୈճ$7ษڧձ!ؔ౦

    View Slide

  49. VideoQA: Text-output ͷ Q&A ࡞੒
    • ࣗಈੜ੒
    • VQ2A [Changpinyo et al., NAACL 2022] ʹΑΔ
    • ࣗಈͰɺճ౴͕ॏෳ͢Δ΋ͷΛআ͘ͳͲͷॲཧΛߦ͏
    • ਓؒʹΑΔνΣοΫ
    • “What color is the sky?”
    • ಈըΛݟͳͯ͘΋ճ౴Ͱ͖Δ΋ͷ
    • “What color is the cat?”
    • ಈըʹෳ਺ͷೣ͕ग़͍ͯͨΒɺճ౴͕ᐆດ
    2023/07/23 ୈճ$7ษڧձ!ؔ౦ 49

    View Slide

  50. VideoQA: Text-output ͷ Q&A ͷྫ
    2023/07/23 ୈճ$7ษڧձ!ؔ౦ 50
    Q: “What falls out of the man’s hand?” , A: “dog”

    View Slide

  51. VideoQA: Text-output ͷ݁Ռ
    • PaLI [Chen et al., ICLR 2023] ΛϕʔεϥΠϯͱͯ͠࢖༻
    • PaLI-1.5B
    • Ground-truth ͱͷ exact-match ͰධՁ
    • ୯ҰϑϨʔϜͰͷ݁Ռ
    • Zero-shot Ͱ 24.1%
    • Fine-tuned (on Oops-QA) Ͱ 44.9%
    • ఏҊͨ͠ϕϯνϚʔΫ͸ෳࡶͳ΋ͷͰɺख๏վળͷ༨஍͋Γ
    2023/07/23 ୈճ$7ษڧձ!ؔ౦ 51

    View Slide