[Journal club] End-to-End Generative Pretraining for Multimodal Video Captioning

End-to-End Generative Pretraining for Multimodal Video Captioning Paul Hongsuck Seo,
Arsha Nagrani, Anurag Arnab, Cordelia Schmid Google Research CVPR 2022 杉浦孔明研究室神原元就 Seo, P. H., Nagrani, A., Arnab, A., & Schmid, C. (2022). End-to-end generative pretraining for multimodal video captioning. In CVPR (pp. 17959-17968).

背景：動画への人手でのキャプション付与はコスト高 3 The Power of PowerPoint - thepopp.com “Grill the
tomatoes in a pan and the put them on a plate” “Add oil to a pan and spread it well so as to fry the bacon” “Cook bacon until crispy, then drain on paper towel” YouCook2データセット [Zho+, AAAI18] 動画へのキャプション付与 • 高コスト • 主観的ラベル(キャプション)なし動画を用いたクロスモーダルキャプション生成 Vision & Language分野における課題

クロスモーダル動画キャプション生成 4 The Power of PowerPoint - thepopp.com 動画＋音声入力
出力自然言語説明文 • タスク概要 “Add oil to a pan and spread it well so as to fry the bacon” • VideoBERT [Sun+, ICCV19] • 事前訓練において，YouTube Data APIにより音声をテキスト化，入力に利用 • video onlyよりもキャプション生成能力向上 Method BLEU4 METEOR ROUGE CIDEr VideoBERT (video only) 3.81 10.81 27.14 0.47 VideoBERT 4.04 11.01 27.50 0.49

関連研究：表現学習に留まる手法が多い 5 The Power of PowerPoint - thepopp.com • HERO
[Li+, EMNLP20] • ActBERT [Zhu+, CVPR20] • 4種類の事前訓練タスク • Masked Language Modeling • Masked Frame Modeling • Video-Subtitle Matching • Frame Order Modeling • 各モダリティ間でsource-target注意を計算 • 4種類の事前訓練タスク

Multimodal Video Generative Pretraining (MV-GPT) framework 6 The Power of
PowerPoint - thepopp.com 特徴 • 音声及び動画からキャプション生成 • エンコーダ-デコーダモデル • 音声をテキストに変換(Automatic speech recognition, ASR)することによりキャプション無しで訓練

新規性：Bi-directional Utterance Generation 7 The Power of PowerPoint - thepopp.com
Forward Generation Present utterance ( , Input frames ) Future utterance • 複数モダリティ(音声，動画)のアライメント(エンコード)を学習 • 損失関数 ℒ𝐹𝐺 = − ෍ 𝑖=1 𝑁𝑤 log 𝑃(𝑤𝑖 |𝑤1 , … , 𝑤𝑖−1 , 𝐹, 𝑈) • 𝐹 = {𝑓1 , … , 𝑓𝑁𝑓 }：動画フレーム群 • 𝑈 = {𝑢1 , … , 𝑢𝑁𝑢 }：Present utterance • 𝑊 = {𝑤1 , … , 𝑤𝑁𝑤 }：Future utterance 対応

新規性：Bi-directional Utterance Generation 8 The Power of PowerPoint - thepopp.com
Backward Generation Future utterance ( , Input frames ) Present utterance • 視覚情報を用いたキャプション生成(デコード)を学習 • 損失関数 ℒ𝐵𝐺 = − ෍ 𝑖=1 𝑁𝑢 log 𝑃(𝑢𝑖 |𝑢1 , … , 𝑢𝑖−1 , 𝐹, 𝑊) 非対応

学習上の工夫 9 The Power of PowerPoint - thepopp.com Present utterance
( , Input frames ) Future utterance Future utterance ( , Input frames ) Present utterance どちらも(テキスト，動画) → テキスト  デコーダはFuture/Presentどちらを出力するべき？挟みこむCLS, BOSトークンの種類によってForward/Backwardを制御ファインチューニング時は CLS1＋BOS2

エンコーダ：言語情報と視覚情報を埋め込み 10 The Power of PowerPoint - thepopp.com Textual Encoder
Visual Encoder Multimodal Encoder

Visual Encoder Multimodal Encoder Textual Encoder 𝐸 • BERT Embedder [Devlin+ NAACL-HLT19]を利用 • 入力文𝑋 = {𝑥1 , … , 𝑥𝑁𝑥 }を埋め込み表現𝐸 = {𝑒𝑖 }に変換 • Forward：X=U, Backward：X=W 𝑋

Visual Encoder Multimodal Encoder Visual Encoder 𝑉 • ViViT [Arnab+ ICCV21]を利用 • 動画フレーム𝐹を埋め込み表現𝑉 = {𝑣𝑖 }に変換 • 𝑉 ∈ ℝ𝑆×(𝐾+1) • 先頭にCLSトークンを追加 𝐹

Visual Encoder Multimodal Encoder Multimodal Encoder 𝐸 • Co-attentional transformer機構 [Lu+, NeurIPS19]etc. 𝑉 ෠ 𝐸, ෠ 𝑉 𝑉 𝐸 × 𝑅 ෠ 𝐸 ෠ 𝑉 Co-attentional transformer block MHA layer MHA layer MHA layer MHA layer

デコーダ：Masked Language Modeling 14 The Power of PowerPoint - thepopp.com
𝐶 Sentence Decoder ෠ 𝑌𝑖−1 ෩ 𝐻𝑖−1 ሚ 𝐶 エンコーダ出力𝐶 = ෠ 𝐸 ∪ ෠ 𝑉及び生成単語系列෠ 𝑌𝑖−1 = {ො 𝑦1 , … , ො 𝑦𝑖−1 } 入力出力特徴量 ሚ 𝐶及び ෩ 𝐻𝑖−1 = {෨ ℎ1 , … , ෨ ℎ𝑖−1 } ሚ 𝐶：事前訓練におけるMasked Language Modelingに利用 ෨ ℎ𝑖−1 ： ො 𝑦𝑖 の生成に利用

実験設定 15 The Power of PowerPoint - thepopp.com • 事前訓練
• ファインチューニング，比較実験 HowTo100Mデータセット [Miech+, ICCV19] • 1.2M本の動画 • 53Mサンプルを抽出，利用 4種類のデータセットを利用 • YouCook2データセット • 料理動画 • Video Timeline Tagsデータセット [Huang+, AACL20] • 各動画セグメントに短いタグ • MSR-VTTデータセット [Xu+, CVPR16] • 20種類のカテゴリ • ActivityNet-Captionsデータセット [Krishna+, CVPR17] • 詳細なキャプションを持つ

定量的結果：各データセットで既存手法を上回る 16 The Power of PowerPoint - thepopp.com Method BLEU-4
CIDEr METEOR ROUGE UniVL [Luo+, 20] 17.35 1.81 22.35 46.52 MV-GPT 21.88 2.21 27.09 49.38 Method BLEU-4 CIDEr METEOR ROUGE DECEMBERT [Tang+, NAACL21] 45.20 0.52 29.70 64.70 UniVL [Luo+, 20] 41.79 0.50 28.94 60.78 MV-GPT 48.92 0.60 38.66 64.00 • YouCook2データセット • MSR-VTTデータセット表現学習モデル • 各標準尺度において既存の動画キャプショニングモデルを上回る • 4種類のデータセットで同様の結果

Ablation Study 17 The Power of PowerPoint - thepopp.com 事前学習における各損失のablation
study (YouCook2) FG BG MLM WD BLEU4 CIDEr METEOR ROUGE 13.25 1.03 17.56 35.48 ✓ ✓ 20.65 2.05 25.81 47.22 ✓ ✓ ✓ 20.89 2.11 26.42 48.30 ✓ ✓ ✓ ✓ 21.26 2.14 26.36 48.58 FG：Forward generation BG：Backward generation MLM：Masked Language Modeling WD：重み減衰事前訓練におけるモダリティ数のablation study (YouCook2) Video Text BLEU4 CIDEr METEOR ROUGE ✓ 16.71 1.53 21.43 41.56 ✓ 16.71 1.56 20.88 40.19 ✓ ✓ 21.88 2.21 27.09 49.38 • Backward generation による性能向上を確認 • 重み減衰により，わずかに性能向上 • 複数モダリティ使用の優位性を確認 • 動画情報の方が重要性大

定性的結果 18 The Power of PowerPoint - thepopp.com MSR-VTTにおける定性的結果 So
by considering the whole host of nature and nurture influences, we can take a broader view of mental health … Ground truth A man in a brown blazer discussing mental health MV-GPT w/o pretrain A man in a blue shirt is talking MV-GPT A man in a suit is talking about mental health YouCook2における定性的結果 Ground truth Spread mustard on the bread MV-GPT w/o pretrain Flip the sandwiches MV-GPT Spread the sauce on the bread

まとめ 19 The Power of PowerPoint - thepopp.com • 背景
• 手法 • 結果動画キャプション生成において，高コストである人手によるアノテーションを避けたい動画及び音声を用いて事前学習を行う動画キャプション生成モデルMV-GPT 各データセットにおいて既存手法を超える性能を達成

[Journal club] End-to-End Generative Pretrainin...

[Journal club] End-to-End Generative Pretraining for Multimodal Video Captioning

Semantic Machine Intelligence Lab., Keio Univ.
PRO

More Decks by Semantic Machine Intelligence Lab., Keio Univ.

Other Decks in Technology

Featured

Transcript

End-to-End Generative Pretraining for Multimodal Video Captioning Paul Hongsuck Seo,

背景：動画への人手でのキャプション付与はコスト高 3 The Power of PowerPoint - thepopp.com “Grill the

クロスモーダル動画キャプション生成 4 The Power of PowerPoint - thepopp.com 動画＋音声入力

関連研究：表現学習に留まる手法が多い 5 The Power of PowerPoint - thepopp.com • HERO

Multimodal Video Generative Pretraining (MV-GPT) framework 6 The Power of

新規性：Bi-directional Utterance Generation 7 The Power of PowerPoint - thepopp.com

新規性：Bi-directional Utterance Generation 8 The Power of PowerPoint - thepopp.com

学習上の工夫 9 The Power of PowerPoint - thepopp.com Present utterance

エンコーダ：言語情報と視覚情報を埋め込み 10 The Power of PowerPoint - thepopp.com Textual Encoder

エンコーダ：言語情報と視覚情報を埋め込み 11 The Power of PowerPoint - thepopp.com Textual Encoder

エンコーダ：言語情報と視覚情報を埋め込み 12 The Power of PowerPoint - thepopp.com Textual Encoder

エンコーダ：言語情報と視覚情報を埋め込み 13 The Power of PowerPoint - thepopp.com Textual Encoder

デコーダ：Masked Language Modeling 14 The Power of PowerPoint - thepopp.com

実験設定 15 The Power of PowerPoint - thepopp.com • 事前訓練

定量的結果：各データセットで既存手法を上回る 16 The Power of PowerPoint - thepopp.com Method BLEU-4

Ablation Study 17 The Power of PowerPoint - thepopp.com 事前学習における各損失のablation

定性的結果 18 The Power of PowerPoint - thepopp.com MSR-VTTにおける定性的結果 So

まとめ 19 The Power of PowerPoint - thepopp.com • 背景