論文紹介 / Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality

論⽂紹介紹介者: ⻄⽥京介（NTT⼈間情報研究所） 2022/09/26 @ 第14回最先端NLP勉強会 CVPR 2022 (arXiv 2022/04)

• 何をする研究か︖ – 視覚・⾔語の融合理解の能⼒を評価する • 貢献は何か︖ – 2件の画像・テキストペアのマッチングタスクのデータセット Winoground
を公開（合計400問） – CLIP など最新モデルを⽤いて評価・考察を⾏い，⼈間に⽐べ（特に⾔語理解の観点で）⼤きな差がある事を⽰す • 嬉しさは︖ – 本データセットにより現在の視覚・⾔語モデルの弱点が明確になり，今後のモデル・学習アルゴリズム開発への貢献が期待されるまとめ 2

提案タスク: Winoground • ⼊⼒として画像・キャプションペアが2つ与えられ，4通りの組合せの中から正しいペアを⾒つけ出す • 2つのキャプションは同じ単語/形態素を持つが，⽂中の順序は異なる some plants surrounding
a lightbulb a lightbulb surrounding some plants 3

参考: Winograd Scheme Challenge • 1語または2語が異なる2⽂において，代名詞を同定する照応解析タスク．正しく解くためには常識的な知識が必要となる • 最近の⾔語モデル PaLM（540B）ではZero-shotで正答率90%まで到達
• WinogroundはWinogradを参考にしているものの，特徴は異なる 4

• タスクとデータセット – 評価指標 – データの作成プロセス – データの分類 • 評価実験
– 検証モデル – 実験結果 – 考察 5 ⽬次

評価指標1: Image-Score (C→I) • 各キャプションに対して，正しい画像を選択できるかを評価 some plants surrounding a
lightbulb a lightbulb surrounding some plants 6

評価指標2: Text-Score (I→C) • 各画像に対して，正しいキャプションを選択できるかを評価 some plants surrounding a

評価指標3: Group-Score • 4通りの組合せを全て正しく判定できるか • Image-/Text-Scoreがどちらも1のときに1となる some plants surrounding a

• 4⼈の専⾨家（⾔語学＋V&L研究に詳しい）によって⼿作業で作成 • Winogroundスキーマを満たす2つのキャプションの作成と2つの画像の収集を同時に⾏う – 画像はストックフォトサイト Getty Images
から収集 • 合計 400 問を作成 – 800 の正しい画像・キャプションペア – 800 の誤った画像・キャプションペア – https://huggingface.co/spaces/CVPR/winoground-explorer – https://huggingface.co/datasets/facebook/winoground などでデータを確認可能 • 作成されたデータを専⾨家によりタグ付 – ⾔語学の観点 – 視覚的推論の観点（全体の10%程度） 9 データの作成プロセス

• 実世界の物体を参照する名詞句等を並び替える 10 Linguistic Tag: Object (141/400) 全9種類

11 Linguistic Tag: Relation (233/400) • 動詞、形容詞、前置詞、副詞などをを並べ替える全44種類（9種類のみ⽰す）

12 Linguistic Tag: Both (26/400) 全9種類 • RelationとObjectの両⽅の交換．1つの交換で品詞を変える例が含まれる • 件数はかなり少ない

13 Linguistic Tag: 1 or 2 Main Preds (292, 108/400)
• 述部の数（1つあるいは2つ）による分類． • 述部2つの⽅がより⻑く，複雑な⽂になりやすい there are more [humans] than [balls] there's a [phone] on a [map] the [plant] is eating the [bug] [out]1[swam]2 the person in the red swimcap []2[]1 looking from [above] at a collection of similar objects [below] the [sail] rests below the [water] [gold] for [pan] there are more [hats] than [people] [circular] food on [heart-shaped] wood the [water] is filled with [plastic] 1 Main Predsの例 [it] ran away while [they] pursued the person in a [brown] coat looks back and the person in a [black] coat looks forward the melting white food is [cold] while the brown is [warm] a kid [jumped] then [threw] a basketball the person is [jumping] while the cat is [sitting] a person wearing [yellow] with their feet in the air and a person wearing [stripes] the [computer's] screen is on and the [phone's] screen is off the person with facial hair [cycles] and the other person [runs] the person with green legs is running quite [slowly] and the red legged one runs [faster] a [] person wearing yellow and a person wearing stripes [jumping] 2 Main Predsの例

• comprises examples where the images need to be interpreted
non-literally （前置詞句の付与場所が違う，”idiomatic use”など） 14 Visual Tag: Pragmatics (41/400) It starts with ["A”] and ends with ["Z”] It starts with ["Z”] and ends with ["A”]

• 画像が収集元のフォトストックサイトにおける同じシリーズ（登場⼈物・背景などが類似）から構成されているもの 15 Visual Tag: Series (31/400) the [masked]
wrestler hits the [unmasked] wrestler the [unmasked] wrestler hits the [masked] wrestler

• 記号的な描写を理解する必要のある例（参考︓イラスト系は物体認識器を使うモデルは⽐較的弱い） 16 Visual Tag: Symbolic (24/400) astronauts in
[blue] suits with a [red] planet in the background astronauts in [red] suits with a [blue] planet in the background

• タスクとデータセット – 評価指標 – データの作成プロセス – データの分類 • 評価実験
– 検証モデル – 実験結果 – 考察 17 ⽬次

18 検証モデル(1/2) • CLIP[1]， FLAVAContrastive [2]︓ デュアルエンコーダによる対照学習 • FLAVAITM [2]︓上記にクロスエンコーダを加え，Image-Text
Matchingを同時に⾏うモデル Vision Text some plants surrounding a lightbulb Vision Text some plants surrounding a lightbulb Joint CLIP, FLAVAContrastive FLAVAITM ※ ざっくりとしたイメージ．各モデルの細部は異なります

19 モデル(2/2) • UNITER[3], VILLA[4], VinVL[5], ViLT[6], VisualBERT[7]︓物体検出(Object Detection) やパッチ埋め込みを⽤いたクロスエンコーダモデル
• LXMERT[8], UniT[9], ViLBERT[10]︓物体検出を⽤いたデュアル＋クロスエンコーダモデル • VSRN, VSE++︓RNN利⽤モデル（説明割愛） Vision Text some plants surrounding a lightbulb Joint LXMERT, UniT, ViLBERT Joint some plants surrounding a lightbulb OD/Patch OD UNITER, ViLLA, VinVL, ViLT, VisualBERT ※ ざっくりとしたイメージ．各モデルの細部は異なります

• 例）Flickr30kにおける画像óテキスト検索（1000画像x5キャプション)の精度 • CLIPやUNITERはファインチューニング無しでも⾼精度 20 参考︓他タスクでのモデルの品質 Image→Text Text→Image UNITER
83.6 68.7 CLIP 88.0 68.7 1000件のテストセットのうち，上位1位が正解した割合（Zero-shot）

21 全体結果⼈間のスコアとチャンスレベル • Text（Image → Caption）， Image（Caption→Image）の両⽅とも⼈間は90%程度正解できている
• チャンスレベルは単体のスコアは 25%，組合せになると16.67%

22 全体結果 Text-Score（Image→Caption） • 幾つかのモデルがチャンスレベルを越えたスコアを達成 • しかし，⼈間のスコアには遠く及ばず

23 全体結果 Image-Score（Caption→Image） • 全てのモデルがチャンスレベルを下回る結果 • Group-Scoreも同様

• はっきりとした理由は書かれていない • More investigation is required to pinpoint the
reasons: perhaps textual encoders are stronger, … （．．．違うのでは︖）（Text-Scoreはまずまずなのに）なぜ全モデルのImage-Score がチャンスレベルを下回ったのか︖ Vision Text some plants surrounding a lightbulb Image-Score a lightbulb surroundin g some plants テキストエンコーダが「弱く」，キャプションC0とC1に対して特徴表現に差がない場合， • s(C0 ,I0 ) > s(C0 ,I1 ) ⇒ s(C1 ,I0 ) > s(C1 ,I1 ) • s(C1 ,I1 ) > s(C1 ,I0 ) ⇒ s(C0 ,I1 ) > s(C0 ,I0 ) のいずれか（どちらもスコア=0）になることがランダムにも届かない原因と思われる 24

25 タグ別の結果（Linguistic/順序交換） • ”Object”（名詞句の交換）, “Relation”（動詞・形容詞などの交換）については，どのモデルでもImage-scoreが低い è テキストエンコーダが弱く細かいテキストの差を認識できていない

26 タグ別の結果（Linguistic /順序交換） • CLIPの”Both”はかなり良い．[fire] [truck] / [truck] [fire] のように，描写対
象が⼤きく変わるものについてはテキストエンコーダが区別出来ている • ただし，“Both”は件数が少ないので，スコアは参考程度．

27 タグ別の結果（Linguistic/述部数） • 1 Main Predより2 Main Predsの⽅がキャプションの内容が複雑になるため，スコアがはっきりと落ちている

28 タグ別の結果（Visual） • 件数が少ないので参考程度だが，特にSeriesは画像エンコーダ側でも区別が難しくなるのでText-Scoreも低くなっている

• ⻑いテキストほどスコアが悪い – テキストエンコーダの弱さを⽰唆 • キャプションのperplexity（GPT2で測定）とスコア間の相関は低い • モデルアーキテクチャによる差は少ない • 学習データ数が多い⽅が良い
– ただし，CLIPがそこまで伸びていない 29 その他の結果

• 条件に当てはまる画像・テキストの収集の難しさからか，正例の画像・テキストペアにやや違和感のあるもの（普通はそのようなキャプショニングはしない）も含まれているように感じる 30 議論: やや不⾃然なキャプションについて https://huggingface.co/spaces/CVPR/winoground-explorer 間違ってはいないが，⾃転⾞に乗っている⽚⽅をsomeone
else扱いは不⾃然 a person spraying water on someone else and a person on a bike a person spraying water on a person on a bike and someone else

• 何をする研究か︖ – 視覚・⾔語の融合理解の能⼒を評価する • 貢献は何か︖ – 2件の画像・テキストペアのマッチングタスクのデータセット Winoground
を公開（合計400問） – CLIP など最新モデルを⽤いて評価・考察を⾏い，⼈間に⽐べ（特に⾔語理解の観点で）⼤きな差がある事を⽰す • 嬉しさは︖ – 本データセットにより現在の視覚・⾔語モデルの弱点が明確になり，今後のモデル・学習アルゴリズム開発への貢献が期待されるまとめ 31

[1] Alec Radford et al.: Learning Transferable Visual Models From
Natural Language Supervision. ICML 2021: 8748-8763 [2] Amanpreet Singh et al.: FLAVA: A Foundational Language And Vision Alignment Model. CoRR abs/2112.04482 (2021) [3] Yen-Chun Chen et al.: UNITER: UNiversal Image-TExt Representation Learning. ECCV (30) 2020: 104-120 [4] Zhe Gan et al.: Large-Scale Adversarial Training for Vision-and-Language Representation Learning. NeurIPS 2020 [5] Pengchuan Zhang et al.: VinVL: Revisiting Visual Representations in Vision-Language Models. CVPR 2021: 5579-5588 [6] Wonjae Kim et al.: ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision. ICML 2021: 5583-5594 [7] Liunian Harold Li et al.: VisualBERT: A Simple and Performant Baseline for Vision and Language. CoRR abs/1908.03557 (2019) [8] Hao Tan et al.: LXMERT: Learning Cross-Modality Encoder Representations from Transformers. EMNLP/IJCNLP (1) 2019: 5099-5110 [9] Ronghang Hu et al.: UniT: Multimodal Multitask Learning with a Unified Transformer. ICCV 2021: 1419-1429 [10] Jiasen Lu et al.: ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision- and-Language Tasks. NeurIPS 2019: 13-23 32 参考⽂献（V&Lモデル）

論文紹介 / Winoground: Probing Vision and Language ...

論文紹介 / Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality

Kyosuke Nishida

More Decks by Kyosuke Nishida

Other Decks in Research

Featured

Transcript

論⽂紹介紹介者: ⻄⽥京介（NTT⼈間情報研究所） 2022/09/26 @ 第14回最先端NLP勉強会 CVPR 2022 (arXiv 2022/04)

• 何をする研究か︖ – 視覚・⾔語の融合理解の能⼒を評価する • 貢献は何か︖ – 2件の画像・テキストペアのマッチングタスクのデータセット Winoground

提案タスク: Winoground • ⼊⼒として画像・キャプションペアが2つ与えられ，4通りの組合せの中から正しいペアを⾒つけ出す • 2つのキャプションは同じ単語/形態素を持つが，⽂中の順序は異なる some plants surrounding

参考: Winograd Scheme Challenge • 1語または2語が異なる2⽂において，代名詞を同定する照応解析タスク．正しく解くためには常識的な知識が必要となる • 最近の⾔語モデル PaLM（540B）ではZero-shotで正答率90%まで到達

• タスクとデータセット – 評価指標 – データの作成プロセス – データの分類 • 評価実験

評価指標1: Image-Score (C→I) • 各キャプションに対して，正しい画像を選択できるかを評価 some plants surrounding a

評価指標2: Text-Score (I→C) • 各画像に対して，正しいキャプションを選択できるかを評価 some plants surrounding a

評価指標3: Group-Score • 4通りの組合せを全て正しく判定できるか • Image-/Text-Scoreがどちらも1のときに1となる some plants surrounding a

• 4⼈の専⾨家（⾔語学＋V&L研究に詳しい）によって⼿作業で作成 • Winogroundスキーマを満たす2つのキャプションの作成と2つの画像の収集を同時に⾏う – 画像はストックフォトサイト Getty Images

• 実世界の物体を参照する名詞句等を並び替える 10 Linguistic Tag: Object (141/400) 全9種類

11 Linguistic Tag: Relation (233/400) • 動詞、形容詞、前置詞、副詞などをを並べ替える全44種類（9種類のみ⽰す）

12 Linguistic Tag: Both (26/400) 全9種類 • RelationとObjectの両⽅の交換．1つの交換で品詞を変える例が含まれる • 件数はかなり少ない

13 Linguistic Tag: 1 or 2 Main Preds (292, 108/400)

• comprises examples where the images need to be interpreted

• 画像が収集元のフォトストックサイトにおける同じシリーズ（登場⼈物・背景などが類似）から構成されているもの 15 Visual Tag: Series (31/400) the [masked]

• 記号的な描写を理解する必要のある例（参考︓イラスト系は物体認識器を使うモデルは⽐較的弱い） 16 Visual Tag: Symbolic (24/400) astronauts in

• タスクとデータセット – 評価指標 – データの作成プロセス – データの分類 • 評価実験

18 検証モデル(1/2) • CLIP[1]， FLAVAContrastive [2]︓ デュアルエンコーダによる対照学習 • FLAVAITM [2]︓上記にクロスエンコーダを加え，Image-Text

19 モデル(2/2) • UNITER[3], VILLA[4], VinVL[5], ViLT[6], VisualBERT[7]︓物体検出(Object Detection) やパッチ埋め込みを⽤いたクロスエンコーダモデル

• 例）Flickr30kにおける画像óテキスト検索（1000画像x5キャプション)の精度 • CLIPやUNITERはファインチューニング無しでも⾼精度 20 参考︓他タスクでのモデルの品質 Image→Text Text→Image UNITER

21 全体結果⼈間のスコアとチャンスレベル • Text（Image → Caption）， Image（Caption→Image）の両⽅とも⼈間は90%程度正解できている

22 全体結果 Text-Score（Image→Caption） • 幾つかのモデルがチャンスレベルを越えたスコアを達成 • しかし，⼈間のスコアには遠く及ばず

23 全体結果 Image-Score（Caption→Image） • 全てのモデルがチャンスレベルを下回る結果 • Group-Scoreも同様

• はっきりとした理由は書かれていない • More investigation is required to pinpoint the

25 タグ別の結果（Linguistic/順序交換） • ”Object”（名詞句の交換）, “Relation”（動詞・形容詞などの交換）については，どのモデルでもImage-scoreが低い è テキストエンコーダが弱く細かいテキストの差を認識できていない

26 タグ別の結果（Linguistic /順序交換） • CLIPの”Both”はかなり良い．[fire] [truck] / [truck] [fire] のように，描写対

27 タグ別の結果（Linguistic/述部数） • 1 Main Predより2 Main Predsの⽅がキャプションの内容が複雑になるため，スコアがはっきりと落ちている

28 タグ別の結果（Visual） • 件数が少ないので参考程度だが，特にSeriesは画像エンコーダ側でも区別が難しくなるのでText-Scoreも低くなっている

• ⻑いテキストほどスコアが悪い – テキストエンコーダの弱さを⽰唆 • キャプションのperplexity（GPT2で測定）とスコア間の相関は低い • モデルアーキテクチャによる差は少ない • 学習データ数が多い⽅が良い

• 何をする研究か︖ – 視覚・⾔語の融合理解の能⼒を評価する • 貢献は何か︖ – 2件の画像・テキストペアのマッチングタスクのデータセット Winoground

[1] Alec Radford et al.: Learning Transferable Visual Models From