[Journal club] CLIP4IDC: CLIP for Image Difference Captioning

CLIP4IDC: CLIP for Image Difference Captioning Zixin Guo, Tzu-Jui Julius
Wang, Jorma Laaksonen Department of Computer Science, Aalto University, Finland AACL2022 慶應義塾大学杉浦孔明研究室後神美結 Guo, Z., Wang, T. J., & Laaksonen, J. (2022, November). CLIP4IDC: CLIP for Image Difference Captioning. In Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (Volume 2: Short Papers) (pp. 33-42).

背景：image difference captioning （IDC）タスク • 2枚の画像間の変化を説明する文章を生成 The person walking is
no longer there There is a smaller group of people in the lot 2

関連研究手法特徴・課題 DUDA [Park+, ICCV 2019] VACC [Kim+, ICCV
2021] IFDC [Huang+, IEEE Transactions on Multimedia 2021] DUDA+Aux [Hosseinzadeh+, CVPR 2021] • 学習済みモデルで視覚的特徴を抽出 • 抽出された特徴は事前学習とIDCタスクの間のdomain gap に対応できない • それぞれの画像から抽出された視覚表現と文章の特徴の間に相関がない IDC-PCL [Yao+, AAAI 2022] • IDCタスク用データセットでfine-tuning • 大規模データセットでの事前学習が生かしきれていない DUDA IDC-PCL 3

提案手法 (1/5)：CLIP4IDC • 入力（訓練時） o textual caption o image 1
o image 2 • “adapt-and-fine-tune” 手法を採用 4

提案手法 (2/5)：Adaptation • 構成要素 o Language Encoder o Vision Encoder
 Intra Encoder  Inter Encoder o Image-Text Retrieval  Image-Pair-to-Text (IP-T)  Text-to-Image-Pair (T-IP) 5

提案手法 (3/5)：Adaptation: Image-Text Retrieval • Image-Text Retrieval o IP-TとT-IP retrievalで視覚的特徴をIDCタスクのdomainに対応させる
o contrastive approach  視覚的特徴を画像変化のキャプションの特徴に近づける o combined visual representation mean-pooling operation 6

提案手法 (4/5)：Adaptation: Image-Text Retrieval • Image-Text Retrieval o IP-T retrievalの損失関数：
o T-IP retrievalの損失関数： o Adaptationの損失関数： cosine similarity function learnable temperature parameter 7

提案手法 (5/5)：Fine-tuning • Fine-tuning o Vision Encoderはadaptationで得られた重み付けで初期化 o 構成要素
 Multi-layer Transformer Encoders • 入力： Vision Encoderの出力  Multi-layer Transformer Decoders • 入力：キャプション、 Multi-layer Transformer Encodersの出力 • 直前までのGTと視覚的差異から次のトークンを予測 o 損失関数：cross entropy loss 8

実験設定 • データセット o CLEVR-Change  合成VQAデータセットCLEVRからIDCタスク用に作成 o Spot-the-Diff 
実世界の画像対と画像間変化のキャプション o Image-Editing-Request  画像編集を施す前後の画像対と対応する編集指示 • 学習環境 o adaptation：V100 GPU×2 o captioning： V100 GPU×1 "The person walking is no longer there" 9

定量的結果： Spot-the-Diff • BLEU、METEOR、CIDEr、ROUGEにおいてSOTAを達成 o 主要評価尺度CIDErで最もスコアが高いベースラインに対して+5.9 10

定性的結果：Spot-the-Diff 11 GT: the person walking is no longer there
CLIP4IDC: the person walking in the parking lot is gone GT: there is a smaller group of people in the lot CLIP4IDC: there are two people in the right image 内容が一致し、むしろ増えている書き方が変化の表現ではないが、内容と着眼点は合っている

Ablation Study • CLIP4IDC (adaptationなし)の方が、概ねCLIP-FTを上回った • CLIP4IDC (adaptation あり) の方が、
CLIP4IDC (adaptation なし)を上回った ⇒ adaptationで細かい視覚的変化を捉える学習は有効 12

追試およびエラー分析失敗している例が多数あった • 書き方が変化の表現ではない ⇒ Fine-tuneするモデルの変更 ⇒ 損失関数の計算方法を変更 • 着眼点が誤っている
⇒ 多階層視覚表現を用いる 13 GT: the person walking is no longer there CLIP4IDC: the person in the parking lot is gone GT 1: the people by the building have moved and joined others GT2: the people in the parking lot have left CLIP4IDC: the people are in the parking lot  

所感 • Strengths o 3種類のデータセットでSOTAを達成している o Adaptationの有用性をablation studyにて示している • Weaknesses
o ベースラインによって生成されるキャプションが定性的結果として含まれていない o エラー分析がなく、大きく失敗している例がない 14

まとめ • 背景 o 事前学習とIDCタスクの目的・データセット間のdomain gap o 各画像ごとに抽出された画像特徴が画像間変化抽出に適切ではない • 提案手法：CLIP4IDC
o Adaptation（Image-Text Retrieval） o Fine-tuning • 結果 o CLEVR-Change、Spot-the-Diff、Image-Editing-Requestにおいて BLEU、METEOR、CIDEr、ROUGEでSOTA達成 o 本手法を拡張した手法を提案、CoRL2024にてunder review 15

Appendix：既存のfine-tuningを利用したモデル 事前学習とIDCタスクの目的にずれがある 事前学習とIDCタスクのデータセット間にdomain gapがある 画像を別々に特徴抽出した場合、違いが上手く抽出されない 16

Appendix：Adaptation: Language Encoder • Language Encoder 𝐺𝐺 o textual caption
linear projection of each token positional embedding output 17

Appendix：Adaptation: Vision Encoder • Vision Encoder 𝐹𝐹 o 画像間の細かい変化を捉える o
image class embedding embedding of image patch positional embedding o token embedding positional embedding 18

Appendix：定量的結果（CLEVR-Change） • BLEU、METEOR、CIDEr、ROUGEにおいてSOTAを達成 o 主要評価尺度CIDErで最もスコアが高いベースラインに対して+21.8 19

Appendix：定性的結果（CLEVR-Change） GT: the blue ball changed to yellow CLIP4IDC: the
blue ball became yellow GT: the big purple metal block behind the green thing changed to rubber CLIP4IDC: the large purple metal block that is behind the big purple metal sphere became rubber 内容が一致位置に関する部分がGTと異なるが、内容は合っている 20

Appendix：定量的結果（Image-Editing-Request） • BLEU、METEOR、CIDEr、ROUGEにおいてSOTAを達成 o 主要評価尺度CIDErで最もスコアが高いベースラインに対して+4.5 21

Appendix：定性的結果（Image-Editing-Request） 22 GT: color the sky blue CLIP4IDC: make the
image more blue GT: brighten the entire photo CLIP4IDC: brighten the photo GTとは異なるが、指示文としては間違いではないかもしれない内容が一致

Appendix：CLEVR-Changeの変化の種類別 • CLEVR-Changeのそれぞれの変化の種類でのCIDErのスコア • 変化の種類 o C: Color o T:
Texture o M: Move o A: Add o D: Drop o DI: Distractor 23

Appendix：Adaptionの結果 24

Appendix：層数による影響（CLEVR-Change） 25

[Journal club] CLIP4IDC: CLIP for Image Differe...

[Journal club] CLIP4IDC: CLIP for Image Difference Captioning

Semantic Machine Intelligence Lab., Keio Univ. PRO

More Decks by Semantic Machine Intelligence Lab., Keio Univ.

Other Decks in Technology

Featured

Transcript

CLIP4IDC: CLIP for Image Difference Captioning Zixin Guo, Tzu-Jui Julius

背景：image difference captioning （IDC）タスク • 2枚の画像間の変化を説明する文章を生成 The person walking is

関連研究手法特徴・課題 DUDA [Park+, ICCV 2019] VACC [Kim+, ICCV

提案手法 (1/5)：CLIP4IDC • 入力（訓練時） o textual caption o image 1

提案手法 (2/5)：Adaptation • 構成要素 o Language Encoder o Vision Encoder

提案手法 (3/5)：Adaptation: Image-Text Retrieval • Image-Text Retrieval o IP-TとT-IP retrievalで視覚的特徴をIDCタスクのdomainに対応させる

提案手法 (4/5)：Adaptation: Image-Text Retrieval • Image-Text Retrieval o IP-T retrievalの損失関数：

提案手法 (5/5)：Fine-tuning • Fine-tuning o Vision Encoderはadaptationで得られた重み付けで初期化 o 構成要素

実験設定 • データセット o CLEVR-Change  合成VQAデータセットCLEVRからIDCタスク用に作成 o Spot-the-Diff 

定量的結果： Spot-the-Diff • BLEU、METEOR、CIDEr、ROUGEにおいてSOTAを達成 o 主要評価尺度CIDErで最もスコアが高いベースラインに対して+5.9 10

定性的結果：Spot-the-Diff 11 GT: the person walking is no longer there

Ablation Study • CLIP4IDC (adaptationなし)の方が、概ねCLIP-FTを上回った • CLIP4IDC (adaptation あり) の方が、

追試およびエラー分析失敗している例が多数あった • 書き方が変化の表現ではない ⇒ Fine-tuneするモデルの変更 ⇒ 損失関数の計算方法を変更 • 着眼点が誤っている

所感 • Strengths o 3種類のデータセットでSOTAを達成している o Adaptationの有用性をablation studyにて示している • Weaknesses

まとめ • 背景 o 事前学習とIDCタスクの目的・データセット間のdomain gap o 各画像ごとに抽出された画像特徴が画像間変化抽出に適切ではない • 提案手法：CLIP4IDC

Appendix：既存のfine-tuningを利用したモデル 事前学習とIDCタスクの目的にずれがある 事前学習とIDCタスクのデータセット間にdomain gapがある 画像を別々に特徴抽出した場合、違いが上手く抽出されない 16

Appendix：Adaptation: Language Encoder • Language Encoder 𝐺𝐺 o textual caption

Appendix：Adaptation: Vision Encoder • Vision Encoder 𝐹𝐹 o 画像間の細かい変化を捉える o

Appendix：定量的結果（CLEVR-Change） • BLEU、METEOR、CIDEr、ROUGEにおいてSOTAを達成 o 主要評価尺度CIDErで最もスコアが高いベースラインに対して+21.8 19

Appendix：定性的結果（CLEVR-Change） GT: the blue ball changed to yellow CLIP4IDC: the

Appendix：定量的結果（Image-Editing-Request） • BLEU、METEOR、CIDEr、ROUGEにおいてSOTAを達成 o 主要評価尺度CIDErで最もスコアが高いベースラインに対して+4.5 21

Appendix：定性的結果（Image-Editing-Request） 22 GT: color the sky blue CLIP4IDC: make the

Appendix：CLEVR-Changeの変化の種類別 • CLEVR-Changeのそれぞれの変化の種類でのCIDErのスコア • 変化の種類 o C: Color o T:

Appendix：Adaptionの結果 24

Appendix：層数による影響（CLEVR-Change） 25