論文解説 CoCa: Contrastive Captioners are Image-Text Foundation Models

論⽂解説 CoCa: Contrastive Captioners are Image-Text Foundation Models Takehiro Matsuda

2 論⽂情報タイトル： CoCa: Contrastive Captioners are Image-Text Foundation Models
• 論⽂： https://arxiv.org/abs/2205.01917 • コード： https://github.com/lucidrains/CoCa-pytorch • 投稿学会： Transactions on Machine Learning Research • 著者： Jiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mojtaba Seyedhosseini, Yonghui Wu • 所属：Google Research 選んだ理由： • Google CloudのVertexAIを使ったデモを⾒て、その特徴ベクトルを⽣成している Vision-Language foundation modelに興味をもった。

3 Introduction https://ai-demos.dev/ Google Cloud VertexAIを使ったデモ Mercari USAの出品商品に対する text-to-image, image-to-imageでの検索
https://cloud.google.com/blog/products/ai-machine-learning/multimodal-generative-ai-search?hl=en https://cloud.google.com/blog/products/ai-machine-learning/how-to-use-grounding-for-your-llms-with-text-embeddings?hl=en 紹介記事 https://atlas.nomic.ai/map/vertexAI-mercari 特徴空間マップの表⽰

4 High score in multiple tasks image-text foundation modelとして、様々なタスクで⾼い性能を⽰した。

5 Background: Single-Encoder model ViT 𝐿!"# = −𝑝 𝑦 𝑙𝑜𝑔𝑞$
(𝑥) 通常はImageNetなどの画像とそのアノテーションであるラベルというデータセットを使い、cross-entropy lossで学習される。

6 Background: Dual-Encoder model CLIP Webから抽出した画像と説明⽂(必ずしも正確でない)のペアの⼤量のデータから Contrastive Lossで画像とテキストの２つのEncoderをjointで学習される。

7 Background: Encoder-Decoder model 𝐿!%& = − - '() *
𝑙𝑜𝑔𝑃$ 𝑦' |𝑦+' , 𝑥 テキストペアについて条件付き確率を最⼤にする⾃⼰回帰 SimVLM

8 Purpose of CoCa Unify single-encoder, dual-encoder, encoder-decoder paradigms ⼀つのimage-text
foundation modelを学習することで、それらの3つのアプローチ実施できるようにする。

9 Overview of CoCa https://blog.research.google/2022/05/image-text-pre-training-with.html?m=1を参照通常のencoder-decoderのTransformerでは全てのdecoder layerがencoder outputsにattentionをとる。 CoCaではdecoderを２つに分割し、前半のdecoder layerのcross-attentionをなくし、unimodal
text decoderとし、後半のdecoderをmultimodal text decoderとする。

10 Overview of CoCa 𝐿,-,% = 𝜆,-. 2 𝐿,-. +
𝜆,%& 2 𝐿,%& 𝐿!%& = − - '() * 𝑙𝑜𝑔𝑃$ 𝑦' |𝑦+' , 𝑥

11 CoCa setting • 画像288x288を18x18パッチで分割し、256 image tokensにする。(1epochは576 x 576の⾼解像度にするらしい) •
CoCaの最⼤のモデルは、ViT-giantと同じセットアップで1Bのimage encoderと2.1Bのテキストデコーダー • attentional poolingはタスクに適応させる学習可能なnquery(generative =256, contrastive=1)のsingle multi-head attention layer • single pathでannotated imageもwebからのデータも⼀緒に扱える。ラベルは“a photo of the cat, animal”のようなtextと考えられる。

12 Video data in CoCa 静⽌画で学習したencoderを共通で使える。 videoについて1frameずつencoderに導⼊して、 attentional poolerでsingle query
tokenにする。 down stream taskの違いはattentional poolerにより切り替えられる。 taskごとに異なるheadにするより実践的と主張。

13 Experiments training data • JFT-3B: Googleのinternal dataset(⾮公開) アノテーションされたラベル付き画像 •
ALIGN dataset: Googleのinternal dataset(⾮公開) WebからScrapingしてきた画像・テキストペアの1.8B data “a photo of the cat, animal”のようなtextにする pretrainなしに2つのdatasetを同時に使ってscratchで学習する。 JFT-300Mの例⽰ ALIGN datasetの例⽰

14 Setting for training • 65,536のimage-text pairsを1バッチにする。 • 𝐿!"!# =
𝜆!"$ $ 𝐿!"$ + 𝜆!#% $ 𝐿!#% の最適化 • 500kstep ≒ 5 epochs on JFT , 10 epochs on ALIGN 𝜆,-. = 1.0, 𝜆,%& = 2.0 2048 CloudTPUv4 chipで5⽇間の学習

15 Evaluation Image recognition Video action recognition CoCaのEncoderは静⽌画のみで学習しているが、良いスコアが得られている。
Image recognition

16 Evaluation of Image classification and video action recognition frozen
encoder or finetuned encoderの２種類のCoCa (Attentional pooling layerのみ学習 or encoderまで学習) Recognition task(single-encoder)としてImage classificationとvideo action recognitionの評価 encoderを調整しないでも⾼い性能を⽰している。 Image classification video action recognition

17 Comparison of model size CoCaは他のFoundation modelと⽐べて、少ないパラメータ数で⾼い性能を⽰している。

18 Evaluation of Image-Text Retrieval CLIPの設定に従い、test setのすべてのimage/textについてそれぞれのencoderに⼊⼒し、 embeddingsを得る。 cosine類似度により、画像を表す説明⽂もしくは説明⽂にマッチする画像をtest set中から探す。
Image-Text Retrieval task(Dual encoder)として、MSCOCO, Flickr30Kの評価 encoderをfine-tuningしなくても⾼いスコアを⽰す。

19 Evaluation of Video-Text Retrieval Video-Text RetrievalとしてMSR-VTTを評価 MSR-VTTはYouTubeから得られた動画なので、視聴できなくなったデータは除外しているため、subsetとなっている。 Video
dataを学習していないencoderで⾼い性能を⽰している。

20 Evaluation of Multimodal understanding CoCaはClassificationやretrieval(対応付け)だけでなく、Image Understandingの様々な multimodal tasksにも対応でき、⾼い性能を⽰す。
Visual Entailment(SNVI-VE) Visual Reasoning(NLVR2) VQA(Visual Question answering)

21 Evaluation of Image Captioning CoCaはClassificationやretrieval(対応付け)だけでなく、 captioningのような⽂章⽣成タスクも可能。 MSCOCOで学習し、MSCOCO, NoCapsで評価する。 NoCaps
166,100 human-generated captions describing 15,100 images from the Open Images validation and test sets. MSCOCO captions 参考：評価指標について https://qiita.com/amtsyh/items/a926b79b90dfabe895e9 CoCaはデータの偏りを利⽤したCIDEr-specific optimizationを使っていない。 over one and a half million captions describing over 330,000 images.

22 Generated caption examples CoCaで⽣成されたImage Captionの例を⽰す。

23 Ablation study Contrastive lossだけでなく、Captioning lossを導⼊することは性能向上する。 Cap:Con=2:1の重みが良い性能を⽰した。計算コストの上昇も⼤きくはない。 We hypothesize
that generative objectives learn fine-grained text representations that further improve text understanding ⽣成タスクでなくてもCaptioning lossの導⼊は有効か？ Contrastive lossとCaptioning lossの重みバランスは？

24 Ablation study Unimodal decoderとMultimodal decoderの総数は同じ(12 layer)で割合を変えてみる。 Unimodal decoderの数が少ないとZero Shot
Classificationのスコアが下がり、 Multimodal decoderの数が少ないとVQAのスコアが下がる。中間の6 layerが良いバランスを⽰す。 One possibility is that global text representation for retrieval doesnʼt require deep modules [33] while early fusion for shallow layers may also be unnecessary for multimodal understanding. DecoderをUnimodal decoderとMultimodal decoderに分割したが、その割合は？

25 Compare with CLIP https://laion.ai/blog/coca/ LAION datasetを使ったCLIPなどとの⽐較 • Text to
Image RetrievalやImage to Text Retrievalのスコアは良い。 • Image captioningのスコアはPaper originalほどは良くない。 • Paper originalに⽐べ、パラメータ数はかなり⼩さい。 https://colab.research.google.com/github/mlfoundations/open_clip/blob/master/docs/Interacting_with_ open_coca.ipynb Captioning サンプルコード

論文解説 CoCa: Contrastive Captioners are Image-Tex...

論文解説 CoCa: Contrastive Captioners are Image-Text Foundation Models

koharite

More Decks by koharite

Other Decks in Research

Featured

Transcript

論⽂解説 CoCa: Contrastive Captioners are Image-Text Foundation Models Takehiro Matsuda

2 論⽂情報タイトル： CoCa: Contrastive Captioners are Image-Text Foundation Models

3 Introduction https://ai-demos.dev/ Google Cloud VertexAIを使ったデモ Mercari USAの出品商品に対する text-to-image, image-to-imageでの検索

4 High score in multiple tasks image-text foundation modelとして、様々なタスクで⾼い性能を⽰した。

5 Background: Single-Encoder model ViT 𝐿!"# = −𝑝 𝑦 𝑙𝑜𝑔𝑞$

6 Background: Dual-Encoder model CLIP Webから抽出した画像と説明⽂(必ずしも正確でない)のペアの⼤量のデータから Contrastive Lossで画像とテキストの２つのEncoderをjointで学習される。

7 Background: Encoder-Decoder model 𝐿!%& = − - '() *

8 Purpose of CoCa Unify single-encoder, dual-encoder, encoder-decoder paradigms ⼀つのimage-text

10 Overview of CoCa 𝐿,-,% = 𝜆,-. 2 𝐿,-. +

11 CoCa setting • 画像288x288を18x18パッチで分割し、256 image tokensにする。(1epochは576 x 576の⾼解像度にするらしい) •

12 Video data in CoCa 静⽌画で学習したencoderを共通で使える。 videoについて1frameずつencoderに導⼊して、 attentional poolerでsingle query

13 Experiments training data • JFT-3B: Googleのinternal dataset(⾮公開) アノテーションされたラベル付き画像 •

14 Setting for training • 65,536のimage-text pairsを1バッチにする。 • 𝐿!"!# =

15 Evaluation Image recognition Video action recognition CoCaのEncoderは静⽌画のみで学習しているが、良いスコアが得られている。

16 Evaluation of Image classification and video action recognition frozen

17 Comparison of model size CoCaは他のFoundation modelと⽐べて、少ないパラメータ数で⾼い性能を⽰している。

18 Evaluation of Image-Text Retrieval CLIPの設定に従い、test setのすべてのimage/textについてそれぞれのencoderに⼊⼒し、 embeddingsを得る。 cosine類似度により、画像を表す説明⽂もしくは説明⽂にマッチする画像をtest set中から探す。

19 Evaluation of Video-Text Retrieval Video-Text RetrievalとしてMSR-VTTを評価 MSR-VTTはYouTubeから得られた動画なので、視聴できなくなったデータは除外しているため、subsetとなっている。 Video

20 Evaluation of Multimodal understanding CoCaはClassificationやretrieval(対応付け)だけでなく、Image Understandingの様々な multimodal tasksにも対応でき、⾼い性能を⽰す。

21 Evaluation of Image Captioning CoCaはClassificationやretrieval(対応付け)だけでなく、 captioningのような⽂章⽣成タスクも可能。 MSCOCOで学習し、MSCOCO, NoCapsで評価する。 NoCaps

22 Generated caption examples CoCaで⽣成されたImage Captionの例を⽰す。

23 Ablation study Contrastive lossだけでなく、Captioning lossを導⼊することは性能向上する。 Cap:Con=2:1の重みが良い性能を⽰した。計算コストの上昇も⼤きくはない。 We hypothesize

24 Ablation study Unimodal decoderとMultimodal decoderの総数は同じ(12 layer)で割合を変えてみる。 Unimodal decoderの数が少ないとZero Shot

25 Compare with CLIP https://laion.ai/blog/coca/ LAION datasetを使ったCLIPなどとの⽐較 • Text to