論文解説 CoCa: Contrastive Captioners are Image-Text Foundation Models

Slide 1

Slide 1 text

論⽂解説 CoCa: Contrastive Captioners are Image-Text Foundation Models Takehiro Matsuda

Slide 2

Slide 2 text

2 論⽂情報タイトル： CoCa: Contrastive Captioners are Image-Text Foundation Models • 論⽂： https://arxiv.org/abs/2205.01917 • コード： https://github.com/lucidrains/CoCa-pytorch • 投稿学会： Transactions on Machine Learning Research • 著者： Jiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mojtaba Seyedhosseini, Yonghui Wu • 所属：Google Research 選んだ理由： • Google CloudのVertexAIを使ったデモを⾒て、その特徴ベクトルを⽣成している Vision-Language foundation modelに興味をもった。

Slide 3

Slide 3 text

3 Introduction https://ai-demos.dev/ Google Cloud VertexAIを使ったデモ Mercari USAの出品商品に対する text-to-image, image-to-imageでの検索 https://cloud.google.com/blog/products/ai-machine-learning/multimodal-generative-ai-search?hl=en https://cloud.google.com/blog/products/ai-machine-learning/how-to-use-grounding-for-your-llms-with-text-embeddings?hl=en 紹介記事 https://atlas.nomic.ai/map/vertexAI-mercari 特徴空間マップの表⽰

Slide 4

Slide 4 text

4 High score in multiple tasks image-text foundation modelとして、様々なタスクで⾼い性能を⽰した。

Slide 5

Slide 5 text

5 Background: Single-Encoder model ViT 𝐿!"# = −𝑝 𝑦 𝑙𝑜𝑔𝑞$ (𝑥) 通常はImageNetなどの画像とそのアノテーションであるラベルというデータセットを使い、cross-entropy lossで学習される。

Slide 6

Slide 6 text

6 Background: Dual-Encoder model CLIP Webから抽出した画像と説明⽂(必ずしも正確でない)のペアの⼤量のデータから Contrastive Lossで画像とテキストの２つのEncoderをjointで学習される。

Slide 7

Slide 7 text

7 Background: Encoder-Decoder model 𝐿!%& = − - '() * 𝑙𝑜𝑔𝑃$ 𝑦' |𝑦+' , 𝑥 テキストペアについて条件付き確率を最⼤にする⾃⼰回帰 SimVLM

Slide 8

Slide 8 text

8 Purpose of CoCa Unify single-encoder, dual-encoder, encoder-decoder paradigms ⼀つのimage-text foundation modelを学習することで、それらの3つのアプローチ実施できるようにする。

Slide 9

Slide 9 text

9 Overview of CoCa https://blog.research.google/2022/05/image-text-pre-training-with.html?m=1を参照通常のencoder-decoderのTransformerでは全てのdecoder layerがencoder outputsにattentionをとる。 CoCaではdecoderを２つに分割し、前半のdecoder layerのcross-attentionをなくし、unimodal text decoderとし、後半のdecoderをmultimodal text decoderとする。

Slide 10

Slide 10 text

10 Overview of CoCa 𝐿,-,% = 𝜆,-. 2 𝐿,-. + 𝜆,%& 2 𝐿,%& 𝐿!%& = − - '() * 𝑙𝑜𝑔𝑃$ 𝑦' |𝑦+' , 𝑥

Slide 11

Slide 11 text

11 CoCa setting • 画像288x288を18x18パッチで分割し、256 image tokensにする。(1epochは576 x 576の⾼解像度にするらしい) • CoCaの最⼤のモデルは、ViT-giantと同じセットアップで1Bのimage encoderと2.1Bのテキストデコーダー • attentional poolingはタスクに適応させる学習可能なnquery(generative =256, contrastive=1)のsingle multi-head attention layer • single pathでannotated imageもwebからのデータも⼀緒に扱える。ラベルは“a photo of the cat, animal”のようなtextと考えられる。

Slide 12

Slide 12 text

12 Video data in CoCa 静⽌画で学習したencoderを共通で使える。 videoについて1frameずつencoderに導⼊して、 attentional poolerでsingle query tokenにする。 down stream taskの違いはattentional poolerにより切り替えられる。 taskごとに異なるheadにするより実践的と主張。

Slide 13

Slide 13 text

13 Experiments training data • JFT-3B: Googleのinternal dataset(⾮公開) アノテーションされたラベル付き画像 • ALIGN dataset: Googleのinternal dataset(⾮公開) WebからScrapingしてきた画像・テキストペアの1.8B data “a photo of the cat, animal”のようなtextにする pretrainなしに2つのdatasetを同時に使ってscratchで学習する。 JFT-300Mの例⽰ ALIGN datasetの例⽰

Slide 14

Slide 14 text

14 Setting for training • 65,536のimage-text pairsを1バッチにする。 • 𝐿!"!# = 𝜆!"$ $ 𝐿!"$ + 𝜆!#% $ 𝐿!#% の最適化 • 500kstep ≒ 5 epochs on JFT , 10 epochs on ALIGN 𝜆,-. = 1.0, 𝜆,%& = 2.0 2048 CloudTPUv4 chipで5⽇間の学習

Slide 15

Slide 15 text

15 Evaluation Image recognition Video action recognition CoCaのEncoderは静⽌画のみで学習しているが、良いスコアが得られている。 Image recognition

Slide 16

Slide 16 text

16 Evaluation of Image classification and video action recognition frozen encoder or finetuned encoderの２種類のCoCa (Attentional pooling layerのみ学習 or encoderまで学習) Recognition task(single-encoder)としてImage classificationとvideo action recognitionの評価 encoderを調整しないでも⾼い性能を⽰している。 Image classification video action recognition

Slide 17

Slide 17 text

17 Comparison of model size CoCaは他のFoundation modelと⽐べて、少ないパラメータ数で⾼い性能を⽰している。

Slide 18

Slide 18 text

18 Evaluation of Image-Text Retrieval CLIPの設定に従い、test setのすべてのimage/textについてそれぞれのencoderに⼊⼒し、 embeddingsを得る。 cosine類似度により、画像を表す説明⽂もしくは説明⽂にマッチする画像をtest set中から探す。 Image-Text Retrieval task(Dual encoder)として、MSCOCO, Flickr30Kの評価 encoderをfine-tuningしなくても⾼いスコアを⽰す。

Slide 19

Slide 19 text

19 Evaluation of Video-Text Retrieval Video-Text RetrievalとしてMSR-VTTを評価 MSR-VTTはYouTubeから得られた動画なので、視聴できなくなったデータは除外しているため、subsetとなっている。 Video dataを学習していないencoderで⾼い性能を⽰している。

Slide 20

Slide 20 text

20 Evaluation of Multimodal understanding CoCaはClassificationやretrieval(対応付け)だけでなく、Image Understandingの様々な multimodal tasksにも対応でき、⾼い性能を⽰す。 Visual Entailment(SNVI-VE) Visual Reasoning(NLVR2) VQA(Visual Question answering)

Slide 21

Slide 21 text

21 Evaluation of Image Captioning CoCaはClassificationやretrieval(対応付け)だけでなく、 captioningのような⽂章⽣成タスクも可能。 MSCOCOで学習し、MSCOCO, NoCapsで評価する。 NoCaps 166,100 human-generated captions describing 15,100 images from the Open Images validation and test sets. MSCOCO captions 参考：評価指標について https://qiita.com/amtsyh/items/a926b79b90dfabe895e9 CoCaはデータの偏りを利⽤したCIDEr-specific optimizationを使っていない。 over one and a half million captions describing over 330,000 images.

Slide 22

Slide 22 text

22 Generated caption examples CoCaで⽣成されたImage Captionの例を⽰す。

Slide 23

Slide 23 text

23 Ablation study Contrastive lossだけでなく、Captioning lossを導⼊することは性能向上する。 Cap:Con=2:1の重みが良い性能を⽰した。計算コストの上昇も⼤きくはない。 We hypothesize that generative objectives learn fine-grained text representations that further improve text understanding ⽣成タスクでなくてもCaptioning lossの導⼊は有効か？ Contrastive lossとCaptioning lossの重みバランスは？

Slide 24

Slide 24 text

24 Ablation study Unimodal decoderとMultimodal decoderの総数は同じ(12 layer)で割合を変えてみる。 Unimodal decoderの数が少ないとZero Shot Classificationのスコアが下がり、 Multimodal decoderの数が少ないとVQAのスコアが下がる。中間の6 layerが良いバランスを⽰す。 One possibility is that global text representation for retrieval doesnʼt require deep modules [33] while early fusion for shallow layers may also be unnecessary for multimodal understanding. DecoderをUnimodal decoderとMultimodal decoderに分割したが、その割合は？

Slide 25

Slide 25 text

25 Compare with CLIP https://laion.ai/blog/coca/ LAION datasetを使ったCLIPなどとの⽐較 • Text to Image RetrievalやImage to Text Retrievalのスコアは良い。 • Image captioningのスコアはPaper originalほどは良くない。 • Paper originalに⽐べ、パラメータ数はかなり⼩さい。 https://colab.research.google.com/github/mlfoundations/open_clip/blob/master/docs/Interacting_with_ open_coca.ipynb Captioning サンプルコード