【EMNLP 2023】Incorporating Structured Representations into Pretrained Vision & Language Models Using Scene Graphs 論文読み会発表資料

Incorporating Structured Representations into Pretrained Vision & Language Models Using
Scene Graphs Roei Herzig, Alon Mendelson, Leonid Karlinsky, Assaf Arbelle, Rogerio Feris, Trevor Darrell, Amir Globerson Tel-Aviv University, UC Berkeley, IBM Research, MIT-IBM Watson AI Lab EMNLP 2023 https://arxiv.org/abs/2305.06343 1

Summary • Visual Language Models (VLMs) face challenges in understanding
complex scenes, particularly in attributes and relations. • This study introduces a few number of structured scene graphs into VLMs, enhancing visual and textual comprehension. • The method improves VLM performance across multiple datasets, effectively addressing the initial scene understanding limitations. 2

Introduction • VLM • image-text Encoder (e.g., CLIP, BLIP, BLIP2)
• remarkable zero-shot performance thanks to the massive scale image-text pairs • Scene Graph • Node: object -> (class label, bounding box, attributes) • Edge: relation -> (obj_1, relation category, obj_2) • Dataset: Visual Genome (Krishna et.al, International journal of computer vision 2017), etc… 3

Introduction • Background • Large-scale pretrained VLMs still struggle with
compositional scene recognition • Especially recognizing attributes and relationships of objects, as well as the state of actions • Scene Graphs (SG) are effective for compositional recognition but have a high annotation cost • making them impractical to prepare on a large scale 4

Introduction • Purpose • They aim to enhance the compositional
recognition capabilities of pre-trained VLMs using a small amount of SGs • Method • They propose a fine-tuning method for pre-trained VLMs, named Scene Graphs for Vision-Language Models (SGVL), to leverage Scene Graphs for enhancing these models 5

Methodology • text-encoder: contrastive learning with SG-to-text • Image-encoder: integrating
“SG Component” 6

Methodology • text-encoder: contrastive learning with SG-to-text • Positive/Negative Captions
from SG • Image-encoder: integrating “SG Component” 7 [CLS] [CLS] Enhancing compositionality by contrasting hard- negative captions that highlight structural aspects

Methodology • Positive/Negative Captions from SG • Negative one Generated
by • Swapping asymmetrical relations • Binding attributes with several objects 8 ↓ GN

Methodology • text-encoder: contrastive learning with SG-to-text • Image-encoder: integrating
“SG Component” • Adaptive SG Token • Partitioning image-encoder 9 [CLS] [CLS] Enhancing compositionality through predicting SG elements (objects, relationships)

Methodology • Adaptive SG Token • learnable soft prompt •
This allows for effective training of the image-encoder in the task of predicting SG 10 image-encoder object relation projection object representation bounding box object name embedding relation representation bounding box relation name embedding

Methodology • Partitioning image patch and SG token in image-encoder
• This allows better learning of the graph prediction task • Although the Q,K,V and MLP are partitioned, the attention is performed over all tokens (patch and SG) 11

Methodology • Objective function (for image-SG pairs) 1. Image-text contrastive
loss like CLIP 2. Matching object and relation loss following DETR (Carion et.al ECCV2020) • for allowing SG Tokens learn object/relations representation 12 Estimated probability of Teacher Label Loss Based on Bounding Box For image-text pairs, objective is just ℒ𝐶𝑜𝑛𝑡

Experiments • Experiment settings • Training Data: image-SG pair (10K
of Visual Genome Dataset: VG) and standard image-text pairs (less than 1% of LAION 400M) • Pretrained Model: {CLIP(ViT/B-32), BLIP(ViT/B-32)/BLIP2(ViT-g)} • {32, 8} epoch • one batch comprised of {256, 32} image-text pairs and 8 image-SG pairs • 4 {V100, A100} GPUs • Evaluation baselines • CLIP, BLIP/BLIP2, NegCLIP/LLaVA/miniGPT4 etc.. 13

Experiments • Evaluation benchmarks • VL-Checklist (VLC) (Zhao et.al arXiv)
• pos-neg captions per 1 image (C pos, C neg, I) • Winoground (Thrush et.al CVPR2022) • 2 image-text pairs (C 0, I 0, C 1, I 1 ) swapping words • Attribution, Relation and Order (ARO) (Yuksekgonul et.al ICLR 2023) • Select the most suitable caption for an image from 5 captions, adjusting for changes in relationship, object, and attributes • Visual Spatial Reasoning (VSR) (Liu et.al TACL 2023) • estimate whether Image-text pair has spatial relationship each other • ZS (Various Zero-Shot Task) • 21 classification tasks from ELEVATER (Li et al., NeurIPS 2022) 14 Winoground sample VSR sample

Experiments • Results • CLIP/BLIP/BLIP2-SGVL outperforms the pretrained base models
across several datasets • These improvements come at the price of a slight degradation in zero-shot performance 15 TextScore= ቊ 1 𝑖𝑓 𝑠𝑖𝑚 𝐼0 , 𝐶0 > 𝑠𝑖𝑚 𝐼0 , 𝐶1 0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 ImageScore= ቊ 1 𝑖𝑓 𝑠𝑖𝑚 𝐶0 , 𝐼0 > 𝑠𝑖𝑚 𝐶0 , 𝐼1 0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 GroupScore= (ImageScore & ImageScore)

Experiments • Fine-grained results • SGVL show compositionality for almost
all categories in Winoground and VLC • especially for swapping image object/relation (in Winoground) 16

Experiments • Ablation study A) Graph based {Caption/Negative Caption}, SG
Token are effective B) Adding Adaptive SG Token and partitioning image patch and SG token are effective C) SG Annotation needs to be dense for improving compositionality of VLM 17

Experiments • Case study 18

Conclusion • Visual Language Models (VLMs) face challenges in understanding
complex scenes, particularly in attributes and relations. • This study introduces a few number of structured scene graphs into VLMs, enhancing visual and textual comprehension. • The method improves VLM performance across multiple datasets, effectively addressing the initial scene understanding limitations. 19

【EMNLP 2023】Incorporating Structured Representa...

【EMNLP 2023】Incorporating Structured Representations into Pretrained Vision & Language Models Using Scene Graphs 論文読み会発表資料

mori yuichiro

More Decks by mori yuichiro

Featured

Transcript

Incorporating Structured Representations into Pretrained Vision & Language Models Using

Summary • Visual Language Models (VLMs) face challenges in understanding

Introduction • VLM • image-text Encoder (e.g., CLIP, BLIP, BLIP2)

Introduction • Background • Large-scale pretrained VLMs still struggle with

Introduction • Purpose • They aim to enhance the compositional

Methodology • text-encoder: contrastive learning with SG-to-text • Image-encoder: integrating

Methodology • text-encoder: contrastive learning with SG-to-text • Positive/Negative Captions

Methodology • Positive/Negative Captions from SG • Negative one Generated

Methodology • text-encoder: contrastive learning with SG-to-text • Image-encoder: integrating

Methodology • Adaptive SG Token • learnable soft prompt •

Methodology • Partitioning image patch and SG token in image-encoder

Methodology • Objective function (for image-SG pairs) 1. Image-text contrastive

Experiments • Experiment settings • Training Data: image-SG pair (10K

Experiments • Evaluation benchmarks • VL-Checklist (VLC) (Zhao et.al arXiv)

Experiments • Results • CLIP/BLIP/BLIP2-SGVL outperforms the pretrained base models

Experiments • Fine-grained results • SGVL show compositionality for almost

Experiments • Ablation study A) Graph based {Caption/Negative Caption}, SG

Experiments • Case study 18

Conclusion • Visual Language Models (VLMs) face challenges in understanding

20