【EMNLP 2023】Incorporating Structured Representations into Pretrained Vision & Language Models Using Scene Graphs 論文読み会発表資料

Slide 1

Slide 1 text

Incorporating Structured Representations into Pretrained Vision & Language Models Using Scene Graphs Roei Herzig, Alon Mendelson, Leonid Karlinsky, Assaf Arbelle, Rogerio Feris, Trevor Darrell, Amir Globerson Tel-Aviv University, UC Berkeley, IBM Research, MIT-IBM Watson AI Lab EMNLP 2023 https://arxiv.org/abs/2305.06343 1

Slide 2

Slide 2 text

Summary • Visual Language Models (VLMs) face challenges in understanding complex scenes, particularly in attributes and relations. • This study introduces a few number of structured scene graphs into VLMs, enhancing visual and textual comprehension. • The method improves VLM performance across multiple datasets, effectively addressing the initial scene understanding limitations. 2

Slide 3

Slide 3 text

Introduction • VLM • image-text Encoder (e.g., CLIP, BLIP, BLIP2) • remarkable zero-shot performance thanks to the massive scale image-text pairs • Scene Graph • Node: object -> (class label, bounding box, attributes) • Edge: relation -> (obj_1, relation category, obj_2) • Dataset: Visual Genome (Krishna et.al, International journal of computer vision 2017), etc… 3

Slide 4

Slide 4 text

Introduction • Background • Large-scale pretrained VLMs still struggle with compositional scene recognition • Especially recognizing attributes and relationships of objects, as well as the state of actions • Scene Graphs (SG) are effective for compositional recognition but have a high annotation cost • making them impractical to prepare on a large scale 4

Slide 5

Slide 5 text

Introduction • Purpose • They aim to enhance the compositional recognition capabilities of pre-trained VLMs using a small amount of SGs • Method • They propose a fine-tuning method for pre-trained VLMs, named Scene Graphs for Vision-Language Models (SGVL), to leverage Scene Graphs for enhancing these models 5

Slide 6

Slide 6 text

Methodology • text-encoder: contrastive learning with SG-to-text • Image-encoder: integrating “SG Component” 6

Slide 7

Slide 7 text

Methodology • text-encoder: contrastive learning with SG-to-text • Positive/Negative Captions from SG • Image-encoder: integrating “SG Component” 7 [CLS] [CLS] Enhancing compositionality by contrasting hard- negative captions that highlight structural aspects

Slide 8

Slide 8 text

Methodology • Positive/Negative Captions from SG • Negative one Generated by • Swapping asymmetrical relations • Binding attributes with several objects 8 ↓ GN

Slide 9

Slide 9 text

Methodology • text-encoder: contrastive learning with SG-to-text • Image-encoder: integrating “SG Component” • Adaptive SG Token • Partitioning image-encoder 9 [CLS] [CLS] Enhancing compositionality through predicting SG elements (objects, relationships)

Slide 10

Slide 10 text

Methodology • Adaptive SG Token • learnable soft prompt • This allows for effective training of the image-encoder in the task of predicting SG 10 image-encoder object relation projection object representation bounding box object name embedding relation representation bounding box relation name embedding

Slide 11

Slide 11 text

Methodology • Partitioning image patch and SG token in image-encoder • This allows better learning of the graph prediction task • Although the Q,K,V and MLP are partitioned, the attention is performed over all tokens (patch and SG) 11

Slide 12

Slide 12 text

Methodology • Objective function (for image-SG pairs) 1. Image-text contrastive loss like CLIP 2. Matching object and relation loss following DETR (Carion et.al ECCV2020) • for allowing SG Tokens learn object/relations representation 12 Estimated probability of Teacher Label Loss Based on Bounding Box For image-text pairs, objective is just ℒ𝐶𝑜𝑛𝑡

Slide 13

Slide 13 text

Experiments • Experiment settings • Training Data: image-SG pair (10K of Visual Genome Dataset: VG) and standard image-text pairs (less than 1% of LAION 400M) • Pretrained Model: {CLIP(ViT/B-32), BLIP(ViT/B-32)/BLIP2(ViT-g)} • {32, 8} epoch • one batch comprised of {256, 32} image-text pairs and 8 image-SG pairs • 4 {V100, A100} GPUs • Evaluation baselines • CLIP, BLIP/BLIP2, NegCLIP/LLaVA/miniGPT4 etc.. 13

Slide 14

Slide 14 text

Experiments • Evaluation benchmarks • VL-Checklist (VLC) (Zhao et.al arXiv) • pos-neg captions per 1 image (C pos, C neg, I) • Winoground (Thrush et.al CVPR2022) • 2 image-text pairs (C 0, I 0, C 1, I 1 ) swapping words • Attribution, Relation and Order (ARO) (Yuksekgonul et.al ICLR 2023) • Select the most suitable caption for an image from 5 captions, adjusting for changes in relationship, object, and attributes • Visual Spatial Reasoning (VSR) (Liu et.al TACL 2023) • estimate whether Image-text pair has spatial relationship each other • ZS (Various Zero-Shot Task) • 21 classification tasks from ELEVATER (Li et al., NeurIPS 2022) 14 Winoground sample VSR sample

Slide 15

Slide 15 text

Experiments • Results • CLIP/BLIP/BLIP2-SGVL outperforms the pretrained base models across several datasets • These improvements come at the price of a slight degradation in zero-shot performance 15 TextScore= ቊ 1 𝑖𝑓 𝑠𝑖𝑚 𝐼0 , 𝐶0 > 𝑠𝑖𝑚 𝐼0 , 𝐶1 0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 ImageScore= ቊ 1 𝑖𝑓 𝑠𝑖𝑚 𝐶0 , 𝐼0 > 𝑠𝑖𝑚 𝐶0 , 𝐼1 0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 GroupScore= (ImageScore & ImageScore)

Slide 16

Slide 16 text

Experiments • Fine-grained results • SGVL show compositionality for almost all categories in Winoground and VLC • especially for swapping image object/relation (in Winoground) 16

Slide 17

Slide 17 text

Experiments • Ablation study A) Graph based {Caption/Negative Caption}, SG Token are effective B) Adding Adaptive SG Token and partitioning image patch and SG token are effective C) SG Annotation needs to be dense for improving compositionality of VLM 17

Slide 18

Slide 18 text

Experiments • Case study 18

Slide 19

Slide 19 text

Conclusion • Visual Language Models (VLMs) face challenges in understanding complex scenes, particularly in attributes and relations. • This study introduces a few number of structured scene graphs into VLMs, enhancing visual and textual comprehension. • The method improves VLM performance across multiple datasets, effectively addressing the initial scene understanding limitations. 19