Upgrade to Pro — share decks privately, control downloads, hide ads and more …

【EMNLP 2023】Incorporating Structured Representa...

mori yuichiro
January 25, 2024
35

【EMNLP 2023】Incorporating Structured Representations into Pretrained Vision & Language Models Using Scene Graphs 論文読み会発表資料

mori yuichiro

January 25, 2024
Tweet

Transcript

  1. Incorporating Structured Representations into Pretrained Vision & Language Models Using

    Scene Graphs Roei Herzig, Alon Mendelson, Leonid Karlinsky, Assaf Arbelle, Rogerio Feris, Trevor Darrell, Amir Globerson Tel-Aviv University, UC Berkeley, IBM Research, MIT-IBM Watson AI Lab EMNLP 2023 https://arxiv.org/abs/2305.06343 1
  2. Summary • Visual Language Models (VLMs) face challenges in understanding

    complex scenes, particularly in attributes and relations. • This study introduces a few number of structured scene graphs into VLMs, enhancing visual and textual comprehension. • The method improves VLM performance across multiple datasets, effectively addressing the initial scene understanding limitations. 2
  3. Introduction • VLM • image-text Encoder (e.g., CLIP, BLIP, BLIP2)

    • remarkable zero-shot performance thanks to the massive scale image-text pairs • Scene Graph • Node: object -> (class label, bounding box, attributes) • Edge: relation -> (obj_1, relation category, obj_2) • Dataset: Visual Genome (Krishna et.al, International journal of computer vision 2017), etc… 3
  4. Introduction • Background • Large-scale pretrained VLMs still struggle with

    compositional scene recognition • Especially recognizing attributes and relationships of objects, as well as the state of actions • Scene Graphs (SG) are effective for compositional recognition but have a high annotation cost • making them impractical to prepare on a large scale 4
  5. Introduction • Purpose • They aim to enhance the compositional

    recognition capabilities of pre-trained VLMs using a small amount of SGs • Method • They propose a fine-tuning method for pre-trained VLMs, named Scene Graphs for Vision-Language Models (SGVL), to leverage Scene Graphs for enhancing these models 5
  6. Methodology • text-encoder: contrastive learning with SG-to-text • Positive/Negative Captions

    from SG • Image-encoder: integrating “SG Component” 7 [CLS] [CLS] Enhancing compositionality by contrasting hard- negative captions that highlight structural aspects
  7. Methodology • Positive/Negative Captions from SG • Negative one Generated

    by • Swapping asymmetrical relations • Binding attributes with several objects 8 ↓ GN
  8. Methodology • text-encoder: contrastive learning with SG-to-text • Image-encoder: integrating

    “SG Component” • Adaptive SG Token • Partitioning image-encoder 9 [CLS] [CLS] Enhancing compositionality through predicting SG elements (objects, relationships)
  9. Methodology • Adaptive SG Token • learnable soft prompt •

    This allows for effective training of the image-encoder in the task of predicting SG 10 image-encoder object relation projection object representation bounding box object name embedding relation representation bounding box relation name embedding
  10. Methodology • Partitioning image patch and SG token in image-encoder

    • This allows better learning of the graph prediction task • Although the Q,K,V and MLP are partitioned, the attention is performed over all tokens (patch and SG) 11
  11. Methodology • Objective function (for image-SG pairs) 1. Image-text contrastive

    loss like CLIP 2. Matching object and relation loss following DETR (Carion et.al ECCV2020) • for allowing SG Tokens learn object/relations representation 12 Estimated probability of Teacher Label Loss Based on Bounding Box For image-text pairs, objective is just ℒ𝐶𝑜𝑛𝑡
  12. Experiments • Experiment settings • Training Data: image-SG pair (10K

    of Visual Genome Dataset: VG) and standard image-text pairs (less than 1% of LAION 400M) • Pretrained Model: {CLIP(ViT/B-32), BLIP(ViT/B-32)/BLIP2(ViT-g)} • {32, 8} epoch • one batch comprised of {256, 32} image-text pairs and 8 image-SG pairs • 4 {V100, A100} GPUs • Evaluation baselines • CLIP, BLIP/BLIP2, NegCLIP/LLaVA/miniGPT4 etc.. 13
  13. Experiments • Evaluation benchmarks • VL-Checklist (VLC) (Zhao et.al arXiv)

    • pos-neg captions per 1 image (C pos, C neg, I) • Winoground (Thrush et.al CVPR2022) • 2 image-text pairs (C 0, I 0, C 1, I 1 ) swapping words • Attribution, Relation and Order (ARO) (Yuksekgonul et.al ICLR 2023) • Select the most suitable caption for an image from 5 captions, adjusting for changes in relationship, object, and attributes • Visual Spatial Reasoning (VSR) (Liu et.al TACL 2023) • estimate whether Image-text pair has spatial relationship each other • ZS (Various Zero-Shot Task) • 21 classification tasks from ELEVATER (Li et al., NeurIPS 2022) 14 Winoground sample VSR sample
  14. Experiments • Results • CLIP/BLIP/BLIP2-SGVL outperforms the pretrained base models

    across several datasets • These improvements come at the price of a slight degradation in zero-shot performance 15 TextScore= ቊ 1 𝑖𝑓 𝑠𝑖𝑚 𝐼0 , 𝐶0 > 𝑠𝑖𝑚 𝐼0 , 𝐶1 0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 ImageScore= ቊ 1 𝑖𝑓 𝑠𝑖𝑚 𝐶0 , 𝐼0 > 𝑠𝑖𝑚 𝐶0 , 𝐼1 0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 GroupScore= (ImageScore & ImageScore)
  15. Experiments • Fine-grained results • SGVL show compositionality for almost

    all categories in Winoground and VLC • especially for swapping image object/relation (in Winoground) 16
  16. Experiments • Ablation study A) Graph based {Caption/Negative Caption}, SG

    Token are effective B) Adding Adaptive SG Token and partitioning image patch and SG token are effective C) SG Annotation needs to be dense for improving compositionality of VLM 17
  17. Conclusion • Visual Language Models (VLMs) face challenges in understanding

    complex scenes, particularly in attributes and relations. • This study introduces a few number of structured scene graphs into VLMs, enhancing visual and textual comprehension. • The method improves VLM performance across multiple datasets, effectively addressing the initial scene understanding limitations. 19
  18. 20