論文紹介 / Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality
第14回最先端NLP勉強会 の発表スライドです.
Tristan Thrush, Ryan Jiang, Max Bartolo, Amanpreet Singh, Adina Williams, Douwe Kiela, Candace Ross; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 5238-5248
• 述部の数(1つあるいは2つ)による分類. • 述部2つの⽅がより⻑く,複雑な⽂になりやすい there are more [humans] than [balls] there's a [phone] on a [map] the [plant] is eating the [bug] [out]1[swam]2 the person in the red swimcap []2[]1 looking from [above] at a collection of similar objects [below] the [sail] rests below the [water] [gold] for [pan] there are more [hats] than [people] [circular] food on [heart-shaped] wood the [water] is filled with [plastic] 1 Main Predsの例 [it] ran away while [they] pursued the person in a [brown] coat looks back and the person in a [black] coat looks forward the melting white food is [cold] while the brown is [warm] a kid [jumped] then [threw] a basketball the person is [jumping] while the cat is [sitting] a person wearing [yellow] with their feet in the air and a person wearing [stripes] the [computer's] screen is on and the [phone's] screen is off the person with facial hair [cycles] and the other person [runs] the person with green legs is running quite [slowly] and the red legged one runs [faster] a [] person wearing yellow and a person wearing stripes [jumping] 2 Main Predsの例
non-literally (前置詞句の付与場所が違う,”idiomatic use”など) 14 Visual Tag: Pragmatics (41/400) It starts with ["A”] and ends with ["Z”] It starts with ["Z”] and ends with ["A”]
Matchingを同 時に⾏うモデル Vision Text some plants surrounding a lightbulb Vision Text some plants surrounding a lightbulb Joint CLIP, FLAVAContrastive FLAVAITM ※ ざっくりとしたイメージ.各モデルの細部は異なります
• LXMERT[8], UniT[9], ViLBERT[10]︓物体検出を⽤いたデュアル+クロスエン コーダモデル • VSRN, VSE++︓RNN利⽤モデル(説明割愛) Vision Text some plants surrounding a lightbulb Joint LXMERT, UniT, ViLBERT Joint some plants surrounding a lightbulb OD/Patch OD UNITER, ViLLA, VinVL, ViLT, VisualBERT ※ ざっくりとしたイメージ.各モデルの細部は異なります
Natural Language Supervision. ICML 2021: 8748-8763 [2] Amanpreet Singh et al.: FLAVA: A Foundational Language And Vision Alignment Model. CoRR abs/2112.04482 (2021) [3] Yen-Chun Chen et al.: UNITER: UNiversal Image-TExt Representation Learning. ECCV (30) 2020: 104-120 [4] Zhe Gan et al.: Large-Scale Adversarial Training for Vision-and-Language Representation Learning. NeurIPS 2020 [5] Pengchuan Zhang et al.: VinVL: Revisiting Visual Representations in Vision-Language Models. CVPR 2021: 5579-5588 [6] Wonjae Kim et al.: ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision. ICML 2021: 5583-5594 [7] Liunian Harold Li et al.: VisualBERT: A Simple and Performant Baseline for Vision and Language. CoRR abs/1908.03557 (2019) [8] Hao Tan et al.: LXMERT: Learning Cross-Modality Encoder Representations from Transformers. EMNLP/IJCNLP (1) 2019: 5099-5110 [9] Ronghang Hu et al.: UniT: Multimodal Multitask Learning with a Unified Transformer. ICCV 2021: 1419-1429 [10] Jiasen Lu et al.: ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision- and-Language Tasks. NeurIPS 2019: 13-23 32 参考⽂献(V&Lモデル)