• pos-neg captions per 1 image (C pos, C neg, I) • Winoground (Thrush et.al CVPR2022) • 2 image-text pairs (C 0, I 0, C 1, I 1 ) swapping words • Attribution, Relation and Order (ARO) (Yuksekgonul et.al ICLR 2023) • Select the most suitable caption for an image from 5 captions, adjusting for changes in relationship, object, and attributes • Visual Spatial Reasoning (VSR) (Liu et.al TACL 2023) • estimate whether Image-text pair has spatial relationship each other • ZS (Various Zero-Shot Task) • 21 classification tasks from ELEVATER (Li et al., NeurIPS 2022) 14 Winoground sample VSR sample