UNITER VILLA Pixel-BERT VinVL OSCAR EARNIE-ViL VL-T5 ViLT word region alignment image- captioning Adversarial object label improve object detection Scene graph patch based whole word masking grid based base VideoBERT E2E-VLP Dialogue Dodecathlon 引用:コンピュータビジョン最前線 Winter 2021,ニュウモンVision & Language ’19 ’20 ’21 5/85
ViLT [Kim+,2021] UNITER [Chen+,2020]: a region based V&L model (処理が重い) Pixel-BERT [Huang+,2020]: a grid-based V&L model (処理速度まあまあ速い) ViLT: modified from UNITER (特徴量抽出がないので速い) 15/85
on the table" • "There is a group of orange foods on the table" • "There is a group of yellow fish eggs on the table" 0.627 0.181 0.192 probability (fish eggsを改悪) 手作りテンプレ: "There is a group of [color] [food] on the table" (色を改悪) CLIP 入力画像 手作り説明文を使った画像からの説明文検索 予測確率の高い文のラベルを予 測結果とする 上の例は2種類の分類に対応 30/85
on the table" "There is a group of yellow fish eggs on the table" "There is a group of blue fish eggs on the table" 0.005 0.833 0.162 probability CLIPは画像中のテキストに敏感(画像中にテキストが 映っている画像が多い?) 利用する時は注意する必要がある 31/85
2021 Winner talk" https://drive.google.com/file/d/1KjVjz9cG0KFbEzQwckyDXwrh_63-dbBn/view VQA2021 Winner Accuracy: 79.78% bottom-up attention VinVL Big ensemble with SoTA models region and grid feature 40/85
. A man stands on the floor . A man is standing by a dog . 1. Exploration (文生成) 2. Update policy (訓練) There is a girl by the table . A man stands on the floor . A man is standing by a dog . 0.1 0.8 0.6 報酬 Scoring 環境 I see. The second one is great! 44/85
Unicoder-VL LXMERT VL-BERT Unified VLP UNITER VILLA Pixel-BERT VinVL OSCAR EARNIE-ViL VL-T5 ViLT word region alignment image- captioning Adversarial object label improve object detection Scene graph patch based whole word masking grid based base VideoBERT E2E-VLP Dialogue Dodecathlon 引用:コンピュータビジョン最前線 Winter 2021,ニュウモンVision & Language ’19 ’20 ’21 70/85
• Add Word Region Alignment (WRA) loss for training • WRA is based on Inexact Proximal point method for Optimal Transports(IPOT)[Xie+,2018] • It enables to align similar embedding in unsupervised manner 73/85
error (agent notices that it get lost), Ask user to help (input a new instruction) to recover from the error [Nguyen+, 2019] HANNA (Help ANNA!) task 81/85
Risks of Foundation Models. In arXiv [cs.LG]. arXiv. http://arxiv.org/abs/2108.07258 P.6 [Wang, P+, 2022] Wang, P et al. Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework. In arXiv [cs.CV]. arXiv. http://arxiv.org/abs/2202.03052, 2022. P.7 [Xie+,2022] Xie, T. et al. UnifiedSKG: Unifying and Multi-Tasking Structured Knowledge Grounding with Text-to-Text Language Models. arXiv [cs.CL] (2022) P.10 [Agrawal+, 2016] Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. VQA: visual question answering. ICCV2015. P.11 [Ren+,2017] Shaoqing Ren, et al. Faster R-CNN: Towards Real-Time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell., Vol. 39, No. 6, pp. 1137–1149, 2017. P.12 [Anderson+,2018] Peter Anderson, et al. Bottom-up and top-down attention for image captioning and visual question answering. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018. P.13 [Jiang+,2020] Huaizu Jiang, et al. In defense of grid features for visual question answering. In Proceedings of CVPR, 2020. 参考文献 86/85
matter in vision-language models. CVPR. 2021. P.15 [Kim+,2021] Wonjae Kim, et al. ViLT: Vision-and-Language transformer without convolution or region supervision. ICML. 2021. [Chen+,2020] Yen-Chun Chen, et al. UNITER: universal image-text representation learning. In Proceedings of ECCV, Vol. 12375 of Lecture Notes in Computer Science, pp. 104–120, 2020. [Huang+,2020] Zhicheng Huang, et al. Pixel-BERT: Aligning image pixels with text by deep Multi-Modal transformers. arXiv preprint arXiv 2004.00849, 2020. P.18 [Johnson+,2015] Justin Johnson, et al. Image retrieval using scene graphs. In Proceedings of CVPR, 2015. 参考文献 87/85
caption evaluation. In Proceedings of ECCV, 2016. [Wang+,2021] Sijin Wang, et al. Faier: Fidelity and adequacy ensured image caption evaluation. In Proceedings of CVPR, pp. 14050–14059, 2021. [Yu+,2021] Fei Yu, et al. Ernie-ViL: Knowledge enhanced vision-language representations through scene graphs. In Proceedings of AAAI, pp. 3208–3216, 2021. [Johnson+,2018] Johnson, Justin, Agrim Gupta, and Li Fei-Fei. "Image generation from scene graphs." Proceedings of the IEEE conference on computer vision and pattern recognition. 2018. P.22 [Frome+,2013] Andrea Frome, et al. DeViSE: A deep visual-semantic embedding model. [Kiros+,2014] Ryan Kiros, et al. Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539, 2014. [Wu+,2019] Hao Wu, et al. Unified visual-semantic embeddings: Bridging vision and language with structured meaning representations. In Proceedings of CVPR, 2019. 参考文献 88/85
Y. & Vinyals, O. Representation Learning with Contrastive Predictive Coding. arXiv [cs.LG] (2018) P.26 [Radford+,2021] Alec Radford, et al. Learning transferable visual models from natural language supervision. In Proceedings of ICML, Vol. 139, pp. 8748–8763, 2021. P.30 [Ramesh+,2021] Aditya Ramesh, et al. Zero-Shot Text-to-Image generation. arXiv preprint arXiv2102.12092, 2021. P.32 [Galatolo+,2021] Galatolo, F. A., et al. Generating images from caption and vice versa via CLIP-Guided Generative Latent Space Search. arXiv [cs.NE] (2021) 参考文献 89/85
Ross, J., & Goel, V. (2017). Self- critical sequence training for image captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 7008-7024). [Liu+,2017] Liu, S., Zhu, Z., Ye, N., Guadarrama, S., & Murphy, K. (2017). Improved image captioning via policy gradient optimization of spider. In Proceedings of the IEEE international conference on computer vision (pp. 873-881). [Anderson+,2018] Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., Gould, S., & Zhang, L. (2018). Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 6077-6086). [Zhao+,2018] Zhao, W., Wang, B., Ye, J., Yang, M., Zhao, Z., Luo, R., & Qiao, Y. (2018, July). A Multi-task Learning Approach for Image Captioning. In IJCAI (pp. 1205-1211). [Gu+,2018] Gu, J., Cai, J., Wang, G., & Chen, T. (2018, April). Stack-captioning: Coarse- to-fine learning for image captioning. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 32, No. 1). 参考文献 90/85
J., Brown, T. B., Radford, A., Amodei, D., Christiano, P., & Irving, G. Fine-Tuning Language Models from Human Preferences. arXiv. http://arxiv.org/abs/1909.08593 [Stiennon+, 2020] Stiennon, N., Ouyang, L., Wu, J., Ziegler, D. M., Lowe, R., Voss, C., Radford, A., Amodei, D., & Christiano, P. Learning to summarize from human feedback. NeurIPS2020. [Ouyang+, 2022] Ouyang, L. et al. Training language models to follow instructions with human feedback. https://cdn.openai.com/papers/Training_language_models_to_follow_instructions_wit h_human_feedback.pdf P.43 [森村哲郎, 強化学習] 森村哲郎, 強化学習 (機械学習プロフェッショナルシリ ーズ) 参考文献 91/85
Ross, J., & Goel, V. (2017, July). Self-critical sequence training for image captioning. CVPR2017. [Li+,2017] Li, J., Monroe, W., & Jurafsky, D. (2017). Learning to Decode for Future Success. In arXiv [cs.CL]. arXiv. http://arxiv.org/abs/1701.06549 [Khandelwal+,2021] Khandelwal, A. (2021). WeaSuL: Weakly Supervised Dialogue Policy Learning: Reward Estimation for Multi-turn Dialogue. INLG2021. P.52 [Ziegler+,2019] Ziegler, D. M., Stiennon, N., Wu, J., Brown, T. B., Radford, A., Amodei, D., Christiano, P., & Irving, G. (2019). Fine-Tuning Language Models from Human Preferences. In arXiv [cs.CL]. arXiv. http://arxiv.org/abs/1909.08593 P.53 [Choshen+,2020] Choshen, L., Fox, L., Aizenbud, Z., & Abend, O. (2020). On the weaknesses of reinforcement learning for neural machine translation. ICLR2020. P.54 [Stiennon+, 2020] Stiennon, N., Ouyang, L., Wu, J., Ziegler, D. M., Lowe, R., Voss, C., Radford, A., Amodei, D., & Christiano, P. Learning to summarize from human feedback. NeurIPS2020. P.57 [Xie+,2018] Yujia Xie, et al. A fast proximal point method for computing exact Wasserstein distance. arXiv preprint arXiv 1802.04307, 2018. 参考文献 92/85
Lior. Transformer Interpretability Beyond Attention Visualization. CVPR2021. [Chefer+, ICCV2021] Chefer, Hila and Gur, Shir and Wolf, Lior. Generic Attention-Model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers. ICCV2021. P.64 [Devlin+,2019] Jacob Devlin, et al. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of ACL, pp. 4171–4186, Minneapolis, Minnesota, 2019. 参考文献 93/85
Pre-Training. in International Conference on Learning Representations (2022). [Lu+,2019] Jiasen Lu, et al. ViLBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Proceedings of NeurIPS, Vol. 32, 2019. [Li+, 2019] Liunian Harold Li, et al. VisualBERT: A simple and performant baseline for vision and language. arXiv preprint arXiv 1908.03557, 2019. [Li+, 2020] Gen Li, et al. Unicoder-VL: A universal encoder for vision and language by Cross-Modal Pre-Training. In Proceedings of AAAI, Vol. 34, pp. 11336–11344, 2020. [Tan+,2019] Hao Tan and Mohit Bansal. LXMERT: Learning cross-modality encoder representations from transformers. In Proceedings of EMNLP-IJCNLP, pp. 5100–5111, 2019. P.66 [Yu+,2021] Fei Yu, et al. ERNIE-ViL: Knowledge enhanced vision-language representations through scene graphs. In Proceedings of AAAI, pp. 3208–3216, 2021. 参考文献 94/85
image captioning and VQA. Vol. 34, pp. 13041–13049, AAAI2020. P.68 [Xu+, 2021] Haiyang Xu, et al. E2E-VLP: End-to-end vision-language pre-training enhanced by visual learning. In Proceedings of ACL, pp. 503–513, 2021. P.69 [Rothe+,2019] Rothe, S., Narayan, S. & Severyn, A. Leveraging Pre-trained Checkpoints for Sequence Generation Tasks. arXiv [cs.CL] (2019) P.73 [Chen+,2020] Yen-Chun Chen, et al. UNITER: Universal image-text representation learning. In Proceedings of ECCV, Vol. 12375, pp. 104–120, 2020. [Xie+,2018] Yujia Xie, et al. A fast proximal point method for computing exact Wasserstein distance. arXiv preprint arXiv 1802.04307, 2018. P. 74 [Goyal+,2017] Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, Devi Parikh. Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering. CVPR2017. P. 75 [Dancette,2021] Corentin Dancette, et al. Beyond Question-Based biases: Assessing multimodal shortcut learning in visual question answering. ICCV2021. 参考文献 95/85
Look at Language Bias. arXiv [cs.CV] (2020) P.77 [Johnson+,2017] Justin Johnson, et al. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In Proceedings of CVPR, 2017. P.79 [Benotti+,2021] Benotti, L., & Blackburn, P. Grounding as a Collaborative Process. EACL2021. 515–531. P.80 [Das+, 2017] Abhishek Das, et al. Visual dialog. In Proceedings of CVPR, pp. 1080– 1089, 2017. P.81 [Nguyen+, 2019] Khanh Nguyen, Hal Daumé III. Help, Anna! Visual Navigation with Natural Multimodal Assistance via Retrospective Curiosity-Encouraging Imitation Learning. EMNLP2019. P.85 [Qiu+, 2021] Qiu, Y. et al. Describing and Localizing Multiple Changes with Transformers. arXiv [cs.CV] (2021) 参考文献 96/85