and Language Navigation 自然言語による物体操作 ロボット対話 画像理解にはラベルより柔軟 な記号であるテキストを役立 てたいよね 実世界でコミュニケー ションとれるロボット を実現したい 言語理解には画像とか別の情報源 も文脈に使えるといいよね マルチモーダル機械翻訳 Vision and Language Navigation 自然言語による物体操作 10/54
ViLT [Kim+,2021] UNITER [Chen+,2020]: a region based V&L model (処理が重い) Pixel-BERT [Huang+,2020]: a grid-based V&L model (処理速度まあまあ速い) ViLT: modified from UNITER (特徴量抽出がないので速い) 31/54
2021 Winner talk" https://drive.google.com/file/d/1KjVjz9cG0KFbEzQwckyDXwrh_63-dbBn/view VQA2021 Winner Accuracy: 79.78% bottom-up attention VinVL Big ensemble with SoTA models region and grid feature 32/54
on the table" • "There is a group of orange foods on the table" • "There is a group of yellow fish eggs on the table" 0.627 0.181 0.192 probability (fish eggsを改悪) 手作りテンプレ: "There is a group of [color] [food] on the table" (色を改悪) CLIP 入力画像 手作り説明文を使った画像からの説明文検索 予測確率の高い文のラベルを予 測結果とする 上の例は2種類の分類に対応 41/54
on the table" "There is a group of yellow fish eggs on the table" "There is a group of blue fish eggs on the table" 0.005 0.833 0.162 probability CLIPは画像中のテキストに敏感(画像中にテキストが 映っている画像が多い?) 利用する時は注意する必要がある 42/54
and Dumitru Erhan. Show and tell: A neural image caption generator. CVPR 2015. [Agrawal+, 2016] Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. VQA: visual question answering. ICCV2015. [Das+, 2018] Abhishek Das, Samyak Datta, Georgia Gkioxari, Stefan Lee, Devi Parikh, Dhruv Batra. Embodied Question Answering. CVPR2018. [Xu+, 2018] Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang, Zhe Gan, Xiaolei Huang, Xiaodong He. AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks. CVPR2018. [Bisk+, 2016] Yonatan Bisk, Deniz Yuret, Daniel Marcu. Natural Language Communication with Robots. NAACL2016. P.6 [Okada, 1980] Naoyuki Okada. Conceptual taxonomy of Japanese verbs for understanding natural language and picture patterns. COLING1980. [Hiyoshi+, 1994] Mayumi Hiyoshi and Hideo Shimazu. Drawing pictures with natural language and direct manipulation. COLING1994. 56/54
and Risks of Foundation Models. In arXiv [cs.LG]. arXiv. http://arxiv.org/abs/2108.07258 P.19 [Dosovitskiy+, 2021] Dosovitskiy, A. et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. in International Conference on Learning Representations (2021). P.20 [Ramesh+, 2021] Aditya Ramesh, et al. Zero-Shot Text-to-Image generation. arXiv preprint arXiv 2102.12092, 2021. P.24 [Oord+, 2017] Aaron van den Oord et al. Neural Discrete Representation Learning. NIPS2017. P.27 [Ren+,2017] Shaoqing Ren, et al. Faster R-CNN: Towards Real-Time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell., Vol. 39, No. 6, pp. 1137–1149, 2017. P.28 [Anderson+,2018] Peter Anderson, et al. Bottom-up and top-down attention for image captioning and visual question answering. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018. P.29 [Jiang+,2020] Huaizu Jiang, et al. In defense of grid features for visual question answering. In Proceedings of CVPR, 2020. P.30 [Zhang+,2021] Pengchuan Zhang, et al. VinVL: Making visual representations matter in vision-language models. CVPR. 2021. 57/54
without convolution or region supervision. ICML, 2021. [Chen+,2020] Yen-Chun Chen, et al. UNITER: universal image-text representation learning. In Proceedings of ECCV, Vol. 12375 of Lecture Notes in Computer Science, pp. 104–120, 2020. [Huang+,2020] Zhicheng Huang, et al. Pixel-BERT: Aligning image pixels with text by deep Multi-Modal transformers. arXiv preprint arXiv 2004.00849, 2020. P.34 [Frome+,2013] Andrea Frome, et al. DeViSE: A deep visual-semantic embedding model. NIPS, 2013. [Kiros+,2014] Ryan Kiros, et al. Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539, 2014. [Wu+,2019] Hao Wu, et al. Unified visual-semantic embeddings: Bridging vision and language with structured meaning representations. In Proceedings of CVPR, 2019. P.36 [Oord+,2018] van den Oord, A., Li, Y. & Vinyals, O. Representation Learning with Contrastive Predictive Coding. arXiv, 2018. 58/54
models from natural language supervision. In Proceedings of ICML, Vol. 139, pp. 8748–8763, 2021. P.43 [Galatolo+,2021] Galatolo, F. A., et al. Generating images from caption and vice versa via CLIP-Guided Generative Latent Space Search. arXiv [cs.NE] (2021) 59/54