Context.“, ECCV14 [Plummer+, ICCV15]:Plummer, Bryan A., et al. "Flickr30k entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models.“, ICCV15 [Karan+, CVPR21]:Desai, Karan, and Justin Johnson. "Virtex: Learning Visual Representations from Textual Annotations.“, CVPR21 [Zhang+, CVPR20] :Zhang, Qi, et al. "Context-Aware Attention Network for Image-Text Retrieval.“, CVPR20 [Chen+, CVPR21]:Chen, Jiacheng, et al. "Learning the Best Pooling Strategy for Visual Semantic Embedding.“, CVPR21 [Frome+, NIPS13]:Frome, Andrea, et al. "Devise: A Deep Visual-Semantic Embedding model.“, NIPS13 [Song+, CVPR19]:Song, Yale, and Mohammad Soleymani. "Polysemous visual-semantic embedding for cross-modal retrieval.“, CVPR19 [Chun+, CVPR21]:Chun, Sanghyuk, et al. "Probabilistic Embeddings for Cross-modal Retrieval.“, CVPR21 [Locatello+, NeurIPS20]:Locatello, Francesco, et al. "Object-Centric Learning with Slot Attention.“, NeurIPS20 [DL輪読会]Object-Centric Learning with Slot Attention https://www.slideshare.net/DeepLearningJP2016/dlobjectcentric-learning-with-slot-attention 26 参考文献