自然言語処理で使われていたTransformerというモデルを 画像認識に応用したモデル Vision Transformer [Dosovitskiy+,2021] [Dosovitskiy+,2021] Dosovitskiy et al. An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale. ICLR. 2021. https://openreview.net/forum?id=YicbFdNTTy 9/59
al. An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale. ICLR. 2021. https://openreview.net/forum?id=YicbFdNTTy P.15 [Radford+, 2021] Radford et al. Learning Transferable Visual Models From Natural Language Supervision.” ICML2021. [Rombach+, 2022] Rombach et al. High-Resolution Image Synthesis with Latent Diffusion Models, CVPR2022. P.16 [He+, 2022] He et al. Masked autoencoders are scalable vision learners. CVPR. 2022. P.28 [Ba+, 2016] Ba, J. L., Kiros, J. R. & Hinton, G. E. Layer Normalization. arXiv. 2016. P.29 [He+, 2016] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. CVPR. 2016. P.36 [Tolstikhin+, 2021] Tolstikhin, I. O. et al. MLP-Mixer: An all-MLP Architecture for Vision. NeurIPS. 2021.
M. Do Vision Transformers See Like Convolutional Neural Networks? NeurIPS. 2021. P.38 [Dong+, 2023] Dong, Xiaoyi et al. PeCo: Perceptual Codebook for BERT Pre- Training of Vision Transformers. AAAI, 2023. P.39 [Kornblith+, 2019] Kornblith, S. et al. Similarity of Neural Network Representations Revisited. ICML2019 P.40 [Zhao+, 2021] Zhao, Y. et al. A Battle of Network Structures: An Empirical Study of CNN, Transformer, and MLP. arXiv. 2021. [Park+, 2022] Park, N. & Kim, S. How Do Vision Transformers Work? ICLR. 2022. P.41 [Yu+, 2022] Yu, W. et al. MetaFormer is Actually What You Need for Vision. CVPR2022.
Wenhai et al. InternImage: Exploring Large-Scale Vision Foundation Models With Deformable Convolutions. CVPR. 2023. [Tuli+, 2021] Tuli, Shikhar. Are Convolutional Neural Networks or Transformers More like Human Vision?” arXiv. 2021. P.43 [Cordonnier+, 2020] Cordonnier, J.-B et al. On the Relationship between Self- Attention and Convolutional Layers. ICLR. 2020. [Ramachandran+, 2019] Ramachandran, P. et al. Stand-Alone Self-Attention in Vision Models. NeurIPS. 2019. P.46 [Nakashima+, 2022] Nakashima, K. et al. Can Vision Transformers Learn without Natural Images? AAAI. 2022. [KKataoka+, 2022] Kataoka, H. et al. Replacing Labeled Real-Image Datasets With Auto-Generated Contours. CVPR. 2022. P.48 [Chen+, 2021] Chen, Xinlei et al. An Empirical Study of Training Self- Supervised Vision Transformers. ICCV. 2021. P.49 [Zhai+, 2021] X. Zhai et al. Scaling vision transformers. CVPR 2022.
Hassani et al. Escaping the big data paradigm with compact transformers. arXiv. 2021. [Zhang+, 2022] Z. Zhang et al. Aggregating nested transformers. AAAI, 2022. P.52 [Mao+, 2021] Mao, X. et al. Towards Robust Vision Transformer. CVPR2022. P.53 [Jia+, 2021] Chao Jia et al. Scaling up visual and vision-language representation learning with noisy text supervision. ICML. 2021. [Yu+, 2022] Yu, Jiahui et al. CoCa: Contrastive Captioners Are Image-Text Foundation Models. arXiv. 2022. P.54 [Zhai+, 2022] Xiaohua Zhai et al. LiT: Zero-Shot Transfer With Locked-Image Text Tuning. CVPR. 2022. [Hu+, 2021] Hu, Edward J. et al. LoRA: Low-Rank Adaptation of Large Language Models. ICLR. 2021. [Jia+, 2022] Menglin Jia et al. Visual Prompt Tuning. ECCV. 2022.