Slide 2
Slide 2 text
2
▪ ViT [1] の流行
▪ 画像もTransformer!でも大量データ(JFT-300M)必要
▪ DeiT [2]
▪ ViTの学習方法の確立、ImageNetだけでもCNN相当に
▪ MLP-Mixer [3]
▪ AttentionではなくMLPでもいいよ!
▪ ViTの改良やattentionの代替(MLP, pool, shift, LSTM) 乱立
背景
[1] A. Dosovitskiy, et al., "An Image is Worth 16x16 Words: Transformers for Image Recognition at
Scale," in Proc. of ICLR, 2021.
[2] H. Touvron, et al., "Training Data-efficient Image Transformers & Distillation Through Attention," in
Proc. of ICLR'21.
[3] I. Tolstikhin, et al., "MLP-Mixer: An all-MLP Architecture for Vision," in Proc. of NeurIPS'21.