▪ https://twitter.com/akivajp/status/1442241252204814336 ▪ Rethinking and Improving Relative Position Encoding for Vision Transformer, ICCV’21. thanks to @sasaki_ts ▪ CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows, arXiv’21. thanks to @Ocha_Cocoa Positional Encoding(余談)
Transformer: A Versatile Backbone for Dense Prediction without Convolutions," in Proc. of ICCV, 2021. 複数パッチを統合してflatten, liner, norm linerとnormの順番が逆なだけでPatch Mergingと同じ
▪ Token mixerが単なるpoolingのPoolFormerを提案 関連手法: MetaFormer W. Yu, et al., "MetaFormer is Actually What You Need for Vision," in arXiv:2111.11418. Conv3x3 stride=2 Ave pool3x3