Swin Transformer (ICCV'21 Best Paper) を完璧に理解する資料

Slide 1

Slide 1 text

Yusuke Uchida (@yu4u) 株式会社 Mobility Technologies Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows 本資料はDeNA+MoTでの輪講資料に加筆したものです

Slide 2

Slide 2 text

2 ▪ 本家 ▪ https://github.com/microsoft/Swin- Transformer/blob/main/models/swin_transformer.py ▪ timm版（ほぼ本家のporting） ▪ https://github.com/rwightman/pytorch-image- models/blob/master/timm/models/swin_transformer.py ▪ バックボーンとして使うならこちら ▪ https://github.com/SwinTransformer/Swin-Transformer-Object- Detection/blob/master/mmdet/models/backbones/swin_transformer.py 本家実装が参考になるので合わせて見ましょう

Slide 3

Slide 3 text

3 ▪ Equal contribution多すぎィどうでもいいところから

Slide 4

Slide 4 text

4 利用者の声個⼈の感想です

Slide 5

Slide 5 text

5 ▪ TransformerはNLPでデファクトバックボーンとなった ▪ TransformerをVisionにおけるCNNのように汎用的なバックボーンとすることはできないか？ → Swin Transformer! ▪ NLPとVisionのドメインの違いに対応する拡張を提案 ▪ スケールの問題 ▪ NLPではword tokenが処理の最小単位、画像はmulti-scaleの処理が重要なタスクも存在（e.g. detection） →パッチマージによる階層的な特徴マップの生成 ▪ 解像度の問題 ▪ パッチ単位よりも細かい解像度の処理が求められるタスクも存在 →Shift Windowによる計算量削減、高解像度特徴マップ実現概要

Slide 6

Slide 6 text

6 ▪ C2-C5特徴マップが出力でき、CNNと互換性がある ▪ チャネルが2倍で増えていく部分も同じアーキテクチャ C2 C3 C4 C5 理屈上は

Slide 7

Slide 7 text

7 timm版はクラス分類以外のバックボーンとしては使いづらい timm Swin-Transformer-Object-Detection この段階で avgpoolされてるちゃんと各レベルの特徴が BCHWのshapeのリストで得られる

Slide 8

Slide 8 text

8 timm版はクラス分類以外のバックボーンとしては使いづらい https://github.com/rwightman/pytorch-image-models/issues/614

Slide 9

Slide 9 text

9 ▪ 主な構成モジュールアーキテクチャ Patch Partition & Linear Embedding Patch Merging Swin Transformer Block

Slide 10

Slide 10 text

10 ▪ Patch Partition ▪ ViTと同じく画像を固定サイズのパッチに分割 ▪ デフォルトだと 4x4 のパッチ →RGB画像だと 4x4x3 次元のtokenができる ▪ Linear Embedding ▪ パッチ (token) をC次元に変換 ▪ 実際は上記2つをkernel_size=stride=パッチサイズの conv2dで行っている ▪ デフォルトではその後 Layer Normalization Patch Partition & Linear Embedding

Slide 11

Slide 11 text

11 ▪ 近傍 2x2 のC次元パッチを統合 ▪ concat → 4C次元 ▪ Layer Normalization ▪ Linear → 2C次元 Patch Merging (B, HW, C) にしてるのでpixel_unshuﬄe 使いづらい︖

Slide 12

Slide 12 text

12 ▪ Transformerのencoder layerとほぼ同じ ▪ 差分は Shifted Window-based Multi-head Self-attention Swin Transformer Block Two Successive Swin Transformer Blocks ココがポイント

Slide 13

Slide 13 text

13 ▪ Transformerのencoder layerとほぼ同じ ▪ 差分は Shifted Window-based Multi-head Self-attention Swin Transformer Block Two Successive Swin Transformer Blocks ココがポイント Pre-norm Post-norm

Slide 14

Slide 14 text

14 ▪ Learning Deep Transformer Models for Machine Translation, ACL’19. ▪ On Layer Normalization in the Transformer Architecture, ICML’20. Post-norm vs. Pre-norm ResNetのpost-act, pre-actを思い出しますね︖

Slide 15

Slide 15 text

15 ▪ Transformerのencoder layerとほぼ同じ ▪ 差分は Shifted Window-based Multi-head Self-attention Swin Transformer Block Two Successive Swin Transformer Blocks ココがポイント

Slide 16

Slide 16 text

16 ▪ 特徴マップをサイズがMxMのwindowに区切り window内でのみself-attentionを求める ▪ hxw個のパッチが存在する特徴マップにおいて、 (hw)x(hw)の計算量が、M2xM2 x (h/M)x(w/M) = M2hwに削減 ▪ M=7 (入力サイズ224の場合） ▪ C2（stride=4, 56x56のfeature map）だと、8x8個window Window-based Multi-head Self-attention (W-MSA) per window window数パッチ数の2乗

Slide 17

Slide 17 text

17 ▪ (M/2, M/2) だけwindowをshiftしたW-MSA ▪ 通常のwindow-basedと交互に適用することで隣接したwindow間でのconnectionが生まれる Shifted Window-based Multi-head Self-attention (SW-MSA) h=w=8, M=4の例

Slide 18

Slide 18 text

18 ▪ 下記だと9個のwindowができるが、特徴マップをshiftしシフトなしと同じ2x2のwindowとしてattention計算 ▪ 実際は複数windowが混じっているwindowは maskを利用してwindow間のattentionを0にする効率的なSW-MSAの実装

Slide 19

Slide 19 text

19 実装 shift 逆shift (S)W-MSA本体

Slide 20

Slide 20 text

20 ▪ Self-attention自体は単なる集合のencoder ▪ Positional encodingにより系列データであることを教えている ▪ SwinではRelative Position Biasを利用 ▪ Relativeにすることで、translation invarianceを表現 Relative Position Bias Window内の相対的な位置関係によって attention強度を調整（learnable）

Slide 21

Slide 21 text

21 ▪ 相対位置関係は縦横[−M + 1, M −1]のrangeで(2M-1)2パターン ▪ このbiasとindexの関係を保持しておき、使うときに引く実装

Slide 22

Slide 22 text

22 ▪ On Position Embeddings in BERT, ICLR’21 ▪ https://openreview.net/forum?id=onxoVA9FxMw ▪ https://twitter.com/akivajp/status/1442241252204814336 ▪ Rethinking and Improving Relative Position Encoding for Vision Transformer, ICCV’21. thanks to @sasaki_ts ▪ CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows, arXiv’21. thanks to @Ocha_Cocoa Positional Encoding（余談）

Slide 23

Slide 23 text

23 img_size (int | tuple(int)): Input image size. Default 224 patch_size (int | tuple(int)): Patch size. Default: 4 in_chans (int): Number of input image channels. Default: 3 num_classes (int): Number of classes for classification head. Default: 1000 embed_dim (int): Patch embedding dimension. Default: 96 depths (tuple(int)): Depth of each Swin Transformer layer. [2, 2, 6, 2] num_heads (tuple(int)): Number of attention heads in different layers. [3, 6, 12, 24] window_size (int): Window size. Default: 7 mlp_ratio (float): Ratio of mlp hidden dim to embedding dim. Default: 4 qkv_bias (bool): If True, add a learnable bias to query, key, value. Default: True qk_scale (float): Override default qk scale of head_dim ** -0.5 if set. Default: None drop_rate (float): Dropout rate. Default: 0 attn_drop_rate (float): Attention dropout rate. Default: 0 drop_path_rate (float): Stochastic depth rate. Default: 0.1 norm_layer (nn.Module): Normalization layer. Default: nn.LayerNorm. ape (bool): If True, add absolute position embedding to the patch embedding. Default: False patch_norm (bool): If True, add normalization after patch embedding. Default: True use_checkpoint (bool): Whether to use checkpointing to save memory. Default: False パラメータとか Stochastic depthをガッツリ使っている次元の増加に合わせhead数増加

Slide 24

Slide 24 text

24 ▪ クラス分類学習時stochastic depthのdrop確率 T: 0.2, S: 0.3, B: 0.5 ▪ Detection, segmentationだと全て0.2 Model Configuration

Slide 25

Slide 25 text

25 ▪ MSAとMLP (FF) 両方に適用 Stochastic Depth

Slide 26

Slide 26 text

26 ▪ SOTA! SUGOI! 実験結果

Slide 27

Slide 27 text

27 ▪ Shifted window, rel. pos.重要 Ablation Study

Slide 28

Slide 28 text

28 ▪ Shiftedが精度同等で高速 Sliding window vs. shifted window

Slide 29

Slide 29 text

29 ▪ チャネルを2等分して、縦横のstripeでのself-attention 関連手法：CSWin Transformer X. Dong, et al., "CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows," in arXiv:2107.00652.

Slide 30

Slide 30 text

30 🤔 関連手法：Pyramid Vision Transformer W. Wang, et al., "Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions," in Proc. of ICCV, 2021. https://github.com/whai362/PVT

Slide 31

Slide 31 text

31 関連手法：Pyramid Vision Transformer W. Wang, et al., "Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions," in Proc. of ICCV, 2021.

Slide 32

Slide 32 text

32 関連手法：Pyramid Vision Transformer W. Wang, et al., "Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions," in Proc. of ICCV, 2021. 複数パッチを統合してﬂatten, liner, norm linerとnormの順番が逆なだけでPatch Mergingと同じ

Slide 33

Slide 33 text

33 関連手法：Pyramid Vision Transformer W. Wang, et al., "Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions," in Proc. of ICCV, 2021. Position Embeddingは普通の学習するやつ

Slide 34

Slide 34 text

34 関連手法：Pyramid Vision Transformer W. Wang, et al., "Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions," in Proc. of ICCV, 2021. Spatial-Reduction Attention (SRA) がポイント

Slide 35

Slide 35 text

35 ▪ K, V（辞書側）のみ空間サイズを縮小 ▪ 実装としてはConv2D -> LayerNorm ▪ Qはそのままなので出力サイズは変わらない ▪ 削減率は8, 4, 2, 1 とstrideに合わせる Spatial-Reduction Attention (SRA)

Slide 36

Slide 36 text

36 ▪ V2もあるよ！ ▪ 2020年ではなく2021年なので誰かPR出してあげてください関連手法：Pyramid Vision Transformer https://github.com/whai362/PVT

Slide 37

Slide 37 text

37 ▪ でっかいモデルをGPUになんとか押し込みました！ ▪ post-normになってる… 関連手法：Swin Transformer V2 Ze Liu, et al., "Swin Transformer V2: Scaling Up Capacity and Resolution," in arXiv:2111.09883.

Slide 38

Slide 38 text

38 ▪ Token mixerよりもTransformerの一般的な構造自体が重要 ▪ Token mixer = self-attention, MLP ▪ Token mixerが単なるpoolingのPoolFormerを提案関連手法： MetaFormer W. Yu, et al., "MetaFormer is Actually What You Need for Vision," in arXiv:2111.11418. Conv3x3 stride=2 Ave pool3x3