論文解説 EfficientViT: Memory Efficient Vision Transformer with Cascaded Group Attention

Slide 1

Slide 1 text

論⽂解説 EfficientViT: Memory Efficient Vision Transformer with Cascaded Group Attention Takehiro Matsuda

Slide 2

Slide 2 text

2 論⽂情報タイトル： EfficientViT: Memory Efficient Vision Transformer with Cascaded Group Attention • 論⽂： https://arxiv.org/abs/2305.07027 • コード： https://github.com/microsoft/Cream/tree/main/EfficientViT • 投稿学会： CVPR2023 • 著者： Xinyu Liu, Houwen Peng, Ningxin Zheng, Yuqing Yang, Han Hu, Yixuan Yuan • 所属： The Chinese University of Hong Kong, Microsoft Research 選んだ理由： • Vision Transformerをエッジデバイスなどで⾼速に実⾏するためにはどのような構成にすればよいか学ぶことができそう

Slide 3

Slide 3 text

3 Vision Transformer 画像分野においてもTransformerの利⽤が広がってきた。ただし、⾼い性能が⽰されているものはサイズが⼤きなモデルが多い。最近、軽量なVision Transformer modelも提案されてきた。ただし、model parameter数やFlopsによる計測は、実際のモデルのinference throughputとずれがあることも多い。 ex.) MobileViT-XSは700M Flopsで、DeiT-Tは1,220M Flopsだが、 Nvidia V100 GPUで実⾏するとDeiT-Tのほうが早い実際にinferenceを実⾏するときにThroughputが⾼いViTの構造を本⼿法で紹介する。

Slide 4

Slide 4 text

4 Performance Nvidia V100 GPUにて ImageNet 右上ほど良い性能

Slide 5

Slide 5 text

5 Good throughput with good recognition l Memory-efficiency l Parameter-efficiency Multi-head self-attention(MHSA)はメモリによる速度制約の影響⼤ parameter reallocation(Pruning) MHSAをFFNで挟むsandwich layout l Computation redundancy Multi-headの類似性を減らす (cascaded group attention[CGA])

Slide 6

Slide 6 text

6 Memory efficiency Transformerはメモリアクセスに費やす時間の影響が⼤きい。多くのreshaping, element-wise addition, normalization memory- bound operations Runtime profiling

Slide 7

Slide 7 text

7 Transformer Attention is All You Need https://arxiv.org/abs/1706.03762 Transformer Architecture Transformer Block Q K V Multi-Head Attention Add & Norm Feed Forward Add & Norm

Slide 8

Slide 8 text

8 Single Head Self Attention Attention is All You Need https://arxiv.org/abs/1706.03762 Q K V Linear Linear Linear Scaled Dot-Product Attention Multi head attentionの前にSingle Head Attentionの振り返り

Slide 9

Slide 9 text

9 Multi Head Self Attention Attention is All You Need https://arxiv.org/abs/1706.03762

Slide 10

Slide 10 text

10 MHSA proportions memory-inefficient layersを減らす構成を考える。従来のViTではMHSA(Multi Head Self Attention)がFFN(Feed Forward Network)と同等数使われることが多い。しかし、MHSAはFFNよりmemory-inefficient なoperationsが多い。 MHSAの割合を変えた場合、20~40%の割合のときに性能が良かった。 Swin-T-1.25xで20%のMHSAを採⽤すると、Memory-bound operationsが減少し合計のruntimeは 44.26%になった。

Slide 11

Slide 11 text

11 Computation Efficiency Self-AttentionをMulti Headにすることで多様な表現になるとされるが、類似性を測ると⾼い。

Slide 12

Slide 12 text

12 Parameter Efficiency Typical ViTs mainly inherit the design strategies from NLP transformer [71], e.g., using an equivalent width for Q,K,V projections, increasing heads over stages, and setting the expansion ratio to 4 in FFN. Taylor structured pruning*により、パラメータの縮⼩を⾏う。重要でないchannelsの削除 * Importance Estimation for Neural Network Pruning https://arxiv.org/abs/1906.10771 Swin-T

Slide 13

Slide 13 text

13 Taylor structured pruning Overview Importance Estimation for Neural Network Pruning https://arxiv.org/abs/1906.10771 重要度の⼩さい層を除去するある層をなくしたとき(wm =0)の誤差を測れればよいが・・・、全ての層についてなくしたパターンを計算するのは⼤変。 Taylor展開で近似する。 1st order 2nd order 𝑔! = 𝜕𝐸 𝜕𝜔! 𝐻",$ = 𝜕%𝐸 𝜕𝜔" 𝜕𝜔$ いくつかの層のグループ(S)による重要度計算は下のようになる。

Slide 14

Slide 14 text

14 Proposed EfficientViT Architecture based on the introduced viewpoints

Slide 15

Slide 15 text

15 EfficientViT architecture

Slide 16

Slide 16 text

16 Sandwich layout To build up a memory-efficient block, we propose a sandwich layout that employs less memory-bound self-attention layers and more memory-efficient FFN layers for channel communication. Single attention layer １つのAttention layerをN個のFFNで挟む Token InteractionとしてDWConv (Depthwise Convolution)を使う

Slide 17

Slide 17 text

17 Cascaded Group Attention チャンネルを分割する(j-th) Group convolutionsからinspired cascaded 後段のheadは前段のhead の結果をもとにrefine localとglobalの関係をjointしてとらえることができ表現⼒が向上する

Slide 18

Slide 18 text

18 Other layout, implementation • LayerNormでなくBatchNormを採⽤ (前段のConvolutionやLinear層と⼀体化可能) • 活性化関数はReLUを使う。(GELUやHardSwishより早い、deployment platformでサポートされている) Channel, Block, Head数解像度の⾼い前段はBlock数を少なめにする。 (MobileNet V3やLeViTと同じ考え)

Slide 19

Slide 19 text

19 Experiments ImageNet-1K • PyTorch 1.11.0, Timm 0.5.4 • 8 Nvidia V100 GPUでscratchで300epochs, batch size 2,048 • AdamW optimizer, cosine learning rate scheduler train inference • GPU: Nvidia V100 • CPU: Intel Xeon E5-2690 v4 @ 2.6-GHz

Slide 20

Slide 20 text

20 Results of ImageNet

Slide 21

Slide 21 text

21 Comparison with the large-scale ViTs

Slide 22

Slide 22 text

22 Performance on mobile devices Apple iPhon11のA13 Bionic chipにおける実⾏(CoreML使⽤) EfficientViT-M4の()内は1,000 epochs with distillationで学習

Slide 23

Slide 23 text

23 Transfer Learning Results 他のdatasetへのtransfer learningの結果も良好 Stanford CarsのaccuracyがMobileNetのほうが⾼いのは、クラスの微妙な違いが局所的な detailにあり、Convolutionで特徴を捉えるのに向いていた可能性がある。

Slide 24

Slide 24 text

24 COCO Object Detection COCO val2017のObject Detectionのtaskについても⾼精度の結果を得た。

Slide 25

Slide 25 text

25 Ablation Study 同等速度では3%精度低下 FFNの数を増やしても精度低下 Cascade Group Attention が精度にも寄与有効なパラメータにすることで同等速度に対する精度を上げることができる。

Slide 26

Slide 26 text

26 Ablation Study MHSAの次元数についてのStudy