論文解説 EfficientViT: Memory Efficient Vision Transformer with Cascaded Group Attention

論⽂解説 EfficientViT: Memory Efficient Vision Transformer with Cascaded Group Attention
Takehiro Matsuda

2 論⽂情報タイトル： EfficientViT: Memory Efficient Vision Transformer with Cascaded
Group Attention • 論⽂： https://arxiv.org/abs/2305.07027 • コード： https://github.com/microsoft/Cream/tree/main/EfficientViT • 投稿学会： CVPR2023 • 著者： Xinyu Liu, Houwen Peng, Ningxin Zheng, Yuqing Yang, Han Hu, Yixuan Yuan • 所属： The Chinese University of Hong Kong, Microsoft Research 選んだ理由： • Vision Transformerをエッジデバイスなどで⾼速に実⾏するためにはどのような構成にすればよいか学ぶことができそう

3 Vision Transformer 画像分野においてもTransformerの利⽤が広がってきた。ただし、⾼い性能が⽰されているものはサイズが⼤きなモデルが多い。最近、軽量なVision Transformer modelも提案されてきた。ただし、model parameter数やFlopsによる計測は、
実際のモデルのinference throughputとずれがあることも多い。 ex.) MobileViT-XSは700M Flopsで、DeiT-Tは1,220M Flopsだが、 Nvidia V100 GPUで実⾏するとDeiT-Tのほうが早い実際にinferenceを実⾏するときにThroughputが⾼いViTの構造を本⼿法で紹介する。

4 Performance Nvidia V100 GPUにて ImageNet 右上ほど良い性能

5 Good throughput with good recognition l Memory-efficiency l Parameter-efficiency
Multi-head self-attention(MHSA)はメモリによる速度制約の影響⼤ parameter reallocation(Pruning) MHSAをFFNで挟むsandwich layout l Computation redundancy Multi-headの類似性を減らす (cascaded group attention[CGA])

6 Memory efficiency Transformerはメモリアクセスに費やす時間の影響が⼤きい。多くのreshaping, element-wise addition, normalization memory- bound
operations Runtime profiling

7 Transformer Attention is All You Need https://arxiv.org/abs/1706.03762 Transformer Architecture
Transformer Block Q K V Multi-Head Attention Add & Norm Feed Forward Add & Norm

8 Single Head Self Attention Attention is All You Need
https://arxiv.org/abs/1706.03762 Q K V Linear Linear Linear Scaled Dot-Product Attention Multi head attentionの前にSingle Head Attentionの振り返り

9 Multi Head Self Attention Attention is All You Need
https://arxiv.org/abs/1706.03762

10 MHSA proportions memory-inefficient layersを減らす構成を考える。従来のViTではMHSA(Multi Head Self Attention)がFFN(Feed Forward
Network)と同等数使われることが多い。しかし、MHSAはFFNよりmemory-inefficient なoperationsが多い。 MHSAの割合を変えた場合、20~40%の割合のときに性能が良かった。 Swin-T-1.25xで20%のMHSAを採⽤すると、Memory-bound operationsが減少し合計のruntimeは 44.26%になった。

11 Computation Efficiency Self-AttentionをMulti Headにすることで多様な表現になるとされるが、類似性を測ると⾼い。

12 Parameter Efficiency Typical ViTs mainly inherit the design strategies
from NLP transformer [71], e.g., using an equivalent width for Q,K,V projections, increasing heads over stages, and setting the expansion ratio to 4 in FFN. Taylor structured pruning*により、パラメータの縮⼩を⾏う。重要でないchannelsの削除 * Importance Estimation for Neural Network Pruning https://arxiv.org/abs/1906.10771 Swin-T

13 Taylor structured pruning Overview Importance Estimation for Neural Network
Pruning https://arxiv.org/abs/1906.10771 重要度の⼩さい層を除去するある層をなくしたとき(wm =0)の誤差を測れればよいが・・・、全ての層についてなくしたパターンを計算するのは⼤変。 Taylor展開で近似する。 1st order 2nd order 𝑔! = 𝜕𝐸 𝜕𝜔! 𝐻",$ = 𝜕%𝐸 𝜕𝜔" 𝜕𝜔$ いくつかの層のグループ(S)による重要度計算は下のようになる。

14 Proposed EfficientViT Architecture based on the introduced viewpoints

15 EfficientViT architecture

16 Sandwich layout To build up a memory-efficient block, we
propose a sandwich layout that employs less memory-bound self-attention layers and more memory-efficient FFN layers for channel communication. Single attention layer １つのAttention layerをN個のFFNで挟む Token InteractionとしてDWConv (Depthwise Convolution)を使う

17 Cascaded Group Attention チャンネルを分割する(j-th) Group convolutionsからinspired cascaded 後段のheadは前段のhead の結果をもとにrefine
localとglobalの関係をjointしてとらえることができ表現⼒が向上する

18 Other layout, implementation • LayerNormでなくBatchNormを採⽤ (前段のConvolutionやLinear層と⼀体化可能) • 活性化関数はReLUを使う。(GELUやHardSwishより早い、deployment platformでサポートされている)
Channel, Block, Head数解像度の⾼い前段はBlock数を少なめにする。 (MobileNet V3やLeViTと同じ考え)

19 Experiments ImageNet-1K • PyTorch 1.11.0, Timm 0.5.4 • 8
Nvidia V100 GPUでscratchで300epochs, batch size 2,048 • AdamW optimizer, cosine learning rate scheduler train inference • GPU: Nvidia V100 • CPU: Intel Xeon E5-2690 v4 @ 2.6-GHz

20 Results of ImageNet

21 Comparison with the large-scale ViTs

22 Performance on mobile devices Apple iPhon11のA13 Bionic chipにおける実⾏(CoreML使⽤) EfficientViT-M4の()内は1,000
epochs with distillationで学習

23 Transfer Learning Results 他のdatasetへのtransfer learningの結果も良好 Stanford CarsのaccuracyがMobileNetのほうが⾼いのは、クラスの微妙な違いが局所的な detailにあり、Convolutionで特徴を捉えるのに向いていた可能性がある。

24 COCO Object Detection COCO val2017のObject Detectionのtaskについても⾼精度の結果を得た。

25 Ablation Study 同等速度では3%精度低下 FFNの数を増やしても精度低下 Cascade Group Attention が精度にも寄与
有効なパラメータにすることで同等速度に対する精度を上げることができる。

26 Ablation Study MHSAの次元数についてのStudy

論文解説 EfficientViT: Memory Efficient Vision Tran...

論文解説 EfficientViT: Memory Efficient Vision Transformer with Cascaded Group Attention

koharite

More Decks by koharite

Other Decks in Research

Featured

Transcript

論⽂解説 EfficientViT: Memory Efficient Vision Transformer with Cascaded Group Attention

2 論⽂情報タイトル： EfficientViT: Memory Efficient Vision Transformer with Cascaded

4 Performance Nvidia V100 GPUにて ImageNet 右上ほど良い性能

5 Good throughput with good recognition l Memory-efficiency l Parameter-efficiency

6 Memory efficiency Transformerはメモリアクセスに費やす時間の影響が⼤きい。多くのreshaping, element-wise addition, normalization memory- bound

7 Transformer Attention is All You Need https://arxiv.org/abs/1706.03762 Transformer Architecture

8 Single Head Self Attention Attention is All You Need

9 Multi Head Self Attention Attention is All You Need

10 MHSA proportions memory-inefficient layersを減らす構成を考える。従来のViTではMHSA(Multi Head Self Attention)がFFN(Feed Forward

11 Computation Efficiency Self-AttentionをMulti Headにすることで多様な表現になるとされるが、類似性を測ると⾼い。

12 Parameter Efficiency Typical ViTs mainly inherit the design strategies

13 Taylor structured pruning Overview Importance Estimation for Neural Network

14 Proposed EfficientViT Architecture based on the introduced viewpoints

15 EfficientViT architecture

16 Sandwich layout To build up a memory-efficient block, we

17 Cascaded Group Attention チャンネルを分割する(j-th) Group convolutionsからinspired cascaded 後段のheadは前段のhead の結果をもとにrefine

18 Other layout, implementation • LayerNormでなくBatchNormを採⽤ (前段のConvolutionやLinear層と⼀体化可能) • 活性化関数はReLUを使う。(GELUやHardSwishより早い、deployment platformでサポートされている)

19 Experiments ImageNet-1K • PyTorch 1.11.0, Timm 0.5.4 • 8

20 Results of ImageNet

21 Comparison with the large-scale ViTs

22 Performance on mobile devices Apple iPhon11のA13 Bionic chipにおける実⾏(CoreML使⽤) EfficientViT-M4の()内は1,000

23 Transfer Learning Results 他のdatasetへのtransfer learningの結果も良好 Stanford CarsのaccuracyがMobileNetのほうが⾼いのは、クラスの微妙な違いが局所的な detailにあり、Convolutionで特徴を捉えるのに向いていた可能性がある。

24 COCO Object Detection COCO val2017のObject Detectionのtaskについても⾼精度の結果を得た。

25 Ablation Study 同等速度では3%精度低下 FFNの数を増やしても精度低下 Cascade Group Attention が精度にも寄与

26 Ablation Study MHSAの次元数についてのStudy