論文解説 EfficientViT: Memory Efficient Vision Transformer with Cascaded Group Attention
Presentation for explaining the paper "EfficientViT "presented from Microsoft Research, The Chinese University of Hong Kong.
EfficientViT is designed high recognition performance with high throughput in real devices.
Group Attention • 論⽂: https://arxiv.org/abs/2305.07027 • コード: https://github.com/microsoft/Cream/tree/main/EfficientViT • 投稿学会: CVPR2023 • 著者: Xinyu Liu, Houwen Peng, Ningxin Zheng, Yuqing Yang, Han Hu, Yixuan Yuan • 所属: The Chinese University of Hong Kong, Microsoft Research 選んだ理由: • Vision Transformerをエッジデバイスなどで⾼速に実⾏するためにはどのような 構成にすればよいか学ぶことができそう
from NLP transformer [71], e.g., using an equivalent width for Q,K,V projections, increasing heads over stages, and setting the expansion ratio to 4 in FFN. Taylor structured pruning*により、パラメータの縮⼩を⾏う。 重要でないchannelsの削除 * Importance Estimation for Neural Network Pruning https://arxiv.org/abs/1906.10771 Swin-T
propose a sandwich layout that employs less memory-bound self-attention layers and more memory-efficient FFN layers for channel communication. Single attention layer 1つのAttention layerをN個のFFNで挟む Token InteractionとしてDWConv (Depthwise Convolution)を使う