Efficient Training of Visual Transformers with Small Datasets

Efficient Training of Visual Transformers with Small Datasets Yahui Liu,
Enver Sangineto, Wei Bi, Nicu Sebe, Bruno Lepri, and Marco De Nadai COPENHAGEN – NEURIPS MEETUP

Transformers • MULTI-HEAD ATTENTION 𝜃 𝑁! complexity • MLP A
simple fully connected network • LAYER NORMALIZATION To stabilize gradients • GO DEEP L-TIMES Stack multiple blocks 2 From Vaswani et al: Attention Is All You Need 4 3 INTRODUCTION 2 1 Embeddings Multi-Head Attention MLP Norm Norm + + L x Sequential Input

Transformer in Vision 3 From Dosovitskiy et al: An Image
is Worth 16x16 Words: Transformers for Image Recognition at Scale An (ImageNet) image is a sequence of pixels (224 x 224 x 3) 4 3 INTRODUCTION 2 1

is Worth 16x16 Words: Transformers for Image Recognition at Scale 4 3 INTRODUCTION 2 1 ViT (2020)

is Worth 16x16 Words: Transformers for Image Recognition at Scale Embeddings Multi-Head Attention MLP Norm Norm + + L x Sequential Input 4 3 INTRODUCTION 2 1 ViT (2020)

10 ViT: the Good Zhai et al. “Scaling Vision Transformers”
• ViT captures global relations in the image (global attention) • Transformers are a general-use architecture • Limit is now on the computation, not the architecture 4 3 INTRODUCTION 2 1

11 ViT: the Bad & Ugly • Require more computation
than CNNs • Vision Transformers are data hungy 4 3 INTRODUCTION 2 1

than CNNs • Vision Transformers are data hungy 4 3 INTRODUCTION 2 1

than CNNs • Vision Transformers are data hungy ImageNet 1K 1.3M images ImageNet 21K 14M images JFT 303M images ViT Most Computer Vision CNN community We focus here 4 3 INTRODUCTION 2 1

1 How can we use Vision Transformers with Small datasets?
2 REGULARIZE SECOND-GENERATION VTs

Regularization technique 1

16 The regularization 4 3 REGULARIZATION 2 1

17 The regularization 1. Sample two embeddings 𝑒!,# , 𝑒!$,#$
from the 𝑘×𝑘 grid 2. Compute the translation offset e.g.: 𝑡! = |!&!$| ' 𝑡# = |#&#$| ' 3. Dense relative localization ℒ()*+, = 𝔼 [ 𝑡! , 𝑡# - − 𝑑. , 𝑑/ - ] 4. Loss: ℒ0+0 = ℒ,1 + 𝜆 ℒ()*+, 4 3 REGULARIZATION 2 1

Second-generation VTs 2

19 Second Generation Vision Transformers (VT) CvT (2021) Swin (2021)
T2T (2021) 4 3 2nd GENERATION VTs 2 1

20 Second Generation Vision Transformers (VT) • Not tested against
each other with the same pipeline (e.g. data augumentation) • Not tested on small datasets • Better than ResNets • Not clear what is the next Vision Transformer -> We are going to compare and use second-generation VTs 4 3 2nd GENERATION VTs 2 1

21 Datasets and Models Model Params (M) ResNet-50 25 Swin-T
29 T2T-Vit-14 22 CvT-13 20 4 3 2nd GENERATION VTs 2 1

Experiments

23 Training from scratch Imagenet-100 4 3 EXPERIMENTS 2 1

27 Training from scratch smaller datasets 4 3 EXPERIMENTS 2
1

1

31 Fine-tuning ImageNet-1K Pre-training on ImageNet 1K -> fine-tune on
a smaller dataset 4 3 EXPERIMENTS 2 1

32 Downstream tasks Pre-training on ImageNet 100 / 1K ->
freeze -> Task OBJECT DETECTION SEMANTIC SEGMENTATION 4 3 EXPERIMENTS 2 1

33 What about ViT-B (86.4M params)? • I just want
to use ViT, just bigger! • ViT-B is 4x bigger than any tested configuration 4 3 EXPERIMENTS 2 1

34 What about speed? 4 3 EXPERIMENTS 2 1

How can we use Vision Transformers with Small datasets? •
USE OUR NEW REGULARIZATION Improved the performance on all 11 datasets and all scenarios, sometimes dramatically (+45 points). It is simple and easily pluggable in any VT • USE A 2nd GENERATION VTs Performance largely varies. CvT is very promising with small datasets! • READ OUR PAPER FOR DETAILS 35 1 2 4 3 CONCLUSION 2 1 3

Thank you! Yahui Liu, Enver Sangineto, Wei Bi, Nicu Sebe,
Bruno Lepri, and Marco De Nadai Paper: https://bit.ly/efficient-VTs Code: https://bit.ly/efficient-VTs-code Email: [email protected] COPENHAGEN – NEURIPS MEETUP

Efficient Training of Visual Transformers with ...

Efficient Training of Visual Transformers with Small Datasets

More Decks by Marco De Nadai

Other Decks in Research

Featured

Transcript