Efficient Training of Visual Transformers with Small Datasets

Slide 1

Slide 1 text

Efficient Training of Visual Transformers with Small Datasets Yahui Liu, Enver Sangineto, Wei Bi, Nicu Sebe, Bruno Lepri, and Marco De Nadai COPENHAGEN – NEURIPS MEETUP

Slide 2

Slide 2 text

Transformers • MULTI-HEAD ATTENTION 𝜃 𝑁! complexity • MLP A simple fully connected network • LAYER NORMALIZATION To stabilize gradients • GO DEEP L-TIMES Stack multiple blocks 2 From Vaswani et al: Attention Is All You Need 4 3 INTRODUCTION 2 1 Embeddings Multi-Head Attention MLP Norm Norm + + L x Sequential Input

Slide 3

Slide 3 text

Transformer in Vision 3 From Dosovitskiy et al: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale An (ImageNet) image is a sequence of pixels (224 x 224 x 3) 4 3 INTRODUCTION 2 1

Slide 4

Slide 4 text

Transformer in Vision 4 From Dosovitskiy et al: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale 4 3 INTRODUCTION 2 1 ViT (2020)

Slide 5

Slide 5 text

Transformer in Vision 5 From Dosovitskiy et al: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale 4 3 INTRODUCTION 2 1 ViT (2020)

Slide 6

Slide 6 text

Transformer in Vision 6 From Dosovitskiy et al: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale 4 3 INTRODUCTION 2 1 ViT (2020)

Slide 7

Slide 7 text

Transformer in Vision 7 From Dosovitskiy et al: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale 4 3 INTRODUCTION 2 1 ViT (2020)

Slide 8

Slide 8 text

Transformer in Vision 8 From Dosovitskiy et al: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale 4 3 INTRODUCTION 2 1 ViT (2020)

Slide 9

Slide 9 text

Transformer in Vision 9 From Dosovitskiy et al: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale Embeddings Multi-Head Attention MLP Norm Norm + + L x Sequential Input 4 3 INTRODUCTION 2 1 ViT (2020)

Slide 10

Slide 10 text

10 ViT: the Good Zhai et al. “Scaling Vision Transformers” • ViT captures global relations in the image (global attention) • Transformers are a general-use architecture • Limit is now on the computation, not the architecture 4 3 INTRODUCTION 2 1

Slide 11

Slide 11 text

11 ViT: the Bad & Ugly • Require more computation than CNNs • Vision Transformers are data hungy 4 3 INTRODUCTION 2 1

Slide 12

Slide 12 text

12 ViT: the Bad & Ugly • Require more computation than CNNs • Vision Transformers are data hungy 4 3 INTRODUCTION 2 1

Slide 13

Slide 13 text

13 ViT: the Bad & Ugly • Require more computation than CNNs • Vision Transformers are data hungy ImageNet 1K 1.3M images ImageNet 21K 14M images JFT 303M images ViT Most Computer Vision CNN community We focus here 4 3 INTRODUCTION 2 1

Slide 14

Slide 14 text

1 How can we use Vision Transformers with Small datasets? 2 REGULARIZE SECOND-GENERATION VTs

Slide 15

Slide 15 text

Regularization technique 1

Slide 16

Slide 16 text

16 The regularization 4 3 REGULARIZATION 2 1

Slide 17

Slide 17 text

17 The regularization 1. Sample two embeddings 𝑒!,# , 𝑒!$,#$ from the 𝑘×𝑘 grid 2. Compute the translation offset e.g.: 𝑡! = |!&!$| ' 𝑡# = |#&#$| ' 3. Dense relative localization ℒ()*+, = 𝔼 [ 𝑡! , 𝑡# - − 𝑑. , 𝑑/ - ] 4. Loss: ℒ0+0 = ℒ,1 + 𝜆 ℒ()*+, 4 3 REGULARIZATION 2 1

Slide 18

Slide 18 text

Second-generation VTs 2

Slide 19

Slide 19 text

19 Second Generation Vision Transformers (VT) CvT (2021) Swin (2021) T2T (2021) 4 3 2nd GENERATION VTs 2 1

Slide 20

Slide 20 text

20 Second Generation Vision Transformers (VT) • Not tested against each other with the same pipeline (e.g. data augumentation) • Not tested on small datasets • Better than ResNets • Not clear what is the next Vision Transformer -> We are going to compare and use second-generation VTs 4 3 2nd GENERATION VTs 2 1

Slide 21

Slide 21 text

21 Datasets and Models Model Params (M) ResNet-50 25 Swin-T 29 T2T-Vit-14 22 CvT-13 20 4 3 2nd GENERATION VTs 2 1

Slide 22

Slide 22 text

Experiments

Slide 23

Slide 23 text

23 Training from scratch Imagenet-100 4 3 EXPERIMENTS 2 1

Slide 24

Slide 24 text

24 Training from scratch Imagenet-100 4 3 EXPERIMENTS 2 1

Slide 25

Slide 25 text

25 Training from scratch Imagenet-100 4 3 EXPERIMENTS 2 1

Slide 26

Slide 26 text

26 Training from scratch Imagenet-100 4 3 EXPERIMENTS 2 1

Slide 27

Slide 27 text

27 Training from scratch smaller datasets 4 3 EXPERIMENTS 2 1

Slide 28

Slide 28 text

28 Training from scratch smaller datasets 4 3 EXPERIMENTS 2 1

Slide 29

Slide 29 text

29 Training from scratch smaller datasets 4 3 EXPERIMENTS 2 1

Slide 30

Slide 30 text

30 Training from scratch smaller datasets 4 3 EXPERIMENTS 2 1

Slide 31

Slide 31 text

31 Fine-tuning ImageNet-1K Pre-training on ImageNet 1K -> fine-tune on a smaller dataset 4 3 EXPERIMENTS 2 1

Slide 32

Slide 32 text

32 Downstream tasks Pre-training on ImageNet 100 / 1K -> freeze -> Task OBJECT DETECTION SEMANTIC SEGMENTATION 4 3 EXPERIMENTS 2 1

Slide 33

Slide 33 text

33 What about ViT-B (86.4M params)? • I just want to use ViT, just bigger! • ViT-B is 4x bigger than any tested configuration 4 3 EXPERIMENTS 2 1

Slide 34

Slide 34 text

34 What about speed? 4 3 EXPERIMENTS 2 1

Slide 35

Slide 35 text

How can we use Vision Transformers with Small datasets? • USE OUR NEW REGULARIZATION Improved the performance on all 11 datasets and all scenarios, sometimes dramatically (+45 points). It is simple and easily pluggable in any VT • USE A 2nd GENERATION VTs Performance largely varies. CvT is very promising with small datasets! • READ OUR PAPER FOR DETAILS 35 1 2 4 3 CONCLUSION 2 1 3

Slide 36

Slide 36 text

Thank you! Yahui Liu, Enver Sangineto, Wei Bi, Nicu Sebe, Bruno Lepri, and Marco De Nadai Paper: https://bit.ly/efficient-VTs Code: https://bit.ly/efficient-VTs-code Email: [email protected] COPENHAGEN – NEURIPS MEETUP