Efficient Training of Visual Transformers with Small Datasets Yahui Liu, Enver Sangineto, Wei Bi, Nicu Sebe, Bruno Lepri, and Marco De Nadai COPENHAGEN – NEURIPS MEETUP
Transformers • MULTI-HEAD ATTENTION 𝜃 𝑁! complexity • MLP A simple fully connected network • LAYER NORMALIZATION To stabilize gradients • GO DEEP L-TIMES Stack multiple blocks 2 From Vaswani et al: Attention Is All You Need 4 3 INTRODUCTION 2 1 Embeddings Multi-Head Attention MLP Norm Norm + + L x Sequential Input
Transformer in Vision 3 From Dosovitskiy et al: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale An (ImageNet) image is a sequence of pixels (224 x 224 x 3) 4 3 INTRODUCTION 2 1
Transformer in Vision 4 From Dosovitskiy et al: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale 4 3 INTRODUCTION 2 1 ViT (2020)
Transformer in Vision 5 From Dosovitskiy et al: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale 4 3 INTRODUCTION 2 1 ViT (2020)
Transformer in Vision 6 From Dosovitskiy et al: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale 4 3 INTRODUCTION 2 1 ViT (2020)
Transformer in Vision 7 From Dosovitskiy et al: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale 4 3 INTRODUCTION 2 1 ViT (2020)
Transformer in Vision 8 From Dosovitskiy et al: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale 4 3 INTRODUCTION 2 1 ViT (2020)
Transformer in Vision 9 From Dosovitskiy et al: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale Embeddings Multi-Head Attention MLP Norm Norm + + L x Sequential Input 4 3 INTRODUCTION 2 1 ViT (2020)
10 ViT: the Good Zhai et al. “Scaling Vision Transformers” • ViT captures global relations in the image (global attention) • Transformers are a general-use architecture • Limit is now on the computation, not the architecture 4 3 INTRODUCTION 2 1
13 ViT: the Bad & Ugly • Require more computation than CNNs • Vision Transformers are data hungy ImageNet 1K 1.3M images ImageNet 21K 14M images JFT 303M images ViT Most Computer Vision CNN community We focus here 4 3 INTRODUCTION 2 1
20 Second Generation Vision Transformers (VT) • Not tested against each other with the same pipeline (e.g. data augumentation) • Not tested on small datasets • Better than ResNets • Not clear what is the next Vision Transformer -> We are going to compare and use second-generation VTs 4 3 2nd GENERATION VTs 2 1
How can we use Vision Transformers with Small datasets? • USE OUR NEW REGULARIZATION Improved the performance on all 11 datasets and all scenarios, sometimes dramatically (+45 points). It is simple and easily pluggable in any VT • USE A 2nd GENERATION VTs Performance largely varies. CvT is very promising with small datasets! • READ OUR PAPER FOR DETAILS 35 1 2 4 3 CONCLUSION 2 1 3