Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Efficient Training of Visual Transformers with Small Datasets

Marco De Nadai
December 24, 2021

Efficient Training of Visual Transformers with Small Datasets

Marco De Nadai

December 24, 2021
Tweet

More Decks by Marco De Nadai

Other Decks in Research

Transcript

  1. Efficient Training of Visual Transformers with Small Datasets Yahui Liu,

    Enver Sangineto, Wei Bi, Nicu Sebe, Bruno Lepri, and Marco De Nadai COPENHAGEN – NEURIPS MEETUP
  2. Transformers • MULTI-HEAD ATTENTION 𝜃 𝑁! complexity • MLP A

    simple fully connected network • LAYER NORMALIZATION To stabilize gradients • GO DEEP L-TIMES Stack multiple blocks 2 From Vaswani et al: Attention Is All You Need 4 3 INTRODUCTION 2 1 Embeddings Multi-Head Attention MLP Norm Norm + + L x Sequential Input
  3. Transformer in Vision 3 From Dosovitskiy et al: An Image

    is Worth 16x16 Words: Transformers for Image Recognition at Scale An (ImageNet) image is a sequence of pixels (224 x 224 x 3) 4 3 INTRODUCTION 2 1
  4. Transformer in Vision 4 From Dosovitskiy et al: An Image

    is Worth 16x16 Words: Transformers for Image Recognition at Scale 4 3 INTRODUCTION 2 1 ViT (2020)
  5. Transformer in Vision 5 From Dosovitskiy et al: An Image

    is Worth 16x16 Words: Transformers for Image Recognition at Scale 4 3 INTRODUCTION 2 1 ViT (2020)
  6. Transformer in Vision 6 From Dosovitskiy et al: An Image

    is Worth 16x16 Words: Transformers for Image Recognition at Scale 4 3 INTRODUCTION 2 1 ViT (2020)
  7. Transformer in Vision 7 From Dosovitskiy et al: An Image

    is Worth 16x16 Words: Transformers for Image Recognition at Scale 4 3 INTRODUCTION 2 1 ViT (2020)
  8. Transformer in Vision 8 From Dosovitskiy et al: An Image

    is Worth 16x16 Words: Transformers for Image Recognition at Scale 4 3 INTRODUCTION 2 1 ViT (2020)
  9. Transformer in Vision 9 From Dosovitskiy et al: An Image

    is Worth 16x16 Words: Transformers for Image Recognition at Scale Embeddings Multi-Head Attention MLP Norm Norm + + L x Sequential Input 4 3 INTRODUCTION 2 1 ViT (2020)
  10. 10 ViT: the Good Zhai et al. “Scaling Vision Transformers”

    • ViT captures global relations in the image (global attention) • Transformers are a general-use architecture • Limit is now on the computation, not the architecture 4 3 INTRODUCTION 2 1
  11. 11 ViT: the Bad & Ugly • Require more computation

    than CNNs • Vision Transformers are data hungy 4 3 INTRODUCTION 2 1
  12. 12 ViT: the Bad & Ugly • Require more computation

    than CNNs • Vision Transformers are data hungy 4 3 INTRODUCTION 2 1
  13. 13 ViT: the Bad & Ugly • Require more computation

    than CNNs • Vision Transformers are data hungy ImageNet 1K 1.3M images ImageNet 21K 14M images JFT 303M images ViT Most Computer Vision CNN community We focus here 4 3 INTRODUCTION 2 1
  14. 17 The regularization 1. Sample two embeddings 𝑒!,# , 𝑒!$,#$

    from the 𝑘×𝑘 grid 2. Compute the translation offset e.g.: 𝑡! = |!&!$| ' 𝑡# = |#&#$| ' 3. Dense relative localization ℒ()*+, = 𝔼 [ 𝑡! , 𝑡# - − 𝑑. , 𝑑/ - ] 4. Loss: ℒ0+0 = ℒ,1 + 𝜆 ℒ()*+, 4 3 REGULARIZATION 2 1
  15. 20 Second Generation Vision Transformers (VT) • Not tested against

    each other with the same pipeline (e.g. data augumentation) • Not tested on small datasets • Better than ResNets • Not clear what is the next Vision Transformer -> We are going to compare and use second-generation VTs 4 3 2nd GENERATION VTs 2 1
  16. 21 Datasets and Models Model Params (M) ResNet-50 25 Swin-T

    29 T2T-Vit-14 22 CvT-13 20 4 3 2nd GENERATION VTs 2 1
  17. 32 Downstream tasks Pre-training on ImageNet 100 / 1K ->

    freeze -> Task OBJECT DETECTION SEMANTIC SEGMENTATION 4 3 EXPERIMENTS 2 1
  18. 33 What about ViT-B (86.4M params)? • I just want

    to use ViT, just bigger! • ViT-B is 4x bigger than any tested configuration 4 3 EXPERIMENTS 2 1
  19. How can we use Vision Transformers with Small datasets? •

    USE OUR NEW REGULARIZATION Improved the performance on all 11 datasets and all scenarios, sometimes dramatically (+45 points). It is simple and easily pluggable in any VT • USE A 2nd GENERATION VTs Performance largely varies. CvT is very promising with small datasets! • READ OUR PAPER FOR DETAILS 35 1 2 4 3 CONCLUSION 2 1 3
  20. Thank you! Yahui Liu, Enver Sangineto, Wei Bi, Nicu Sebe,

    Bruno Lepri, and Marco De Nadai Paper: https://bit.ly/efficient-VTs Code: https://bit.ly/efficient-VTs-code Email: [email protected] COPENHAGEN – NEURIPS MEETUP