Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Efficient Training of Visual Transformers with Small Datasets

Efficient Training of Visual Transformers with Small Datasets

44a34405e4821fa9047cfa635e198f61?s=128

Marco De Nadai

December 24, 2021
Tweet

Transcript

  1. Efficient Training of Visual Transformers with Small Datasets Yahui Liu,

    Enver Sangineto, Wei Bi, Nicu Sebe, Bruno Lepri, and Marco De Nadai COPENHAGEN – NEURIPS MEETUP
  2. Transformers • MULTI-HEAD ATTENTION 𝜃 𝑁! complexity • MLP A

    simple fully connected network • LAYER NORMALIZATION To stabilize gradients • GO DEEP L-TIMES Stack multiple blocks 2 From Vaswani et al: Attention Is All You Need 4 3 INTRODUCTION 2 1 Embeddings Multi-Head Attention MLP Norm Norm + + L x Sequential Input
  3. Transformer in Vision 3 From Dosovitskiy et al: An Image

    is Worth 16x16 Words: Transformers for Image Recognition at Scale An (ImageNet) image is a sequence of pixels (224 x 224 x 3) 4 3 INTRODUCTION 2 1
  4. Transformer in Vision 4 From Dosovitskiy et al: An Image

    is Worth 16x16 Words: Transformers for Image Recognition at Scale 4 3 INTRODUCTION 2 1 ViT (2020)
  5. Transformer in Vision 5 From Dosovitskiy et al: An Image

    is Worth 16x16 Words: Transformers for Image Recognition at Scale 4 3 INTRODUCTION 2 1 ViT (2020)
  6. Transformer in Vision 6 From Dosovitskiy et al: An Image

    is Worth 16x16 Words: Transformers for Image Recognition at Scale 4 3 INTRODUCTION 2 1 ViT (2020)
  7. Transformer in Vision 7 From Dosovitskiy et al: An Image

    is Worth 16x16 Words: Transformers for Image Recognition at Scale 4 3 INTRODUCTION 2 1 ViT (2020)
  8. Transformer in Vision 8 From Dosovitskiy et al: An Image

    is Worth 16x16 Words: Transformers for Image Recognition at Scale 4 3 INTRODUCTION 2 1 ViT (2020)
  9. Transformer in Vision 9 From Dosovitskiy et al: An Image

    is Worth 16x16 Words: Transformers for Image Recognition at Scale Embeddings Multi-Head Attention MLP Norm Norm + + L x Sequential Input 4 3 INTRODUCTION 2 1 ViT (2020)
  10. 10 ViT: the Good Zhai et al. “Scaling Vision Transformers”

    • ViT captures global relations in the image (global attention) • Transformers are a general-use architecture • Limit is now on the computation, not the architecture 4 3 INTRODUCTION 2 1
  11. 11 ViT: the Bad & Ugly • Require more computation

    than CNNs • Vision Transformers are data hungy 4 3 INTRODUCTION 2 1
  12. 12 ViT: the Bad & Ugly • Require more computation

    than CNNs • Vision Transformers are data hungy 4 3 INTRODUCTION 2 1
  13. 13 ViT: the Bad & Ugly • Require more computation

    than CNNs • Vision Transformers are data hungy ImageNet 1K 1.3M images ImageNet 21K 14M images JFT 303M images ViT Most Computer Vision CNN community We focus here 4 3 INTRODUCTION 2 1
  14. 1 How can we use Vision Transformers with Small datasets?

    2 REGULARIZE SECOND-GENERATION VTs
  15. Regularization technique 1

  16. 16 The regularization 4 3 REGULARIZATION 2 1

  17. 17 The regularization 1. Sample two embeddings 𝑒!,# , 𝑒!$,#$

    from the 𝑘×𝑘 grid 2. Compute the translation offset e.g.: 𝑡! = |!&!$| ' 𝑡# = |#&#$| ' 3. Dense relative localization ℒ()*+, = 𝔼 [ 𝑡! , 𝑡# - − 𝑑. , 𝑑/ - ] 4. Loss: ℒ0+0 = ℒ,1 + 𝜆 ℒ()*+, 4 3 REGULARIZATION 2 1
  18. Second-generation VTs 2

  19. 19 Second Generation Vision Transformers (VT) CvT (2021) Swin (2021)

    T2T (2021) 4 3 2nd GENERATION VTs 2 1
  20. 20 Second Generation Vision Transformers (VT) • Not tested against

    each other with the same pipeline (e.g. data augumentation) • Not tested on small datasets • Better than ResNets • Not clear what is the next Vision Transformer -> We are going to compare and use second-generation VTs 4 3 2nd GENERATION VTs 2 1
  21. 21 Datasets and Models Model Params (M) ResNet-50 25 Swin-T

    29 T2T-Vit-14 22 CvT-13 20 4 3 2nd GENERATION VTs 2 1
  22. Experiments

  23. 23 Training from scratch Imagenet-100 4 3 EXPERIMENTS 2 1

  24. 24 Training from scratch Imagenet-100 4 3 EXPERIMENTS 2 1

  25. 25 Training from scratch Imagenet-100 4 3 EXPERIMENTS 2 1

  26. 26 Training from scratch Imagenet-100 4 3 EXPERIMENTS 2 1

  27. 27 Training from scratch smaller datasets 4 3 EXPERIMENTS 2

    1
  28. 28 Training from scratch smaller datasets 4 3 EXPERIMENTS 2

    1
  29. 29 Training from scratch smaller datasets 4 3 EXPERIMENTS 2

    1
  30. 30 Training from scratch smaller datasets 4 3 EXPERIMENTS 2

    1
  31. 31 Fine-tuning ImageNet-1K Pre-training on ImageNet 1K -> fine-tune on

    a smaller dataset 4 3 EXPERIMENTS 2 1
  32. 32 Downstream tasks Pre-training on ImageNet 100 / 1K ->

    freeze -> Task OBJECT DETECTION SEMANTIC SEGMENTATION 4 3 EXPERIMENTS 2 1
  33. 33 What about ViT-B (86.4M params)? • I just want

    to use ViT, just bigger! • ViT-B is 4x bigger than any tested configuration 4 3 EXPERIMENTS 2 1
  34. 34 What about speed? 4 3 EXPERIMENTS 2 1

  35. How can we use Vision Transformers with Small datasets? •

    USE OUR NEW REGULARIZATION Improved the performance on all 11 datasets and all scenarios, sometimes dramatically (+45 points). It is simple and easily pluggable in any VT • USE A 2nd GENERATION VTs Performance largely varies. CvT is very promising with small datasets! • READ OUR PAPER FOR DETAILS 35 1 2 4 3 CONCLUSION 2 1 3
  36. Thank you! Yahui Liu, Enver Sangineto, Wei Bi, Nicu Sebe,

    Bruno Lepri, and Marco De Nadai Paper: https://bit.ly/efficient-VTs Code: https://bit.ly/efficient-VTs-code Email: work@marcodena.it COPENHAGEN – NEURIPS MEETUP