Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Efficient Training of Visual Transformers with Small Datasets

Marco De Nadai
December 24, 2021

Efficient Training of Visual Transformers with Small Datasets

Marco De Nadai

December 24, 2021
Tweet

More Decks by Marco De Nadai

Other Decks in Research

Transcript

  1. Efficient Training of Visual
    Transformers with Small Datasets
    Yahui Liu, Enver Sangineto, Wei Bi, Nicu Sebe, Bruno Lepri, and Marco De Nadai
    COPENHAGEN – NEURIPS MEETUP

    View Slide

  2. Transformers
    • MULTI-HEAD ATTENTION
    𝜃 𝑁! complexity
    • MLP
    A simple fully connected network
    • LAYER NORMALIZATION
    To stabilize gradients
    • GO DEEP L-TIMES
    Stack multiple blocks
    2
    From Vaswani et al: Attention Is All You Need
    4
    3
    INTRODUCTION 2
    1
    Embeddings
    Multi-Head
    Attention
    MLP
    Norm
    Norm
    +
    +
    L x
    Sequential Input

    View Slide

  3. Transformer in Vision
    3
    From Dosovitskiy et al: An Image is Worth 16x16 Words: Transformers for Image Recognition at
    Scale
    An (ImageNet) image is a sequence of pixels (224 x 224 x 3)
    4
    3
    INTRODUCTION 2
    1

    View Slide

  4. Transformer in Vision
    4
    From Dosovitskiy et al: An Image is Worth 16x16 Words: Transformers for Image Recognition at
    Scale
    4
    3
    INTRODUCTION 2
    1
    ViT (2020)

    View Slide

  5. Transformer in Vision
    5
    From Dosovitskiy et al: An Image is Worth 16x16 Words: Transformers for Image Recognition at
    Scale
    4
    3
    INTRODUCTION 2
    1
    ViT (2020)

    View Slide

  6. Transformer in Vision
    6
    From Dosovitskiy et al: An Image is Worth 16x16 Words: Transformers for Image Recognition at
    Scale
    4
    3
    INTRODUCTION 2
    1
    ViT (2020)

    View Slide

  7. Transformer in Vision
    7
    From Dosovitskiy et al: An Image is Worth 16x16 Words: Transformers for Image Recognition at
    Scale
    4
    3
    INTRODUCTION 2
    1
    ViT (2020)

    View Slide

  8. Transformer in Vision
    8
    From Dosovitskiy et al: An Image is Worth 16x16 Words: Transformers for Image Recognition at
    Scale
    4
    3
    INTRODUCTION 2
    1
    ViT (2020)

    View Slide

  9. Transformer in Vision
    9
    From Dosovitskiy et al: An Image is Worth 16x16 Words: Transformers for Image Recognition at
    Scale
    Embeddings
    Multi-Head
    Attention
    MLP
    Norm
    Norm
    +
    +
    L x
    Sequential Input
    4
    3
    INTRODUCTION 2
    1
    ViT (2020)

    View Slide

  10. 10
    ViT: the Good
    Zhai et al. “Scaling Vision Transformers”
    • ViT captures global relations in the image (global attention)
    • Transformers are a general-use architecture
    • Limit is now on the computation, not the architecture
    4
    3
    INTRODUCTION 2
    1

    View Slide

  11. 11
    ViT: the Bad & Ugly
    • Require more computation than CNNs
    • Vision Transformers are data hungy
    4
    3
    INTRODUCTION 2
    1

    View Slide

  12. 12
    ViT: the Bad & Ugly
    • Require more computation than CNNs
    • Vision Transformers are data hungy
    4
    3
    INTRODUCTION 2
    1

    View Slide

  13. 13
    ViT: the Bad & Ugly
    • Require more computation than CNNs
    • Vision Transformers are data hungy
    ImageNet 1K
    1.3M images
    ImageNet 21K
    14M images
    JFT
    303M images
    ViT
    Most Computer Vision
    CNN community
    We focus here
    4
    3
    INTRODUCTION 2
    1

    View Slide

  14. 1
    How can we use Vision
    Transformers with Small datasets?
    2
    REGULARIZE SECOND-GENERATION VTs

    View Slide

  15. Regularization technique
    1

    View Slide

  16. 16
    The regularization
    4
    3
    REGULARIZATION 2
    1

    View Slide

  17. 17
    The regularization
    1. Sample two embeddings
    𝑒!,#
    , 𝑒!$,#$
    from the 𝑘×𝑘 grid
    2. Compute the translation offset e.g.:
    𝑡!
    = |!&!$|
    '
    𝑡#
    = |#$|
    '
    3. Dense relative localization
    ℒ()*+,
    = 𝔼 [ 𝑡!
    , 𝑡#
    -
    − 𝑑.
    , 𝑑/
    - ]
    4. Loss:
    ℒ0+0
    = ℒ,1
    + 𝜆 ℒ()*+,
    4
    3
    REGULARIZATION 2
    1

    View Slide

  18. Second-generation VTs
    2

    View Slide

  19. 19
    Second Generation Vision Transformers (VT)
    CvT (2021)
    Swin (2021)
    T2T (2021)
    4
    3
    2nd GENERATION VTs 2
    1

    View Slide

  20. 20
    Second Generation Vision Transformers (VT)
    • Not tested against each other with the same pipeline (e.g. data
    augumentation)
    • Not tested on small datasets
    • Better than ResNets
    • Not clear what is the next Vision Transformer
    -> We are going to compare and use second-generation VTs
    4
    3
    2nd GENERATION VTs 2
    1

    View Slide

  21. 21
    Datasets and Models
    Model Params (M)
    ResNet-50 25
    Swin-T 29
    T2T-Vit-14 22
    CvT-13 20
    4
    3
    2nd GENERATION VTs 2
    1

    View Slide

  22. Experiments

    View Slide

  23. 23
    Training from scratch Imagenet-100
    4
    3
    EXPERIMENTS 2
    1

    View Slide

  24. 24
    Training from scratch Imagenet-100
    4
    3
    EXPERIMENTS 2
    1

    View Slide

  25. 25
    Training from scratch Imagenet-100
    4
    3
    EXPERIMENTS 2
    1

    View Slide

  26. 26
    Training from scratch Imagenet-100
    4
    3
    EXPERIMENTS 2
    1

    View Slide

  27. 27
    Training from scratch smaller datasets
    4
    3
    EXPERIMENTS 2
    1

    View Slide

  28. 28
    Training from scratch smaller datasets
    4
    3
    EXPERIMENTS 2
    1

    View Slide

  29. 29
    Training from scratch smaller datasets
    4
    3
    EXPERIMENTS 2
    1

    View Slide

  30. 30
    Training from scratch smaller datasets
    4
    3
    EXPERIMENTS 2
    1

    View Slide

  31. 31
    Fine-tuning ImageNet-1K
    Pre-training on ImageNet 1K -> fine-tune on a smaller dataset
    4
    3
    EXPERIMENTS 2
    1

    View Slide

  32. 32
    Downstream tasks
    Pre-training on ImageNet 100 / 1K -> freeze -> Task
    OBJECT DETECTION SEMANTIC SEGMENTATION
    4
    3
    EXPERIMENTS 2
    1

    View Slide

  33. 33
    What about ViT-B (86.4M params)?
    • I just want to use ViT, just bigger!
    • ViT-B is 4x bigger than any tested configuration
    4
    3
    EXPERIMENTS 2
    1

    View Slide

  34. 34
    What about speed?
    4
    3
    EXPERIMENTS 2
    1

    View Slide

  35. How can we use Vision
    Transformers with Small datasets?
    • USE OUR NEW REGULARIZATION
    Improved the performance on all 11 datasets and all
    scenarios, sometimes dramatically (+45 points). It is simple
    and easily pluggable in any VT
    • USE A 2nd GENERATION VTs
    Performance largely varies. CvT is very promising with small
    datasets!
    • READ OUR PAPER FOR DETAILS
    35
    1
    2
    4
    3
    CONCLUSION 2
    1
    3

    View Slide

  36. Thank you!
    Yahui Liu, Enver Sangineto, Wei Bi, Nicu Sebe, Bruno Lepri, and Marco De Nadai
    Paper: https://bit.ly/efficient-VTs
    Code: https://bit.ly/efficient-VTs-code
    Email: [email protected]
    COPENHAGEN – NEURIPS MEETUP

    View Slide