Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Efficient Training of Visual Transformers with Small Datasets

Marco De Nadai
December 24, 2021

Efficient Training of Visual Transformers with Small Datasets

Marco De Nadai

December 24, 2021
Tweet

More Decks by Marco De Nadai

Other Decks in Research

Transcript

  1. Efficient Training of Visual
    Transformers with Small Datasets
    Yahui Liu, Enver Sangineto, Wei Bi, Nicu Sebe, Bruno Lepri, and Marco De Nadai
    COPENHAGEN – NEURIPS MEETUP

    View full-size slide

  2. Transformers
    • MULTI-HEAD ATTENTION
    𝜃 𝑁! complexity
    • MLP
    A simple fully connected network
    • LAYER NORMALIZATION
    To stabilize gradients
    • GO DEEP L-TIMES
    Stack multiple blocks
    2
    From Vaswani et al: Attention Is All You Need
    4
    3
    INTRODUCTION 2
    1
    Embeddings
    Multi-Head
    Attention
    MLP
    Norm
    Norm
    +
    +
    L x
    Sequential Input

    View full-size slide

  3. Transformer in Vision
    3
    From Dosovitskiy et al: An Image is Worth 16x16 Words: Transformers for Image Recognition at
    Scale
    An (ImageNet) image is a sequence of pixels (224 x 224 x 3)
    4
    3
    INTRODUCTION 2
    1

    View full-size slide

  4. Transformer in Vision
    4
    From Dosovitskiy et al: An Image is Worth 16x16 Words: Transformers for Image Recognition at
    Scale
    4
    3
    INTRODUCTION 2
    1
    ViT (2020)

    View full-size slide

  5. Transformer in Vision
    5
    From Dosovitskiy et al: An Image is Worth 16x16 Words: Transformers for Image Recognition at
    Scale
    4
    3
    INTRODUCTION 2
    1
    ViT (2020)

    View full-size slide

  6. Transformer in Vision
    6
    From Dosovitskiy et al: An Image is Worth 16x16 Words: Transformers for Image Recognition at
    Scale
    4
    3
    INTRODUCTION 2
    1
    ViT (2020)

    View full-size slide

  7. Transformer in Vision
    7
    From Dosovitskiy et al: An Image is Worth 16x16 Words: Transformers for Image Recognition at
    Scale
    4
    3
    INTRODUCTION 2
    1
    ViT (2020)

    View full-size slide

  8. Transformer in Vision
    8
    From Dosovitskiy et al: An Image is Worth 16x16 Words: Transformers for Image Recognition at
    Scale
    4
    3
    INTRODUCTION 2
    1
    ViT (2020)

    View full-size slide

  9. Transformer in Vision
    9
    From Dosovitskiy et al: An Image is Worth 16x16 Words: Transformers for Image Recognition at
    Scale
    Embeddings
    Multi-Head
    Attention
    MLP
    Norm
    Norm
    +
    +
    L x
    Sequential Input
    4
    3
    INTRODUCTION 2
    1
    ViT (2020)

    View full-size slide

  10. 10
    ViT: the Good
    Zhai et al. “Scaling Vision Transformers”
    • ViT captures global relations in the image (global attention)
    • Transformers are a general-use architecture
    • Limit is now on the computation, not the architecture
    4
    3
    INTRODUCTION 2
    1

    View full-size slide

  11. 11
    ViT: the Bad & Ugly
    • Require more computation than CNNs
    • Vision Transformers are data hungy
    4
    3
    INTRODUCTION 2
    1

    View full-size slide

  12. 12
    ViT: the Bad & Ugly
    • Require more computation than CNNs
    • Vision Transformers are data hungy
    4
    3
    INTRODUCTION 2
    1

    View full-size slide

  13. 13
    ViT: the Bad & Ugly
    • Require more computation than CNNs
    • Vision Transformers are data hungy
    ImageNet 1K
    1.3M images
    ImageNet 21K
    14M images
    JFT
    303M images
    ViT
    Most Computer Vision
    CNN community
    We focus here
    4
    3
    INTRODUCTION 2
    1

    View full-size slide

  14. 1
    How can we use Vision
    Transformers with Small datasets?
    2
    REGULARIZE SECOND-GENERATION VTs

    View full-size slide

  15. Regularization technique
    1

    View full-size slide

  16. 16
    The regularization
    4
    3
    REGULARIZATION 2
    1

    View full-size slide

  17. 17
    The regularization
    1. Sample two embeddings
    𝑒!,#
    , 𝑒!$,#$
    from the 𝑘×𝑘 grid
    2. Compute the translation offset e.g.:
    𝑡!
    = |!&!$|
    '
    𝑡#
    = |#$|
    '
    3. Dense relative localization
    ℒ()*+,
    = 𝔼 [ 𝑡!
    , 𝑡#
    -
    − 𝑑.
    , 𝑑/
    - ]
    4. Loss:
    ℒ0+0
    = ℒ,1
    + 𝜆 ℒ()*+,
    4
    3
    REGULARIZATION 2
    1

    View full-size slide

  18. Second-generation VTs
    2

    View full-size slide

  19. 19
    Second Generation Vision Transformers (VT)
    CvT (2021)
    Swin (2021)
    T2T (2021)
    4
    3
    2nd GENERATION VTs 2
    1

    View full-size slide

  20. 20
    Second Generation Vision Transformers (VT)
    • Not tested against each other with the same pipeline (e.g. data
    augumentation)
    • Not tested on small datasets
    • Better than ResNets
    • Not clear what is the next Vision Transformer
    -> We are going to compare and use second-generation VTs
    4
    3
    2nd GENERATION VTs 2
    1

    View full-size slide

  21. 21
    Datasets and Models
    Model Params (M)
    ResNet-50 25
    Swin-T 29
    T2T-Vit-14 22
    CvT-13 20
    4
    3
    2nd GENERATION VTs 2
    1

    View full-size slide

  22. 23
    Training from scratch Imagenet-100
    4
    3
    EXPERIMENTS 2
    1

    View full-size slide

  23. 24
    Training from scratch Imagenet-100
    4
    3
    EXPERIMENTS 2
    1

    View full-size slide

  24. 25
    Training from scratch Imagenet-100
    4
    3
    EXPERIMENTS 2
    1

    View full-size slide

  25. 26
    Training from scratch Imagenet-100
    4
    3
    EXPERIMENTS 2
    1

    View full-size slide

  26. 27
    Training from scratch smaller datasets
    4
    3
    EXPERIMENTS 2
    1

    View full-size slide

  27. 28
    Training from scratch smaller datasets
    4
    3
    EXPERIMENTS 2
    1

    View full-size slide

  28. 29
    Training from scratch smaller datasets
    4
    3
    EXPERIMENTS 2
    1

    View full-size slide

  29. 30
    Training from scratch smaller datasets
    4
    3
    EXPERIMENTS 2
    1

    View full-size slide

  30. 31
    Fine-tuning ImageNet-1K
    Pre-training on ImageNet 1K -> fine-tune on a smaller dataset
    4
    3
    EXPERIMENTS 2
    1

    View full-size slide

  31. 32
    Downstream tasks
    Pre-training on ImageNet 100 / 1K -> freeze -> Task
    OBJECT DETECTION SEMANTIC SEGMENTATION
    4
    3
    EXPERIMENTS 2
    1

    View full-size slide

  32. 33
    What about ViT-B (86.4M params)?
    • I just want to use ViT, just bigger!
    • ViT-B is 4x bigger than any tested configuration
    4
    3
    EXPERIMENTS 2
    1

    View full-size slide

  33. 34
    What about speed?
    4
    3
    EXPERIMENTS 2
    1

    View full-size slide

  34. How can we use Vision
    Transformers with Small datasets?
    • USE OUR NEW REGULARIZATION
    Improved the performance on all 11 datasets and all
    scenarios, sometimes dramatically (+45 points). It is simple
    and easily pluggable in any VT
    • USE A 2nd GENERATION VTs
    Performance largely varies. CvT is very promising with small
    datasets!
    • READ OUR PAPER FOR DETAILS
    35
    1
    2
    4
    3
    CONCLUSION 2
    1
    3

    View full-size slide

  35. Thank you!
    Yahui Liu, Enver Sangineto, Wei Bi, Nicu Sebe, Bruno Lepri, and Marco De Nadai
    Paper: https://bit.ly/efficient-VTs
    Code: https://bit.ly/efficient-VTs-code
    Email: [email protected]
    COPENHAGEN – NEURIPS MEETUP

    View full-size slide