Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Paper Review of ConvNext

Paper Review of ConvNext

Paper Review of ConvNext - A new CNN architecture

Senthilkumar Gopal

July 29, 2022
Tweet

More Decks by Senthilkumar Gopal

Other Decks in Research

Transcript

  1. A ConvNet for the 2020s
    paper review
    https://arxiv.org/abs/2201.03545

    View full-size slide

  2. A bit of History
    1. Transformers displaces RNN as backbone architecture for NLP.
    2. introduction of Vision Transformers (ViT)
    a. Except initial “patchify” layer, no other image-specific inductive bias
    3. Larger model and dataset sizes → Significant improvement on classification tasks.
    4. But struggled with computer vision tasks
    a. Depends on a sliding-window, fully convolutional paradigm.
    5. ViT’s global attention design → quadratic complexity with respect to the input size.

    View full-size slide

  3. A bit of History
    1. ResNet
    2. Vision Transformer (ViT)
    a. faces difficulties when applied to general computer vision tasks such as object
    detection and semantic segmentation.
    3. Swin Transformers
    a. Hierarchical
    b. reintroduced several ConvNet priors

    View full-size slide

  4. Note on Swin Transformers

    View full-size slide

  5. Swin Transformers
    ● Expensive sliding window technique
    ● Cyclic shifting
    ● Replicating standard convolution techniques to produce convolving properties.

    View full-size slide

  6. Swin Transformers
    ● Primary difference - Attention vs. Convolution
    ● Architecture choices
    ● Training methods
    ● Belief #1: Multi-head Self-Attention (MSA) is all you need
    ● Belief #2: Transformer processing is superior & more scalable

    View full-size slide

  7. Why Convolution work for Images?
    1. “Sliding window” strategy is intrinsic to visual processing
    2. Particularly for high-resolution images.
    3. ConvNets has built-in inductive biases
    4. Well suited to a wide variety of computer vision applications.
    5. Translation equivariance
    Source: https://chriswolfvision.medium.com/what-is-translation-equivariance-and-why-do-we-use-convolutions-to-get-it-6f18139d4c59

    View full-size slide

  8. Swin Transformers - uses Convolution
    ● “Sliding window” strategy (e.g. attention within local windows) similar to ConvNets.
    ● Swin Transformer shows that:
    ○ Transformers can be adopted as a generic vision backbone
    ○ Used for a range of computer vision tasks
    But essence of convolution is not becoming irrelevant; rather, it remains much desired and
    has never faded.

    View full-size slide

  9. Driving Factors
    Motivation
    1. Effectiveness credited to superiority of Transformers
    2. Ignores the inherent inductive biases of convolutions evident in Swin Transformers
    a. Rather depends on data augmentation
    Experiment
    1. Gradually “modernize” a standard ResNet towards a vision Transformer
    2. Identify the confounding variables when comparing network performance. Our
    3. Bridge the gap between the pre-ViT and post-ViT eras for ConvNets
    4. Test the limits of what a pure ConvNet can achieve
    How do design decisions in Swin Transformers impact ConvNext performance?

    View full-size slide

  10. Spoiler Alert *
    ● Constructed entirely from standard ConvNet modules,
    ● ConvNeXts compete favorably with Transformers
    ● 87.8% ImageNet top-1 accuracy
    ● Outperforming Swin Transformers on COCO detection and ADE20K segmentation

    View full-size slide

  11. Baseline
    Two model sizes in terms of FLOPs
    ● ResNet-50 / Swin-T regime - GFLOPs ~ 4.5
    ● ResNet-200 / Swin-B regime - GFLOPS ~ 15.0
    ● For simplicity, use the results with the ResNet-50 / Swin-T complexity models.
    ● Use similar training techniques from vision Transformers.
    ● Evaluated on ImageNet-1K for Accuracy.
    Note
    Network complexity is closely correlated with the final performance, so the FLOPs are roughly controlled
    over the course of the exploration, though at intermediate steps, the FLOPs might be higher or lower than
    the reference models.

    View full-size slide

  12. Improved Baseline - Hyperparameters
    Not Used - “ResNet strikes back” - modern training techniques for ResNet-50
    Training recipe DeiT (Data-efficient Image Transformers) and Swin Transformer.
    1. 300 epochs from the original 90 epochs for ResNets.
    2. AdamW optimizer
    3. Data augmentation techniques - Mixup, Cutmix, RandAugment, Random Erasing
    4. Regularization schemes including Stochastic Depth and Label Smoothing
    5. Other hyper-parameters
    ** significant portion of the performance difference between traditional ConvNets
    and vision Transformers may be due to the training techniques.
    Ross Wightman, Hugo Touvron, and Herv  J gou. Resnet strikes back: An improved training procedure in timm. arXiv:2110.00476, 2021.

    View full-size slide

  13. Macro design - Stage Ratio
    ● Swin-T - same principle - different stage compute ratio of 1:1:3:1.
    ● For larger Swin-T - ratio is 1:1:9:1.
    ● # of blocks in each stage from (3, 4, 6, 3) in ResNet-50 to (3, 3, 9, s3)
    ● Aligns the FLOPs with Swin-T.
    ● * More optimal design is likely to exist.

    View full-size slide

  14. Macro design - Patchify - Stem Cell Structure
    1. Stem cell design determines input image processing.
    2. Common stem cell
    a. Aggressively down-samples the input images to feature map size.
    b. ResNet - 7x7 convolution with stride 2 + max pool → 4x downsampling
    3. Vision Transformers, a more aggressive “patchify” strategy is used
    a. Stem cell - large kernel size (14 or 16) and non-overlapping convolution.
    4. Swin Transformer uses similar “patchify” layer - patch size of 4 [multi-stage design]
    5. Replace ResNet-style stem cell with patchify layer- using 4x4, stride 4 convolutional layer.

    View full-size slide

  15. Macro - ResNeXtify - Depthwise Convolution
    What is ResNextify?
    1. Grouped convolution - where convolutional filters are separated into different groups.
    2. “use more groups, expand width”.
    Depthwise convolution
    1. Special case of grouped convolution - number of groups equal to the number of input channels,
    2. Each convolution kernel processes one channel
    3. Only mixes information in the spatial dimension - similar to the self-attention mechanism
    Side effects
    1. Depthwise convolution reduces the network FLOPs and the accuracy.
    2. Increase network width to the same no. of channels as Swin-T’s (64 to 96) [proposed in ResNeXt]
    3. This brings the network performance to 80.5% with increased FLOPs (5.3G).

    View full-size slide

  16. Depthwise convolution is similar to the weighted sum operation in self-attention, which operates on
    a per-channel basis, i.e., only mixing information in the spatial dimension.
    Macro - ResNeXtify - Depthwise Convolution

    View full-size slide

  17. ResNeXt employs
    grouped convolution for the 3x3
    conv layer in a bottleneck
    block.
    As this significantly reduces the
    FLOPs, the network width is
    expanded to compensate for the
    capacity loss.
    Macro - ResNeXtify - Depthwise Convolution

    View full-size slide

  18. Inverted Bottleneck
    ● Inverted bottleneck, comparable to MobileNetV2, in every Transformer block **
    ● MLP block’s hidden dimension is 4x the input dimension.
    ● FLOPs of the depthwise convolutional layer increase after this reversal
    ● But the FLOPs of overall network fall due to the downsampling effect of the residual block
    ● Offsetting the increase by depthwise convolution
    ● ResNet-200 / Swin-B regime - 81.9% to 82.6%) reduced FLOPs.

    View full-size slide

  19. Inverted Bottleneck

    View full-size slide

  20. Inverted Bottleneck

    View full-size slide

  21. Large Kernel Sizes
    Vision Transformers
    1. non-local self-attention
    2. Enables each layer to have a global receptive field.
    ConvNets
    1. Used large kernel sizes
    2. Gold standard (popularized by VGGNet) - stack small kernel-sized (3x3) conv layers
    3. Efficient hardware implementations on modern GPUs
    Swin Transformers
    1. Reintroduced the local window to the self-attention block
    2. Window size is at least 7x7, significantly larger than ResNe(X)t 3x3.

    View full-size slide

  22. Increased kernel size
    1. Experimented with kernel sizes - 3 to 11
    2. Saturation point at 7x7.
    3. Network’s FLOPs stay roughly the same.
    Moving up depthwise conv layer for Large kernels
    ● Move up the position of the depthwise conv layer
    ● Transformers: the MSA block is placed prior to the MLP
    layers.
    ● Due to inverted bottleneck block - natural design choice
    ● Complex/inefficient modules (MSA, large-kernel conv) will
    have fewer channels
    ● Efficient, dense 1x1 layers will do the heavy lifting.
    ● Reduces the FLOPs to 4.1G, resulting Accuracy 79.9% *

    View full-size slide

  23. Micro Design
    ● at the layer level
    ● specific choices of activation functions and normalization layers

    View full-size slide

  24. Replacing ReLU with GELU
    Gaussian
    Error Linear Unit,
    or GELU -
    smoother variant
    of ReLU, is utilized
    in the most
    advanced
    Transformers,
    BERT, GPT-2, ViTs.

    View full-size slide

  25. Fewer activation functions

    View full-size slide

  26. Micro Design
    Transformers have fewer activation functions.
    ● Transformer block
    ○ key/query/value linear embedding layers
    ○ Projection
    ○ Linear layers in an MLP
    ○ Only one activation function present in the MLP block.
    In comparison,
    ● Activation function to each convolutional layer, including the 1x1 convs.
    ● Eliminate all GELU layers from the residual block except for one between two 1x1 layers, replicating
    a Transformer block.
    ● Accuracy 81.3%, matching the performance of Swin-T *

    View full-size slide

  27. Fewer normalization layers
    ● Transformer blocks have fewer normalization layers
    ● Remove two BatchNorm (BN) layers, leaving only one BN
    layer before the conv 1x1 layers.
    ● Fewer normalization layers per block than Transformers
    Adding one additional BN layer at the beginning of the block
    does not improve the performance.

    View full-size slide

  28. Substituting BN with LN
    ● Alternative normalization techniques
    ● BN has remained the preferred option in most vision tasks
    ● Simpler Layer Normalization used in Transformers
    ● good performance across different application scenarios
    ● Directly substituting LN for BN in the original ResNet will result in suboptimal performance
    ● But with all the modifications in network architecture and training techniques, ConvNeXt
    performance is 81.5% (better)
    ● One LayerNorm as of normalization in each residual block.

    View full-size slide

  29. Fewer normalization layers

    View full-size slide

  30. Separate downsampling layers
    1. ResNet - Spatial downsampling is achieved by the residual block at the start of each stage, using 3x3 conv
    with stride 2 (and 1x1 conv with stride 2 at the shortcut connection).
    2. In Swin Transformers, a separate downsampling layer is added between stages.
    3. Similar strategy - use 2x2 conv layers with stride 2 for spatial downsampling → leads to diverged training.
    4. Adding normalization layers wherever spatial resolution is changed can help stabilize training.
    5. Several LN layers also used in Swin Transformers
    a. One before each downsampling layer
    b. One after the stem,
    c. One after the final global average pooling.
    6. Improves the accuracy to 82.0%, significantly exceeding Swin-T’s 81.3%.
    7. ConvNeXt uses separate downsampling layers

    View full-size slide

  31. Speed
    ● Swin Transformers and ConvNeXts both achieve faster inference throughput
    ● In V100 GPUs, ConvNeXts’ advantage is now significantly greater, sometimes up to 49% faster.
    ● Could be practically more efficient models on modern hardwares.

    View full-size slide

  32. Closing remarks
    ● ConvNeXt, a pure ConvNet model, can perform as good as a hierarchical vision
    Transformer
    ● On image classification, object detection, instance and semantic segmentation tasks.
    ● ConvNeXt may be more suited for certain tasks, while Transformers may be more flexible
    for others.
    ● Multi-modal learning → Cross-attention module may be preferable for modeling feature
    interactions across many modalities.
    ● Transformers - more flexible when used for tasks requiring discretized, sparse, or
    structured outputs.
    ● Architecture choice should meet the needs of the task at hand while striving for simplicity.

    View full-size slide

  33. Points to ponder
    1. none of the design options discussed thus far is novel—they have all been
    researched separately, but not collectively,
    2. Are other architectures modernizable?
    3. Twitter already ablaze about the baseline used for ViT :)

    View full-size slide

  34. https://github.com/Raghvender1205/ConvNeXt/blob/master/Py
    Torch/models/convnext.py

    View full-size slide