Paper Review of ConvNext

Slide 1

Slide 1 text

A ConvNet for the 2020s paper review https://arxiv.org/abs/2201.03545

Slide 2

Slide 2 text

A bit of History 1. Transformers displaces RNN as backbone architecture for NLP. 2. introduction of Vision Transformers (ViT) a. Except initial “patchify” layer, no other image-specific inductive bias 3. Larger model and dataset sizes → Significant improvement on classification tasks. 4. But struggled with computer vision tasks a. Depends on a sliding-window, fully convolutional paradigm. 5. ViT’s global attention design → quadratic complexity with respect to the input size.

Slide 3

Slide 3 text

A bit of History 1. ResNet 2. Vision Transformer (ViT) a. faces diﬃculties when applied to general computer vision tasks such as object detection and semantic segmentation. 3. Swin Transformers a. Hierarchical b. reintroduced several ConvNet priors

Slide 4

Slide 4 text

Note on Swin Transformers

Slide 5

Slide 5 text

Swin Transformers ● Expensive sliding window technique ● Cyclic shifting ● Replicating standard convolution techniques to produce convolving properties.

Slide 6

Slide 6 text

Swin Transformers ● Primary diﬀerence - Attention vs. Convolution ● Architecture choices ● Training methods ● Belief #1: Multi-head Self-Attention (MSA) is all you need ● Belief #2: Transformer processing is superior & more scalable

Slide 7

Slide 7 text

Why Convolution work for Images? 1. “Sliding window” strategy is intrinsic to visual processing 2. Particularly for high-resolution images. 3. ConvNets has built-in inductive biases 4. Well suited to a wide variety of computer vision applications. 5. Translation equivariance Source: https://chriswolfvision.medium.com/what-is-translation-equivariance-and-why-do-we-use-convolutions-to-get-it-6f18139d4c59

Slide 8

Slide 8 text

Swin Transformers - uses Convolution ● “Sliding window” strategy (e.g. attention within local windows) similar to ConvNets. ● Swin Transformer shows that: ○ Transformers can be adopted as a generic vision backbone ○ Used for a range of computer vision tasks But essence of convolution is not becoming irrelevant; rather, it remains much desired and has never faded.

Slide 9

Slide 9 text

Driving Factors Motivation 1. Eﬀectiveness credited to superiority of Transformers 2. Ignores the inherent inductive biases of convolutions evident in Swin Transformers a. Rather depends on data augmentation Experiment 1. Gradually “modernize” a standard ResNet towards a vision Transformer 2. Identify the confounding variables when comparing network performance. Our 3. Bridge the gap between the pre-ViT and post-ViT eras for ConvNets 4. Test the limits of what a pure ConvNet can achieve How do design decisions in Swin Transformers impact ConvNext performance?

Slide 10

Slide 10 text

Spoiler Alert * ● Constructed entirely from standard ConvNet modules, ● ConvNeXts compete favorably with Transformers ● 87.8% ImageNet top-1 accuracy ● Outperforming Swin Transformers on COCO detection and ADE20K segmentation

Slide 11

Slide 11 text

Baseline Two model sizes in terms of FLOPs ● ResNet-50 / Swin-T regime - GFLOPs ~ 4.5 ● ResNet-200 / Swin-B regime - GFLOPS ~ 15.0 ● For simplicity, use the results with the ResNet-50 / Swin-T complexity models. ● Use similar training techniques from vision Transformers. ● Evaluated on ImageNet-1K for Accuracy. Note Network complexity is closely correlated with the ﬁnal performance, so the FLOPs are roughly controlled over the course of the exploration, though at intermediate steps, the FLOPs might be higher or lower than the reference models.

Slide 12

Slide 12 text

Improved Baseline - Hyperparameters Not Used - “ResNet strikes back” - modern training techniques for ResNet-50 Training recipe DeiT (Data-eﬃcient Image Transformers) and Swin Transformer. 1. 300 epochs from the original 90 epochs for ResNets. 2. AdamW optimizer 3. Data augmentation techniques - Mixup, Cutmix, RandAugment, Random Erasing 4. Regularization schemes including Stochastic Depth and Label Smoothing 5. Other hyper-parameters ** significant portion of the performance difference between traditional ConvNets and vision Transformers may be due to the training techniques. Ross Wightman, Hugo Touvron, and Herv J gou. Resnet strikes back: An improved training procedure in timm. arXiv:2110.00476, 2021.

Slide 13

Slide 13 text

No content

Slide 14

Slide 14 text

Macro design - Stage Ratio ● Swin-T - same principle - diﬀerent stage compute ratio of 1:1:3:1. ● For larger Swin-T - ratio is 1:1:9:1. ● # of blocks in each stage from (3, 4, 6, 3) in ResNet-50 to (3, 3, 9, s3) ● Aligns the FLOPs with Swin-T. ● * More optimal design is likely to exist.

Slide 15

Slide 15 text

No content

Slide 16

Slide 16 text

Macro design - Patchify - Stem Cell Structure 1. Stem cell design determines input image processing. 2. Common stem cell a. Aggressively down-samples the input images to feature map size. b. ResNet - 7x7 convolution with stride 2 + max pool → 4x downsampling 3. Vision Transformers, a more aggressive “patchify” strategy is used a. Stem cell - large kernel size (14 or 16) and non-overlapping convolution. 4. Swin Transformer uses similar “patchify” layer - patch size of 4 [multi-stage design] 5. Replace ResNet-style stem cell with patchify layer- using 4x4, stride 4 convolutional layer.

Slide 17

Slide 17 text

No content

Slide 18

Slide 18 text

Macro - ResNeXtify - Depthwise Convolution What is ResNextify? 1. Grouped convolution - where convolutional filters are separated into different groups. 2. “use more groups, expand width”. Depthwise convolution 1. Special case of grouped convolution - number of groups equal to the number of input channels, 2. Each convolution kernel processes one channel 3. Only mixes information in the spatial dimension - similar to the self-attention mechanism Side effects 1. Depthwise convolution reduces the network FLOPs and the accuracy. 2. Increase network width to the same no. of channels as Swin-T’s (64 to 96) [proposed in ResNeXt] 3. This brings the network performance to 80.5% with increased FLOPs (5.3G).

Slide 19

Slide 19 text

Depthwise convolution is similar to the weighted sum operation in self-attention, which operates on a per-channel basis, i.e., only mixing information in the spatial dimension. Macro - ResNeXtify - Depthwise Convolution

Slide 20

Slide 20 text

ResNeXt employs grouped convolution for the 3x3 conv layer in a bottleneck block. As this signiﬁcantly reduces the FLOPs, the network width is expanded to compensate for the capacity loss. Macro - ResNeXtify - Depthwise Convolution

Slide 21

Slide 21 text

No content

Slide 22

Slide 22 text

Inverted Bottleneck ● Inverted bottleneck, comparable to MobileNetV2, in every Transformer block ** ● MLP block’s hidden dimension is 4x the input dimension. ● FLOPs of the depthwise convolutional layer increase after this reversal ● But the FLOPs of overall network fall due to the downsampling eﬀect of the residual block ● Oﬀsetting the increase by depthwise convolution ● ResNet-200 / Swin-B regime - 81.9% to 82.6%) reduced FLOPs.

Slide 23

Slide 23 text

Inverted Bottleneck

Slide 24

Slide 24 text

Inverted Bottleneck

Slide 25

Slide 25 text

No content

Slide 26

Slide 26 text

Large Kernel Sizes Vision Transformers 1. non-local self-attention 2. Enables each layer to have a global receptive field. ConvNets 1. Used large kernel sizes 2. Gold standard (popularized by VGGNet) - stack small kernel-sized (3x3) conv layers 3. Efficient hardware implementations on modern GPUs Swin Transformers 1. Reintroduced the local window to the self-attention block 2. Window size is at least 7x7, significantly larger than ResNe(X)t 3x3.

Slide 27

Slide 27 text

Increased kernel size 1. Experimented with kernel sizes - 3 to 11 2. Saturation point at 7x7. 3. Network’s FLOPs stay roughly the same. Moving up depthwise conv layer for Large kernels ● Move up the position of the depthwise conv layer ● Transformers: the MSA block is placed prior to the MLP layers. ● Due to inverted bottleneck block - natural design choice ● Complex/ineﬃcient modules (MSA, large-kernel conv) will have fewer channels ● Eﬃcient, dense 1x1 layers will do the heavy lifting. ● Reduces the FLOPs to 4.1G, resulting Accuracy 79.9% *

Slide 28

Slide 28 text

No content

Slide 29

Slide 29 text

Micro Design ● at the layer level ● speciﬁc choices of activation functions and normalization layers

Slide 30

Slide 30 text

Replacing ReLU with GELU Gaussian Error Linear Unit, or GELU - smoother variant of ReLU, is utilized in the most advanced Transformers, BERT, GPT-2, ViTs.

Slide 31

Slide 31 text

Fewer activation functions

Slide 32

Slide 32 text

Micro Design Transformers have fewer activation functions. ● Transformer block ○ key/query/value linear embedding layers ○ Projection ○ Linear layers in an MLP ○ Only one activation function present in the MLP block. In comparison, ● Activation function to each convolutional layer, including the 1x1 convs. ● Eliminate all GELU layers from the residual block except for one between two 1x1 layers, replicating a Transformer block. ● Accuracy 81.3%, matching the performance of Swin-T *

Slide 33

Slide 33 text

Fewer normalization layers ● Transformer blocks have fewer normalization layers ● Remove two BatchNorm (BN) layers, leaving only one BN layer before the conv 1x1 layers. ● Fewer normalization layers per block than Transformers Adding one additional BN layer at the beginning of the block does not improve the performance.

Slide 34

Slide 34 text

No content

Slide 35

Slide 35 text

Substituting BN with LN ● Alternative normalization techniques ● BN has remained the preferred option in most vision tasks ● Simpler Layer Normalization used in Transformers ● good performance across diﬀerent application scenarios ● Directly substituting LN for BN in the original ResNet will result in suboptimal performance ● But with all the modiﬁcations in network architecture and training techniques, ConvNeXt performance is 81.5% (better) ● One LayerNorm as of normalization in each residual block.

Slide 36

Slide 36 text

No content

Slide 37

Slide 37 text

Fewer normalization layers

Slide 38

Slide 38 text

Separate downsampling layers 1. ResNet - Spatial downsampling is achieved by the residual block at the start of each stage, using 3x3 conv with stride 2 (and 1x1 conv with stride 2 at the shortcut connection). 2. In Swin Transformers, a separate downsampling layer is added between stages. 3. Similar strategy - use 2x2 conv layers with stride 2 for spatial downsampling → leads to diverged training. 4. Adding normalization layers wherever spatial resolution is changed can help stabilize training. 5. Several LN layers also used in Swin Transformers a. One before each downsampling layer b. One after the stem, c. One after the ﬁnal global average pooling. 6. Improves the accuracy to 82.0%, signiﬁcantly exceeding Swin-T’s 81.3%. 7. ConvNeXt uses separate downsampling layers

Slide 39

Slide 39 text

No content

Slide 40

Slide 40 text

No content

Slide 41

Slide 41 text

No content

Slide 42

Slide 42 text

No content

Slide 43

Slide 43 text

No content

Slide 44

Slide 44 text

Speed ● Swin Transformers and ConvNeXts both achieve faster inference throughput ● In V100 GPUs, ConvNeXts’ advantage is now signiﬁcantly greater, sometimes up to 49% faster. ● Could be practically more eﬃcient models on modern hardwares.

Slide 45

Slide 45 text

Closing remarks ● ConvNeXt, a pure ConvNet model, can perform as good as a hierarchical vision Transformer ● On image classification, object detection, instance and semantic segmentation tasks. ● ConvNeXt may be more suited for certain tasks, while Transformers may be more flexible for others. ● Multi-modal learning → Cross-attention module may be preferable for modeling feature interactions across many modalities. ● Transformers - more flexible when used for tasks requiring discretized, sparse, or structured outputs. ● Architecture choice should meet the needs of the task at hand while striving for simplicity.

Slide 46

Slide 46 text

Points to ponder 1. none of the design options discussed thus far is novel—they have all been researched separately, but not collectively, 2. Are other architectures modernizable? 3. Twitter already ablaze about the baseline used for ViT :)

Slide 47

Slide 47 text

https://github.com/Raghvender1205/ConvNeXt/blob/master/Py Torch/models/convnext.py