NFNet: High-Performance Large-Scale Image Recognition Without Normalization
This is a talk about SAM, image augmentation, ResNet-like architectures evolution (from ResNet to NFNet), NAS (neural architecture search), and other techniques which could help to build modern SOTA in Computer Vision and Object Detection.
Bad parts ● Batch normalization is expensive ● Batch normalization breaks the assumption of data independence ● Introduces a lot of extra hyper-parameters that need further fine-tuning ● Causes a lot of implementation errors in distributed training ● Requires a specific “training” and “inference” mode in frameworks
Good parts ● Batch normalization downscales the residual branch ● Batch normalization eliminates mean-shift (in ReLU networks) ● Batch normalization has a regularizing effect ● Batch normalization allows efficient large-batch training
Not enough Top-1 Accuracy!!! Even with Adaptive Gradient Clipping, and the modified residual branch and convolutions, normalizer-free networks still could not surpass the accuracies of EfficientNet.
Architecture optimization for improved accuracy and training speed ● SE-ResNeXt-D as a baseline ● Fixed group width (specific for the ResNeXt architecture) ● Depth Scaling pattern was changed (from very specific to very simple) ● Width pattern was changed too
Implementation ● On JAX from authors ● Collab to play with ● PyTorch with weights ● Yet another PyTorch ● Very good PyTorch (clear code) ● Adaptive Gradient Clipping example ● Broken Keras example (could be a good entry point) ● Raw TF implementation
Summary 1. Downscales residual branch 2. Enables large batch training 3. Implicit regularization 4. Prevents mean-shift 1. NF-strategy 2. Adaptive gradient clipping 3. Explicit regularization 4. Scaled Weight Standardization ● NFNets were new SOTA during a few months ● NFResNet >= BNResNet ● NFNet training >>> EfficientNet training