NFNet: High-Performance Large-Scale Image Recognition Without Normalization

NFNet: High-Performance Large-Scale Image Recognition Without Normalization Alexey Zinoviev, JetBrains

Bio • Java & Kotlin developer • Distributed ML enthusiast
• Apache Ignite PMC • TensorFlow Contributor • ML engineer at JetBrains • Happy father and husband • https://github.com/zaleslaw • https://twitter.com/zaleslaw

NFNets: top-1 accuracy vs training latency

Evolution of Image Recognition models

BatchNorm + Skip Connections Epoch NFNet-F4+

Top-1 Accuracy 2021

Architecture innovations: 2015-2019

Residual Block

Bottleneck Residual Block

Depthwise Separable Convolution

Inception Module

Multi-residual block: ResNeXt

Shared source skip connections

Dense blocks in DenseNet (DenseNet != MLP)

AutoML Era

New principle in AutoML era Give us your computational resource
limit and we scale our baseline model for you

EfficientNet: scale <width, depth, resolution>

3D Pareto: FLOPS, Number of params and Top-1 Acc

HPO task

Bilevel optimization problem

NAS (neural architecture search)

Search Space skeleton

Controller implementation (RNN, isn’t it?)

NASNet cells designed by AutoML algorithm

Evolutionary algorithms

Evolutionary AutoML

Evolution vs RL vs Random Search

AmoebaNet: pinnacle of evolution

Batch Normalization

Internal Covariate Shift

Batch Norm algorithm

Bad parts • Batch normalization is expensive • Batch normalization
breaks the assumption of data independence • Introduces a lot of extra hyper-parameters that need further fine-tuning • Causes a lot of implementation errors in distributed training • Requires a specific “training” and “inference” mode in frameworks

The philosophy of this paper Identify the origin of BatchNorm’s
benefits and replicate these benefits in BatchNorm-free Neural Networks

Good parts • Batch normalization downscales the residual branch •
Batch normalization eliminates mean-shift (in ReLU networks) • Batch normalization has a regularizing effect • Batch normalization allows efficient large-batch training

Early Free Batch Norm architects

Residual Branch Downscaling effect: SkipInit Batch Normalization Biases Deep Residual
Networks Towards Shallow Paths

Removing Mean Shift: Scaled Weight Standardization Characterizing signal propagation to
close the performance gap in unnormalized ResNets

NFNet improvements

Improved SkipInit

Changed Scaled Weight Standardization

Regularization: Dropout Dropout CNNs BatchNorm me

Regularization: Stochastic Depth

How to train on larger batches?

Gradient Clipping

Intuition about Gradient Clipping Parameter updates should be small relative
to the magnitude of the weight

Adaptive Gradient Clipping

Not enough Top-1 Accuracy!!! Even with Adaptive Gradient Clipping, and
the modified residual branch and convolutions, normalizer-free networks still could not surpass the accuracies of EfficientNet.

New SOTA model family

Architecture optimization for improved accuracy and training speed • SE-ResNeXt-D
as a baseline • Fixed group width (specific for the ResNeXt architecture) • Depth Scaling pattern was changed (from very specific to very simple) • Width pattern was changed too

Bottleneck block design

The whole NFNet family

Implementation • On JAX from authors • Collab to play
with • PyTorch with weights • Yet another PyTorch • Very good PyTorch (clear code) • Adaptive Gradient Clipping example • Broken Keras example (could be a good entry point) • Raw TF implementation

SAM: Sharpness Aware Minimization

SAM: Two backprops and approximate grads

Accelerating Sharpness-Aware Minimization The idea is to use 20% of
batch on SAM step only.

Modern Augmentation

Different augmentations applied to Baseline

RandAugment

CutMix, MixUp, Cutout

What model is the best?

BN vs NF for different depths and resolutions

Summary 1. Downscales residual branch 1. NF-strategy

Summary 1. Downscales residual branch 2. Enables large batch training
1. NF-strategy 2. Adaptive gradient clipping

3. Implicit regularization 1. NF-strategy 2. Adaptive gradient clipping 3. Explicit regularization

3. Implicit regularization 4. Prevents mean-shift 1. NF-strategy 2. Adaptive gradient clipping 3. Explicit regularization 4. Scaled Weight Standardization

3. Implicit regularization 4. Prevents mean-shift 1. NF-strategy 2. Adaptive gradient clipping 3. Explicit regularization 4. Scaled Weight Standardization • NFNets were new SOTA during a few months • NFResNet >= BNResNet • NFNet training >>> EfficientNet training

NFNet: High-Performance Large-Scale Image Recog...

NFNet: High-Performance Large-Scale Image Recognition Without Normalization

More Decks by Alexey Zinoviev

Other Decks in Science

Featured

Transcript