Upgrade to Pro — share decks privately, control downloads, hide ads and more …

NFNet: High-Performance Large-Scale Image Recognition Without Normalization

NFNet: High-Performance Large-Scale Image Recognition Without Normalization

This is a talk about SAM, image augmentation, ResNet-like architectures evolution (from ResNet to NFNet), NAS (neural architecture search), and other techniques which could help to build modern SOTA in Computer Vision and Object Detection.

Alexey Zinoviev

May 12, 2021
Tweet

More Decks by Alexey Zinoviev

Other Decks in Science

Transcript

  1. Bio • Java & Kotlin developer • Distributed ML enthusiast

    • Apache Ignite PMC • TensorFlow Contributor • ML engineer at JetBrains • Happy father and husband • https://github.com/zaleslaw • https://twitter.com/zaleslaw
  2. New principle in AutoML era Give us your computational resource

    limit and we scale our baseline model for you
  3. Bad parts • Batch normalization is expensive • Batch normalization

    breaks the assumption of data independence • Introduces a lot of extra hyper-parameters that need further fine-tuning • Causes a lot of implementation errors in distributed training • Requires a specific “training” and “inference” mode in frameworks
  4. The philosophy of this paper Identify the origin of BatchNorm’s

    benefits and replicate these benefits in BatchNorm-free Neural Networks
  5. Good parts • Batch normalization downscales the residual branch •

    Batch normalization eliminates mean-shift (in ReLU networks) • Batch normalization has a regularizing effect • Batch normalization allows efficient large-batch training
  6. Not enough Top-1 Accuracy!!! Even with Adaptive Gradient Clipping, and

    the modified residual branch and convolutions, normalizer-free networks still could not surpass the accuracies of EfficientNet.
  7. Architecture optimization for improved accuracy and training speed • SE-ResNeXt-D

    as a baseline • Fixed group width (specific for the ResNeXt architecture) • Depth Scaling pattern was changed (from very specific to very simple) • Width pattern was changed too
  8. Implementation • On JAX from authors • Collab to play

    with • PyTorch with weights • Yet another PyTorch • Very good PyTorch (clear code) • Adaptive Gradient Clipping example • Broken Keras example (could be a good entry point) • Raw TF implementation
  9. Summary 1. Downscales residual branch 2. Enables large batch training

    1. NF-strategy 2. Adaptive gradient clipping
  10. Summary 1. Downscales residual branch 2. Enables large batch training

    3. Implicit regularization 1. NF-strategy 2. Adaptive gradient clipping 3. Explicit regularization
  11. Summary 1. Downscales residual branch 2. Enables large batch training

    3. Implicit regularization 4. Prevents mean-shift 1. NF-strategy 2. Adaptive gradient clipping 3. Explicit regularization 4. Scaled Weight Standardization
  12. Summary 1. Downscales residual branch 2. Enables large batch training

    3. Implicit regularization 4. Prevents mean-shift 1. NF-strategy 2. Adaptive gradient clipping 3. Explicit regularization 4. Scaled Weight Standardization • NFNets were new SOTA during a few months • NFResNet >= BNResNet • NFNet training >>> EfficientNet training