Upgrade to Pro — share decks privately, control downloads, hide ads and more …

NFNet: High-Performance Large-Scale Image Recognition Without Normalization

NFNet: High-Performance Large-Scale Image Recognition Without Normalization

This is a talk about SAM, image augmentation, ResNet-like architectures evolution (from ResNet to NFNet), NAS (neural architecture search), and other techniques which could help to build modern SOTA in Computer Vision and Object Detection.

Alexey Zinoviev

May 12, 2021
Tweet

More Decks by Alexey Zinoviev

Other Decks in Science

Transcript

  1. NFNet: High-Performance
    Large-Scale Image Recognition
    Without Normalization
    Alexey Zinoviev, JetBrains

    View Slide

  2. Bio
    ● Java & Kotlin developer
    ● Distributed ML enthusiast
    ● Apache Ignite PMC
    ● TensorFlow Contributor
    ● ML engineer at JetBrains
    ● Happy father and husband
    ● https://github.com/zaleslaw
    ● https://twitter.com/zaleslaw

    View Slide

  3. NFNets: top-1 accuracy vs training latency

    View Slide

  4. Evolution of Image Recognition models

    View Slide

  5. BatchNorm + Skip Connections Epoch
    NFNet-F4+

    View Slide

  6. Top-1 Accuracy 2021

    View Slide

  7. Architecture innovations: 2015-2019

    View Slide

  8. Residual Block

    View Slide

  9. Bottleneck Residual Block

    View Slide

  10. Bottleneck Residual Block

    View Slide

  11. Depthwise Separable Convolution

    View Slide

  12. Inception Module

    View Slide

  13. Multi-residual block: ResNeXt

    View Slide

  14. Shared source skip connections

    View Slide

  15. Dense blocks in DenseNet (DenseNet != MLP)

    View Slide

  16. AutoML Era

    View Slide

  17. New principle in AutoML era
    Give us your computational resource limit and
    we scale our baseline model for you

    View Slide

  18. EfficientNet: scale

    View Slide

  19. 3D Pareto: FLOPS, Number of params and Top-1 Acc

    View Slide

  20. HPO task

    View Slide

  21. Bilevel optimization problem

    View Slide

  22. NAS (neural architecture search)

    View Slide

  23. Search Space skeleton

    View Slide

  24. Controller implementation (RNN, isn’t it?)

    View Slide

  25. NASNet cells designed by AutoML algorithm

    View Slide

  26. Evolutionary algorithms

    View Slide

  27. Evolutionary AutoML

    View Slide

  28. Evolution vs RL vs Random Search

    View Slide

  29. AmoebaNet: pinnacle of evolution

    View Slide

  30. View Slide

  31. Batch Normalization

    View Slide

  32. Internal Covariate Shift

    View Slide

  33. Batch Norm algorithm

    View Slide

  34. Bad parts
    ● Batch normalization is expensive
    ● Batch normalization breaks the assumption of data independence
    ● Introduces a lot of extra hyper-parameters that need further fine-tuning
    ● Causes a lot of implementation errors in distributed training
    ● Requires a specific “training” and “inference” mode in frameworks

    View Slide

  35. The philosophy of this paper
    Identify the origin of BatchNorm’s benefits and replicate these benefits in
    BatchNorm-free Neural Networks

    View Slide

  36. Good parts
    ● Batch normalization downscales the residual branch
    ● Batch normalization eliminates mean-shift (in ReLU networks)
    ● Batch normalization has a regularizing effect
    ● Batch normalization allows efficient large-batch training

    View Slide

  37. Early Free Batch Norm architects

    View Slide

  38. Residual Branch Downscaling effect: SkipInit
    Batch Normalization Biases Deep Residual Networks Towards Shallow Paths

    View Slide

  39. Removing Mean Shift: Scaled Weight Standardization
    Characterizing signal propagation to close the performance gap in
    unnormalized ResNets

    View Slide

  40. NFNet improvements

    View Slide

  41. Improved SkipInit

    View Slide

  42. Changed Scaled Weight Standardization

    View Slide

  43. Regularization: Dropout
    Dropout
    CNNs
    BatchNorm
    me

    View Slide

  44. Regularization: Stochastic Depth

    View Slide

  45. Regularization: Stochastic Depth

    View Slide

  46. How to train on larger batches?

    View Slide

  47. Gradient Clipping

    View Slide

  48. Intuition about Gradient Clipping
    Parameter updates should be small relative to the magnitude of the weight

    View Slide

  49. Adaptive Gradient Clipping

    View Slide

  50. Not enough Top-1 Accuracy!!!
    Even with Adaptive Gradient Clipping, and the modified residual branch and
    convolutions, normalizer-free networks still could not surpass the accuracies of
    EfficientNet.

    View Slide

  51. New SOTA model family

    View Slide

  52. Architecture optimization for improved accuracy and
    training speed
    ● SE-ResNeXt-D as a baseline
    ● Fixed group width (specific for the ResNeXt architecture)
    ● Depth Scaling pattern was changed (from very specific to very simple)
    ● Width pattern was changed too

    View Slide

  53. Bottleneck block design

    View Slide

  54. View Slide

  55. View Slide

  56. The whole NFNet family

    View Slide

  57. Implementation
    ● On JAX from authors
    ● Collab to play with
    ● PyTorch with weights
    ● Yet another PyTorch
    ● Very good PyTorch (clear code)
    ● Adaptive Gradient Clipping example
    ● Broken Keras example (could be a good entry point)
    ● Raw TF implementation

    View Slide

  58. SAM: Sharpness Aware Minimization

    View Slide

  59. SAM: Sharpness Aware Minimization

    View Slide

  60. SAM: Two backprops and approximate grads

    View Slide

  61. Accelerating Sharpness-Aware Minimization
    The idea is to use 20% of batch on SAM step only.

    View Slide

  62. Modern Augmentation

    View Slide

  63. Different augmentations applied to Baseline

    View Slide

  64. RandAugment

    View Slide

  65. CutMix, MixUp, Cutout

    View Slide

  66. What model is the best?

    View Slide

  67. View Slide

  68. BN vs NF for different depths and resolutions

    View Slide

  69. Summary
    1. Downscales residual branch 1. NF-strategy

    View Slide

  70. Summary
    1. Downscales residual branch
    2. Enables large batch training
    1. NF-strategy
    2. Adaptive gradient clipping

    View Slide

  71. Summary
    1. Downscales residual branch
    2. Enables large batch training
    3. Implicit regularization
    1. NF-strategy
    2. Adaptive gradient clipping
    3. Explicit regularization

    View Slide

  72. Summary
    1. Downscales residual branch
    2. Enables large batch training
    3. Implicit regularization
    4. Prevents mean-shift
    1. NF-strategy
    2. Adaptive gradient clipping
    3. Explicit regularization
    4. Scaled Weight Standardization

    View Slide

  73. Summary
    1. Downscales residual branch
    2. Enables large batch training
    3. Implicit regularization
    4. Prevents mean-shift
    1. NF-strategy
    2. Adaptive gradient clipping
    3. Explicit regularization
    4. Scaled Weight Standardization
    ● NFNets were new SOTA during a few months
    ● NFResNet >= BNResNet
    ● NFNet training >>> EfficientNet training

    View Slide

  74. View Slide