Upgrade to Pro — share decks privately, control downloads, hide ads and more …

OpenTalks.AI - Дмитрий Ветров, Фрактальность функции потерь, эффект двойного спуска и степенные законы в глубинном обучении - фрагменты одной мозаики

OpenTalks.AI
February 04, 2021

OpenTalks.AI - Дмитрий Ветров, Фрактальность функции потерь, эффект двойного спуска и степенные законы в глубинном обучении - фрагменты одной мозаики

OpenTalks.AI

February 04, 2021
Tweet

More Decks by OpenTalks.AI

Other Decks in Business

Transcript

  1. Surprizing properties of scale- invariant neural networks Dmitry P. Vetrov

    Research professor at HSE, Lab lead in SAIC-Moscow Head of BayesGroup http://bayesgroup.ru
  2. Outline • Learning rate and its influence to the width

    of loss minima • Fractal hypothesis • Scale-invariance and effective rate and curvature • Attractors of learning trajectory in phase diagrams
  3. Different learning rates Y. Li, C. Wei, T. Ma. Towards

    Explaining the Regularization Effect of Initial Large Learning Rate in Training Neural Networks. In NeurIPS 2019.
  4. Minefields in loss landscape It appears that there are lots

    of poor global minima with almost zero train loss and arbitrarily poor validation loss Moreover many of them lie in close vicinity to optimization path SGD does not «see» them unless we start directly looking for poor minima Huang et al. Understanding Generalization through Visualizations. https://arxiv.org/abs/1906.03291
  5. Implicit regularization Modern DNNs are trained with stochastic optimization techniques

    The noise in stochastic gradient prevents us from seeing small details of the loss We simply cannot see narrow minima There is a common (yet unproven) belief in community that wide minima have better generalization
  6. Hypothesis Zero train loss Starting point Large learning rate Small

    learning rate Annealing to small learning rate Train loss landscape
  7. Hypothesis Zero train loss Train loss landscape Starting point Large

    learning rate Small learning rate Annealing to small learning rate
  8. Fisher information matrix • Fisher information matrix shows the average

    curvature of model’s prediction function • It shows how robust the model is about its own prediction • It is positive semi-definite • It is correlated well with covariance matrix of stochastic gradients and with the Hessian of train loss • We can effectively estimate its trace in a stochastic manner F(θ) = − 𝔼x 𝔼y ∂2 log p(y|x, θ) ∂θ2 = 𝔼x 𝔼y ∂ log p(y|x, θ) ∂θ ( ∂ log p(y|x, θ) ∂θ ) T ≽ 0
  9. Experiment We run SGD with constant learning rate η ∝

    1 scale2 Then we estimate the spectre of Fisher information matrix for different values of scale If our hypothesis is right we should observe self-similarity and spectre should differ only by a constant factor
  10. Experiment We run SGD with constant learning rate η ∝

    1 scale2 Then we estimate the spectre of Fisher information matrix for different values of scale If our hypothesis is right we should observe self-similarity and spectre should differ only by a constant factor Let us plot the histograms in log X-scale
  11. Experiment We run SGD with constant learning rate η ∝

    1 scale2 Then we estimate the spectre of Fisher information matrix for different values of scale If our hypothesis is right we should observe self-similarity and spectre should differ only by a constant factor Let us plot the histograms in log X-scale
  12. The charm of batch-normalization • Injects stochasticity by using mini-batch

    statistics in normalization procedure • Makes train loss landscape more smooth • Ensures scale-invariance of our model log p(y|x, θ) = log p(y|x, Cθ), ∀C > 0
  13. Scale-invariance • We have one-to-one correspondence between optimization on the

    whole space of weights and optimization on a unit sphere • The change of weights norm changes the curvature of the loss • Let us define effective learning rate and effective Fisher trace that would correspond to the optimization on a sphere ηeff = η ∥θ∥2 eff tr(F) = tr(F)∥θ∥2 T. van Laarhoven. L2 regularization versus batch and weight normalization. In NIPS 2017.
  14. The evolution of weight norm • The [stochastic] gradient from

    training data is always orthogonal to the beam which connects the origin with the current weights • The gradient from weight decay is directed to the origin • The relation between data gradient and weight decay defines the change of the weights norm R. Wan, et al. Spherical motion dynamics: Learning dynamics of neural network with normalization, weight decay and SGD. https://arxiv.org/pdf/2006.08419.pdf
  15. The evolution of effective rate • BN allows to stabilize

    effective learning rate regardless of the values of learning rate • At some value of effective rate network becomes unstable and is knocked out of the given minimum. Effective rate drops and network is able to quickly converge to a new minimum Convolutional NN. Constant learning rate. SGD. No augmentation was used
  16. The evolution of effective rate • BN allows to stabilize

    effective learning rate regardless of the values of learning rate • At some value of effective rate network becomes unstable and is knocked out of the given minimum. Effective rate drops and network is able to quickly converge to a new minimum Convolutional NN. Constant learning rate. SGD. No augmentation was used
  17. Phase diagram • After a few knock-outs network converges to

    stable orbit in phase space Learns on hard training objects Double descent of test loss A. Achille, M. Rovere, S. Soatto. Critical learning periods in deep learning. In ICLR 2019.
  18. Phase diagram • After a few knock-outs network converges to

    stable orbit in phase space Train error at zero Weight norm drops Effective rate grows
  19. Phase diagram • After a few knock-outs network converges to

    stable orbit in phase space Minimum becomes unstable DNN is knocked-out Weight norm increases Effective rate drops
  20. Phase diagram • After a few knock-outs network converges to

    stable orbit in phase space Stable orbit is reached Periodical knock-outs
  21. Conclusion • The value of learning rate defines the sharpness

    of train loss minimum • There are multiple global minima of different curvature • For scale-invariant networks effective learning rate is important • Batch-normalization allows to automatically adapt the effective learning rate to the values which ensure good generalization