OpenTalks.AI

Surprizing properties of scale- invariant neural networks Dmitry P. Vetrov
Research professor at HSE, Lab lead in SAIC-Moscow Head of BayesGroup http://bayesgroup.ru

Outline • Learning rate and its influence to the width
of loss minima • Fractal hypothesis • Scale-invariance and effective rate and curvature • Attractors of learning trajectory in phase diagrams

Different learning rates Y. Li, C. Wei, T. Ma. Towards
Explaining the Regularization Effect of Initial Large Learning Rate in Training Neural Networks. In NeurIPS 2019.

Minefields in loss landscape It appears that there are lots
of poor global minima with almost zero train loss and arbitrarily poor validation loss Moreover many of them lie in close vicinity to optimization path SGD does not «see» them unless we start directly looking for poor minima Huang et al. Understanding Generalization through Visualizations. https://arxiv.org/abs/1906.03291

Implicit regularization Modern DNNs are trained with stochastic optimization techniques
The noise in stochastic gradient prevents us from seeing small details of the loss We simply cannot see narrow minima There is a common (yet unproven) belief in community that wide minima have better generalization

Hypothesis Zero train loss Starting point Large learning rate Small
learning rate Annealing to small learning rate Train loss landscape

Hypothesis Zero train loss Train loss landscape Starting point Large
learning rate Small learning rate Annealing to small learning rate

Fisher information matrix • Fisher information matrix shows the average
curvature of model’s prediction function • It shows how robust the model is about its own prediction • It is positive semi-definite • It is correlated well with covariance matrix of stochastic gradients and with the Hessian of train loss • We can effectively estimate its trace in a stochastic manner F(θ) = − 𝔼x 𝔼y ∂2 log p(y|x, θ) ∂θ2 = 𝔼x 𝔼y ∂ log p(y|x, θ) ∂θ ( ∂ log p(y|x, θ) ∂θ ) T ≽ 0

Experiment We run SGD with constant learning rate η ∝
1 scale2 Then we estimate the spectre of Fisher information matrix for different values of scale If our hypothesis is right we should observe self-similarity and spectre should differ only by a constant factor

Experiment We run SGD with constant learning rate η ∝
1 scale2 Then we estimate the spectre of Fisher information matrix for different values of scale If our hypothesis is right we should observe self-similarity and spectre should differ only by a constant factor Let us plot the histograms in log X-scale

The charm of batch-normalization • Injects stochasticity by using mini-batch
statistics in normalization procedure • Makes train loss landscape more smooth • Ensures scale-invariance of our model log p(y|x, θ) = log p(y|x, Cθ), ∀C > 0

Scale-invariance • We have one-to-one correspondence between optimization on the
whole space of weights and optimization on a unit sphere • The change of weights norm changes the curvature of the loss • Let us define effective learning rate and effective Fisher trace that would correspond to the optimization on a sphere ηeff = η ∥θ∥2 eff tr(F) = tr(F)∥θ∥2 T. van Laarhoven. L2 regularization versus batch and weight normalization. In NIPS 2017.

The evolution of weight norm • The [stochastic] gradient from
training data is always orthogonal to the beam which connects the origin with the current weights • The gradient from weight decay is directed to the origin • The relation between data gradient and weight decay defines the change of the weights norm R. Wan, et al. Spherical motion dynamics: Learning dynamics of neural network with normalization, weight decay and SGD. https://arxiv.org/pdf/2006.08419.pdf

The evolution of effective rate • BN allows to stabilize
effective learning rate regardless of the values of learning rate • At some value of effective rate network becomes unstable and is knocked out of the given minimum. Effective rate drops and network is able to quickly converge to a new minimum Convolutional NN. Constant learning rate. SGD. No augmentation was used

Phase diagram • After a few knock-outs network converges to
stable orbit in phase space

stable orbit in phase space Learns on hard training objects Double descent of test loss A. Achille, M. Rovere, S. Soatto. Critical learning periods in deep learning. In ICLR 2019.

stable orbit in phase space Train error at zero Weight norm drops Effective rate grows

stable orbit in phase space Minimum becomes unstable DNN is knocked-out Weight norm increases Effective rate drops

stable orbit in phase space Stable orbit is reached Periodical knock-outs

Conclusion • The value of learning rate defines the sharpness
of train loss minimum • There are multiple global minima of different curvature • For scale-invariant networks effective learning rate is important • Batch-normalization allows to automatically adapt the effective learning rate to the values which ensure good generalization

OpenTalks.AI - Дмитрий Ветров, Фрактальность фу...

OpenTalks.AI - Дмитрий Ветров, Фрактальность функции потерь, эффект двойного спуска и степенные законы в глубинном обучении - фрагменты одной мозаики

More Decks by OpenTalks.AI

Other Decks in Business

Featured

Transcript

Surprizing properties of scale- invariant neural networks Dmitry P. Vetrov

Outline • Learning rate and its influence to the width

Different learning rates Y. Li, C. Wei, T. Ma. Towards

Minefields in loss landscape It appears that there are lots

Implicit regularization Modern DNNs are trained with stochastic optimization techniques

Hypothesis Zero train loss Starting point Large learning rate Small

Hypothesis Zero train loss Train loss landscape Starting point Large

Fisher information matrix • Fisher information matrix shows the average

Experiment We run SGD with constant learning rate η ∝

Experiment We run SGD with constant learning rate η ∝

Experiment We run SGD with constant learning rate η ∝

The charm of batch-normalization • Injects stochasticity by using mini-batch

Scale-invariance • We have one-to-one correspondence between optimization on the

The evolution of weight norm • The [stochastic] gradient from

The evolution of effective rate • BN allows to stabilize

The evolution of effective rate • BN allows to stabilize

Phase diagram • After a few knock-outs network converges to

Phase diagram • After a few knock-outs network converges to

Phase diagram • After a few knock-outs network converges to

Phase diagram • After a few knock-outs network converges to

Phase diagram • After a few knock-outs network converges to

Conclusion • The value of learning rate defines the sharpness