OpenTalks.AI - Дмитрий Ветров, Deep learning & Bayesian approach. Оптимизация loss функции сети

Deep learning: An Ensemble perspective Dmitry P. Vetrov Research professor
at HSE, Lab lead in SAIC-Moscow Head of BayesGroup http://bayesgroup.ru

Outline • Introduction to Bayesian framework • MCMC and adversarial
learning •Understanding loss landscape •Uncertainty estimation study •Deep Ensemble perspective

Intro to Bayesian framework

Conditional and marginal distributions

Bayesian Framework

Bayesian framework • Treats everything as a random variables •
Allows to encode our ignorance in terms of distributions • Makes use of Bayes theorem

Frequentist vs. Bayesian frameworks

What is machine learning?

Machine learning from Bayesian point of view

Poor man’s Bayes

MCMC and adversarial learning

Modeling of probabilistic distributions

• Approximates intractable true posterior with a tractable variational distribution
• Typically KL-divergence is minimized • Can be scaled up by stochastic optimization • Generates samples from the true posterior • No bias even if the true distribution is intractable • Quite slow in practice • Problematic scaling to large data Markov Chain Monte Carlo Variational inference 17 Main approximate inference tools

Metropolis-Hastings algorithm

Acceptance rate maximization

Results on toy problem

Implicit setting

Reduction to adversarial training

MH GAN

MH GAN Acceptance rate is about 10% Only unique objects
were used for evaluating the metrics

Understanding loss landscape

Different learning rates Ma. Towards Explaining the Regularization Effect of
Initial Large Learning Rate in Training Neural Networks. In

Two kinds of patterns Easy-to-fit, Hard-to-generalize - Noisy regularities -
Easy patterns Hard-to-fit, Easy-to-generalize - Low noise - Complicated patterns

Two kinds of patterns Easy-to-fit, Hard-to-generalize - Noisy regularities -
Easy patterns Hard-to-fit, Easy-to-generalize - Low noise - Complicated patterns Main claim: - smaller LR learns (memorizes) hard-to-fit patterns - Larger LR learns easy-to-fit patterns - Larger LR with annealing learns both

Toy experiment

Discussion Small LR First: Memorizes fixed patterns Second: Learns noisy
patterns using less data Large LR Learns noisy flexible patterns using full data Unable to memorize objects/patterns Large LR + Annealing First: Learns noisy flexible patterns using full data Second: Memorizes fixed patterns using less data

Discussion Small LR First: Memorizes fixed patterns Second: Learns noisy
patterns using less data Large LR Learns noisy flexible patterns using full data Unable to memorize objects/patterns Large LR + Annealing First: Learns noisy flexible patterns using full data Second: Memorizes fixed patterns using less data Both noisy and fixed patterns are present in real data Larger LR corresponds to wider local optima (see Khan 2019) E. Khan. Deep learning with Bayesian principles. NeurIPS 2019 tutorial

Hypothesis Zero train loss Train loss landscape Starting point Large
learning rate Small learning rate Annealing to small learning rate

2-dimensional slice https://www.youtube.com/watch?v=dqX2LBcp5Hs ilov, D. Podoprikhin, D. Vetrov, A. Wilson.
Loss Surfaces, Mode Connectivity, and Fast Ensembling of DNNs.

Intriguing properties of Loss landscape S. Fort, S. Jastrzebsky. Large
Scale Structure of Neural Network Loss Landscapes. In NeurIPS 2019

Phenomenological model es to emulate loss landscape. The toy loss
equals the minimal distance from one of the N-dimen

Discussion The phenomenological model reproduces -Mode connectivity effect -Circular cut
isotropy -Existence of intrinsic dimension (should be larger than D-N) -Injection of the noise increases the wideness of local optima

Weight averaging Weight averaging helps Weight averaging does not help

Fighting the overconfidence of DNNs

Uncertainty estimation

Non-Bayesian way: deep ensembles

Bayesian way: inferring posterior

Exponential number of symmetries

Estimation metrics

Temperature scaling

Experiment design Cover multiple modes Memory-efficient Cover single mode

Deep ensemble equivalent

Test time data augmentation Data augmentation surprisingly helps to almost
all ensembling tools

Results

Results Explore different modes Explore single mode Memory-efficient

Deep Ensemble perspective

Diversity of ensembles Extensive experiments show -Ensembling really improves accuracy
and uncertainty -Existing variational methods are far behind deep ensembles -The less memory is required by ensemble the worse is its accuracy S. Fort, H. Hu, B. Lakshminarayanan. Deep Ensembles: A Loss Landscape Perspective. In BDL workshop at N Y. Ovadia, E. Fertig, J. Ren, Z. Nado, D. Sculley, S. Nowozin, J. Dillon, B. Lakshminarayanan, J. Snoek. Can Y A. Lyzhov, D. Molchanov, A. Ashukha, D. Vetrov. Pitfalls of In-Domain Uncertainty Estimation and Ensemblin

Cooling the posterior Cooled posterior True posterior F. Wenzel, et
al. How Good is the Bayes Posterior in Deep Neural Networks Really? https://arxiv.org/abs/200 (|) = exp 1 log(|)

Deep Ensembles It may appear that DE approximate cooled posterior
rather then the true one What is the maximal possible performance that can be achieved by using infinitely large DE of

Deep ensembles vs wide DNNs

Conclusion • Too many various topics to draw all conclusions…
• Deep MCMC techniques can become a new probabilistic tool • Loss landscapes (and the corresponding posteriors) are extremely complicated and require further study • Ensembles are highly under-estimated by community • Those and many other topics on Deep|Bayes 2020

KL-divergence

KL-divergence Mode Collapsing!

KL-divergence Low-density covering!

Pros and cons VAE • Reconstruction term • Learned latent
representations • Unrealistic explicit likelihood of decoder GAN • More realistic implicit likelihood • No covering of training data

Taking the best of the two worlds

Taking the best of the two worlds GAN objective –
ensures realistic quality of generated samples Implicit reconstruction term – ensures coverage of the whole dataset

Results

OpenTalks.AI - Дмитрий Ветров, Deep learning & ...

OpenTalks.AI - Дмитрий Ветров, Deep learning & Bayesian approach. Оптимизация loss функции сети

More Decks by OpenTalks.AI

Other Decks in Science

Featured

Transcript