Upgrade to Pro — share decks privately, control downloads, hide ads and more …

OpenTalks.AI - Дмитрий Ветров, Deep learning & Bayesian approach. Оптимизация loss функции сети

OpenTalks.AI
February 19, 2020

OpenTalks.AI - Дмитрий Ветров, Deep learning & Bayesian approach. Оптимизация loss функции сети

OpenTalks.AI

February 19, 2020
Tweet

More Decks by OpenTalks.AI

Other Decks in Science

Transcript

  1. Deep learning: An Ensemble perspective Dmitry P. Vetrov Research professor

    at HSE, Lab lead in SAIC-Moscow Head of BayesGroup http://bayesgroup.ru
  2. Outline • Introduction to Bayesian framework • MCMC and adversarial

    learning •Understanding loss landscape •Uncertainty estimation study •Deep Ensemble perspective
  3. Bayesian framework • Treats everything as a random variables •

    Allows to encode our ignorance in terms of distributions • Makes use of Bayes theorem
  4. Bayesian framework • Treats everything as a random variables •

    Allows to encode our ignorance in terms of distributions • Makes use of Bayes theorem
  5. Bayesian framework • Treats everything as a random variables •

    Allows to encode our ignorance in terms of distributions • Makes use of Bayes theorem
  6. Bayesian framework • Treats everything as a random variables •

    Allows to encode our ignorance in terms of distributions • Makes use of Bayes theorem
  7. • Approximates intractable true posterior with a tractable variational distribution

    • Typically KL-divergence is minimized • Can be scaled up by stochastic optimization • Generates samples from the true posterior • No bias even if the true distribution is intractable • Quite slow in practice • Problematic scaling to large data Markov Chain Monte Carlo Variational inference 17 Main approximate inference tools
  8. MH GAN Acceptance rate is about 10% Only unique objects

    were used for evaluating the metrics
  9. Different learning rates Ma. Towards Explaining the Regularization Effect of

    Initial Large Learning Rate in Training Neural Networks. In
  10. Two kinds of patterns Easy-to-fit, Hard-to-generalize - Noisy regularities -

    Easy patterns Hard-to-fit, Easy-to-generalize - Low noise - Complicated patterns
  11. Two kinds of patterns Easy-to-fit, Hard-to-generalize - Noisy regularities -

    Easy patterns Hard-to-fit, Easy-to-generalize - Low noise - Complicated patterns Main claim: - smaller LR learns (memorizes) hard-to-fit patterns - Larger LR learns easy-to-fit patterns - Larger LR with annealing learns both
  12. Discussion Small LR First: Memorizes fixed patterns Second: Learns noisy

    patterns using less data Large LR Learns noisy flexible patterns using full data Unable to memorize objects/patterns Large LR + Annealing First: Learns noisy flexible patterns using full data Second: Memorizes fixed patterns using less data
  13. Discussion Small LR First: Memorizes fixed patterns Second: Learns noisy

    patterns using less data Large LR Learns noisy flexible patterns using full data Unable to memorize objects/patterns Large LR + Annealing First: Learns noisy flexible patterns using full data Second: Memorizes fixed patterns using less data Both noisy and fixed patterns are present in real data Larger LR corresponds to wider local optima (see Khan 2019) E. Khan. Deep learning with Bayesian principles. NeurIPS 2019 tutorial
  14. Hypothesis Zero train loss Train loss landscape Starting point Large

    learning rate Small learning rate Annealing to small learning rate
  15. Intriguing properties of Loss landscape S. Fort, S. Jastrzebsky. Large

    Scale Structure of Neural Network Loss Landscapes. In NeurIPS 2019
  16. Phenomenological model es to emulate loss landscape. The toy loss

    equals the minimal distance from one of the N-dimen
  17. Discussion The phenomenological model reproduces -Mode connectivity effect -Circular cut

    isotropy -Existence of intrinsic dimension (should be larger than D-N) -Injection of the noise increases the wideness of local optima
  18. Diversity of ensembles Extensive experiments show -Ensembling really improves accuracy

    and uncertainty -Existing variational methods are far behind deep ensembles -The less memory is required by ensemble the worse is its accuracy S. Fort, H. Hu, B. Lakshminarayanan. Deep Ensembles: A Loss Landscape Perspective. In BDL workshop at N Y. Ovadia, E. Fertig, J. Ren, Z. Nado, D. Sculley, S. Nowozin, J. Dillon, B. Lakshminarayanan, J. Snoek. Can Y A. Lyzhov, D. Molchanov, A. Ashukha, D. Vetrov. Pitfalls of In-Domain Uncertainty Estimation and Ensemblin
  19. Cooling the posterior Cooled posterior True posterior F. Wenzel, et

    al. How Good is the Bayes Posterior in Deep Neural Networks Really? https://arxiv.org/abs/200 (|) = exp 1 log(|)
  20. Deep Ensembles It may appear that DE approximate cooled posterior

    rather then the true one What is the maximal possible performance that can be achieved by using infinitely large DE of
  21. Deep Ensembles It may appear that DE approximate cooled posterior

    rather then the true one What is the maximal possible performance that can be achieved by using infinitely large DE of
  22. Deep Ensembles It may appear that DE approximate cooled posterior

    rather then the true one What is the maximal possible performance that can be achieved by using infinitely large DE of
  23. Deep Ensembles It may appear that DE approximate cooled posterior

    rather then the true one What is the maximal possible performance that can be achieved by using infinitely large DE of
  24. Conclusion • Too many various topics to draw all conclusions…

    • Deep MCMC techniques can become a new probabilistic tool • Loss landscapes (and the corresponding posteriors) are extremely complicated and require further study • Ensembles are highly under-estimated by community • Those and many other topics on Deep|Bayes 2020
  25. GAN

  26. GAN

  27. GAN

  28. VAE

  29. Pros and cons VAE • Reconstruction term • Learned latent

    representations • Unrealistic explicit likelihood of decoder GAN • More realistic implicit likelihood • No covering of training data
  30. Taking the best of the two worlds GAN objective –

    ensures realistic quality of generated samples Implicit reconstruction term – ensures coverage of the whole dataset
  31. Taking the best of the two worlds GAN objective –

    ensures realistic quality of generated samples Implicit reconstruction term – ensures coverage of the whole dataset