OpenTalks.AI - Дмитрий Ветров, Deep learning & Bayesian approach. Оптимизация loss функции сети

Ad8ae7af280edaecb09bd73a551b5e5f?s=47 OpenTalks.AI
February 19, 2020

OpenTalks.AI - Дмитрий Ветров, Deep learning & Bayesian approach. Оптимизация loss функции сети

Ad8ae7af280edaecb09bd73a551b5e5f?s=128

OpenTalks.AI

February 19, 2020
Tweet

Transcript

  1. Deep learning: An Ensemble perspective Dmitry P. Vetrov Research professor

    at HSE, Lab lead in SAIC-Moscow Head of BayesGroup http://bayesgroup.ru
  2. Outline • Introduction to Bayesian framework • MCMC and adversarial

    learning •Understanding loss landscape •Uncertainty estimation study •Deep Ensemble perspective
  3. Intro to Bayesian framework

  4. Conditional and marginal distributions

  5. Bayesian Framework

  6. Bayesian framework • Treats everything as a random variables •

    Allows to encode our ignorance in terms of distributions • Makes use of Bayes theorem
  7. Bayesian framework • Treats everything as a random variables •

    Allows to encode our ignorance in terms of distributions • Makes use of Bayes theorem
  8. Bayesian framework • Treats everything as a random variables •

    Allows to encode our ignorance in terms of distributions • Makes use of Bayes theorem
  9. Bayesian framework • Treats everything as a random variables •

    Allows to encode our ignorance in terms of distributions • Makes use of Bayes theorem
  10. Frequentist vs. Bayesian frameworks

  11. What is machine learning?

  12. Machine learning from Bayesian point of view

  13. Poor man’s Bayes

  14. MCMC and adversarial learning

  15. Modeling of probabilistic distributions

  16. Modeling of probabilistic distributions

  17. • Approximates intractable true posterior with a tractable variational distribution

    • Typically KL-divergence is minimized • Can be scaled up by stochastic optimization • Generates samples from the true posterior • No bias even if the true distribution is intractable • Quite slow in practice • Problematic scaling to large data Markov Chain Monte Carlo Variational inference 17 Main approximate inference tools
  18. Metropolis-Hastings algorithm

  19. Metropolis-Hastings algorithm

  20. Metropolis-Hastings algorithm

  21. Acceptance rate maximization

  22. Acceptance rate maximization

  23. Acceptance rate maximization

  24. Acceptance rate maximization

  25. Acceptance rate maximization

  26. Results on toy problem

  27. Implicit setting

  28. Implicit setting

  29. Reduction to adversarial training

  30. Reduction to adversarial training

  31. Reduction to adversarial training

  32. Reduction to adversarial training

  33. Reduction to adversarial training

  34. MH GAN

  35. MH GAN

  36. MH GAN Acceptance rate is about 10% Only unique objects

    were used for evaluating the metrics
  37. Understanding loss landscape

  38. Different learning rates Ma. Towards Explaining the Regularization Effect of

    Initial Large Learning Rate in Training Neural Networks. In
  39. Two kinds of patterns Easy-to-fit, Hard-to-generalize - Noisy regularities -

    Easy patterns Hard-to-fit, Easy-to-generalize - Low noise - Complicated patterns
  40. Two kinds of patterns Easy-to-fit, Hard-to-generalize - Noisy regularities -

    Easy patterns Hard-to-fit, Easy-to-generalize - Low noise - Complicated patterns Main claim: - smaller LR learns (memorizes) hard-to-fit patterns - Larger LR learns easy-to-fit patterns - Larger LR with annealing learns both
  41. Toy experiment

  42. Toy experiment

  43. Discussion Small LR First: Memorizes fixed patterns Second: Learns noisy

    patterns using less data Large LR Learns noisy flexible patterns using full data Unable to memorize objects/patterns Large LR + Annealing First: Learns noisy flexible patterns using full data Second: Memorizes fixed patterns using less data
  44. Discussion Small LR First: Memorizes fixed patterns Second: Learns noisy

    patterns using less data Large LR Learns noisy flexible patterns using full data Unable to memorize objects/patterns Large LR + Annealing First: Learns noisy flexible patterns using full data Second: Memorizes fixed patterns using less data Both noisy and fixed patterns are present in real data Larger LR corresponds to wider local optima (see Khan 2019) E. Khan. Deep learning with Bayesian principles. NeurIPS 2019 tutorial
  45. Hypothesis Zero train loss Train loss landscape Starting point Large

    learning rate Small learning rate Annealing to small learning rate
  46. 2-dimensional slice https://www.youtube.com/watch?v=dqX2LBcp5Hs ilov, D. Podoprikhin, D. Vetrov, A. Wilson.

    Loss Surfaces, Mode Connectivity, and Fast Ensembling of DNNs.
  47. Intriguing properties of Loss landscape S. Fort, S. Jastrzebsky. Large

    Scale Structure of Neural Network Loss Landscapes. In NeurIPS 2019
  48. Phenomenological model es to emulate loss landscape. The toy loss

    equals the minimal distance from one of the N-dimen
  49. Discussion The phenomenological model reproduces -Mode connectivity effect -Circular cut

    isotropy -Existence of intrinsic dimension (should be larger than D-N) -Injection of the noise increases the wideness of local optima
  50. Weight averaging Weight averaging helps Weight averaging does not help

  51. Fighting the overconfidence of DNNs

  52. Uncertainty estimation

  53. Non-Bayesian way: deep ensembles

  54. Bayesian way: inferring posterior

  55. Exponential number of symmetries

  56. Estimation metrics

  57. Estimation metrics

  58. Temperature scaling

  59. Experiment design Cover multiple modes Memory-efficient Cover single mode

  60. Deep ensemble equivalent

  61. Deep ensemble equivalent

  62. Test time data augmentation Data augmentation surprisingly helps to almost

    all ensembling tools
  63. Results

  64. Results Explore different modes Explore single mode Memory-efficient

  65. Deep Ensemble perspective

  66. Diversity of ensembles Extensive experiments show -Ensembling really improves accuracy

    and uncertainty -Existing variational methods are far behind deep ensembles -The less memory is required by ensemble the worse is its accuracy S. Fort, H. Hu, B. Lakshminarayanan. Deep Ensembles: A Loss Landscape Perspective. In BDL workshop at N Y. Ovadia, E. Fertig, J. Ren, Z. Nado, D. Sculley, S. Nowozin, J. Dillon, B. Lakshminarayanan, J. Snoek. Can Y A. Lyzhov, D. Molchanov, A. Ashukha, D. Vetrov. Pitfalls of In-Domain Uncertainty Estimation and Ensemblin
  67. Cooling the posterior Cooled posterior True posterior F. Wenzel, et

    al. How Good is the Bayes Posterior in Deep Neural Networks Really? https://arxiv.org/abs/200 (|) = exp 1 log(|)
  68. Deep Ensembles It may appear that DE approximate cooled posterior

    rather then the true one What is the maximal possible performance that can be achieved by using infinitely large DE of
  69. Deep Ensembles It may appear that DE approximate cooled posterior

    rather then the true one What is the maximal possible performance that can be achieved by using infinitely large DE of
  70. Deep Ensembles It may appear that DE approximate cooled posterior

    rather then the true one What is the maximal possible performance that can be achieved by using infinitely large DE of
  71. Deep Ensembles It may appear that DE approximate cooled posterior

    rather then the true one What is the maximal possible performance that can be achieved by using infinitely large DE of
  72. Deep ensembles vs wide DNNs

  73. Conclusion • Too many various topics to draw all conclusions…

    • Deep MCMC techniques can become a new probabilistic tool • Loss landscapes (and the corresponding posteriors) are extremely complicated and require further study • Ensembles are highly under-estimated by community • Those and many other topics on Deep|Bayes 2020
  74. KL-divergence

  75. KL-divergence Mode Collapsing!

  76. KL-divergence Low-density covering!

  77. GAN

  78. GAN

  79. GAN

  80. VAE

  81. Pros and cons VAE • Reconstruction term • Learned latent

    representations • Unrealistic explicit likelihood of decoder GAN • More realistic implicit likelihood • No covering of training data
  82. Taking the best of the two worlds

  83. Taking the best of the two worlds GAN objective –

    ensures realistic quality of generated samples Implicit reconstruction term – ensures coverage of the whole dataset
  84. Taking the best of the two worlds GAN objective –

    ensures realistic quality of generated samples Implicit reconstruction term – ensures coverage of the whole dataset
  85. Results