Upgrade to Pro — share decks privately, control downloads, hide ads and more …

ICML2021論文読み会

40044819b1dc2904638f689281150cac?s=47 Sean Saito
August 18, 2021

 ICML2021論文読み会

40044819b1dc2904638f689281150cac?s=128

Sean Saito

August 18, 2021
Tweet

Transcript

  1. Sean Saito ・ ⿑藤初雲 Data Scientist, BCG GAMMA ICML2021 論⽂読み会

    Oops I Took A Gradient: Scalable Sampling for Discrete Distributions
  2. Summary

  3. 3 Oops I Took A Gradient: Scalable Sampling for Discrete

    Distributions Authors • Will Grathwohl1 2 • Kevin Swersky2 • Milad Hashemi2 • David Duvenaud1 2 • Chris J. Maddison1 Affiliation • 1 University of Toronto, Vector Institute • 2 Google Brain Awards • ICML 2021 Outstanding Paper Honorable Mention Keywords • Generative modelling, EBM, MCMC, Gibbs Sampling, Metropolis-Hastings, Restricted Boltzmann Machines Summary Oops I Took A Gradient: Scalable Sampling for Discrete Distributions ・ Sean Saito 3
  4. 4 1.Energy-based models are making a comeback in generative modelling

    2.This paper proposes Gibbs-with-Gradients, a general and scalable sampling method for discrete distributions Oops I Took A Gradient: Scalable Sampling for Discrete Distributions ・ Sean Saito Summary 4 Key takeaways
  5. Background

  6. 6 Energy-based models (EBM): A class of generative models for

    modelling data based on statistical physics, i.e. assigning a scalar “energy” value to each data point. https://towardsdatascience.com/gibbs-sampling-8e4844560ae5 https://en.wikipedia.org/wiki/Restricted_Boltzmann_machine http://www.cs.toronto.edu/~hinton/science.pdf 1980’s ~ 2020 ~ Early ~ mid 2000’s • Gibbs sampling • Hamiltonian Monte Carlo • Restricted Boltzmann Machines • Autoencoders, Deep Belief Networks • Deep EBM with gradient-based MCMC • Score Matching GAN, VAE, NF Hinton et al. (2006) Du et al. (2021) Song et al. (2021) Oops I Took A Gradient: Scalable Sampling for Discrete Distributions ・ Sean Saito Background 6 Energy-based models are making a comeback
  7. 7 𝑝! 𝑥 = 𝑒"#! (%) 𝑍(𝜃) Probability model of

    some distribution, parameterized by 𝜃 Energy function 𝐸: 𝑋 → ℝ (this can be defined flexibly, e.g. a neural network) In this paper, X ∈ {0, 1} (binary/one-hot) Partition function, typically expressed as ∫ 𝑒!"! ($)𝑑𝑥 Optimization formulated by maximizing log likelihood log 𝑝! 𝑥 = −𝐸! 𝑥 − log . 𝑒"#! (%)𝑑𝑥 This is intractable L Trained by deriving gradients from an unbiased gradient estimator ∇! log 𝑝! 𝑥 = −∇! 𝐸! 𝑥 − Ε%~)! % [−∇! 𝐸! 𝑥 ] Sampled via Markov chain Monte Carlo (MCMC) Can be trained via: • Stochastic Gradient Ascent • Langevin dynamics • Persistent CD Take the log Take the gradient Main focus of the paper Oops I Took A Gradient: Scalable Sampling for Discrete Distributions ・ Sean Saito Background 7 Sampling is key in training EBMs
  8. Gibbs-with-Gradients (GWG)

  9. Ø We want to apply the advanced EBM techniques on

    discrete data (tabular, text, proteins, etc.), such as gradient- based MCMC. Ø But gradient-based MCMC assumes continuous data. Other MCMC methods for discrete data are not efficient & don’t scale. Ø Moreover, current “efficient” discrete sampling methods assumes some structure in the data, and hence not general. e.g. block independence (Block Gibbs) We want a general yet scalable way of training EBMs on discrete data Motivation
  10. Core intuition Common discrete distributions have differentiable energy functions. This

    can be exploited to speed-up sampling. Discrete distributions Continuous energy functions
  11. 1. Calculate energy for current sample 𝑥 (e.g. output of

    a neural network) 2. Compute gradients & approximate energies of neighbors 𝑥! (Taylor series) to get likelihood ratios (in other words, very fast) 3. Apply softmax to get proposal distribution and sample dimension 𝑖 ∈ 𝐷 for which to propose a change 4. (Metropolis-Hastings Step) Accept 𝑥! as the next sample with probability: min 𝑒 "! #" $"! # 𝑞 𝑥 𝑥! 𝑞 𝑥! 𝑥 , 1 Gibbs-With-Gradients EBM Training MCMC Sampling 1 2 3 4 1 2 3 4 Oops I Took A Gradient: Scalable Sampling for Discrete Distributions ・ Sean Saito 80% 10% 5% 5% Repeat from 1… GWG 11
  12. 12 Metropolis-Hastings proposal Decide whether we take the next generated

    sample 𝑥& in the MCMC chain with probablity of: min 𝑒 !! "" #!! " 𝑞 𝑥 𝑥$ 𝑞 𝑥$ 𝑥 , 1 Likelihood ratio Proposal dist. ratio (backward vs. forward) where 𝑞 𝑥# 𝑥 = ∑$ 𝑞 𝑥# 𝑥, 𝑖 𝑞 𝑖 , 𝑖: index of 𝐷 A good proposal balances the likelihood of the proposal 𝑞 𝑥$ 𝑥 and the reverse proposal 𝑞 𝑥 𝑥$ Strategy: use locally-informed proposals We want this probability to be as close to 1 as possible for efficient sampling. (small probability → rejected proposal → wasted computation) 𝑞% (𝑥$|𝑥) ∝ 𝑒 & % !! "" #!! " 𝟏 𝑥$ ∈ 𝐻(𝑥) 𝐻(𝑥): Hamming ball of size 1 around 𝑥 (bit flip of one dimension since binary) 𝜏: Temperature parameter for tempered softmax 𝝉 = 𝟐 gives most balanced proposals (see paper for details!) Problem: bad scalability because evaluating 𝑞%(𝑥#|𝑥) is 𝑂(𝐷) where 𝐷 is the number of dimensions. e.g. MNIST → 𝑂 28×28 = 𝑂(784) . Can we do this in 𝑂(1)? Oops I Took A Gradient: Scalable Sampling for Discrete Distributions ・ Sean Saito GWG 12 Finding balance in Metropolis-Hastings proposals
  13. 13 Continuous functions with discrete input spaces Differentiable with respect

    to 𝑥 i.e. we can obtain ∇" 𝐸' 𝑥 Gibbs-with-Gradients: exploit these structures/gradients to estimate likelihood ratios in the form of Taylor-series approximations: 𝑞9:9(𝑥;|𝑥) ∝ 𝑒 < =(%) > 𝟏 𝑥; ∈ 𝐻(𝑥) Binary data: : 𝑑 𝑥 = − 2𝑥 − 1 ⨀ ∇" 𝐸' 𝑥 Categorical data: : 𝑑 𝑥 $& = ∇' 𝐸( 𝑥 $& − 𝑥$ )∇' 𝐸( 𝑥 $ Proposing a move = choosing which dimension 𝑖 to change So, 𝑞*+* 𝑥# 𝑥 = 𝑞 𝑖 𝑥 = Categorical Softmax : 𝑑(𝑥) 2 This is computed in parallel and only requires two function calls (forward & backward pass), allowing for a complexity of 𝑶(𝟏). Oops I Took A Gradient: Scalable Sampling for Discrete Distributions ・ Sean Saito GWG 13 Exploit gradients of energy functions to speed-up proposals
  14. Experiments

  15. 15 GWG is more scalable with increasing dimensions GWG converges

    faster and achieves higher effective sample size Oops I Took A Gradient: Scalable Sampling for Discrete Distributions ・ Sean Saito Experiments 15 How much does GWG help train EBMs?
  16. Conclusion

  17. 17 Additional details (paper) Next steps Theoretical analysis • Proof

    that GWG is near-optimal for local proposals • Analysis on efficiency • Relationship to continuous relaxations Other experiments • Factorial Hidden Markov Models (time- series) • Lattice Ising Models • Protein coupling prediction Expansion • Expand window size of the proposals for the sampler (i.e. modify more than one variable at a time) • Apply to larger categorical spaces (e.g. text) Generalization • Apply to recent deep EBM training methods such as Score Matching and Stein Discrepancies to generalize to discrete data Oops I Took A Gradient: Scalable Sampling for Discrete Distributions ・ Sean Saito Conclusion 17 Conclusion
  18. 18 linkedin.com/in/seansaito Oops I Took A Gradient: Scalable Sampling for

    Discrete Distributions ・ Sean Saito @saitonian 18 Thanks and let’s connect 気軽にご連絡ください︕
  19. Appendix

  20. 20 Compares samples from GWG versus those from standard Gibbs

    sampling (based on a groundtruth connectivity matrix) GWG finds better samples (about 100x better in orders of magnitude in terms of log(RMSE)) GWG is much faster (at least 20x faster in terms of sampling steps) GWG provides better MCMC performance, which leads to better parameter inference for EBMs/generative models Experiments How fast is GWG in MCMC? 20
  21. 21 GWG is competitive/outperforming on both binary and categorical data

    Experiments 21 How does GWG compare with other deep generative models?
  22. 22 • Du, Yilun, et al. "Improved contrastive divergence training

    of energy based models." arXiv preprint arXiv:2012.01316 (2020). • Grathwohl, Will, et al. "Oops I Took A Gradient: Scalable Sampling for Discrete Distributions." arXiv preprint arXiv:2102.04509 (2021). • Hinton, Geoffrey E., and Ruslan R. Salakhutdinov. "Reducing the dimensionality of data with neural networks." science 313.5786 (2006): 504-507. • Larochelle, Hugo, and Yoshua Bengio. "Classification using discriminative restricted Boltzmann machines." Proceedings of the 25th international conference on Machine learning. 2008. • Song, Yang, and Diederik P. Kingma. "How to train your energy-based models." arXiv preprint arXiv:2101.03288 (2021). Oops I Took A Gradient: Scalable Sampling for Discrete Distributions ・ Sean Saito 22 References