ICML2021論文読み会

Sean Saito ・⿑藤初雲 Data Scientist, BCG GAMMA ICML2021 論⽂読み会
Oops I Took A Gradient: Scalable Sampling for Discrete Distributions

Summary

3 Oops I Took A Gradient: Scalable Sampling for Discrete
Distributions Authors • Will Grathwohl1 2 • Kevin Swersky2 • Milad Hashemi2 • David Duvenaud1 2 • Chris J. Maddison1 Affiliation • 1 University of Toronto, Vector Institute • 2 Google Brain Awards • ICML 2021 Outstanding Paper Honorable Mention Keywords • Generative modelling, EBM, MCMC, Gibbs Sampling, Metropolis-Hastings, Restricted Boltzmann Machines Summary Oops I Took A Gradient: Scalable Sampling for Discrete Distributions ・ Sean Saito 3

4 1.Energy-based models are making a comeback in generative modelling
2.This paper proposes Gibbs-with-Gradients, a general and scalable sampling method for discrete distributions Oops I Took A Gradient: Scalable Sampling for Discrete Distributions ・ Sean Saito Summary 4 Key takeaways

Background

6 Energy-based models (EBM): A class of generative models for
modelling data based on statistical physics, i.e. assigning a scalar “energy” value to each data point. https://towardsdatascience.com/gibbs-sampling-8e4844560ae5 https://en.wikipedia.org/wiki/Restricted_Boltzmann_machine http://www.cs.toronto.edu/~hinton/science.pdf 1980’s ~ 2020 ~ Early ~ mid 2000’s • Gibbs sampling • Hamiltonian Monte Carlo • Restricted Boltzmann Machines • Autoencoders, Deep Belief Networks • Deep EBM with gradient-based MCMC • Score Matching GAN, VAE, NF Hinton et al. (2006) Du et al. (2021) Song et al. (2021) Oops I Took A Gradient: Scalable Sampling for Discrete Distributions ・ Sean Saito Background 6 Energy-based models are making a comeback

7 𝑝! 𝑥 = 𝑒"#! (%) 𝑍(𝜃) Probability model of
some distribution, parameterized by 𝜃 Energy function 𝐸: 𝑋 → ℝ (this can be defined flexibly, e.g. a neural network) In this paper, X ∈ {0, 1} (binary/one-hot) Partition function, typically expressed as ∫ 𝑒!"! ($)𝑑𝑥 Optimization formulated by maximizing log likelihood log 𝑝! 𝑥 = −𝐸! 𝑥 − log . 𝑒"#! (%)𝑑𝑥 This is intractable L Trained by deriving gradients from an unbiased gradient estimator ∇! log 𝑝! 𝑥 = −∇! 𝐸! 𝑥 − Ε%~)! % [−∇! 𝐸! 𝑥 ] Sampled via Markov chain Monte Carlo (MCMC) Can be trained via: • Stochastic Gradient Ascent • Langevin dynamics • Persistent CD Take the log Take the gradient Main focus of the paper Oops I Took A Gradient: Scalable Sampling for Discrete Distributions ・ Sean Saito Background 7 Sampling is key in training EBMs

Gibbs-with-Gradients (GWG)

Ø We want to apply the advanced EBM techniques on
discrete data (tabular, text, proteins, etc.), such as gradient- based MCMC. Ø But gradient-based MCMC assumes continuous data. Other MCMC methods for discrete data are not efficient & don’t scale. Ø Moreover, current “efficient” discrete sampling methods assumes some structure in the data, and hence not general. e.g. block independence (Block Gibbs) We want a general yet scalable way of training EBMs on discrete data Motivation

Core intuition Common discrete distributions have differentiable energy functions. This
can be exploited to speed-up sampling. Discrete distributions Continuous energy functions

1. Calculate energy for current sample 𝑥 (e.g. output of
a neural network) 2. Compute gradients & approximate energies of neighbors 𝑥! (Taylor series) to get likelihood ratios (in other words, very fast) 3. Apply softmax to get proposal distribution and sample dimension 𝑖 ∈ 𝐷 for which to propose a change 4. (Metropolis-Hastings Step) Accept 𝑥! as the next sample with probability: min 𝑒 "! #" $"! # 𝑞 𝑥 𝑥! 𝑞 𝑥! 𝑥 , 1 Gibbs-With-Gradients EBM Training MCMC Sampling 1 2 3 4 1 2 3 4 Oops I Took A Gradient: Scalable Sampling for Discrete Distributions ・ Sean Saito 80% 10% 5% 5% Repeat from 1… GWG 11

12 Metropolis-Hastings proposal Decide whether we take the next generated
sample 𝑥& in the MCMC chain with probablity of: min 𝑒 !! "" #!! " 𝑞 𝑥 𝑥$ 𝑞 𝑥$ 𝑥 , 1 Likelihood ratio Proposal dist. ratio (backward vs. forward) where 𝑞 𝑥# 𝑥 = ∑$ 𝑞 𝑥# 𝑥, 𝑖 𝑞 𝑖 , 𝑖: index of 𝐷 A good proposal balances the likelihood of the proposal 𝑞 𝑥$ 𝑥 and the reverse proposal 𝑞 𝑥 𝑥$ Strategy: use locally-informed proposals We want this probability to be as close to 1 as possible for efficient sampling. (small probability → rejected proposal → wasted computation) 𝑞% (𝑥$|𝑥) ∝ 𝑒 & % !! "" #!! " 𝟏 𝑥$ ∈ 𝐻(𝑥) 𝐻(𝑥): Hamming ball of size 1 around 𝑥 (bit flip of one dimension since binary) 𝜏: Temperature parameter for tempered softmax 𝝉 = 𝟐 gives most balanced proposals (see paper for details!) Problem: bad scalability because evaluating 𝑞%(𝑥#|𝑥) is 𝑂(𝐷) where 𝐷 is the number of dimensions. e.g. MNIST → 𝑂 28×28 = 𝑂(784) . Can we do this in 𝑂(1)? Oops I Took A Gradient: Scalable Sampling for Discrete Distributions ・ Sean Saito GWG 12 Finding balance in Metropolis-Hastings proposals

13 Continuous functions with discrete input spaces Differentiable with respect
to 𝑥 i.e. we can obtain ∇" 𝐸' 𝑥 Gibbs-with-Gradients: exploit these structures/gradients to estimate likelihood ratios in the form of Taylor-series approximations: 𝑞9:9(𝑥;|𝑥) ∝ 𝑒 < =(%) > 𝟏 𝑥; ∈ 𝐻(𝑥) Binary data: : 𝑑 𝑥 = − 2𝑥 − 1 ⨀ ∇" 𝐸' 𝑥 Categorical data: : 𝑑 𝑥 $& = ∇' 𝐸( 𝑥 $& − 𝑥$ )∇' 𝐸( 𝑥 $ Proposing a move = choosing which dimension 𝑖 to change So, 𝑞*+* 𝑥# 𝑥 = 𝑞 𝑖 𝑥 = Categorical Softmax : 𝑑(𝑥) 2 This is computed in parallel and only requires two function calls (forward & backward pass), allowing for a complexity of 𝑶(𝟏). Oops I Took A Gradient: Scalable Sampling for Discrete Distributions ・ Sean Saito GWG 13 Exploit gradients of energy functions to speed-up proposals

Experiments

15 GWG is more scalable with increasing dimensions GWG converges
faster and achieves higher effective sample size Oops I Took A Gradient: Scalable Sampling for Discrete Distributions ・ Sean Saito Experiments 15 How much does GWG help train EBMs?

Conclusion

17 Additional details (paper) Next steps Theoretical analysis • Proof
that GWG is near-optimal for local proposals • Analysis on efficiency • Relationship to continuous relaxations Other experiments • Factorial Hidden Markov Models (time- series) • Lattice Ising Models • Protein coupling prediction Expansion • Expand window size of the proposals for the sampler (i.e. modify more than one variable at a time) • Apply to larger categorical spaces (e.g. text) Generalization • Apply to recent deep EBM training methods such as Score Matching and Stein Discrepancies to generalize to discrete data Oops I Took A Gradient: Scalable Sampling for Discrete Distributions ・ Sean Saito Conclusion 17 Conclusion

18 linkedin.com/in/seansaito Oops I Took A Gradient: Scalable Sampling for
Discrete Distributions ・ Sean Saito @saitonian 18 Thanks and let’s connect 気軽にご連絡ください︕

Appendix

20 Compares samples from GWG versus those from standard Gibbs
sampling (based on a groundtruth connectivity matrix) GWG finds better samples (about 100x better in orders of magnitude in terms of log(RMSE)) GWG is much faster (at least 20x faster in terms of sampling steps) GWG provides better MCMC performance, which leads to better parameter inference for EBMs/generative models Experiments How fast is GWG in MCMC? 20

21 GWG is competitive/outperforming on both binary and categorical data
Experiments 21 How does GWG compare with other deep generative models?

22 • Du, Yilun, et al. "Improved contrastive divergence training
of energy based models." arXiv preprint arXiv:2012.01316 (2020). • Grathwohl, Will, et al. "Oops I Took A Gradient: Scalable Sampling for Discrete Distributions." arXiv preprint arXiv:2102.04509 (2021). • Hinton, Geoffrey E., and Ruslan R. Salakhutdinov. "Reducing the dimensionality of data with neural networks." science 313.5786 (2006): 504-507. • Larochelle, Hugo, and Yoshua Bengio. "Classification using discriminative restricted Boltzmann machines." Proceedings of the 25th international conference on Machine learning. 2008. • Song, Yang, and Diederik P. Kingma. "How to train your energy-based models." arXiv preprint arXiv:2101.03288 (2021). Oops I Took A Gradient: Scalable Sampling for Discrete Distributions ・ Sean Saito 22 References

ICML2021論文読み会

ICML2021論文読み会

Sean Saito

More Decks by Sean Saito

Other Decks in Science

Featured

Transcript

Sean Saito ・⿑藤初雲 Data Scientist, BCG GAMMA ICML2021 論⽂読み会

Summary

3 Oops I Took A Gradient: Scalable Sampling for Discrete

4 1.Energy-based models are making a comeback in generative modelling

Background

6 Energy-based models (EBM): A class of generative models for

7 𝑝! 𝑥 = 𝑒"#! (%) 𝑍(𝜃) Probability model of

Gibbs-with-Gradients (GWG)

Ø We want to apply the advanced EBM techniques on

Core intuition Common discrete distributions have differentiable energy functions. This

1. Calculate energy for current sample 𝑥 (e.g. output of

12 Metropolis-Hastings proposal Decide whether we take the next generated

13 Continuous functions with discrete input spaces Differentiable with respect

Experiments

15 GWG is more scalable with increasing dimensions GWG converges

Conclusion

17 Additional details (paper) Next steps Theoretical analysis • Proof

18 linkedin.com/in/seansaito Oops I Took A Gradient: Scalable Sampling for

Appendix

20 Compares samples from GWG versus those from standard Gibbs

21 GWG is competitive/outperforming on both binary and categorical data

22 • Du, Yilun, et al. "Improved contrastive divergence training