Distributions Authors • Will Grathwohl1 2 • Kevin Swersky2 • Milad Hashemi2 • David Duvenaud1 2 • Chris J. Maddison1 Affiliation • 1 University of Toronto, Vector Institute • 2 Google Brain Awards • ICML 2021 Outstanding Paper Honorable Mention Keywords • Generative modelling, EBM, MCMC, Gibbs Sampling, Metropolis-Hastings, Restricted Boltzmann Machines Summary Oops I Took A Gradient: Scalable Sampling for Discrete Distributions ・ Sean Saito 3
2.This paper proposes Gibbs-with-Gradients, a general and scalable sampling method for discrete distributions Oops I Took A Gradient: Scalable Sampling for Discrete Distributions ・ Sean Saito Summary 4 Key takeaways
modelling data based on statistical physics, i.e. assigning a scalar “energy” value to each data point. https://towardsdatascience.com/gibbs-sampling-8e4844560ae5 https://en.wikipedia.org/wiki/Restricted_Boltzmann_machine http://www.cs.toronto.edu/~hinton/science.pdf 1980’s ~ 2020 ~ Early ~ mid 2000’s • Gibbs sampling • Hamiltonian Monte Carlo • Restricted Boltzmann Machines • Autoencoders, Deep Belief Networks • Deep EBM with gradient-based MCMC • Score Matching GAN, VAE, NF Hinton et al. (2006) Du et al. (2021) Song et al. (2021) Oops I Took A Gradient: Scalable Sampling for Discrete Distributions ・ Sean Saito Background 6 Energy-based models are making a comeback
some distribution, parameterized by 𝜃 Energy function 𝐸: 𝑋 → ℝ (this can be defined flexibly, e.g. a neural network) In this paper, X ∈ {0, 1} (binary/one-hot) Partition function, typically expressed as ∫ 𝑒!"! ($)𝑑𝑥 Optimization formulated by maximizing log likelihood log 𝑝! 𝑥 = −𝐸! 𝑥 − log . 𝑒"#! (%)𝑑𝑥 This is intractable L Trained by deriving gradients from an unbiased gradient estimator ∇! log 𝑝! 𝑥 = −∇! 𝐸! 𝑥 − Ε%~)! % [−∇! 𝐸! 𝑥 ] Sampled via Markov chain Monte Carlo (MCMC) Can be trained via: • Stochastic Gradient Ascent • Langevin dynamics • Persistent CD Take the log Take the gradient Main focus of the paper Oops I Took A Gradient: Scalable Sampling for Discrete Distributions ・ Sean Saito Background 7 Sampling is key in training EBMs
discrete data (tabular, text, proteins, etc.), such as gradient- based MCMC. Ø But gradient-based MCMC assumes continuous data. Other MCMC methods for discrete data are not efficient & don’t scale. Ø Moreover, current “efficient” discrete sampling methods assumes some structure in the data, and hence not general. e.g. block independence (Block Gibbs) We want a general yet scalable way of training EBMs on discrete data Motivation
a neural network) 2. Compute gradients & approximate energies of neighbors 𝑥! (Taylor series) to get likelihood ratios (in other words, very fast) 3. Apply softmax to get proposal distribution and sample dimension 𝑖 ∈ 𝐷 for which to propose a change 4. (Metropolis-Hastings Step) Accept 𝑥! as the next sample with probability: min 𝑒 "! #" $"! # 𝑞 𝑥 𝑥! 𝑞 𝑥! 𝑥 , 1 Gibbs-With-Gradients EBM Training MCMC Sampling 1 2 3 4 1 2 3 4 Oops I Took A Gradient: Scalable Sampling for Discrete Distributions ・ Sean Saito 80% 10% 5% 5% Repeat from 1… GWG 11
sample 𝑥& in the MCMC chain with probablity of: min 𝑒 !! "" #!! " 𝑞 𝑥 𝑥$ 𝑞 𝑥$ 𝑥 , 1 Likelihood ratio Proposal dist. ratio (backward vs. forward) where 𝑞 𝑥# 𝑥 = ∑$ 𝑞 𝑥# 𝑥, 𝑖 𝑞 𝑖 , 𝑖: index of 𝐷 A good proposal balances the likelihood of the proposal 𝑞 𝑥$ 𝑥 and the reverse proposal 𝑞 𝑥 𝑥$ Strategy: use locally-informed proposals We want this probability to be as close to 1 as possible for efficient sampling. (small probability → rejected proposal → wasted computation) 𝑞% (𝑥$|𝑥) ∝ 𝑒 & % !! "" #!! " 𝟏 𝑥$ ∈ 𝐻(𝑥) 𝐻(𝑥): Hamming ball of size 1 around 𝑥 (bit flip of one dimension since binary) 𝜏: Temperature parameter for tempered softmax 𝝉 = 𝟐 gives most balanced proposals (see paper for details!) Problem: bad scalability because evaluating 𝑞%(𝑥#|𝑥) is 𝑂(𝐷) where 𝐷 is the number of dimensions. e.g. MNIST → 𝑂 28×28 = 𝑂(784) . Can we do this in 𝑂(1)? Oops I Took A Gradient: Scalable Sampling for Discrete Distributions ・ Sean Saito GWG 12 Finding balance in Metropolis-Hastings proposals
to 𝑥 i.e. we can obtain ∇" 𝐸' 𝑥 Gibbs-with-Gradients: exploit these structures/gradients to estimate likelihood ratios in the form of Taylor-series approximations: 𝑞9:9(𝑥;|𝑥) ∝ 𝑒 < =(%) > 𝟏 𝑥; ∈ 𝐻(𝑥) Binary data: : 𝑑 𝑥 = − 2𝑥 − 1 ⨀ ∇" 𝐸' 𝑥 Categorical data: : 𝑑 𝑥 $& = ∇' 𝐸( 𝑥 $& − 𝑥$ )∇' 𝐸( 𝑥 $ Proposing a move = choosing which dimension 𝑖 to change So, 𝑞*+* 𝑥# 𝑥 = 𝑞 𝑖 𝑥 = Categorical Softmax : 𝑑(𝑥) 2 This is computed in parallel and only requires two function calls (forward & backward pass), allowing for a complexity of 𝑶(𝟏). Oops I Took A Gradient: Scalable Sampling for Discrete Distributions ・ Sean Saito GWG 13 Exploit gradients of energy functions to speed-up proposals
faster and achieves higher effective sample size Oops I Took A Gradient: Scalable Sampling for Discrete Distributions ・ Sean Saito Experiments 15 How much does GWG help train EBMs?
that GWG is near-optimal for local proposals • Analysis on efficiency • Relationship to continuous relaxations Other experiments • Factorial Hidden Markov Models (time- series) • Lattice Ising Models • Protein coupling prediction Expansion • Expand window size of the proposals for the sampler (i.e. modify more than one variable at a time) • Apply to larger categorical spaces (e.g. text) Generalization • Apply to recent deep EBM training methods such as Score Matching and Stein Discrepancies to generalize to discrete data Oops I Took A Gradient: Scalable Sampling for Discrete Distributions ・ Sean Saito Conclusion 17 Conclusion
sampling (based on a groundtruth connectivity matrix) GWG finds better samples (about 100x better in orders of magnitude in terms of log(RMSE)) GWG is much faster (at least 20x faster in terms of sampling steps) GWG provides better MCMC performance, which leads to better parameter inference for EBMs/generative models Experiments How fast is GWG in MCMC? 20
of energy based models." arXiv preprint arXiv:2012.01316 (2020). • Grathwohl, Will, et al. "Oops I Took A Gradient: Scalable Sampling for Discrete Distributions." arXiv preprint arXiv:2102.04509 (2021). • Hinton, Geoffrey E., and Ruslan R. Salakhutdinov. "Reducing the dimensionality of data with neural networks." science 313.5786 (2006): 504-507. • Larochelle, Hugo, and Yoshua Bengio. "Classification using discriminative restricted Boltzmann machines." Proceedings of the 25th international conference on Machine learning. 2008. • Song, Yang, and Diederik P. Kingma. "How to train your energy-based models." arXiv preprint arXiv:2101.03288 (2021). Oops I Took A Gradient: Scalable Sampling for Discrete Distributions ・ Sean Saito 22 References