Slide 1

Slide 1 text

Sean Saito ・ ⿑藤初雲 Data Scientist, BCG GAMMA ICML2021 論⽂読み会 Oops I Took A Gradient: Scalable Sampling for Discrete Distributions

Slide 2

Slide 2 text

Summary

Slide 3

Slide 3 text

3 Oops I Took A Gradient: Scalable Sampling for Discrete Distributions Authors • Will Grathwohl1 2 • Kevin Swersky2 • Milad Hashemi2 • David Duvenaud1 2 • Chris J. Maddison1 Affiliation • 1 University of Toronto, Vector Institute • 2 Google Brain Awards • ICML 2021 Outstanding Paper Honorable Mention Keywords • Generative modelling, EBM, MCMC, Gibbs Sampling, Metropolis-Hastings, Restricted Boltzmann Machines Summary Oops I Took A Gradient: Scalable Sampling for Discrete Distributions ・ Sean Saito 3

Slide 4

Slide 4 text

4 1.Energy-based models are making a comeback in generative modelling 2.This paper proposes Gibbs-with-Gradients, a general and scalable sampling method for discrete distributions Oops I Took A Gradient: Scalable Sampling for Discrete Distributions ・ Sean Saito Summary 4 Key takeaways

Slide 5

Slide 5 text

Background

Slide 6

Slide 6 text

6 Energy-based models (EBM): A class of generative models for modelling data based on statistical physics, i.e. assigning a scalar “energy” value to each data point. https://towardsdatascience.com/gibbs-sampling-8e4844560ae5 https://en.wikipedia.org/wiki/Restricted_Boltzmann_machine http://www.cs.toronto.edu/~hinton/science.pdf 1980’s ~ 2020 ~ Early ~ mid 2000’s • Gibbs sampling • Hamiltonian Monte Carlo • Restricted Boltzmann Machines • Autoencoders, Deep Belief Networks • Deep EBM with gradient-based MCMC • Score Matching GAN, VAE, NF Hinton et al. (2006) Du et al. (2021) Song et al. (2021) Oops I Took A Gradient: Scalable Sampling for Discrete Distributions ・ Sean Saito Background 6 Energy-based models are making a comeback

Slide 7

Slide 7 text

7 𝑝! 𝑥 = 𝑒"#! (%) 𝑍(𝜃) Probability model of some distribution, parameterized by 𝜃 Energy function 𝐸: 𝑋 → ℝ (this can be defined flexibly, e.g. a neural network) In this paper, X ∈ {0, 1} (binary/one-hot) Partition function, typically expressed as ∫ 𝑒!"! ($)𝑑𝑥 Optimization formulated by maximizing log likelihood log 𝑝! 𝑥 = −𝐸! 𝑥 − log . 𝑒"#! (%)𝑑𝑥 This is intractable L Trained by deriving gradients from an unbiased gradient estimator ∇! log 𝑝! 𝑥 = −∇! 𝐸! 𝑥 − Ε%~)! % [−∇! 𝐸! 𝑥 ] Sampled via Markov chain Monte Carlo (MCMC) Can be trained via: • Stochastic Gradient Ascent • Langevin dynamics • Persistent CD Take the log Take the gradient Main focus of the paper Oops I Took A Gradient: Scalable Sampling for Discrete Distributions ・ Sean Saito Background 7 Sampling is key in training EBMs

Slide 8

Slide 8 text

Gibbs-with-Gradients (GWG)

Slide 9

Slide 9 text

Ø We want to apply the advanced EBM techniques on discrete data (tabular, text, proteins, etc.), such as gradient- based MCMC. Ø But gradient-based MCMC assumes continuous data. Other MCMC methods for discrete data are not efficient & don’t scale. Ø Moreover, current “efficient” discrete sampling methods assumes some structure in the data, and hence not general. e.g. block independence (Block Gibbs) We want a general yet scalable way of training EBMs on discrete data Motivation

Slide 10

Slide 10 text

Core intuition Common discrete distributions have differentiable energy functions. This can be exploited to speed-up sampling. Discrete distributions Continuous energy functions

Slide 11

Slide 11 text

1. Calculate energy for current sample 𝑥 (e.g. output of a neural network) 2. Compute gradients & approximate energies of neighbors 𝑥! (Taylor series) to get likelihood ratios (in other words, very fast) 3. Apply softmax to get proposal distribution and sample dimension 𝑖 ∈ 𝐷 for which to propose a change 4. (Metropolis-Hastings Step) Accept 𝑥! as the next sample with probability: min 𝑒 "! #" $"! # 𝑞 𝑥 𝑥! 𝑞 𝑥! 𝑥 , 1 Gibbs-With-Gradients EBM Training MCMC Sampling 1 2 3 4 1 2 3 4 Oops I Took A Gradient: Scalable Sampling for Discrete Distributions ・ Sean Saito 80% 10% 5% 5% Repeat from 1… GWG 11

Slide 12

Slide 12 text

12 Metropolis-Hastings proposal Decide whether we take the next generated sample 𝑥& in the MCMC chain with probablity of: min 𝑒 !! "" #!! " 𝑞 𝑥 𝑥$ 𝑞 𝑥$ 𝑥 , 1 Likelihood ratio Proposal dist. ratio (backward vs. forward) where 𝑞 𝑥# 𝑥 = ∑$ 𝑞 𝑥# 𝑥, 𝑖 𝑞 𝑖 , 𝑖: index of 𝐷 A good proposal balances the likelihood of the proposal 𝑞 𝑥$ 𝑥 and the reverse proposal 𝑞 𝑥 𝑥$ Strategy: use locally-informed proposals We want this probability to be as close to 1 as possible for efficient sampling. (small probability → rejected proposal → wasted computation) 𝑞% (𝑥$|𝑥) ∝ 𝑒 & % !! "" #!! " 𝟏 𝑥$ ∈ 𝐻(𝑥) 𝐻(𝑥): Hamming ball of size 1 around 𝑥 (bit flip of one dimension since binary) 𝜏: Temperature parameter for tempered softmax 𝝉 = 𝟐 gives most balanced proposals (see paper for details!) Problem: bad scalability because evaluating 𝑞%(𝑥#|𝑥) is 𝑂(𝐷) where 𝐷 is the number of dimensions. e.g. MNIST → 𝑂 28×28 = 𝑂(784) . Can we do this in 𝑂(1)? Oops I Took A Gradient: Scalable Sampling for Discrete Distributions ・ Sean Saito GWG 12 Finding balance in Metropolis-Hastings proposals

Slide 13

Slide 13 text

13 Continuous functions with discrete input spaces Differentiable with respect to 𝑥 i.e. we can obtain ∇" 𝐸' 𝑥 Gibbs-with-Gradients: exploit these structures/gradients to estimate likelihood ratios in the form of Taylor-series approximations: 𝑞9:9(𝑥;|𝑥) ∝ 𝑒 < =(%) > 𝟏 𝑥; ∈ 𝐻(𝑥) Binary data: : 𝑑 𝑥 = − 2𝑥 − 1 ⨀ ∇" 𝐸' 𝑥 Categorical data: : 𝑑 𝑥 $& = ∇' 𝐸( 𝑥 $& − 𝑥$ )∇' 𝐸( 𝑥 $ Proposing a move = choosing which dimension 𝑖 to change So, 𝑞*+* 𝑥# 𝑥 = 𝑞 𝑖 𝑥 = Categorical Softmax : 𝑑(𝑥) 2 This is computed in parallel and only requires two function calls (forward & backward pass), allowing for a complexity of 𝑶(𝟏). Oops I Took A Gradient: Scalable Sampling for Discrete Distributions ・ Sean Saito GWG 13 Exploit gradients of energy functions to speed-up proposals

Slide 14

Slide 14 text

Experiments

Slide 15

Slide 15 text

15 GWG is more scalable with increasing dimensions GWG converges faster and achieves higher effective sample size Oops I Took A Gradient: Scalable Sampling for Discrete Distributions ・ Sean Saito Experiments 15 How much does GWG help train EBMs?

Slide 16

Slide 16 text

Conclusion

Slide 17

Slide 17 text

17 Additional details (paper) Next steps Theoretical analysis • Proof that GWG is near-optimal for local proposals • Analysis on efficiency • Relationship to continuous relaxations Other experiments • Factorial Hidden Markov Models (time- series) • Lattice Ising Models • Protein coupling prediction Expansion • Expand window size of the proposals for the sampler (i.e. modify more than one variable at a time) • Apply to larger categorical spaces (e.g. text) Generalization • Apply to recent deep EBM training methods such as Score Matching and Stein Discrepancies to generalize to discrete data Oops I Took A Gradient: Scalable Sampling for Discrete Distributions ・ Sean Saito Conclusion 17 Conclusion

Slide 18

Slide 18 text

18 linkedin.com/in/seansaito Oops I Took A Gradient: Scalable Sampling for Discrete Distributions ・ Sean Saito @saitonian 18 Thanks and let’s connect 気軽にご連絡ください︕

Slide 19

Slide 19 text

Appendix

Slide 20

Slide 20 text

20 Compares samples from GWG versus those from standard Gibbs sampling (based on a groundtruth connectivity matrix) GWG finds better samples (about 100x better in orders of magnitude in terms of log(RMSE)) GWG is much faster (at least 20x faster in terms of sampling steps) GWG provides better MCMC performance, which leads to better parameter inference for EBMs/generative models Experiments How fast is GWG in MCMC? 20

Slide 21

Slide 21 text

21 GWG is competitive/outperforming on both binary and categorical data Experiments 21 How does GWG compare with other deep generative models?

Slide 22

Slide 22 text

22 • Du, Yilun, et al. "Improved contrastive divergence training of energy based models." arXiv preprint arXiv:2012.01316 (2020). • Grathwohl, Will, et al. "Oops I Took A Gradient: Scalable Sampling for Discrete Distributions." arXiv preprint arXiv:2102.04509 (2021). • Hinton, Geoffrey E., and Ruslan R. Salakhutdinov. "Reducing the dimensionality of data with neural networks." science 313.5786 (2006): 504-507. • Larochelle, Hugo, and Yoshua Bengio. "Classification using discriminative restricted Boltzmann machines." Proceedings of the 25th international conference on Machine learning. 2008. • Song, Yang, and Diederik P. Kingma. "How to train your energy-based models." arXiv preprint arXiv:2101.03288 (2021). Oops I Took A Gradient: Scalable Sampling for Discrete Distributions ・ Sean Saito 22 References