Slide 1

Slide 1 text

Auto-encoding variational Bayes Diederik P Kingma1 Max Welling2 Presented by : Mehdi Cherti (LAL/CNRS) 9th May 2015 Diederik P Kingma, Max Welling Auto-encoding variational Bayes

Slide 2

Slide 2 text

Diederik P Kingma, Max Welling Auto-encoding variational Bayes

Slide 3

Slide 3 text

What is a generative model ? A model of how the data X was generated Typically, the purpose is to nd a model for : p(x) or p(x, y) y can be a set of latent (hidden) variables or a set of output variables, for discriminative problems Diederik P Kingma, Max Welling Auto-encoding variational Bayes

Slide 4

Slide 4 text

Training generative models Typically, we assume a parametric form of the probability density : p(x|Θ) Given an i.i.d dataset : X = (x 1 , x 2 , ..., xN), we typically do : Maximum likelihood (ML) : argmax Θ p(X|Θ) Maximum a posteriori (MAP) : argmax Θ p(X|Θ)p(Θ) Bayesian inference : p(Θ|X) = p(x|Θ)p(Θ) ´ Θ p(x|Θ)p(Θ)dΘ Diederik P Kingma, Max Welling Auto-encoding variational Bayes

Slide 5

Slide 5 text

The problem let x be the observed variables we assume a latent representation z we dene p Θ(z) and p Θ(x|z) We want to design a generative model where: p Θ(x) = ´ p Θ(x|z)p Θ(z)dz is intractable p Θ(z|x) = p Θ(x|z)p Θ(z)/p Θ(x) is intractable we have large datasets : we want to avoid sampling based training procedures (e.g MCMC) Diederik P Kingma, Max Welling Auto-encoding variational Bayes

Slide 6

Slide 6 text

The proposed solution They propose: a fast training procedure that estimates the parameters Θ: for data generation an approximation of the posterior p Θ(z|x) : for data representation an approximation of the marginal p Θ(x) : for model evaluation and as a prior for other tasks Diederik P Kingma, Max Welling Auto-encoding variational Bayes

Slide 7

Slide 7 text

Formulation of the problem the process of generation consists of sampling z from p Θ(z) then x from p Θ(x|z). Let's dene : a prior over over the latent representation p Θ(z), a decoder p Θ(x|z) We want to maximize the log-likelihood of the data (x(1), x(2), ..., x(N)): logp Θ(x(1), x(2), ..., x(N)) = i logp Θ(xi) and be able to do inference : p Θ(z|x) Diederik P Kingma, Max Welling Auto-encoding variational Bayes

Slide 8

Slide 8 text

The variational lower bound We will learn an approximate of p Θ(z|x) : q Φ(z|x) by maximizing a lower bound of the log-likelihood of the data We can write : logp Θ(x) = DKL(q Φ(z|x)||p Θ(z|x)) + L(Θ, φ, x) where: L(Θ, Φ, x) = Eq Φ(z|x) [logp Θ(x, z) − logq φ (z|x)] L(Θ, Φ, x)is called the variational lower bound, and the goal is to maximize it w.r.t to all the parameters (Θ, Φ) Diederik P Kingma, Max Welling Auto-encoding variational Bayes

Slide 9

Slide 9 text

Estimating the lower bound gradients We need to compute ∂L(Θ,Φ,x) ∂Θ , ∂L(Θ,Φ,x) ∂φ to apply gradient descent For that, we use the reparametrisation trick : we sample from a noise variable p( ) and apply a determenistic function to it so that we obtain correct samples from q φ(z|x), meaning: if ∼ p( ) we nd g so that if z = g(x, φ, ) then z ∼ q φ (z|x) g can be the inverse CDF of q Φ (z|x) if is uniform With the reparametrisation trick we can rewrite L: L(Θ, Φ, x) = E ∼p( ) [logp Θ(x, g(x, φ, )) − logq φ (g(x, φ, )|x)] We then estimate the gradients with Monte Carlo Diederik P Kingma, Max Welling Auto-encoding variational Bayes

Slide 10

Slide 10 text

A connection with auto-encoders Note that L can also be written in this form: L(Θ, φ, x) = −DKL(q Φ(z|x)||p Θ(z)) + Eq Φ(z|x) [logp Θ(x|z)] We can interpret the rst term as a regularizer : it forces q Φ(z|x) to not be too divergent from the prior p Θ(z) We can interpret the (-second term) as the reconstruction error Diederik P Kingma, Max Welling Auto-encoding variational Bayes

Slide 11

Slide 11 text

The algorithm Diederik P Kingma, Max Welling Auto-encoding variational Bayes

Slide 12

Slide 12 text

Variational auto-encoders It is a model example which uses the procedure described above to maximize the lower bound In V.A, we choose: p Θ (z) = N(0, I) p Θ (x|z) : is normal distribution for real data, we have neural network decoder that computes µand σ of this distribution from z is multivariate bernoulli for boolean data, we have neural network decoder that computes the probability of 1 from z q Φ (z|x) = N(µ(x), σ(x)I) : we have a neural network encoder that computes µand σ of q Φ (z|x) from x ∼ N(0, I) and z = g(x, φ, ) = µ(x) + σ(x) ∗ Diederik P Kingma, Max Welling Auto-encoding variational Bayes

Slide 13

Slide 13 text

Experiments (1) Samples from MNIST: Diederik P Kingma, Max Welling Auto-encoding variational Bayes

Slide 14

Slide 14 text

Experiments (2) 2D-Latent space manifolds from MNIST and Frey datasets Diederik P Kingma, Max Welling Auto-encoding variational Bayes

Slide 15

Slide 15 text

Experiments (3) Comparison of the lower bound with the Wake-sleep algorithm : Diederik P Kingma, Max Welling Auto-encoding variational Bayes

Slide 16

Slide 16 text

Experiments (4) Comparison of the marginal log-likelihood with Wake-Sleep and Monte Carlo EM (MCEM): Diederik P Kingma, Max Welling Auto-encoding variational Bayes

Slide 17

Slide 17 text

Implementation : https://github.com/mehdidc/lasagnekit Diederik P Kingma, Max Welling Auto-encoding variational Bayes