Deep Learning book 18. Confront the Partition Function

Slide 1

Slide 1 text

DEEP LEARNING Ian Goodfellow et al. 18. Confront the Partition Function nzw Note: all ﬁgures are cited from the original book

Slide 2

Slide 2 text

Introduction: Partition function p ( x ; ✓ ) = 1 Z ( ✓ ) ˜ p ( x ; ✓ ) Z ˜ p ( x ) dx X x ˜ p ( x ) or Z(✓) is An integral or sum is intractable over unnormalized probability distribution for many models. We sometimes face the probability distribution below: Unnormalized distribution Partition function

Slide 3

Slide 3 text

How to deal with it? 1. To design a model with tractable normalizing constant 2. To design a model that does not have involve computing p(x) 3. To confront a model with intractable normalizing constant • This chapter!

Slide 4

Slide 4 text

18.1 The Log–Likelihood Gradient r✓ log p (x; ✓ ) = r✓ log ˜ p (x; ✓ ) r✓ log Z ( ✓ ) This chapter treats with models which have difﬁcult negative phase such as RBM. positive phase negative phase

Slide 5

Slide 5 text

18.1 The Gradient of log Z: Negative Phase r✓ log Z = X x p (x) r✓ log ˜ p (x) = E x ⇠p( x ) r✓ log ˜ p (x) For discrete data r✓ log Z = r✓ Z ˜ p (x) d x = Z r✓ ˜ p (x) d x = E x ⇠p( x ) r✓ log ˜ p (x) For continuous data NOTE:   Under certain regularity conditions

Slide 6

Slide 6 text

18.1 Interpretation of both phases by … Monte Carlo sampling: r✓ log Z = E x ⇠p( x ) r✓ log ˜ p (x) r✓ log p (x; ✓ ) = r✓ log ˜ p (x; ✓ ) r✓ log Z ( ✓ ) • Pos: Increase based on data • Neg: Decrease by decreasing based on model distribution log ˜ p (x) x log ˜ p (x) Z(✓) Energy function: • Pos: Pushing down based on • Nog: Pushing up based on model samples E( x ) x E( x )

Slide 7

Slide 7 text

18.2 Stochastic Maximum Likelihood and Contrastive Divergence MCMC is the the naive way to compute log likelihood. However, it is high cost to compute gradient by burning—in. positive phase negative phase

Slide 8

Slide 8 text

Slide 9

Slide 9 text

18.2 Review the Negative Phase by the MCMC: Hallucinations / Fantasy • Data point x is drawn from model distribution to obtain gradient • It is considered as ﬁnding high probability points in model distribution. • Model is trained to reduce the gap between model distribution and the true distribution

Slide 10

Slide 10 text

18.2 Analogy between REM Sleep and Two Phases • Pos: Obtain the gradient based on real events: Awake • Neg: Obtain the gradient based on samples from model dist.: Asleep Note: • In ML, two phases are simultaneous, rather than separate; wakefulness and REM sleep

Slide 11

Slide 11 text

18.2 More Effective algorithm: Contrastive Divergence • The bottleneck in MCMC is burning in the Markov chain from random initialization • To reduce too many burning in, a solution is to initialize Markov chain from a distribution that is a similar to model distribution • Contrastive Divergence is a algorithm to do that • Short forms: CD, CD-k (CD with k Gibbs steps)

Slide 1

Slide 1 text

Slide 2

Slide 2 text

Slide 3

Slide 3 text

Slide 4

Slide 4 text

Slide 5

Slide 5 text

Slide 6

Slide 6 text

Slide 7

Slide 7 text

Slide 8

Slide 8 text

Slide 9

Slide 9 text

Slide 10

Slide 10 text

Slide 11

Slide 11 text

Slide 12

Slide 12 text

Slide 13

Slide 13 text

Slide 14

Slide 14 text

Slide 15

Slide 15 text

Slide 16

Slide 16 text

Slide 17

Slide 17 text

Slide 18

Slide 18 text

Slide 19

Slide 19 text

Slide 20

Slide 20 text