Reading Gibbs Sampling by Gelfand & Smith

Sampling-Based Approaches to Calculating Marginal Densities Alan E. Gelfand; Adrian
F. Smith Journal of the American Statistical Association, Vol. 85, No. 410. (Jun., 1990), pp. 398-409 November 4, 2013

Outline I. Introduction II. Sampling Approaches III. Examples and Numerical
Illustrations IV. Conclusion

I. Introduction 1. Purposes 2. Context

I. 1. Purposes I Exploitation of structural information to obtain
numerical estimates of non analytically available marginal densities I Using of sampling methods instead of implementing sophisticated numerical analytic ones I More attractable because of their simplicity and ease of implementation

I. Introduction 1. Purposes 2. Context

I. 2. Context Two cases I A) For i =
1, . . . , k the conditional distributions U i | U j ( j 6= i ) are available. I We can also consider the reduced forms Ui | Uj where j 2 Si ⇢ { 1 , . . . , k } I B) The functional form of the joint density of U1 , U2 , . . . , Uk is known and at least one U i | U j ( j 6= i ) is available

II. Sampling Approaches Assumptions I Dealing with real random variables
having a joint distribution whose density function is strictly positive over the sample space I Full set of conditional speciﬁcations uniquely deﬁnes the full joint density I Existence of densities with respect to either Lebesgue or counting measures for all marginal and conditional distributions

II. Sampling Approaches 1. Substitution Algorithm 2. Substitution Sampling 3.
Gibbs Sampling 4. Rubin Importance-Sampling Algorithm

II. 1. Substitution Algorithm Introduction I Standard mathematical tool used
in ﬁnding ﬁxed-pointed solutions to certain classes of integral equations I Its utility in statistical problems was developed by Tanner and Wong (1987) who called it a data-augmentation algorithm

II. 1. Substitution Algorithm Two-variable case / Tanner and Wong
‘ development (1) [ X ] = ´ [ X | Y ] ⇤ [ Y ] (2) [ Y ] = ´ [ Y | X ] ⇤ [ X ] (3) [ X ] = ´ [ X | Y ] ⇤ ´ [ Y | X 0] ⇤ [ X 0] = ´ h ( X , X 0) ⇤ [ X 0] Where h ( X , X 0) = ´ [ X | Y ] ⇤ [ Y | X 0] With X 0 a dummy argument and [ X 0] s [ X ]

II. 1. Substitution Algorithm Two-variable case / Algorithm I [
X 0] replaced by [ X ]i I [ X ]i+ 1 = ´ h ( X , X 0) ⇤ [ X ]i = Ih[ X ]h I Ih = integral operator associated with h

II. 1. Substitution Algorithm Main properties Theorem TW1 (uniqueness). The
true marginal density, [ X ] , is the unique solution to (3)

II. 1. Substitution Algorithm Main properties Theorem TW2 (convergence). For
almost any [ X ] 0, the sequence [ X ] 1 , [ X ] 2 , . . . deﬁned by [ X ]i+ 1 = Ih[ X ]i converges monotonically in L1 to [ X ]

II. 1. Substitution Algorithm Main properties Theorem TW3 (rate). ´
| [ X ]i [ X ] |! 0 geometrically in i

II. 1. Substitution Algorithm Extension case X , Y and
Z , three random variables : (4) [ X ] = ´ [ X , Z | Y ] ⇤ [ Y ] (5) [ Y ] = ´ [ Y , X | Z ] ⇤ [ Z ] (6) [ Z ] = ´ [ Z , Y | X ] ⇤ [ X ] By substitution : I [ X ] = ´ h ( X , X 0) ⇤ [ X 0] I Where h ( X , X 0) = ´ [ X , Z | Y ] ⇤ [ Y , X | Z ] ⇤ [ Z , Y | X 0] I With X 0 a dummy argument and [ X 0] s [ X ] Extension to k variables is straightforward.

II. 2. Substitution Sampling Two-variable case / Assumptions I [
X | Y ] and [ Y | X ] are available I [ X ] 0 is an arbitrary (possibly degenerate) initial density

II. 2. Substitution Sampling Two-variable case / Algorithm Cycle (i)
Draw a single X ( 0 ) from [ X ] 0 (ii) Draw Y ( 1 ) ⇠ [ Y | X ( 0 )] (iii) Complete this cycle by drawing X ( 1) ⇠ [ X | Y ( 1 )]

II. 2. Substitution Sampling Two-variable case / Convergence Repetition of
this cycle produces, after i iterations, the pair ( X (i), Y (i)) such that : I X (i) d ! X ⇠ [ X ] and Y (i) d ! Y ⇠ [ Y ]

II. 2. Substitution Sampling Two-variable case / Estimator If at
each iteration i we generate m iid pairs ( X (i) j , Y (i) j ) ( j 2 {1, . . . , m } , i 2 {1, . . . , k }), we can estimate [ X ] with the Monte Carlo integration : h ˆ X i i = 1 m m X j= 1 h X | Y (i) j i

II. 2. Substitution Sampling Two-variable case / L1 Convergence L1
convergence of h ˆ X i i to [ X ] since : ˆ | h ˆ X i i [ X ] | ˆ | h ˆ X i i [ X ]i | + ˆ | [ X ]i [ X ] | and I h ˆ X i i P ! [ X ]i when m ! 1 (Glick 1974) I [ X ]i L1 ! [ X ] when i ! 1 (TW2)

II. 2. Substitution Sampling Extension case I Extension to more
than two variables is straightforward. I In the three-variable-case with an arbitrary starting marginal density [ X ] 0 for X : I X (0) ⇠ [ X ]0 I ( Z (0)0 , Y (0)0 ) ⇠ ⇥ Z , Y | X (0) ⇤ I ( Y (1), X (0)0 ) ⇠ h Y , X | Z (0)0 i I ( X (1), Z (1)) ⇠ ⇥ X , Z | Y (1) ⇤ I Six generated variables required

II. 2. Substitution Sampling Extension case I Repeating this cycle
i times produces ( X (i), Y (i), Z (i)) such that : I X (i) d ! X ⇠ [ X ] , Y (i) d ! Y ⇠ [ Y ] and Z (i) d ! Z ⇠ [ Z ] I At the i th iteration for m generations : I ( X (i) j , Y (i) j , Z (i) j ) iid samples I h ˆ X i i = 1 m m X j=1 h X | Y (i) j , Z (i) j i I The L1 convergence still follows

II. 2. Substitution Sampling Extension case For k variables ,
U1 , . . . , Uk, : I k ( k 1) random variate generations to complete one cycle I mik ( k 1) random generations for m sequences an i iterations I h ˆ U s i i = 1 m m X j= 1 h U s | U t = U (i) tj ; t 6= s i I The L1 convergence still follows

II. 3. Gibbs Sampling Introduction I Introduced by Geman and
Geman (1984) to simulate marginal densities using full conditional distributions I [ X | Y , Z ] , [ Y | X , Z ] and [ Z | X , Y ] in the three-variable-case

II. 3. Gibbs Sampling Introduction I Gibbs sampler was introduced
by Geman and Geman (1984) to simulate marginal densities without using all conditional distributions (just the full ones) I [ X | Y , Z ] , [ Y | X , Z ] and [ Z | X , Y ] in the three-variable-case

II. 3. Gibbs Sampling Gibbs scheme I A Markovian updating
scheme I With an arbitrary starting set of values U ( 0 ) 1 , . . . , U ( 0 ) k : I U (1) 1 ⇠ h U1 | U (0) 2 , . . . , U (0) k i I U (1) 2 ⇠ h U2 | U (1) 1 , U (0) 3 , . . . , U (0) k i I U (1) 3 ⇠ h U3 | U (1) 1 , U (1) 2 , U (0) 4 , . . . , U (0) k i I . . . I U (1) k ⇠ h Uk | U (1) 1 , . . . , U (1) k 1 i I K random variate generations required in a cycle I Afer i iterations =) ( U (i) 1 , . . . , U (i) k )

II. 3. Gibbs Sampling Main properties Theorem GG1 (convergence) (
U (i) 1 , . . . , U (i) k ) d ! [ U1 , . . . , Uk] and hence for each s, U (i) s d ! U s ⇠ [ U s] as i ! 1

II. 3. Gibbs Sampling Main properties Theorem GG2 (rate) Using
the sup norm, rather than the L1 norm, the joint density of ( U (i) 1 , . . . , U (i) k ) converges to the true joint density at a geometric rate in i.

II. 3. Gibbs Sampling Main properties Theorem GG3 (ergodic theorem)
For any measurable function T of U1 , . . . , Uk whose expectation exists, lim i!1 1 i i X l= 1 T ⇣ U (l) 1 , . . . , U (l) k ⌘ a.s. ! E ( T (( U1 , . . . , Uk))

II. 3. Gibbs Sampling Estimator The density estimate for [
U s] is given by : h ˆ U s i i = 1 m m X j= 1 h U s | U t = U (i) tj ; t 6= s i

II. 3. Gibbs Sampling Substitution versus Gibbs I Diﬀerences beetwen
these two samplers : Substitution Gibbs Conditional distributions all full ones Variables generated k(k-1) k

II. 3. Gibbs Sampling Substitution versus Gibbs I Substitution and
Gibbs sampler are equivalent when only the set of full conditionnals is available. I If reduced conditional distributions are available, substitution sampling oﬀers the possibility of acceleration relative to Gibbs sampling.

II. 4. Rubin Importance-Sampling Algorithm Introduction I A noniterative Monte
Carlo method for generating marginal distributions using importance-sampling ideas

II. 4. Rubin Importance-Sampling Algorithm Introduction Fact Monte Carlo integration
Problem Jh = Ef [ h ( X )] = ˆ H h ( x ) f ( x ) dx MC solution ¯ h m = 1 m m X i= 1 h ( x i ) with ( x1 , . . . , x m) ⇠ f

II. 4. Rubin Importance-Sampling Algorithm Introduction Fact Importance Idea Problem
Jh = E g  h ( X )f ( X ) g ( X ) = ˆ H  h ( x )f ( x ) g ( x ) g ( x ) dx MC solution ¯ h m = 1 m m X i= 1 h ( x i )f ( x i ) g ( x i ) with ( x1 , . . . , x m) ⇠ g

II. 4. Rubin Importance-Sampling Algorithm Two-variable case / Assumptions I
The functional form (modulo the normalizing constant), of the joint density [ X , Y ] is known I the conditional distribution [ X | Y ] is available

II. 4. Rubin Importance-Sampling Algorithm Two-variable case / Assumptions I
The functional form (modulo the normalizing constant), of the joint density [ X , Y ] is known I The conditional distribution [ X | Y ] is available

II. 4. Rubin Importance-Sampling Algorithm Two-variable case / Rubin’s idea
I Choose an importance-sampling distribution [ Y ]s for Y I Use [ X | Y ] ⇤ [ Y ]s as an importance-sampling distribution for ( X , Y ) I ( Xl , Yl ) is created by drawing Yl ⇠ [ Y ]s and Xl ⇠ [ X | Yl ] ( l = 1 , . . . , N ) I Calculate rl = [ Xl , Yl ] [ Xl | Yl ] ⇤ [ Yl ]s

II. 4. Rubin Importance-Sampling Algorithm Two-variable case I The estimator
of the marginal density [ X ] is : h ˆ X i = PN l= 1 [ X | Yl ] ⇤ rl PN l= 1 rl

II. 4. Rubin Importance-Sampling Algorithm Main property By dividing the
numerator and the denominator by N and using the law of large number, we obtain : Theorem R1 (convergence) h ˆ X i ! [ X ] with probability 1 as N ! 1 for almost every X

II. 4. Rubin Importance-Sampling Algorithm Extension case / Assumptions In
the three-variable case, we need : I The functional form of [ X , Y , Z ] I The availability of [ X | Y , Z ] I An importance- sampling distribution [ Y , Z ]s

II. 4. Rubin Importance-Sampling Algorithm Extension case / Estimator Then,
we have : I rl = [ Xl , Yl , Zl ] [ Xl | Yl , Zl ] ⇤ [ Yl , Zl ]s I h ˆ X i = PN l= 1 [ X | Yl , Zl ] ⇤ rl PN l= 1 rl

II. 4. Rubin Importance-Sampling Algorithm Extension case In the k-variable
case : I Nk variables are generated I Extension is also straightforward

III. Examples and Numerical Illustrations 1. A Multinomial Model 2.
A Conjugate Hierarchical Model

III. 1. A Multinomial Model Motivations I Two-parameter version of
a one-parameter genetic-linkage example I Some observations are not assigned to individual cells but to aggregates of cells : =) we use multinomial sampling

III. 1. A Multinomial Model Data and Prior information I
Y = ( Y1 , . . . , Y5 ) ⇠ mult ( n , a1 ✓ + b1 , a2 ✓ + b2 , a3 ⌘ + b3 , a4 ⌘ + b4 , c (1 ✓ ⌘)) I where ai , bi 0 are known, I 0 < c = 1 P bi = a1 + a2 = a3 + a4 < 1 I ✓, ⌘ 0 , ✓ + ⌘  1 I (✓, ⌘) ⇠ Dirichlet (↵ 1 , ↵ 2 , ↵ 3 )

III. 1. A Multinomial Model Unobservable Data and Posterior Information
I X = ( X1 , . . . , X9 ) ⇠ mult ( n , a1 ✓, b1 , a2 ✓, b2 , a3 ⌘, b3 , a4 ⌘, b4 , c (1 ✓ ⌘)) I (✓, ⌘ | X ) ⇠ Dirichlet ( X1 + X3 + ↵ 1 , X5 + X7 + ↵ 2 , X9 + ↵ 3 ) I [✓ | X , ⌘] and [⌘ | X , ✓] are available as scaled beta distributions on [0, 1 ⌘] and [0, 1 ✓]

III. 1. A Multinomial Model Authors’ trick If we let
: Y1 = X1 + X2 , Y2 = X3 + X4 , Y3 = X5 + X6 , Y4 = X7 + X8 and Y5 = X9 Z = ( X1 , X3 , X5 , X7 ) =) studying X is equivalent to studying ( Y , Z )

III. 1. A Multinomial Model Studied case I So, we
have a three-variable case (✓, ⌘, Z ) with interest in the marginal distributions : I [✓ | Y ] , [⌘ | Y ] and [ Z | Y ] I We can remark that [ Z | Y , ✓, ⌘] is the product of four independent binomials X1 , X3 , X5 and X7 =)[ X i | Y , ✓, ⌘] = binomial ( Y i , ai ✓ (ai ✓+bi ) )

III. 1. A Multinomial Model Data Used Y = (14,
1, 1, 1, 5) ⇠ mult (22, 1 4 ✓ + 1 8 , 1 4 ⌘, 1 4 ⌘ + 3 8 , 1 2 (1 ✓ ⌘)) X = ( X1 , . . . , X7 ) ⇠ mult (22, 1 4 ✓, 1 8 , 1 4 ⌘, 1 4 ⌘, 3 8 , 1 2 (1 ✓ ⌘)) Z = ( X1 , X5 )

III. 1. A Multinomial Model Substitution and Gibbs Sampling To
compare the two forms of iterative sampling, the authors : I obtained a numerical estimates of [✓ | Y ] and [⌘ | Y ] I processed 5,000 times the following scheme : I initialize : ✓ ⇠ U ( 0 , 1 ), ⌘ ⇠ U ( 0 , 1 ), 0  ✓ + ⌘  1 I run 4 cycles of the two samplers with m = 10 I Compared the average cumulative posterior probabilities

III. 1. A Multinomial Model Substitution and Gibbs Sampling

III. 1. A Multinomial Model Substitution and Gibbs Sampling We
note that : I Substitution sampler adapts more quickly than Gibbs sampler I By the time, the two samplers have the same performance I Few random variate generations are required to obtain convergence ( m=10)

III. 1. A Multinomial Model Rubin Importance-Sampling I The authors
obtained the following average cumulative posterior probabilities with: I 2,500 simulations I [ Z | Y ]s following the product of X1 ⇠ binomial ( Y1, 1 2 ) and X5 ⇠ binomial ( Y4, 1 2 )

III. 1. A Multinomial Model Rubin Importance-Sampling We note that
: I The estimation is rather poor I The result may be sensitive to the choice of the importance distribution

III. Examples and Numerical Illustrations 1. A Multinomial Model 2.
A Poisson Model

III. 2. A Poisson Model Introduction We consider an exchangeable
Poisson model modelizing pump failures with : I s i the number of failures I t i the lenght of time in thousands of hours I ⇢i = si ti the rate

III. 2. A Poisson Model Prior Information We assume that
: I Y = ( s1 , . . . , s p) I [ s i | i ] = Poisson ( i t i ) I i are idd from (↵, ) with ↵ know and ⇠ IG ( , )

III. 2. A Poisson Model Full Conditional Distributions We have
that : I [ j | Y , , i6=j ] = (↵ + s j , ( t j + 1 ) 1), j = 1, . . . , p I [ | Y , 1 , . . . , p] ⇠ IG ( + p ↵, P i + )

III. 2. A Poisson Model Gibbs Cycle Given ( (
0 ) 1 , . . . , ( 0 ) p , ( 0 )), we draw : I ( 1 ) j ⇠ (↵ + s j , ( t j + 1 (0) ) 1) I ( 1 ) ⇠ IG ( + p ↵, P ( 1 ) i + )

III. 2. A Poisson Model Marginal Densities estimated by Gibbs
Sampling After i iterations with m repetitions : I ( (i) 1 l , . . . , (i) pl , (i) l ) with l = 1, . . . , m I [ j | Y ] = 1 m m X l= 1 (↵ + s j , ( t j + 1 (i) l ) 1) I [ | Y ] = 1 m m X l= 1 IG ( + p ↵, X (i) jl + )

III. 2. A Poisson Model Data I Pump-failure data analyzed
by Gaver and O’Muircheartaigh (1987) : I p = 10, = 1, = 0.1 I ↵ = ¯ (⇢)2 (S2 ' p 1 ¯ ⇢ P t 1 i )

III. 2. A Poisson Model Results After 10 cycles of
the algorithm :

III. 2. A Poisson Model Conclusion I Great ﬁt of
the Gibbs estimator I Convergence from a small number of drawings (m=10,100)

IV. Conclusion I These sampling algorithms are straightforward to implement
I Substitution and Gibbs sampling (iterative methods) provide better results in terms of convergence than the Rubin importance-sampling (noniterative method) I Performance of the Rubin importance-sampling performance depends on the choice of the importance distribution I If some reduced conditional distributions are available, substitution sampling becomes more eﬃcient than Gibbs ones.

Thanks for your attention

References Geman, S., and Geman, D. (1984), “Stochastic Relaxation, Gibbs
Distributions and the Bayesian Restoration of Images,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 6, 721-741. Glick, N. (1974), “Consistency Conditions for Probability Estimators and Integrals of Density Estimators,” Utilitas Mathematica, 6, 61-74. Rubin, D. B. (1987), Comment on “The Calculation of Posterior Distributions by Data Augmentation,” by M. A. Tanner and W. H. Wong, Journal of the American Statistical Association, 82, 543-546 Tanner,M., and Wong,W. (1987), “The Calculation of Posterior Distributions by Data Augmentation,” Journal of the American Statistical Association, 82, 528-550

Reading Gibbs Sampling by Gelfand & Smith

Reading Gibbs Sampling by Gelfand & Smith

More Decks by Xi'an

Other Decks in Education

Featured

Transcript