Reading Gibbs Sampling by Gelfand & Smith

Slide 1

Slide 1 text

Sampling-Based Approaches to Calculating Marginal Densities Alan E. Gelfand; Adrian F. Smith Journal of the American Statistical Association, Vol. 85, No. 410. (Jun., 1990), pp. 398-409 November 4, 2013

Slide 2

Slide 2 text

Outline I. Introduction II. Sampling Approaches III. Examples and Numerical Illustrations IV. Conclusion

Slide 3

Slide 3 text

Outline I. Introduction II. Sampling Approaches III. Examples and Numerical Illustrations IV. Conclusion

Slide 4

Slide 4 text

I. Introduction 1. Purposes 2. Context

Slide 5

Slide 5 text

I. Introduction 1. Purposes 2. Context

Slide 6

Slide 6 text

I. 1. Purposes I Exploitation of structural information to obtain numerical estimates of non analytically available marginal densities I Using of sampling methods instead of implementing sophisticated numerical analytic ones I More attractable because of their simplicity and ease of implementation

Slide 7

Slide 7 text

Slide 8

Slide 8 text

Slide 9

Slide 9 text

I. Introduction 1. Purposes 2. Context

Slide 10

Slide 10 text

I. 2. Context Two cases I A) For i = 1, . . . , k the conditional distributions U i | U j ( j 6= i ) are available. I We can also consider the reduced forms Ui | Uj where j 2 Si ⇢ { 1 , . . . , k } I B) The functional form of the joint density of U1 , U2 , . . . , Uk is known and at least one U i | U j ( j 6= i ) is available

Slide 11

Slide 11 text

Slide 12

Slide 12 text

Slide 13

Slide 13 text

Outline I. Introduction II. Sampling Approaches III. Examples and Numerical Illustrations IV. Conclusion

Slide 14

Slide 14 text

II. Sampling Approaches Assumptions I Dealing with real random variables having a joint distribution whose density function is strictly positive over the sample space I Full set of conditional speciﬁcations uniquely deﬁnes the full joint density I Existence of densities with respect to either Lebesgue or counting measures for all marginal and conditional distributions

Slide 15

Slide 15 text

Slide 16

Slide 16 text

Slide 17

Slide 17 text

II. Sampling Approaches 1. Substitution Algorithm 2. Substitution Sampling 3. Gibbs Sampling 4. Rubin Importance-Sampling Algorithm

Slide 18

Slide 18 text

II. Sampling Approaches 1. Substitution Algorithm 2. Substitution Sampling 3. Gibbs Sampling 4. Rubin Importance-Sampling Algorithm

Slide 19

Slide 19 text

II. 1. Substitution Algorithm Introduction I Standard mathematical tool used in ﬁnding ﬁxed-pointed solutions to certain classes of integral equations I Its utility in statistical problems was developed by Tanner and Wong (1987) who called it a data-augmentation algorithm

Slide 20

Slide 20 text

Slide 21

Slide 21 text

II. 1. Substitution Algorithm Two-variable case / Tanner and Wong ‘ development (1) [ X ] = ´ [ X | Y ] ⇤ [ Y ] (2) [ Y ] = ´ [ Y | X ] ⇤ [ X ] (3) [ X ] = ´ [ X | Y ] ⇤ ´ [ Y | X 0] ⇤ [ X 0] = ´ h ( X , X 0) ⇤ [ X 0] Where h ( X , X 0) = ´ [ X | Y ] ⇤ [ Y | X 0] With X 0 a dummy argument and [ X 0] s [ X ]

Slide 22

Slide 22 text

Slide 23

Slide 23 text

II. 1. Substitution Algorithm Two-variable case / Algorithm I [ X 0] replaced by [ X ]i I [ X ]i+ 1 = ´ h ( X , X 0) ⇤ [ X ]i = Ih[ X ]h I Ih = integral operator associated with h

Slide 24

Slide 24 text

II. 1. Substitution Algorithm Two-variable case / Algorithm I [ X 0] replaced by [ X ]i I [ X ]i+ 1 = ´ h ( X , X 0) ⇤ [ X ]i = Ih[ X ]h I Ih = integral operator associated with h

Slide 25

Slide 25 text

II. 1. Substitution Algorithm Main properties Theorem TW1 (uniqueness). The true marginal density, [ X ] , is the unique solution to (3)

Slide 26

Slide 26 text

II. 1. Substitution Algorithm Main properties Theorem TW2 (convergence). For almost any [ X ] 0, the sequence [ X ] 1 , [ X ] 2 , . . . deﬁned by [ X ]i+ 1 = Ih[ X ]i converges monotonically in L1 to [ X ]

Slide 27

Slide 27 text

II. 1. Substitution Algorithm Main properties Theorem TW3 (rate). ´ | [ X ]i [ X ] |! 0 geometrically in i

Slide 28

Slide 28 text

II. 1. Substitution Algorithm Extension case X , Y and Z , three random variables : (4) [ X ] = ´ [ X , Z | Y ] ⇤ [ Y ] (5) [ Y ] = ´ [ Y , X | Z ] ⇤ [ Z ] (6) [ Z ] = ´ [ Z , Y | X ] ⇤ [ X ] By substitution : I [ X ] = ´ h ( X , X 0) ⇤ [ X 0] I Where h ( X , X 0) = ´ [ X , Z | Y ] ⇤ [ Y , X | Z ] ⇤ [ Z , Y | X 0] I With X 0 a dummy argument and [ X 0] s [ X ] Extension to k variables is straightforward.

Slide 29

Slide 29 text

Slide 30

Slide 30 text

Slide 31

Slide 31 text

II. Sampling Approaches 1. Substitution Algorithm 2. Substitution Sampling 3. Gibbs Sampling 4. Rubin Importance-Sampling Algorithm

Slide 32

Slide 32 text

II. 2. Substitution Sampling Two-variable case / Assumptions I [ X | Y ] and [ Y | X ] are available I [ X ] 0 is an arbitrary (possibly degenerate) initial density

Slide 33

Slide 33 text

II. 2. Substitution Sampling Two-variable case / Assumptions I [ X | Y ] and [ Y | X ] are available I [ X ] 0 is an arbitrary (possibly degenerate) initial density

Slide 34

Slide 34 text

II. 2. Substitution Sampling Two-variable case / Algorithm Cycle (i) Draw a single X ( 0 ) from [ X ] 0 (ii) Draw Y ( 1 ) ⇠ [ Y | X ( 0 )] (iii) Complete this cycle by drawing X ( 1) ⇠ [ X | Y ( 1 )]

Slide 35

Slide 35 text

II. 2. Substitution Sampling Two-variable case / Convergence Repetition of this cycle produces, after i iterations, the pair ( X (i), Y (i)) such that : I X (i) d ! X ⇠ [ X ] and Y (i) d ! Y ⇠ [ Y ]

Slide 36

Slide 36 text

Slide 37

Slide 37 text

II. 2. Substitution Sampling Two-variable case / Estimator If at each iteration i we generate m iid pairs ( X (i) j , Y (i) j ) ( j 2 {1, . . . , m } , i 2 {1, . . . , k }), we can estimate [ X ] with the Monte Carlo integration : h ˆ X i i = 1 m m X j= 1 h X | Y (i) j i

Slide 38

Slide 38 text

Slide 39

Slide 39 text

II. 2. Substitution Sampling Two-variable case / L1 Convergence L1 convergence of h ˆ X i i to [ X ] since : ˆ | h ˆ X i i [ X ] | ˆ | h ˆ X i i [ X ]i | + ˆ | [ X ]i [ X ] | and I h ˆ X i i P ! [ X ]i when m ! 1 (Glick 1974) I [ X ]i L1 ! [ X ] when i ! 1 (TW2)

Slide 40

Slide 40 text

II. 2. Substitution Sampling Extension case I Extension to more than two variables is straightforward. I In the three-variable-case with an arbitrary starting marginal density [ X ] 0 for X : I X (0) ⇠ [ X ]0 I ( Z (0)0 , Y (0)0 ) ⇠ ⇥ Z , Y | X (0) ⇤ I ( Y (1), X (0)0 ) ⇠ h Y , X | Z (0)0 i I ( X (1), Z (1)) ⇠ ⇥ X , Z | Y (1) ⇤ I Six generated variables required

Slide 41

Slide 41 text

II. 2. Substitution Sampling Extension case I Repeating this cycle i times produces ( X (i), Y (i), Z (i)) such that : I X (i) d ! X ⇠ [ X ] , Y (i) d ! Y ⇠ [ Y ] and Z (i) d ! Z ⇠ [ Z ] I At the i th iteration for m generations : I ( X (i) j , Y (i) j , Z (i) j ) iid samples I h ˆ X i i = 1 m m X j=1 h X | Y (i) j , Z (i) j i I The L1 convergence still follows

Slide 42

Slide 42 text

II. 2. Substitution Sampling Extension case For k variables , U1 , . . . , Uk, : I k ( k 1) random variate generations to complete one cycle I mik ( k 1) random generations for m sequences an i iterations I h ˆ U s i i = 1 m m X j= 1 h U s | U t = U (i) tj ; t 6= s i I The L1 convergence still follows

Slide 43

Slide 43 text

II. Sampling Approaches 1. Substitution Algorithm 2. Substitution Sampling 3. Gibbs Sampling 4. Rubin Importance-Sampling Algorithm

Slide 44

Slide 44 text

II. 3. Gibbs Sampling Introduction I Introduced by Geman and Geman (1984) to simulate marginal densities using full conditional distributions I [ X | Y , Z ] , [ Y | X , Z ] and [ Z | X , Y ] in the three-variable-case

Slide 45

Slide 45 text

II. 3. Gibbs Sampling Introduction I Gibbs sampler was introduced by Geman and Geman (1984) to simulate marginal densities without using all conditional distributions (just the full ones) I [ X | Y , Z ] , [ Y | X , Z ] and [ Z | X , Y ] in the three-variable-case

Slide 46

Slide 46 text

II. 3. Gibbs Sampling Gibbs scheme I A Markovian updating scheme I With an arbitrary starting set of values U ( 0 ) 1 , . . . , U ( 0 ) k : I U (1) 1 ⇠ h U1 | U (0) 2 , . . . , U (0) k i I U (1) 2 ⇠ h U2 | U (1) 1 , U (0) 3 , . . . , U (0) k i I U (1) 3 ⇠ h U3 | U (1) 1 , U (1) 2 , U (0) 4 , . . . , U (0) k i I . . . I U (1) k ⇠ h Uk | U (1) 1 , . . . , U (1) k 1 i I K random variate generations required in a cycle I Afer i iterations =) ( U (i) 1 , . . . , U (i) k )

Slide 47

Slide 47 text

Slide 48

Slide 48 text

Slide 49

Slide 49 text

II. 3. Gibbs Sampling Main properties Theorem GG1 (convergence) ( U (i) 1 , . . . , U (i) k ) d ! [ U1 , . . . , Uk] and hence for each s, U (i) s d ! U s ⇠ [ U s] as i ! 1

Slide 50

Slide 50 text

II. 3. Gibbs Sampling Main properties Theorem GG2 (rate) Using the sup norm, rather than the L1 norm, the joint density of ( U (i) 1 , . . . , U (i) k ) converges to the true joint density at a geometric rate in i.

Slide 51

Slide 51 text

II. 3. Gibbs Sampling Main properties Theorem GG3 (ergodic theorem) For any measurable function T of U1 , . . . , Uk whose expectation exists, lim i!1 1 i i X l= 1 T ⇣ U (l) 1 , . . . , U (l) k ⌘ a.s. ! E ( T (( U1 , . . . , Uk))

Slide 52

Slide 52 text

II. 3. Gibbs Sampling Estimator The density estimate for [ U s] is given by : h ˆ U s i i = 1 m m X j= 1 h U s | U t = U (i) tj ; t 6= s i

Slide 53

Slide 53 text

II. 3. Gibbs Sampling Substitution versus Gibbs I Diﬀerences beetwen these two samplers : Substitution Gibbs Conditional distributions all full ones Variables generated k(k-1) k

Slide 54

Slide 54 text

II. 3. Gibbs Sampling Substitution versus Gibbs I Substitution and Gibbs sampler are equivalent when only the set of full conditionnals is available. I If reduced conditional distributions are available, substitution sampling oﬀers the possibility of acceleration relative to Gibbs sampling.

Slide 55

Slide 55 text

Slide 56

Slide 56 text

Slide 57

Slide 57 text

II. Sampling Approaches 1. Substitution Algorithm 2. Substitution Sampling 3. Gibbs Sampling 4. Rubin Importance-Sampling Algorithm

Slide 58

Slide 58 text

II. 4. Rubin Importance-Sampling Algorithm Introduction I A noniterative Monte Carlo method for generating marginal distributions using importance-sampling ideas

Slide 59

Slide 59 text

II. 4. Rubin Importance-Sampling Algorithm Introduction Fact Monte Carlo integration Problem Jh = Ef [ h ( X )] = ˆ H h ( x ) f ( x ) dx MC solution ¯ h m = 1 m m X i= 1 h ( x i ) with ( x1 , . . . , x m) ⇠ f

Slide 60

Slide 60 text

II. 4. Rubin Importance-Sampling Algorithm Introduction Fact Importance Idea Problem Jh = E g  h ( X )f ( X ) g ( X ) = ˆ H  h ( x )f ( x ) g ( x ) g ( x ) dx MC solution ¯ h m = 1 m m X i= 1 h ( x i )f ( x i ) g ( x i ) with ( x1 , . . . , x m) ⇠ g

Slide 61

Slide 61 text

II. 4. Rubin Importance-Sampling Algorithm Two-variable case / Assumptions I The functional form (modulo the normalizing constant), of the joint density [ X , Y ] is known I the conditional distribution [ X | Y ] is available

Slide 62

Slide 62 text

II. 4. Rubin Importance-Sampling Algorithm Two-variable case / Assumptions I The functional form (modulo the normalizing constant), of the joint density [ X , Y ] is known I The conditional distribution [ X | Y ] is available

Slide 63

Slide 63 text

II. 4. Rubin Importance-Sampling Algorithm Two-variable case / Rubin’s idea I Choose an importance-sampling distribution [ Y ]s for Y I Use [ X | Y ] ⇤ [ Y ]s as an importance-sampling distribution for ( X , Y ) I ( Xl , Yl ) is created by drawing Yl ⇠ [ Y ]s and Xl ⇠ [ X | Yl ] ( l = 1 , . . . , N ) I Calculate rl = [ Xl , Yl ] [ Xl | Yl ] ⇤ [ Yl ]s

Slide 64

Slide 64 text

II. 4. Rubin Importance-Sampling Algorithm Two-variable case I The estimator of the marginal density [ X ] is : h ˆ X i = PN l= 1 [ X | Yl ] ⇤ rl PN l= 1 rl

Slide 65

Slide 65 text

II. 4. Rubin Importance-Sampling Algorithm Main property By dividing the numerator and the denominator by N and using the law of large number, we obtain : Theorem R1 (convergence) h ˆ X i ! [ X ] with probability 1 as N ! 1 for almost every X

Slide 66

Slide 66 text

II. 4. Rubin Importance-Sampling Algorithm Extension case / Assumptions In the three-variable case, we need : I The functional form of [ X , Y , Z ] I The availability of [ X | Y , Z ] I An importance- sampling distribution [ Y , Z ]s

Slide 67

Slide 67 text

II. 4. Rubin Importance-Sampling Algorithm Extension case / Estimator Then, we have : I rl = [ Xl , Yl , Zl ] [ Xl | Yl , Zl ] ⇤ [ Yl , Zl ]s I h ˆ X i = PN l= 1 [ X | Yl , Zl ] ⇤ rl PN l= 1 rl

Slide 68

Slide 68 text

II. 4. Rubin Importance-Sampling Algorithm Extension case In the k-variable case : I Nk variables are generated I Extension is also straightforward

Slide 69

Slide 69 text

Outline I. Introduction II. Sampling Approaches III. Examples and Numerical Illustrations IV. Conclusion

Slide 70

Slide 70 text

III. Examples and Numerical Illustrations 1. A Multinomial Model 2. A Conjugate Hierarchical Model

Slide 71

Slide 71 text

III. Examples and Numerical Illustrations 1. A Multinomial Model 2. A Conjugate Hierarchical Model

Slide 72

Slide 72 text

III. 1. A Multinomial Model Motivations I Two-parameter version of a one-parameter genetic-linkage example I Some observations are not assigned to individual cells but to aggregates of cells : =) we use multinomial sampling

Slide 73

Slide 73 text

Slide 74

Slide 74 text

III. 1. A Multinomial Model Data and Prior information I Y = ( Y1 , . . . , Y5 ) ⇠ mult ( n , a1 ✓ + b1 , a2 ✓ + b2 , a3 ⌘ + b3 , a4 ⌘ + b4 , c (1 ✓ ⌘)) I where ai , bi 0 are known, I 0 < c = 1 P bi = a1 + a2 = a3 + a4 < 1 I ✓, ⌘ 0 , ✓ + ⌘  1 I (✓, ⌘) ⇠ Dirichlet (↵ 1 , ↵ 2 , ↵ 3 )

Slide 75

Slide 75 text

Slide 76

Slide 76 text

III. 1. A Multinomial Model Unobservable Data and Posterior Information I X = ( X1 , . . . , X9 ) ⇠ mult ( n , a1 ✓, b1 , a2 ✓, b2 , a3 ⌘, b3 , a4 ⌘, b4 , c (1 ✓ ⌘)) I (✓, ⌘ | X ) ⇠ Dirichlet ( X1 + X3 + ↵ 1 , X5 + X7 + ↵ 2 , X9 + ↵ 3 ) I [✓ | X , ⌘] and [⌘ | X , ✓] are available as scaled beta distributions on [0, 1 ⌘] and [0, 1 ✓]

Slide 77

Slide 77 text

III. 1. A Multinomial Model Authors’ trick If we let : Y1 = X1 + X2 , Y2 = X3 + X4 , Y3 = X5 + X6 , Y4 = X7 + X8 and Y5 = X9 Z = ( X1 , X3 , X5 , X7 ) =) studying X is equivalent to studying ( Y , Z )

Slide 78

Slide 78 text

Slide 79

Slide 79 text

III. 1. A Multinomial Model Studied case I So, we have a three-variable case (✓, ⌘, Z ) with interest in the marginal distributions : I [✓ | Y ] , [⌘ | Y ] and [ Z | Y ] I We can remark that [ Z | Y , ✓, ⌘] is the product of four independent binomials X1 , X3 , X5 and X7 =)[ X i | Y , ✓, ⌘] = binomial ( Y i , ai ✓ (ai ✓+bi ) )

Slide 80

Slide 80 text

III. 1. A Multinomial Model Data Used Y = (14, 1, 1, 1, 5) ⇠ mult (22, 1 4 ✓ + 1 8 , 1 4 ⌘, 1 4 ⌘ + 3 8 , 1 2 (1 ✓ ⌘)) X = ( X1 , . . . , X7 ) ⇠ mult (22, 1 4 ✓, 1 8 , 1 4 ⌘, 1 4 ⌘, 3 8 , 1 2 (1 ✓ ⌘)) Z = ( X1 , X5 )

Slide 81

Slide 81 text

III. 1. A Multinomial Model Substitution and Gibbs Sampling To compare the two forms of iterative sampling, the authors : I obtained a numerical estimates of [✓ | Y ] and [⌘ | Y ] I processed 5,000 times the following scheme : I initialize : ✓ ⇠ U ( 0 , 1 ), ⌘ ⇠ U ( 0 , 1 ), 0  ✓ + ⌘  1 I run 4 cycles of the two samplers with m = 10 I Compared the average cumulative posterior probabilities

Slide 82

Slide 82 text

III. 1. A Multinomial Model Substitution and Gibbs Sampling

Slide 83

Slide 83 text

III. 1. A Multinomial Model Substitution and Gibbs Sampling We note that : I Substitution sampler adapts more quickly than Gibbs sampler I By the time, the two samplers have the same performance I Few random variate generations are required to obtain convergence ( m=10)

Slide 84

Slide 84 text

Slide 85

Slide 85 text

III. 1. A Multinomial Model Rubin Importance-Sampling I The authors obtained the following average cumulative posterior probabilities with: I 2,500 simulations I [ Z | Y ]s following the product of X1 ⇠ binomial ( Y1, 1 2 ) and X5 ⇠ binomial ( Y4, 1 2 )

Slide 86

Slide 86 text

III. 1. A Multinomial Model Rubin Importance-Sampling We note that : I The estimation is rather poor I The result may be sensitive to the choice of the importance distribution

Slide 87

Slide 87 text

III. Examples and Numerical Illustrations 1. A Multinomial Model 2. A Poisson Model

Slide 88

Slide 88 text

III. 2. A Poisson Model Introduction We consider an exchangeable Poisson model modelizing pump failures with : I s i the number of failures I t i the lenght of time in thousands of hours I ⇢i = si ti the rate

Slide 89

Slide 89 text

III. 2. A Poisson Model Prior Information We assume that : I Y = ( s1 , . . . , s p) I [ s i | i ] = Poisson ( i t i ) I i are idd from (↵, ) with ↵ know and ⇠ IG ( , )

Slide 90

Slide 90 text

III. 2. A Poisson Model Full Conditional Distributions We have that : I [ j | Y , , i6=j ] = (↵ + s j , ( t j + 1 ) 1), j = 1, . . . , p I [ | Y , 1 , . . . , p] ⇠ IG ( + p ↵, P i + )

Slide 91

Slide 91 text

III. 2. A Poisson Model Gibbs Cycle Given ( ( 0 ) 1 , . . . , ( 0 ) p , ( 0 )), we draw : I ( 1 ) j ⇠ (↵ + s j , ( t j + 1 (0) ) 1) I ( 1 ) ⇠ IG ( + p ↵, P ( 1 ) i + )

Slide 92

Slide 92 text

III. 2. A Poisson Model Marginal Densities estimated by Gibbs Sampling After i iterations with m repetitions : I ( (i) 1 l , . . . , (i) pl , (i) l ) with l = 1, . . . , m I [ j | Y ] = 1 m m X l= 1 (↵ + s j , ( t j + 1 (i) l ) 1) I [ | Y ] = 1 m m X l= 1 IG ( + p ↵, X (i) jl + )

Slide 93

Slide 93 text

III. 2. A Poisson Model Data I Pump-failure data analyzed by Gaver and O’Muircheartaigh (1987) : I p = 10, = 1, = 0.1 I ↵ = ¯ (⇢)2 (S2 ' p 1 ¯ ⇢ P t 1 i )

Slide 94

Slide 94 text

III. 2. A Poisson Model Results After 10 cycles of the algorithm :

Slide 95

Slide 95 text

III. 2. A Poisson Model Results After 10 cycles of the algorithm :

Slide 96

Slide 96 text

III. 2. A Poisson Model Conclusion I Great ﬁt of the Gibbs estimator I Convergence from a small number of drawings (m=10,100)

Slide 97

Slide 97 text

Outline I. Introduction II. Sampling Approaches III. Examples and Numerical Illustrations IV. Conclusion

Slide 98

Slide 98 text

IV. Conclusion I These sampling algorithms are straightforward to implement I Substitution and Gibbs sampling (iterative methods) provide better results in terms of convergence than the Rubin importance-sampling (noniterative method) I Performance of the Rubin importance-sampling performance depends on the choice of the importance distribution I If some reduced conditional distributions are available, substitution sampling becomes more eﬃcient than Gibbs ones.

Slide 99

Slide 99 text

Thanks for your attention

Slide 100

Slide 100 text

References Geman, S., and Geman, D. (1984), “Stochastic Relaxation, Gibbs Distributions and the Bayesian Restoration of Images,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 6, 721-741. Glick, N. (1974), “Consistency Conditions for Probability Estimators and Integrals of Density Estimators,” Utilitas Mathematica, 6, 61-74. Rubin, D. B. (1987), Comment on “The Calculation of Posterior Distributions by Data Augmentation,” by M. A. Tanner and W. H. Wong, Journal of the American Statistical Association, 82, 543-546 Tanner,M., and Wong,W. (1987), “The Calculation of Posterior Distributions by Data Augmentation,” Journal of the American Statistical Association, 82, 528-550