Xi'an
January 23, 2014
3k

# Statistical frontiers, Warwick, Jan. 30, 2014

Course on the approximation of Bayes factors by an array of numerical methods

January 23, 2014

## Transcript

1. ### On some computational methods for Bayesian model choice Christian P.

Robert University of Warwick and Universit´ e Paris Dauphine http://www.ceremade.dauphine.fr/~xian Statistical Frontiers, January 30, 2014
2. ### Outline 1 Evidence 2 Importance sampling solutions 3 Cross-model solutions

4 Nested sampling 5 ABC model choice 6 The Savage–Dickey ratio
3. ### Formal construction of Bayes tests Deﬁnition (Test) Given an hypothesis

H0 : θ ∈ Θ0 on the parameter θ ∈ Θ0 of a statistical model, a test is a statistical procedure that takes its values in {0, 1}. Theorem (Bayes test) The Bayes estimator associated with π and with the 0 − 1 loss is δπ(x) = 1 if π(θ ∈ Θ0|x) > π(θ ∈ Θ0|x), 0 otherwise,
4. ### Formal construction of Bayes tests Deﬁnition (Test) Given an hypothesis

H0 : θ ∈ Θ0 on the parameter θ ∈ Θ0 of a statistical model, a test is a statistical procedure that takes its values in {0, 1}. Theorem (Bayes test) The Bayes estimator associated with π and with the 0 − 1 loss is δπ(x) = 1 if π(θ ∈ Θ0|x) > π(θ ∈ Θ0|x), 0 otherwise,
5. ### Bayes factor Deﬁnition (Bayes factors) For testing hypothesis H0 :

θ ∈ Θ0 vs. Ha : θ ∈ Θ0, under prior π(Θ0)π0(θ) + π(Θc 0 )π1(θ) , central quantity B01(x) = π(Θ0|x) π(Θc 0 |x) π(Θ0) π(Θc 0 ) = Θ0 f(x|θ)π0(θ)dθ Θc 0 f(x|θ)π1(θ)dθ = m0(x) m1(x) [Jeﬀreys, 1939]
6. ### Model choice and model comparison Choice between models Several models

available for the same observation x Mi : x ∼ fi(x|θi), i ∈ I where the family I can be ﬁnite or inﬁnite Identical setup: Replace hypotheses with models but keep marginal likelihoods and Bayes factors
7. ### Model choice and model comparison Choice between models Several models

available for the same observation x Mi : x ∼ fi(x|θi), i ∈ I where the family I can be ﬁnite or inﬁnite Identical setup: Replace hypotheses with models but keep marginal likelihoods and Bayes factors
8. ### Bayesian model choice Probabilise the entire model/parameter space allocate probabilities

pi to all models Mi deﬁne priors πi(θi) for each parameter space Θi compute π(Mi|x) = pi Θi fi(x|θi)πi(θi)dθi j pj Θj fj(x|θj)πj(θj)dθj take largest π(Mi|x) to determine “best” model, or use averaged predictive j π(Mj|x) Θj fj(x |θj)πj(θj|x)dθj
9. ### Bayesian model choice Probabilise the entire model/parameter space allocate probabilities

pi to all models Mi deﬁne priors πi(θi) for each parameter space Θi compute π(Mi|x) = pi Θi fi(x|θi)πi(θi)dθi j pj Θj fj(x|θj)πj(θj)dθj take largest π(Mi|x) to determine “best” model, or use averaged predictive j π(Mj|x) Θj fj(x |θj)πj(θj|x)dθj
10. ### Bayesian model choice Probabilise the entire model/parameter space allocate probabilities

pi to all models Mi deﬁne priors πi(θi) for each parameter space Θi compute π(Mi|x) = pi Θi fi(x|θi)πi(θi)dθi j pj Θj fj(x|θj)πj(θj)dθj take largest π(Mi|x) to determine “best” model, or use averaged predictive j π(Mj|x) Θj fj(x |θj)πj(θj|x)dθj
11. ### Bayesian model choice Probabilise the entire model/parameter space allocate probabilities

pi to all models Mi deﬁne priors πi(θi) for each parameter space Θi compute π(Mi|x) = pi Θi fi(x|θi)πi(θi)dθi j pj Θj fj(x|θj)πj(θj)dθj take largest π(Mi|x) to determine “best” model, or use averaged predictive j π(Mj|x) Θj fj(x |θj)πj(θj|x)dθj
12. ### Evidence All these problems end up with a similar quantity,

the evidence Zk = Θk πk(θk)Lk(θk) dθk, aka the marginal likelihood.
13. ### Importance sampling 1 Evidence 2 Importance sampling solutions Regular importance

Harmonic means Chib’s solution Auxiliary variables 3 Cross-model solutions 4 Nested sampling 5 ABC model choice 6 The Savage–Dickey ratio
14. ### Bayes factor approximation When approximating the Bayes factor B01 =

Θ0 f0(x|θ0)π0(θ0)dθ0 Θ1 f1(x|θ1)π1(θ1)dθ1 use of importance functions 0 and 1 and B01 = n−1 0 n0 i=1 f0(x|θi 0 )π0(θi 0 )/ 0(θi 0 ) n−1 1 n1 i=1 f1(x|θi 1 )π1(θi 1 )/ 1(θi 1 ) on the corresponding parameter spaces (may be of diﬀerent dimensions!)
15. ### Default importance A typical choice for in Bayesian settings is

to take θ ∼ N(ˆ θMLE, ˆ σ2 MLE ) [Marin & Robert, 2009; Wood, 2010]
16. ### Bridge sampling Special case: If π1(θ1|x) ∝ ˜ π1(θ1|x) π2(θ2|x)

∝ ˜ π2(θ2|x) live on the same space (Θ1 = Θ2), then B12 ≈ 1 n n i=1 ˜ π1(θi|x) ˜ π2(θi|x) θi ∼ π2(θ|x) [Gelman & Meng, 1998; Chen, Shao & Ibrahim, 2000]
17. ### Bridge sampling variance The bridge sampling estimator does poorly if

var(B12) B2 12 ≈ 1 n E π1(θ) − π2(θ) π2(θ) 2 is large, i.e. if π1 and π2 have little overlap...
18. ### Bridge sampling variance The bridge sampling estimator does poorly if

var(B12) B2 12 ≈ 1 n E π1(θ) − π2(θ) π2(θ) 2 is large, i.e. if π1 and π2 have little overlap...
19. ### (Further) bridge sampling A remarkable property B12 = ˜ π2(θ|x)α(θ)π1(θ|x)dθ

˜ π1(θ|x)α(θ)π2(θ|x)dθ ∀ α(·) ≈ 1 n1 n1 i=1 ˜ π2(θ1i|x)α(θ1i) 1 n2 n2 i=1 ˜ π1(θ2i|x)α(θ2i) θji ∼ πj(θ|x)
20. ### An infamous example When α(θ) = 1 ˜ π1(θ)˜ π2(θ)

harmonic mean approximation to B12 B21 = 1 n1 n1 i=1 1/˜ π1(θ1i|x) 1 n2 n2 i=1 1/˜ π2(θ2i|x) θji ∼ πj(θ|x) [Newton & Raftery, 1994] Infamous: Most often leads to an inﬁnite variance!!! [Radford Neal’s blog, 2008]
21. ### An infamous example When α(θ) = 1 ˜ π1(θ)˜ π2(θ)

harmonic mean approximation to B12 B21 = 1 n1 n1 i=1 1/˜ π1(θ1i|x) 1 n2 n2 i=1 1/˜ π2(θ2i|x) θji ∼ πj(θ|x) [Newton & Raftery, 1994] Infamous: Most often leads to an inﬁnite variance!!! [Radford Neal’s blog, 2008]
22. ### “The Worst Monte Carlo Method Ever” “The good news is

that the Law of Large Numbers guarantees that this estimator is consistent ie, it will very likely be very close to the correct answer if you use a suﬃciently large number of points from the posterior distribution. The bad news is that the number of points required for this estimator to get close to the right answer will often be greater than the number of atoms in the observable universe. The even worse news is that itws easy for people to not realize this, and to naively accept estimates that are nowhere close to the correct value of the marginal likelihood.” [Radford Neal’s blog, Aug. 23, 2008]
23. ### “The Worst Monte Carlo Method Ever” “The good news is

that the Law of Large Numbers guarantees that this estimator is consistent ie, it will very likely be very close to the correct answer if you use a suﬃciently large number of points from the posterior distribution. The bad news is that the number of points required for this estimator to get close to the right answer will often be greater than the number of atoms in the observable universe. The even worse news is that itws easy for people to not realize this, and to naively accept estimates that are nowhere close to the correct value of the marginal likelihood.” [Radford Neal’s blog, Aug. 23, 2008]
24. ### Optimal bridge sampling The optimal choice of auxiliary function is

α = n1 + n2 n1π1(θ|x) + n2π2(θ|x) leading to B12 ≈ 1 n1 n1 i=1 ˜ π2(θ1i|x) n1π1(θ1i|x) + n2π2(θ1i|x) 1 n2 n2 i=1 ˜ π1(θ2i|x) n1π1(θ2i|x) + n2π2(θ2i|x) Back later! [Gelfand & Dey, 1994]
25. ### Optimal bridge sampling (2) Reason: Var(B12) B2 12 ≈ 1

n1n2 π1(θ)π2(θ)[n1π1(θ) + n2π2(θ)]α(θ)2 dθ π1(θ)π2(θ)α(θ) dθ 2 − 1 δ method Drag: Dependence on the unknown normalising constants solved iteratively
26. ### Optimal bridge sampling (2) Reason: Var(B12) B2 12 ≈ 1

n1n2 π1(θ)π2(θ)[n1π1(θ) + n2π2(θ)]α(θ)2 dθ π1(θ)π2(θ)α(θ) dθ 2 − 1 δ method Drag: Dependence on the unknown normalising constants solved iteratively
27. ### Ratio importance sampling Another fundamental identity: B12 = Eϕ [˜

π1(θ1)/ϕ(θ1)] Eϕ [˜ π2(θ2)/ϕ(θ2)] for any density ϕ with suﬃciently large support (when Θ1 = Θ2) [Torrie & Valleau, 1977] Use of a single sample θ1, . . . , θn from ϕ B12 = i=1 ˜ π1(θi)/ϕ(θi) i=1 ˜ π2(θi)/ϕ(θi)
28. ### Ratio importance sampling Another fundamental identity: B12 = Eϕ [˜

π1(θ1)/ϕ(θ1)] Eϕ [˜ π2(θ2)/ϕ(θ2)] for any density ϕ with suﬃciently large support (when Θ1 = Θ2) [Torrie & Valleau, 1977] Use of a single sample θ1, . . . , θn from ϕ B12 = i=1 ˜ π1(θi)/ϕ(θi) i=1 ˜ π2(θi)/ϕ(θi)
29. ### Ratio importance sampling (2) Approximate variance: var(B12) B2 12 ≈

1 n Eϕ (π1(θ) − π2(θ))2 ϕ(θ)2 2 Optimal choice: ϕ∗(θ) = | π1(θ) − π2(θ) | | π1(η) − π2(η) | dη [Chen, Shao & Ibrahim, 2000] Formaly better than bridge sampling
30. ### Ratio importance sampling (2) Approximate variance: var(B12) B2 12 ≈

1 n Eϕ (π1(θ) − π2(θ))2 ϕ(θ)2 2 Optimal choice: ϕ∗(θ) = | π1(θ) − π2(θ) | | π1(η) − π2(η) | dη [Chen, Shao & Ibrahim, 2000] Formaly better than bridge sampling
31. ### Ratio importance sampling (2) Approximate variance: var(B12) B2 12 ≈

1 n Eϕ (π1(θ) − π2(θ))2 ϕ(θ)2 2 Optimal choice: ϕ∗(θ) = | π1(θ) − π2(θ) | | π1(η) − π2(η) | dη [Chen, Shao & Ibrahim, 2000] Formaly better than bridge sampling
32. ### Approximating Zk from a posterior sample Use of the [harmonic

mean] identity Eπk ϕ(θk) πk(θk)Lk(θk) x = ϕ(θk) πk(θk)Lk(θk) πk(θk)Lk(θk) Zk dθk = 1 Zk no matter what the proposal ϕ(·) is. [Gelfand & Dey, 1994; Bartolucci et al., 2006] Direct exploitation of the MCMC output RB-RJ
33. ### Approximating Zk from a posterior sample Use of the [harmonic

mean] identity Eπk ϕ(θk) πk(θk)Lk(θk) x = ϕ(θk) πk(θk)Lk(θk) πk(θk)Lk(θk) Zk dθk = 1 Zk no matter what the proposal ϕ(·) is. [Gelfand & Dey, 1994; Bartolucci et al., 2006] Direct exploitation of the MCMC output RB-RJ
34. ### Comparison with regular importance sampling Harmonic mean: Constraint opposed to

usual importance sampling constraints: ϕ(θ) must have lighter (rather than fatter) tails than πk(θk)Lk(θk) for the approximation Z1k = 1 1 T T t=1 ϕ(θ(t) k ) πk(θ(t) k )Lk(θ(t) k ) to have a ﬁnite variance. E.g., use ﬁnite support kernels (like Epanechnikov’s kernel) for ϕ
35. ### Comparison with regular importance sampling Harmonic mean: Constraint opposed to

usual importance sampling constraints: ϕ(θ) must have lighter (rather than fatter) tails than πk(θk)Lk(θk) for the approximation Z1k = 1 1 T T t=1 ϕ(θ(t) k ) πk(θ(t) k )Lk(θ(t) k ) to have a ﬁnite variance. E.g., use ﬁnite support kernels (like Epanechnikov’s kernel) for ϕ
36. ### Comparison with regular importance sampling (cont’d) Compare Z1k with a

standard importance sampling approximation Z2k = 1 T T t=1 πk(θ(t) k )Lk(θ(t) k ) ϕ(θ(t) k ) where the θ(t) k ’s are generated from the density ϕ(·) (with fatter tails like t’s)
37. ### Approximating Zk using a mixture representation Bridge sampling recall Design

a speciﬁc mixture for simulation [importance sampling] purposes, with density ϕk(θk) ∝ ω1πk(θk)Lk(θk) + ϕ(θk) , where ϕ(·) is arbitrary (but normalised) Note: ω1 is not a probability weight
38. ### Approximating Zk using a mixture representation Bridge sampling recall Design

a speciﬁc mixture for simulation [importance sampling] purposes, with density ϕk(θk) ∝ ω1πk(θk)Lk(θk) + ϕ(θk) , where ϕ(·) is arbitrary (but normalised) Note: ω1 is not a probability weight
39. ### Approximating Z using a mixture representation (cont’d) Corresponding MCMC (=Gibbs)

sampler At iteration t 1 Take δ(t) = 1 with probability ω1πk(θ(t−1) k )Lk(θ(t−1) k ) ω1πk(θ(t−1) k )Lk(θ(t−1) k ) + ϕ(θ(t−1) k ) and δ(t) = 2 otherwise; 2 If δ(t) = 1, generate θ(t) k ∼ MCMC(θ(t−1) k , θk) where MCMC(θk, θk ) denotes an arbitrary MCMC kernel associated with the posterior πk(θk|x) ∝ πk(θk)Lk(θk); 3 If δ(t) = 2, generate θ(t) k ∼ ϕ(θk) independently
40. ### Approximating Z using a mixture representation (cont’d) Corresponding MCMC (=Gibbs)

sampler At iteration t 1 Take δ(t) = 1 with probability ω1πk(θ(t−1) k )Lk(θ(t−1) k ) ω1πk(θ(t−1) k )Lk(θ(t−1) k ) + ϕ(θ(t−1) k ) and δ(t) = 2 otherwise; 2 If δ(t) = 1, generate θ(t) k ∼ MCMC(θ(t−1) k , θk) where MCMC(θk, θk ) denotes an arbitrary MCMC kernel associated with the posterior πk(θk|x) ∝ πk(θk)Lk(θk); 3 If δ(t) = 2, generate θ(t) k ∼ ϕ(θk) independently
41. ### Approximating Z using a mixture representation (cont’d) Corresponding MCMC (=Gibbs)

sampler At iteration t 1 Take δ(t) = 1 with probability ω1πk(θ(t−1) k )Lk(θ(t−1) k ) ω1πk(θ(t−1) k )Lk(θ(t−1) k ) + ϕ(θ(t−1) k ) and δ(t) = 2 otherwise; 2 If δ(t) = 1, generate θ(t) k ∼ MCMC(θ(t−1) k , θk) where MCMC(θk, θk ) denotes an arbitrary MCMC kernel associated with the posterior πk(θk|x) ∝ πk(θk)Lk(θk); 3 If δ(t) = 2, generate θ(t) k ∼ ϕ(θk) independently
42. ### Evidence approximation by mixtures Rao-Blackwellised estimate ˆ ξ = 1

T T t=1 ω1πk(θ(t) k )Lk(θ(t) k ) ω1πk(θ(t) k )Lk(θ(t) k ) + ϕ(θ(t) k ) , converges to ω1Zk/{ω1Zk + 1} Deduce ˆ Z3k from ω1 ˆ Z3k/{ω1 ˆ Z3k + 1} = ˆ ξ ie ˆ Z3k = T t=1 ω1πk(θ(t) k )Lk(θ(t) k ) ω1π(θ(t) k )Lk(θ(t) k ) + ϕ(θ(t) k ) T t=1 ϕ(θ(t) k ) ω1πk(θ(t) k )Lk(θ(t) k ) + ϕ(θ(t) k ) [Bridge sampler redux]
43. ### Evidence approximation by mixtures Rao-Blackwellised estimate ˆ ξ = 1

T T t=1 ω1πk(θ(t) k )Lk(θ(t) k ) ω1πk(θ(t) k )Lk(θ(t) k ) + ϕ(θ(t) k ) , converges to ω1Zk/{ω1Zk + 1} Deduce ˆ Z3k from ω1 ˆ Z3k/{ω1 ˆ Z3k + 1} = ˆ ξ ie ˆ Z3k = T t=1 ω1πk(θ(t) k )Lk(θ(t) k ) ω1π(θ(t) k )Lk(θ(t) k ) + ϕ(θ(t) k ) T t=1 ϕ(θ(t) k ) ω1πk(θ(t) k )Lk(θ(t) k ) + ϕ(θ(t) k ) [Bridge sampler redux]

45. ### Consequence If I unbiased estimator of I = Ef [h(X)],

with X simulated from a joint density ˜ f(x, y), where ˜ f(x, y)dy = f(x), the estimator I∗ = E ˜ f [I|Y1, . . . , Yn] dominate I(X1, . . . , Xn) variance-wise (and is unbiased)
46. ### Chib’s representation Direct application of Bayes’ theorem: given x ∼

fk(x|θk) and θk ∼ πk(θk), Zk = mk(x) = fk(x|θk) πk(θk) πk(θk|x) Use of an approximation to the posterior Zk = mk(x) = fk(x|θ∗ k ) πk(θ∗ k ) ˆ πk(θ∗ k |x) .
47. ### Chib’s representation Direct application of Bayes’ theorem: given x ∼

fk(x|θk) and θk ∼ πk(θk), Zk = mk(x) = fk(x|θk) πk(θk) πk(θk|x) Use of an approximation to the posterior Zk = mk(x) = fk(x|θ∗ k ) πk(θ∗ k ) ˆ πk(θ∗ k |x) .
48. ### Case of latent variables For missing variable z as in

mixture models, natural Rao-Blackwell estimate πk(θ∗ k |x) = 1 T T t=1 πk(θ∗ k |x, z(t) k ) , where the z(t) k ’s are Gibbs sampled latent variables
49. ### Label switching impact A mixture model [special case of missing

variable model] is invariant under permutations of the indices of the components. E.g., mixtures 0.3N(0, 1) + 0.7N(2.3, 1) and 0.7N(2.3, 1) + 0.3N(0, 1) are exactly the same! c The component parameters θi are not identiﬁable marginally since they are exchangeable
50. ### Label switching impact A mixture model [special case of missing

variable model] is invariant under permutations of the indices of the components. E.g., mixtures 0.3N(0, 1) + 0.7N(2.3, 1) and 0.7N(2.3, 1) + 0.3N(0, 1) are exactly the same! c The component parameters θi are not identiﬁable marginally since they are exchangeable
51. ### Connected diﬃculties 1 Number of modes of the likelihood of

order O(k!): c Maximization and even [MCMC] exploration of the posterior surface harder 2 Under exchangeable priors on (θ, p) [prior invariant under permutation of the indices], all posterior marginals are identical: c Posterior expectation of θ1 equal to posterior expectation of θ2
52. ### Connected diﬃculties 1 Number of modes of the likelihood of

order O(k!): c Maximization and even [MCMC] exploration of the posterior surface harder 2 Under exchangeable priors on (θ, p) [prior invariant under permutation of the indices], all posterior marginals are identical: c Posterior expectation of θ1 equal to posterior expectation of θ2
53. ### License Since Gibbs output does not produce exchangeability, the Gibbs

sampler has not explored the whole parameter space: it lacks energy to switch simultaneously enough component allocations at once 0 100 200 300 400 500 −1 0 1 2 3 n µi −1 0 1 2 3 0.2 0.3 0.4 0.5 µ i pi 0 100 200 300 400 500 0.2 0.3 0.4 0.5 n pi 0.2 0.3 0.4 0.5 0.4 0.6 0.8 1.0 p i σi 0 100 200 300 400 500 0.4 0.6 0.8 1.0 n σi 0.4 0.6 0.8 1.0 −1 0 1 2 3 σ i µi
54. ### Label switching paradox We should observe the exchangeability of the

components [label switching] to conclude about convergence of the Gibbs sampler. If we observe it, then we do not know how to estimate the parameters. If we do not, then we are uncertain about the convergence!!!
55. ### Label switching paradox We should observe the exchangeability of the

components [label switching] to conclude about convergence of the Gibbs sampler. If we observe it, then we do not know how to estimate the parameters. If we do not, then we are uncertain about the convergence!!!
56. ### Label switching paradox We should observe the exchangeability of the

components [label switching] to conclude about convergence of the Gibbs sampler. If we observe it, then we do not know how to estimate the parameters. If we do not, then we are uncertain about the convergence!!!
57. ### Compensation for label switching For mixture models, z(t) k usually

fails to visit all conﬁgurations in a balanced way, despite the symmetry predicted by the theory πk(θk|x) = πk(σ(θk)|x) = 1 k! σ∈S πk(σ(θk)|x) for all σ’s in Sk, set of all permutations of {1, . . . , k}. Consequences on numerical approximation, biased by an order k! Recover the theoretical symmetry by using πk(θ∗ k |x) = 1 T k! σ∈Sk T t=1 πk(σ(θ∗ k )|x, z(t) k ) . [Berkhof, Mechelen, & Gelman, 2003]
58. ### Compensation for label switching For mixture models, z(t) k usually

fails to visit all conﬁgurations in a balanced way, despite the symmetry predicted by the theory πk(θk|x) = πk(σ(θk)|x) = 1 k! σ∈S πk(σ(θk)|x) for all σ’s in Sk, set of all permutations of {1, . . . , k}. Consequences on numerical approximation, biased by an order k! Recover the theoretical symmetry by using πk(θ∗ k |x) = 1 T k! σ∈Sk T t=1 πk(σ(θ∗ k )|x, z(t) k ) . [Berkhof, Mechelen, & Gelman, 2003]
59. ### Galaxy dataset n = 82 galaxies as a mixture of

k normal distributions with both mean and variance unknown. [Roeder, 1992] Average density data Relative Frequency −2 −1 0 1 2 3 0.0 0.2 0.4 0.6 0.8
60. ### Galaxy dataset (k) Using only the original estimate, with θ∗

k as the MAP estimator, log( ˆ mk(x)) = −105.1396 for k = 3 (based on 103 simulations), while introducing the permutations leads to log( ˆ mk(x)) = −103.3479 Note that −105.1396 + log(3!) = −103.3479 k 2 3 4 5 6 7 8 mk (x) -115.68 -103.35 -102.66 -101.93 -102.88 -105.48 -108.44 Estimations of the marginal likelihoods by the symmetrised Chib’s approximation (based on 105 Gibbs iterations and, for k > 5, 100 permutations selected at random in Sk). [Lee, Marin, Mengersen & Robert, 2008]
61. ### Galaxy dataset (k) Using only the original estimate, with θ∗

k as the MAP estimator, log( ˆ mk(x)) = −105.1396 for k = 3 (based on 103 simulations), while introducing the permutations leads to log( ˆ mk(x)) = −103.3479 Note that −105.1396 + log(3!) = −103.3479 k 2 3 4 5 6 7 8 mk (x) -115.68 -103.35 -102.66 -101.93 -102.88 -105.48 -108.44 Estimations of the marginal likelihoods by the symmetrised Chib’s approximation (based on 105 Gibbs iterations and, for k > 5, 100 permutations selected at random in Sk). [Lee, Marin, Mengersen & Robert, 2008]
62. ### Galaxy dataset (k) Using only the original estimate, with θ∗

k as the MAP estimator, log( ˆ mk(x)) = −105.1396 for k = 3 (based on 103 simulations), while introducing the permutations leads to log( ˆ mk(x)) = −103.3479 Note that −105.1396 + log(3!) = −103.3479 k 2 3 4 5 6 7 8 mk (x) -115.68 -103.35 -102.66 -101.93 -102.88 -105.48 -108.44 Estimations of the marginal likelihoods by the symmetrised Chib’s approximation (based on 105 Gibbs iterations and, for k > 5, 100 permutations selected at random in Sk). [Lee, Marin, Mengersen & Robert, 2008]
63. ### Auxiliary variables Auxiliary variable method avoids computations of untractable constant(s)

in likelihood function f(y|θ) = Zθ ˜ f(y|θ) Introduce pseudo-data y with artiﬁcial target g(y|θ, y) Generate θ ∼ K(θ, θ ) and y ∼ f(y|θ ) [Møller, Pettitt, Berthelsen, & Reeves, 2006]
64. ### Auxiliary variables Auxiliary variable method avoids computations of untractable constant(s)

in likelihood function f(y|θ) = Zθ ˜ f(y|θ) Introduce pseudo-data y with artiﬁcial target g(y|θ, y) Generate θ ∼ K(θ, θ ) and y ∼ f(y|θ ) Accept with probability π(θ )f(y|θ )g(y |θ , y) π(θ)f(y|θ)g(y|θ, y) K(θ , θ)f(y|θ) K(θ, θ )f(y |θ ) ∧ 1 [Møller, Pettitt, Berthelsen, & Reeves, 2006]
65. ### Auxiliary variables Auxiliary variable method avoids computations of untractable constant(s)

in likelihood function f(y|θ) = Zθ ˜ f(y|θ) Introduce pseudo-data y with artiﬁcial target g(y|θ, y) Generate θ ∼ K(θ, θ ) and y ∼ f(y|θ ) Accept with probability π(θ ) ˜ f(y|θ )g(y |θ , y) π(θ) ˜ f(y|θ)g(y|θ, y) K(θ , θ) ˜ f(y|θ) K(θ, θ ) ˜ f(y |θ ) ∧ 1 [Møller, Pettitt, Berthelsen, & Reeves, 2006]
66. ### Auxiliary variables Auxiliary variable method avoids computations of untractable constant(s)

in likelihood function f(y|θ) = Zθ ˜ f(y|θ) Introduce pseudo-data y with artiﬁcial target g(y|θ, y) Generate θ ∼ K(θ, θ ) and y ∼ f(y|θ ) For Gibbs random ﬁelds , existence of a genuine suﬃcient statistic η(y). [Møller, Pettitt, Berthelsen, & Reeves, 2006]
67. ### Cross-model solutions 1 Evidence 2 Importance sampling solutions 3 Cross-model

solutions Reversible jump Saturation schemes Implementation error 4 Nested sampling 5 ABC model choice 6 The Savage–Dickey ratio
68. ### Reversible jump irreversible jump Idea: Set up a proper measure–theoretic

framework for designing moves between models Mk [Green, 1995] Create a reversible kernel K on H = k {k} × Θk such that A B K(x, dy)π(x)dx = B A K(y, dx)π(y)dy for the invariant density π [x is of the form (k, θ(k))]
69. ### Reversible jump irreversible jump Idea: Set up a proper measure–theoretic

framework for designing moves between models Mk [Green, 1995] Create a reversible kernel K on H = k {k} × Θk such that A B K(x, dy)π(x)dx = B A K(y, dx)π(y)dy for the invariant density π [x is of the form (k, θ(k))]
70. ### Local moves For a move between two models, M1 and

M2, the Markov chain being in state θ1 ∈ M1, denote by K1→2(θ1, dθ) and K2→1(θ2, dθ) the corresponding kernels, under the detailed balance condition π(dθ1) K1→2(θ1, dθ) = π(dθ2) K2→1(θ2, dθ) , and take, wlog, dim(M2) > dim(M1). Proposal expressed as θ2 = Ψ1→2(θ1, v1→2) where v1→2 is a random variable of dimension dim(M2) − dim(M1), generated as v1→2 ∼ ϕ1→2(v1→2) .
71. ### Local moves For a move between two models, M1 and

M2, the Markov chain being in state θ1 ∈ M1, denote by K1→2(θ1, dθ) and K2→1(θ2, dθ) the corresponding kernels, under the detailed balance condition π(dθ1) K1→2(θ1, dθ) = π(dθ2) K2→1(θ2, dθ) , and take, wlog, dim(M2) > dim(M1). Proposal expressed as θ2 = Ψ1→2(θ1, v1→2) where v1→2 is a random variable of dimension dim(M2) − dim(M1), generated as v1→2 ∼ ϕ1→2(v1→2) .
72. ### Local moves (2) In this case, q1→2(θ1, dθ2) has density

ϕ1→2(v1→2) ∂Ψ1→2(θ1, v1→2) ∂(θ1, v1→2) −1 , by the Jacobian rule. Reverse importance link If probability 1→2 of choosing move to M2 while in M1, acceptance probability reduces to α(θ1, v1→2) = 1∧ π(M2, θ2) 2→1 π(M1, θ1) 1→2 ϕ1→2(v1→2) ∂Ψ1→2(θ1, v1→2) ∂(θ1, v1→2) . If several models are considered simultaneously, with probability 1→2 of choosing move to M2 while in M1, as in K(x, B) = ∞ m=1 ρm (x, y)qm (x, dy) + ω(x)IB (x)
73. ### Local moves (2) In this case, q1→2(θ1, dθ2) has density

ϕ1→2(v1→2) ∂Ψ1→2(θ1, v1→2) ∂(θ1, v1→2) −1 , by the Jacobian rule. Reverse importance link If probability 1→2 of choosing move to M2 while in M1, acceptance probability reduces to α(θ1, v1→2) = 1∧ π(M2, θ2) 2→1 π(M1, θ1) 1→2 ϕ1→2(v1→2) ∂Ψ1→2(θ1, v1→2) ∂(θ1, v1→2) . If several models are considered simultaneously, with probability 1→2 of choosing move to M2 while in M1, as in K(x, B) = ∞ m=1 ρm (x, y)qm (x, dy) + ω(x)IB (x)
74. ### Local moves (2) In this case, q1→2(θ1, dθ2) has density

ϕ1→2(v1→2) ∂Ψ1→2(θ1, v1→2) ∂(θ1, v1→2) −1 , by the Jacobian rule. Reverse importance link If probability 1→2 of choosing move to M2 while in M1, acceptance probability reduces to α(θ1, v1→2) = 1∧ π(M2, θ2) 2→1 π(M1, θ1) 1→2 ϕ1→2(v1→2) ∂Ψ1→2(θ1, v1→2) ∂(θ1, v1→2) . If several models are considered simultaneously, with probability 1→2 of choosing move to M2 while in M1, as in K(x, B) = ∞ m=1 ρm (x, y)qm (x, dy) + ω(x)IB (x)
75. ### Generic reversible jump acceptance probability Acceptance probability of θ2 =

Ψ1→2(θ1, v1→2) is α(θ1 , v1→2 ) = 1 ∧ π(M2 , θ2 ) 2→1 π(M1 , θ1 ) 1→2 ϕ1→2 (v1→2 ) ∂Ψ1→2 (θ1 , v1→2 ) ∂(θ1 , v1→2 ) while acceptance probability of θ1 with (θ1, v1→2) = Ψ−1 1→2 (θ2) is α(θ1 , v1→2 ) = 1 ∧ π(M1 , θ1 ) 1→2 ϕ1→2 (v1→2 ) π(M2 , θ2 ) 2→1 ∂Ψ1→2 (θ1 , v1→2 ) ∂(θ1 , v1→2 ) −1 c Diﬃcult calibration
76. ### Generic reversible jump acceptance probability Acceptance probability of θ2 =

Ψ1→2(θ1, v1→2) is α(θ1 , v1→2 ) = 1 ∧ π(M2 , θ2 ) 2→1 π(M1 , θ1 ) 1→2 ϕ1→2 (v1→2 ) ∂Ψ1→2 (θ1 , v1→2 ) ∂(θ1 , v1→2 ) while acceptance probability of θ1 with (θ1, v1→2) = Ψ−1 1→2 (θ2) is α(θ1 , v1→2 ) = 1 ∧ π(M1 , θ1 ) 1→2 ϕ1→2 (v1→2 ) π(M2 , θ2 ) 2→1 ∂Ψ1→2 (θ1 , v1→2 ) ∂(θ1 , v1→2 ) −1 c Diﬃcult calibration
77. ### Generic reversible jump acceptance probability Acceptance probability of θ2 =

Ψ1→2(θ1, v1→2) is α(θ1 , v1→2 ) = 1 ∧ π(M2 , θ2 ) 2→1 π(M1 , θ1 ) 1→2 ϕ1→2 (v1→2 ) ∂Ψ1→2 (θ1 , v1→2 ) ∂(θ1 , v1→2 ) while acceptance probability of θ1 with (θ1, v1→2) = Ψ−1 1→2 (θ2) is α(θ1 , v1→2 ) = 1 ∧ π(M1 , θ1 ) 1→2 ϕ1→2 (v1→2 ) π(M2 , θ2 ) 2→1 ∂Ψ1→2 (θ1 , v1→2 ) ∂(θ1 , v1→2 ) −1 c Diﬃcult calibration
78. ### Green’s sampler Algorithm Iteration t (t ≥ 1): if x(t)

= (m, θ(m)), 1 Select model Mn with probability πmn 2 Generate umn ∼ ϕmn(u) and set (θ(n), vnm) = Ψm→n(θ(m), umn) 3 Take x(t+1) = (n, θ(n)) with probability min π(n, θ(n)) π(m, θ(m)) πnm ϕnm (vnm ) πmn ϕmn (umn ) ∂Ψm→n (θ(m), umn ) ∂(θ(m), umn ) , 1 and take x(t+1) = x(t) otherwise.
79. ### Interpretation The representation puts us back in a ﬁxed dimension

setting: M1 × V1→2 and M2 in one-to-one relation. reversibility imposes that θ1 is derived as (θ1, v1→2) = Ψ−1 1→2 (θ2) appears like a regular Metropolis–Hastings move from the couple (θ1, v1→2) to θ2 when stationary distributions are π(M1, θ1) × ϕ1→2(v1→2) and π(M2, θ2), and when proposal distribution is deterministic
80. ### Interpretation The representation puts us back in a ﬁxed dimension

setting: M1 × V1→2 and M2 in one-to-one relation. reversibility imposes that θ1 is derived as (θ1, v1→2) = Ψ−1 1→2 (θ2) appears like a regular Metropolis–Hastings move from the couple (θ1, v1→2) to θ2 when stationary distributions are π(M1, θ1) × ϕ1→2(v1→2) and π(M2, θ2), and when proposal distribution is deterministic
81. ### Alternative Saturation of the parameter space H = k {k}

× Θk by creating θ = (θ1, . . . , θD) a model index M pseudo-priors πj(θj|M = k) for j = k [Carlin & Chib, 1995] Validation by P(M = k|x) = P(M = k|x, θ)π(θ|x)dθ = Zk where the (marginal) posterior is [not πk] π(θ|x) = D k=1 P(θ, M = k|x) = D k=1 pk Zk πk(θk|x) j=k πj(θj|M = k) .
82. ### Alternative Saturation of the parameter space H = k {k}

× Θk by creating θ = (θ1, . . . , θD) a model index M pseudo-priors πj(θj|M = k) for j = k [Carlin & Chib, 1995] Validation by P(M = k|x) = P(M = k|x, θ)π(θ|x)dθ = Zk where the (marginal) posterior is [not πk] π(θ|x) = D k=1 P(θ, M = k|x) = D k=1 pk Zk πk(θk|x) j=k πj(θj|M = k) .
83. ### MCMC implementation Run a Markov chain (M(t), θ(t) 1 ,

. . . , θ(t) D ) with stationary distribution π(θ, M|x) by 1 Pick M(t) = k with probability π(θ(t−1), k|x) 2 Generate θ(t−1) k from the posterior πk(θk|x) [or MCMC step] 3 Generate θ(t−1) j (j = k) from the pseudo-prior πj(θj|M = k) Approximate P(M = k|x) = Zk by ˇ pk(x) ∝ pk T t=1 fk(x|θ(t) k ) πk(θ(t) k ) j=k πj(θ(t) j |M = k) D =1 p f (x|θ(t)) π (θ(t)) j= πj(θ(t) j |M = )
84. ### MCMC implementation Run a Markov chain (M(t), θ(t) 1 ,

. . . , θ(t) D ) with stationary distribution π(θ, M|x) by 1 Pick M(t) = k with probability π(θ(t−1), k|x) 2 Generate θ(t−1) k from the posterior πk(θk|x) [or MCMC step] 3 Generate θ(t−1) j (j = k) from the pseudo-prior πj(θj|M = k) Approximate P(M = k|x) = Zk by ˇ pk(x) ∝ pk T t=1 fk(x|θ(t) k ) πk(θ(t) k ) j=k πj(θ(t) j |M = k) D =1 p f (x|θ(t)) π (θ(t)) j= πj(θ(t) j |M = )
85. ### MCMC implementation Run a Markov chain (M(t), θ(t) 1 ,

. . . , θ(t) D ) with stationary distribution π(θ, M|x) by 1 Pick M(t) = k with probability π(θ(t−1), k|x) 2 Generate θ(t−1) k from the posterior πk(θk|x) [or MCMC step] 3 Generate θ(t−1) j (j = k) from the pseudo-prior πj(θj|M = k) Approximate P(M = k|x) = Zk by ˇ pk(x) ∝ pk T t=1 fk(x|θ(t) k ) πk(θ(t) k ) j=k πj(θ(t) j |M = k) D =1 p f (x|θ(t)) π (θ(t)) j= πj(θ(t) j |M = )
86. ### Scott’s (2002) proposal Suggest estimating P(M = k|x) by Zk

∝ pk T t=1    fk(x|θ(t) k ) D j=1 pj fj(x|θ(t) j )    , based on D simultaneous and independent MCMC chains (θ(t) k )t , 1 ≤ k ≤ D , with stationary distributions πk(θk|x) [instead of above joint]
87. ### Scott’s (2002) proposal Suggest estimating P(M = k|x) by Zk

∝ pk T t=1    fk(x|θ(t) k ) D j=1 pj fj(x|θ(t) j )    , based on D simultaneous and independent MCMC chains (θ(t) k )t , 1 ≤ k ≤ D , with stationary distributions πk(θk|x) [instead of above joint]
88. ### Congdon’s (2006) extension Selecting ﬂat [prohibited] pseudo-priors, uses instead Zk

∝ pk T t=1    fk(x|θ(t) k )πk(θ(t) k ) D j=1 pj fj(x|θ(t) j )πj(θ(t) j )    , where again the θ(t) k ’s are MCMC chains with stationary distributions πk(θk|x) to next section
89. ### Examples Example (Model choice) Model M1 : x|θ ∼ U(0,

θ) with prior θ ∼ Exp(1) is versus model M2 : x|θ ∼ Exp(θ) with prior θ ∼ Exp(1). Equal prior weights on both models: 1 = 2 = 0.5. Approximations of P(M = 1|x): Scott’s (2002) (blue), and Congdon’s (2006) (red) [N = 106 simulations]. 1 2 3 4 5 0.0 0.2 0.4 0.6 0.8 1.0 y
90. ### Examples Example (Model choice) Model M1 : x|θ ∼ U(0,

θ) with prior θ ∼ Exp(1) is versus model M2 : x|θ ∼ Exp(θ) with prior θ ∼ Exp(1). Equal prior weights on both models: 1 = 2 = 0.5. Approximations of P(M = 1|x): Scott’s (2002) (blue), and Congdon’s (2006) (red) [N = 106 simulations]. 1 2 3 4 5 0.0 0.2 0.4 0.6 0.8 1.0 y
91. ### Examples (2) Example (Model choice (2)) Normal model M1 :

x ∼ N(θ, 1) with θ ∼ N(0, 1) vs. normal model M2 : x ∼ N(θ, 1) with θ ∼ N(5, 1) Comparison of both approximations with P(M = 1|x): Scott’s (2002) (green and mixed dashes) and Congdon’s (2006) (brown and long dashes) [N = 104 simulations]. −1 0 1 2 3 4 5 6 0.0 0.2 0.4 0.6 0.8 1.0 y
92. ### Examples (3) Example (Model choice (3)) Model M1 : x

∼ N(0, 1/ω) with ω ∼ Exp(a) vs. M2 : exp(x) ∼ Exp(λ) with λ ∼ Exp(b). Comparison of Congdon’s (2006) (brown and dashed lines) with P(M = 1|x) when (a, b) is equal to (.24, 8.9), (.56, .7), (4.1, .46) and (.98, .081), resp. [N = 104 simulations]. −10 −5 0 5 10 0.0 0.2 0.4 0.6 0.8 1.0 y −10 −5 0 5 10 0.0 0.2 0.4 0.6 0.8 1.0 y −10 −5 0 5 10 0.0 0.2 0.4 0.6 0.8 1.0 y −10 −5 0 5 10 0.0 0.2 0.4 0.6 0.8 1.0 y
93. ### Nested sampling 1 Evidence 2 Importance sampling solutions 3 Cross-model

solutions 4 Nested sampling Purpose Implementation Error rates Constraints Importance variant A mixture comparison 5 ABC model choice
94. ### Nested sampling: Goal Skilling’s (2007) technique using the one-dimensional representation:

Z = Eπ[L(θ)] = 1 0 ϕ(x) dx with ϕ−1(l) = Pπ(L(θ) > l). Note; ϕ(·) is intractable in most cases.
95. ### Nested sampling: First approximation Approximate Z by a Riemann sum:

Z = N i=1 (xi−1 − xi)ϕ(xi) where the xi’s are either: deterministic: xi = e−i/N or random: x0 = 0, xi+1 = tixi, ti ∼ Be(N, 1) so that E[log xi] = −i/N.
96. ### Extraneous white noise Take Z = e−θ dθ = 1

δ e−(1−δ)θ e−δθ = Eδ 1 δ e−(1−δ)θ ˆ Z = 1 N N i=1 δ−1 e−(1−δ)θi (xi−1 − xi) , θi ∼ E(δ) I(θi ≤ θi−1) N deterministic random 50 4.64 10.5 4.65 10.5 100 2.47 4.9 2.48 5.02 500 .549 1.01 .550 1.14 Comparison of variances and MSEs
97. ### Extraneous white noise Take Z = e−θ dθ = 1

δ e−(1−δ)θ e−δθ = Eδ 1 δ e−(1−δ)θ ˆ Z = 1 N N i=1 δ−1 e−(1−δ)θi (xi−1 − xi) , θi ∼ E(δ) I(θi ≤ θi−1) N deterministic random 50 4.64 10.5 4.65 10.5 100 2.47 4.9 2.48 5.02 500 .549 1.01 .550 1.14 Comparison of variances and MSEs
98. ### Nested sampling: Alternative representation Another representation is Z = N−1

i=0 {ϕ(xi+1) − ϕ(xi)}xi which is a special case of Z = N−1 i=0 {L(θ(i+1) ) − L(θ(i) )}π({θ; L(θ) > L(θ(i) )}) where · · · L(θ(i+1) ) < L(θ(i) ) · · · [Lebesgue version of Riemann’s sum]
99. ### Nested sampling: Alternative representation Another representation is Z = N−1

i=0 {ϕ(xi+1) − ϕ(xi)}xi which is a special case of Z = N−1 i=0 {L(θ(i+1) ) − L(θ(i) )}π({θ; L(θ) > L(θ(i) )}) where · · · L(θ(i+1) ) < L(θ(i) ) · · · [Lebesgue version of Riemann’s sum]
100. ### Nested sampling: Second approximation Estimate (intractable) ϕ(xi) by ϕi: Nested

sampling Start with N values θ1, . . . , θN sampled from π At iteration i, 1 Take ϕi = L(θk), where θk is the point with smallest likelihood in the pool of θi’s 2 Replace θk with a sample from the prior constrained to L(θ) > ϕi: the current N points are sampled from prior constrained to L(θ) > ϕi. Note that π({θ; L(θ) > L(θ(i+1) )}) π({θ; L(θ) > L(θ(i) )}) ≈ 1 − 1 N
101. ### Nested sampling: Second approximation Estimate (intractable) ϕ(xi) by ϕi: Nested

sampling Start with N values θ1, . . . , θN sampled from π At iteration i, 1 Take ϕi = L(θk), where θk is the point with smallest likelihood in the pool of θi’s 2 Replace θk with a sample from the prior constrained to L(θ) > ϕi: the current N points are sampled from prior constrained to L(θ) > ϕi. Note that π({θ; L(θ) > L(θ(i+1) )}) π({θ; L(θ) > L(θ(i) )}) ≈ 1 − 1 N
102. ### Nested sampling: Second approximation Estimate (intractable) ϕ(xi) by ϕi: Nested

sampling Start with N values θ1, . . . , θN sampled from π At iteration i, 1 Take ϕi = L(θk), where θk is the point with smallest likelihood in the pool of θi’s 2 Replace θk with a sample from the prior constrained to L(θ) > ϕi: the current N points are sampled from prior constrained to L(θ) > ϕi. Note that π({θ; L(θ) > L(θ(i+1) )}) π({θ; L(θ) > L(θ(i) )}) ≈ 1 − 1 N
103. ### Nested sampling: Second approximation Estimate (intractable) ϕ(xi) by ϕi: Nested

sampling Start with N values θ1, . . . , θN sampled from π At iteration i, 1 Take ϕi = L(θk), where θk is the point with smallest likelihood in the pool of θi’s 2 Replace θk with a sample from the prior constrained to L(θ) > ϕi: the current N points are sampled from prior constrained to L(θ) > ϕi. Note that π({θ; L(θ) > L(θ(i+1) )}) π({θ; L(θ) > L(θ(i) )}) ≈ 1 − 1 N
104. ### Nested sampling: Third approximation Iterate the above steps until a

given stopping iteration j is reached: e.g., observe very small changes in the approximation Z; reach the maximal value of L(θ) when the likelihood is bounded and its maximum is known; truncate the integral Z at level , i.e. replace 1 0 ϕ(x) dx with 1 ϕ(x) dx
105. ### Approximation error Error = Z − Z = j i=1

(xi−1 − xi)ϕi − 1 0 ϕ(x) dx = − 0 ϕ(x) dx + j i=1 (xi−1 − xi)ϕ(xi) − 1 ϕ(x) dx (Quadrature Error) + j i=1 (xi−1 − xi) {ϕi − ϕ(xi)} (Stochastic Error) [Dominated by Monte Carlo]
106. ### A CLT for the Stochastic Error The (dominating) stochastic error

is OP (N−1/2): N1/2 Z − Z D → N (0, V ) with V = − s,t∈[ ,1] sϕ (s)tϕ (t) log(s ∨ t) ds dt. [Proof based on Donsker’s theorem]
107. ### What of log Z? If the interest lies within log

Z, Slutsky’s transform of the CLT: N−1/2 log Z − log Z D → N 0, V/Z2 Note: The number of simulated points equals the number of iterations j, and is a multiple of N: if one stops at ﬁrst iteration j such that e−j/N < , then: j = N − log .
108. ### What of log Z? If the interest lies within log

Z, Slutsky’s transform of the CLT: N−1/2 log Z − log Z D → N 0, V/Z2 Note: The number of simulated points equals the number of iterations j, and is a multiple of N: if one stops at ﬁrst iteration j such that e−j/N < , then: j = N − log .
109. ### What of log Z? If the interest lies within log

Z, Slutsky’s transform of the CLT: N−1/2 log Z − log Z D → N 0, V/Z2 Note: The number of simulated points equals the number of iterations j, and is a multiple of N: if one stops at ﬁrst iteration j such that e−j/N < , then: j = N − log .
110. ### Sampling from constr’d priors Exact simulation from the constrained prior

is intractable in most cases Skilling (2007) proposes to use MCMC but this introduces a bias (stopping rule) If implementable, then slice sampler can be devised at the same cost [or less]
111. ### Sampling from constr’d priors Exact simulation from the constrained prior

is intractable in most cases Skilling (2007) proposes to use MCMC but this introduces a bias (stopping rule) If implementable, then slice sampler can be devised at the same cost [or less]
112. ### Sampling from constr’d priors Exact simulation from the constrained prior

is intractable in most cases Skilling (2007) proposes to use MCMC but this introduces a bias (stopping rule) If implementable, then slice sampler can be devised at the same cost [or less]
113. ### Banana illustration Case of a banana target made of a

twisted 2D normal: x2 = x2 + βx2 1 − 100β [Haario, Sacksman, Tamminen, 1999] β = .03, σ = (100, 1)
114. ### Banana illustration (2) Use of nested sampling with N =

1000, 50 MCMC steps with size 0.1, compared with a population Monte Carlo (PMC) based on 10 iterations with 5000 points per iteration and ﬁnal sample of 50000 points, using nine Student’s t components with 9 df [Wraith, Kilbinger, Benabed et al., 2009, Physica Rev. D] Evidence estimation
115. ### Banana illustration (2) Use of nested sampling with N =

1000, 50 MCMC steps with size 0.1, compared with a population Monte Carlo (PMC) based on 10 iterations with 5000 points per iteration and ﬁnal sample of 50000 points, using nine Student’s t components with 9 df [Wraith, Kilbinger, Benabed et al., 2009, Physica Rev. D] E[X1] estimation
116. ### Banana illustration (2) Use of nested sampling with N =

1000, 50 MCMC steps with size 0.1, compared with a population Monte Carlo (PMC) based on 10 iterations with 5000 points per iteration and ﬁnal sample of 50000 points, using nine Student’s t components with 9 df [Wraith, Kilbinger, Benabed et al., 2009, Physica Rev. D] E[X2] estimation
117. ### A IS variant of nested sampling Consider instrumental prior π

and likelihood ˜ L, weight function w(θ) = π(θ)L(θ) π(θ)L(θ) and weighted NS estimator Z = j i=1 (xi−1 − xi)ϕiw(θi). Then choose (π, L) so that sampling from π constrained to L(θ) > l is easy; e.g. N(c, Id) constrained to c − θ < r.
118. ### A IS variant of nested sampling Consider instrumental prior π

and likelihood ˜ L, weight function w(θ) = π(θ)L(θ) π(θ)L(θ) and weighted NS estimator Z = j i=1 (xi−1 − xi)ϕiw(θi). Then choose (π, L) so that sampling from π constrained to L(θ) > l is easy; e.g. N(c, Id) constrained to c − θ < r.
119. ### Mixture pN(0, 1) + (1 − p)N(µ, σ) posterior Posterior

on (µ, σ) for n observations with µ = 2 and σ = 3/2, when p is known Use of a uniform prior both on (−2, 6) for µ and on (.001, 16) for log σ2. occurrences of posterior bursts for µ = xi
120. ### Experiment MCMC sample for n = 16 observations from the

mixture. Nested sampling sequence with M = 1000 starting points.
121. ### Experiment MCMC sample for n = 50 observations from the

mixture. Nested sampling sequence with M = 1000 starting points.
122. ### Comparison 1 Nested sampling: M = 1000 points, with 10

random walk moves at each step, simulations from the constr’d prior and a stopping rule at 95% of the observed maximum likelihood 2 T = 104 MCMC (=Gibbs) simulations producing non-parametric estimates ϕ [Diebolt & Robert, 1990] 3 Monte Carlo estimates Z1, Z2, Z3 using product of two Gaussian kernels 4 numerical integration based on 850 × 950 grid [reference value, conﬁrmed by Chib’s]
123. ### Comparison (cont’d) Graph based on a sample of 10 observations

for µ = 2 and σ = 3/2 (150 replicas).
124. ### Comparison (cont’d) Graph based on a sample of 50 observations

for µ = 2 and σ = 3/2 (150 replicas).
125. ### Comparison (cont’d) Graph based on a sample of 100 observations

for µ = 2 and σ = 3/2 (150 replicas).
126. ### Approximate Bayesian computation 1 Evidence 2 Importance sampling solutions 3

Cross-model solutions 4 Nested sampling 5 ABC model choice ABC method Formalised framework Consistency results Summary statistics 6 The Savage–Dickey ratio
127. ### Approximate Bayesian Computation Bayesian setting: target is π(θ)f(x|θ) When likelihood

f(x|θ) not in closed form, likelihood-free rejection technique: ABC algorithm For an observation y ∼ f(y|θ), under the prior π(θ), keep jointly simulating θ ∼ π(θ) , x ∼ f(x|θ ) , until the auxiliary variable x is equal to the observed value, x = y. [Pritchard et al., 1999]
128. ### Approximate Bayesian Computation Bayesian setting: target is π(θ)f(x|θ) When likelihood

f(x|θ) not in closed form, likelihood-free rejection technique: ABC algorithm For an observation y ∼ f(y|θ), under the prior π(θ), keep jointly simulating θ ∼ π(θ) , x ∼ f(x|θ ) , until the auxiliary variable x is equal to the observed value, x = y. [Pritchard et al., 1999]
129. ### Approximate Bayesian Computation Bayesian setting: target is π(θ)f(x|θ) When likelihood

f(x|θ) not in closed form, likelihood-free rejection technique: ABC algorithm For an observation y ∼ f(y|θ), under the prior π(θ), keep jointly simulating θ ∼ π(θ) , x ∼ f(x|θ ) , until the auxiliary variable x is equal to the observed value, x = y. [Pritchard et al., 1999]
130. ### A as approximative When y is a continuous random variable,

equality x = y is replaced with a tolerance condition, (x, y) ≤ where is a distance between summary statistics Output distributed from π(θ) Pθ{ (x, y) < } ∝ π(θ| (x, y) < )
131. ### A as approximative When y is a continuous random variable,

equality x = y is replaced with a tolerance condition, (x, y) ≤ where is a distance between summary statistics Output distributed from π(θ) Pθ{ (x, y) < } ∝ π(θ| (x, y) < )
132. ### The starting point Central question to the validation of ABC

for model choice: When is a Bayes factor based on an insuﬃcient statistic T (y) consistent? Note/warnin: c drawn on T (y) through BT 12 (y) necessarily diﬀers from c drawn on y through B12(y) [Marin, Pillai, X, & Rousseau, arXiv, 2012]
133. ### The starting point Central question to the validation of ABC

for model choice: When is a Bayes factor based on an insuﬃcient statistic T (y) consistent? Note/warnin: c drawn on T (y) through BT 12 (y) necessarily diﬀers from c drawn on y through B12(y) [Marin, Pillai, X, & Rousseau, arXiv, 2012]
134. ### A benchmark if toy example Comparison suggested by referee of

PNAS paper [thanks!]: [X, Cornuet, Marin, & Pillai, Aug. 2011] Model M1: y ∼ N(θ1, 1) opposed to model M2: y ∼ L(θ2, 1/ √ 2), Laplace distribution with mean θ2 and scale parameter 1/ √ 2 (variance one).
135. ### A benchmark if toy example Comparison suggested by referee of

PNAS paper [thanks!]: [X, Cornuet, Marin, & Pillai, Aug. 2011] Model M1: y ∼ N(θ1, 1) opposed to model M2: y ∼ L(θ2, 1/ √ 2), Laplace distribution with mean θ2 and scale parameter 1/ √ 2 (variance one). Four possible statistics 1 sample mean y (suﬃcient for M1 if not M2); 2 sample median med(y) (insuﬃcient); 3 sample variance var(y) (ancillary); 4 median absolute deviation mad(y) = med(|y − med(y)|);
136. ### A benchmark if toy example Comparison suggested by referee of

PNAS paper [thanks!]: [X, Cornuet, Marin, & Pillai, Aug. 2011] Model M1: y ∼ N(θ1, 1) opposed to model M2: y ∼ L(θ2, 1/ √ 2), Laplace distribution with mean θ2 and scale parameter 1/ √ 2 (variance one). q q q q q q q q q q q Gauss Laplace 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 n=100
137. ### A benchmark if toy example Comparison suggested by referee of

PNAS paper [thanks!]: [X, Cornuet, Marin, & Pillai, Aug. 2011] Model M1: y ∼ N(θ1, 1) opposed to model M2: y ∼ L(θ2, 1/ √ 2), Laplace distribution with mean θ2 and scale parameter 1/ √ 2 (variance one). q q q q q q q q q q q Gauss Laplace 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 n=100 q q q q q q q q q q q q q q q q q q Gauss Laplace 0.0 0.2 0.4 0.6 0.8 1.0 n=100
138. ### Framework Starting from sample y = (y1, . . .

, yn) the observed sample, not necessarily iid with true distribution y ∼ Pn Summary statistics T (y) = T n = (T1(y), T2(y), · · · , Td(y)) ∈ Rd with true distribution T n ∼ Gn.
139. ### Framework c Comparison of – under M1, y ∼ F1,n(·|θ1)

where θ1 ∈ Θ1 ⊂ Rp1 – under M2, y ∼ F2,n(·|θ2) where θ2 ∈ Θ2 ⊂ Rp2 turned into – under M1, T (y) ∼ G1,n(·|θ1), and θ1|T (y) ∼ π1(·|T n) – under M2, T (y) ∼ G2,n(·|θ2), and θ2|T (y) ∼ π2(·|T n)
140. ### Assumptions A collection of asymptotic “standard” assumptions: [A1] is a

standard central limit theorem under the true model [A2] controls the large deviations of the estimator T n from the estimand µ(θ) [A3] is the standard prior mass condition found in Bayesian asymptotics (di eﬀective dimension of the parameter) [A4] restricts the behaviour of the model density against the true density [Think CLT!]
141. ### Assumptions A collection of asymptotic “standard” assumptions: [Think CLT!] [A1]

There exist a sequence {vn} converging to +∞, a distribution Q, a symmetric, d × d positive deﬁnite matrix V0 and a vector µ0 ∈ Rd such that vnV −1/2 0 (T n − µ0) n→∞ ; Q, under Gn
142. ### Assumptions A collection of asymptotic “standard” assumptions: [Think CLT!] [A2]

For i = 1, 2, there exist sets Fn,i ⊂ Θi, functions µi(θi) and constants i, τi, αi > 0 such that for all τ > 0, sup θi∈Fn,i Gi,n |T n − µi(θi)| > τ|µi(θi) − µ0| ∧ i |θi (|τµi(θi) − µ0| ∧ i)−αi v−αi n with πi(Fc n,i ) = o(v−τi n ).
143. ### Assumptions A collection of asymptotic “standard” assumptions: [Think CLT!] [A3]

If inf{|µi(θi) − µ0|; θi ∈ Θi} = 0, deﬁning (u > 0) Sn,i(u) = θi ∈ Fn,i; |µi(θi) − µ0| ≤ u v−1 n , there exist constants di < τi ∧ αi − 1 such that πi(Sn,i(u)) ∼ udi v−di n , ∀u vn
144. ### Assumptions A collection of asymptotic “standard” assumptions: [Think CLT!] [A4]

If inf{|µi(θi) − µ0|; θi ∈ Θi} = 0, for any > 0, there exist U, δ > 0 and (En)n such that, if θi ∈ Sn,i(U) En ⊂ {t; gi(t|θi) < δgn(t)} and Gn (Ec n ) < .
145. ### Assumptions A collection of asymptotic “standard” assumptions: [Think CLT!] Again

(sumup) [A1] is a standard central limit theorem under the true model [A2] controls the large deviations of the estimator T n from the estimand µ(θ) [A3] is the standard prior mass condition found in Bayesian asymptotics (di eﬀective dimension of the parameter) [A4] restricts the behaviour of the model density against the true density
146. ### Eﬀective dimension Understanding di in [A3]: deﬁned only when µ0

∈ {µi(θi), θi ∈ Θi}, πi(θi : |µi(θi) − µ0| < n−1/2) = O(n−di/2) is the eﬀective dimension of the model Θi around µ0
147. ### Eﬀective dimension Understanding di in [A3]: when inf{|µi(θi) − µ0|;

θi ∈ Θi} = 0, mi(T n) gn(T n) ∼ v−di n , thus log(mi(T n)/gn(T n)) ∼ −di log vn and v−di n penalization factor resulting from integrating θi out (see eﬀective number of parameters in DIC)
148. ### Eﬀective dimension Understanding di in [A3]: In regular models, di

dimension of T (Θi), leading to BIC In non-regular models, di can be smaller
149. ### Asymptotic marginals Asymptotically, under [A1]–[A4] mi(t) = Θi gi(t|θi) πi(θi)

dθi is such that (i) if inf{|µi(θi) − µ0|; θi ∈ Θi} = 0, Clvd−di n ≤ mi(T n) ≤ Cuvd−di n and (ii) if inf{|µi(θi) − µ0|; θi ∈ Θi} > 0 mi(T n) = o Pn [vd−τi n + vd−αi n ].
150. ### Within-model consistency Under same assumptions, if inf{|µi(θi) − µ0|; θi

∈ Θi} = 0, the posterior distribution of µi(θi) given T n is consistent at rate 1/vn provided αi ∧ τi > di.
151. ### Between-model consistency Consequence of above is that asymptotic behaviour of

the Bayes factor is driven by the asymptotic mean value of T n under both models. And only by this mean value!
152. ### Between-model consistency Consequence of above is that asymptotic behaviour of

the Bayes factor is driven by the asymptotic mean value of T n under both models. And only by this mean value! Indeed, if inf{|µ0 − µ2(θ2)|; θ2 ∈ Θ2} = inf{|µ0 − µ1(θ1)|; θ1 ∈ Θ1} = 0 then Clv−(d1−d2) n ≤ m1(T n) m2(T n) ≤ Cuv−(d1−d2) n , where Cl, Cu = O Pn (1), irrespective of the true model. c Only depends on the diﬀerence d1 − d2: no consistency
153. ### Between-model consistency Consequence of above is that asymptotic behaviour of

the Bayes factor is driven by the asymptotic mean value of T n under both models. And only by this mean value! Else, if inf{|µ0 − µ2(θ2)|; θ2 ∈ Θ2} > inf{|µ0 − µ1(θ1)|; θ1 ∈ Θ1} = 0 then m1(T n) m2(T n) ≥ Cu min v−(d1−α2) n , v−(d1−τ2) n
154. ### Consistency theorem If inf{|µ0 − µ2(θ2)|; θ2 ∈ Θ2} =

inf{|µ0 − µ1(θ1)|; θ1 ∈ Θ1} = 0, Bayes factor BT 12 = O(v−(d1−d2) n ) irrespective of the true model. It is inconsistent since it always picks the model with the smallest dimension
155. ### Consistency theorem If Pn belongs to one of the two

models and if µ0 cannot be attained by the other one : 0 = min (inf{|µ0 − µi(θi)|; θi ∈ Θi}, i = 1, 2) < max (inf{|µ0 − µi(θi)|; θi ∈ Θi}, i = 1, 2) , then the Bayes factor BT 12 is consistent
156. ### Consequences on summary statistics Bayes factor driven by the means

µi(θi) and the relative position of µ0 wrt both sets {µi(θi); θi ∈ Θi}, i = 1, 2. For ABC, this implies the most likely statistics T n are ancillary statistics with diﬀerent mean values under both models Else, if T n asymptotically depends on some of the parameters of the models, it is possible that there exists θi ∈ Θi such that µi(θi) = µ0 even though model M1 is misspeciﬁed
157. ### Consequences on summary statistics Bayes factor driven by the means

µi(θi) and the relative position of µ0 wrt both sets {µi(θi); θi ∈ Θi}, i = 1, 2. For ABC, this implies the most likely statistics T n are ancillary statistics with diﬀerent mean values under both models Else, if T n asymptotically depends on some of the parameters of the models, it is possible that there exists θi ∈ Θi such that µi(θi) = µ0 even though model M1 is misspeciﬁed
158. ### Toy example: Laplace versus Gauss [1] If T n =

n−1 n i=1 X4 i , µ1(θ) = 3 + θ4 + 6θ2, µ2(θ) = 6 + · · · and the true distribution is Laplace with mean θ0 = 1, under the Gaussian model the value θ∗ = 2 √ 3 − 3 also leads to µ0 = µ(θ∗) [here d1 = d2 = d = 1]
159. ### Toy example: Laplace versus Gauss [1] If T n =

n−1 n i=1 X4 i , µ1(θ) = 3 + θ4 + 6θ2, µ2(θ) = 6 + · · · and the true distribution is Laplace with mean θ0 = 1, under the Gaussian model the value θ∗ = 2 √ 3 − 3 also leads to µ0 = µ(θ∗) [here d1 = d2 = d = 1] c A Bayes factor associated with such a statistic is inconsistent
160. ### Toy example: Laplace versus Gauss [1] If T n =

n−1 n i=1 X4 i , µ1(θ) = 3 + θ4 + 6θ2, µ2(θ) = 6 + · · ·
161. ### Toy example: Laplace versus Gauss [1] If T n =

n−1 n i=1 X4 i , µ1(θ) = 3 + θ4 + 6θ2, µ2(θ) = 6 + · · · q q q q q q q M1 M2 0.30 0.35 0.40 0.45 n=1000 Fourth moment
162. ### Toy example: Laplace versus Gauss [0] When T (y) =

¯ y(4) n , ¯ y(6) n and the true distribution is Laplace with mean θ0 = 0, then µ0 = 6, µ1(θ∗ 1 ) = 6 with θ∗ 1 = 2 √ 3 − 3 [d1 = 1 and d2 = 1/2] thus B12 ∼ n−1/4 → 0 : consistent Under the Gaussian model µ0 = 3, µ2(θ2) ≥ 6 > 3 ∀θ2 B12 → +∞ : consistent
163. ### Toy example: Laplace versus Gauss [0] When T (y) =

¯ y(4) n , ¯ y(6) n and the true distribution is Laplace with mean θ0 = 0, then µ0 = 6, µ1(θ∗ 1 ) = 6 with θ∗ 1 = 2 √ 3 − 3 [d1 = 1 and d2 = 1/2] thus B12 ∼ n−1/4 → 0 : consistent Under the Gaussian model µ0 = 3, µ2(θ2) ≥ 6 > 3 ∀θ2 B12 → +∞ : consistent
164. ### Toy example: Laplace versus Gauss [0] When T (y) =

¯ y(4) n , ¯ y(6) n 0.0 0.2 0.4 0.6 0.8 1.0 0 1 2 3 4 5 posterior probabilities Density Fourth and sixth moments
165. ### Checking for adequate statistics After running ABC, i.e. creating reference

tables of (θi, xi) from both joints, straightforward derivation of ABC estimates ˆ θ1 and ˆ θ2. Evaluation of E1 ˆ θ1 [T(X)] and E2 ˆ θ2 [T(X)] allows for detection of diﬀerent means under both models via Monte Carlo simulations
166. ### Toy example: Laplace versus Gauss q q q q q

q q q q q q q q q q q q q q q q q q q q q q q q q Gauss Laplace Gauss Laplace 0 10 20 30 40 Normalised χ2 without and with mad
167. ### Embedded models When M1 submodel of M2, and if the

true distribution belongs to the smaller model M1, Bayes factor is of order v−(d1−d2) n
168. ### Embedded models When M1 submodel of M2, and if the

true distribution belongs to the smaller model M1, Bayes factor is of order v−(d1−d2) n If summary statistic only informative on a parameter that is the same under both models, i.e if d1 = d2, then c the Bayes factor is not consistent
169. ### Embedded models When M1 submodel of M2, and if the

true distribution belongs to the smaller model M1, Bayes factor is of order v−(d1−d2) n Else, d1 < d2 and Bayes factor is going to ∞ under M1. If true distribution not in M1, then c Bayes factor is consistent only if µ1 = µ2 = µ0
170. ### Summary Model selection feasible with ABC Choice of summary statistics

is paramount At best, ABC → π(. | T (y)) which concentrates around µ0 For estimation : {θ; µ(θ) = µ0} = θ0 For testing {µ1(θ1), θ1 ∈ Θ1} ∩ {µ2(θ2), θ2 ∈ Θ2} = ∅
171. ### Summary Model selection feasible with ABC Choice of summary statistics

is paramount At best, ABC → π(. | T (y)) which concentrates around µ0 For estimation : {θ; µ(θ) = µ0} = θ0 For testing {µ1(θ1), θ1 ∈ Θ1} ∩ {µ2(θ2), θ2 ∈ Θ2} = ∅
172. ### 1 Evidence 2 Importance sampling solutions 3 Cross-model solutions 4

Nested sampling 5 ABC model choice 6 The Savage–Dickey ratio Measure-theoretic aspects Computational implications
173. ### The Savage–Dickey ratio representation Special representation of the Bayes factor

used for simulation Original version (Dickey, AoMS, 1971)
174. ### The Savage–Dickey ratio representation Special representation of the Bayes factor

used for simulation Original version (Dickey, AoMS, 1971)
175. ### Savage’s density ratio theorem Given a test H0 : θ

= θ0 in a model f(x|θ, ψ) with a nuisance parameter ψ, under priors π0(ψ) and π1(θ, ψ) such that π1(ψ|θ0) = π0(ψ) then B01 = π1(θ0|x) π1(θ0) , with the obvious notations π1(θ) = π1(θ, ψ)dψ , π1(θ|x) = π1(θ, ψ|x)dψ , [Dickey, 1971; Verdinelli & Wasserman, 1995]
176. ### Rephrased “Suppose that f0(θ) = f1(θ|φ = φ0). As f0(x|θ)

= f1(x|θ, φ = φ0), f0(x) = f1(x|θ, φ = φ0)f1(θ|φ = φ0) dθ = f1(x|φ = φ0) , i.e., the denumerator of the Bayes factor is the value of f1(x|φ) at φ = φ0, while the denominator is an average of the values of f1(x|φ) for φ = φ0, weighted by the prior distribution f1(φ) under the augmented model. Applying Bayes’ theorem to the right-hand side of [the above] we get f0(x) = f1(φ0|x)f1(x) f1(φ0) and hence the Bayes factor is given by B = f0(x) f1(x) = f1(φ0|x) f1(φ0) . the ratio of the posterior to prior densities at φ = φ0 under the augmented model.” [O’Hagan & Forster, 1996]
177. ### Rephrased “Suppose that f0(θ) = f1(θ|φ = φ0). As f0(x|θ)

= f1(x|θ, φ = φ0), f0(x) = f1(x|θ, φ = φ0)f1(θ|φ = φ0) dθ = f1(x|φ = φ0) , i.e., the denumerator of the Bayes factor is the value of f1(x|φ) at φ = φ0, while the denominator is an average of the values of f1(x|φ) for φ = φ0, weighted by the prior distribution f1(φ) under the augmented model. Applying Bayes’ theorem to the right-hand side of [the above] we get f0(x) = f1(φ0|x)f1(x) f1(φ0) and hence the Bayes factor is given by B = f0(x) f1(x) = f1(φ0|x) f1(φ0) . the ratio of the posterior to prior densities at φ = φ0 under the augmented model.” [O’Hagan & Forster, 1996]
178. ### Measure-theoretic diﬃculty Representation depends on the choice of versions of

conditional densities: B01 = π0(ψ)f(x|θ0, ψ) dψ π1(θ, ψ)f(x|θ, ψ) dψdθ [by deﬁnition] = π1(ψ|θ0)f(x|θ0, ψ) dψ π1(θ0) π1(θ, ψ)f(x|θ, ψ) dψdθ π1(θ0) [speciﬁc version of π1(ψ|θ0) and arbitrary version of π1(θ0)] = π1(θ0, ψ)f(x|θ0, ψ) dψ m1(x)π1(θ0) [speciﬁc version of π1(θ0, ψ)] = π1(θ0|x) π1(θ0) [version dependent]
179. ### Measure-theoretic diﬃculty Representation depends on the choice of versions of

conditional densities: B01 = π0(ψ)f(x|θ0, ψ) dψ π1(θ, ψ)f(x|θ, ψ) dψdθ [by deﬁnition] = π1(ψ|θ0)f(x|θ0, ψ) dψ π1(θ0) π1(θ, ψ)f(x|θ, ψ) dψdθ π1(θ0) [speciﬁc version of π1(ψ|θ0) and arbitrary version of π1(θ0)] = π1(θ0, ψ)f(x|θ0, ψ) dψ m1(x)π1(θ0) [speciﬁc version of π1(θ0, ψ)] = π1(θ0|x) π1(θ0) [version dependent]
180. ### Choice of density version c Dickey’s (1971) condition is not

a condition: If π1(θ0|x) π1(θ0) = π0(ψ)f(x|θ0, ψ) dψ m1(x) is chosen as a version, then Savage–Dickey’s representation holds
181. ### Choice of density version c Dickey’s (1971) condition is not

a condition: If π1(θ0|x) π1(θ0) = π0(ψ)f(x|θ0, ψ) dψ m1(x) is chosen as a version, then Savage–Dickey’s representation holds
182. ### Savage–Dickey paradox Verdinelli-Wasserman extension: B01 = π1(θ0|x) π1(θ0) Eπ1(ψ|x,θ0,x) π0(ψ)

π1(ψ|θ0) similarly depends on choices of versions... ...but Monte Carlo implementation relies on speciﬁc versions of all densities without making mention of it [Chen, Shao & Ibrahim, 2000]
183. ### Savage–Dickey paradox Verdinelli-Wasserman extension: B01 = π1(θ0|x) π1(θ0) Eπ1(ψ|x,θ0,x) π0(ψ)

π1(ψ|θ0) similarly depends on choices of versions... ...but Monte Carlo implementation relies on speciﬁc versions of all densities without making mention of it [Chen, Shao & Ibrahim, 2000]
184. ### A computational exploitation Starting from the (instrumental) prior ˜ π1(θ,

ψ) = π1(θ)π0(ψ) deﬁne the associated posterior ˜ π1(θ, ψ|x) = π0(ψ)π1(θ)f(x|θ, ψ) ˜ m1(x) and impose the choice of version ˜ π1(θ0|x) π0(θ0) = π0(ψ)f(x|θ0, ψ) dψ ˜ m1(x) Then B01 = ˜ π1(θ0|x) π1(θ0) ˜ m1(x) m1(x)
185. ### A computational exploitation Starting from the (instrumental) prior ˜ π1(θ,

ψ) = π1(θ)π0(ψ) deﬁne the associated posterior ˜ π1(θ, ψ|x) = π0(ψ)π1(θ)f(x|θ, ψ) ˜ m1(x) and impose the choice of version ˜ π1(θ0|x) π0(θ0) = π0(ψ)f(x|θ0, ψ) dψ ˜ m1(x) Then B01 = ˜ π1(θ0|x) π1(θ0) ˜ m1(x) m1(x)
186. ### First ratio If (θ(1), ψ(1)), . . . , (θ(T),

ψ(T)) ∼ ˜ π(θ, ψ|x), then 1 T t ˜ π1(θ0|x, ψ(t)) converges to ˜ π1(θ0|x) provided the right version is used in θ0 ˜ π1(θ0|x, ψ) = π1(θ0)f(x|θ0, ψ) π1(θ)f(x|θ, ψ) dθ
187. ### Rao–Blackwellisation with latent variables When ˜ π1(θ0|x, ψ) unavailable, replace

with 1 T T t=1 ˜ π1(θ0|x, z(t), ψ(t)) via data completion by latent variable z such that f(x|θ, ψ) = ˜ f(x, z|θ, ψ) dz and that ˜ π1(θ, ψ, z|x) ∝ π0(ψ)π1(θ) ˜ f(x, z|θ, ψ) available in closed form, including the normalising constant, based on version ˜ π1(θ0|x, z, ψ) π1(θ0) = ˜ f(x, z|θ0, ψ) ˜ f(x, z|θ, ψ)π1(θ) dθ .
188. ### Rao–Blackwellisation with latent variables When ˜ π1(θ0|x, ψ) unavailable, replace

with 1 T T t=1 ˜ π1(θ0|x, z(t), ψ(t)) via data completion by latent variable z such that f(x|θ, ψ) = ˜ f(x, z|θ, ψ) dz and that ˜ π1(θ, ψ, z|x) ∝ π0(ψ)π1(θ) ˜ f(x, z|θ, ψ) available in closed form, including the normalising constant, based on version ˜ π1(θ0|x, z, ψ) π1(θ0) = ˜ f(x, z|θ0, ψ) ˜ f(x, z|θ, ψ)π1(θ) dθ .
189. ### Bridge revival (1) Since ˜ m1(x)/m1(x) is unknown, apparent failure!

Use of the bridge identity E˜ π1(θ,ψ|x) π1(θ, ψ)f(x|θ, ψ) π0(ψ)π1(θ)f(x|θ, ψ) = E˜ π1(θ,ψ|x) π1(ψ|θ) π0(ψ) = m1(x) ˜ m1(x) to (biasedly) estimate ˜ m1(x)/m1(x) by T T t=1 π1(ψ(t)|θ(t)) π0(ψ(t)) based on the same sample from ˜ π1.
190. ### Bridge revival (1) Since ˜ m1(x)/m1(x) is unknown, apparent failure!

Use of the bridge identity E˜ π1(θ,ψ|x) π1(θ, ψ)f(x|θ, ψ) π0(ψ)π1(θ)f(x|θ, ψ) = E˜ π1(θ,ψ|x) π1(ψ|θ) π0(ψ) = m1(x) ˜ m1(x) to (biasedly) estimate ˜ m1(x)/m1(x) by T T t=1 π1(ψ(t)|θ(t)) π0(ψ(t)) based on the same sample from ˜ π1.
191. ### Bridge revival (2) Alternative identity Eπ1(θ,ψ|x) π0(ψ)π1(θ)f(x|θ, ψ) π1(θ, ψ)f(x|θ,

ψ) = Eπ1(θ,ψ|x) π0(ψ) π1(ψ|θ) = ˜ m1(x) m1(x) suggests using a second sample (¯ θ(t), ¯ ψ(t), z(t)) ∼ π1(θ, ψ|x) and the ratio estimate 1 T T t=1 π0( ¯ ψ(t)) π1( ¯ ψ(t)|¯ θ(t)) Resulting unbiased estimate: B01 = 1 T t ˜ π1(θ0|x, z(t), ψ(t)) π1(θ0) 1 T T t=1 π0( ¯ ψ(t)) π1( ¯ ψ(t)|¯ θ(t))
192. ### Bridge revival (2) Alternative identity Eπ1(θ,ψ|x) π0(ψ)π1(θ)f(x|θ, ψ) π1(θ, ψ)f(x|θ,

ψ) = Eπ1(θ,ψ|x) π0(ψ) π1(ψ|θ) = ˜ m1(x) m1(x) suggests using a second sample (¯ θ(t), ¯ ψ(t), z(t)) ∼ π1(θ, ψ|x) and the ratio estimate 1 T T t=1 π0( ¯ ψ(t)) π1( ¯ ψ(t)|¯ θ(t)) Resulting unbiased estimate: B01 = 1 T t ˜ π1(θ0|x, z(t), ψ(t)) π1(θ0) 1 T T t=1 π0( ¯ ψ(t)) π1( ¯ ψ(t)|¯ θ(t))
193. ### Diﬀerence with Verdinelli–Wasserman representation The above leads to the representation

B01 = ˜ π1(θ0|x) π1(θ0) Eπ1(θ,ψ|x) π0(ψ) π1(ψ|θ) shows how our approach diﬀers from Verdinelli and Wasserman’s B01 = π1(θ0|x) π1(θ0) Eπ1(ψ|x,θ0,x) π0(ψ) π1(ψ|θ0) [for referees only!!]
194. ### Diabetes in Pima Indian women (cont’d) Comparison of the variation

of the Bayes factor approximations based on 100 replicas for 20, 000 simulations for a simulation from the above importance, Chib’s, Savage–Dickey’s and bridge samplers q q IS Chib Savage−Dickey Bridge 2.8 3.0 3.2 3.4
195. ### Other things I could have mentioned use of particle ﬁlters

[Del Moral et al., 2006] path sampling d dθ log Z(θ) = Eθ d dθ log q(ω|θ) [Ogata, 1989; Gelman & Meng, 1998] multicanonical algorithms [Marinari & Parisi, 1992; Geyer & Thompson, 1995]