On some computational methods for Bayesian model choice Christian P. Robert University of Warwick and Universit´ e Paris Dauphine http://www.ceremade.dauphine.fr/~xian Statistical Frontiers, January 30, 2014

Formal construction of Bayes tests Deﬁnition (Test) Given an hypothesis H0 : θ ∈ Θ0 on the parameter θ ∈ Θ0 of a statistical model, a test is a statistical procedure that takes its values in {0, 1}. Theorem (Bayes test) The Bayes estimator associated with π and with the 0 − 1 loss is δπ(x) = 1 if π(θ ∈ Θ0|x) > π(θ ∈ Θ0|x), 0 otherwise,

Formal construction of Bayes tests Deﬁnition (Test) Given an hypothesis H0 : θ ∈ Θ0 on the parameter θ ∈ Θ0 of a statistical model, a test is a statistical procedure that takes its values in {0, 1}. Theorem (Bayes test) The Bayes estimator associated with π and with the 0 − 1 loss is δπ(x) = 1 if π(θ ∈ Θ0|x) > π(θ ∈ Θ0|x), 0 otherwise,

Model choice and model comparison Choice between models Several models available for the same observation x Mi : x ∼ fi(x|θi), i ∈ I where the family I can be ﬁnite or inﬁnite Identical setup: Replace hypotheses with models but keep marginal likelihoods and Bayes factors

Model choice and model comparison Choice between models Several models available for the same observation x Mi : x ∼ fi(x|θi), i ∈ I where the family I can be ﬁnite or inﬁnite Identical setup: Replace hypotheses with models but keep marginal likelihoods and Bayes factors

Bayesian model choice Probabilise the entire model/parameter space allocate probabilities pi to all models Mi deﬁne priors πi(θi) for each parameter space Θi compute π(Mi|x) = pi Θi fi(x|θi)πi(θi)dθi j pj Θj fj(x|θj)πj(θj)dθj take largest π(Mi|x) to determine “best” model, or use averaged predictive j π(Mj|x) Θj fj(x |θj)πj(θj|x)dθj

Bayesian model choice Probabilise the entire model/parameter space allocate probabilities pi to all models Mi deﬁne priors πi(θi) for each parameter space Θi compute π(Mi|x) = pi Θi fi(x|θi)πi(θi)dθi j pj Θj fj(x|θj)πj(θj)dθj take largest π(Mi|x) to determine “best” model, or use averaged predictive j π(Mj|x) Θj fj(x |θj)πj(θj|x)dθj

Bayesian model choice Probabilise the entire model/parameter space allocate probabilities pi to all models Mi deﬁne priors πi(θi) for each parameter space Θi compute π(Mi|x) = pi Θi fi(x|θi)πi(θi)dθi j pj Θj fj(x|θj)πj(θj)dθj take largest π(Mi|x) to determine “best” model, or use averaged predictive j π(Mj|x) Θj fj(x |θj)πj(θj|x)dθj

Bayesian model choice Probabilise the entire model/parameter space allocate probabilities pi to all models Mi deﬁne priors πi(θi) for each parameter space Θi compute π(Mi|x) = pi Θi fi(x|θi)πi(θi)dθi j pj Θj fj(x|θj)πj(θj)dθj take largest π(Mi|x) to determine “best” model, or use averaged predictive j π(Mj|x) Θj fj(x |θj)πj(θj|x)dθj

Bridge sampling Special case: If π1(θ1|x) ∝ ˜ π1(θ1|x) π2(θ2|x) ∝ ˜ π2(θ2|x) live on the same space (Θ1 = Θ2), then B12 ≈ 1 n n i=1 ˜ π1(θi|x) ˜ π2(θi|x) θi ∼ π2(θ|x) [Gelman & Meng, 1998; Chen, Shao & Ibrahim, 2000]

Bridge sampling variance The bridge sampling estimator does poorly if var(B12) B2 12 ≈ 1 n E π1(θ) − π2(θ) π2(θ) 2 is large, i.e. if π1 and π2 have little overlap...

Bridge sampling variance The bridge sampling estimator does poorly if var(B12) B2 12 ≈ 1 n E π1(θ) − π2(θ) π2(θ) 2 is large, i.e. if π1 and π2 have little overlap...

“The Worst Monte Carlo Method Ever” “The good news is that the Law of Large Numbers guarantees that this estimator is consistent ie, it will very likely be very close to the correct answer if you use a suﬃciently large number of points from the posterior distribution. The bad news is that the number of points required for this estimator to get close to the right answer will often be greater than the number of atoms in the observable universe. The even worse news is that itws easy for people to not realize this, and to naively accept estimates that are nowhere close to the correct value of the marginal likelihood.” [Radford Neal’s blog, Aug. 23, 2008]

“The Worst Monte Carlo Method Ever” “The good news is that the Law of Large Numbers guarantees that this estimator is consistent ie, it will very likely be very close to the correct answer if you use a suﬃciently large number of points from the posterior distribution. The bad news is that the number of points required for this estimator to get close to the right answer will often be greater than the number of atoms in the observable universe. The even worse news is that itws easy for people to not realize this, and to naively accept estimates that are nowhere close to the correct value of the marginal likelihood.” [Radford Neal’s blog, Aug. 23, 2008]

Ratio importance sampling Another fundamental identity: B12 = Eϕ [˜ π1(θ1)/ϕ(θ1)] Eϕ [˜ π2(θ2)/ϕ(θ2)] for any density ϕ with suﬃciently large support (when Θ1 = Θ2) [Torrie & Valleau, 1977] Use of a single sample θ1, . . . , θn from ϕ B12 = i=1 ˜ π1(θi)/ϕ(θi) i=1 ˜ π2(θi)/ϕ(θi)

Ratio importance sampling Another fundamental identity: B12 = Eϕ [˜ π1(θ1)/ϕ(θ1)] Eϕ [˜ π2(θ2)/ϕ(θ2)] for any density ϕ with suﬃciently large support (when Θ1 = Θ2) [Torrie & Valleau, 1977] Use of a single sample θ1, . . . , θn from ϕ B12 = i=1 ˜ π1(θi)/ϕ(θi) i=1 ˜ π2(θi)/ϕ(θi)

Approximating Zk from a posterior sample Use of the [harmonic mean] identity Eπk ϕ(θk) πk(θk)Lk(θk) x = ϕ(θk) πk(θk)Lk(θk) πk(θk)Lk(θk) Zk dθk = 1 Zk no matter what the proposal ϕ(·) is. [Gelfand & Dey, 1994; Bartolucci et al., 2006] Direct exploitation of the MCMC output RB-RJ

Approximating Zk from a posterior sample Use of the [harmonic mean] identity Eπk ϕ(θk) πk(θk)Lk(θk) x = ϕ(θk) πk(θk)Lk(θk) πk(θk)Lk(θk) Zk dθk = 1 Zk no matter what the proposal ϕ(·) is. [Gelfand & Dey, 1994; Bartolucci et al., 2006] Direct exploitation of the MCMC output RB-RJ

Comparison with regular importance sampling Harmonic mean: Constraint opposed to usual importance sampling constraints: ϕ(θ) must have lighter (rather than fatter) tails than πk(θk)Lk(θk) for the approximation Z1k = 1 1 T T t=1 ϕ(θ(t) k ) πk(θ(t) k )Lk(θ(t) k ) to have a ﬁnite variance. E.g., use ﬁnite support kernels (like Epanechnikov’s kernel) for ϕ

Comparison with regular importance sampling Harmonic mean: Constraint opposed to usual importance sampling constraints: ϕ(θ) must have lighter (rather than fatter) tails than πk(θk)Lk(θk) for the approximation Z1k = 1 1 T T t=1 ϕ(θ(t) k ) πk(θ(t) k )Lk(θ(t) k ) to have a ﬁnite variance. E.g., use ﬁnite support kernels (like Epanechnikov’s kernel) for ϕ

Comparison with regular importance sampling (cont’d) Compare Z1k with a standard importance sampling approximation Z2k = 1 T T t=1 πk(θ(t) k )Lk(θ(t) k ) ϕ(θ(t) k ) where the θ(t) k ’s are generated from the density ϕ(·) (with fatter tails like t’s)

Approximating Zk using a mixture representation Bridge sampling recall Design a speciﬁc mixture for simulation [importance sampling] purposes, with density ϕk(θk) ∝ ω1πk(θk)Lk(θk) + ϕ(θk) , where ϕ(·) is arbitrary (but normalised) Note: ω1 is not a probability weight

Approximating Zk using a mixture representation Bridge sampling recall Design a speciﬁc mixture for simulation [importance sampling] purposes, with density ϕk(θk) ∝ ω1πk(θk)Lk(θk) + ϕ(θk) , where ϕ(·) is arbitrary (but normalised) Note: ω1 is not a probability weight

Approximating Z using a mixture representation (cont’d) Corresponding MCMC (=Gibbs) sampler At iteration t 1 Take δ(t) = 1 with probability ω1πk(θ(t−1) k )Lk(θ(t−1) k ) ω1πk(θ(t−1) k )Lk(θ(t−1) k ) + ϕ(θ(t−1) k ) and δ(t) = 2 otherwise; 2 If δ(t) = 1, generate θ(t) k ∼ MCMC(θ(t−1) k , θk) where MCMC(θk, θk ) denotes an arbitrary MCMC kernel associated with the posterior πk(θk|x) ∝ πk(θk)Lk(θk); 3 If δ(t) = 2, generate θ(t) k ∼ ϕ(θk) independently

Approximating Z using a mixture representation (cont’d) Corresponding MCMC (=Gibbs) sampler At iteration t 1 Take δ(t) = 1 with probability ω1πk(θ(t−1) k )Lk(θ(t−1) k ) ω1πk(θ(t−1) k )Lk(θ(t−1) k ) + ϕ(θ(t−1) k ) and δ(t) = 2 otherwise; 2 If δ(t) = 1, generate θ(t) k ∼ MCMC(θ(t−1) k , θk) where MCMC(θk, θk ) denotes an arbitrary MCMC kernel associated with the posterior πk(θk|x) ∝ πk(θk)Lk(θk); 3 If δ(t) = 2, generate θ(t) k ∼ ϕ(θk) independently

Approximating Z using a mixture representation (cont’d) Corresponding MCMC (=Gibbs) sampler At iteration t 1 Take δ(t) = 1 with probability ω1πk(θ(t−1) k )Lk(θ(t−1) k ) ω1πk(θ(t−1) k )Lk(θ(t−1) k ) + ϕ(θ(t−1) k ) and δ(t) = 2 otherwise; 2 If δ(t) = 1, generate θ(t) k ∼ MCMC(θ(t−1) k , θk) where MCMC(θk, θk ) denotes an arbitrary MCMC kernel associated with the posterior πk(θk|x) ∝ πk(θk)Lk(θk); 3 If δ(t) = 2, generate θ(t) k ∼ ϕ(θk) independently

Evidence approximation by mixtures Rao-Blackwellised estimate ˆ ξ = 1 T T t=1 ω1πk(θ(t) k )Lk(θ(t) k ) ω1πk(θ(t) k )Lk(θ(t) k ) + ϕ(θ(t) k ) , converges to ω1Zk/{ω1Zk + 1} Deduce ˆ Z3k from ω1 ˆ Z3k/{ω1 ˆ Z3k + 1} = ˆ ξ ie ˆ Z3k = T t=1 ω1πk(θ(t) k )Lk(θ(t) k ) ω1π(θ(t) k )Lk(θ(t) k ) + ϕ(θ(t) k ) T t=1 ϕ(θ(t) k ) ω1πk(θ(t) k )Lk(θ(t) k ) + ϕ(θ(t) k ) [Bridge sampler redux]

Evidence approximation by mixtures Rao-Blackwellised estimate ˆ ξ = 1 T T t=1 ω1πk(θ(t) k )Lk(θ(t) k ) ω1πk(θ(t) k )Lk(θ(t) k ) + ϕ(θ(t) k ) , converges to ω1Zk/{ω1Zk + 1} Deduce ˆ Z3k from ω1 ˆ Z3k/{ω1 ˆ Z3k + 1} = ˆ ξ ie ˆ Z3k = T t=1 ω1πk(θ(t) k )Lk(θ(t) k ) ω1π(θ(t) k )Lk(θ(t) k ) + ϕ(θ(t) k ) T t=1 ϕ(θ(t) k ) ω1πk(θ(t) k )Lk(θ(t) k ) + ϕ(θ(t) k ) [Bridge sampler redux]

Consequence If I unbiased estimator of I = Ef [h(X)], with X simulated from a joint density ˜ f(x, y), where ˜ f(x, y)dy = f(x), the estimator I∗ = E ˜ f [I|Y1, . . . , Yn] dominate I(X1, . . . , Xn) variance-wise (and is unbiased)

Chib’s representation Direct application of Bayes’ theorem: given x ∼ fk(x|θk) and θk ∼ πk(θk), Zk = mk(x) = fk(x|θk) πk(θk) πk(θk|x) Use of an approximation to the posterior Zk = mk(x) = fk(x|θ∗ k ) πk(θ∗ k ) ˆ πk(θ∗ k |x) .

Chib’s representation Direct application of Bayes’ theorem: given x ∼ fk(x|θk) and θk ∼ πk(θk), Zk = mk(x) = fk(x|θk) πk(θk) πk(θk|x) Use of an approximation to the posterior Zk = mk(x) = fk(x|θ∗ k ) πk(θ∗ k ) ˆ πk(θ∗ k |x) .

Case of latent variables For missing variable z as in mixture models, natural Rao-Blackwell estimate πk(θ∗ k |x) = 1 T T t=1 πk(θ∗ k |x, z(t) k ) , where the z(t) k ’s are Gibbs sampled latent variables

Label switching impact A mixture model [special case of missing variable model] is invariant under permutations of the indices of the components. E.g., mixtures 0.3N(0, 1) + 0.7N(2.3, 1) and 0.7N(2.3, 1) + 0.3N(0, 1) are exactly the same! c The component parameters θi are not identiﬁable marginally since they are exchangeable

Label switching impact A mixture model [special case of missing variable model] is invariant under permutations of the indices of the components. E.g., mixtures 0.3N(0, 1) + 0.7N(2.3, 1) and 0.7N(2.3, 1) + 0.3N(0, 1) are exactly the same! c The component parameters θi are not identiﬁable marginally since they are exchangeable

Connected diﬃculties 1 Number of modes of the likelihood of order O(k!): c Maximization and even [MCMC] exploration of the posterior surface harder 2 Under exchangeable priors on (θ, p) [prior invariant under permutation of the indices], all posterior marginals are identical: c Posterior expectation of θ1 equal to posterior expectation of θ2

Connected diﬃculties 1 Number of modes of the likelihood of order O(k!): c Maximization and even [MCMC] exploration of the posterior surface harder 2 Under exchangeable priors on (θ, p) [prior invariant under permutation of the indices], all posterior marginals are identical: c Posterior expectation of θ1 equal to posterior expectation of θ2

License Since Gibbs output does not produce exchangeability, the Gibbs sampler has not explored the whole parameter space: it lacks energy to switch simultaneously enough component allocations at once 0 100 200 300 400 500 −1 0 1 2 3 n µi −1 0 1 2 3 0.2 0.3 0.4 0.5 µ i pi 0 100 200 300 400 500 0.2 0.3 0.4 0.5 n pi 0.2 0.3 0.4 0.5 0.4 0.6 0.8 1.0 p i σi 0 100 200 300 400 500 0.4 0.6 0.8 1.0 n σi 0.4 0.6 0.8 1.0 −1 0 1 2 3 σ i µi

Label switching paradox We should observe the exchangeability of the components [label switching] to conclude about convergence of the Gibbs sampler. If we observe it, then we do not know how to estimate the parameters. If we do not, then we are uncertain about the convergence!!!

Label switching paradox We should observe the exchangeability of the components [label switching] to conclude about convergence of the Gibbs sampler. If we observe it, then we do not know how to estimate the parameters. If we do not, then we are uncertain about the convergence!!!

Label switching paradox We should observe the exchangeability of the components [label switching] to conclude about convergence of the Gibbs sampler. If we observe it, then we do not know how to estimate the parameters. If we do not, then we are uncertain about the convergence!!!

Compensation for label switching For mixture models, z(t) k usually fails to visit all conﬁgurations in a balanced way, despite the symmetry predicted by the theory πk(θk|x) = πk(σ(θk)|x) = 1 k! σ∈S πk(σ(θk)|x) for all σ’s in Sk, set of all permutations of {1, . . . , k}. Consequences on numerical approximation, biased by an order k! Recover the theoretical symmetry by using πk(θ∗ k |x) = 1 T k! σ∈Sk T t=1 πk(σ(θ∗ k )|x, z(t) k ) . [Berkhof, Mechelen, & Gelman, 2003]

Compensation for label switching For mixture models, z(t) k usually fails to visit all conﬁgurations in a balanced way, despite the symmetry predicted by the theory πk(θk|x) = πk(σ(θk)|x) = 1 k! σ∈S πk(σ(θk)|x) for all σ’s in Sk, set of all permutations of {1, . . . , k}. Consequences on numerical approximation, biased by an order k! Recover the theoretical symmetry by using πk(θ∗ k |x) = 1 T k! σ∈Sk T t=1 πk(σ(θ∗ k )|x, z(t) k ) . [Berkhof, Mechelen, & Gelman, 2003]

Galaxy dataset n = 82 galaxies as a mixture of k normal distributions with both mean and variance unknown. [Roeder, 1992] Average density data Relative Frequency −2 −1 0 1 2 3 0.0 0.2 0.4 0.6 0.8

Galaxy dataset (k) Using only the original estimate, with θ∗ k as the MAP estimator, log( ˆ mk(x)) = −105.1396 for k = 3 (based on 103 simulations), while introducing the permutations leads to log( ˆ mk(x)) = −103.3479 Note that −105.1396 + log(3!) = −103.3479 k 2 3 4 5 6 7 8 mk (x) -115.68 -103.35 -102.66 -101.93 -102.88 -105.48 -108.44 Estimations of the marginal likelihoods by the symmetrised Chib’s approximation (based on 105 Gibbs iterations and, for k > 5, 100 permutations selected at random in Sk). [Lee, Marin, Mengersen & Robert, 2008]

Galaxy dataset (k) Using only the original estimate, with θ∗ k as the MAP estimator, log( ˆ mk(x)) = −105.1396 for k = 3 (based on 103 simulations), while introducing the permutations leads to log( ˆ mk(x)) = −103.3479 Note that −105.1396 + log(3!) = −103.3479 k 2 3 4 5 6 7 8 mk (x) -115.68 -103.35 -102.66 -101.93 -102.88 -105.48 -108.44 Estimations of the marginal likelihoods by the symmetrised Chib’s approximation (based on 105 Gibbs iterations and, for k > 5, 100 permutations selected at random in Sk). [Lee, Marin, Mengersen & Robert, 2008]

Galaxy dataset (k) Using only the original estimate, with θ∗ k as the MAP estimator, log( ˆ mk(x)) = −105.1396 for k = 3 (based on 103 simulations), while introducing the permutations leads to log( ˆ mk(x)) = −103.3479 Note that −105.1396 + log(3!) = −103.3479 k 2 3 4 5 6 7 8 mk (x) -115.68 -103.35 -102.66 -101.93 -102.88 -105.48 -108.44 Estimations of the marginal likelihoods by the symmetrised Chib’s approximation (based on 105 Gibbs iterations and, for k > 5, 100 permutations selected at random in Sk). [Lee, Marin, Mengersen & Robert, 2008]

Auxiliary variables Auxiliary variable method avoids computations of untractable constant(s) in likelihood function f(y|θ) = Zθ ˜ f(y|θ) Introduce pseudo-data y with artiﬁcial target g(y|θ, y) Generate θ ∼ K(θ, θ ) and y ∼ f(y|θ ) For Gibbs random ﬁelds , existence of a genuine suﬃcient statistic η(y). [Møller, Pettitt, Berthelsen, & Reeves, 2006]

Reversible jump irreversible jump Idea: Set up a proper measure–theoretic framework for designing moves between models Mk [Green, 1995] Create a reversible kernel K on H = k {k} × Θk such that A B K(x, dy)π(x)dx = B A K(y, dx)π(y)dy for the invariant density π [x is of the form (k, θ(k))]

Reversible jump irreversible jump Idea: Set up a proper measure–theoretic framework for designing moves between models Mk [Green, 1995] Create a reversible kernel K on H = k {k} × Θk such that A B K(x, dy)π(x)dx = B A K(y, dx)π(y)dy for the invariant density π [x is of the form (k, θ(k))]

Local moves For a move between two models, M1 and M2, the Markov chain being in state θ1 ∈ M1, denote by K1→2(θ1, dθ) and K2→1(θ2, dθ) the corresponding kernels, under the detailed balance condition π(dθ1) K1→2(θ1, dθ) = π(dθ2) K2→1(θ2, dθ) , and take, wlog, dim(M2) > dim(M1). Proposal expressed as θ2 = Ψ1→2(θ1, v1→2) where v1→2 is a random variable of dimension dim(M2) − dim(M1), generated as v1→2 ∼ ϕ1→2(v1→2) .

Local moves For a move between two models, M1 and M2, the Markov chain being in state θ1 ∈ M1, denote by K1→2(θ1, dθ) and K2→1(θ2, dθ) the corresponding kernels, under the detailed balance condition π(dθ1) K1→2(θ1, dθ) = π(dθ2) K2→1(θ2, dθ) , and take, wlog, dim(M2) > dim(M1). Proposal expressed as θ2 = Ψ1→2(θ1, v1→2) where v1→2 is a random variable of dimension dim(M2) − dim(M1), generated as v1→2 ∼ ϕ1→2(v1→2) .

Local moves (2) In this case, q1→2(θ1, dθ2) has density ϕ1→2(v1→2) ∂Ψ1→2(θ1, v1→2) ∂(θ1, v1→2) −1 , by the Jacobian rule. Reverse importance link If probability 1→2 of choosing move to M2 while in M1, acceptance probability reduces to α(θ1, v1→2) = 1∧ π(M2, θ2) 2→1 π(M1, θ1) 1→2 ϕ1→2(v1→2) ∂Ψ1→2(θ1, v1→2) ∂(θ1, v1→2) . If several models are considered simultaneously, with probability 1→2 of choosing move to M2 while in M1, as in K(x, B) = ∞ m=1 ρm (x, y)qm (x, dy) + ω(x)IB (x)

Local moves (2) In this case, q1→2(θ1, dθ2) has density ϕ1→2(v1→2) ∂Ψ1→2(θ1, v1→2) ∂(θ1, v1→2) −1 , by the Jacobian rule. Reverse importance link If probability 1→2 of choosing move to M2 while in M1, acceptance probability reduces to α(θ1, v1→2) = 1∧ π(M2, θ2) 2→1 π(M1, θ1) 1→2 ϕ1→2(v1→2) ∂Ψ1→2(θ1, v1→2) ∂(θ1, v1→2) . If several models are considered simultaneously, with probability 1→2 of choosing move to M2 while in M1, as in K(x, B) = ∞ m=1 ρm (x, y)qm (x, dy) + ω(x)IB (x)

Local moves (2) In this case, q1→2(θ1, dθ2) has density ϕ1→2(v1→2) ∂Ψ1→2(θ1, v1→2) ∂(θ1, v1→2) −1 , by the Jacobian rule. Reverse importance link If probability 1→2 of choosing move to M2 while in M1, acceptance probability reduces to α(θ1, v1→2) = 1∧ π(M2, θ2) 2→1 π(M1, θ1) 1→2 ϕ1→2(v1→2) ∂Ψ1→2(θ1, v1→2) ∂(θ1, v1→2) . If several models are considered simultaneously, with probability 1→2 of choosing move to M2 while in M1, as in K(x, B) = ∞ m=1 ρm (x, y)qm (x, dy) + ω(x)IB (x)

Interpretation The representation puts us back in a ﬁxed dimension setting: M1 × V1→2 and M2 in one-to-one relation. reversibility imposes that θ1 is derived as (θ1, v1→2) = Ψ−1 1→2 (θ2) appears like a regular Metropolis–Hastings move from the couple (θ1, v1→2) to θ2 when stationary distributions are π(M1, θ1) × ϕ1→2(v1→2) and π(M2, θ2), and when proposal distribution is deterministic

Interpretation The representation puts us back in a ﬁxed dimension setting: M1 × V1→2 and M2 in one-to-one relation. reversibility imposes that θ1 is derived as (θ1, v1→2) = Ψ−1 1→2 (θ2) appears like a regular Metropolis–Hastings move from the couple (θ1, v1→2) to θ2 when stationary distributions are π(M1, θ1) × ϕ1→2(v1→2) and π(M2, θ2), and when proposal distribution is deterministic

Alternative Saturation of the parameter space H = k {k} × Θk by creating θ = (θ1, . . . , θD) a model index M pseudo-priors πj(θj|M = k) for j = k [Carlin & Chib, 1995] Validation by P(M = k|x) = P(M = k|x, θ)π(θ|x)dθ = Zk where the (marginal) posterior is [not πk] π(θ|x) = D k=1 P(θ, M = k|x) = D k=1 pk Zk πk(θk|x) j=k πj(θj|M = k) .

Alternative Saturation of the parameter space H = k {k} × Θk by creating θ = (θ1, . . . , θD) a model index M pseudo-priors πj(θj|M = k) for j = k [Carlin & Chib, 1995] Validation by P(M = k|x) = P(M = k|x, θ)π(θ|x)dθ = Zk where the (marginal) posterior is [not πk] π(θ|x) = D k=1 P(θ, M = k|x) = D k=1 pk Zk πk(θk|x) j=k πj(θj|M = k) .

MCMC implementation Run a Markov chain (M(t), θ(t) 1 , . . . , θ(t) D ) with stationary distribution π(θ, M|x) by 1 Pick M(t) = k with probability π(θ(t−1), k|x) 2 Generate θ(t−1) k from the posterior πk(θk|x) [or MCMC step] 3 Generate θ(t−1) j (j = k) from the pseudo-prior πj(θj|M = k) Approximate P(M = k|x) = Zk by ˇ pk(x) ∝ pk T t=1 fk(x|θ(t) k ) πk(θ(t) k ) j=k πj(θ(t) j |M = k) D =1 p f (x|θ(t)) π (θ(t)) j= πj(θ(t) j |M = )

MCMC implementation Run a Markov chain (M(t), θ(t) 1 , . . . , θ(t) D ) with stationary distribution π(θ, M|x) by 1 Pick M(t) = k with probability π(θ(t−1), k|x) 2 Generate θ(t−1) k from the posterior πk(θk|x) [or MCMC step] 3 Generate θ(t−1) j (j = k) from the pseudo-prior πj(θj|M = k) Approximate P(M = k|x) = Zk by ˇ pk(x) ∝ pk T t=1 fk(x|θ(t) k ) πk(θ(t) k ) j=k πj(θ(t) j |M = k) D =1 p f (x|θ(t)) π (θ(t)) j= πj(θ(t) j |M = )

MCMC implementation Run a Markov chain (M(t), θ(t) 1 , . . . , θ(t) D ) with stationary distribution π(θ, M|x) by 1 Pick M(t) = k with probability π(θ(t−1), k|x) 2 Generate θ(t−1) k from the posterior πk(θk|x) [or MCMC step] 3 Generate θ(t−1) j (j = k) from the pseudo-prior πj(θj|M = k) Approximate P(M = k|x) = Zk by ˇ pk(x) ∝ pk T t=1 fk(x|θ(t) k ) πk(θ(t) k ) j=k πj(θ(t) j |M = k) D =1 p f (x|θ(t)) π (θ(t)) j= πj(θ(t) j |M = )

Scott’s (2002) proposal Suggest estimating P(M = k|x) by Zk ∝ pk T t=1 fk(x|θ(t) k ) D j=1 pj fj(x|θ(t) j ) , based on D simultaneous and independent MCMC chains (θ(t) k )t , 1 ≤ k ≤ D , with stationary distributions πk(θk|x) [instead of above joint]

Scott’s (2002) proposal Suggest estimating P(M = k|x) by Zk ∝ pk T t=1 fk(x|θ(t) k ) D j=1 pj fj(x|θ(t) j ) , based on D simultaneous and independent MCMC chains (θ(t) k )t , 1 ≤ k ≤ D , with stationary distributions πk(θk|x) [instead of above joint]

Congdon’s (2006) extension Selecting ﬂat [prohibited] pseudo-priors, uses instead Zk ∝ pk T t=1 fk(x|θ(t) k )πk(θ(t) k ) D j=1 pj fj(x|θ(t) j )πj(θ(t) j ) , where again the θ(t) k ’s are MCMC chains with stationary distributions πk(θk|x) to next section

Examples (2) Example (Model choice (2)) Normal model M1 : x ∼ N(θ, 1) with θ ∼ N(0, 1) vs. normal model M2 : x ∼ N(θ, 1) with θ ∼ N(5, 1) Comparison of both approximations with P(M = 1|x): Scott’s (2002) (green and mixed dashes) and Congdon’s (2006) (brown and long dashes) [N = 104 simulations]. −1 0 1 2 3 4 5 6 0.0 0.2 0.4 0.6 0.8 1.0 y

Nested sampling: Goal Skilling’s (2007) technique using the one-dimensional representation: Z = Eπ[L(θ)] = 1 0 ϕ(x) dx with ϕ−1(l) = Pπ(L(θ) > l). Note; ϕ(·) is intractable in most cases.

Nested sampling: First approximation Approximate Z by a Riemann sum: Z = N i=1 (xi−1 − xi)ϕ(xi) where the xi’s are either: deterministic: xi = e−i/N or random: x0 = 0, xi+1 = tixi, ti ∼ Be(N, 1) so that E[log xi] = −i/N.

Nested sampling: Alternative representation Another representation is Z = N−1 i=0 {ϕ(xi+1) − ϕ(xi)}xi which is a special case of Z = N−1 i=0 {L(θ(i+1) ) − L(θ(i) )}π({θ; L(θ) > L(θ(i) )}) where · · · L(θ(i+1) ) < L(θ(i) ) · · · [Lebesgue version of Riemann’s sum]

Nested sampling: Alternative representation Another representation is Z = N−1 i=0 {ϕ(xi+1) − ϕ(xi)}xi which is a special case of Z = N−1 i=0 {L(θ(i+1) ) − L(θ(i) )}π({θ; L(θ) > L(θ(i) )}) where · · · L(θ(i+1) ) < L(θ(i) ) · · · [Lebesgue version of Riemann’s sum]

Nested sampling: Second approximation Estimate (intractable) ϕ(xi) by ϕi: Nested sampling Start with N values θ1, . . . , θN sampled from π At iteration i, 1 Take ϕi = L(θk), where θk is the point with smallest likelihood in the pool of θi’s 2 Replace θk with a sample from the prior constrained to L(θ) > ϕi: the current N points are sampled from prior constrained to L(θ) > ϕi. Note that π({θ; L(θ) > L(θ(i+1) )}) π({θ; L(θ) > L(θ(i) )}) ≈ 1 − 1 N

Nested sampling: Second approximation Estimate (intractable) ϕ(xi) by ϕi: Nested sampling Start with N values θ1, . . . , θN sampled from π At iteration i, 1 Take ϕi = L(θk), where θk is the point with smallest likelihood in the pool of θi’s 2 Replace θk with a sample from the prior constrained to L(θ) > ϕi: the current N points are sampled from prior constrained to L(θ) > ϕi. Note that π({θ; L(θ) > L(θ(i+1) )}) π({θ; L(θ) > L(θ(i) )}) ≈ 1 − 1 N

Nested sampling: Second approximation Estimate (intractable) ϕ(xi) by ϕi: Nested sampling Start with N values θ1, . . . , θN sampled from π At iteration i, 1 Take ϕi = L(θk), where θk is the point with smallest likelihood in the pool of θi’s 2 Replace θk with a sample from the prior constrained to L(θ) > ϕi: the current N points are sampled from prior constrained to L(θ) > ϕi. Note that π({θ; L(θ) > L(θ(i+1) )}) π({θ; L(θ) > L(θ(i) )}) ≈ 1 − 1 N

Nested sampling: Second approximation Estimate (intractable) ϕ(xi) by ϕi: Nested sampling Start with N values θ1, . . . , θN sampled from π At iteration i, 1 Take ϕi = L(θk), where θk is the point with smallest likelihood in the pool of θi’s 2 Replace θk with a sample from the prior constrained to L(θ) > ϕi: the current N points are sampled from prior constrained to L(θ) > ϕi. Note that π({θ; L(θ) > L(θ(i+1) )}) π({θ; L(θ) > L(θ(i) )}) ≈ 1 − 1 N

Nested sampling: Third approximation Iterate the above steps until a given stopping iteration j is reached: e.g., observe very small changes in the approximation Z; reach the maximal value of L(θ) when the likelihood is bounded and its maximum is known; truncate the integral Z at level , i.e. replace 1 0 ϕ(x) dx with 1 ϕ(x) dx

A CLT for the Stochastic Error The (dominating) stochastic error is OP (N−1/2): N1/2 Z − Z D → N (0, V ) with V = − s,t∈[ ,1] sϕ (s)tϕ (t) log(s ∨ t) ds dt. [Proof based on Donsker’s theorem]

What of log Z? If the interest lies within log Z, Slutsky’s transform of the CLT: N−1/2 log Z − log Z D → N 0, V/Z2 Note: The number of simulated points equals the number of iterations j, and is a multiple of N: if one stops at ﬁrst iteration j such that e−j/N < , then: j = N − log .

What of log Z? If the interest lies within log Z, Slutsky’s transform of the CLT: N−1/2 log Z − log Z D → N 0, V/Z2 Note: The number of simulated points equals the number of iterations j, and is a multiple of N: if one stops at ﬁrst iteration j such that e−j/N < , then: j = N − log .

What of log Z? If the interest lies within log Z, Slutsky’s transform of the CLT: N−1/2 log Z − log Z D → N 0, V/Z2 Note: The number of simulated points equals the number of iterations j, and is a multiple of N: if one stops at ﬁrst iteration j such that e−j/N < , then: j = N − log .

Sampling from constr’d priors Exact simulation from the constrained prior is intractable in most cases Skilling (2007) proposes to use MCMC but this introduces a bias (stopping rule) If implementable, then slice sampler can be devised at the same cost [or less]

Sampling from constr’d priors Exact simulation from the constrained prior is intractable in most cases Skilling (2007) proposes to use MCMC but this introduces a bias (stopping rule) If implementable, then slice sampler can be devised at the same cost [or less]

Sampling from constr’d priors Exact simulation from the constrained prior is intractable in most cases Skilling (2007) proposes to use MCMC but this introduces a bias (stopping rule) If implementable, then slice sampler can be devised at the same cost [or less]

Banana illustration (2) Use of nested sampling with N = 1000, 50 MCMC steps with size 0.1, compared with a population Monte Carlo (PMC) based on 10 iterations with 5000 points per iteration and ﬁnal sample of 50000 points, using nine Student’s t components with 9 df [Wraith, Kilbinger, Benabed et al., 2009, Physica Rev. D] Evidence estimation

Banana illustration (2) Use of nested sampling with N = 1000, 50 MCMC steps with size 0.1, compared with a population Monte Carlo (PMC) based on 10 iterations with 5000 points per iteration and ﬁnal sample of 50000 points, using nine Student’s t components with 9 df [Wraith, Kilbinger, Benabed et al., 2009, Physica Rev. D] E[X1] estimation

Banana illustration (2) Use of nested sampling with N = 1000, 50 MCMC steps with size 0.1, compared with a population Monte Carlo (PMC) based on 10 iterations with 5000 points per iteration and ﬁnal sample of 50000 points, using nine Student’s t components with 9 df [Wraith, Kilbinger, Benabed et al., 2009, Physica Rev. D] E[X2] estimation

A IS variant of nested sampling Consider instrumental prior π and likelihood ˜ L, weight function w(θ) = π(θ)L(θ) π(θ)L(θ) and weighted NS estimator Z = j i=1 (xi−1 − xi)ϕiw(θi). Then choose (π, L) so that sampling from π constrained to L(θ) > l is easy; e.g. N(c, Id) constrained to c − θ < r.

A IS variant of nested sampling Consider instrumental prior π and likelihood ˜ L, weight function w(θ) = π(θ)L(θ) π(θ)L(θ) and weighted NS estimator Z = j i=1 (xi−1 − xi)ϕiw(θi). Then choose (π, L) so that sampling from π constrained to L(θ) > l is easy; e.g. N(c, Id) constrained to c − θ < r.

Mixture pN(0, 1) + (1 − p)N(µ, σ) posterior Posterior on (µ, σ) for n observations with µ = 2 and σ = 3/2, when p is known Use of a uniform prior both on (−2, 6) for µ and on (.001, 16) for log σ2. occurrences of posterior bursts for µ = xi

Comparison 1 Nested sampling: M = 1000 points, with 10 random walk moves at each step, simulations from the constr’d prior and a stopping rule at 95% of the observed maximum likelihood 2 T = 104 MCMC (=Gibbs) simulations producing non-parametric estimates ϕ [Diebolt & Robert, 1990] 3 Monte Carlo estimates Z1, Z2, Z3 using product of two Gaussian kernels 4 numerical integration based on 850 × 950 grid [reference value, conﬁrmed by Chib’s]

Approximate Bayesian Computation Bayesian setting: target is π(θ)f(x|θ) When likelihood f(x|θ) not in closed form, likelihood-free rejection technique: ABC algorithm For an observation y ∼ f(y|θ), under the prior π(θ), keep jointly simulating θ ∼ π(θ) , x ∼ f(x|θ ) , until the auxiliary variable x is equal to the observed value, x = y. [Pritchard et al., 1999]

Approximate Bayesian Computation Bayesian setting: target is π(θ)f(x|θ) When likelihood f(x|θ) not in closed form, likelihood-free rejection technique: ABC algorithm For an observation y ∼ f(y|θ), under the prior π(θ), keep jointly simulating θ ∼ π(θ) , x ∼ f(x|θ ) , until the auxiliary variable x is equal to the observed value, x = y. [Pritchard et al., 1999]

Approximate Bayesian Computation Bayesian setting: target is π(θ)f(x|θ) When likelihood f(x|θ) not in closed form, likelihood-free rejection technique: ABC algorithm For an observation y ∼ f(y|θ), under the prior π(θ), keep jointly simulating θ ∼ π(θ) , x ∼ f(x|θ ) , until the auxiliary variable x is equal to the observed value, x = y. [Pritchard et al., 1999]

A as approximative When y is a continuous random variable, equality x = y is replaced with a tolerance condition, (x, y) ≤ where is a distance between summary statistics Output distributed from π(θ) Pθ{ (x, y) < } ∝ π(θ| (x, y) < )

A as approximative When y is a continuous random variable, equality x = y is replaced with a tolerance condition, (x, y) ≤ where is a distance between summary statistics Output distributed from π(θ) Pθ{ (x, y) < } ∝ π(θ| (x, y) < )

The starting point Central question to the validation of ABC for model choice: When is a Bayes factor based on an insuﬃcient statistic T (y) consistent? Note/warnin: c drawn on T (y) through BT 12 (y) necessarily diﬀers from c drawn on y through B12(y) [Marin, Pillai, X, & Rousseau, arXiv, 2012]

The starting point Central question to the validation of ABC for model choice: When is a Bayes factor based on an insuﬃcient statistic T (y) consistent? Note/warnin: c drawn on T (y) through BT 12 (y) necessarily diﬀers from c drawn on y through B12(y) [Marin, Pillai, X, & Rousseau, arXiv, 2012]

A benchmark if toy example Comparison suggested by referee of PNAS paper [thanks!]: [X, Cornuet, Marin, & Pillai, Aug. 2011] Model M1: y ∼ N(θ1, 1) opposed to model M2: y ∼ L(θ2, 1/ √ 2), Laplace distribution with mean θ2 and scale parameter 1/ √ 2 (variance one).

A benchmark if toy example Comparison suggested by referee of PNAS paper [thanks!]: [X, Cornuet, Marin, & Pillai, Aug. 2011] Model M1: y ∼ N(θ1, 1) opposed to model M2: y ∼ L(θ2, 1/ √ 2), Laplace distribution with mean θ2 and scale parameter 1/ √ 2 (variance one). Four possible statistics 1 sample mean y (suﬃcient for M1 if not M2); 2 sample median med(y) (insuﬃcient); 3 sample variance var(y) (ancillary); 4 median absolute deviation mad(y) = med(|y − med(y)|);

A benchmark if toy example Comparison suggested by referee of PNAS paper [thanks!]: [X, Cornuet, Marin, & Pillai, Aug. 2011] Model M1: y ∼ N(θ1, 1) opposed to model M2: y ∼ L(θ2, 1/ √ 2), Laplace distribution with mean θ2 and scale parameter 1/ √ 2 (variance one). q q q q q q q q q q q Gauss Laplace 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 n=100

Framework Starting from sample y = (y1, . . . , yn) the observed sample, not necessarily iid with true distribution y ∼ Pn Summary statistics T (y) = T n = (T1(y), T2(y), · · · , Td(y)) ∈ Rd with true distribution T n ∼ Gn.

Framework c Comparison of – under M1, y ∼ F1,n(·|θ1) where θ1 ∈ Θ1 ⊂ Rp1 – under M2, y ∼ F2,n(·|θ2) where θ2 ∈ Θ2 ⊂ Rp2 turned into – under M1, T (y) ∼ G1,n(·|θ1), and θ1|T (y) ∼ π1(·|T n) – under M2, T (y) ∼ G2,n(·|θ2), and θ2|T (y) ∼ π2(·|T n)

Assumptions A collection of asymptotic “standard” assumptions: [A1] is a standard central limit theorem under the true model [A2] controls the large deviations of the estimator T n from the estimand µ(θ) [A3] is the standard prior mass condition found in Bayesian asymptotics (di eﬀective dimension of the parameter) [A4] restricts the behaviour of the model density against the true density [Think CLT!]

Assumptions A collection of asymptotic “standard” assumptions: [Think CLT!] [A1] There exist a sequence {vn} converging to +∞, a distribution Q, a symmetric, d × d positive deﬁnite matrix V0 and a vector µ0 ∈ Rd such that vnV −1/2 0 (T n − µ0) n→∞ ; Q, under Gn

Assumptions A collection of asymptotic “standard” assumptions: [Think CLT!] [A2] For i = 1, 2, there exist sets Fn,i ⊂ Θi, functions µi(θi) and constants i, τi, αi > 0 such that for all τ > 0, sup θi∈Fn,i Gi,n |T n − µi(θi)| > τ|µi(θi) − µ0| ∧ i |θi (|τµi(θi) − µ0| ∧ i)−αi v−αi n with πi(Fc n,i ) = o(v−τi n ).

Assumptions A collection of asymptotic “standard” assumptions: [Think CLT!] [A4] If inf{|µi(θi) − µ0|; θi ∈ Θi} = 0, for any > 0, there exist U, δ > 0 and (En)n such that, if θi ∈ Sn,i(U) En ⊂ {t; gi(t|θi) < δgn(t)} and Gn (Ec n ) < .

Assumptions A collection of asymptotic “standard” assumptions: [Think CLT!] Again (sumup) [A1] is a standard central limit theorem under the true model [A2] controls the large deviations of the estimator T n from the estimand µ(θ) [A3] is the standard prior mass condition found in Bayesian asymptotics (di eﬀective dimension of the parameter) [A4] restricts the behaviour of the model density against the true density

Eﬀective dimension Understanding di in [A3]: deﬁned only when µ0 ∈ {µi(θi), θi ∈ Θi}, πi(θi : |µi(θi) − µ0| < n−1/2) = O(n−di/2) is the eﬀective dimension of the model Θi around µ0

Eﬀective dimension Understanding di in [A3]: when inf{|µi(θi) − µ0|; θi ∈ Θi} = 0, mi(T n) gn(T n) ∼ v−di n , thus log(mi(T n)/gn(T n)) ∼ −di log vn and v−di n penalization factor resulting from integrating θi out (see eﬀective number of parameters in DIC)

Asymptotic marginals Asymptotically, under [A1]–[A4] mi(t) = Θi gi(t|θi) πi(θi) dθi is such that (i) if inf{|µi(θi) − µ0|; θi ∈ Θi} = 0, Clvd−di n ≤ mi(T n) ≤ Cuvd−di n and (ii) if inf{|µi(θi) − µ0|; θi ∈ Θi} > 0 mi(T n) = o Pn [vd−τi n + vd−αi n ].

Within-model consistency Under same assumptions, if inf{|µi(θi) − µ0|; θi ∈ Θi} = 0, the posterior distribution of µi(θi) given T n is consistent at rate 1/vn provided αi ∧ τi > di.

Between-model consistency Consequence of above is that asymptotic behaviour of the Bayes factor is driven by the asymptotic mean value of T n under both models. And only by this mean value!

Between-model consistency Consequence of above is that asymptotic behaviour of the Bayes factor is driven by the asymptotic mean value of T n under both models. And only by this mean value! Indeed, if inf{|µ0 − µ2(θ2)|; θ2 ∈ Θ2} = inf{|µ0 − µ1(θ1)|; θ1 ∈ Θ1} = 0 then Clv−(d1−d2) n ≤ m1(T n) m2(T n) ≤ Cuv−(d1−d2) n , where Cl, Cu = O Pn (1), irrespective of the true model. c Only depends on the diﬀerence d1 − d2: no consistency

Between-model consistency Consequence of above is that asymptotic behaviour of the Bayes factor is driven by the asymptotic mean value of T n under both models. And only by this mean value! Else, if inf{|µ0 − µ2(θ2)|; θ2 ∈ Θ2} > inf{|µ0 − µ1(θ1)|; θ1 ∈ Θ1} = 0 then m1(T n) m2(T n) ≥ Cu min v−(d1−α2) n , v−(d1−τ2) n

Consistency theorem If inf{|µ0 − µ2(θ2)|; θ2 ∈ Θ2} = inf{|µ0 − µ1(θ1)|; θ1 ∈ Θ1} = 0, Bayes factor BT 12 = O(v−(d1−d2) n ) irrespective of the true model. It is inconsistent since it always picks the model with the smallest dimension

Consistency theorem If Pn belongs to one of the two models and if µ0 cannot be attained by the other one : 0 = min (inf{|µ0 − µi(θi)|; θi ∈ Θi}, i = 1, 2) < max (inf{|µ0 − µi(θi)|; θi ∈ Θi}, i = 1, 2) , then the Bayes factor BT 12 is consistent

Consequences on summary statistics Bayes factor driven by the means µi(θi) and the relative position of µ0 wrt both sets {µi(θi); θi ∈ Θi}, i = 1, 2. For ABC, this implies the most likely statistics T n are ancillary statistics with diﬀerent mean values under both models Else, if T n asymptotically depends on some of the parameters of the models, it is possible that there exists θi ∈ Θi such that µi(θi) = µ0 even though model M1 is misspeciﬁed

Consequences on summary statistics Bayes factor driven by the means µi(θi) and the relative position of µ0 wrt both sets {µi(θi); θi ∈ Θi}, i = 1, 2. For ABC, this implies the most likely statistics T n are ancillary statistics with diﬀerent mean values under both models Else, if T n asymptotically depends on some of the parameters of the models, it is possible that there exists θi ∈ Θi such that µi(θi) = µ0 even though model M1 is misspeciﬁed

Toy example: Laplace versus Gauss [1] If T n = n−1 n i=1 X4 i , µ1(θ) = 3 + θ4 + 6θ2, µ2(θ) = 6 + · · · and the true distribution is Laplace with mean θ0 = 1, under the Gaussian model the value θ∗ = 2 √ 3 − 3 also leads to µ0 = µ(θ∗) [here d1 = d2 = d = 1]

Toy example: Laplace versus Gauss [1] If T n = n−1 n i=1 X4 i , µ1(θ) = 3 + θ4 + 6θ2, µ2(θ) = 6 + · · · and the true distribution is Laplace with mean θ0 = 1, under the Gaussian model the value θ∗ = 2 √ 3 − 3 also leads to µ0 = µ(θ∗) [here d1 = d2 = d = 1] c A Bayes factor associated with such a statistic is inconsistent

Toy example: Laplace versus Gauss [0] When T (y) = ¯ y(4) n , ¯ y(6) n and the true distribution is Laplace with mean θ0 = 0, then µ0 = 6, µ1(θ∗ 1 ) = 6 with θ∗ 1 = 2 √ 3 − 3 [d1 = 1 and d2 = 1/2] thus B12 ∼ n−1/4 → 0 : consistent Under the Gaussian model µ0 = 3, µ2(θ2) ≥ 6 > 3 ∀θ2 B12 → +∞ : consistent

Toy example: Laplace versus Gauss [0] When T (y) = ¯ y(4) n , ¯ y(6) n and the true distribution is Laplace with mean θ0 = 0, then µ0 = 6, µ1(θ∗ 1 ) = 6 with θ∗ 1 = 2 √ 3 − 3 [d1 = 1 and d2 = 1/2] thus B12 ∼ n−1/4 → 0 : consistent Under the Gaussian model µ0 = 3, µ2(θ2) ≥ 6 > 3 ∀θ2 B12 → +∞ : consistent

Toy example: Laplace versus Gauss [0] When T (y) = ¯ y(4) n , ¯ y(6) n 0.0 0.2 0.4 0.6 0.8 1.0 0 1 2 3 4 5 posterior probabilities Density Fourth and sixth moments

Checking for adequate statistics After running ABC, i.e. creating reference tables of (θi, xi) from both joints, straightforward derivation of ABC estimates ˆ θ1 and ˆ θ2. Evaluation of E1 ˆ θ1 [T(X)] and E2 ˆ θ2 [T(X)] allows for detection of diﬀerent means under both models via Monte Carlo simulations

Embedded models When M1 submodel of M2, and if the true distribution belongs to the smaller model M1, Bayes factor is of order v−(d1−d2) n If summary statistic only informative on a parameter that is the same under both models, i.e if d1 = d2, then c the Bayes factor is not consistent

Embedded models When M1 submodel of M2, and if the true distribution belongs to the smaller model M1, Bayes factor is of order v−(d1−d2) n Else, d1 < d2 and Bayes factor is going to ∞ under M1. If true distribution not in M1, then c Bayes factor is consistent only if µ1 = µ2 = µ0

Summary Model selection feasible with ABC Choice of summary statistics is paramount At best, ABC → π(. | T (y)) which concentrates around µ0 For estimation : {θ; µ(θ) = µ0} = θ0 For testing {µ1(θ1), θ1 ∈ Θ1} ∩ {µ2(θ2), θ2 ∈ Θ2} = ∅

Summary Model selection feasible with ABC Choice of summary statistics is paramount At best, ABC → π(. | T (y)) which concentrates around µ0 For estimation : {θ; µ(θ) = µ0} = θ0 For testing {µ1(θ1), θ1 ∈ Θ1} ∩ {µ2(θ2), θ2 ∈ Θ2} = ∅

Savage’s density ratio theorem Given a test H0 : θ = θ0 in a model f(x|θ, ψ) with a nuisance parameter ψ, under priors π0(ψ) and π1(θ, ψ) such that π1(ψ|θ0) = π0(ψ) then B01 = π1(θ0|x) π1(θ0) , with the obvious notations π1(θ) = π1(θ, ψ)dψ , π1(θ|x) = π1(θ, ψ|x)dψ , [Dickey, 1971; Verdinelli & Wasserman, 1995]

Rephrased “Suppose that f0(θ) = f1(θ|φ = φ0). As f0(x|θ) = f1(x|θ, φ = φ0), f0(x) = f1(x|θ, φ = φ0)f1(θ|φ = φ0) dθ = f1(x|φ = φ0) , i.e., the denumerator of the Bayes factor is the value of f1(x|φ) at φ = φ0, while the denominator is an average of the values of f1(x|φ) for φ = φ0, weighted by the prior distribution f1(φ) under the augmented model. Applying Bayes’ theorem to the right-hand side of [the above] we get f0(x) = f1(φ0|x)f1(x) f1(φ0) and hence the Bayes factor is given by B = f0(x) f1(x) = f1(φ0|x) f1(φ0) . the ratio of the posterior to prior densities at φ = φ0 under the augmented model.” [O’Hagan & Forster, 1996]

Rephrased “Suppose that f0(θ) = f1(θ|φ = φ0). As f0(x|θ) = f1(x|θ, φ = φ0), f0(x) = f1(x|θ, φ = φ0)f1(θ|φ = φ0) dθ = f1(x|φ = φ0) , i.e., the denumerator of the Bayes factor is the value of f1(x|φ) at φ = φ0, while the denominator is an average of the values of f1(x|φ) for φ = φ0, weighted by the prior distribution f1(φ) under the augmented model. Applying Bayes’ theorem to the right-hand side of [the above] we get f0(x) = f1(φ0|x)f1(x) f1(φ0) and hence the Bayes factor is given by B = f0(x) f1(x) = f1(φ0|x) f1(φ0) . the ratio of the posterior to prior densities at φ = φ0 under the augmented model.” [O’Hagan & Forster, 1996]

Choice of density version c Dickey’s (1971) condition is not a condition: If π1(θ0|x) π1(θ0) = π0(ψ)f(x|θ0, ψ) dψ m1(x) is chosen as a version, then Savage–Dickey’s representation holds

Choice of density version c Dickey’s (1971) condition is not a condition: If π1(θ0|x) π1(θ0) = π0(ψ)f(x|θ0, ψ) dψ m1(x) is chosen as a version, then Savage–Dickey’s representation holds

Savage–Dickey paradox Verdinelli-Wasserman extension: B01 = π1(θ0|x) π1(θ0) Eπ1(ψ|x,θ0,x) π0(ψ) π1(ψ|θ0) similarly depends on choices of versions... ...but Monte Carlo implementation relies on speciﬁc versions of all densities without making mention of it [Chen, Shao & Ibrahim, 2000]

Savage–Dickey paradox Verdinelli-Wasserman extension: B01 = π1(θ0|x) π1(θ0) Eπ1(ψ|x,θ0,x) π0(ψ) π1(ψ|θ0) similarly depends on choices of versions... ...but Monte Carlo implementation relies on speciﬁc versions of all densities without making mention of it [Chen, Shao & Ibrahim, 2000]

First ratio If (θ(1), ψ(1)), . . . , (θ(T), ψ(T)) ∼ ˜ π(θ, ψ|x), then 1 T t ˜ π1(θ0|x, ψ(t)) converges to ˜ π1(θ0|x) provided the right version is used in θ0 ˜ π1(θ0|x, ψ) = π1(θ0)f(x|θ0, ψ) π1(θ)f(x|θ, ψ) dθ

Rao–Blackwellisation with latent variables When ˜ π1(θ0|x, ψ) unavailable, replace with 1 T T t=1 ˜ π1(θ0|x, z(t), ψ(t)) via data completion by latent variable z such that f(x|θ, ψ) = ˜ f(x, z|θ, ψ) dz and that ˜ π1(θ, ψ, z|x) ∝ π0(ψ)π1(θ) ˜ f(x, z|θ, ψ) available in closed form, including the normalising constant, based on version ˜ π1(θ0|x, z, ψ) π1(θ0) = ˜ f(x, z|θ0, ψ) ˜ f(x, z|θ, ψ)π1(θ) dθ .

Rao–Blackwellisation with latent variables When ˜ π1(θ0|x, ψ) unavailable, replace with 1 T T t=1 ˜ π1(θ0|x, z(t), ψ(t)) via data completion by latent variable z such that f(x|θ, ψ) = ˜ f(x, z|θ, ψ) dz and that ˜ π1(θ, ψ, z|x) ∝ π0(ψ)π1(θ) ˜ f(x, z|θ, ψ) available in closed form, including the normalising constant, based on version ˜ π1(θ0|x, z, ψ) π1(θ0) = ˜ f(x, z|θ0, ψ) ˜ f(x, z|θ, ψ)π1(θ) dθ .

Bridge revival (1) Since ˜ m1(x)/m1(x) is unknown, apparent failure! Use of the bridge identity E˜ π1(θ,ψ|x) π1(θ, ψ)f(x|θ, ψ) π0(ψ)π1(θ)f(x|θ, ψ) = E˜ π1(θ,ψ|x) π1(ψ|θ) π0(ψ) = m1(x) ˜ m1(x) to (biasedly) estimate ˜ m1(x)/m1(x) by T T t=1 π1(ψ(t)|θ(t)) π0(ψ(t)) based on the same sample from ˜ π1.

Bridge revival (1) Since ˜ m1(x)/m1(x) is unknown, apparent failure! Use of the bridge identity E˜ π1(θ,ψ|x) π1(θ, ψ)f(x|θ, ψ) π0(ψ)π1(θ)f(x|θ, ψ) = E˜ π1(θ,ψ|x) π1(ψ|θ) π0(ψ) = m1(x) ˜ m1(x) to (biasedly) estimate ˜ m1(x)/m1(x) by T T t=1 π1(ψ(t)|θ(t)) π0(ψ(t)) based on the same sample from ˜ π1.

Diabetes in Pima Indian women (cont’d) Comparison of the variation of the Bayes factor approximations based on 100 replicas for 20, 000 simulations for a simulation from the above importance, Chib’s, Savage–Dickey’s and bridge samplers q q IS Chib Savage−Dickey Bridge 2.8 3.0 3.2 3.4

Other things I could have mentioned use of particle ﬁlters [Del Moral et al., 2006] path sampling d dθ log Z(θ) = Eθ d dθ log q(ω|θ) [Ogata, 1989; Gelman & Meng, 1998] multicanonical algorithms [Marinari & Parisi, 1992; Geyer & Thompson, 1995]