H0 : θ ∈ Θ0 on the parameter θ ∈ Θ0 of a statistical model, a test is a statistical procedure that takes its values in {0, 1}. Theorem (Bayes test) The Bayes estimator associated with π and with the 0 − 1 loss is δπ(x) = 1 if π(θ ∈ Θ0|x) > π(θ ∈ Θ0|x), 0 otherwise,

H0 : θ ∈ Θ0 on the parameter θ ∈ Θ0 of a statistical model, a test is a statistical procedure that takes its values in {0, 1}. Theorem (Bayes test) The Bayes estimator associated with π and with the 0 − 1 loss is δπ(x) = 1 if π(θ ∈ Θ0|x) > π(θ ∈ Θ0|x), 0 otherwise,

available for the same observation x Mi : x ∼ fi(x|θi), i ∈ I where the family I can be ﬁnite or inﬁnite Identical setup: Replace hypotheses with models but keep marginal likelihoods and Bayes factors

available for the same observation x Mi : x ∼ fi(x|θi), i ∈ I where the family I can be ﬁnite or inﬁnite Identical setup: Replace hypotheses with models but keep marginal likelihoods and Bayes factors

pi to all models Mi deﬁne priors πi(θi) for each parameter space Θi compute π(Mi|x) = pi Θi fi(x|θi)πi(θi)dθi j pj Θj fj(x|θj)πj(θj)dθj take largest π(Mi|x) to determine “best” model, or use averaged predictive j π(Mj|x) Θj fj(x |θj)πj(θj|x)dθj

pi to all models Mi deﬁne priors πi(θi) for each parameter space Θi compute π(Mi|x) = pi Θi fi(x|θi)πi(θi)dθi j pj Θj fj(x|θj)πj(θj)dθj take largest π(Mi|x) to determine “best” model, or use averaged predictive j π(Mj|x) Θj fj(x |θj)πj(θj|x)dθj

pi to all models Mi deﬁne priors πi(θi) for each parameter space Θi compute π(Mi|x) = pi Θi fi(x|θi)πi(θi)dθi j pj Θj fj(x|θj)πj(θj)dθj take largest π(Mi|x) to determine “best” model, or use averaged predictive j π(Mj|x) Θj fj(x |θj)πj(θj|x)dθj

pi to all models Mi deﬁne priors πi(θi) for each parameter space Θi compute π(Mi|x) = pi Θi fi(x|θi)πi(θi)dθi j pj Θj fj(x|θj)πj(θj)dθj take largest π(Mi|x) to determine “best” model, or use averaged predictive j π(Mj|x) Θj fj(x |θj)πj(θj|x)dθj

∝ ˜ π2(θ2|x) live on the same space (Θ1 = Θ2), then B12 ≈ 1 n n i=1 ˜ π1(θi|x) ˜ π2(θi|x) θi ∼ π2(θ|x) [Gelman & Meng, 1998; Chen, Shao & Ibrahim, 2000]

that the Law of Large Numbers guarantees that this estimator is consistent ie, it will very likely be very close to the correct answer if you use a suﬃciently large number of points from the posterior distribution. The bad news is that the number of points required for this estimator to get close to the right answer will often be greater than the number of atoms in the observable universe. The even worse news is that itws easy for people to not realize this, and to naively accept estimates that are nowhere close to the correct value of the marginal likelihood.” [Radford Neal’s blog, Aug. 23, 2008]

that the Law of Large Numbers guarantees that this estimator is consistent ie, it will very likely be very close to the correct answer if you use a suﬃciently large number of points from the posterior distribution. The bad news is that the number of points required for this estimator to get close to the right answer will often be greater than the number of atoms in the observable universe. The even worse news is that itws easy for people to not realize this, and to naively accept estimates that are nowhere close to the correct value of the marginal likelihood.” [Radford Neal’s blog, Aug. 23, 2008]

π1(θ1)/ϕ(θ1)] Eϕ [˜ π2(θ2)/ϕ(θ2)] for any density ϕ with suﬃciently large support (when Θ1 = Θ2) [Torrie & Valleau, 1977] Use of a single sample θ1, . . . , θn from ϕ B12 = i=1 ˜ π1(θi)/ϕ(θi) i=1 ˜ π2(θi)/ϕ(θi)

π1(θ1)/ϕ(θ1)] Eϕ [˜ π2(θ2)/ϕ(θ2)] for any density ϕ with suﬃciently large support (when Θ1 = Θ2) [Torrie & Valleau, 1977] Use of a single sample θ1, . . . , θn from ϕ B12 = i=1 ˜ π1(θi)/ϕ(θi) i=1 ˜ π2(θi)/ϕ(θi)

mean] identity Eπk ϕ(θk) πk(θk)Lk(θk) x = ϕ(θk) πk(θk)Lk(θk) πk(θk)Lk(θk) Zk dθk = 1 Zk no matter what the proposal ϕ(·) is. [Gelfand & Dey, 1994; Bartolucci et al., 2006] Direct exploitation of the MCMC output RB-RJ

mean] identity Eπk ϕ(θk) πk(θk)Lk(θk) x = ϕ(θk) πk(θk)Lk(θk) πk(θk)Lk(θk) Zk dθk = 1 Zk no matter what the proposal ϕ(·) is. [Gelfand & Dey, 1994; Bartolucci et al., 2006] Direct exploitation of the MCMC output RB-RJ

usual importance sampling constraints: ϕ(θ) must have lighter (rather than fatter) tails than πk(θk)Lk(θk) for the approximation Z1k = 1 1 T T t=1 ϕ(θ(t) k ) πk(θ(t) k )Lk(θ(t) k ) to have a ﬁnite variance. E.g., use ﬁnite support kernels (like Epanechnikov’s kernel) for ϕ

usual importance sampling constraints: ϕ(θ) must have lighter (rather than fatter) tails than πk(θk)Lk(θk) for the approximation Z1k = 1 1 T T t=1 ϕ(θ(t) k ) πk(θ(t) k )Lk(θ(t) k ) to have a ﬁnite variance. E.g., use ﬁnite support kernels (like Epanechnikov’s kernel) for ϕ

standard importance sampling approximation Z2k = 1 T T t=1 πk(θ(t) k )Lk(θ(t) k ) ϕ(θ(t) k ) where the θ(t) k ’s are generated from the density ϕ(·) (with fatter tails like t’s)

a speciﬁc mixture for simulation [importance sampling] purposes, with density ϕk(θk) ∝ ω1πk(θk)Lk(θk) + ϕ(θk) , where ϕ(·) is arbitrary (but normalised) Note: ω1 is not a probability weight

a speciﬁc mixture for simulation [importance sampling] purposes, with density ϕk(θk) ∝ ω1πk(θk)Lk(θk) + ϕ(θk) , where ϕ(·) is arbitrary (but normalised) Note: ω1 is not a probability weight

sampler At iteration t 1 Take δ(t) = 1 with probability ω1πk(θ(t−1) k )Lk(θ(t−1) k ) ω1πk(θ(t−1) k )Lk(θ(t−1) k ) + ϕ(θ(t−1) k ) and δ(t) = 2 otherwise; 2 If δ(t) = 1, generate θ(t) k ∼ MCMC(θ(t−1) k , θk) where MCMC(θk, θk ) denotes an arbitrary MCMC kernel associated with the posterior πk(θk|x) ∝ πk(θk)Lk(θk); 3 If δ(t) = 2, generate θ(t) k ∼ ϕ(θk) independently

sampler At iteration t 1 Take δ(t) = 1 with probability ω1πk(θ(t−1) k )Lk(θ(t−1) k ) ω1πk(θ(t−1) k )Lk(θ(t−1) k ) + ϕ(θ(t−1) k ) and δ(t) = 2 otherwise; 2 If δ(t) = 1, generate θ(t) k ∼ MCMC(θ(t−1) k , θk) where MCMC(θk, θk ) denotes an arbitrary MCMC kernel associated with the posterior πk(θk|x) ∝ πk(θk)Lk(θk); 3 If δ(t) = 2, generate θ(t) k ∼ ϕ(θk) independently

sampler At iteration t 1 Take δ(t) = 1 with probability ω1πk(θ(t−1) k )Lk(θ(t−1) k ) ω1πk(θ(t−1) k )Lk(θ(t−1) k ) + ϕ(θ(t−1) k ) and δ(t) = 2 otherwise; 2 If δ(t) = 1, generate θ(t) k ∼ MCMC(θ(t−1) k , θk) where MCMC(θk, θk ) denotes an arbitrary MCMC kernel associated with the posterior πk(θk|x) ∝ πk(θk)Lk(θk); 3 If δ(t) = 2, generate θ(t) k ∼ ϕ(θk) independently

T T t=1 ω1πk(θ(t) k )Lk(θ(t) k ) ω1πk(θ(t) k )Lk(θ(t) k ) + ϕ(θ(t) k ) , converges to ω1Zk/{ω1Zk + 1} Deduce ˆ Z3k from ω1 ˆ Z3k/{ω1 ˆ Z3k + 1} = ˆ ξ ie ˆ Z3k = T t=1 ω1πk(θ(t) k )Lk(θ(t) k ) ω1π(θ(t) k )Lk(θ(t) k ) + ϕ(θ(t) k ) T t=1 ϕ(θ(t) k ) ω1πk(θ(t) k )Lk(θ(t) k ) + ϕ(θ(t) k ) [Bridge sampler redux]

T T t=1 ω1πk(θ(t) k )Lk(θ(t) k ) ω1πk(θ(t) k )Lk(θ(t) k ) + ϕ(θ(t) k ) , converges to ω1Zk/{ω1Zk + 1} Deduce ˆ Z3k from ω1 ˆ Z3k/{ω1 ˆ Z3k + 1} = ˆ ξ ie ˆ Z3k = T t=1 ω1πk(θ(t) k )Lk(θ(t) k ) ω1π(θ(t) k )Lk(θ(t) k ) + ϕ(θ(t) k ) T t=1 ϕ(θ(t) k ) ω1πk(θ(t) k )Lk(θ(t) k ) + ϕ(θ(t) k ) [Bridge sampler redux]

with X simulated from a joint density ˜ f(x, y), where ˜ f(x, y)dy = f(x), the estimator I∗ = E ˜ f [I|Y1, . . . , Yn] dominate I(X1, . . . , Xn) variance-wise (and is unbiased)

fk(x|θk) and θk ∼ πk(θk), Zk = mk(x) = fk(x|θk) πk(θk) πk(θk|x) Use of an approximation to the posterior Zk = mk(x) = fk(x|θ∗ k ) πk(θ∗ k ) ˆ πk(θ∗ k |x) .

fk(x|θk) and θk ∼ πk(θk), Zk = mk(x) = fk(x|θk) πk(θk) πk(θk|x) Use of an approximation to the posterior Zk = mk(x) = fk(x|θ∗ k ) πk(θ∗ k ) ˆ πk(θ∗ k |x) .

variable model] is invariant under permutations of the indices of the components. E.g., mixtures 0.3N(0, 1) + 0.7N(2.3, 1) and 0.7N(2.3, 1) + 0.3N(0, 1) are exactly the same! c The component parameters θi are not identiﬁable marginally since they are exchangeable

variable model] is invariant under permutations of the indices of the components. E.g., mixtures 0.3N(0, 1) + 0.7N(2.3, 1) and 0.7N(2.3, 1) + 0.3N(0, 1) are exactly the same! c The component parameters θi are not identiﬁable marginally since they are exchangeable

order O(k!): c Maximization and even [MCMC] exploration of the posterior surface harder 2 Under exchangeable priors on (θ, p) [prior invariant under permutation of the indices], all posterior marginals are identical: c Posterior expectation of θ1 equal to posterior expectation of θ2

order O(k!): c Maximization and even [MCMC] exploration of the posterior surface harder 2 Under exchangeable priors on (θ, p) [prior invariant under permutation of the indices], all posterior marginals are identical: c Posterior expectation of θ1 equal to posterior expectation of θ2

components [label switching] to conclude about convergence of the Gibbs sampler. If we observe it, then we do not know how to estimate the parameters. If we do not, then we are uncertain about the convergence!!!

components [label switching] to conclude about convergence of the Gibbs sampler. If we observe it, then we do not know how to estimate the parameters. If we do not, then we are uncertain about the convergence!!!

components [label switching] to conclude about convergence of the Gibbs sampler. If we observe it, then we do not know how to estimate the parameters. If we do not, then we are uncertain about the convergence!!!

fails to visit all conﬁgurations in a balanced way, despite the symmetry predicted by the theory πk(θk|x) = πk(σ(θk)|x) = 1 k! σ∈S πk(σ(θk)|x) for all σ’s in Sk, set of all permutations of {1, . . . , k}. Consequences on numerical approximation, biased by an order k! Recover the theoretical symmetry by using πk(θ∗ k |x) = 1 T k! σ∈Sk T t=1 πk(σ(θ∗ k )|x, z(t) k ) . [Berkhof, Mechelen, & Gelman, 2003]

fails to visit all conﬁgurations in a balanced way, despite the symmetry predicted by the theory πk(θk|x) = πk(σ(θk)|x) = 1 k! σ∈S πk(σ(θk)|x) for all σ’s in Sk, set of all permutations of {1, . . . , k}. Consequences on numerical approximation, biased by an order k! Recover the theoretical symmetry by using πk(θ∗ k |x) = 1 T k! σ∈Sk T t=1 πk(σ(θ∗ k )|x, z(t) k ) . [Berkhof, Mechelen, & Gelman, 2003]

k as the MAP estimator, log( ˆ mk(x)) = −105.1396 for k = 3 (based on 103 simulations), while introducing the permutations leads to log( ˆ mk(x)) = −103.3479 Note that −105.1396 + log(3!) = −103.3479 k 2 3 4 5 6 7 8 mk (x) -115.68 -103.35 -102.66 -101.93 -102.88 -105.48 -108.44 Estimations of the marginal likelihoods by the symmetrised Chib’s approximation (based on 105 Gibbs iterations and, for k > 5, 100 permutations selected at random in Sk). [Lee, Marin, Mengersen & Robert, 2008]

k as the MAP estimator, log( ˆ mk(x)) = −105.1396 for k = 3 (based on 103 simulations), while introducing the permutations leads to log( ˆ mk(x)) = −103.3479 Note that −105.1396 + log(3!) = −103.3479 k 2 3 4 5 6 7 8 mk (x) -115.68 -103.35 -102.66 -101.93 -102.88 -105.48 -108.44 Estimations of the marginal likelihoods by the symmetrised Chib’s approximation (based on 105 Gibbs iterations and, for k > 5, 100 permutations selected at random in Sk). [Lee, Marin, Mengersen & Robert, 2008]

k as the MAP estimator, log( ˆ mk(x)) = −105.1396 for k = 3 (based on 103 simulations), while introducing the permutations leads to log( ˆ mk(x)) = −103.3479 Note that −105.1396 + log(3!) = −103.3479 k 2 3 4 5 6 7 8 mk (x) -115.68 -103.35 -102.66 -101.93 -102.88 -105.48 -108.44 Estimations of the marginal likelihoods by the symmetrised Chib’s approximation (based on 105 Gibbs iterations and, for k > 5, 100 permutations selected at random in Sk). [Lee, Marin, Mengersen & Robert, 2008]

in likelihood function f(y|θ) = Zθ ˜ f(y|θ) Introduce pseudo-data y with artiﬁcial target g(y|θ, y) Generate θ ∼ K(θ, θ ) and y ∼ f(y|θ ) For Gibbs random ﬁelds , existence of a genuine suﬃcient statistic η(y). [Møller, Pettitt, Berthelsen, & Reeves, 2006]

framework for designing moves between models Mk [Green, 1995] Create a reversible kernel K on H = k {k} × Θk such that A B K(x, dy)π(x)dx = B A K(y, dx)π(y)dy for the invariant density π [x is of the form (k, θ(k))]

framework for designing moves between models Mk [Green, 1995] Create a reversible kernel K on H = k {k} × Θk such that A B K(x, dy)π(x)dx = B A K(y, dx)π(y)dy for the invariant density π [x is of the form (k, θ(k))]

M2, the Markov chain being in state θ1 ∈ M1, denote by K1→2(θ1, dθ) and K2→1(θ2, dθ) the corresponding kernels, under the detailed balance condition π(dθ1) K1→2(θ1, dθ) = π(dθ2) K2→1(θ2, dθ) , and take, wlog, dim(M2) > dim(M1). Proposal expressed as θ2 = Ψ1→2(θ1, v1→2) where v1→2 is a random variable of dimension dim(M2) − dim(M1), generated as v1→2 ∼ ϕ1→2(v1→2) .

M2, the Markov chain being in state θ1 ∈ M1, denote by K1→2(θ1, dθ) and K2→1(θ2, dθ) the corresponding kernels, under the detailed balance condition π(dθ1) K1→2(θ1, dθ) = π(dθ2) K2→1(θ2, dθ) , and take, wlog, dim(M2) > dim(M1). Proposal expressed as θ2 = Ψ1→2(θ1, v1→2) where v1→2 is a random variable of dimension dim(M2) − dim(M1), generated as v1→2 ∼ ϕ1→2(v1→2) .

ϕ1→2(v1→2) ∂Ψ1→2(θ1, v1→2) ∂(θ1, v1→2) −1 , by the Jacobian rule. Reverse importance link If probability 1→2 of choosing move to M2 while in M1, acceptance probability reduces to α(θ1, v1→2) = 1∧ π(M2, θ2) 2→1 π(M1, θ1) 1→2 ϕ1→2(v1→2) ∂Ψ1→2(θ1, v1→2) ∂(θ1, v1→2) . If several models are considered simultaneously, with probability 1→2 of choosing move to M2 while in M1, as in K(x, B) = ∞ m=1 ρm (x, y)qm (x, dy) + ω(x)IB (x)

ϕ1→2(v1→2) ∂Ψ1→2(θ1, v1→2) ∂(θ1, v1→2) −1 , by the Jacobian rule. Reverse importance link If probability 1→2 of choosing move to M2 while in M1, acceptance probability reduces to α(θ1, v1→2) = 1∧ π(M2, θ2) 2→1 π(M1, θ1) 1→2 ϕ1→2(v1→2) ∂Ψ1→2(θ1, v1→2) ∂(θ1, v1→2) . If several models are considered simultaneously, with probability 1→2 of choosing move to M2 while in M1, as in K(x, B) = ∞ m=1 ρm (x, y)qm (x, dy) + ω(x)IB (x)

ϕ1→2(v1→2) ∂Ψ1→2(θ1, v1→2) ∂(θ1, v1→2) −1 , by the Jacobian rule. Reverse importance link If probability 1→2 of choosing move to M2 while in M1, acceptance probability reduces to α(θ1, v1→2) = 1∧ π(M2, θ2) 2→1 π(M1, θ1) 1→2 ϕ1→2(v1→2) ∂Ψ1→2(θ1, v1→2) ∂(θ1, v1→2) . If several models are considered simultaneously, with probability 1→2 of choosing move to M2 while in M1, as in K(x, B) = ∞ m=1 ρm (x, y)qm (x, dy) + ω(x)IB (x)

setting: M1 × V1→2 and M2 in one-to-one relation. reversibility imposes that θ1 is derived as (θ1, v1→2) = Ψ−1 1→2 (θ2) appears like a regular Metropolis–Hastings move from the couple (θ1, v1→2) to θ2 when stationary distributions are π(M1, θ1) × ϕ1→2(v1→2) and π(M2, θ2), and when proposal distribution is deterministic

setting: M1 × V1→2 and M2 in one-to-one relation. reversibility imposes that θ1 is derived as (θ1, v1→2) = Ψ−1 1→2 (θ2) appears like a regular Metropolis–Hastings move from the couple (θ1, v1→2) to θ2 when stationary distributions are π(M1, θ1) × ϕ1→2(v1→2) and π(M2, θ2), and when proposal distribution is deterministic

× Θk by creating θ = (θ1, . . . , θD) a model index M pseudo-priors πj(θj|M = k) for j = k [Carlin & Chib, 1995] Validation by P(M = k|x) = P(M = k|x, θ)π(θ|x)dθ = Zk where the (marginal) posterior is [not πk] π(θ|x) = D k=1 P(θ, M = k|x) = D k=1 pk Zk πk(θk|x) j=k πj(θj|M = k) .

× Θk by creating θ = (θ1, . . . , θD) a model index M pseudo-priors πj(θj|M = k) for j = k [Carlin & Chib, 1995] Validation by P(M = k|x) = P(M = k|x, θ)π(θ|x)dθ = Zk where the (marginal) posterior is [not πk] π(θ|x) = D k=1 P(θ, M = k|x) = D k=1 pk Zk πk(θk|x) j=k πj(θj|M = k) .

. . . , θ(t) D ) with stationary distribution π(θ, M|x) by 1 Pick M(t) = k with probability π(θ(t−1), k|x) 2 Generate θ(t−1) k from the posterior πk(θk|x) [or MCMC step] 3 Generate θ(t−1) j (j = k) from the pseudo-prior πj(θj|M = k) Approximate P(M = k|x) = Zk by ˇ pk(x) ∝ pk T t=1 fk(x|θ(t) k ) πk(θ(t) k ) j=k πj(θ(t) j |M = k) D =1 p f (x|θ(t)) π (θ(t)) j= πj(θ(t) j |M = )

. . . , θ(t) D ) with stationary distribution π(θ, M|x) by 1 Pick M(t) = k with probability π(θ(t−1), k|x) 2 Generate θ(t−1) k from the posterior πk(θk|x) [or MCMC step] 3 Generate θ(t−1) j (j = k) from the pseudo-prior πj(θj|M = k) Approximate P(M = k|x) = Zk by ˇ pk(x) ∝ pk T t=1 fk(x|θ(t) k ) πk(θ(t) k ) j=k πj(θ(t) j |M = k) D =1 p f (x|θ(t)) π (θ(t)) j= πj(θ(t) j |M = )

. . . , θ(t) D ) with stationary distribution π(θ, M|x) by 1 Pick M(t) = k with probability π(θ(t−1), k|x) 2 Generate θ(t−1) k from the posterior πk(θk|x) [or MCMC step] 3 Generate θ(t−1) j (j = k) from the pseudo-prior πj(θj|M = k) Approximate P(M = k|x) = Zk by ˇ pk(x) ∝ pk T t=1 fk(x|θ(t) k ) πk(θ(t) k ) j=k πj(θ(t) j |M = k) D =1 p f (x|θ(t)) π (θ(t)) j= πj(θ(t) j |M = )

∝ pk T t=1 fk(x|θ(t) k ) D j=1 pj fj(x|θ(t) j ) , based on D simultaneous and independent MCMC chains (θ(t) k )t , 1 ≤ k ≤ D , with stationary distributions πk(θk|x) [instead of above joint]

∝ pk T t=1 fk(x|θ(t) k ) D j=1 pj fj(x|θ(t) j ) , based on D simultaneous and independent MCMC chains (θ(t) k )t , 1 ≤ k ≤ D , with stationary distributions πk(θk|x) [instead of above joint]

∝ pk T t=1 fk(x|θ(t) k )πk(θ(t) k ) D j=1 pj fj(x|θ(t) j )πj(θ(t) j ) , where again the θ(t) k ’s are MCMC chains with stationary distributions πk(θk|x) to next section

x ∼ N(θ, 1) with θ ∼ N(0, 1) vs. normal model M2 : x ∼ N(θ, 1) with θ ∼ N(5, 1) Comparison of both approximations with P(M = 1|x): Scott’s (2002) (green and mixed dashes) and Congdon’s (2006) (brown and long dashes) [N = 104 simulations]. −1 0 1 2 3 4 5 6 0.0 0.2 0.4 0.6 0.8 1.0 y

Z = N i=1 (xi−1 − xi)ϕ(xi) where the xi’s are either: deterministic: xi = e−i/N or random: x0 = 0, xi+1 = tixi, ti ∼ Be(N, 1) so that E[log xi] = −i/N.

i=0 {ϕ(xi+1) − ϕ(xi)}xi which is a special case of Z = N−1 i=0 {L(θ(i+1) ) − L(θ(i) )}π({θ; L(θ) > L(θ(i) )}) where · · · L(θ(i+1) ) < L(θ(i) ) · · · [Lebesgue version of Riemann’s sum]

i=0 {ϕ(xi+1) − ϕ(xi)}xi which is a special case of Z = N−1 i=0 {L(θ(i+1) ) − L(θ(i) )}π({θ; L(θ) > L(θ(i) )}) where · · · L(θ(i+1) ) < L(θ(i) ) · · · [Lebesgue version of Riemann’s sum]

sampling Start with N values θ1, . . . , θN sampled from π At iteration i, 1 Take ϕi = L(θk), where θk is the point with smallest likelihood in the pool of θi’s 2 Replace θk with a sample from the prior constrained to L(θ) > ϕi: the current N points are sampled from prior constrained to L(θ) > ϕi. Note that π({θ; L(θ) > L(θ(i+1) )}) π({θ; L(θ) > L(θ(i) )}) ≈ 1 − 1 N

sampling Start with N values θ1, . . . , θN sampled from π At iteration i, 1 Take ϕi = L(θk), where θk is the point with smallest likelihood in the pool of θi’s 2 Replace θk with a sample from the prior constrained to L(θ) > ϕi: the current N points are sampled from prior constrained to L(θ) > ϕi. Note that π({θ; L(θ) > L(θ(i+1) )}) π({θ; L(θ) > L(θ(i) )}) ≈ 1 − 1 N

sampling Start with N values θ1, . . . , θN sampled from π At iteration i, 1 Take ϕi = L(θk), where θk is the point with smallest likelihood in the pool of θi’s 2 Replace θk with a sample from the prior constrained to L(θ) > ϕi: the current N points are sampled from prior constrained to L(θ) > ϕi. Note that π({θ; L(θ) > L(θ(i+1) )}) π({θ; L(θ) > L(θ(i) )}) ≈ 1 − 1 N

sampling Start with N values θ1, . . . , θN sampled from π At iteration i, 1 Take ϕi = L(θk), where θk is the point with smallest likelihood in the pool of θi’s 2 Replace θk with a sample from the prior constrained to L(θ) > ϕi: the current N points are sampled from prior constrained to L(θ) > ϕi. Note that π({θ; L(θ) > L(θ(i+1) )}) π({θ; L(θ) > L(θ(i) )}) ≈ 1 − 1 N

given stopping iteration j is reached: e.g., observe very small changes in the approximation Z; reach the maximal value of L(θ) when the likelihood is bounded and its maximum is known; truncate the integral Z at level , i.e. replace 1 0 ϕ(x) dx with 1 ϕ(x) dx

Z, Slutsky’s transform of the CLT: N−1/2 log Z − log Z D → N 0, V/Z2 Note: The number of simulated points equals the number of iterations j, and is a multiple of N: if one stops at ﬁrst iteration j such that e−j/N < , then: j = N − log .

Z, Slutsky’s transform of the CLT: N−1/2 log Z − log Z D → N 0, V/Z2 Note: The number of simulated points equals the number of iterations j, and is a multiple of N: if one stops at ﬁrst iteration j such that e−j/N < , then: j = N − log .

Z, Slutsky’s transform of the CLT: N−1/2 log Z − log Z D → N 0, V/Z2 Note: The number of simulated points equals the number of iterations j, and is a multiple of N: if one stops at ﬁrst iteration j such that e−j/N < , then: j = N − log .

is intractable in most cases Skilling (2007) proposes to use MCMC but this introduces a bias (stopping rule) If implementable, then slice sampler can be devised at the same cost [or less]

is intractable in most cases Skilling (2007) proposes to use MCMC but this introduces a bias (stopping rule) If implementable, then slice sampler can be devised at the same cost [or less]

is intractable in most cases Skilling (2007) proposes to use MCMC but this introduces a bias (stopping rule) If implementable, then slice sampler can be devised at the same cost [or less]

1000, 50 MCMC steps with size 0.1, compared with a population Monte Carlo (PMC) based on 10 iterations with 5000 points per iteration and ﬁnal sample of 50000 points, using nine Student’s t components with 9 df [Wraith, Kilbinger, Benabed et al., 2009, Physica Rev. D] Evidence estimation

1000, 50 MCMC steps with size 0.1, compared with a population Monte Carlo (PMC) based on 10 iterations with 5000 points per iteration and ﬁnal sample of 50000 points, using nine Student’s t components with 9 df [Wraith, Kilbinger, Benabed et al., 2009, Physica Rev. D] E[X1] estimation

1000, 50 MCMC steps with size 0.1, compared with a population Monte Carlo (PMC) based on 10 iterations with 5000 points per iteration and ﬁnal sample of 50000 points, using nine Student’s t components with 9 df [Wraith, Kilbinger, Benabed et al., 2009, Physica Rev. D] E[X2] estimation

and likelihood ˜ L, weight function w(θ) = π(θ)L(θ) π(θ)L(θ) and weighted NS estimator Z = j i=1 (xi−1 − xi)ϕiw(θi). Then choose (π, L) so that sampling from π constrained to L(θ) > l is easy; e.g. N(c, Id) constrained to c − θ < r.

and likelihood ˜ L, weight function w(θ) = π(θ)L(θ) π(θ)L(θ) and weighted NS estimator Z = j i=1 (xi−1 − xi)ϕiw(θi). Then choose (π, L) so that sampling from π constrained to L(θ) > l is easy; e.g. N(c, Id) constrained to c − θ < r.

on (µ, σ) for n observations with µ = 2 and σ = 3/2, when p is known Use of a uniform prior both on (−2, 6) for µ and on (.001, 16) for log σ2. occurrences of posterior bursts for µ = xi

random walk moves at each step, simulations from the constr’d prior and a stopping rule at 95% of the observed maximum likelihood 2 T = 104 MCMC (=Gibbs) simulations producing non-parametric estimates ϕ [Diebolt & Robert, 1990] 3 Monte Carlo estimates Z1, Z2, Z3 using product of two Gaussian kernels 4 numerical integration based on 850 × 950 grid [reference value, conﬁrmed by Chib’s]

f(x|θ) not in closed form, likelihood-free rejection technique: ABC algorithm For an observation y ∼ f(y|θ), under the prior π(θ), keep jointly simulating θ ∼ π(θ) , x ∼ f(x|θ ) , until the auxiliary variable x is equal to the observed value, x = y. [Pritchard et al., 1999]

f(x|θ) not in closed form, likelihood-free rejection technique: ABC algorithm For an observation y ∼ f(y|θ), under the prior π(θ), keep jointly simulating θ ∼ π(θ) , x ∼ f(x|θ ) , until the auxiliary variable x is equal to the observed value, x = y. [Pritchard et al., 1999]

f(x|θ) not in closed form, likelihood-free rejection technique: ABC algorithm For an observation y ∼ f(y|θ), under the prior π(θ), keep jointly simulating θ ∼ π(θ) , x ∼ f(x|θ ) , until the auxiliary variable x is equal to the observed value, x = y. [Pritchard et al., 1999]

equality x = y is replaced with a tolerance condition, (x, y) ≤ where is a distance between summary statistics Output distributed from π(θ) Pθ{ (x, y) < } ∝ π(θ| (x, y) < )

equality x = y is replaced with a tolerance condition, (x, y) ≤ where is a distance between summary statistics Output distributed from π(θ) Pθ{ (x, y) < } ∝ π(θ| (x, y) < )

for model choice: When is a Bayes factor based on an insuﬃcient statistic T (y) consistent? Note/warnin: c drawn on T (y) through BT 12 (y) necessarily diﬀers from c drawn on y through B12(y) [Marin, Pillai, X, & Rousseau, arXiv, 2012]

for model choice: When is a Bayes factor based on an insuﬃcient statistic T (y) consistent? Note/warnin: c drawn on T (y) through BT 12 (y) necessarily diﬀers from c drawn on y through B12(y) [Marin, Pillai, X, & Rousseau, arXiv, 2012]

PNAS paper [thanks!]: [X, Cornuet, Marin, & Pillai, Aug. 2011] Model M1: y ∼ N(θ1, 1) opposed to model M2: y ∼ L(θ2, 1/ √ 2), Laplace distribution with mean θ2 and scale parameter 1/ √ 2 (variance one).

PNAS paper [thanks!]: [X, Cornuet, Marin, & Pillai, Aug. 2011] Model M1: y ∼ N(θ1, 1) opposed to model M2: y ∼ L(θ2, 1/ √ 2), Laplace distribution with mean θ2 and scale parameter 1/ √ 2 (variance one). Four possible statistics 1 sample mean y (suﬃcient for M1 if not M2); 2 sample median med(y) (insuﬃcient); 3 sample variance var(y) (ancillary); 4 median absolute deviation mad(y) = med(|y − med(y)|);

, yn) the observed sample, not necessarily iid with true distribution y ∼ Pn Summary statistics T (y) = T n = (T1(y), T2(y), · · · , Td(y)) ∈ Rd with true distribution T n ∼ Gn.

where θ1 ∈ Θ1 ⊂ Rp1 – under M2, y ∼ F2,n(·|θ2) where θ2 ∈ Θ2 ⊂ Rp2 turned into – under M1, T (y) ∼ G1,n(·|θ1), and θ1|T (y) ∼ π1(·|T n) – under M2, T (y) ∼ G2,n(·|θ2), and θ2|T (y) ∼ π2(·|T n)

standard central limit theorem under the true model [A2] controls the large deviations of the estimator T n from the estimand µ(θ) [A3] is the standard prior mass condition found in Bayesian asymptotics (di eﬀective dimension of the parameter) [A4] restricts the behaviour of the model density against the true density [Think CLT!]

There exist a sequence {vn} converging to +∞, a distribution Q, a symmetric, d × d positive deﬁnite matrix V0 and a vector µ0 ∈ Rd such that vnV −1/2 0 (T n − µ0) n→∞ ; Q, under Gn

For i = 1, 2, there exist sets Fn,i ⊂ Θi, functions µi(θi) and constants i, τi, αi > 0 such that for all τ > 0, sup θi∈Fn,i Gi,n |T n − µi(θi)| > τ|µi(θi) − µ0| ∧ i |θi (|τµi(θi) − µ0| ∧ i)−αi v−αi n with πi(Fc n,i ) = o(v−τi n ).

If inf{|µi(θi) − µ0|; θi ∈ Θi} = 0, for any > 0, there exist U, δ > 0 and (En)n such that, if θi ∈ Sn,i(U) En ⊂ {t; gi(t|θi) < δgn(t)} and Gn (Ec n ) < .

(sumup) [A1] is a standard central limit theorem under the true model [A2] controls the large deviations of the estimator T n from the estimand µ(θ) [A3] is the standard prior mass condition found in Bayesian asymptotics (di eﬀective dimension of the parameter) [A4] restricts the behaviour of the model density against the true density

θi ∈ Θi} = 0, mi(T n) gn(T n) ∼ v−di n , thus log(mi(T n)/gn(T n)) ∼ −di log vn and v−di n penalization factor resulting from integrating θi out (see eﬀective number of parameters in DIC)

dθi is such that (i) if inf{|µi(θi) − µ0|; θi ∈ Θi} = 0, Clvd−di n ≤ mi(T n) ≤ Cuvd−di n and (ii) if inf{|µi(θi) − µ0|; θi ∈ Θi} > 0 mi(T n) = o Pn [vd−τi n + vd−αi n ].

the Bayes factor is driven by the asymptotic mean value of T n under both models. And only by this mean value! Indeed, if inf{|µ0 − µ2(θ2)|; θ2 ∈ Θ2} = inf{|µ0 − µ1(θ1)|; θ1 ∈ Θ1} = 0 then Clv−(d1−d2) n ≤ m1(T n) m2(T n) ≤ Cuv−(d1−d2) n , where Cl, Cu = O Pn (1), irrespective of the true model. c Only depends on the diﬀerence d1 − d2: no consistency

the Bayes factor is driven by the asymptotic mean value of T n under both models. And only by this mean value! Else, if inf{|µ0 − µ2(θ2)|; θ2 ∈ Θ2} > inf{|µ0 − µ1(θ1)|; θ1 ∈ Θ1} = 0 then m1(T n) m2(T n) ≥ Cu min v−(d1−α2) n , v−(d1−τ2) n

inf{|µ0 − µ1(θ1)|; θ1 ∈ Θ1} = 0, Bayes factor BT 12 = O(v−(d1−d2) n ) irrespective of the true model. It is inconsistent since it always picks the model with the smallest dimension

models and if µ0 cannot be attained by the other one : 0 = min (inf{|µ0 − µi(θi)|; θi ∈ Θi}, i = 1, 2) < max (inf{|µ0 − µi(θi)|; θi ∈ Θi}, i = 1, 2) , then the Bayes factor BT 12 is consistent

µi(θi) and the relative position of µ0 wrt both sets {µi(θi); θi ∈ Θi}, i = 1, 2. For ABC, this implies the most likely statistics T n are ancillary statistics with diﬀerent mean values under both models Else, if T n asymptotically depends on some of the parameters of the models, it is possible that there exists θi ∈ Θi such that µi(θi) = µ0 even though model M1 is misspeciﬁed

µi(θi) and the relative position of µ0 wrt both sets {µi(θi); θi ∈ Θi}, i = 1, 2. For ABC, this implies the most likely statistics T n are ancillary statistics with diﬀerent mean values under both models Else, if T n asymptotically depends on some of the parameters of the models, it is possible that there exists θi ∈ Θi such that µi(θi) = µ0 even though model M1 is misspeciﬁed

n−1 n i=1 X4 i , µ1(θ) = 3 + θ4 + 6θ2, µ2(θ) = 6 + · · · and the true distribution is Laplace with mean θ0 = 1, under the Gaussian model the value θ∗ = 2 √ 3 − 3 also leads to µ0 = µ(θ∗) [here d1 = d2 = d = 1]

n−1 n i=1 X4 i , µ1(θ) = 3 + θ4 + 6θ2, µ2(θ) = 6 + · · · and the true distribution is Laplace with mean θ0 = 1, under the Gaussian model the value θ∗ = 2 √ 3 − 3 also leads to µ0 = µ(θ∗) [here d1 = d2 = d = 1] c A Bayes factor associated with such a statistic is inconsistent

¯ y(4) n , ¯ y(6) n and the true distribution is Laplace with mean θ0 = 0, then µ0 = 6, µ1(θ∗ 1 ) = 6 with θ∗ 1 = 2 √ 3 − 3 [d1 = 1 and d2 = 1/2] thus B12 ∼ n−1/4 → 0 : consistent Under the Gaussian model µ0 = 3, µ2(θ2) ≥ 6 > 3 ∀θ2 B12 → +∞ : consistent

¯ y(4) n , ¯ y(6) n and the true distribution is Laplace with mean θ0 = 0, then µ0 = 6, µ1(θ∗ 1 ) = 6 with θ∗ 1 = 2 √ 3 − 3 [d1 = 1 and d2 = 1/2] thus B12 ∼ n−1/4 → 0 : consistent Under the Gaussian model µ0 = 3, µ2(θ2) ≥ 6 > 3 ∀θ2 B12 → +∞ : consistent

tables of (θi, xi) from both joints, straightforward derivation of ABC estimates ˆ θ1 and ˆ θ2. Evaluation of E1 ˆ θ1 [T(X)] and E2 ˆ θ2 [T(X)] allows for detection of diﬀerent means under both models via Monte Carlo simulations

true distribution belongs to the smaller model M1, Bayes factor is of order v−(d1−d2) n If summary statistic only informative on a parameter that is the same under both models, i.e if d1 = d2, then c the Bayes factor is not consistent

true distribution belongs to the smaller model M1, Bayes factor is of order v−(d1−d2) n Else, d1 < d2 and Bayes factor is going to ∞ under M1. If true distribution not in M1, then c Bayes factor is consistent only if µ1 = µ2 = µ0

= θ0 in a model f(x|θ, ψ) with a nuisance parameter ψ, under priors π0(ψ) and π1(θ, ψ) such that π1(ψ|θ0) = π0(ψ) then B01 = π1(θ0|x) π1(θ0) , with the obvious notations π1(θ) = π1(θ, ψ)dψ , π1(θ|x) = π1(θ, ψ|x)dψ , [Dickey, 1971; Verdinelli & Wasserman, 1995]

= f1(x|θ, φ = φ0), f0(x) = f1(x|θ, φ = φ0)f1(θ|φ = φ0) dθ = f1(x|φ = φ0) , i.e., the denumerator of the Bayes factor is the value of f1(x|φ) at φ = φ0, while the denominator is an average of the values of f1(x|φ) for φ = φ0, weighted by the prior distribution f1(φ) under the augmented model. Applying Bayes’ theorem to the right-hand side of [the above] we get f0(x) = f1(φ0|x)f1(x) f1(φ0) and hence the Bayes factor is given by B = f0(x) f1(x) = f1(φ0|x) f1(φ0) . the ratio of the posterior to prior densities at φ = φ0 under the augmented model.” [O’Hagan & Forster, 1996]

= f1(x|θ, φ = φ0), f0(x) = f1(x|θ, φ = φ0)f1(θ|φ = φ0) dθ = f1(x|φ = φ0) , i.e., the denumerator of the Bayes factor is the value of f1(x|φ) at φ = φ0, while the denominator is an average of the values of f1(x|φ) for φ = φ0, weighted by the prior distribution f1(φ) under the augmented model. Applying Bayes’ theorem to the right-hand side of [the above] we get f0(x) = f1(φ0|x)f1(x) f1(φ0) and hence the Bayes factor is given by B = f0(x) f1(x) = f1(φ0|x) f1(φ0) . the ratio of the posterior to prior densities at φ = φ0 under the augmented model.” [O’Hagan & Forster, 1996]

π1(ψ|θ0) similarly depends on choices of versions... ...but Monte Carlo implementation relies on speciﬁc versions of all densities without making mention of it [Chen, Shao & Ibrahim, 2000]

π1(ψ|θ0) similarly depends on choices of versions... ...but Monte Carlo implementation relies on speciﬁc versions of all densities without making mention of it [Chen, Shao & Ibrahim, 2000]

ψ(T)) ∼ ˜ π(θ, ψ|x), then 1 T t ˜ π1(θ0|x, ψ(t)) converges to ˜ π1(θ0|x) provided the right version is used in θ0 ˜ π1(θ0|x, ψ) = π1(θ0)f(x|θ0, ψ) π1(θ)f(x|θ, ψ) dθ

with 1 T T t=1 ˜ π1(θ0|x, z(t), ψ(t)) via data completion by latent variable z such that f(x|θ, ψ) = ˜ f(x, z|θ, ψ) dz and that ˜ π1(θ, ψ, z|x) ∝ π0(ψ)π1(θ) ˜ f(x, z|θ, ψ) available in closed form, including the normalising constant, based on version ˜ π1(θ0|x, z, ψ) π1(θ0) = ˜ f(x, z|θ0, ψ) ˜ f(x, z|θ, ψ)π1(θ) dθ .

with 1 T T t=1 ˜ π1(θ0|x, z(t), ψ(t)) via data completion by latent variable z such that f(x|θ, ψ) = ˜ f(x, z|θ, ψ) dz and that ˜ π1(θ, ψ, z|x) ∝ π0(ψ)π1(θ) ˜ f(x, z|θ, ψ) available in closed form, including the normalising constant, based on version ˜ π1(θ0|x, z, ψ) π1(θ0) = ˜ f(x, z|θ0, ψ) ˜ f(x, z|θ, ψ)π1(θ) dθ .

Use of the bridge identity E˜ π1(θ,ψ|x) π1(θ, ψ)f(x|θ, ψ) π0(ψ)π1(θ)f(x|θ, ψ) = E˜ π1(θ,ψ|x) π1(ψ|θ) π0(ψ) = m1(x) ˜ m1(x) to (biasedly) estimate ˜ m1(x)/m1(x) by T T t=1 π1(ψ(t)|θ(t)) π0(ψ(t)) based on the same sample from ˜ π1.

Use of the bridge identity E˜ π1(θ,ψ|x) π1(θ, ψ)f(x|θ, ψ) π0(ψ)π1(θ)f(x|θ, ψ) = E˜ π1(θ,ψ|x) π1(ψ|θ) π0(ψ) = m1(x) ˜ m1(x) to (biasedly) estimate ˜ m1(x)/m1(x) by T T t=1 π1(ψ(t)|θ(t)) π0(ψ(t)) based on the same sample from ˜ π1.

of the Bayes factor approximations based on 100 replicas for 20, 000 simulations for a simulation from the above importance, Chib’s, Savage–Dickey’s and bridge samplers q q IS Chib Savage−Dickey Bridge 2.8 3.0 3.2 3.4