Statistical frontiers, Warwick, Jan. 30, 2014

Transcript

On some computational methods for Bayesian model choice Christian P.

Robert University of Warwick and Universit´ e Paris Dauphine http://www.ceremade.dauphine.fr/~xian Statistical Frontiers, January 30, 2014

Outline 1 Evidence 2 Importance sampling solutions 3 Cross-model solutions

4 Nested sampling 5 ABC model choice 6 The Savage–Dickey ratio

Formal construction of Bayes tests Deﬁnition (Test) Given an hypothesis

H0 : θ ∈ Θ0 on the parameter θ ∈ Θ0 of a statistical model, a test is a statistical procedure that takes its values in {0, 1}. Theorem (Bayes test) The Bayes estimator associated with π and with the 0 − 1 loss is δπ(x) = 1 if π(θ ∈ Θ0|x) > π(θ ∈ Θ0|x), 0 otherwise,

Formal construction of Bayes tests Deﬁnition (Test) Given an hypothesis

H0 : θ ∈ Θ0 on the parameter θ ∈ Θ0 of a statistical model, a test is a statistical procedure that takes its values in {0, 1}. Theorem (Bayes test) The Bayes estimator associated with π and with the 0 − 1 loss is δπ(x) = 1 if π(θ ∈ Θ0|x) > π(θ ∈ Θ0|x), 0 otherwise,

Bayes factor Deﬁnition (Bayes factors) For testing hypothesis H0 :

θ ∈ Θ0 vs. Ha : θ ∈ Θ0, under prior π(Θ0)π0(θ) + π(Θc 0 )π1(θ) , central quantity B01(x) = π(Θ0|x) π(Θc 0 |x) π(Θ0) π(Θc 0 ) = Θ0 f(x|θ)π0(θ)dθ Θc 0 f(x|θ)π1(θ)dθ = m0(x) m1(x) [Jeﬀreys, 1939]

Model choice and model comparison Choice between models Several models

available for the same observation x Mi : x ∼ fi(x|θi), i ∈ I where the family I can be ﬁnite or inﬁnite Identical setup: Replace hypotheses with models but keep marginal likelihoods and Bayes factors

Model choice and model comparison Choice between models Several models

available for the same observation x Mi : x ∼ fi(x|θi), i ∈ I where the family I can be ﬁnite or inﬁnite Identical setup: Replace hypotheses with models but keep marginal likelihoods and Bayes factors

Bayesian model choice Probabilise the entire model/parameter space allocate probabilities

Bayesian model choice Probabilise the entire model/parameter space allocate probabilities

Evidence All these problems end up with a similar quantity,

the evidence Zk = Θk πk(θk)Lk(θk) dθk, aka the marginal likelihood.

Importance sampling 1 Evidence 2 Importance sampling solutions Regular importance

Harmonic means Chib’s solution Auxiliary variables 3 Cross-model solutions 4 Nested sampling 5 ABC model choice 6 The Savage–Dickey ratio

Bayes factor approximation When approximating the Bayes factor B01 =

Θ0 f0(x|θ0)π0(θ0)dθ0 Θ1 f1(x|θ1)π1(θ1)dθ1 use of importance functions 0 and 1 and B01 = n−1 0 n0 i=1 f0(x|θi 0 )π0(θi 0 )/ 0(θi 0 ) n−1 1 n1 i=1 f1(x|θi 1 )π1(θi 1 )/ 1(θi 1 ) on the corresponding parameter spaces (may be of diﬀerent dimensions!)

Default importance A typical choice for in Bayesian settings is

to take θ ∼ N(ˆ θMLE, ˆ σ2 MLE ) [Marin & Robert, 2009; Wood, 2010]

Bridge sampling Special case: If π1(θ1|x) ∝ ˜ π1(θ1|x) π2(θ2|x)

∝ ˜ π2(θ2|x) live on the same space (Θ1 = Θ2), then B12 ≈ 1 n n i=1 ˜ π1(θi|x) ˜ π2(θi|x) θi ∼ π2(θ|x) [Gelman & Meng, 1998; Chen, Shao & Ibrahim, 2000]

Bridge sampling variance The bridge sampling estimator does poorly if

var(B12) B2 12 ≈ 1 n E π1(θ) − π2(θ) π2(θ) 2 is large, i.e. if π1 and π2 have little overlap...

Bridge sampling variance The bridge sampling estimator does poorly if

var(B12) B2 12 ≈ 1 n E π1(θ) − π2(θ) π2(θ) 2 is large, i.e. if π1 and π2 have little overlap...

(Further) bridge sampling A remarkable property B12 = ˜ π2(θ|x)α(θ)π1(θ|x)dθ

An infamous example When α(θ) = 1 ˜ π1(θ)˜ π2(θ)

harmonic mean approximation to B12 B21 = 1 n1 n1 i=1 1/˜ π1(θ1i|x) 1 n2 n2 i=1 1/˜ π2(θ2i|x) θji ∼ πj(θ|x) [Newton & Raftery, 1994] Infamous: Most often leads to an inﬁnite variance!!! [Radford Neal’s blog, 2008]

An infamous example When α(θ) = 1 ˜ π1(θ)˜ π2(θ)

harmonic mean approximation to B12 B21 = 1 n1 n1 i=1 1/˜ π1(θ1i|x) 1 n2 n2 i=1 1/˜ π2(θ2i|x) θji ∼ πj(θ|x) [Newton & Raftery, 1994] Infamous: Most often leads to an inﬁnite variance!!! [Radford Neal’s blog, 2008]

“The Worst Monte Carlo Method Ever” “The good news is

that the Law of Large Numbers guarantees that this estimator is consistent ie, it will very likely be very close to the correct answer if you use a suﬃciently large number of points from the posterior distribution. The bad news is that the number of points required for this estimator to get close to the right answer will often be greater than the number of atoms in the observable universe. The even worse news is that itws easy for people to not realize this, and to naively accept estimates that are nowhere close to the correct value of the marginal likelihood.” [Radford Neal’s blog, Aug. 23, 2008]

“The Worst Monte Carlo Method Ever” “The good news is

that the Law of Large Numbers guarantees that this estimator is consistent ie, it will very likely be very close to the correct answer if you use a suﬃciently large number of points from the posterior distribution. The bad news is that the number of points required for this estimator to get close to the right answer will often be greater than the number of atoms in the observable universe. The even worse news is that itws easy for people to not realize this, and to naively accept estimates that are nowhere close to the correct value of the marginal likelihood.” [Radford Neal’s blog, Aug. 23, 2008]

Optimal bridge sampling The optimal choice of auxiliary function is

Optimal bridge sampling (2) Reason: Var(B12) B2 12 ≈ 1

n1n2 π1(θ)π2(θ)[n1π1(θ) + n2π2(θ)]α(θ)2 dθ π1(θ)π2(θ)α(θ) dθ 2 − 1 δ method Drag: Dependence on the unknown normalising constants solved iteratively

Optimal bridge sampling (2) Reason: Var(B12) B2 12 ≈ 1

n1n2 π1(θ)π2(θ)[n1π1(θ) + n2π2(θ)]α(θ)2 dθ π1(θ)π2(θ)α(θ) dθ 2 − 1 δ method Drag: Dependence on the unknown normalising constants solved iteratively

Ratio importance sampling Another fundamental identity: B12 = Eϕ [˜

π1(θ1)/ϕ(θ1)] Eϕ [˜ π2(θ2)/ϕ(θ2)] for any density ϕ with suﬃciently large support (when Θ1 = Θ2) [Torrie & Valleau, 1977] Use of a single sample θ1, . . . , θn from ϕ B12 = i=1 ˜ π1(θi)/ϕ(θi) i=1 ˜ π2(θi)/ϕ(θi)

Ratio importance sampling Another fundamental identity: B12 = Eϕ [˜

π1(θ1)/ϕ(θ1)] Eϕ [˜ π2(θ2)/ϕ(θ2)] for any density ϕ with suﬃciently large support (when Θ1 = Θ2) [Torrie & Valleau, 1977] Use of a single sample θ1, . . . , θn from ϕ B12 = i=1 ˜ π1(θi)/ϕ(θi) i=1 ˜ π2(θi)/ϕ(θi)

Ratio importance sampling (2) Approximate variance: var(B12) B2 12 ≈

1 n Eϕ (π1(θ) − π2(θ))2 ϕ(θ)2 2 Optimal choice: ϕ∗(θ) = | π1(θ) − π2(θ) | | π1(η) − π2(η) | dη [Chen, Shao & Ibrahim, 2000] Formaly better than bridge sampling

Ratio importance sampling (2) Approximate variance: var(B12) B2 12 ≈

Approximating Zk from a posterior sample Use of the [harmonic

mean] identity Eπk ϕ(θk) πk(θk)Lk(θk) x = ϕ(θk) πk(θk)Lk(θk) πk(θk)Lk(θk) Zk dθk = 1 Zk no matter what the proposal ϕ(·) is. [Gelfand & Dey, 1994; Bartolucci et al., 2006] Direct exploitation of the MCMC output RB-RJ

Approximating Zk from a posterior sample Use of the [harmonic

mean] identity Eπk ϕ(θk) πk(θk)Lk(θk) x = ϕ(θk) πk(θk)Lk(θk) πk(θk)Lk(θk) Zk dθk = 1 Zk no matter what the proposal ϕ(·) is. [Gelfand & Dey, 1994; Bartolucci et al., 2006] Direct exploitation of the MCMC output RB-RJ

Comparison with regular importance sampling Harmonic mean: Constraint opposed to

usual importance sampling constraints: ϕ(θ) must have lighter (rather than fatter) tails than πk(θk)Lk(θk) for the approximation Z1k = 1 1 T T t=1 ϕ(θ(t) k ) πk(θ(t) k )Lk(θ(t) k ) to have a ﬁnite variance. E.g., use ﬁnite support kernels (like Epanechnikov’s kernel) for ϕ

Comparison with regular importance sampling Harmonic mean: Constraint opposed to

usual importance sampling constraints: ϕ(θ) must have lighter (rather than fatter) tails than πk(θk)Lk(θk) for the approximation Z1k = 1 1 T T t=1 ϕ(θ(t) k ) πk(θ(t) k )Lk(θ(t) k ) to have a ﬁnite variance. E.g., use ﬁnite support kernels (like Epanechnikov’s kernel) for ϕ

Comparison with regular importance sampling (cont’d) Compare Z1k with a

standard importance sampling approximation Z2k = 1 T T t=1 πk(θ(t) k )Lk(θ(t) k ) ϕ(θ(t) k ) where the θ(t) k ’s are generated from the density ϕ(·) (with fatter tails like t’s)

Approximating Zk using a mixture representation Bridge sampling recall Design

a speciﬁc mixture for simulation [importance sampling] purposes, with density ϕk(θk) ∝ ω1πk(θk)Lk(θk) + ϕ(θk) , where ϕ(·) is arbitrary (but normalised) Note: ω1 is not a probability weight

Approximating Zk using a mixture representation Bridge sampling recall Design

a speciﬁc mixture for simulation [importance sampling] purposes, with density ϕk(θk) ∝ ω1πk(θk)Lk(θk) + ϕ(θk) , where ϕ(·) is arbitrary (but normalised) Note: ω1 is not a probability weight

Approximating Z using a mixture representation (cont’d) Corresponding MCMC (=Gibbs)

sampler At iteration t 1 Take δ(t) = 1 with probability ω1πk(θ(t−1) k )Lk(θ(t−1) k ) ω1πk(θ(t−1) k )Lk(θ(t−1) k ) + ϕ(θ(t−1) k ) and δ(t) = 2 otherwise; 2 If δ(t) = 1, generate θ(t) k ∼ MCMC(θ(t−1) k , θk) where MCMC(θk, θk ) denotes an arbitrary MCMC kernel associated with the posterior πk(θk|x) ∝ πk(θk)Lk(θk); 3 If δ(t) = 2, generate θ(t) k ∼ ϕ(θk) independently

Approximating Z using a mixture representation (cont’d) Corresponding MCMC (=Gibbs)

Evidence approximation by mixtures Rao-Blackwellised estimate ˆ ξ = 1

T T t=1 ω1πk(θ(t) k )Lk(θ(t) k ) ω1πk(θ(t) k )Lk(θ(t) k ) + ϕ(θ(t) k ) , converges to ω1Zk/{ω1Zk + 1} Deduce ˆ Z3k from ω1 ˆ Z3k/{ω1 ˆ Z3k + 1} = ˆ ξ ie ˆ Z3k = T t=1 ω1πk(θ(t) k )Lk(θ(t) k ) ω1π(θ(t) k )Lk(θ(t) k ) + ϕ(θ(t) k ) T t=1 ϕ(θ(t) k ) ω1πk(θ(t) k )Lk(θ(t) k ) + ϕ(θ(t) k ) [Bridge sampler redux]

Evidence approximation by mixtures Rao-Blackwellised estimate ˆ ξ = 1

T T t=1 ω1πk(θ(t) k )Lk(θ(t) k ) ω1πk(θ(t) k )Lk(θ(t) k ) + ϕ(θ(t) k ) , converges to ω1Zk/{ω1Zk + 1} Deduce ˆ Z3k from ω1 ˆ Z3k/{ω1 ˆ Z3k + 1} = ˆ ξ ie ˆ Z3k = T t=1 ω1πk(θ(t) k )Lk(θ(t) k ) ω1π(θ(t) k )Lk(θ(t) k ) + ϕ(θ(t) k ) T t=1 ϕ(θ(t) k ) ω1πk(θ(t) k )Lk(θ(t) k ) + ϕ(θ(t) k ) [Bridge sampler redux]

Integration by conditioning Use Rao-Blackwell Theorem var(E[δ(X)|Y]) ≤ var(δ(X))

Consequence If I unbiased estimator of I = Ef [h(X)],

with X simulated from a joint density ˜ f(x, y), where ˜ f(x, y)dy = f(x), the estimator I∗ = E ˜ f [I|Y1, . . . , Yn] dominate I(X1, . . . , Xn) variance-wise (and is unbiased)

Chib’s representation Direct application of Bayes’ theorem: given x ∼

Chib’s representation Direct application of Bayes’ theorem: given x ∼

Case of latent variables For missing variable z as in

mixture models, natural Rao-Blackwell estimate πk(θ∗ k |x) = 1 T T t=1 πk(θ∗ k |x, z(t) k ) , where the z(t) k ’s are Gibbs sampled latent variables

Label switching impact A mixture model [special case of missing

variable model] is invariant under permutations of the indices of the components. E.g., mixtures 0.3N(0, 1) + 0.7N(2.3, 1) and 0.7N(2.3, 1) + 0.3N(0, 1) are exactly the same! c The component parameters θi are not identiﬁable marginally since they are exchangeable

Label switching impact A mixture model [special case of missing

variable model] is invariant under permutations of the indices of the components. E.g., mixtures 0.3N(0, 1) + 0.7N(2.3, 1) and 0.7N(2.3, 1) + 0.3N(0, 1) are exactly the same! c The component parameters θi are not identiﬁable marginally since they are exchangeable

Connected diﬃculties 1 Number of modes of the likelihood of

order O(k!): c Maximization and even [MCMC] exploration of the posterior surface harder 2 Under exchangeable priors on (θ, p) [prior invariant under permutation of the indices], all posterior marginals are identical: c Posterior expectation of θ1 equal to posterior expectation of θ2

Connected diﬃculties 1 Number of modes of the likelihood of

order O(k!): c Maximization and even [MCMC] exploration of the posterior surface harder 2 Under exchangeable priors on (θ, p) [prior invariant under permutation of the indices], all posterior marginals are identical: c Posterior expectation of θ1 equal to posterior expectation of θ2

License Since Gibbs output does not produce exchangeability, the Gibbs

sampler has not explored the whole parameter space: it lacks energy to switch simultaneously enough component allocations at once 0 100 200 300 400 500 −1 0 1 2 3 n µi −1 0 1 2 3 0.2 0.3 0.4 0.5 µ i pi 0 100 200 300 400 500 0.2 0.3 0.4 0.5 n pi 0.2 0.3 0.4 0.5 0.4 0.6 0.8 1.0 p i σi 0 100 200 300 400 500 0.4 0.6 0.8 1.0 n σi 0.4 0.6 0.8 1.0 −1 0 1 2 3 σ i µi

Label switching paradox We should observe the exchangeability of the

components [label switching] to conclude about convergence of the Gibbs sampler. If we observe it, then we do not know how to estimate the parameters. If we do not, then we are uncertain about the convergence!!!

Label switching paradox We should observe the exchangeability of the

Compensation for label switching For mixture models, z(t) k usually

fails to visit all conﬁgurations in a balanced way, despite the symmetry predicted by the theory πk(θk|x) = πk(σ(θk)|x) = 1 k! σ∈S πk(σ(θk)|x) for all σ’s in Sk, set of all permutations of {1, . . . , k}. Consequences on numerical approximation, biased by an order k! Recover the theoretical symmetry by using πk(θ∗ k |x) = 1 T k! σ∈Sk T t=1 πk(σ(θ∗ k )|x, z(t) k ) . [Berkhof, Mechelen, & Gelman, 2003]

Compensation for label switching For mixture models, z(t) k usually

fails to visit all conﬁgurations in a balanced way, despite the symmetry predicted by the theory πk(θk|x) = πk(σ(θk)|x) = 1 k! σ∈S πk(σ(θk)|x) for all σ’s in Sk, set of all permutations of {1, . . . , k}. Consequences on numerical approximation, biased by an order k! Recover the theoretical symmetry by using πk(θ∗ k |x) = 1 T k! σ∈Sk T t=1 πk(σ(θ∗ k )|x, z(t) k ) . [Berkhof, Mechelen, & Gelman, 2003]

Galaxy dataset n = 82 galaxies as a mixture of

k normal distributions with both mean and variance unknown. [Roeder, 1992] Average density data Relative Frequency −2 −1 0 1 2 3 0.0 0.2 0.4 0.6 0.8

Galaxy dataset (k) Using only the original estimate, with θ∗

k as the MAP estimator, log( ˆ mk(x)) = −105.1396 for k = 3 (based on 103 simulations), while introducing the permutations leads to log( ˆ mk(x)) = −103.3479 Note that −105.1396 + log(3!) = −103.3479 k 2 3 4 5 6 7 8 mk (x) -115.68 -103.35 -102.66 -101.93 -102.88 -105.48 -108.44 Estimations of the marginal likelihoods by the symmetrised Chib’s approximation (based on 105 Gibbs iterations and, for k > 5, 100 permutations selected at random in Sk). [Lee, Marin, Mengersen & Robert, 2008]

Galaxy dataset (k) Using only the original estimate, with θ∗

Auxiliary variables Auxiliary variable method avoids computations of untractable constant(s)

in likelihood function f(y|θ) = Zθ ˜ f(y|θ) Introduce pseudo-data y with artiﬁcial target g(y|θ, y) Generate θ ∼ K(θ, θ ) and y ∼ f(y|θ ) [Møller, Pettitt, Berthelsen, & Reeves, 2006]

Auxiliary variables Auxiliary variable method avoids computations of untractable constant(s)

Auxiliary variables Auxiliary variable method avoids computations of untractable constant(s)

Auxiliary variables Auxiliary variable method avoids computations of untractable constant(s)

in likelihood function f(y|θ) = Zθ ˜ f(y|θ) Introduce pseudo-data y with artificial target g(y|θ, y) Generate θ ∼ K(θ, θ ) and y ∼ f(y|θ ) For Gibbs random fields , existence of a genuine sufficient statistic η(y). [Møller, Pettitt, Berthelsen, & Reeves, 2006]

Cross-model solutions 1 Evidence 2 Importance sampling solutions 3 Cross-model

solutions Reversible jump Saturation schemes Implementation error 4 Nested sampling 5 ABC model choice 6 The Savage–Dickey ratio

Reversible jump irreversible jump Idea: Set up a proper measure–theoretic

framework for designing moves between models Mk [Green, 1995] Create a reversible kernel K on H = k {k} × Θk such that A B K(x, dy)π(x)dx = B A K(y, dx)π(y)dy for the invariant density π [x is of the form (k, θ(k))]

Reversible jump irreversible jump Idea: Set up a proper measure–theoretic

framework for designing moves between models Mk [Green, 1995] Create a reversible kernel K on H = k {k} × Θk such that A B K(x, dy)π(x)dx = B A K(y, dx)π(y)dy for the invariant density π [x is of the form (k, θ(k))]

Local moves For a move between two models, M1 and

M2, the Markov chain being in state θ1 ∈ M1, denote by K1→2(θ1, dθ) and K2→1(θ2, dθ) the corresponding kernels, under the detailed balance condition π(dθ1) K1→2(θ1, dθ) = π(dθ2) K2→1(θ2, dθ) , and take, wlog, dim(M2) > dim(M1). Proposal expressed as θ2 = Ψ1→2(θ1, v1→2) where v1→2 is a random variable of dimension dim(M2) − dim(M1), generated as v1→2 ∼ ϕ1→2(v1→2) .

Local moves For a move between two models, M1 and

M2, the Markov chain being in state θ1 ∈ M1, denote by K1→2(θ1, dθ) and K2→1(θ2, dθ) the corresponding kernels, under the detailed balance condition π(dθ1) K1→2(θ1, dθ) = π(dθ2) K2→1(θ2, dθ) , and take, wlog, dim(M2) > dim(M1). Proposal expressed as θ2 = Ψ1→2(θ1, v1→2) where v1→2 is a random variable of dimension dim(M2) − dim(M1), generated as v1→2 ∼ ϕ1→2(v1→2) .

Local moves (2) In this case, q1→2(θ1, dθ2) has density

ϕ1→2(v1→2) ∂Ψ1→2(θ1, v1→2) ∂(θ1, v1→2) −1 , by the Jacobian rule. Reverse importance link If probability 1→2 of choosing move to M2 while in M1, acceptance probability reduces to α(θ1, v1→2) = 1∧ π(M2, θ2) 2→1 π(M1, θ1) 1→2 ϕ1→2(v1→2) ∂Ψ1→2(θ1, v1→2) ∂(θ1, v1→2) . If several models are considered simultaneously, with probability 1→2 of choosing move to M2 while in M1, as in K(x, B) = ∞ m=1 ρm (x, y)qm (x, dy) + ω(x)IB (x)

Local moves (2) In this case, q1→2(θ1, dθ2) has density

Generic reversible jump acceptance probability Acceptance probability of θ2 =

Ψ1→2(θ1, v1→2) is α(θ1 , v1→2 ) = 1 ∧ π(M2 , θ2 ) 2→1 π(M1 , θ1 ) 1→2 ϕ1→2 (v1→2 ) ∂Ψ1→2 (θ1 , v1→2 ) ∂(θ1 , v1→2 ) while acceptance probability of θ1 with (θ1, v1→2) = Ψ−1 1→2 (θ2) is α(θ1 , v1→2 ) = 1 ∧ π(M1 , θ1 ) 1→2 ϕ1→2 (v1→2 ) π(M2 , θ2 ) 2→1 ∂Ψ1→2 (θ1 , v1→2 ) ∂(θ1 , v1→2 ) −1 c Diﬃcult calibration

Generic reversible jump acceptance probability Acceptance probability of θ2 =

Green’s sampler Algorithm Iteration t (t ≥ 1): if x(t)

= (m, θ(m)), 1 Select model Mn with probability πmn 2 Generate umn ∼ ϕmn(u) and set (θ(n), vnm) = Ψm→n(θ(m), umn) 3 Take x(t+1) = (n, θ(n)) with probability min π(n, θ(n)) π(m, θ(m)) πnm ϕnm (vnm ) πmn ϕmn (umn ) ∂Ψm→n (θ(m), umn ) ∂(θ(m), umn ) , 1 and take x(t+1) = x(t) otherwise.

Interpretation The representation puts us back in a ﬁxed dimension

setting: M1 × V1→2 and M2 in one-to-one relation. reversibility imposes that θ1 is derived as (θ1, v1→2) = Ψ−1 1→2 (θ2) appears like a regular Metropolis–Hastings move from the couple (θ1, v1→2) to θ2 when stationary distributions are π(M1, θ1) × ϕ1→2(v1→2) and π(M2, θ2), and when proposal distribution is deterministic

Interpretation The representation puts us back in a ﬁxed dimension

setting: M1 × V1→2 and M2 in one-to-one relation. reversibility imposes that θ1 is derived as (θ1, v1→2) = Ψ−1 1→2 (θ2) appears like a regular Metropolis–Hastings move from the couple (θ1, v1→2) to θ2 when stationary distributions are π(M1, θ1) × ϕ1→2(v1→2) and π(M2, θ2), and when proposal distribution is deterministic

Alternative Saturation of the parameter space H = k {k}

Alternative Saturation of the parameter space H = k {k}

MCMC implementation Run a Markov chain (M(t), θ(t) 1 ,

MCMC implementation Run a Markov chain (M(t), θ(t) 1 ,

Scott’s (2002) proposal Suggest estimating P(M = k|x) by Zk

∝ pk T t=1    fk(x|θ(t) k ) D j=1 pj fj(x|θ(t) j )    , based on D simultaneous and independent MCMC chains (θ(t) k )t , 1 ≤ k ≤ D , with stationary distributions πk(θk|x) [instead of above joint]

Scott’s (2002) proposal Suggest estimating P(M = k|x) by Zk

∝ pk T t=1    fk(x|θ(t) k ) D j=1 pj fj(x|θ(t) j )    , based on D simultaneous and independent MCMC chains (θ(t) k )t , 1 ≤ k ≤ D , with stationary distributions πk(θk|x) [instead of above joint]

Congdon’s (2006) extension Selecting ﬂat [prohibited] pseudo-priors, uses instead Zk

∝ pk T t=1    fk(x|θ(t) k )πk(θ(t) k ) D j=1 pj fj(x|θ(t) j )πj(θ(t) j )    , where again the θ(t) k ’s are MCMC chains with stationary distributions πk(θk|x) to next section

Examples Example (Model choice) Model M1 : x|θ ∼ U(0,

θ) with prior θ ∼ Exp(1) is versus model M2 : x|θ ∼ Exp(θ) with prior θ ∼ Exp(1). Equal prior weights on both models: 1 = 2 = 0.5. Approximations of P(M = 1|x): Scott’s (2002) (blue), and Congdon’s (2006) (red) [N = 106 simulations]. 1 2 3 4 5 0.0 0.2 0.4 0.6 0.8 1.0 y

Examples Example (Model choice) Model M1 : x|θ ∼ U(0,

θ) with prior θ ∼ Exp(1) is versus model M2 : x|θ ∼ Exp(θ) with prior θ ∼ Exp(1). Equal prior weights on both models: 1 = 2 = 0.5. Approximations of P(M = 1|x): Scott’s (2002) (blue), and Congdon’s (2006) (red) [N = 106 simulations]. 1 2 3 4 5 0.0 0.2 0.4 0.6 0.8 1.0 y

Examples (2) Example (Model choice (2)) Normal model M1 :

x ∼ N(θ, 1) with θ ∼ N(0, 1) vs. normal model M2 : x ∼ N(θ, 1) with θ ∼ N(5, 1) Comparison of both approximations with P(M = 1|x): Scott’s (2002) (green and mixed dashes) and Congdon’s (2006) (brown and long dashes) [N = 104 simulations]. −1 0 1 2 3 4 5 6 0.0 0.2 0.4 0.6 0.8 1.0 y

Examples (3) Example (Model choice (3)) Model M1 : x

∼ N(0, 1/ω) with ω ∼ Exp(a) vs. M2 : exp(x) ∼ Exp(λ) with λ ∼ Exp(b). Comparison of Congdon’s (2006) (brown and dashed lines) with P(M = 1|x) when (a, b) is equal to (.24, 8.9), (.56, .7), (4.1, .46) and (.98, .081), resp. [N = 104 simulations]. −10 −5 0 5 10 0.0 0.2 0.4 0.6 0.8 1.0 y −10 −5 0 5 10 0.0 0.2 0.4 0.6 0.8 1.0 y −10 −5 0 5 10 0.0 0.2 0.4 0.6 0.8 1.0 y −10 −5 0 5 10 0.0 0.2 0.4 0.6 0.8 1.0 y

Nested sampling 1 Evidence 2 Importance sampling solutions 3 Cross-model

solutions 4 Nested sampling Purpose Implementation Error rates Constraints Importance variant A mixture comparison 5 ABC model choice

Nested sampling: Goal Skilling’s (2007) technique using the one-dimensional representation:

Z = Eπ[L(θ)] = 1 0 ϕ(x) dx with ϕ−1(l) = Pπ(L(θ) > l). Note; ϕ(·) is intractable in most cases.

Nested sampling: First approximation Approximate Z by a Riemann sum:

Z = N i=1 (xi−1 − xi)ϕ(xi) where the xi’s are either: deterministic: xi = e−i/N or random: x0 = 0, xi+1 = tixi, ti ∼ Be(N, 1) so that E[log xi] = −i/N.

Extraneous white noise Take Z = e−θ dθ = 1

δ e−(1−δ)θ e−δθ = Eδ 1 δ e−(1−δ)θ ˆ Z = 1 N N i=1 δ−1 e−(1−δ)θi (xi−1 − xi) , θi ∼ E(δ) I(θi ≤ θi−1) N deterministic random 50 4.64 10.5 4.65 10.5 100 2.47 4.9 2.48 5.02 500 .549 1.01 .550 1.14 Comparison of variances and MSEs

Extraneous white noise Take Z = e−θ dθ = 1

δ e−(1−δ)θ e−δθ = Eδ 1 δ e−(1−δ)θ ˆ Z = 1 N N i=1 δ−1 e−(1−δ)θi (xi−1 − xi) , θi ∼ E(δ) I(θi ≤ θi−1) N deterministic random 50 4.64 10.5 4.65 10.5 100 2.47 4.9 2.48 5.02 500 .549 1.01 .550 1.14 Comparison of variances and MSEs

Nested sampling: Alternative representation Another representation is Z = N−1

i=0 {ϕ(xi+1) − ϕ(xi)}xi which is a special case of Z = N−1 i=0 {L(θ(i+1) ) − L(θ(i) )}π({θ; L(θ) > L(θ(i) )}) where · · · L(θ(i+1) ) < L(θ(i) ) · · · [Lebesgue version of Riemann’s sum]

Nested sampling: Alternative representation Another representation is Z = N−1

i=0 {ϕ(xi+1) − ϕ(xi)}xi which is a special case of Z = N−1 i=0 {L(θ(i+1) ) − L(θ(i) )}π({θ; L(θ) > L(θ(i) )}) where · · · L(θ(i+1) ) < L(θ(i) ) · · · [Lebesgue version of Riemann’s sum]

Nested sampling: Second approximation Estimate (intractable) ϕ(xi) by ϕi: Nested

sampling Start with N values θ1, . . . , θN sampled from π At iteration i, 1 Take ϕi = L(θk), where θk is the point with smallest likelihood in the pool of θi’s 2 Replace θk with a sample from the prior constrained to L(θ) > ϕi: the current N points are sampled from prior constrained to L(θ) > ϕi. Note that π({θ; L(θ) > L(θ(i+1) )}) π({θ; L(θ) > L(θ(i) )}) ≈ 1 − 1 N

Nested sampling: Second approximation Estimate (intractable) ϕ(xi) by ϕi: Nested

Nested sampling: Third approximation Iterate the above steps until a

given stopping iteration j is reached: e.g., observe very small changes in the approximation Z; reach the maximal value of L(θ) when the likelihood is bounded and its maximum is known; truncate the integral Z at level , i.e. replace 1 0 ϕ(x) dx with 1 ϕ(x) dx

Approximation error Error = Z − Z = j i=1

(xi−1 − xi)ϕi − 1 0 ϕ(x) dx = − 0 ϕ(x) dx + j i=1 (xi−1 − xi)ϕ(xi) − 1 ϕ(x) dx (Quadrature Error) + j i=1 (xi−1 − xi) {ϕi − ϕ(xi)} (Stochastic Error) [Dominated by Monte Carlo]

A CLT for the Stochastic Error The (dominating) stochastic error

is OP (N−1/2): N1/2 Z − Z D → N (0, V ) with V = − s,t∈[ ,1] sϕ (s)tϕ (t) log(s ∨ t) ds dt. [Proof based on Donsker’s theorem]

What of log Z? If the interest lies within log

Z, Slutsky’s transform of the CLT: N−1/2 log Z − log Z D → N 0, V/Z2 Note: The number of simulated points equals the number of iterations j, and is a multiple of N: if one stops at ﬁrst iteration j such that e−j/N < , then: j = N − log .

What of log Z? If the interest lies within log

Sampling from constr’d priors Exact simulation from the constrained prior

is intractable in most cases Skilling (2007) proposes to use MCMC but this introduces a bias (stopping rule) If implementable, then slice sampler can be devised at the same cost [or less]

Sampling from constr’d priors Exact simulation from the constrained prior

Banana illustration Case of a banana target made of a

twisted 2D normal: x2 = x2 + βx2 1 − 100β [Haario, Sacksman, Tamminen, 1999] β = .03, σ = (100, 1)

Banana illustration (2) Use of nested sampling with N =

1000, 50 MCMC steps with size 0.1, compared with a population Monte Carlo (PMC) based on 10 iterations with 5000 points per iteration and ﬁnal sample of 50000 points, using nine Student’s t components with 9 df [Wraith, Kilbinger, Benabed et al., 2009, Physica Rev. D] Evidence estimation

Banana illustration (2) Use of nested sampling with N =

1000, 50 MCMC steps with size 0.1, compared with a population Monte Carlo (PMC) based on 10 iterations with 5000 points per iteration and ﬁnal sample of 50000 points, using nine Student’s t components with 9 df [Wraith, Kilbinger, Benabed et al., 2009, Physica Rev. D] E[X1] estimation

Banana illustration (2) Use of nested sampling with N =

1000, 50 MCMC steps with size 0.1, compared with a population Monte Carlo (PMC) based on 10 iterations with 5000 points per iteration and ﬁnal sample of 50000 points, using nine Student’s t components with 9 df [Wraith, Kilbinger, Benabed et al., 2009, Physica Rev. D] E[X2] estimation

A IS variant of nested sampling Consider instrumental prior π

and likelihood ˜ L, weight function w(θ) = π(θ)L(θ) π(θ)L(θ) and weighted NS estimator Z = j i=1 (xi−1 − xi)ϕiw(θi). Then choose (π, L) so that sampling from π constrained to L(θ) > l is easy; e.g. N(c, Id) constrained to c − θ < r.

A IS variant of nested sampling Consider instrumental prior π

and likelihood ˜ L, weight function w(θ) = π(θ)L(θ) π(θ)L(θ) and weighted NS estimator Z = j i=1 (xi−1 − xi)ϕiw(θi). Then choose (π, L) so that sampling from π constrained to L(θ) > l is easy; e.g. N(c, Id) constrained to c − θ < r.

Mixture pN(0, 1) + (1 − p)N(µ, σ) posterior Posterior

on (µ, σ) for n observations with µ = 2 and σ = 3/2, when p is known Use of a uniform prior both on (−2, 6) for µ and on (.001, 16) for log σ2. occurrences of posterior bursts for µ = xi

Experiment MCMC sample for n = 16 observations from the

mixture. Nested sampling sequence with M = 1000 starting points.

Experiment MCMC sample for n = 50 observations from the

mixture. Nested sampling sequence with M = 1000 starting points.

Comparison 1 Nested sampling: M = 1000 points, with 10

random walk moves at each step, simulations from the constr’d prior and a stopping rule at 95% of the observed maximum likelihood 2 T = 104 MCMC (=Gibbs) simulations producing non-parametric estimates ϕ [Diebolt & Robert, 1990] 3 Monte Carlo estimates Z1, Z2, Z3 using product of two Gaussian kernels 4 numerical integration based on 850 × 950 grid [reference value, conﬁrmed by Chib’s]

Comparison (cont’d) Graph based on a sample of 10 observations

for µ = 2 and σ = 3/2 (150 replicas).

Comparison (cont’d) Graph based on a sample of 50 observations

Comparison (cont’d) Graph based on a sample of 100 observations

Approximate Bayesian computation 1 Evidence 2 Importance sampling solutions 3

Cross-model solutions 4 Nested sampling 5 ABC model choice ABC method Formalised framework Consistency results Summary statistics 6 The Savage–Dickey ratio

Approximate Bayesian Computation Bayesian setting: target is π(θ)f(x|θ) When likelihood

f(x|θ) not in closed form, likelihood-free rejection technique: ABC algorithm For an observation y ∼ f(y|θ), under the prior π(θ), keep jointly simulating θ ∼ π(θ) , x ∼ f(x|θ ) , until the auxiliary variable x is equal to the observed value, x = y. [Pritchard et al., 1999]

Approximate Bayesian Computation Bayesian setting: target is π(θ)f(x|θ) When likelihood

A as approximative When y is a continuous random variable,

equality x = y is replaced with a tolerance condition, (x, y) ≤ where is a distance between summary statistics Output distributed from π(θ) Pθ{ (x, y) < } ∝ π(θ| (x, y) < )

A as approximative When y is a continuous random variable,

equality x = y is replaced with a tolerance condition, (x, y) ≤ where is a distance between summary statistics Output distributed from π(θ) Pθ{ (x, y) < } ∝ π(θ| (x, y) < )

The starting point Central question to the validation of ABC

for model choice: When is a Bayes factor based on an insuﬃcient statistic T (y) consistent? Note/warnin: c drawn on T (y) through BT 12 (y) necessarily diﬀers from c drawn on y through B12(y) [Marin, Pillai, X, & Rousseau, arXiv, 2012]

The starting point Central question to the validation of ABC

for model choice: When is a Bayes factor based on an insuﬃcient statistic T (y) consistent? Note/warnin: c drawn on T (y) through BT 12 (y) necessarily diﬀers from c drawn on y through B12(y) [Marin, Pillai, X, & Rousseau, arXiv, 2012]

A benchmark if toy example Comparison suggested by referee of

PNAS paper [thanks!]: [X, Cornuet, Marin, & Pillai, Aug. 2011] Model M1: y ∼ N(θ1, 1) opposed to model M2: y ∼ L(θ2, 1/ √ 2), Laplace distribution with mean θ2 and scale parameter 1/ √ 2 (variance one).

A benchmark if toy example Comparison suggested by referee of

PNAS paper [thanks!]: [X, Cornuet, Marin, & Pillai, Aug. 2011] Model M1: y ∼ N(θ1, 1) opposed to model M2: y ∼ L(θ2, 1/ √ 2), Laplace distribution with mean θ2 and scale parameter 1/ √ 2 (variance one). Four possible statistics 1 sample mean y (suﬃcient for M1 if not M2); 2 sample median med(y) (insuﬃcient); 3 sample variance var(y) (ancillary); 4 median absolute deviation mad(y) = med(|y − med(y)|);

A benchmark if toy example Comparison suggested by referee of

PNAS paper [thanks!]: [X, Cornuet, Marin, & Pillai, Aug. 2011] Model M1: y ∼ N(θ1, 1) opposed to model M2: y ∼ L(θ2, 1/ √ 2), Laplace distribution with mean θ2 and scale parameter 1/ √ 2 (variance one). q q q q q q q q q q q Gauss Laplace 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 n=100

A benchmark if toy example Comparison suggested by referee of

PNAS paper [thanks!]: [X, Cornuet, Marin, & Pillai, Aug. 2011] Model M1: y ∼ N(θ1, 1) opposed to model M2: y ∼ L(θ2, 1/ √ 2), Laplace distribution with mean θ2 and scale parameter 1/ √ 2 (variance one). q q q q q q q q q q q Gauss Laplace 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 n=100 q q q q q q q q q q q q q q q q q q Gauss Laplace 0.0 0.2 0.4 0.6 0.8 1.0 n=100

Framework Starting from sample y = (y1, . . .

, yn) the observed sample, not necessarily iid with true distribution y ∼ Pn Summary statistics T (y) = T n = (T1(y), T2(y), · · · , Td(y)) ∈ Rd with true distribution T n ∼ Gn.

Framework c Comparison of – under M1, y ∼ F1,n(·|θ1)

Assumptions A collection of asymptotic “standard” assumptions: [A1] is a

standard central limit theorem under the true model [A2] controls the large deviations of the estimator T n from the estimand µ(θ) [A3] is the standard prior mass condition found in Bayesian asymptotics (di eﬀective dimension of the parameter) [A4] restricts the behaviour of the model density against the true density [Think CLT!]

Assumptions A collection of asymptotic “standard” assumptions: [Think CLT!] [A1]

There exist a sequence {vn} converging to +∞, a distribution Q, a symmetric, d × d positive deﬁnite matrix V0 and a vector µ0 ∈ Rd such that vnV −1/2 0 (T n − µ0) n→∞ ; Q, under Gn

Assumptions A collection of asymptotic “standard” assumptions: [Think CLT!] [A2]

Assumptions A collection of asymptotic “standard” assumptions: [Think CLT!] [A3]

If inf{|µi(θi) − µ0|; θi ∈ Θi} = 0, deﬁning (u > 0) Sn,i(u) = θi ∈ Fn,i; |µi(θi) − µ0| ≤ u v−1 n , there exist constants di < τi ∧ αi − 1 such that πi(Sn,i(u)) ∼ udi v−di n , ∀u vn

Assumptions A collection of asymptotic “standard” assumptions: [Think CLT!] [A4]

If inf{|µi(θi) − µ0|; θi ∈ Θi} = 0, for any > 0, there exist U, δ > 0 and (En)n such that, if θi ∈ Sn,i(U) En ⊂ {t; gi(t|θi) < δgn(t)} and Gn (Ec n ) < .

Assumptions A collection of asymptotic “standard” assumptions: [Think CLT!] Again

(sumup) [A1] is a standard central limit theorem under the true model [A2] controls the large deviations of the estimator T n from the estimand µ(θ) [A3] is the standard prior mass condition found in Bayesian asymptotics (di eﬀective dimension of the parameter) [A4] restricts the behaviour of the model density against the true density

Eﬀective dimension Understanding di in [A3]: deﬁned only when µ0

∈ {µi(θi), θi ∈ Θi}, πi(θi : |µi(θi) − µ0| < n−1/2) = O(n−di/2) is the eﬀective dimension of the model Θi around µ0

Eﬀective dimension Understanding di in [A3]: when inf{|µi(θi) − µ0|;

θi ∈ Θi} = 0, mi(T n) gn(T n) ∼ v−di n , thus log(mi(T n)/gn(T n)) ∼ −di log vn and v−di n penalization factor resulting from integrating θi out (see eﬀective number of parameters in DIC)

Eﬀective dimension Understanding di in [A3]: In regular models, di

dimension of T (Θi), leading to BIC In non-regular models, di can be smaller

Asymptotic marginals Asymptotically, under [A1]–[A4] mi(t) = Θi gi(t|θi) πi(θi)

dθi is such that (i) if inf{|µi(θi) − µ0|; θi ∈ Θi} = 0, Clvd−di n ≤ mi(T n) ≤ Cuvd−di n and (ii) if inf{|µi(θi) − µ0|; θi ∈ Θi} > 0 mi(T n) = o Pn [vd−τi n + vd−αi n ].

Within-model consistency Under same assumptions, if inf{|µi(θi) − µ0|; θi

∈ Θi} = 0, the posterior distribution of µi(θi) given T n is consistent at rate 1/vn provided αi ∧ τi > di.

Between-model consistency Consequence of above is that asymptotic behaviour of

the Bayes factor is driven by the asymptotic mean value of T n under both models. And only by this mean value!

Between-model consistency Consequence of above is that asymptotic behaviour of

the Bayes factor is driven by the asymptotic mean value of T n under both models. And only by this mean value! Indeed, if inf{|µ0 − µ2(θ2)|; θ2 ∈ Θ2} = inf{|µ0 − µ1(θ1)|; θ1 ∈ Θ1} = 0 then Clv−(d1−d2) n ≤ m1(T n) m2(T n) ≤ Cuv−(d1−d2) n , where Cl, Cu = O Pn (1), irrespective of the true model. c Only depends on the diﬀerence d1 − d2: no consistency

Between-model consistency Consequence of above is that asymptotic behaviour of

the Bayes factor is driven by the asymptotic mean value of T n under both models. And only by this mean value! Else, if inf{|µ0 − µ2(θ2)|; θ2 ∈ Θ2} > inf{|µ0 − µ1(θ1)|; θ1 ∈ Θ1} = 0 then m1(T n) m2(T n) ≥ Cu min v−(d1−α2) n , v−(d1−τ2) n

Consistency theorem If inf{|µ0 − µ2(θ2)|; θ2 ∈ Θ2} =

inf{|µ0 − µ1(θ1)|; θ1 ∈ Θ1} = 0, Bayes factor BT 12 = O(v−(d1−d2) n ) irrespective of the true model. It is inconsistent since it always picks the model with the smallest dimension

Consistency theorem If Pn belongs to one of the two

models and if µ0 cannot be attained by the other one : 0 = min (inf{|µ0 − µi(θi)|; θi ∈ Θi}, i = 1, 2) < max (inf{|µ0 − µi(θi)|; θi ∈ Θi}, i = 1, 2) , then the Bayes factor BT 12 is consistent

Consequences on summary statistics Bayes factor driven by the means

µi(θi) and the relative position of µ0 wrt both sets {µi(θi); θi ∈ Θi}, i = 1, 2. For ABC, this implies the most likely statistics T n are ancillary statistics with diﬀerent mean values under both models Else, if T n asymptotically depends on some of the parameters of the models, it is possible that there exists θi ∈ Θi such that µi(θi) = µ0 even though model M1 is misspeciﬁed

Consequences on summary statistics Bayes factor driven by the means

µi(θi) and the relative position of µ0 wrt both sets {µi(θi); θi ∈ Θi}, i = 1, 2. For ABC, this implies the most likely statistics T n are ancillary statistics with diﬀerent mean values under both models Else, if T n asymptotically depends on some of the parameters of the models, it is possible that there exists θi ∈ Θi such that µi(θi) = µ0 even though model M1 is misspeciﬁed

Toy example: Laplace versus Gauss [1] If T n =

n−1 n i=1 X4 i , µ1(θ) = 3 + θ4 + 6θ2, µ2(θ) = 6 + · · · and the true distribution is Laplace with mean θ0 = 1, under the Gaussian model the value θ∗ = 2 √ 3 − 3 also leads to µ0 = µ(θ∗) [here d1 = d2 = d = 1]

Toy example: Laplace versus Gauss [1] If T n =

n−1 n i=1 X4 i , µ1(θ) = 3 + θ4 + 6θ2, µ2(θ) = 6 + · · · and the true distribution is Laplace with mean θ0 = 1, under the Gaussian model the value θ∗ = 2 √ 3 − 3 also leads to µ0 = µ(θ∗) [here d1 = d2 = d = 1] c A Bayes factor associated with such a statistic is inconsistent

Toy example: Laplace versus Gauss [1] If T n =

n−1 n i=1 X4 i , µ1(θ) = 3 + θ4 + 6θ2, µ2(θ) = 6 + · · ·

Toy example: Laplace versus Gauss [1] If T n =

n−1 n i=1 X4 i , µ1(θ) = 3 + θ4 + 6θ2, µ2(θ) = 6 + · · · q q q q q q q M1 M2 0.30 0.35 0.40 0.45 n=1000 Fourth moment

Toy example: Laplace versus Gauss [0] When T (y) =

¯ y(4) n , ¯ y(6) n and the true distribution is Laplace with mean θ0 = 0, then µ0 = 6, µ1(θ∗ 1 ) = 6 with θ∗ 1 = 2 √ 3 − 3 [d1 = 1 and d2 = 1/2] thus B12 ∼ n−1/4 → 0 : consistent Under the Gaussian model µ0 = 3, µ2(θ2) ≥ 6 > 3 ∀θ2 B12 → +∞ : consistent

Toy example: Laplace versus Gauss [0] When T (y) =

¯ y(4) n , ¯ y(6) n and the true distribution is Laplace with mean θ0 = 0, then µ0 = 6, µ1(θ∗ 1 ) = 6 with θ∗ 1 = 2 √ 3 − 3 [d1 = 1 and d2 = 1/2] thus B12 ∼ n−1/4 → 0 : consistent Under the Gaussian model µ0 = 3, µ2(θ2) ≥ 6 > 3 ∀θ2 B12 → +∞ : consistent

Toy example: Laplace versus Gauss [0] When T (y) =

¯ y(4) n , ¯ y(6) n 0.0 0.2 0.4 0.6 0.8 1.0 0 1 2 3 4 5 posterior probabilities Density Fourth and sixth moments

Checking for adequate statistics After running ABC, i.e. creating reference

tables of (θi, xi) from both joints, straightforward derivation of ABC estimates ˆ θ1 and ˆ θ2. Evaluation of E1 ˆ θ1 [T(X)] and E2 ˆ θ2 [T(X)] allows for detection of diﬀerent means under both models via Monte Carlo simulations

Toy example: Laplace versus Gauss q q q q q

q q q q q q q q q q q q q q q q q q q q q q q q q Gauss Laplace Gauss Laplace 0 10 20 30 40 Normalised χ2 without and with mad

Embedded models When M1 submodel of M2, and if the

true distribution belongs to the smaller model M1, Bayes factor is of order v−(d1−d2) n

Embedded models When M1 submodel of M2, and if the

true distribution belongs to the smaller model M1, Bayes factor is of order v−(d1−d2) n If summary statistic only informative on a parameter that is the same under both models, i.e if d1 = d2, then c the Bayes factor is not consistent

Embedded models When M1 submodel of M2, and if the

true distribution belongs to the smaller model M1, Bayes factor is of order v−(d1−d2) n Else, d1 < d2 and Bayes factor is going to ∞ under M1. If true distribution not in M1, then c Bayes factor is consistent only if µ1 = µ2 = µ0

Summary Model selection feasible with ABC Choice of summary statistics

is paramount At best, ABC → π(. | T (y)) which concentrates around µ0 For estimation : {θ; µ(θ) = µ0} = θ0 For testing {µ1(θ1), θ1 ∈ Θ1} ∩ {µ2(θ2), θ2 ∈ Θ2} = ∅

Summary Model selection feasible with ABC Choice of summary statistics

is paramount At best, ABC → π(. | T (y)) which concentrates around µ0 For estimation : {θ; µ(θ) = µ0} = θ0 For testing {µ1(θ1), θ1 ∈ Θ1} ∩ {µ2(θ2), θ2 ∈ Θ2} = ∅

1 Evidence 2 Importance sampling solutions 3 Cross-model solutions 4

Nested sampling 5 ABC model choice 6 The Savage–Dickey ratio Measure-theoretic aspects Computational implications

The Savage–Dickey ratio representation Special representation of the Bayes factor

used for simulation Original version (Dickey, AoMS, 1971)

The Savage–Dickey ratio representation Special representation of the Bayes factor

used for simulation Original version (Dickey, AoMS, 1971)

Savage’s density ratio theorem Given a test H0 : θ

= θ0 in a model f(x|θ, ψ) with a nuisance parameter ψ, under priors π0(ψ) and π1(θ, ψ) such that π1(ψ|θ0) = π0(ψ) then B01 = π1(θ0|x) π1(θ0) , with the obvious notations π1(θ) = π1(θ, ψ)dψ , π1(θ|x) = π1(θ, ψ|x)dψ , [Dickey, 1971; Verdinelli & Wasserman, 1995]

Rephrased “Suppose that f0(θ) = f1(θ|φ = φ0). As f0(x|θ)

= f1(x|θ, φ = φ0), f0(x) = f1(x|θ, φ = φ0)f1(θ|φ = φ0) dθ = f1(x|φ = φ0) , i.e., the denumerator of the Bayes factor is the value of f1(x|φ) at φ = φ0, while the denominator is an average of the values of f1(x|φ) for φ = φ0, weighted by the prior distribution f1(φ) under the augmented model. Applying Bayes’ theorem to the right-hand side of [the above] we get f0(x) = f1(φ0|x)f1(x) f1(φ0) and hence the Bayes factor is given by B = f0(x) f1(x) = f1(φ0|x) f1(φ0) . the ratio of the posterior to prior densities at φ = φ0 under the augmented model.” [O’Hagan & Forster, 1996]

Rephrased “Suppose that f0(θ) = f1(θ|φ = φ0). As f0(x|θ)

= f1(x|θ, φ = φ0), f0(x) = f1(x|θ, φ = φ0)f1(θ|φ = φ0) dθ = f1(x|φ = φ0) , i.e., the denumerator of the Bayes factor is the value of f1(x|φ) at φ = φ0, while the denominator is an average of the values of f1(x|φ) for φ = φ0, weighted by the prior distribution f1(φ) under the augmented model. Applying Bayes’ theorem to the right-hand side of [the above] we get f0(x) = f1(φ0|x)f1(x) f1(φ0) and hence the Bayes factor is given by B = f0(x) f1(x) = f1(φ0|x) f1(φ0) . the ratio of the posterior to prior densities at φ = φ0 under the augmented model.” [O’Hagan & Forster, 1996]

Measure-theoretic diﬃculty Representation depends on the choice of versions of

Measure-theoretic diﬃculty Representation depends on the choice of versions of

Choice of density version c Dickey’s (1971) condition is not

a condition: If π1(θ0|x) π1(θ0) = π0(ψ)f(x|θ0, ψ) dψ m1(x) is chosen as a version, then Savage–Dickey’s representation holds

Choice of density version c Dickey’s (1971) condition is not

a condition: If π1(θ0|x) π1(θ0) = π0(ψ)f(x|θ0, ψ) dψ m1(x) is chosen as a version, then Savage–Dickey’s representation holds

Savage–Dickey paradox Verdinelli-Wasserman extension: B01 = π1(θ0|x) π1(θ0) Eπ1(ψ|x,θ0,x) π0(ψ)

π1(ψ|θ0) similarly depends on choices of versions... ...but Monte Carlo implementation relies on speciﬁc versions of all densities without making mention of it [Chen, Shao & Ibrahim, 2000]

Savage–Dickey paradox Verdinelli-Wasserman extension: B01 = π1(θ0|x) π1(θ0) Eπ1(ψ|x,θ0,x) π0(ψ)

π1(ψ|θ0) similarly depends on choices of versions... ...but Monte Carlo implementation relies on speciﬁc versions of all densities without making mention of it [Chen, Shao & Ibrahim, 2000]

A computational exploitation Starting from the (instrumental) prior ˜ π1(θ,

A computational exploitation Starting from the (instrumental) prior ˜ π1(θ,

First ratio If (θ(1), ψ(1)), . . . , (θ(T),

Rao–Blackwellisation with latent variables When ˜ π1(θ0|x, ψ) unavailable, replace

Rao–Blackwellisation with latent variables When ˜ π1(θ0|x, ψ) unavailable, replace

Bridge revival (1) Since ˜ m1(x)/m1(x) is unknown, apparent failure!

Bridge revival (1) Since ˜ m1(x)/m1(x) is unknown, apparent failure!

Bridge revival (2) Alternative identity Eπ1(θ,ψ|x) π0(ψ)π1(θ)f(x|θ, ψ) π1(θ, ψ)f(x|θ,

Bridge revival (2) Alternative identity Eπ1(θ,ψ|x) π0(ψ)π1(θ)f(x|θ, ψ) π1(θ, ψ)f(x|θ,

Statistical frontiers, Warwick, Jan. 30, 2014

Statistical frontiers, Warwick, Jan. 30, 2014

More Decks by Xi'an

Other Decks in Education

Featured

Transcript