Statistical frontiers, Warwick, Jan. 30, 2014

Slide 1

Slide 1 text

On some computational methods for Bayesian model choice Christian P. Robert University of Warwick and Universit´ e Paris Dauphine http://www.ceremade.dauphine.fr/~xian Statistical Frontiers, January 30, 2014

Slide 2

Slide 2 text

Outline 1 Evidence 2 Importance sampling solutions 3 Cross-model solutions 4 Nested sampling 5 ABC model choice 6 The Savage–Dickey ratio

Slide 3

Slide 3 text

Formal construction of Bayes tests Deﬁnition (Test) Given an hypothesis H0 : θ ∈ Θ0 on the parameter θ ∈ Θ0 of a statistical model, a test is a statistical procedure that takes its values in {0, 1}. Theorem (Bayes test) The Bayes estimator associated with π and with the 0 − 1 loss is δπ(x) = 1 if π(θ ∈ Θ0|x) > π(θ ∈ Θ0|x), 0 otherwise,

Slide 4

Slide 4 text

Slide 5

Slide 5 text

Bayes factor Deﬁnition (Bayes factors) For testing hypothesis H0 : θ ∈ Θ0 vs. Ha : θ ∈ Θ0, under prior π(Θ0)π0(θ) + π(Θc 0 )π1(θ) , central quantity B01(x) = π(Θ0|x) π(Θc 0 |x) π(Θ0) π(Θc 0 ) = Θ0 f(x|θ)π0(θ)dθ Θc 0 f(x|θ)π1(θ)dθ = m0(x) m1(x) [Jeﬀreys, 1939]

Slide 6

Slide 6 text

Model choice and model comparison Choice between models Several models available for the same observation x Mi : x ∼ fi(x|θi), i ∈ I where the family I can be ﬁnite or inﬁnite Identical setup: Replace hypotheses with models but keep marginal likelihoods and Bayes factors

Slide 7

Slide 7 text

Slide 8

Slide 8 text

Slide 9

Slide 9 text

Slide 10

Slide 10 text

Slide 11

Slide 11 text

Slide 12

Slide 12 text

Evidence All these problems end up with a similar quantity, the evidence Zk = Θk πk(θk)Lk(θk) dθk, aka the marginal likelihood.

Slide 13

Slide 13 text

Importance sampling 1 Evidence 2 Importance sampling solutions Regular importance Harmonic means Chib’s solution Auxiliary variables 3 Cross-model solutions 4 Nested sampling 5 ABC model choice 6 The Savage–Dickey ratio

Slide 14

Slide 14 text

Bayes factor approximation When approximating the Bayes factor B01 = Θ0 f0(x|θ0)π0(θ0)dθ0 Θ1 f1(x|θ1)π1(θ1)dθ1 use of importance functions 0 and 1 and B01 = n−1 0 n0 i=1 f0(x|θi 0 )π0(θi 0 )/ 0(θi 0 ) n−1 1 n1 i=1 f1(x|θi 1 )π1(θi 1 )/ 1(θi 1 ) on the corresponding parameter spaces (may be of diﬀerent dimensions!)

Slide 15

Slide 15 text

Default importance A typical choice for in Bayesian settings is to take θ ∼ N(ˆ θMLE, ˆ σ2 MLE ) [Marin & Robert, 2009; Wood, 2010]

Slide 16

Slide 16 text

Slide 17

Slide 17 text

Bridge sampling variance The bridge sampling estimator does poorly if var(B12) B2 12 ≈ 1 n E π1(θ) − π2(θ) π2(θ) 2 is large, i.e. if π1 and π2 have little overlap...

Slide 18

Slide 18 text

Bridge sampling variance The bridge sampling estimator does poorly if var(B12) B2 12 ≈ 1 n E π1(θ) − π2(θ) π2(θ) 2 is large, i.e. if π1 and π2 have little overlap...

Slide 19

Slide 19 text

Slide 20

Slide 20 text

An infamous example When α(θ) = 1 ˜ π1(θ)˜ π2(θ) harmonic mean approximation to B12 B21 = 1 n1 n1 i=1 1/˜ π1(θ1i|x) 1 n2 n2 i=1 1/˜ π2(θ2i|x) θji ∼ πj(θ|x) [Newton & Raftery, 1994] Infamous: Most often leads to an inﬁnite variance!!! [Radford Neal’s blog, 2008]

Slide 21

Slide 21 text

Slide 22

Slide 22 text

“The Worst Monte Carlo Method Ever” “The good news is that the Law of Large Numbers guarantees that this estimator is consistent ie, it will very likely be very close to the correct answer if you use a suﬃciently large number of points from the posterior distribution. The bad news is that the number of points required for this estimator to get close to the right answer will often be greater than the number of atoms in the observable universe. The even worse news is that itws easy for people to not realize this, and to naively accept estimates that are nowhere close to the correct value of the marginal likelihood.” [Radford Neal’s blog, Aug. 23, 2008]

Slide 23

Slide 23 text

Slide 24

Slide 24 text

Slide 25

Slide 25 text

Optimal bridge sampling (2) Reason: Var(B12) B2 12 ≈ 1 n1n2 π1(θ)π2(θ)[n1π1(θ) + n2π2(θ)]α(θ)2 dθ π1(θ)π2(θ)α(θ) dθ 2 − 1 δ method Drag: Dependence on the unknown normalising constants solved iteratively

Slide 26

Slide 26 text

Slide 27

Slide 27 text

Ratio importance sampling Another fundamental identity: B12 = Eϕ [˜ π1(θ1)/ϕ(θ1)] Eϕ [˜ π2(θ2)/ϕ(θ2)] for any density ϕ with suﬃciently large support (when Θ1 = Θ2) [Torrie & Valleau, 1977] Use of a single sample θ1, . . . , θn from ϕ B12 = i=1 ˜ π1(θi)/ϕ(θi) i=1 ˜ π2(θi)/ϕ(θi)

Slide 28

Slide 28 text

Slide 29

Slide 29 text

Ratio importance sampling (2) Approximate variance: var(B12) B2 12 ≈ 1 n Eϕ (π1(θ) − π2(θ))2 ϕ(θ)2 2 Optimal choice: ϕ∗(θ) = | π1(θ) − π2(θ) | | π1(η) − π2(η) | dη [Chen, Shao & Ibrahim, 2000] Formaly better than bridge sampling

Slide 30

Slide 30 text

Slide 31

Slide 31 text

Slide 32

Slide 32 text

Approximating Zk from a posterior sample Use of the [harmonic mean] identity Eπk ϕ(θk) πk(θk)Lk(θk) x = ϕ(θk) πk(θk)Lk(θk) πk(θk)Lk(θk) Zk dθk = 1 Zk no matter what the proposal ϕ(·) is. [Gelfand & Dey, 1994; Bartolucci et al., 2006] Direct exploitation of the MCMC output RB-RJ

Slide 33

Slide 33 text

Slide 34

Slide 34 text

Comparison with regular importance sampling Harmonic mean: Constraint opposed to usual importance sampling constraints: ϕ(θ) must have lighter (rather than fatter) tails than πk(θk)Lk(θk) for the approximation Z1k = 1 1 T T t=1 ϕ(θ(t) k ) πk(θ(t) k )Lk(θ(t) k ) to have a ﬁnite variance. E.g., use ﬁnite support kernels (like Epanechnikov’s kernel) for ϕ

Slide 35

Slide 35 text

Slide 36

Slide 36 text

Comparison with regular importance sampling (cont’d) Compare Z1k with a standard importance sampling approximation Z2k = 1 T T t=1 πk(θ(t) k )Lk(θ(t) k ) ϕ(θ(t) k ) where the θ(t) k ’s are generated from the density ϕ(·) (with fatter tails like t’s)

Slide 37

Slide 37 text

Approximating Zk using a mixture representation Bridge sampling recall Design a speciﬁc mixture for simulation [importance sampling] purposes, with density ϕk(θk) ∝ ω1πk(θk)Lk(θk) + ϕ(θk) , where ϕ(·) is arbitrary (but normalised) Note: ω1 is not a probability weight

Slide 38

Slide 38 text

Slide 39

Slide 39 text

Approximating Z using a mixture representation (cont’d) Corresponding MCMC (=Gibbs) sampler At iteration t 1 Take δ(t) = 1 with probability ω1πk(θ(t−1) k )Lk(θ(t−1) k ) ω1πk(θ(t−1) k )Lk(θ(t−1) k ) + ϕ(θ(t−1) k ) and δ(t) = 2 otherwise; 2 If δ(t) = 1, generate θ(t) k ∼ MCMC(θ(t−1) k , θk) where MCMC(θk, θk ) denotes an arbitrary MCMC kernel associated with the posterior πk(θk|x) ∝ πk(θk)Lk(θk); 3 If δ(t) = 2, generate θ(t) k ∼ ϕ(θk) independently

Slide 40

Slide 40 text

Slide 41

Slide 41 text

Slide 42

Slide 42 text

Evidence approximation by mixtures Rao-Blackwellised estimate ˆ ξ = 1 T T t=1 ω1πk(θ(t) k )Lk(θ(t) k ) ω1πk(θ(t) k )Lk(θ(t) k ) + ϕ(θ(t) k ) , converges to ω1Zk/{ω1Zk + 1} Deduce ˆ Z3k from ω1 ˆ Z3k/{ω1 ˆ Z3k + 1} = ˆ ξ ie ˆ Z3k = T t=1 ω1πk(θ(t) k )Lk(θ(t) k ) ω1π(θ(t) k )Lk(θ(t) k ) + ϕ(θ(t) k ) T t=1 ϕ(θ(t) k ) ω1πk(θ(t) k )Lk(θ(t) k ) + ϕ(θ(t) k ) [Bridge sampler redux]

Slide 43

Slide 43 text

Slide 44

Slide 44 text

Integration by conditioning Use Rao-Blackwell Theorem var(E[δ(X)|Y]) ≤ var(δ(X))

Slide 45

Slide 45 text

Consequence If I unbiased estimator of I = Ef [h(X)], with X simulated from a joint density ˜ f(x, y), where ˜ f(x, y)dy = f(x), the estimator I∗ = E ˜ f [I|Y1, . . . , Yn] dominate I(X1, . . . , Xn) variance-wise (and is unbiased)

Slide 46

Slide 46 text

Slide 47

Slide 47 text

Slide 48

Slide 48 text

Case of latent variables For missing variable z as in mixture models, natural Rao-Blackwell estimate πk(θ∗ k |x) = 1 T T t=1 πk(θ∗ k |x, z(t) k ) , where the z(t) k ’s are Gibbs sampled latent variables

Slide 49

Slide 49 text

Label switching impact A mixture model [special case of missing variable model] is invariant under permutations of the indices of the components. E.g., mixtures 0.3N(0, 1) + 0.7N(2.3, 1) and 0.7N(2.3, 1) + 0.3N(0, 1) are exactly the same! c The component parameters θi are not identiﬁable marginally since they are exchangeable

Slide 50

Slide 50 text

Slide 51

Slide 51 text

Connected diﬃculties 1 Number of modes of the likelihood of order O(k!): c Maximization and even [MCMC] exploration of the posterior surface harder 2 Under exchangeable priors on (θ, p) [prior invariant under permutation of the indices], all posterior marginals are identical: c Posterior expectation of θ1 equal to posterior expectation of θ2

Slide 52

Slide 52 text

Slide 53

Slide 53 text

License Since Gibbs output does not produce exchangeability, the Gibbs sampler has not explored the whole parameter space: it lacks energy to switch simultaneously enough component allocations at once 0 100 200 300 400 500 −1 0 1 2 3 n µi −1 0 1 2 3 0.2 0.3 0.4 0.5 µ i pi 0 100 200 300 400 500 0.2 0.3 0.4 0.5 n pi 0.2 0.3 0.4 0.5 0.4 0.6 0.8 1.0 p i σi 0 100 200 300 400 500 0.4 0.6 0.8 1.0 n σi 0.4 0.6 0.8 1.0 −1 0 1 2 3 σ i µi

Slide 54

Slide 54 text

Label switching paradox We should observe the exchangeability of the components [label switching] to conclude about convergence of the Gibbs sampler. If we observe it, then we do not know how to estimate the parameters. If we do not, then we are uncertain about the convergence!!!

Slide 55

Slide 55 text

Slide 56

Slide 56 text

Slide 57

Slide 57 text

Compensation for label switching For mixture models, z(t) k usually fails to visit all conﬁgurations in a balanced way, despite the symmetry predicted by the theory πk(θk|x) = πk(σ(θk)|x) = 1 k! σ∈S πk(σ(θk)|x) for all σ’s in Sk, set of all permutations of {1, . . . , k}. Consequences on numerical approximation, biased by an order k! Recover the theoretical symmetry by using πk(θ∗ k |x) = 1 T k! σ∈Sk T t=1 πk(σ(θ∗ k )|x, z(t) k ) . [Berkhof, Mechelen, & Gelman, 2003]

Slide 58

Slide 58 text

Slide 59

Slide 59 text

Galaxy dataset n = 82 galaxies as a mixture of k normal distributions with both mean and variance unknown. [Roeder, 1992] Average density data Relative Frequency −2 −1 0 1 2 3 0.0 0.2 0.4 0.6 0.8

Slide 60

Slide 60 text

Galaxy dataset (k) Using only the original estimate, with θ∗ k as the MAP estimator, log( ˆ mk(x)) = −105.1396 for k = 3 (based on 103 simulations), while introducing the permutations leads to log( ˆ mk(x)) = −103.3479 Note that −105.1396 + log(3!) = −103.3479 k 2 3 4 5 6 7 8 mk (x) -115.68 -103.35 -102.66 -101.93 -102.88 -105.48 -108.44 Estimations of the marginal likelihoods by the symmetrised Chib’s approximation (based on 105 Gibbs iterations and, for k > 5, 100 permutations selected at random in Sk). [Lee, Marin, Mengersen & Robert, 2008]

Slide 61

Slide 61 text

Slide 62

Slide 62 text

Slide 63

Slide 63 text

Slide 64

Slide 64 text

Slide 65

Slide 65 text

Slide 66

Slide 66 text

Auxiliary variables Auxiliary variable method avoids computations of untractable constant(s) in likelihood function f(y|θ) = Zθ ˜ f(y|θ) Introduce pseudo-data y with artificial target g(y|θ, y) Generate θ ∼ K(θ, θ ) and y ∼ f(y|θ ) For Gibbs random fields , existence of a genuine sufficient statistic η(y). [Møller, Pettitt, Berthelsen, & Reeves, 2006]

Slide 67

Slide 67 text

Cross-model solutions 1 Evidence 2 Importance sampling solutions 3 Cross-model solutions Reversible jump Saturation schemes Implementation error 4 Nested sampling 5 ABC model choice 6 The Savage–Dickey ratio

Slide 68

Slide 68 text

Reversible jump irreversible jump Idea: Set up a proper measure–theoretic framework for designing moves between models Mk [Green, 1995] Create a reversible kernel K on H = k {k} × Θk such that A B K(x, dy)π(x)dx = B A K(y, dx)π(y)dy for the invariant density π [x is of the form (k, θ(k))]

Slide 69

Slide 69 text

Slide 70

Slide 70 text

Local moves For a move between two models, M1 and M2, the Markov chain being in state θ1 ∈ M1, denote by K1→2(θ1, dθ) and K2→1(θ2, dθ) the corresponding kernels, under the detailed balance condition π(dθ1) K1→2(θ1, dθ) = π(dθ2) K2→1(θ2, dθ) , and take, wlog, dim(M2) > dim(M1). Proposal expressed as θ2 = Ψ1→2(θ1, v1→2) where v1→2 is a random variable of dimension dim(M2) − dim(M1), generated as v1→2 ∼ ϕ1→2(v1→2) .

Slide 71

Slide 71 text

Slide 72

Slide 72 text

Local moves (2) In this case, q1→2(θ1, dθ2) has density ϕ1→2(v1→2) ∂Ψ1→2(θ1, v1→2) ∂(θ1, v1→2) −1 , by the Jacobian rule. Reverse importance link If probability 1→2 of choosing move to M2 while in M1, acceptance probability reduces to α(θ1, v1→2) = 1∧ π(M2, θ2) 2→1 π(M1, θ1) 1→2 ϕ1→2(v1→2) ∂Ψ1→2(θ1, v1→2) ∂(θ1, v1→2) . If several models are considered simultaneously, with probability 1→2 of choosing move to M2 while in M1, as in K(x, B) = ∞ m=1 ρm (x, y)qm (x, dy) + ω(x)IB (x)

Slide 73

Slide 73 text

Slide 74

Slide 74 text

Slide 75

Slide 75 text

Generic reversible jump acceptance probability Acceptance probability of θ2 = Ψ1→2(θ1, v1→2) is α(θ1 , v1→2 ) = 1 ∧ π(M2 , θ2 ) 2→1 π(M1 , θ1 ) 1→2 ϕ1→2 (v1→2 ) ∂Ψ1→2 (θ1 , v1→2 ) ∂(θ1 , v1→2 ) while acceptance probability of θ1 with (θ1, v1→2) = Ψ−1 1→2 (θ2) is α(θ1 , v1→2 ) = 1 ∧ π(M1 , θ1 ) 1→2 ϕ1→2 (v1→2 ) π(M2 , θ2 ) 2→1 ∂Ψ1→2 (θ1 , v1→2 ) ∂(θ1 , v1→2 ) −1 c Diﬃcult calibration

Slide 76

Slide 76 text

Slide 77

Slide 77 text

Slide 78

Slide 78 text

Green’s sampler Algorithm Iteration t (t ≥ 1): if x(t) = (m, θ(m)), 1 Select model Mn with probability πmn 2 Generate umn ∼ ϕmn(u) and set (θ(n), vnm) = Ψm→n(θ(m), umn) 3 Take x(t+1) = (n, θ(n)) with probability min π(n, θ(n)) π(m, θ(m)) πnm ϕnm (vnm ) πmn ϕmn (umn ) ∂Ψm→n (θ(m), umn ) ∂(θ(m), umn ) , 1 and take x(t+1) = x(t) otherwise.

Slide 79

Slide 79 text

Interpretation The representation puts us back in a ﬁxed dimension setting: M1 × V1→2 and M2 in one-to-one relation. reversibility imposes that θ1 is derived as (θ1, v1→2) = Ψ−1 1→2 (θ2) appears like a regular Metropolis–Hastings move from the couple (θ1, v1→2) to θ2 when stationary distributions are π(M1, θ1) × ϕ1→2(v1→2) and π(M2, θ2), and when proposal distribution is deterministic

Slide 80

Slide 80 text

Slide 81

Slide 81 text

Alternative Saturation of the parameter space H = k {k} × Θk by creating θ = (θ1, . . . , θD) a model index M pseudo-priors πj(θj|M = k) for j = k [Carlin & Chib, 1995] Validation by P(M = k|x) = P(M = k|x, θ)π(θ|x)dθ = Zk where the (marginal) posterior is [not πk] π(θ|x) = D k=1 P(θ, M = k|x) = D k=1 pk Zk πk(θk|x) j=k πj(θj|M = k) .

Slide 82

Slide 82 text

Slide 83

Slide 83 text

MCMC implementation Run a Markov chain (M(t), θ(t) 1 , . . . , θ(t) D ) with stationary distribution π(θ, M|x) by 1 Pick M(t) = k with probability π(θ(t−1), k|x) 2 Generate θ(t−1) k from the posterior πk(θk|x) [or MCMC step] 3 Generate θ(t−1) j (j = k) from the pseudo-prior πj(θj|M = k) Approximate P(M = k|x) = Zk by ˇ pk(x) ∝ pk T t=1 fk(x|θ(t) k ) πk(θ(t) k ) j=k πj(θ(t) j |M = k) D =1 p f (x|θ(t)) π (θ(t)) j= πj(θ(t) j |M = )

Slide 84

Slide 84 text

Slide 85

Slide 85 text

Slide 86

Slide 86 text

Scott’s (2002) proposal Suggest estimating P(M = k|x) by Zk ∝ pk T t=1    fk(x|θ(t) k ) D j=1 pj fj(x|θ(t) j )    , based on D simultaneous and independent MCMC chains (θ(t) k )t , 1 ≤ k ≤ D , with stationary distributions πk(θk|x) [instead of above joint]

Slide 87

Slide 87 text

Slide 88

Slide 88 text

Congdon’s (2006) extension Selecting ﬂat [prohibited] pseudo-priors, uses instead Zk ∝ pk T t=1    fk(x|θ(t) k )πk(θ(t) k ) D j=1 pj fj(x|θ(t) j )πj(θ(t) j )    , where again the θ(t) k ’s are MCMC chains with stationary distributions πk(θk|x) to next section

Slide 89

Slide 89 text

Examples Example (Model choice) Model M1 : x|θ ∼ U(0, θ) with prior θ ∼ Exp(1) is versus model M2 : x|θ ∼ Exp(θ) with prior θ ∼ Exp(1). Equal prior weights on both models: 1 = 2 = 0.5. Approximations of P(M = 1|x): Scott’s (2002) (blue), and Congdon’s (2006) (red) [N = 106 simulations]. 1 2 3 4 5 0.0 0.2 0.4 0.6 0.8 1.0 y

Slide 90

Slide 90 text

Slide 91

Slide 91 text

Examples (2) Example (Model choice (2)) Normal model M1 : x ∼ N(θ, 1) with θ ∼ N(0, 1) vs. normal model M2 : x ∼ N(θ, 1) with θ ∼ N(5, 1) Comparison of both approximations with P(M = 1|x): Scott’s (2002) (green and mixed dashes) and Congdon’s (2006) (brown and long dashes) [N = 104 simulations]. −1 0 1 2 3 4 5 6 0.0 0.2 0.4 0.6 0.8 1.0 y

Slide 92

Slide 92 text

Examples (3) Example (Model choice (3)) Model M1 : x ∼ N(0, 1/ω) with ω ∼ Exp(a) vs. M2 : exp(x) ∼ Exp(λ) with λ ∼ Exp(b). Comparison of Congdon’s (2006) (brown and dashed lines) with P(M = 1|x) when (a, b) is equal to (.24, 8.9), (.56, .7), (4.1, .46) and (.98, .081), resp. [N = 104 simulations]. −10 −5 0 5 10 0.0 0.2 0.4 0.6 0.8 1.0 y −10 −5 0 5 10 0.0 0.2 0.4 0.6 0.8 1.0 y −10 −5 0 5 10 0.0 0.2 0.4 0.6 0.8 1.0 y −10 −5 0 5 10 0.0 0.2 0.4 0.6 0.8 1.0 y

Slide 93

Slide 93 text

Nested sampling 1 Evidence 2 Importance sampling solutions 3 Cross-model solutions 4 Nested sampling Purpose Implementation Error rates Constraints Importance variant A mixture comparison 5 ABC model choice

Slide 94

Slide 94 text

Nested sampling: Goal Skilling’s (2007) technique using the one-dimensional representation: Z = Eπ[L(θ)] = 1 0 ϕ(x) dx with ϕ−1(l) = Pπ(L(θ) > l). Note; ϕ(·) is intractable in most cases.

Slide 95

Slide 95 text

Nested sampling: First approximation Approximate Z by a Riemann sum: Z = N i=1 (xi−1 − xi)ϕ(xi) where the xi’s are either: deterministic: xi = e−i/N or random: x0 = 0, xi+1 = tixi, ti ∼ Be(N, 1) so that E[log xi] = −i/N.

Slide 96

Slide 96 text

Extraneous white noise Take Z = e−θ dθ = 1 δ e−(1−δ)θ e−δθ = Eδ 1 δ e−(1−δ)θ ˆ Z = 1 N N i=1 δ−1 e−(1−δ)θi (xi−1 − xi) , θi ∼ E(δ) I(θi ≤ θi−1) N deterministic random 50 4.64 10.5 4.65 10.5 100 2.47 4.9 2.48 5.02 500 .549 1.01 .550 1.14 Comparison of variances and MSEs

Slide 97

Slide 97 text

Slide 98

Slide 98 text

Nested sampling: Alternative representation Another representation is Z = N−1 i=0 {ϕ(xi+1) − ϕ(xi)}xi which is a special case of Z = N−1 i=0 {L(θ(i+1) ) − L(θ(i) )}π({θ; L(θ) > L(θ(i) )}) where · · · L(θ(i+1) ) < L(θ(i) ) · · · [Lebesgue version of Riemann’s sum]

Slide 99

Slide 99 text

Slide 100

Slide 100 text

Nested sampling: Second approximation Estimate (intractable) ϕ(xi) by ϕi: Nested sampling Start with N values θ1, . . . , θN sampled from π At iteration i, 1 Take ϕi = L(θk), where θk is the point with smallest likelihood in the pool of θi’s 2 Replace θk with a sample from the prior constrained to L(θ) > ϕi: the current N points are sampled from prior constrained to L(θ) > ϕi. Note that π({θ; L(θ) > L(θ(i+1) )}) π({θ; L(θ) > L(θ(i) )}) ≈ 1 − 1 N

Slide 101

Slide 101 text

Slide 102

Slide 102 text

Slide 103

Slide 103 text

Slide 104

Slide 104 text

Nested sampling: Third approximation Iterate the above steps until a given stopping iteration j is reached: e.g., observe very small changes in the approximation Z; reach the maximal value of L(θ) when the likelihood is bounded and its maximum is known; truncate the integral Z at level , i.e. replace 1 0 ϕ(x) dx with 1 ϕ(x) dx

Slide 105

Slide 105 text

Approximation error Error = Z − Z = j i=1 (xi−1 − xi)ϕi − 1 0 ϕ(x) dx = − 0 ϕ(x) dx + j i=1 (xi−1 − xi)ϕ(xi) − 1 ϕ(x) dx (Quadrature Error) + j i=1 (xi−1 − xi) {ϕi − ϕ(xi)} (Stochastic Error) [Dominated by Monte Carlo]

Slide 106

Slide 106 text

A CLT for the Stochastic Error The (dominating) stochastic error is OP (N−1/2): N1/2 Z − Z D → N (0, V ) with V = − s,t∈[ ,1] sϕ (s)tϕ (t) log(s ∨ t) ds dt. [Proof based on Donsker’s theorem]

Slide 107

Slide 107 text

What of log Z? If the interest lies within log Z, Slutsky’s transform of the CLT: N−1/2 log Z − log Z D → N 0, V/Z2 Note: The number of simulated points equals the number of iterations j, and is a multiple of N: if one stops at ﬁrst iteration j such that e−j/N < , then: j = N − log .

Slide 108

Slide 108 text

Slide 109

Slide 109 text

Slide 110

Slide 110 text

Sampling from constr’d priors Exact simulation from the constrained prior is intractable in most cases Skilling (2007) proposes to use MCMC but this introduces a bias (stopping rule) If implementable, then slice sampler can be devised at the same cost [or less]

Slide 111

Slide 111 text

Slide 112

Slide 112 text

Slide 113

Slide 113 text

Banana illustration Case of a banana target made of a twisted 2D normal: x2 = x2 + βx2 1 − 100β [Haario, Sacksman, Tamminen, 1999] β = .03, σ = (100, 1)

Slide 114

Slide 114 text

Banana illustration (2) Use of nested sampling with N = 1000, 50 MCMC steps with size 0.1, compared with a population Monte Carlo (PMC) based on 10 iterations with 5000 points per iteration and ﬁnal sample of 50000 points, using nine Student’s t components with 9 df [Wraith, Kilbinger, Benabed et al., 2009, Physica Rev. D] Evidence estimation

Slide 115

Slide 115 text

Slide 116

Slide 116 text

Slide 117

Slide 117 text

A IS variant of nested sampling Consider instrumental prior π and likelihood ˜ L, weight function w(θ) = π(θ)L(θ) π(θ)L(θ) and weighted NS estimator Z = j i=1 (xi−1 − xi)ϕiw(θi). Then choose (π, L) so that sampling from π constrained to L(θ) > l is easy; e.g. N(c, Id) constrained to c − θ < r.

Slide 118

Slide 118 text

Slide 119

Slide 119 text

Mixture pN(0, 1) + (1 − p)N(µ, σ) posterior Posterior on (µ, σ) for n observations with µ = 2 and σ = 3/2, when p is known Use of a uniform prior both on (−2, 6) for µ and on (.001, 16) for log σ2. occurrences of posterior bursts for µ = xi

Slide 120

Slide 120 text

Experiment MCMC sample for n = 16 observations from the mixture. Nested sampling sequence with M = 1000 starting points.

Slide 121

Slide 121 text

Experiment MCMC sample for n = 50 observations from the mixture. Nested sampling sequence with M = 1000 starting points.

Slide 122

Slide 122 text

Comparison 1 Nested sampling: M = 1000 points, with 10 random walk moves at each step, simulations from the constr’d prior and a stopping rule at 95% of the observed maximum likelihood 2 T = 104 MCMC (=Gibbs) simulations producing non-parametric estimates ϕ [Diebolt & Robert, 1990] 3 Monte Carlo estimates Z1, Z2, Z3 using product of two Gaussian kernels 4 numerical integration based on 850 × 950 grid [reference value, conﬁrmed by Chib’s]

Slide 123

Slide 123 text

Comparison (cont’d) Graph based on a sample of 10 observations for µ = 2 and σ = 3/2 (150 replicas).

Slide 124

Slide 124 text

Comparison (cont’d) Graph based on a sample of 50 observations for µ = 2 and σ = 3/2 (150 replicas).

Slide 125

Slide 125 text

Comparison (cont’d) Graph based on a sample of 100 observations for µ = 2 and σ = 3/2 (150 replicas).

Slide 126

Slide 126 text

Approximate Bayesian computation 1 Evidence 2 Importance sampling solutions 3 Cross-model solutions 4 Nested sampling 5 ABC model choice ABC method Formalised framework Consistency results Summary statistics 6 The Savage–Dickey ratio

Slide 127

Slide 127 text

Approximate Bayesian Computation Bayesian setting: target is π(θ)f(x|θ) When likelihood f(x|θ) not in closed form, likelihood-free rejection technique: ABC algorithm For an observation y ∼ f(y|θ), under the prior π(θ), keep jointly simulating θ ∼ π(θ) , x ∼ f(x|θ ) , until the auxiliary variable x is equal to the observed value, x = y. [Pritchard et al., 1999]

Slide 128

Slide 128 text

Slide 129

Slide 129 text

Slide 130

Slide 130 text

A as approximative When y is a continuous random variable, equality x = y is replaced with a tolerance condition, (x, y) ≤ where is a distance between summary statistics Output distributed from π(θ) Pθ{ (x, y) < } ∝ π(θ| (x, y) < )

Slide 131

Slide 131 text

Slide 132

Slide 132 text

The starting point Central question to the validation of ABC for model choice: When is a Bayes factor based on an insuﬃcient statistic T (y) consistent? Note/warnin: c drawn on T (y) through BT 12 (y) necessarily diﬀers from c drawn on y through B12(y) [Marin, Pillai, X, & Rousseau, arXiv, 2012]

Slide 133

Slide 133 text

Slide 134

Slide 134 text

Slide 135

Slide 135 text

A benchmark if toy example Comparison suggested by referee of PNAS paper [thanks!]: [X, Cornuet, Marin, & Pillai, Aug. 2011] Model M1: y ∼ N(θ1, 1) opposed to model M2: y ∼ L(θ2, 1/ √ 2), Laplace distribution with mean θ2 and scale parameter 1/ √ 2 (variance one). Four possible statistics 1 sample mean y (suﬃcient for M1 if not M2); 2 sample median med(y) (insuﬃcient); 3 sample variance var(y) (ancillary); 4 median absolute deviation mad(y) = med(|y − med(y)|);

Slide 136

Slide 136 text

A benchmark if toy example Comparison suggested by referee of PNAS paper [thanks!]: [X, Cornuet, Marin, & Pillai, Aug. 2011] Model M1: y ∼ N(θ1, 1) opposed to model M2: y ∼ L(θ2, 1/ √ 2), Laplace distribution with mean θ2 and scale parameter 1/ √ 2 (variance one). q q q q q q q q q q q Gauss Laplace 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 n=100

Slide 137

Slide 137 text

A benchmark if toy example Comparison suggested by referee of PNAS paper [thanks!]: [X, Cornuet, Marin, & Pillai, Aug. 2011] Model M1: y ∼ N(θ1, 1) opposed to model M2: y ∼ L(θ2, 1/ √ 2), Laplace distribution with mean θ2 and scale parameter 1/ √ 2 (variance one). q q q q q q q q q q q Gauss Laplace 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 n=100 q q q q q q q q q q q q q q q q q q Gauss Laplace 0.0 0.2 0.4 0.6 0.8 1.0 n=100

Slide 138

Slide 138 text

Framework Starting from sample y = (y1, . . . , yn) the observed sample, not necessarily iid with true distribution y ∼ Pn Summary statistics T (y) = T n = (T1(y), T2(y), · · · , Td(y)) ∈ Rd with true distribution T n ∼ Gn.

Slide 139

Slide 139 text

Slide 140

Slide 140 text

Assumptions A collection of asymptotic “standard” assumptions: [A1] is a standard central limit theorem under the true model [A2] controls the large deviations of the estimator T n from the estimand µ(θ) [A3] is the standard prior mass condition found in Bayesian asymptotics (di eﬀective dimension of the parameter) [A4] restricts the behaviour of the model density against the true density [Think CLT!]

Slide 141

Slide 141 text

Assumptions A collection of asymptotic “standard” assumptions: [Think CLT!] [A1] There exist a sequence {vn} converging to +∞, a distribution Q, a symmetric, d × d positive deﬁnite matrix V0 and a vector µ0 ∈ Rd such that vnV −1/2 0 (T n − µ0) n→∞ ; Q, under Gn

Slide 142

Slide 142 text

Slide 143

Slide 143 text

Assumptions A collection of asymptotic “standard” assumptions: [Think CLT!] [A3] If inf{|µi(θi) − µ0|; θi ∈ Θi} = 0, deﬁning (u > 0) Sn,i(u) = θi ∈ Fn,i; |µi(θi) − µ0| ≤ u v−1 n , there exist constants di < τi ∧ αi − 1 such that πi(Sn,i(u)) ∼ udi v−di n , ∀u vn

Slide 144

Slide 144 text

Assumptions A collection of asymptotic “standard” assumptions: [Think CLT!] [A4] If inf{|µi(θi) − µ0|; θi ∈ Θi} = 0, for any > 0, there exist U, δ > 0 and (En)n such that, if θi ∈ Sn,i(U) En ⊂ {t; gi(t|θi) < δgn(t)} and Gn (Ec n ) < .

Slide 145

Slide 145 text

Assumptions A collection of asymptotic “standard” assumptions: [Think CLT!] Again (sumup) [A1] is a standard central limit theorem under the true model [A2] controls the large deviations of the estimator T n from the estimand µ(θ) [A3] is the standard prior mass condition found in Bayesian asymptotics (di eﬀective dimension of the parameter) [A4] restricts the behaviour of the model density against the true density

Slide 146

Slide 146 text

Effective dimension Understanding di in [A3]: defined only when µ0 ∈ {µi(θi), θi ∈ Θi}, πi(θi : |µi(θi) − µ0| < n−1/2) = O(n−di/2) is the effective dimension of the model Θi around µ0

Slide 147

Slide 147 text

Eﬀective dimension Understanding di in [A3]: when inf{|µi(θi) − µ0|; θi ∈ Θi} = 0, mi(T n) gn(T n) ∼ v−di n , thus log(mi(T n)/gn(T n)) ∼ −di log vn and v−di n penalization factor resulting from integrating θi out (see eﬀective number of parameters in DIC)

Slide 148

Slide 148 text

Eﬀective dimension Understanding di in [A3]: In regular models, di dimension of T (Θi), leading to BIC In non-regular models, di can be smaller

Slide 149

Slide 149 text

Asymptotic marginals Asymptotically, under [A1]–[A4] mi(t) = Θi gi(t|θi) πi(θi) dθi is such that (i) if inf{|µi(θi) − µ0|; θi ∈ Θi} = 0, Clvd−di n ≤ mi(T n) ≤ Cuvd−di n and (ii) if inf{|µi(θi) − µ0|; θi ∈ Θi} > 0 mi(T n) = o Pn [vd−τi n + vd−αi n ].

Slide 150

Slide 150 text

Within-model consistency Under same assumptions, if inf{|µi(θi) − µ0|; θi ∈ Θi} = 0, the posterior distribution of µi(θi) given T n is consistent at rate 1/vn provided αi ∧ τi > di.

Slide 151

Slide 151 text

Between-model consistency Consequence of above is that asymptotic behaviour of the Bayes factor is driven by the asymptotic mean value of T n under both models. And only by this mean value!

Slide 152

Slide 152 text

Between-model consistency Consequence of above is that asymptotic behaviour of the Bayes factor is driven by the asymptotic mean value of T n under both models. And only by this mean value! Indeed, if inf{|µ0 − µ2(θ2)|; θ2 ∈ Θ2} = inf{|µ0 − µ1(θ1)|; θ1 ∈ Θ1} = 0 then Clv−(d1−d2) n ≤ m1(T n) m2(T n) ≤ Cuv−(d1−d2) n , where Cl, Cu = O Pn (1), irrespective of the true model. c Only depends on the diﬀerence d1 − d2: no consistency

Slide 153

Slide 153 text

Between-model consistency Consequence of above is that asymptotic behaviour of the Bayes factor is driven by the asymptotic mean value of T n under both models. And only by this mean value! Else, if inf{|µ0 − µ2(θ2)|; θ2 ∈ Θ2} > inf{|µ0 − µ1(θ1)|; θ1 ∈ Θ1} = 0 then m1(T n) m2(T n) ≥ Cu min v−(d1−α2) n , v−(d1−τ2) n

Slide 154

Slide 154 text

Consistency theorem If inf{|µ0 − µ2(θ2)|; θ2 ∈ Θ2} = inf{|µ0 − µ1(θ1)|; θ1 ∈ Θ1} = 0, Bayes factor BT 12 = O(v−(d1−d2) n ) irrespective of the true model. It is inconsistent since it always picks the model with the smallest dimension

Slide 155

Slide 155 text

Consistency theorem If Pn belongs to one of the two models and if µ0 cannot be attained by the other one : 0 = min (inf{|µ0 − µi(θi)|; θi ∈ Θi}, i = 1, 2) < max (inf{|µ0 − µi(θi)|; θi ∈ Θi}, i = 1, 2) , then the Bayes factor BT 12 is consistent

Slide 156

Slide 156 text

Consequences on summary statistics Bayes factor driven by the means µi(θi) and the relative position of µ0 wrt both sets {µi(θi); θi ∈ Θi}, i = 1, 2. For ABC, this implies the most likely statistics T n are ancillary statistics with diﬀerent mean values under both models Else, if T n asymptotically depends on some of the parameters of the models, it is possible that there exists θi ∈ Θi such that µi(θi) = µ0 even though model M1 is misspeciﬁed

Slide 157

Slide 157 text

Slide 158

Slide 158 text

Toy example: Laplace versus Gauss [1] If T n = n−1 n i=1 X4 i , µ1(θ) = 3 + θ4 + 6θ2, µ2(θ) = 6 + · · · and the true distribution is Laplace with mean θ0 = 1, under the Gaussian model the value θ∗ = 2 √ 3 − 3 also leads to µ0 = µ(θ∗) [here d1 = d2 = d = 1]

Slide 159

Slide 159 text

Slide 160

Slide 160 text

Toy example: Laplace versus Gauss [1] If T n = n−1 n i=1 X4 i , µ1(θ) = 3 + θ4 + 6θ2, µ2(θ) = 6 + · · ·

Slide 161

Slide 161 text

Toy example: Laplace versus Gauss [1] If T n = n−1 n i=1 X4 i , µ1(θ) = 3 + θ4 + 6θ2, µ2(θ) = 6 + · · · q q q q q q q M1 M2 0.30 0.35 0.40 0.45 n=1000 Fourth moment

Slide 162

Slide 162 text

Toy example: Laplace versus Gauss [0] When T (y) = ¯ y(4) n , ¯ y(6) n and the true distribution is Laplace with mean θ0 = 0, then µ0 = 6, µ1(θ∗ 1 ) = 6 with θ∗ 1 = 2 √ 3 − 3 [d1 = 1 and d2 = 1/2] thus B12 ∼ n−1/4 → 0 : consistent Under the Gaussian model µ0 = 3, µ2(θ2) ≥ 6 > 3 ∀θ2 B12 → +∞ : consistent

Slide 163

Slide 163 text

Slide 164

Slide 164 text

Toy example: Laplace versus Gauss [0] When T (y) = ¯ y(4) n , ¯ y(6) n 0.0 0.2 0.4 0.6 0.8 1.0 0 1 2 3 4 5 posterior probabilities Density Fourth and sixth moments

Slide 165

Slide 165 text

Checking for adequate statistics After running ABC, i.e. creating reference tables of (θi, xi) from both joints, straightforward derivation of ABC estimates ˆ θ1 and ˆ θ2. Evaluation of E1 ˆ θ1 [T(X)] and E2 ˆ θ2 [T(X)] allows for detection of diﬀerent means under both models via Monte Carlo simulations

Slide 166

Slide 166 text

Toy example: Laplace versus Gauss q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q Gauss Laplace Gauss Laplace 0 10 20 30 40 Normalised χ2 without and with mad

Slide 167

Slide 167 text

Embedded models When M1 submodel of M2, and if the true distribution belongs to the smaller model M1, Bayes factor is of order v−(d1−d2) n

Slide 168

Slide 168 text

Embedded models When M1 submodel of M2, and if the true distribution belongs to the smaller model M1, Bayes factor is of order v−(d1−d2) n If summary statistic only informative on a parameter that is the same under both models, i.e if d1 = d2, then c the Bayes factor is not consistent

Slide 169

Slide 169 text

Embedded models When M1 submodel of M2, and if the true distribution belongs to the smaller model M1, Bayes factor is of order v−(d1−d2) n Else, d1 < d2 and Bayes factor is going to ∞ under M1. If true distribution not in M1, then c Bayes factor is consistent only if µ1 = µ2 = µ0

Slide 170

Slide 170 text

Summary Model selection feasible with ABC Choice of summary statistics is paramount At best, ABC → π(. | T (y)) which concentrates around µ0 For estimation : {θ; µ(θ) = µ0} = θ0 For testing {µ1(θ1), θ1 ∈ Θ1} ∩ {µ2(θ2), θ2 ∈ Θ2} = ∅

Slide 171

Slide 171 text

Slide 172

Slide 172 text

1 Evidence 2 Importance sampling solutions 3 Cross-model solutions 4 Nested sampling 5 ABC model choice 6 The Savage–Dickey ratio Measure-theoretic aspects Computational implications

Slide 173

Slide 173 text

The Savage–Dickey ratio representation Special representation of the Bayes factor used for simulation Original version (Dickey, AoMS, 1971)

Slide 174

Slide 174 text

The Savage–Dickey ratio representation Special representation of the Bayes factor used for simulation Original version (Dickey, AoMS, 1971)

Slide 175

Slide 175 text

Savage’s density ratio theorem Given a test H0 : θ = θ0 in a model f(x|θ, ψ) with a nuisance parameter ψ, under priors π0(ψ) and π1(θ, ψ) such that π1(ψ|θ0) = π0(ψ) then B01 = π1(θ0|x) π1(θ0) , with the obvious notations π1(θ) = π1(θ, ψ)dψ , π1(θ|x) = π1(θ, ψ|x)dψ , [Dickey, 1971; Verdinelli & Wasserman, 1995]

Slide 176

Slide 176 text

Rephrased “Suppose that f0(θ) = f1(θ|φ = φ0). As f0(x|θ) = f1(x|θ, φ = φ0), f0(x) = f1(x|θ, φ = φ0)f1(θ|φ = φ0) dθ = f1(x|φ = φ0) , i.e., the denumerator of the Bayes factor is the value of f1(x|φ) at φ = φ0, while the denominator is an average of the values of f1(x|φ) for φ = φ0, weighted by the prior distribution f1(φ) under the augmented model. Applying Bayes’ theorem to the right-hand side of [the above] we get f0(x) = f1(φ0|x)f1(x) f1(φ0) and hence the Bayes factor is given by B = f0(x) f1(x) = f1(φ0|x) f1(φ0) . the ratio of the posterior to prior densities at φ = φ0 under the augmented model.” [O’Hagan & Forster, 1996]

Slide 177

Slide 177 text

Slide 178

Slide 178 text

Measure-theoretic difficulty Representation depends on the choice of versions of conditional densities: B01 = π0(ψ)f(x|θ0, ψ) dψ π1(θ, ψ)f(x|θ, ψ) dψdθ [by definition] = π1(ψ|θ0)f(x|θ0, ψ) dψ π1(θ0) π1(θ, ψ)f(x|θ, ψ) dψdθ π1(θ0) [specific version of π1(ψ|θ0) and arbitrary version of π1(θ0)] = π1(θ0, ψ)f(x|θ0, ψ) dψ m1(x)π1(θ0) [specific version of π1(θ0, ψ)] = π1(θ0|x) π1(θ0) [version dependent]

Slide 179

Slide 179 text

Slide 180

Slide 180 text

Choice of density version c Dickey’s (1971) condition is not a condition: If π1(θ0|x) π1(θ0) = π0(ψ)f(x|θ0, ψ) dψ m1(x) is chosen as a version, then Savage–Dickey’s representation holds

Slide 181

Slide 181 text

Slide 182

Slide 182 text

Savage–Dickey paradox Verdinelli-Wasserman extension: B01 = π1(θ0|x) π1(θ0) Eπ1(ψ|x,θ0,x) π0(ψ) π1(ψ|θ0) similarly depends on choices of versions... ...but Monte Carlo implementation relies on speciﬁc versions of all densities without making mention of it [Chen, Shao & Ibrahim, 2000]

Slide 183

Slide 183 text

Slide 184

Slide 184 text

A computational exploitation Starting from the (instrumental) prior ˜ π1(θ, ψ) = π1(θ)π0(ψ) deﬁne the associated posterior ˜ π1(θ, ψ|x) = π0(ψ)π1(θ)f(x|θ, ψ) ˜ m1(x) and impose the choice of version ˜ π1(θ0|x) π0(θ0) = π0(ψ)f(x|θ0, ψ) dψ ˜ m1(x) Then B01 = ˜ π1(θ0|x) π1(θ0) ˜ m1(x) m1(x)

Slide 185

Slide 185 text

Slide 186

Slide 186 text

Slide 187

Slide 187 text

Slide 188

Slide 188 text

Slide 189

Slide 189 text

Bridge revival (1) Since ˜ m1(x)/m1(x) is unknown, apparent failure! Use of the bridge identity E˜ π1(θ,ψ|x) π1(θ, ψ)f(x|θ, ψ) π0(ψ)π1(θ)f(x|θ, ψ) = E˜ π1(θ,ψ|x) π1(ψ|θ) π0(ψ) = m1(x) ˜ m1(x) to (biasedly) estimate ˜ m1(x)/m1(x) by T T t=1 π1(ψ(t)|θ(t)) π0(ψ(t)) based on the same sample from ˜ π1.

Slide 190

Slide 190 text

Slide 191

Slide 191 text

Slide 192

Slide 192 text

Slide 193

Slide 193 text

Slide 194

Slide 194 text

Diabetes in Pima Indian women (cont’d) Comparison of the variation of the Bayes factor approximations based on 100 replicas for 20, 000 simulations for a simulation from the above importance, Chib’s, Savage–Dickey’s and bridge samplers q q IS Chib Savage−Dickey Bridge 2.8 3.0 3.2 3.4

Slide 195

Slide 195 text

Other things I could have mentioned use of particle ﬁlters [Del Moral et al., 2006] path sampling d dθ log Z(θ) = Eθ d dθ log q(ω|θ) [Ogata, 1989; Gelman & Meng, 1998] multicanonical algorithms [Marinari & Parisi, 1992; Geyer & Thompson, 1995]