Statistical frontiers, Warwick, Jan. 30, 2014

Dfbaebe5e96e827d993483f842c74fa2?s=47 Xi'an
January 23, 2014

Statistical frontiers, Warwick, Jan. 30, 2014

Course on the approximation of Bayes factors by an array of numerical methods

Dfbaebe5e96e827d993483f842c74fa2?s=128

Xi'an

January 23, 2014
Tweet

Transcript

  1. On some computational methods for Bayesian model choice Christian P.

    Robert University of Warwick and Universit´ e Paris Dauphine http://www.ceremade.dauphine.fr/~xian Statistical Frontiers, January 30, 2014
  2. Outline 1 Evidence 2 Importance sampling solutions 3 Cross-model solutions

    4 Nested sampling 5 ABC model choice 6 The Savage–Dickey ratio
  3. Formal construction of Bayes tests Definition (Test) Given an hypothesis

    H0 : θ ∈ Θ0 on the parameter θ ∈ Θ0 of a statistical model, a test is a statistical procedure that takes its values in {0, 1}. Theorem (Bayes test) The Bayes estimator associated with π and with the 0 − 1 loss is δπ(x) = 1 if π(θ ∈ Θ0|x) > π(θ ∈ Θ0|x), 0 otherwise,
  4. Formal construction of Bayes tests Definition (Test) Given an hypothesis

    H0 : θ ∈ Θ0 on the parameter θ ∈ Θ0 of a statistical model, a test is a statistical procedure that takes its values in {0, 1}. Theorem (Bayes test) The Bayes estimator associated with π and with the 0 − 1 loss is δπ(x) = 1 if π(θ ∈ Θ0|x) > π(θ ∈ Θ0|x), 0 otherwise,
  5. Bayes factor Definition (Bayes factors) For testing hypothesis H0 :

    θ ∈ Θ0 vs. Ha : θ ∈ Θ0, under prior π(Θ0)π0(θ) + π(Θc 0 )π1(θ) , central quantity B01(x) = π(Θ0|x) π(Θc 0 |x) π(Θ0) π(Θc 0 ) = Θ0 f(x|θ)π0(θ)dθ Θc 0 f(x|θ)π1(θ)dθ = m0(x) m1(x) [Jeffreys, 1939]
  6. Model choice and model comparison Choice between models Several models

    available for the same observation x Mi : x ∼ fi(x|θi), i ∈ I where the family I can be finite or infinite Identical setup: Replace hypotheses with models but keep marginal likelihoods and Bayes factors
  7. Model choice and model comparison Choice between models Several models

    available for the same observation x Mi : x ∼ fi(x|θi), i ∈ I where the family I can be finite or infinite Identical setup: Replace hypotheses with models but keep marginal likelihoods and Bayes factors
  8. Bayesian model choice Probabilise the entire model/parameter space allocate probabilities

    pi to all models Mi define priors πi(θi) for each parameter space Θi compute π(Mi|x) = pi Θi fi(x|θi)πi(θi)dθi j pj Θj fj(x|θj)πj(θj)dθj take largest π(Mi|x) to determine “best” model, or use averaged predictive j π(Mj|x) Θj fj(x |θj)πj(θj|x)dθj
  9. Bayesian model choice Probabilise the entire model/parameter space allocate probabilities

    pi to all models Mi define priors πi(θi) for each parameter space Θi compute π(Mi|x) = pi Θi fi(x|θi)πi(θi)dθi j pj Θj fj(x|θj)πj(θj)dθj take largest π(Mi|x) to determine “best” model, or use averaged predictive j π(Mj|x) Θj fj(x |θj)πj(θj|x)dθj
  10. Bayesian model choice Probabilise the entire model/parameter space allocate probabilities

    pi to all models Mi define priors πi(θi) for each parameter space Θi compute π(Mi|x) = pi Θi fi(x|θi)πi(θi)dθi j pj Θj fj(x|θj)πj(θj)dθj take largest π(Mi|x) to determine “best” model, or use averaged predictive j π(Mj|x) Θj fj(x |θj)πj(θj|x)dθj
  11. Bayesian model choice Probabilise the entire model/parameter space allocate probabilities

    pi to all models Mi define priors πi(θi) for each parameter space Θi compute π(Mi|x) = pi Θi fi(x|θi)πi(θi)dθi j pj Θj fj(x|θj)πj(θj)dθj take largest π(Mi|x) to determine “best” model, or use averaged predictive j π(Mj|x) Θj fj(x |θj)πj(θj|x)dθj
  12. Evidence All these problems end up with a similar quantity,

    the evidence Zk = Θk πk(θk)Lk(θk) dθk, aka the marginal likelihood.
  13. Importance sampling 1 Evidence 2 Importance sampling solutions Regular importance

    Harmonic means Chib’s solution Auxiliary variables 3 Cross-model solutions 4 Nested sampling 5 ABC model choice 6 The Savage–Dickey ratio
  14. Bayes factor approximation When approximating the Bayes factor B01 =

    Θ0 f0(x|θ0)π0(θ0)dθ0 Θ1 f1(x|θ1)π1(θ1)dθ1 use of importance functions 0 and 1 and B01 = n−1 0 n0 i=1 f0(x|θi 0 )π0(θi 0 )/ 0(θi 0 ) n−1 1 n1 i=1 f1(x|θi 1 )π1(θi 1 )/ 1(θi 1 ) on the corresponding parameter spaces (may be of different dimensions!)
  15. Default importance A typical choice for in Bayesian settings is

    to take θ ∼ N(ˆ θMLE, ˆ σ2 MLE ) [Marin & Robert, 2009; Wood, 2010]
  16. Bridge sampling Special case: If π1(θ1|x) ∝ ˜ π1(θ1|x) π2(θ2|x)

    ∝ ˜ π2(θ2|x) live on the same space (Θ1 = Θ2), then B12 ≈ 1 n n i=1 ˜ π1(θi|x) ˜ π2(θi|x) θi ∼ π2(θ|x) [Gelman & Meng, 1998; Chen, Shao & Ibrahim, 2000]
  17. Bridge sampling variance The bridge sampling estimator does poorly if

    var(B12) B2 12 ≈ 1 n E π1(θ) − π2(θ) π2(θ) 2 is large, i.e. if π1 and π2 have little overlap...
  18. Bridge sampling variance The bridge sampling estimator does poorly if

    var(B12) B2 12 ≈ 1 n E π1(θ) − π2(θ) π2(θ) 2 is large, i.e. if π1 and π2 have little overlap...
  19. (Further) bridge sampling A remarkable property B12 = ˜ π2(θ|x)α(θ)π1(θ|x)dθ

    ˜ π1(θ|x)α(θ)π2(θ|x)dθ ∀ α(·) ≈ 1 n1 n1 i=1 ˜ π2(θ1i|x)α(θ1i) 1 n2 n2 i=1 ˜ π1(θ2i|x)α(θ2i) θji ∼ πj(θ|x)
  20. An infamous example When α(θ) = 1 ˜ π1(θ)˜ π2(θ)

    harmonic mean approximation to B12 B21 = 1 n1 n1 i=1 1/˜ π1(θ1i|x) 1 n2 n2 i=1 1/˜ π2(θ2i|x) θji ∼ πj(θ|x) [Newton & Raftery, 1994] Infamous: Most often leads to an infinite variance!!! [Radford Neal’s blog, 2008]
  21. An infamous example When α(θ) = 1 ˜ π1(θ)˜ π2(θ)

    harmonic mean approximation to B12 B21 = 1 n1 n1 i=1 1/˜ π1(θ1i|x) 1 n2 n2 i=1 1/˜ π2(θ2i|x) θji ∼ πj(θ|x) [Newton & Raftery, 1994] Infamous: Most often leads to an infinite variance!!! [Radford Neal’s blog, 2008]
  22. “The Worst Monte Carlo Method Ever” “The good news is

    that the Law of Large Numbers guarantees that this estimator is consistent ie, it will very likely be very close to the correct answer if you use a sufficiently large number of points from the posterior distribution. The bad news is that the number of points required for this estimator to get close to the right answer will often be greater than the number of atoms in the observable universe. The even worse news is that itws easy for people to not realize this, and to naively accept estimates that are nowhere close to the correct value of the marginal likelihood.” [Radford Neal’s blog, Aug. 23, 2008]
  23. “The Worst Monte Carlo Method Ever” “The good news is

    that the Law of Large Numbers guarantees that this estimator is consistent ie, it will very likely be very close to the correct answer if you use a sufficiently large number of points from the posterior distribution. The bad news is that the number of points required for this estimator to get close to the right answer will often be greater than the number of atoms in the observable universe. The even worse news is that itws easy for people to not realize this, and to naively accept estimates that are nowhere close to the correct value of the marginal likelihood.” [Radford Neal’s blog, Aug. 23, 2008]
  24. Optimal bridge sampling The optimal choice of auxiliary function is

    α = n1 + n2 n1π1(θ|x) + n2π2(θ|x) leading to B12 ≈ 1 n1 n1 i=1 ˜ π2(θ1i|x) n1π1(θ1i|x) + n2π2(θ1i|x) 1 n2 n2 i=1 ˜ π1(θ2i|x) n1π1(θ2i|x) + n2π2(θ2i|x) Back later! [Gelfand & Dey, 1994]
  25. Optimal bridge sampling (2) Reason: Var(B12) B2 12 ≈ 1

    n1n2 π1(θ)π2(θ)[n1π1(θ) + n2π2(θ)]α(θ)2 dθ π1(θ)π2(θ)α(θ) dθ 2 − 1 δ method Drag: Dependence on the unknown normalising constants solved iteratively
  26. Optimal bridge sampling (2) Reason: Var(B12) B2 12 ≈ 1

    n1n2 π1(θ)π2(θ)[n1π1(θ) + n2π2(θ)]α(θ)2 dθ π1(θ)π2(θ)α(θ) dθ 2 − 1 δ method Drag: Dependence on the unknown normalising constants solved iteratively
  27. Ratio importance sampling Another fundamental identity: B12 = Eϕ [˜

    π1(θ1)/ϕ(θ1)] Eϕ [˜ π2(θ2)/ϕ(θ2)] for any density ϕ with sufficiently large support (when Θ1 = Θ2) [Torrie & Valleau, 1977] Use of a single sample θ1, . . . , θn from ϕ B12 = i=1 ˜ π1(θi)/ϕ(θi) i=1 ˜ π2(θi)/ϕ(θi)
  28. Ratio importance sampling Another fundamental identity: B12 = Eϕ [˜

    π1(θ1)/ϕ(θ1)] Eϕ [˜ π2(θ2)/ϕ(θ2)] for any density ϕ with sufficiently large support (when Θ1 = Θ2) [Torrie & Valleau, 1977] Use of a single sample θ1, . . . , θn from ϕ B12 = i=1 ˜ π1(θi)/ϕ(θi) i=1 ˜ π2(θi)/ϕ(θi)
  29. Ratio importance sampling (2) Approximate variance: var(B12) B2 12 ≈

    1 n Eϕ (π1(θ) − π2(θ))2 ϕ(θ)2 2 Optimal choice: ϕ∗(θ) = | π1(θ) − π2(θ) | | π1(η) − π2(η) | dη [Chen, Shao & Ibrahim, 2000] Formaly better than bridge sampling
  30. Ratio importance sampling (2) Approximate variance: var(B12) B2 12 ≈

    1 n Eϕ (π1(θ) − π2(θ))2 ϕ(θ)2 2 Optimal choice: ϕ∗(θ) = | π1(θ) − π2(θ) | | π1(η) − π2(η) | dη [Chen, Shao & Ibrahim, 2000] Formaly better than bridge sampling
  31. Ratio importance sampling (2) Approximate variance: var(B12) B2 12 ≈

    1 n Eϕ (π1(θ) − π2(θ))2 ϕ(θ)2 2 Optimal choice: ϕ∗(θ) = | π1(θ) − π2(θ) | | π1(η) − π2(η) | dη [Chen, Shao & Ibrahim, 2000] Formaly better than bridge sampling
  32. Approximating Zk from a posterior sample Use of the [harmonic

    mean] identity Eπk ϕ(θk) πk(θk)Lk(θk) x = ϕ(θk) πk(θk)Lk(θk) πk(θk)Lk(θk) Zk dθk = 1 Zk no matter what the proposal ϕ(·) is. [Gelfand & Dey, 1994; Bartolucci et al., 2006] Direct exploitation of the MCMC output RB-RJ
  33. Approximating Zk from a posterior sample Use of the [harmonic

    mean] identity Eπk ϕ(θk) πk(θk)Lk(θk) x = ϕ(θk) πk(θk)Lk(θk) πk(θk)Lk(θk) Zk dθk = 1 Zk no matter what the proposal ϕ(·) is. [Gelfand & Dey, 1994; Bartolucci et al., 2006] Direct exploitation of the MCMC output RB-RJ
  34. Comparison with regular importance sampling Harmonic mean: Constraint opposed to

    usual importance sampling constraints: ϕ(θ) must have lighter (rather than fatter) tails than πk(θk)Lk(θk) for the approximation Z1k = 1 1 T T t=1 ϕ(θ(t) k ) πk(θ(t) k )Lk(θ(t) k ) to have a finite variance. E.g., use finite support kernels (like Epanechnikov’s kernel) for ϕ
  35. Comparison with regular importance sampling Harmonic mean: Constraint opposed to

    usual importance sampling constraints: ϕ(θ) must have lighter (rather than fatter) tails than πk(θk)Lk(θk) for the approximation Z1k = 1 1 T T t=1 ϕ(θ(t) k ) πk(θ(t) k )Lk(θ(t) k ) to have a finite variance. E.g., use finite support kernels (like Epanechnikov’s kernel) for ϕ
  36. Comparison with regular importance sampling (cont’d) Compare Z1k with a

    standard importance sampling approximation Z2k = 1 T T t=1 πk(θ(t) k )Lk(θ(t) k ) ϕ(θ(t) k ) where the θ(t) k ’s are generated from the density ϕ(·) (with fatter tails like t’s)
  37. Approximating Zk using a mixture representation Bridge sampling recall Design

    a specific mixture for simulation [importance sampling] purposes, with density ϕk(θk) ∝ ω1πk(θk)Lk(θk) + ϕ(θk) , where ϕ(·) is arbitrary (but normalised) Note: ω1 is not a probability weight
  38. Approximating Zk using a mixture representation Bridge sampling recall Design

    a specific mixture for simulation [importance sampling] purposes, with density ϕk(θk) ∝ ω1πk(θk)Lk(θk) + ϕ(θk) , where ϕ(·) is arbitrary (but normalised) Note: ω1 is not a probability weight
  39. Approximating Z using a mixture representation (cont’d) Corresponding MCMC (=Gibbs)

    sampler At iteration t 1 Take δ(t) = 1 with probability ω1πk(θ(t−1) k )Lk(θ(t−1) k ) ω1πk(θ(t−1) k )Lk(θ(t−1) k ) + ϕ(θ(t−1) k ) and δ(t) = 2 otherwise; 2 If δ(t) = 1, generate θ(t) k ∼ MCMC(θ(t−1) k , θk) where MCMC(θk, θk ) denotes an arbitrary MCMC kernel associated with the posterior πk(θk|x) ∝ πk(θk)Lk(θk); 3 If δ(t) = 2, generate θ(t) k ∼ ϕ(θk) independently
  40. Approximating Z using a mixture representation (cont’d) Corresponding MCMC (=Gibbs)

    sampler At iteration t 1 Take δ(t) = 1 with probability ω1πk(θ(t−1) k )Lk(θ(t−1) k ) ω1πk(θ(t−1) k )Lk(θ(t−1) k ) + ϕ(θ(t−1) k ) and δ(t) = 2 otherwise; 2 If δ(t) = 1, generate θ(t) k ∼ MCMC(θ(t−1) k , θk) where MCMC(θk, θk ) denotes an arbitrary MCMC kernel associated with the posterior πk(θk|x) ∝ πk(θk)Lk(θk); 3 If δ(t) = 2, generate θ(t) k ∼ ϕ(θk) independently
  41. Approximating Z using a mixture representation (cont’d) Corresponding MCMC (=Gibbs)

    sampler At iteration t 1 Take δ(t) = 1 with probability ω1πk(θ(t−1) k )Lk(θ(t−1) k ) ω1πk(θ(t−1) k )Lk(θ(t−1) k ) + ϕ(θ(t−1) k ) and δ(t) = 2 otherwise; 2 If δ(t) = 1, generate θ(t) k ∼ MCMC(θ(t−1) k , θk) where MCMC(θk, θk ) denotes an arbitrary MCMC kernel associated with the posterior πk(θk|x) ∝ πk(θk)Lk(θk); 3 If δ(t) = 2, generate θ(t) k ∼ ϕ(θk) independently
  42. Evidence approximation by mixtures Rao-Blackwellised estimate ˆ ξ = 1

    T T t=1 ω1πk(θ(t) k )Lk(θ(t) k ) ω1πk(θ(t) k )Lk(θ(t) k ) + ϕ(θ(t) k ) , converges to ω1Zk/{ω1Zk + 1} Deduce ˆ Z3k from ω1 ˆ Z3k/{ω1 ˆ Z3k + 1} = ˆ ξ ie ˆ Z3k = T t=1 ω1πk(θ(t) k )Lk(θ(t) k ) ω1π(θ(t) k )Lk(θ(t) k ) + ϕ(θ(t) k ) T t=1 ϕ(θ(t) k ) ω1πk(θ(t) k )Lk(θ(t) k ) + ϕ(θ(t) k ) [Bridge sampler redux]
  43. Evidence approximation by mixtures Rao-Blackwellised estimate ˆ ξ = 1

    T T t=1 ω1πk(θ(t) k )Lk(θ(t) k ) ω1πk(θ(t) k )Lk(θ(t) k ) + ϕ(θ(t) k ) , converges to ω1Zk/{ω1Zk + 1} Deduce ˆ Z3k from ω1 ˆ Z3k/{ω1 ˆ Z3k + 1} = ˆ ξ ie ˆ Z3k = T t=1 ω1πk(θ(t) k )Lk(θ(t) k ) ω1π(θ(t) k )Lk(θ(t) k ) + ϕ(θ(t) k ) T t=1 ϕ(θ(t) k ) ω1πk(θ(t) k )Lk(θ(t) k ) + ϕ(θ(t) k ) [Bridge sampler redux]
  44. Integration by conditioning Use Rao-Blackwell Theorem var(E[δ(X)|Y]) ≤ var(δ(X))

  45. Consequence If I unbiased estimator of I = Ef [h(X)],

    with X simulated from a joint density ˜ f(x, y), where ˜ f(x, y)dy = f(x), the estimator I∗ = E ˜ f [I|Y1, . . . , Yn] dominate I(X1, . . . , Xn) variance-wise (and is unbiased)
  46. Chib’s representation Direct application of Bayes’ theorem: given x ∼

    fk(x|θk) and θk ∼ πk(θk), Zk = mk(x) = fk(x|θk) πk(θk) πk(θk|x) Use of an approximation to the posterior Zk = mk(x) = fk(x|θ∗ k ) πk(θ∗ k ) ˆ πk(θ∗ k |x) .
  47. Chib’s representation Direct application of Bayes’ theorem: given x ∼

    fk(x|θk) and θk ∼ πk(θk), Zk = mk(x) = fk(x|θk) πk(θk) πk(θk|x) Use of an approximation to the posterior Zk = mk(x) = fk(x|θ∗ k ) πk(θ∗ k ) ˆ πk(θ∗ k |x) .
  48. Case of latent variables For missing variable z as in

    mixture models, natural Rao-Blackwell estimate πk(θ∗ k |x) = 1 T T t=1 πk(θ∗ k |x, z(t) k ) , where the z(t) k ’s are Gibbs sampled latent variables
  49. Label switching impact A mixture model [special case of missing

    variable model] is invariant under permutations of the indices of the components. E.g., mixtures 0.3N(0, 1) + 0.7N(2.3, 1) and 0.7N(2.3, 1) + 0.3N(0, 1) are exactly the same! c The component parameters θi are not identifiable marginally since they are exchangeable
  50. Label switching impact A mixture model [special case of missing

    variable model] is invariant under permutations of the indices of the components. E.g., mixtures 0.3N(0, 1) + 0.7N(2.3, 1) and 0.7N(2.3, 1) + 0.3N(0, 1) are exactly the same! c The component parameters θi are not identifiable marginally since they are exchangeable
  51. Connected difficulties 1 Number of modes of the likelihood of

    order O(k!): c Maximization and even [MCMC] exploration of the posterior surface harder 2 Under exchangeable priors on (θ, p) [prior invariant under permutation of the indices], all posterior marginals are identical: c Posterior expectation of θ1 equal to posterior expectation of θ2
  52. Connected difficulties 1 Number of modes of the likelihood of

    order O(k!): c Maximization and even [MCMC] exploration of the posterior surface harder 2 Under exchangeable priors on (θ, p) [prior invariant under permutation of the indices], all posterior marginals are identical: c Posterior expectation of θ1 equal to posterior expectation of θ2
  53. License Since Gibbs output does not produce exchangeability, the Gibbs

    sampler has not explored the whole parameter space: it lacks energy to switch simultaneously enough component allocations at once 0 100 200 300 400 500 −1 0 1 2 3 n µi −1 0 1 2 3 0.2 0.3 0.4 0.5 µ i pi 0 100 200 300 400 500 0.2 0.3 0.4 0.5 n pi 0.2 0.3 0.4 0.5 0.4 0.6 0.8 1.0 p i σi 0 100 200 300 400 500 0.4 0.6 0.8 1.0 n σi 0.4 0.6 0.8 1.0 −1 0 1 2 3 σ i µi
  54. Label switching paradox We should observe the exchangeability of the

    components [label switching] to conclude about convergence of the Gibbs sampler. If we observe it, then we do not know how to estimate the parameters. If we do not, then we are uncertain about the convergence!!!
  55. Label switching paradox We should observe the exchangeability of the

    components [label switching] to conclude about convergence of the Gibbs sampler. If we observe it, then we do not know how to estimate the parameters. If we do not, then we are uncertain about the convergence!!!
  56. Label switching paradox We should observe the exchangeability of the

    components [label switching] to conclude about convergence of the Gibbs sampler. If we observe it, then we do not know how to estimate the parameters. If we do not, then we are uncertain about the convergence!!!
  57. Compensation for label switching For mixture models, z(t) k usually

    fails to visit all configurations in a balanced way, despite the symmetry predicted by the theory πk(θk|x) = πk(σ(θk)|x) = 1 k! σ∈S πk(σ(θk)|x) for all σ’s in Sk, set of all permutations of {1, . . . , k}. Consequences on numerical approximation, biased by an order k! Recover the theoretical symmetry by using πk(θ∗ k |x) = 1 T k! σ∈Sk T t=1 πk(σ(θ∗ k )|x, z(t) k ) . [Berkhof, Mechelen, & Gelman, 2003]
  58. Compensation for label switching For mixture models, z(t) k usually

    fails to visit all configurations in a balanced way, despite the symmetry predicted by the theory πk(θk|x) = πk(σ(θk)|x) = 1 k! σ∈S πk(σ(θk)|x) for all σ’s in Sk, set of all permutations of {1, . . . , k}. Consequences on numerical approximation, biased by an order k! Recover the theoretical symmetry by using πk(θ∗ k |x) = 1 T k! σ∈Sk T t=1 πk(σ(θ∗ k )|x, z(t) k ) . [Berkhof, Mechelen, & Gelman, 2003]
  59. Galaxy dataset n = 82 galaxies as a mixture of

    k normal distributions with both mean and variance unknown. [Roeder, 1992] Average density data Relative Frequency −2 −1 0 1 2 3 0.0 0.2 0.4 0.6 0.8
  60. Galaxy dataset (k) Using only the original estimate, with θ∗

    k as the MAP estimator, log( ˆ mk(x)) = −105.1396 for k = 3 (based on 103 simulations), while introducing the permutations leads to log( ˆ mk(x)) = −103.3479 Note that −105.1396 + log(3!) = −103.3479 k 2 3 4 5 6 7 8 mk (x) -115.68 -103.35 -102.66 -101.93 -102.88 -105.48 -108.44 Estimations of the marginal likelihoods by the symmetrised Chib’s approximation (based on 105 Gibbs iterations and, for k > 5, 100 permutations selected at random in Sk). [Lee, Marin, Mengersen & Robert, 2008]
  61. Galaxy dataset (k) Using only the original estimate, with θ∗

    k as the MAP estimator, log( ˆ mk(x)) = −105.1396 for k = 3 (based on 103 simulations), while introducing the permutations leads to log( ˆ mk(x)) = −103.3479 Note that −105.1396 + log(3!) = −103.3479 k 2 3 4 5 6 7 8 mk (x) -115.68 -103.35 -102.66 -101.93 -102.88 -105.48 -108.44 Estimations of the marginal likelihoods by the symmetrised Chib’s approximation (based on 105 Gibbs iterations and, for k > 5, 100 permutations selected at random in Sk). [Lee, Marin, Mengersen & Robert, 2008]
  62. Galaxy dataset (k) Using only the original estimate, with θ∗

    k as the MAP estimator, log( ˆ mk(x)) = −105.1396 for k = 3 (based on 103 simulations), while introducing the permutations leads to log( ˆ mk(x)) = −103.3479 Note that −105.1396 + log(3!) = −103.3479 k 2 3 4 5 6 7 8 mk (x) -115.68 -103.35 -102.66 -101.93 -102.88 -105.48 -108.44 Estimations of the marginal likelihoods by the symmetrised Chib’s approximation (based on 105 Gibbs iterations and, for k > 5, 100 permutations selected at random in Sk). [Lee, Marin, Mengersen & Robert, 2008]
  63. Auxiliary variables Auxiliary variable method avoids computations of untractable constant(s)

    in likelihood function f(y|θ) = Zθ ˜ f(y|θ) Introduce pseudo-data y with artificial target g(y|θ, y) Generate θ ∼ K(θ, θ ) and y ∼ f(y|θ ) [Møller, Pettitt, Berthelsen, & Reeves, 2006]
  64. Auxiliary variables Auxiliary variable method avoids computations of untractable constant(s)

    in likelihood function f(y|θ) = Zθ ˜ f(y|θ) Introduce pseudo-data y with artificial target g(y|θ, y) Generate θ ∼ K(θ, θ ) and y ∼ f(y|θ ) Accept with probability π(θ )f(y|θ )g(y |θ , y) π(θ)f(y|θ)g(y|θ, y) K(θ , θ)f(y|θ) K(θ, θ )f(y |θ ) ∧ 1 [Møller, Pettitt, Berthelsen, & Reeves, 2006]
  65. Auxiliary variables Auxiliary variable method avoids computations of untractable constant(s)

    in likelihood function f(y|θ) = Zθ ˜ f(y|θ) Introduce pseudo-data y with artificial target g(y|θ, y) Generate θ ∼ K(θ, θ ) and y ∼ f(y|θ ) Accept with probability π(θ ) ˜ f(y|θ )g(y |θ , y) π(θ) ˜ f(y|θ)g(y|θ, y) K(θ , θ) ˜ f(y|θ) K(θ, θ ) ˜ f(y |θ ) ∧ 1 [Møller, Pettitt, Berthelsen, & Reeves, 2006]
  66. Auxiliary variables Auxiliary variable method avoids computations of untractable constant(s)

    in likelihood function f(y|θ) = Zθ ˜ f(y|θ) Introduce pseudo-data y with artificial target g(y|θ, y) Generate θ ∼ K(θ, θ ) and y ∼ f(y|θ ) For Gibbs random fields , existence of a genuine sufficient statistic η(y). [Møller, Pettitt, Berthelsen, & Reeves, 2006]
  67. Cross-model solutions 1 Evidence 2 Importance sampling solutions 3 Cross-model

    solutions Reversible jump Saturation schemes Implementation error 4 Nested sampling 5 ABC model choice 6 The Savage–Dickey ratio
  68. Reversible jump irreversible jump Idea: Set up a proper measure–theoretic

    framework for designing moves between models Mk [Green, 1995] Create a reversible kernel K on H = k {k} × Θk such that A B K(x, dy)π(x)dx = B A K(y, dx)π(y)dy for the invariant density π [x is of the form (k, θ(k))]
  69. Reversible jump irreversible jump Idea: Set up a proper measure–theoretic

    framework for designing moves between models Mk [Green, 1995] Create a reversible kernel K on H = k {k} × Θk such that A B K(x, dy)π(x)dx = B A K(y, dx)π(y)dy for the invariant density π [x is of the form (k, θ(k))]
  70. Local moves For a move between two models, M1 and

    M2, the Markov chain being in state θ1 ∈ M1, denote by K1→2(θ1, dθ) and K2→1(θ2, dθ) the corresponding kernels, under the detailed balance condition π(dθ1) K1→2(θ1, dθ) = π(dθ2) K2→1(θ2, dθ) , and take, wlog, dim(M2) > dim(M1). Proposal expressed as θ2 = Ψ1→2(θ1, v1→2) where v1→2 is a random variable of dimension dim(M2) − dim(M1), generated as v1→2 ∼ ϕ1→2(v1→2) .
  71. Local moves For a move between two models, M1 and

    M2, the Markov chain being in state θ1 ∈ M1, denote by K1→2(θ1, dθ) and K2→1(θ2, dθ) the corresponding kernels, under the detailed balance condition π(dθ1) K1→2(θ1, dθ) = π(dθ2) K2→1(θ2, dθ) , and take, wlog, dim(M2) > dim(M1). Proposal expressed as θ2 = Ψ1→2(θ1, v1→2) where v1→2 is a random variable of dimension dim(M2) − dim(M1), generated as v1→2 ∼ ϕ1→2(v1→2) .
  72. Local moves (2) In this case, q1→2(θ1, dθ2) has density

    ϕ1→2(v1→2) ∂Ψ1→2(θ1, v1→2) ∂(θ1, v1→2) −1 , by the Jacobian rule. Reverse importance link If probability 1→2 of choosing move to M2 while in M1, acceptance probability reduces to α(θ1, v1→2) = 1∧ π(M2, θ2) 2→1 π(M1, θ1) 1→2 ϕ1→2(v1→2) ∂Ψ1→2(θ1, v1→2) ∂(θ1, v1→2) . If several models are considered simultaneously, with probability 1→2 of choosing move to M2 while in M1, as in K(x, B) = ∞ m=1 ρm (x, y)qm (x, dy) + ω(x)IB (x)
  73. Local moves (2) In this case, q1→2(θ1, dθ2) has density

    ϕ1→2(v1→2) ∂Ψ1→2(θ1, v1→2) ∂(θ1, v1→2) −1 , by the Jacobian rule. Reverse importance link If probability 1→2 of choosing move to M2 while in M1, acceptance probability reduces to α(θ1, v1→2) = 1∧ π(M2, θ2) 2→1 π(M1, θ1) 1→2 ϕ1→2(v1→2) ∂Ψ1→2(θ1, v1→2) ∂(θ1, v1→2) . If several models are considered simultaneously, with probability 1→2 of choosing move to M2 while in M1, as in K(x, B) = ∞ m=1 ρm (x, y)qm (x, dy) + ω(x)IB (x)
  74. Local moves (2) In this case, q1→2(θ1, dθ2) has density

    ϕ1→2(v1→2) ∂Ψ1→2(θ1, v1→2) ∂(θ1, v1→2) −1 , by the Jacobian rule. Reverse importance link If probability 1→2 of choosing move to M2 while in M1, acceptance probability reduces to α(θ1, v1→2) = 1∧ π(M2, θ2) 2→1 π(M1, θ1) 1→2 ϕ1→2(v1→2) ∂Ψ1→2(θ1, v1→2) ∂(θ1, v1→2) . If several models are considered simultaneously, with probability 1→2 of choosing move to M2 while in M1, as in K(x, B) = ∞ m=1 ρm (x, y)qm (x, dy) + ω(x)IB (x)
  75. Generic reversible jump acceptance probability Acceptance probability of θ2 =

    Ψ1→2(θ1, v1→2) is α(θ1 , v1→2 ) = 1 ∧ π(M2 , θ2 ) 2→1 π(M1 , θ1 ) 1→2 ϕ1→2 (v1→2 ) ∂Ψ1→2 (θ1 , v1→2 ) ∂(θ1 , v1→2 ) while acceptance probability of θ1 with (θ1, v1→2) = Ψ−1 1→2 (θ2) is α(θ1 , v1→2 ) = 1 ∧ π(M1 , θ1 ) 1→2 ϕ1→2 (v1→2 ) π(M2 , θ2 ) 2→1 ∂Ψ1→2 (θ1 , v1→2 ) ∂(θ1 , v1→2 ) −1 c Difficult calibration
  76. Generic reversible jump acceptance probability Acceptance probability of θ2 =

    Ψ1→2(θ1, v1→2) is α(θ1 , v1→2 ) = 1 ∧ π(M2 , θ2 ) 2→1 π(M1 , θ1 ) 1→2 ϕ1→2 (v1→2 ) ∂Ψ1→2 (θ1 , v1→2 ) ∂(θ1 , v1→2 ) while acceptance probability of θ1 with (θ1, v1→2) = Ψ−1 1→2 (θ2) is α(θ1 , v1→2 ) = 1 ∧ π(M1 , θ1 ) 1→2 ϕ1→2 (v1→2 ) π(M2 , θ2 ) 2→1 ∂Ψ1→2 (θ1 , v1→2 ) ∂(θ1 , v1→2 ) −1 c Difficult calibration
  77. Generic reversible jump acceptance probability Acceptance probability of θ2 =

    Ψ1→2(θ1, v1→2) is α(θ1 , v1→2 ) = 1 ∧ π(M2 , θ2 ) 2→1 π(M1 , θ1 ) 1→2 ϕ1→2 (v1→2 ) ∂Ψ1→2 (θ1 , v1→2 ) ∂(θ1 , v1→2 ) while acceptance probability of θ1 with (θ1, v1→2) = Ψ−1 1→2 (θ2) is α(θ1 , v1→2 ) = 1 ∧ π(M1 , θ1 ) 1→2 ϕ1→2 (v1→2 ) π(M2 , θ2 ) 2→1 ∂Ψ1→2 (θ1 , v1→2 ) ∂(θ1 , v1→2 ) −1 c Difficult calibration
  78. Green’s sampler Algorithm Iteration t (t ≥ 1): if x(t)

    = (m, θ(m)), 1 Select model Mn with probability πmn 2 Generate umn ∼ ϕmn(u) and set (θ(n), vnm) = Ψm→n(θ(m), umn) 3 Take x(t+1) = (n, θ(n)) with probability min π(n, θ(n)) π(m, θ(m)) πnm ϕnm (vnm ) πmn ϕmn (umn ) ∂Ψm→n (θ(m), umn ) ∂(θ(m), umn ) , 1 and take x(t+1) = x(t) otherwise.
  79. Interpretation The representation puts us back in a fixed dimension

    setting: M1 × V1→2 and M2 in one-to-one relation. reversibility imposes that θ1 is derived as (θ1, v1→2) = Ψ−1 1→2 (θ2) appears like a regular Metropolis–Hastings move from the couple (θ1, v1→2) to θ2 when stationary distributions are π(M1, θ1) × ϕ1→2(v1→2) and π(M2, θ2), and when proposal distribution is deterministic
  80. Interpretation The representation puts us back in a fixed dimension

    setting: M1 × V1→2 and M2 in one-to-one relation. reversibility imposes that θ1 is derived as (θ1, v1→2) = Ψ−1 1→2 (θ2) appears like a regular Metropolis–Hastings move from the couple (θ1, v1→2) to θ2 when stationary distributions are π(M1, θ1) × ϕ1→2(v1→2) and π(M2, θ2), and when proposal distribution is deterministic
  81. Alternative Saturation of the parameter space H = k {k}

    × Θk by creating θ = (θ1, . . . , θD) a model index M pseudo-priors πj(θj|M = k) for j = k [Carlin & Chib, 1995] Validation by P(M = k|x) = P(M = k|x, θ)π(θ|x)dθ = Zk where the (marginal) posterior is [not πk] π(θ|x) = D k=1 P(θ, M = k|x) = D k=1 pk Zk πk(θk|x) j=k πj(θj|M = k) .
  82. Alternative Saturation of the parameter space H = k {k}

    × Θk by creating θ = (θ1, . . . , θD) a model index M pseudo-priors πj(θj|M = k) for j = k [Carlin & Chib, 1995] Validation by P(M = k|x) = P(M = k|x, θ)π(θ|x)dθ = Zk where the (marginal) posterior is [not πk] π(θ|x) = D k=1 P(θ, M = k|x) = D k=1 pk Zk πk(θk|x) j=k πj(θj|M = k) .
  83. MCMC implementation Run a Markov chain (M(t), θ(t) 1 ,

    . . . , θ(t) D ) with stationary distribution π(θ, M|x) by 1 Pick M(t) = k with probability π(θ(t−1), k|x) 2 Generate θ(t−1) k from the posterior πk(θk|x) [or MCMC step] 3 Generate θ(t−1) j (j = k) from the pseudo-prior πj(θj|M = k) Approximate P(M = k|x) = Zk by ˇ pk(x) ∝ pk T t=1 fk(x|θ(t) k ) πk(θ(t) k ) j=k πj(θ(t) j |M = k) D =1 p f (x|θ(t)) π (θ(t)) j= πj(θ(t) j |M = )
  84. MCMC implementation Run a Markov chain (M(t), θ(t) 1 ,

    . . . , θ(t) D ) with stationary distribution π(θ, M|x) by 1 Pick M(t) = k with probability π(θ(t−1), k|x) 2 Generate θ(t−1) k from the posterior πk(θk|x) [or MCMC step] 3 Generate θ(t−1) j (j = k) from the pseudo-prior πj(θj|M = k) Approximate P(M = k|x) = Zk by ˇ pk(x) ∝ pk T t=1 fk(x|θ(t) k ) πk(θ(t) k ) j=k πj(θ(t) j |M = k) D =1 p f (x|θ(t)) π (θ(t)) j= πj(θ(t) j |M = )
  85. MCMC implementation Run a Markov chain (M(t), θ(t) 1 ,

    . . . , θ(t) D ) with stationary distribution π(θ, M|x) by 1 Pick M(t) = k with probability π(θ(t−1), k|x) 2 Generate θ(t−1) k from the posterior πk(θk|x) [or MCMC step] 3 Generate θ(t−1) j (j = k) from the pseudo-prior πj(θj|M = k) Approximate P(M = k|x) = Zk by ˇ pk(x) ∝ pk T t=1 fk(x|θ(t) k ) πk(θ(t) k ) j=k πj(θ(t) j |M = k) D =1 p f (x|θ(t)) π (θ(t)) j= πj(θ(t) j |M = )
  86. Scott’s (2002) proposal Suggest estimating P(M = k|x) by Zk

    ∝ pk T t=1    fk(x|θ(t) k ) D j=1 pj fj(x|θ(t) j )    , based on D simultaneous and independent MCMC chains (θ(t) k )t , 1 ≤ k ≤ D , with stationary distributions πk(θk|x) [instead of above joint]
  87. Scott’s (2002) proposal Suggest estimating P(M = k|x) by Zk

    ∝ pk T t=1    fk(x|θ(t) k ) D j=1 pj fj(x|θ(t) j )    , based on D simultaneous and independent MCMC chains (θ(t) k )t , 1 ≤ k ≤ D , with stationary distributions πk(θk|x) [instead of above joint]
  88. Congdon’s (2006) extension Selecting flat [prohibited] pseudo-priors, uses instead Zk

    ∝ pk T t=1    fk(x|θ(t) k )πk(θ(t) k ) D j=1 pj fj(x|θ(t) j )πj(θ(t) j )    , where again the θ(t) k ’s are MCMC chains with stationary distributions πk(θk|x) to next section
  89. Examples Example (Model choice) Model M1 : x|θ ∼ U(0,

    θ) with prior θ ∼ Exp(1) is versus model M2 : x|θ ∼ Exp(θ) with prior θ ∼ Exp(1). Equal prior weights on both models: 1 = 2 = 0.5. Approximations of P(M = 1|x): Scott’s (2002) (blue), and Congdon’s (2006) (red) [N = 106 simulations]. 1 2 3 4 5 0.0 0.2 0.4 0.6 0.8 1.0 y
  90. Examples Example (Model choice) Model M1 : x|θ ∼ U(0,

    θ) with prior θ ∼ Exp(1) is versus model M2 : x|θ ∼ Exp(θ) with prior θ ∼ Exp(1). Equal prior weights on both models: 1 = 2 = 0.5. Approximations of P(M = 1|x): Scott’s (2002) (blue), and Congdon’s (2006) (red) [N = 106 simulations]. 1 2 3 4 5 0.0 0.2 0.4 0.6 0.8 1.0 y
  91. Examples (2) Example (Model choice (2)) Normal model M1 :

    x ∼ N(θ, 1) with θ ∼ N(0, 1) vs. normal model M2 : x ∼ N(θ, 1) with θ ∼ N(5, 1) Comparison of both approximations with P(M = 1|x): Scott’s (2002) (green and mixed dashes) and Congdon’s (2006) (brown and long dashes) [N = 104 simulations]. −1 0 1 2 3 4 5 6 0.0 0.2 0.4 0.6 0.8 1.0 y
  92. Examples (3) Example (Model choice (3)) Model M1 : x

    ∼ N(0, 1/ω) with ω ∼ Exp(a) vs. M2 : exp(x) ∼ Exp(λ) with λ ∼ Exp(b). Comparison of Congdon’s (2006) (brown and dashed lines) with P(M = 1|x) when (a, b) is equal to (.24, 8.9), (.56, .7), (4.1, .46) and (.98, .081), resp. [N = 104 simulations]. −10 −5 0 5 10 0.0 0.2 0.4 0.6 0.8 1.0 y −10 −5 0 5 10 0.0 0.2 0.4 0.6 0.8 1.0 y −10 −5 0 5 10 0.0 0.2 0.4 0.6 0.8 1.0 y −10 −5 0 5 10 0.0 0.2 0.4 0.6 0.8 1.0 y
  93. Nested sampling 1 Evidence 2 Importance sampling solutions 3 Cross-model

    solutions 4 Nested sampling Purpose Implementation Error rates Constraints Importance variant A mixture comparison 5 ABC model choice
  94. Nested sampling: Goal Skilling’s (2007) technique using the one-dimensional representation:

    Z = Eπ[L(θ)] = 1 0 ϕ(x) dx with ϕ−1(l) = Pπ(L(θ) > l). Note; ϕ(·) is intractable in most cases.
  95. Nested sampling: First approximation Approximate Z by a Riemann sum:

    Z = N i=1 (xi−1 − xi)ϕ(xi) where the xi’s are either: deterministic: xi = e−i/N or random: x0 = 0, xi+1 = tixi, ti ∼ Be(N, 1) so that E[log xi] = −i/N.
  96. Extraneous white noise Take Z = e−θ dθ = 1

    δ e−(1−δ)θ e−δθ = Eδ 1 δ e−(1−δ)θ ˆ Z = 1 N N i=1 δ−1 e−(1−δ)θi (xi−1 − xi) , θi ∼ E(δ) I(θi ≤ θi−1) N deterministic random 50 4.64 10.5 4.65 10.5 100 2.47 4.9 2.48 5.02 500 .549 1.01 .550 1.14 Comparison of variances and MSEs
  97. Extraneous white noise Take Z = e−θ dθ = 1

    δ e−(1−δ)θ e−δθ = Eδ 1 δ e−(1−δ)θ ˆ Z = 1 N N i=1 δ−1 e−(1−δ)θi (xi−1 − xi) , θi ∼ E(δ) I(θi ≤ θi−1) N deterministic random 50 4.64 10.5 4.65 10.5 100 2.47 4.9 2.48 5.02 500 .549 1.01 .550 1.14 Comparison of variances and MSEs
  98. Nested sampling: Alternative representation Another representation is Z = N−1

    i=0 {ϕ(xi+1) − ϕ(xi)}xi which is a special case of Z = N−1 i=0 {L(θ(i+1) ) − L(θ(i) )}π({θ; L(θ) > L(θ(i) )}) where · · · L(θ(i+1) ) < L(θ(i) ) · · · [Lebesgue version of Riemann’s sum]
  99. Nested sampling: Alternative representation Another representation is Z = N−1

    i=0 {ϕ(xi+1) − ϕ(xi)}xi which is a special case of Z = N−1 i=0 {L(θ(i+1) ) − L(θ(i) )}π({θ; L(θ) > L(θ(i) )}) where · · · L(θ(i+1) ) < L(θ(i) ) · · · [Lebesgue version of Riemann’s sum]
  100. Nested sampling: Second approximation Estimate (intractable) ϕ(xi) by ϕi: Nested

    sampling Start with N values θ1, . . . , θN sampled from π At iteration i, 1 Take ϕi = L(θk), where θk is the point with smallest likelihood in the pool of θi’s 2 Replace θk with a sample from the prior constrained to L(θ) > ϕi: the current N points are sampled from prior constrained to L(θ) > ϕi. Note that π({θ; L(θ) > L(θ(i+1) )}) π({θ; L(θ) > L(θ(i) )}) ≈ 1 − 1 N
  101. Nested sampling: Second approximation Estimate (intractable) ϕ(xi) by ϕi: Nested

    sampling Start with N values θ1, . . . , θN sampled from π At iteration i, 1 Take ϕi = L(θk), where θk is the point with smallest likelihood in the pool of θi’s 2 Replace θk with a sample from the prior constrained to L(θ) > ϕi: the current N points are sampled from prior constrained to L(θ) > ϕi. Note that π({θ; L(θ) > L(θ(i+1) )}) π({θ; L(θ) > L(θ(i) )}) ≈ 1 − 1 N
  102. Nested sampling: Second approximation Estimate (intractable) ϕ(xi) by ϕi: Nested

    sampling Start with N values θ1, . . . , θN sampled from π At iteration i, 1 Take ϕi = L(θk), where θk is the point with smallest likelihood in the pool of θi’s 2 Replace θk with a sample from the prior constrained to L(θ) > ϕi: the current N points are sampled from prior constrained to L(θ) > ϕi. Note that π({θ; L(θ) > L(θ(i+1) )}) π({θ; L(θ) > L(θ(i) )}) ≈ 1 − 1 N
  103. Nested sampling: Second approximation Estimate (intractable) ϕ(xi) by ϕi: Nested

    sampling Start with N values θ1, . . . , θN sampled from π At iteration i, 1 Take ϕi = L(θk), where θk is the point with smallest likelihood in the pool of θi’s 2 Replace θk with a sample from the prior constrained to L(θ) > ϕi: the current N points are sampled from prior constrained to L(θ) > ϕi. Note that π({θ; L(θ) > L(θ(i+1) )}) π({θ; L(θ) > L(θ(i) )}) ≈ 1 − 1 N
  104. Nested sampling: Third approximation Iterate the above steps until a

    given stopping iteration j is reached: e.g., observe very small changes in the approximation Z; reach the maximal value of L(θ) when the likelihood is bounded and its maximum is known; truncate the integral Z at level , i.e. replace 1 0 ϕ(x) dx with 1 ϕ(x) dx
  105. Approximation error Error = Z − Z = j i=1

    (xi−1 − xi)ϕi − 1 0 ϕ(x) dx = − 0 ϕ(x) dx + j i=1 (xi−1 − xi)ϕ(xi) − 1 ϕ(x) dx (Quadrature Error) + j i=1 (xi−1 − xi) {ϕi − ϕ(xi)} (Stochastic Error) [Dominated by Monte Carlo]
  106. A CLT for the Stochastic Error The (dominating) stochastic error

    is OP (N−1/2): N1/2 Z − Z D → N (0, V ) with V = − s,t∈[ ,1] sϕ (s)tϕ (t) log(s ∨ t) ds dt. [Proof based on Donsker’s theorem]
  107. What of log Z? If the interest lies within log

    Z, Slutsky’s transform of the CLT: N−1/2 log Z − log Z D → N 0, V/Z2 Note: The number of simulated points equals the number of iterations j, and is a multiple of N: if one stops at first iteration j such that e−j/N < , then: j = N − log .
  108. What of log Z? If the interest lies within log

    Z, Slutsky’s transform of the CLT: N−1/2 log Z − log Z D → N 0, V/Z2 Note: The number of simulated points equals the number of iterations j, and is a multiple of N: if one stops at first iteration j such that e−j/N < , then: j = N − log .
  109. What of log Z? If the interest lies within log

    Z, Slutsky’s transform of the CLT: N−1/2 log Z − log Z D → N 0, V/Z2 Note: The number of simulated points equals the number of iterations j, and is a multiple of N: if one stops at first iteration j such that e−j/N < , then: j = N − log .
  110. Sampling from constr’d priors Exact simulation from the constrained prior

    is intractable in most cases Skilling (2007) proposes to use MCMC but this introduces a bias (stopping rule) If implementable, then slice sampler can be devised at the same cost [or less]
  111. Sampling from constr’d priors Exact simulation from the constrained prior

    is intractable in most cases Skilling (2007) proposes to use MCMC but this introduces a bias (stopping rule) If implementable, then slice sampler can be devised at the same cost [or less]
  112. Sampling from constr’d priors Exact simulation from the constrained prior

    is intractable in most cases Skilling (2007) proposes to use MCMC but this introduces a bias (stopping rule) If implementable, then slice sampler can be devised at the same cost [or less]
  113. Banana illustration Case of a banana target made of a

    twisted 2D normal: x2 = x2 + βx2 1 − 100β [Haario, Sacksman, Tamminen, 1999] β = .03, σ = (100, 1)
  114. Banana illustration (2) Use of nested sampling with N =

    1000, 50 MCMC steps with size 0.1, compared with a population Monte Carlo (PMC) based on 10 iterations with 5000 points per iteration and final sample of 50000 points, using nine Student’s t components with 9 df [Wraith, Kilbinger, Benabed et al., 2009, Physica Rev. D] Evidence estimation
  115. Banana illustration (2) Use of nested sampling with N =

    1000, 50 MCMC steps with size 0.1, compared with a population Monte Carlo (PMC) based on 10 iterations with 5000 points per iteration and final sample of 50000 points, using nine Student’s t components with 9 df [Wraith, Kilbinger, Benabed et al., 2009, Physica Rev. D] E[X1] estimation
  116. Banana illustration (2) Use of nested sampling with N =

    1000, 50 MCMC steps with size 0.1, compared with a population Monte Carlo (PMC) based on 10 iterations with 5000 points per iteration and final sample of 50000 points, using nine Student’s t components with 9 df [Wraith, Kilbinger, Benabed et al., 2009, Physica Rev. D] E[X2] estimation
  117. A IS variant of nested sampling Consider instrumental prior π

    and likelihood ˜ L, weight function w(θ) = π(θ)L(θ) π(θ)L(θ) and weighted NS estimator Z = j i=1 (xi−1 − xi)ϕiw(θi). Then choose (π, L) so that sampling from π constrained to L(θ) > l is easy; e.g. N(c, Id) constrained to c − θ < r.
  118. A IS variant of nested sampling Consider instrumental prior π

    and likelihood ˜ L, weight function w(θ) = π(θ)L(θ) π(θ)L(θ) and weighted NS estimator Z = j i=1 (xi−1 − xi)ϕiw(θi). Then choose (π, L) so that sampling from π constrained to L(θ) > l is easy; e.g. N(c, Id) constrained to c − θ < r.
  119. Mixture pN(0, 1) + (1 − p)N(µ, σ) posterior Posterior

    on (µ, σ) for n observations with µ = 2 and σ = 3/2, when p is known Use of a uniform prior both on (−2, 6) for µ and on (.001, 16) for log σ2. occurrences of posterior bursts for µ = xi
  120. Experiment MCMC sample for n = 16 observations from the

    mixture. Nested sampling sequence with M = 1000 starting points.
  121. Experiment MCMC sample for n = 50 observations from the

    mixture. Nested sampling sequence with M = 1000 starting points.
  122. Comparison 1 Nested sampling: M = 1000 points, with 10

    random walk moves at each step, simulations from the constr’d prior and a stopping rule at 95% of the observed maximum likelihood 2 T = 104 MCMC (=Gibbs) simulations producing non-parametric estimates ϕ [Diebolt & Robert, 1990] 3 Monte Carlo estimates Z1, Z2, Z3 using product of two Gaussian kernels 4 numerical integration based on 850 × 950 grid [reference value, confirmed by Chib’s]
  123. Comparison (cont’d) Graph based on a sample of 10 observations

    for µ = 2 and σ = 3/2 (150 replicas).
  124. Comparison (cont’d) Graph based on a sample of 50 observations

    for µ = 2 and σ = 3/2 (150 replicas).
  125. Comparison (cont’d) Graph based on a sample of 100 observations

    for µ = 2 and σ = 3/2 (150 replicas).
  126. Approximate Bayesian computation 1 Evidence 2 Importance sampling solutions 3

    Cross-model solutions 4 Nested sampling 5 ABC model choice ABC method Formalised framework Consistency results Summary statistics 6 The Savage–Dickey ratio
  127. Approximate Bayesian Computation Bayesian setting: target is π(θ)f(x|θ) When likelihood

    f(x|θ) not in closed form, likelihood-free rejection technique: ABC algorithm For an observation y ∼ f(y|θ), under the prior π(θ), keep jointly simulating θ ∼ π(θ) , x ∼ f(x|θ ) , until the auxiliary variable x is equal to the observed value, x = y. [Pritchard et al., 1999]
  128. Approximate Bayesian Computation Bayesian setting: target is π(θ)f(x|θ) When likelihood

    f(x|θ) not in closed form, likelihood-free rejection technique: ABC algorithm For an observation y ∼ f(y|θ), under the prior π(θ), keep jointly simulating θ ∼ π(θ) , x ∼ f(x|θ ) , until the auxiliary variable x is equal to the observed value, x = y. [Pritchard et al., 1999]
  129. Approximate Bayesian Computation Bayesian setting: target is π(θ)f(x|θ) When likelihood

    f(x|θ) not in closed form, likelihood-free rejection technique: ABC algorithm For an observation y ∼ f(y|θ), under the prior π(θ), keep jointly simulating θ ∼ π(θ) , x ∼ f(x|θ ) , until the auxiliary variable x is equal to the observed value, x = y. [Pritchard et al., 1999]
  130. A as approximative When y is a continuous random variable,

    equality x = y is replaced with a tolerance condition, (x, y) ≤ where is a distance between summary statistics Output distributed from π(θ) Pθ{ (x, y) < } ∝ π(θ| (x, y) < )
  131. A as approximative When y is a continuous random variable,

    equality x = y is replaced with a tolerance condition, (x, y) ≤ where is a distance between summary statistics Output distributed from π(θ) Pθ{ (x, y) < } ∝ π(θ| (x, y) < )
  132. The starting point Central question to the validation of ABC

    for model choice: When is a Bayes factor based on an insufficient statistic T (y) consistent? Note/warnin: c drawn on T (y) through BT 12 (y) necessarily differs from c drawn on y through B12(y) [Marin, Pillai, X, & Rousseau, arXiv, 2012]
  133. The starting point Central question to the validation of ABC

    for model choice: When is a Bayes factor based on an insufficient statistic T (y) consistent? Note/warnin: c drawn on T (y) through BT 12 (y) necessarily differs from c drawn on y through B12(y) [Marin, Pillai, X, & Rousseau, arXiv, 2012]
  134. A benchmark if toy example Comparison suggested by referee of

    PNAS paper [thanks!]: [X, Cornuet, Marin, & Pillai, Aug. 2011] Model M1: y ∼ N(θ1, 1) opposed to model M2: y ∼ L(θ2, 1/ √ 2), Laplace distribution with mean θ2 and scale parameter 1/ √ 2 (variance one).
  135. A benchmark if toy example Comparison suggested by referee of

    PNAS paper [thanks!]: [X, Cornuet, Marin, & Pillai, Aug. 2011] Model M1: y ∼ N(θ1, 1) opposed to model M2: y ∼ L(θ2, 1/ √ 2), Laplace distribution with mean θ2 and scale parameter 1/ √ 2 (variance one). Four possible statistics 1 sample mean y (sufficient for M1 if not M2); 2 sample median med(y) (insufficient); 3 sample variance var(y) (ancillary); 4 median absolute deviation mad(y) = med(|y − med(y)|);
  136. A benchmark if toy example Comparison suggested by referee of

    PNAS paper [thanks!]: [X, Cornuet, Marin, & Pillai, Aug. 2011] Model M1: y ∼ N(θ1, 1) opposed to model M2: y ∼ L(θ2, 1/ √ 2), Laplace distribution with mean θ2 and scale parameter 1/ √ 2 (variance one). q q q q q q q q q q q Gauss Laplace 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 n=100
  137. A benchmark if toy example Comparison suggested by referee of

    PNAS paper [thanks!]: [X, Cornuet, Marin, & Pillai, Aug. 2011] Model M1: y ∼ N(θ1, 1) opposed to model M2: y ∼ L(θ2, 1/ √ 2), Laplace distribution with mean θ2 and scale parameter 1/ √ 2 (variance one). q q q q q q q q q q q Gauss Laplace 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 n=100 q q q q q q q q q q q q q q q q q q Gauss Laplace 0.0 0.2 0.4 0.6 0.8 1.0 n=100
  138. Framework Starting from sample y = (y1, . . .

    , yn) the observed sample, not necessarily iid with true distribution y ∼ Pn Summary statistics T (y) = T n = (T1(y), T2(y), · · · , Td(y)) ∈ Rd with true distribution T n ∼ Gn.
  139. Framework c Comparison of – under M1, y ∼ F1,n(·|θ1)

    where θ1 ∈ Θ1 ⊂ Rp1 – under M2, y ∼ F2,n(·|θ2) where θ2 ∈ Θ2 ⊂ Rp2 turned into – under M1, T (y) ∼ G1,n(·|θ1), and θ1|T (y) ∼ π1(·|T n) – under M2, T (y) ∼ G2,n(·|θ2), and θ2|T (y) ∼ π2(·|T n)
  140. Assumptions A collection of asymptotic “standard” assumptions: [A1] is a

    standard central limit theorem under the true model [A2] controls the large deviations of the estimator T n from the estimand µ(θ) [A3] is the standard prior mass condition found in Bayesian asymptotics (di effective dimension of the parameter) [A4] restricts the behaviour of the model density against the true density [Think CLT!]
  141. Assumptions A collection of asymptotic “standard” assumptions: [Think CLT!] [A1]

    There exist a sequence {vn} converging to +∞, a distribution Q, a symmetric, d × d positive definite matrix V0 and a vector µ0 ∈ Rd such that vnV −1/2 0 (T n − µ0) n→∞ ; Q, under Gn
  142. Assumptions A collection of asymptotic “standard” assumptions: [Think CLT!] [A2]

    For i = 1, 2, there exist sets Fn,i ⊂ Θi, functions µi(θi) and constants i, τi, αi > 0 such that for all τ > 0, sup θi∈Fn,i Gi,n |T n − µi(θi)| > τ|µi(θi) − µ0| ∧ i |θi (|τµi(θi) − µ0| ∧ i)−αi v−αi n with πi(Fc n,i ) = o(v−τi n ).
  143. Assumptions A collection of asymptotic “standard” assumptions: [Think CLT!] [A3]

    If inf{|µi(θi) − µ0|; θi ∈ Θi} = 0, defining (u > 0) Sn,i(u) = θi ∈ Fn,i; |µi(θi) − µ0| ≤ u v−1 n , there exist constants di < τi ∧ αi − 1 such that πi(Sn,i(u)) ∼ udi v−di n , ∀u vn
  144. Assumptions A collection of asymptotic “standard” assumptions: [Think CLT!] [A4]

    If inf{|µi(θi) − µ0|; θi ∈ Θi} = 0, for any > 0, there exist U, δ > 0 and (En)n such that, if θi ∈ Sn,i(U) En ⊂ {t; gi(t|θi) < δgn(t)} and Gn (Ec n ) < .
  145. Assumptions A collection of asymptotic “standard” assumptions: [Think CLT!] Again

    (sumup) [A1] is a standard central limit theorem under the true model [A2] controls the large deviations of the estimator T n from the estimand µ(θ) [A3] is the standard prior mass condition found in Bayesian asymptotics (di effective dimension of the parameter) [A4] restricts the behaviour of the model density against the true density
  146. Effective dimension Understanding di in [A3]: defined only when µ0

    ∈ {µi(θi), θi ∈ Θi}, πi(θi : |µi(θi) − µ0| < n−1/2) = O(n−di/2) is the effective dimension of the model Θi around µ0
  147. Effective dimension Understanding di in [A3]: when inf{|µi(θi) − µ0|;

    θi ∈ Θi} = 0, mi(T n) gn(T n) ∼ v−di n , thus log(mi(T n)/gn(T n)) ∼ −di log vn and v−di n penalization factor resulting from integrating θi out (see effective number of parameters in DIC)
  148. Effective dimension Understanding di in [A3]: In regular models, di

    dimension of T (Θi), leading to BIC In non-regular models, di can be smaller
  149. Asymptotic marginals Asymptotically, under [A1]–[A4] mi(t) = Θi gi(t|θi) πi(θi)

    dθi is such that (i) if inf{|µi(θi) − µ0|; θi ∈ Θi} = 0, Clvd−di n ≤ mi(T n) ≤ Cuvd−di n and (ii) if inf{|µi(θi) − µ0|; θi ∈ Θi} > 0 mi(T n) = o Pn [vd−τi n + vd−αi n ].
  150. Within-model consistency Under same assumptions, if inf{|µi(θi) − µ0|; θi

    ∈ Θi} = 0, the posterior distribution of µi(θi) given T n is consistent at rate 1/vn provided αi ∧ τi > di.
  151. Between-model consistency Consequence of above is that asymptotic behaviour of

    the Bayes factor is driven by the asymptotic mean value of T n under both models. And only by this mean value!
  152. Between-model consistency Consequence of above is that asymptotic behaviour of

    the Bayes factor is driven by the asymptotic mean value of T n under both models. And only by this mean value! Indeed, if inf{|µ0 − µ2(θ2)|; θ2 ∈ Θ2} = inf{|µ0 − µ1(θ1)|; θ1 ∈ Θ1} = 0 then Clv−(d1−d2) n ≤ m1(T n) m2(T n) ≤ Cuv−(d1−d2) n , where Cl, Cu = O Pn (1), irrespective of the true model. c Only depends on the difference d1 − d2: no consistency
  153. Between-model consistency Consequence of above is that asymptotic behaviour of

    the Bayes factor is driven by the asymptotic mean value of T n under both models. And only by this mean value! Else, if inf{|µ0 − µ2(θ2)|; θ2 ∈ Θ2} > inf{|µ0 − µ1(θ1)|; θ1 ∈ Θ1} = 0 then m1(T n) m2(T n) ≥ Cu min v−(d1−α2) n , v−(d1−τ2) n
  154. Consistency theorem If inf{|µ0 − µ2(θ2)|; θ2 ∈ Θ2} =

    inf{|µ0 − µ1(θ1)|; θ1 ∈ Θ1} = 0, Bayes factor BT 12 = O(v−(d1−d2) n ) irrespective of the true model. It is inconsistent since it always picks the model with the smallest dimension
  155. Consistency theorem If Pn belongs to one of the two

    models and if µ0 cannot be attained by the other one : 0 = min (inf{|µ0 − µi(θi)|; θi ∈ Θi}, i = 1, 2) < max (inf{|µ0 − µi(θi)|; θi ∈ Θi}, i = 1, 2) , then the Bayes factor BT 12 is consistent
  156. Consequences on summary statistics Bayes factor driven by the means

    µi(θi) and the relative position of µ0 wrt both sets {µi(θi); θi ∈ Θi}, i = 1, 2. For ABC, this implies the most likely statistics T n are ancillary statistics with different mean values under both models Else, if T n asymptotically depends on some of the parameters of the models, it is possible that there exists θi ∈ Θi such that µi(θi) = µ0 even though model M1 is misspecified
  157. Consequences on summary statistics Bayes factor driven by the means

    µi(θi) and the relative position of µ0 wrt both sets {µi(θi); θi ∈ Θi}, i = 1, 2. For ABC, this implies the most likely statistics T n are ancillary statistics with different mean values under both models Else, if T n asymptotically depends on some of the parameters of the models, it is possible that there exists θi ∈ Θi such that µi(θi) = µ0 even though model M1 is misspecified
  158. Toy example: Laplace versus Gauss [1] If T n =

    n−1 n i=1 X4 i , µ1(θ) = 3 + θ4 + 6θ2, µ2(θ) = 6 + · · · and the true distribution is Laplace with mean θ0 = 1, under the Gaussian model the value θ∗ = 2 √ 3 − 3 also leads to µ0 = µ(θ∗) [here d1 = d2 = d = 1]
  159. Toy example: Laplace versus Gauss [1] If T n =

    n−1 n i=1 X4 i , µ1(θ) = 3 + θ4 + 6θ2, µ2(θ) = 6 + · · · and the true distribution is Laplace with mean θ0 = 1, under the Gaussian model the value θ∗ = 2 √ 3 − 3 also leads to µ0 = µ(θ∗) [here d1 = d2 = d = 1] c A Bayes factor associated with such a statistic is inconsistent
  160. Toy example: Laplace versus Gauss [1] If T n =

    n−1 n i=1 X4 i , µ1(θ) = 3 + θ4 + 6θ2, µ2(θ) = 6 + · · ·
  161. Toy example: Laplace versus Gauss [1] If T n =

    n−1 n i=1 X4 i , µ1(θ) = 3 + θ4 + 6θ2, µ2(θ) = 6 + · · · q q q q q q q M1 M2 0.30 0.35 0.40 0.45 n=1000 Fourth moment
  162. Toy example: Laplace versus Gauss [0] When T (y) =

    ¯ y(4) n , ¯ y(6) n and the true distribution is Laplace with mean θ0 = 0, then µ0 = 6, µ1(θ∗ 1 ) = 6 with θ∗ 1 = 2 √ 3 − 3 [d1 = 1 and d2 = 1/2] thus B12 ∼ n−1/4 → 0 : consistent Under the Gaussian model µ0 = 3, µ2(θ2) ≥ 6 > 3 ∀θ2 B12 → +∞ : consistent
  163. Toy example: Laplace versus Gauss [0] When T (y) =

    ¯ y(4) n , ¯ y(6) n and the true distribution is Laplace with mean θ0 = 0, then µ0 = 6, µ1(θ∗ 1 ) = 6 with θ∗ 1 = 2 √ 3 − 3 [d1 = 1 and d2 = 1/2] thus B12 ∼ n−1/4 → 0 : consistent Under the Gaussian model µ0 = 3, µ2(θ2) ≥ 6 > 3 ∀θ2 B12 → +∞ : consistent
  164. Toy example: Laplace versus Gauss [0] When T (y) =

    ¯ y(4) n , ¯ y(6) n 0.0 0.2 0.4 0.6 0.8 1.0 0 1 2 3 4 5 posterior probabilities Density Fourth and sixth moments
  165. Checking for adequate statistics After running ABC, i.e. creating reference

    tables of (θi, xi) from both joints, straightforward derivation of ABC estimates ˆ θ1 and ˆ θ2. Evaluation of E1 ˆ θ1 [T(X)] and E2 ˆ θ2 [T(X)] allows for detection of different means under both models via Monte Carlo simulations
  166. Toy example: Laplace versus Gauss q q q q q

    q q q q q q q q q q q q q q q q q q q q q q q q q Gauss Laplace Gauss Laplace 0 10 20 30 40 Normalised χ2 without and with mad
  167. Embedded models When M1 submodel of M2, and if the

    true distribution belongs to the smaller model M1, Bayes factor is of order v−(d1−d2) n
  168. Embedded models When M1 submodel of M2, and if the

    true distribution belongs to the smaller model M1, Bayes factor is of order v−(d1−d2) n If summary statistic only informative on a parameter that is the same under both models, i.e if d1 = d2, then c the Bayes factor is not consistent
  169. Embedded models When M1 submodel of M2, and if the

    true distribution belongs to the smaller model M1, Bayes factor is of order v−(d1−d2) n Else, d1 < d2 and Bayes factor is going to ∞ under M1. If true distribution not in M1, then c Bayes factor is consistent only if µ1 = µ2 = µ0
  170. Summary Model selection feasible with ABC Choice of summary statistics

    is paramount At best, ABC → π(. | T (y)) which concentrates around µ0 For estimation : {θ; µ(θ) = µ0} = θ0 For testing {µ1(θ1), θ1 ∈ Θ1} ∩ {µ2(θ2), θ2 ∈ Θ2} = ∅
  171. Summary Model selection feasible with ABC Choice of summary statistics

    is paramount At best, ABC → π(. | T (y)) which concentrates around µ0 For estimation : {θ; µ(θ) = µ0} = θ0 For testing {µ1(θ1), θ1 ∈ Θ1} ∩ {µ2(θ2), θ2 ∈ Θ2} = ∅
  172. 1 Evidence 2 Importance sampling solutions 3 Cross-model solutions 4

    Nested sampling 5 ABC model choice 6 The Savage–Dickey ratio Measure-theoretic aspects Computational implications
  173. The Savage–Dickey ratio representation Special representation of the Bayes factor

    used for simulation Original version (Dickey, AoMS, 1971)
  174. The Savage–Dickey ratio representation Special representation of the Bayes factor

    used for simulation Original version (Dickey, AoMS, 1971)
  175. Savage’s density ratio theorem Given a test H0 : θ

    = θ0 in a model f(x|θ, ψ) with a nuisance parameter ψ, under priors π0(ψ) and π1(θ, ψ) such that π1(ψ|θ0) = π0(ψ) then B01 = π1(θ0|x) π1(θ0) , with the obvious notations π1(θ) = π1(θ, ψ)dψ , π1(θ|x) = π1(θ, ψ|x)dψ , [Dickey, 1971; Verdinelli & Wasserman, 1995]
  176. Rephrased “Suppose that f0(θ) = f1(θ|φ = φ0). As f0(x|θ)

    = f1(x|θ, φ = φ0), f0(x) = f1(x|θ, φ = φ0)f1(θ|φ = φ0) dθ = f1(x|φ = φ0) , i.e., the denumerator of the Bayes factor is the value of f1(x|φ) at φ = φ0, while the denominator is an average of the values of f1(x|φ) for φ = φ0, weighted by the prior distribution f1(φ) under the augmented model. Applying Bayes’ theorem to the right-hand side of [the above] we get f0(x) = f1(φ0|x)f1(x) f1(φ0) and hence the Bayes factor is given by B = f0(x) f1(x) = f1(φ0|x) f1(φ0) . the ratio of the posterior to prior densities at φ = φ0 under the augmented model.” [O’Hagan & Forster, 1996]
  177. Rephrased “Suppose that f0(θ) = f1(θ|φ = φ0). As f0(x|θ)

    = f1(x|θ, φ = φ0), f0(x) = f1(x|θ, φ = φ0)f1(θ|φ = φ0) dθ = f1(x|φ = φ0) , i.e., the denumerator of the Bayes factor is the value of f1(x|φ) at φ = φ0, while the denominator is an average of the values of f1(x|φ) for φ = φ0, weighted by the prior distribution f1(φ) under the augmented model. Applying Bayes’ theorem to the right-hand side of [the above] we get f0(x) = f1(φ0|x)f1(x) f1(φ0) and hence the Bayes factor is given by B = f0(x) f1(x) = f1(φ0|x) f1(φ0) . the ratio of the posterior to prior densities at φ = φ0 under the augmented model.” [O’Hagan & Forster, 1996]
  178. Measure-theoretic difficulty Representation depends on the choice of versions of

    conditional densities: B01 = π0(ψ)f(x|θ0, ψ) dψ π1(θ, ψ)f(x|θ, ψ) dψdθ [by definition] = π1(ψ|θ0)f(x|θ0, ψ) dψ π1(θ0) π1(θ, ψ)f(x|θ, ψ) dψdθ π1(θ0) [specific version of π1(ψ|θ0) and arbitrary version of π1(θ0)] = π1(θ0, ψ)f(x|θ0, ψ) dψ m1(x)π1(θ0) [specific version of π1(θ0, ψ)] = π1(θ0|x) π1(θ0) [version dependent]
  179. Measure-theoretic difficulty Representation depends on the choice of versions of

    conditional densities: B01 = π0(ψ)f(x|θ0, ψ) dψ π1(θ, ψ)f(x|θ, ψ) dψdθ [by definition] = π1(ψ|θ0)f(x|θ0, ψ) dψ π1(θ0) π1(θ, ψ)f(x|θ, ψ) dψdθ π1(θ0) [specific version of π1(ψ|θ0) and arbitrary version of π1(θ0)] = π1(θ0, ψ)f(x|θ0, ψ) dψ m1(x)π1(θ0) [specific version of π1(θ0, ψ)] = π1(θ0|x) π1(θ0) [version dependent]
  180. Choice of density version c Dickey’s (1971) condition is not

    a condition: If π1(θ0|x) π1(θ0) = π0(ψ)f(x|θ0, ψ) dψ m1(x) is chosen as a version, then Savage–Dickey’s representation holds
  181. Choice of density version c Dickey’s (1971) condition is not

    a condition: If π1(θ0|x) π1(θ0) = π0(ψ)f(x|θ0, ψ) dψ m1(x) is chosen as a version, then Savage–Dickey’s representation holds
  182. Savage–Dickey paradox Verdinelli-Wasserman extension: B01 = π1(θ0|x) π1(θ0) Eπ1(ψ|x,θ0,x) π0(ψ)

    π1(ψ|θ0) similarly depends on choices of versions... ...but Monte Carlo implementation relies on specific versions of all densities without making mention of it [Chen, Shao & Ibrahim, 2000]
  183. Savage–Dickey paradox Verdinelli-Wasserman extension: B01 = π1(θ0|x) π1(θ0) Eπ1(ψ|x,θ0,x) π0(ψ)

    π1(ψ|θ0) similarly depends on choices of versions... ...but Monte Carlo implementation relies on specific versions of all densities without making mention of it [Chen, Shao & Ibrahim, 2000]
  184. A computational exploitation Starting from the (instrumental) prior ˜ π1(θ,

    ψ) = π1(θ)π0(ψ) define the associated posterior ˜ π1(θ, ψ|x) = π0(ψ)π1(θ)f(x|θ, ψ) ˜ m1(x) and impose the choice of version ˜ π1(θ0|x) π0(θ0) = π0(ψ)f(x|θ0, ψ) dψ ˜ m1(x) Then B01 = ˜ π1(θ0|x) π1(θ0) ˜ m1(x) m1(x)
  185. A computational exploitation Starting from the (instrumental) prior ˜ π1(θ,

    ψ) = π1(θ)π0(ψ) define the associated posterior ˜ π1(θ, ψ|x) = π0(ψ)π1(θ)f(x|θ, ψ) ˜ m1(x) and impose the choice of version ˜ π1(θ0|x) π0(θ0) = π0(ψ)f(x|θ0, ψ) dψ ˜ m1(x) Then B01 = ˜ π1(θ0|x) π1(θ0) ˜ m1(x) m1(x)
  186. First ratio If (θ(1), ψ(1)), . . . , (θ(T),

    ψ(T)) ∼ ˜ π(θ, ψ|x), then 1 T t ˜ π1(θ0|x, ψ(t)) converges to ˜ π1(θ0|x) provided the right version is used in θ0 ˜ π1(θ0|x, ψ) = π1(θ0)f(x|θ0, ψ) π1(θ)f(x|θ, ψ) dθ
  187. Rao–Blackwellisation with latent variables When ˜ π1(θ0|x, ψ) unavailable, replace

    with 1 T T t=1 ˜ π1(θ0|x, z(t), ψ(t)) via data completion by latent variable z such that f(x|θ, ψ) = ˜ f(x, z|θ, ψ) dz and that ˜ π1(θ, ψ, z|x) ∝ π0(ψ)π1(θ) ˜ f(x, z|θ, ψ) available in closed form, including the normalising constant, based on version ˜ π1(θ0|x, z, ψ) π1(θ0) = ˜ f(x, z|θ0, ψ) ˜ f(x, z|θ, ψ)π1(θ) dθ .
  188. Rao–Blackwellisation with latent variables When ˜ π1(θ0|x, ψ) unavailable, replace

    with 1 T T t=1 ˜ π1(θ0|x, z(t), ψ(t)) via data completion by latent variable z such that f(x|θ, ψ) = ˜ f(x, z|θ, ψ) dz and that ˜ π1(θ, ψ, z|x) ∝ π0(ψ)π1(θ) ˜ f(x, z|θ, ψ) available in closed form, including the normalising constant, based on version ˜ π1(θ0|x, z, ψ) π1(θ0) = ˜ f(x, z|θ0, ψ) ˜ f(x, z|θ, ψ)π1(θ) dθ .
  189. Bridge revival (1) Since ˜ m1(x)/m1(x) is unknown, apparent failure!

    Use of the bridge identity E˜ π1(θ,ψ|x) π1(θ, ψ)f(x|θ, ψ) π0(ψ)π1(θ)f(x|θ, ψ) = E˜ π1(θ,ψ|x) π1(ψ|θ) π0(ψ) = m1(x) ˜ m1(x) to (biasedly) estimate ˜ m1(x)/m1(x) by T T t=1 π1(ψ(t)|θ(t)) π0(ψ(t)) based on the same sample from ˜ π1.
  190. Bridge revival (1) Since ˜ m1(x)/m1(x) is unknown, apparent failure!

    Use of the bridge identity E˜ π1(θ,ψ|x) π1(θ, ψ)f(x|θ, ψ) π0(ψ)π1(θ)f(x|θ, ψ) = E˜ π1(θ,ψ|x) π1(ψ|θ) π0(ψ) = m1(x) ˜ m1(x) to (biasedly) estimate ˜ m1(x)/m1(x) by T T t=1 π1(ψ(t)|θ(t)) π0(ψ(t)) based on the same sample from ˜ π1.
  191. Bridge revival (2) Alternative identity Eπ1(θ,ψ|x) π0(ψ)π1(θ)f(x|θ, ψ) π1(θ, ψ)f(x|θ,

    ψ) = Eπ1(θ,ψ|x) π0(ψ) π1(ψ|θ) = ˜ m1(x) m1(x) suggests using a second sample (¯ θ(t), ¯ ψ(t), z(t)) ∼ π1(θ, ψ|x) and the ratio estimate 1 T T t=1 π0( ¯ ψ(t)) π1( ¯ ψ(t)|¯ θ(t)) Resulting unbiased estimate: B01 = 1 T t ˜ π1(θ0|x, z(t), ψ(t)) π1(θ0) 1 T T t=1 π0( ¯ ψ(t)) π1( ¯ ψ(t)|¯ θ(t))
  192. Bridge revival (2) Alternative identity Eπ1(θ,ψ|x) π0(ψ)π1(θ)f(x|θ, ψ) π1(θ, ψ)f(x|θ,

    ψ) = Eπ1(θ,ψ|x) π0(ψ) π1(ψ|θ) = ˜ m1(x) m1(x) suggests using a second sample (¯ θ(t), ¯ ψ(t), z(t)) ∼ π1(θ, ψ|x) and the ratio estimate 1 T T t=1 π0( ¯ ψ(t)) π1( ¯ ψ(t)|¯ θ(t)) Resulting unbiased estimate: B01 = 1 T t ˜ π1(θ0|x, z(t), ψ(t)) π1(θ0) 1 T T t=1 π0( ¯ ψ(t)) π1( ¯ ψ(t)|¯ θ(t))
  193. Difference with Verdinelli–Wasserman representation The above leads to the representation

    B01 = ˜ π1(θ0|x) π1(θ0) Eπ1(θ,ψ|x) π0(ψ) π1(ψ|θ) shows how our approach differs from Verdinelli and Wasserman’s B01 = π1(θ0|x) π1(θ0) Eπ1(ψ|x,θ0,x) π0(ψ) π1(ψ|θ0) [for referees only!!]
  194. Diabetes in Pima Indian women (cont’d) Comparison of the variation

    of the Bayes factor approximations based on 100 replicas for 20, 000 simulations for a simulation from the above importance, Chib’s, Savage–Dickey’s and bridge samplers q q IS Chib Savage−Dickey Bridge 2.8 3.0 3.2 3.4
  195. Other things I could have mentioned use of particle filters

    [Del Moral et al., 2006] path sampling d dθ log Z(θ) = Eθ d dθ log q(ω|θ) [Ogata, 1989; Gelman & Meng, 1998] multicanonical algorithms [Marinari & Parisi, 1992; Geyer & Thompson, 1995]