Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Statistical frontiers, Warwick, Jan. 30, 2014

Xi'an
January 23, 2014

Statistical frontiers, Warwick, Jan. 30, 2014

Course on the approximation of Bayes factors by an array of numerical methods

Xi'an

January 23, 2014
Tweet

More Decks by Xi'an

Other Decks in Education

Transcript

  1. On some computational methods for Bayesian
    model choice
    Christian P. Robert
    University of Warwick and Universit´
    e Paris Dauphine
    http://www.ceremade.dauphine.fr/~xian
    Statistical Frontiers, January 30, 2014

    View full-size slide

  2. Outline
    1 Evidence
    2 Importance sampling solutions
    3 Cross-model solutions
    4 Nested sampling
    5 ABC model choice
    6 The Savage–Dickey ratio

    View full-size slide

  3. Formal construction of Bayes tests
    Definition (Test)
    Given an hypothesis H0 : θ ∈ Θ0 on the parameter θ ∈ Θ0 of a
    statistical model, a test is a statistical procedure that takes its
    values in {0, 1}.
    Theorem (Bayes test)
    The Bayes estimator associated with π and with the 0 − 1 loss is
    δπ(x) =
    1 if π(θ ∈ Θ0|x) > π(θ ∈ Θ0|x),
    0 otherwise,

    View full-size slide

  4. Formal construction of Bayes tests
    Definition (Test)
    Given an hypothesis H0 : θ ∈ Θ0 on the parameter θ ∈ Θ0 of a
    statistical model, a test is a statistical procedure that takes its
    values in {0, 1}.
    Theorem (Bayes test)
    The Bayes estimator associated with π and with the 0 − 1 loss is
    δπ(x) =
    1 if π(θ ∈ Θ0|x) > π(θ ∈ Θ0|x),
    0 otherwise,

    View full-size slide

  5. Bayes factor
    Definition (Bayes factors)
    For testing hypothesis H0 : θ ∈ Θ0 vs. Ha : θ ∈ Θ0, under prior
    π(Θ0)π0(θ) + π(Θc
    0
    )π1(θ) ,
    central quantity
    B01(x) =
    π(Θ0|x)
    π(Θc
    0
    |x)
    π(Θ0)
    π(Θc
    0
    )
    = Θ0
    f(x|θ)π0(θ)dθ
    Θc
    0
    f(x|θ)π1(θ)dθ
    =
    m0(x)
    m1(x)
    [Jeffreys, 1939]

    View full-size slide

  6. Model choice and model comparison
    Choice between models
    Several models available for the same observation x
    Mi : x ∼ fi(x|θi), i ∈ I
    where the family I can be finite or infinite
    Identical setup: Replace hypotheses with models but keep
    marginal likelihoods and Bayes factors

    View full-size slide

  7. Model choice and model comparison
    Choice between models
    Several models available for the same observation x
    Mi : x ∼ fi(x|θi), i ∈ I
    where the family I can be finite or infinite
    Identical setup: Replace hypotheses with models but keep
    marginal likelihoods and Bayes factors

    View full-size slide

  8. Bayesian model choice
    Probabilise the entire model/parameter space
    allocate probabilities pi to all models Mi
    define priors πi(θi) for each parameter space Θi
    compute
    π(Mi|x) =
    pi
    Θi
    fi(x|θi)πi(θi)dθi
    j
    pj
    Θj
    fj(x|θj)πj(θj)dθj
    take largest π(Mi|x) to determine “best” model,
    or use averaged predictive
    j
    π(Mj|x)
    Θj
    fj(x |θj)πj(θj|x)dθj

    View full-size slide

  9. Bayesian model choice
    Probabilise the entire model/parameter space
    allocate probabilities pi to all models Mi
    define priors πi(θi) for each parameter space Θi
    compute
    π(Mi|x) =
    pi
    Θi
    fi(x|θi)πi(θi)dθi
    j
    pj
    Θj
    fj(x|θj)πj(θj)dθj
    take largest π(Mi|x) to determine “best” model,
    or use averaged predictive
    j
    π(Mj|x)
    Θj
    fj(x |θj)πj(θj|x)dθj

    View full-size slide

  10. Bayesian model choice
    Probabilise the entire model/parameter space
    allocate probabilities pi to all models Mi
    define priors πi(θi) for each parameter space Θi
    compute
    π(Mi|x) =
    pi
    Θi
    fi(x|θi)πi(θi)dθi
    j
    pj
    Θj
    fj(x|θj)πj(θj)dθj
    take largest π(Mi|x) to determine “best” model,
    or use averaged predictive
    j
    π(Mj|x)
    Θj
    fj(x |θj)πj(θj|x)dθj

    View full-size slide

  11. Bayesian model choice
    Probabilise the entire model/parameter space
    allocate probabilities pi to all models Mi
    define priors πi(θi) for each parameter space Θi
    compute
    π(Mi|x) =
    pi
    Θi
    fi(x|θi)πi(θi)dθi
    j
    pj
    Θj
    fj(x|θj)πj(θj)dθj
    take largest π(Mi|x) to determine “best” model,
    or use averaged predictive
    j
    π(Mj|x)
    Θj
    fj(x |θj)πj(θj|x)dθj

    View full-size slide

  12. Evidence
    All these problems end up with a similar quantity, the evidence
    Zk =
    Θk
    πk(θk)Lk(θk) dθk,
    aka the marginal likelihood.

    View full-size slide

  13. Importance sampling
    1 Evidence
    2 Importance sampling solutions
    Regular importance
    Harmonic means
    Chib’s solution
    Auxiliary variables
    3 Cross-model solutions
    4 Nested sampling
    5 ABC model choice
    6 The Savage–Dickey ratio

    View full-size slide

  14. Bayes factor approximation
    When approximating the Bayes factor
    B01 = Θ0
    f0(x|θ0)π0(θ0)dθ0
    Θ1
    f1(x|θ1)π1(θ1)dθ1
    use of importance functions 0 and 1 and
    B01 =
    n−1
    0
    n0
    i=1
    f0(x|θi
    0
    )π0(θi
    0
    )/ 0(θi
    0
    )
    n−1
    1
    n1
    i=1
    f1(x|θi
    1
    )π1(θi
    1
    )/ 1(θi
    1
    )
    on the corresponding parameter spaces (may be of different
    dimensions!)

    View full-size slide

  15. Default importance
    A typical choice for in Bayesian settings is to take
    θ ∼ N(ˆ
    θMLE, ˆ
    σ2
    MLE
    )
    [Marin & Robert, 2009; Wood, 2010]

    View full-size slide

  16. Bridge sampling
    Special case:
    If
    π1(θ1|x) ∝ ˜
    π1(θ1|x)
    π2(θ2|x) ∝ ˜
    π2(θ2|x)
    live on the same space (Θ1 = Θ2), then
    B12 ≈
    1
    n
    n
    i=1
    ˜
    π1(θi|x)
    ˜
    π2(θi|x)
    θi ∼ π2(θ|x)
    [Gelman & Meng, 1998; Chen, Shao & Ibrahim, 2000]

    View full-size slide

  17. Bridge sampling variance
    The bridge sampling estimator does poorly if
    var(B12)
    B2
    12

    1
    n E
    π1(θ) − π2(θ)
    π2(θ)
    2
    is large, i.e. if π1 and π2 have little overlap...

    View full-size slide

  18. Bridge sampling variance
    The bridge sampling estimator does poorly if
    var(B12)
    B2
    12

    1
    n E
    π1(θ) − π2(θ)
    π2(θ)
    2
    is large, i.e. if π1 and π2 have little overlap...

    View full-size slide

  19. (Further) bridge sampling
    A remarkable property
    B12 =
    ˜
    π2(θ|x)α(θ)π1(θ|x)dθ
    ˜
    π1(θ|x)α(θ)π2(θ|x)dθ
    ∀ α(·)

    1
    n1
    n1
    i=1
    ˜
    π2(θ1i|x)α(θ1i)
    1
    n2
    n2
    i=1
    ˜
    π1(θ2i|x)α(θ2i)
    θji ∼ πj(θ|x)

    View full-size slide

  20. An infamous example
    When
    α(θ) =
    1
    ˜
    π1(θ)˜
    π2(θ)
    harmonic mean approximation to B12
    B21 =
    1
    n1
    n1
    i=1
    1/˜
    π1(θ1i|x)
    1
    n2
    n2
    i=1
    1/˜
    π2(θ2i|x)
    θji ∼ πj(θ|x)
    [Newton & Raftery, 1994]
    Infamous: Most often leads to an infinite variance!!!
    [Radford Neal’s blog, 2008]

    View full-size slide

  21. An infamous example
    When
    α(θ) =
    1
    ˜
    π1(θ)˜
    π2(θ)
    harmonic mean approximation to B12
    B21 =
    1
    n1
    n1
    i=1
    1/˜
    π1(θ1i|x)
    1
    n2
    n2
    i=1
    1/˜
    π2(θ2i|x)
    θji ∼ πj(θ|x)
    [Newton & Raftery, 1994]
    Infamous: Most often leads to an infinite variance!!!
    [Radford Neal’s blog, 2008]

    View full-size slide

  22. “The Worst Monte Carlo Method Ever”
    “The good news is that the Law of Large Numbers guarantees that
    this estimator is consistent ie, it will very likely be very close to the
    correct answer if you use a sufficiently large number of points from
    the posterior distribution.
    The bad news is that the number of points required for this
    estimator to get close to the right answer will often be greater
    than the number of atoms in the observable universe. The even
    worse news is that itws easy for people to not realize this, and to
    naively accept estimates that are nowhere close to the correct
    value of the marginal likelihood.”
    [Radford Neal’s blog, Aug. 23, 2008]

    View full-size slide

  23. “The Worst Monte Carlo Method Ever”
    “The good news is that the Law of Large Numbers guarantees that
    this estimator is consistent ie, it will very likely be very close to the
    correct answer if you use a sufficiently large number of points from
    the posterior distribution.
    The bad news is that the number of points required for this
    estimator to get close to the right answer will often be greater
    than the number of atoms in the observable universe. The even
    worse news is that itws easy for people to not realize this, and to
    naively accept estimates that are nowhere close to the correct
    value of the marginal likelihood.”
    [Radford Neal’s blog, Aug. 23, 2008]

    View full-size slide

  24. Optimal bridge sampling
    The optimal choice of auxiliary function is
    α =
    n1 + n2
    n1π1(θ|x) + n2π2(θ|x)
    leading to
    B12 ≈
    1
    n1
    n1
    i=1
    ˜
    π2(θ1i|x)
    n1π1(θ1i|x) + n2π2(θ1i|x)
    1
    n2
    n2
    i=1
    ˜
    π1(θ2i|x)
    n1π1(θ2i|x) + n2π2(θ2i|x)
    Back later!
    [Gelfand & Dey, 1994]

    View full-size slide

  25. Optimal bridge sampling (2)
    Reason:
    Var(B12)
    B2
    12

    1
    n1n2
    π1(θ)π2(θ)[n1π1(θ) + n2π2(θ)]α(θ)2 dθ
    π1(θ)π2(θ)α(θ) dθ 2
    − 1
    δ method
    Drag: Dependence on the unknown normalising constants solved
    iteratively

    View full-size slide

  26. Optimal bridge sampling (2)
    Reason:
    Var(B12)
    B2
    12

    1
    n1n2
    π1(θ)π2(θ)[n1π1(θ) + n2π2(θ)]α(θ)2 dθ
    π1(θ)π2(θ)α(θ) dθ 2
    − 1
    δ method
    Drag: Dependence on the unknown normalising constants solved
    iteratively

    View full-size slide

  27. Ratio importance sampling
    Another fundamental identity:
    B12 = Eϕ [˜
    π1(θ1)/ϕ(θ1)]
    Eϕ [˜
    π2(θ2)/ϕ(θ2)]
    for any density ϕ with sufficiently large support (when Θ1 = Θ2)
    [Torrie & Valleau, 1977]
    Use of a single sample θ1, . . . , θn from ϕ
    B12 = i=1
    ˜
    π1(θi)/ϕ(θi)
    i=1
    ˜
    π2(θi)/ϕ(θi)

    View full-size slide

  28. Ratio importance sampling
    Another fundamental identity:
    B12 = Eϕ [˜
    π1(θ1)/ϕ(θ1)]
    Eϕ [˜
    π2(θ2)/ϕ(θ2)]
    for any density ϕ with sufficiently large support (when Θ1 = Θ2)
    [Torrie & Valleau, 1977]
    Use of a single sample θ1, . . . , θn from ϕ
    B12 = i=1
    ˜
    π1(θi)/ϕ(θi)
    i=1
    ˜
    π2(θi)/ϕ(θi)

    View full-size slide

  29. Ratio importance sampling (2)
    Approximate variance:
    var(B12)
    B2
    12

    1
    n Eϕ
    (π1(θ) − π2(θ))2
    ϕ(θ)2
    2
    Optimal choice:
    ϕ∗(θ) =
    | π1(θ) − π2(θ) |
    | π1(η) − π2(η) | dη
    [Chen, Shao & Ibrahim, 2000]
    Formaly better than bridge sampling

    View full-size slide

  30. Ratio importance sampling (2)
    Approximate variance:
    var(B12)
    B2
    12

    1
    n Eϕ
    (π1(θ) − π2(θ))2
    ϕ(θ)2
    2
    Optimal choice:
    ϕ∗(θ) =
    | π1(θ) − π2(θ) |
    | π1(η) − π2(η) | dη
    [Chen, Shao & Ibrahim, 2000]
    Formaly better than bridge sampling

    View full-size slide

  31. Ratio importance sampling (2)
    Approximate variance:
    var(B12)
    B2
    12

    1
    n Eϕ
    (π1(θ) − π2(θ))2
    ϕ(θ)2
    2
    Optimal choice:
    ϕ∗(θ) =
    | π1(θ) − π2(θ) |
    | π1(η) − π2(η) | dη
    [Chen, Shao & Ibrahim, 2000]
    Formaly better than bridge sampling

    View full-size slide

  32. Approximating Zk
    from a posterior sample
    Use of the [harmonic mean] identity
    Eπk
    ϕ(θk)
    πk(θk)Lk(θk)
    x =
    ϕ(θk)
    πk(θk)Lk(θk)
    πk(θk)Lk(θk)
    Zk
    dθk =
    1
    Zk
    no matter what the proposal ϕ(·) is.
    [Gelfand & Dey, 1994; Bartolucci et al., 2006]
    Direct exploitation of the MCMC output
    RB-RJ

    View full-size slide

  33. Approximating Zk
    from a posterior sample
    Use of the [harmonic mean] identity
    Eπk
    ϕ(θk)
    πk(θk)Lk(θk)
    x =
    ϕ(θk)
    πk(θk)Lk(θk)
    πk(θk)Lk(θk)
    Zk
    dθk =
    1
    Zk
    no matter what the proposal ϕ(·) is.
    [Gelfand & Dey, 1994; Bartolucci et al., 2006]
    Direct exploitation of the MCMC output
    RB-RJ

    View full-size slide

  34. Comparison with regular importance sampling
    Harmonic mean: Constraint opposed to usual importance sampling
    constraints: ϕ(θ) must have lighter (rather than fatter) tails than
    πk(θk)Lk(θk) for the approximation
    Z1k = 1
    1
    T
    T
    t=1
    ϕ(θ(t)
    k
    )
    πk(θ(t)
    k
    )Lk(θ(t)
    k
    )
    to have a finite variance.
    E.g., use finite support kernels (like Epanechnikov’s kernel) for ϕ

    View full-size slide

  35. Comparison with regular importance sampling
    Harmonic mean: Constraint opposed to usual importance sampling
    constraints: ϕ(θ) must have lighter (rather than fatter) tails than
    πk(θk)Lk(θk) for the approximation
    Z1k = 1
    1
    T
    T
    t=1
    ϕ(θ(t)
    k
    )
    πk(θ(t)
    k
    )Lk(θ(t)
    k
    )
    to have a finite variance.
    E.g., use finite support kernels (like Epanechnikov’s kernel) for ϕ

    View full-size slide

  36. Comparison with regular importance sampling (cont’d)
    Compare Z1k with a standard importance sampling approximation
    Z2k =
    1
    T
    T
    t=1
    πk(θ(t)
    k
    )Lk(θ(t)
    k
    )
    ϕ(θ(t)
    k
    )
    where the θ(t)
    k
    ’s are generated from the density ϕ(·) (with fatter
    tails like t’s)

    View full-size slide

  37. Approximating Zk
    using a mixture representation
    Bridge sampling recall
    Design a specific mixture for simulation [importance sampling]
    purposes, with density
    ϕk(θk) ∝ ω1πk(θk)Lk(θk) + ϕ(θk) ,
    where ϕ(·) is arbitrary (but normalised)
    Note: ω1 is not a probability weight

    View full-size slide

  38. Approximating Zk
    using a mixture representation
    Bridge sampling recall
    Design a specific mixture for simulation [importance sampling]
    purposes, with density
    ϕk(θk) ∝ ω1πk(θk)Lk(θk) + ϕ(θk) ,
    where ϕ(·) is arbitrary (but normalised)
    Note: ω1 is not a probability weight

    View full-size slide

  39. Approximating Z using a mixture representation (cont’d)
    Corresponding MCMC (=Gibbs) sampler
    At iteration t
    1 Take δ(t) = 1 with probability
    ω1πk(θ(t−1)
    k
    )Lk(θ(t−1)
    k
    ) ω1πk(θ(t−1)
    k
    )Lk(θ(t−1)
    k
    ) + ϕ(θ(t−1)
    k
    )
    and δ(t) = 2 otherwise;
    2 If δ(t) = 1, generate θ(t)
    k
    ∼ MCMC(θ(t−1)
    k
    , θk) where
    MCMC(θk, θk
    ) denotes an arbitrary MCMC kernel associated
    with the posterior πk(θk|x) ∝ πk(θk)Lk(θk);
    3 If δ(t) = 2, generate θ(t)
    k
    ∼ ϕ(θk) independently

    View full-size slide

  40. Approximating Z using a mixture representation (cont’d)
    Corresponding MCMC (=Gibbs) sampler
    At iteration t
    1 Take δ(t) = 1 with probability
    ω1πk(θ(t−1)
    k
    )Lk(θ(t−1)
    k
    ) ω1πk(θ(t−1)
    k
    )Lk(θ(t−1)
    k
    ) + ϕ(θ(t−1)
    k
    )
    and δ(t) = 2 otherwise;
    2 If δ(t) = 1, generate θ(t)
    k
    ∼ MCMC(θ(t−1)
    k
    , θk) where
    MCMC(θk, θk
    ) denotes an arbitrary MCMC kernel associated
    with the posterior πk(θk|x) ∝ πk(θk)Lk(θk);
    3 If δ(t) = 2, generate θ(t)
    k
    ∼ ϕ(θk) independently

    View full-size slide

  41. Approximating Z using a mixture representation (cont’d)
    Corresponding MCMC (=Gibbs) sampler
    At iteration t
    1 Take δ(t) = 1 with probability
    ω1πk(θ(t−1)
    k
    )Lk(θ(t−1)
    k
    ) ω1πk(θ(t−1)
    k
    )Lk(θ(t−1)
    k
    ) + ϕ(θ(t−1)
    k
    )
    and δ(t) = 2 otherwise;
    2 If δ(t) = 1, generate θ(t)
    k
    ∼ MCMC(θ(t−1)
    k
    , θk) where
    MCMC(θk, θk
    ) denotes an arbitrary MCMC kernel associated
    with the posterior πk(θk|x) ∝ πk(θk)Lk(θk);
    3 If δ(t) = 2, generate θ(t)
    k
    ∼ ϕ(θk) independently

    View full-size slide

  42. Evidence approximation by mixtures
    Rao-Blackwellised estimate
    ˆ
    ξ =
    1
    T
    T
    t=1
    ω1πk(θ(t)
    k
    )Lk(θ(t)
    k
    ) ω1πk(θ(t)
    k
    )Lk(θ(t)
    k
    ) + ϕ(θ(t)
    k
    ) ,
    converges to ω1Zk/{ω1Zk + 1}
    Deduce ˆ
    Z3k from ω1
    ˆ
    Z3k/{ω1
    ˆ
    Z3k + 1} = ˆ
    ξ ie
    ˆ
    Z3k =
    T
    t=1
    ω1πk(θ(t)
    k
    )Lk(θ(t)
    k
    ) ω1π(θ(t)
    k
    )Lk(θ(t)
    k
    ) + ϕ(θ(t)
    k
    )
    T
    t=1
    ϕ(θ(t)
    k
    ) ω1πk(θ(t)
    k
    )Lk(θ(t)
    k
    ) + ϕ(θ(t)
    k
    )
    [Bridge sampler redux]

    View full-size slide

  43. Evidence approximation by mixtures
    Rao-Blackwellised estimate
    ˆ
    ξ =
    1
    T
    T
    t=1
    ω1πk(θ(t)
    k
    )Lk(θ(t)
    k
    ) ω1πk(θ(t)
    k
    )Lk(θ(t)
    k
    ) + ϕ(θ(t)
    k
    ) ,
    converges to ω1Zk/{ω1Zk + 1}
    Deduce ˆ
    Z3k from ω1
    ˆ
    Z3k/{ω1
    ˆ
    Z3k + 1} = ˆ
    ξ ie
    ˆ
    Z3k =
    T
    t=1
    ω1πk(θ(t)
    k
    )Lk(θ(t)
    k
    ) ω1π(θ(t)
    k
    )Lk(θ(t)
    k
    ) + ϕ(θ(t)
    k
    )
    T
    t=1
    ϕ(θ(t)
    k
    ) ω1πk(θ(t)
    k
    )Lk(θ(t)
    k
    ) + ϕ(θ(t)
    k
    )
    [Bridge sampler redux]

    View full-size slide

  44. Integration by conditioning
    Use Rao-Blackwell Theorem
    var(E[δ(X)|Y]) ≤ var(δ(X))

    View full-size slide

  45. Consequence
    If I unbiased estimator of I = Ef [h(X)], with X simulated from a
    joint density ˜
    f(x, y), where
    ˜
    f(x, y)dy = f(x),
    the estimator
    I∗ = E ˜
    f
    [I|Y1, . . . , Yn]
    dominate I(X1, . . . , Xn) variance-wise (and is unbiased)

    View full-size slide

  46. Chib’s representation
    Direct application of Bayes’ theorem: given x ∼ fk(x|θk) and
    θk ∼ πk(θk),
    Zk = mk(x) =
    fk(x|θk) πk(θk)
    πk(θk|x)
    Use of an approximation to the posterior
    Zk = mk(x) =
    fk(x|θ∗
    k
    ) πk(θ∗
    k
    )
    ˆ
    πk(θ∗
    k
    |x)
    .

    View full-size slide

  47. Chib’s representation
    Direct application of Bayes’ theorem: given x ∼ fk(x|θk) and
    θk ∼ πk(θk),
    Zk = mk(x) =
    fk(x|θk) πk(θk)
    πk(θk|x)
    Use of an approximation to the posterior
    Zk = mk(x) =
    fk(x|θ∗
    k
    ) πk(θ∗
    k
    )
    ˆ
    πk(θ∗
    k
    |x)
    .

    View full-size slide

  48. Case of latent variables
    For missing variable z as in mixture models, natural Rao-Blackwell
    estimate
    πk(θ∗
    k
    |x) =
    1
    T
    T
    t=1
    πk(θ∗
    k
    |x, z(t)
    k
    ) ,
    where the z(t)
    k
    ’s are Gibbs sampled latent variables

    View full-size slide

  49. Label switching impact
    A mixture model [special case of missing variable model] is
    invariant under permutations of the indices of the components.
    E.g., mixtures
    0.3N(0, 1) + 0.7N(2.3, 1)
    and
    0.7N(2.3, 1) + 0.3N(0, 1)
    are exactly the same!
    c The component parameters θi are not identifiable
    marginally since they are exchangeable

    View full-size slide

  50. Label switching impact
    A mixture model [special case of missing variable model] is
    invariant under permutations of the indices of the components.
    E.g., mixtures
    0.3N(0, 1) + 0.7N(2.3, 1)
    and
    0.7N(2.3, 1) + 0.3N(0, 1)
    are exactly the same!
    c The component parameters θi are not identifiable
    marginally since they are exchangeable

    View full-size slide

  51. Connected difficulties
    1 Number of modes of the likelihood of order O(k!):
    c Maximization and even [MCMC] exploration of the
    posterior surface harder
    2 Under exchangeable priors on (θ, p) [prior invariant under
    permutation of the indices], all posterior marginals are
    identical:
    c Posterior expectation of θ1 equal to posterior expectation
    of θ2

    View full-size slide

  52. Connected difficulties
    1 Number of modes of the likelihood of order O(k!):
    c Maximization and even [MCMC] exploration of the
    posterior surface harder
    2 Under exchangeable priors on (θ, p) [prior invariant under
    permutation of the indices], all posterior marginals are
    identical:
    c Posterior expectation of θ1 equal to posterior expectation
    of θ2

    View full-size slide

  53. License
    Since Gibbs output does not produce exchangeability, the Gibbs
    sampler has not explored the whole parameter space: it lacks
    energy to switch simultaneously enough component allocations at
    once
    0 100 200 300 400 500
    −1 0 1 2 3
    n
    µi
    −1 0 1 2 3
    0.2 0.3 0.4 0.5
    µ
    i
    pi
    0 100 200 300 400 500
    0.2 0.3 0.4 0.5
    n
    pi
    0.2 0.3 0.4 0.5
    0.4 0.6 0.8 1.0
    p
    i
    σi
    0 100 200 300 400 500
    0.4 0.6 0.8 1.0
    n
    σi
    0.4 0.6 0.8 1.0
    −1 0 1 2 3
    σ
    i
    µi

    View full-size slide

  54. Label switching paradox
    We should observe the exchangeability of the components [label
    switching] to conclude about convergence of the Gibbs sampler.
    If we observe it, then we do not know how to estimate the
    parameters.
    If we do not, then we are uncertain about the convergence!!!

    View full-size slide

  55. Label switching paradox
    We should observe the exchangeability of the components [label
    switching] to conclude about convergence of the Gibbs sampler.
    If we observe it, then we do not know how to estimate the
    parameters.
    If we do not, then we are uncertain about the convergence!!!

    View full-size slide

  56. Label switching paradox
    We should observe the exchangeability of the components [label
    switching] to conclude about convergence of the Gibbs sampler.
    If we observe it, then we do not know how to estimate the
    parameters.
    If we do not, then we are uncertain about the convergence!!!

    View full-size slide

  57. Compensation for label switching
    For mixture models, z(t)
    k
    usually fails to visit all configurations in a
    balanced way, despite the symmetry predicted by the theory
    πk(θk|x) = πk(σ(θk)|x) =
    1
    k!
    σ∈S
    πk(σ(θk)|x)
    for all σ’s in Sk, set of all permutations of {1, . . . , k}.
    Consequences on numerical approximation, biased by an order k!
    Recover the theoretical symmetry by using
    πk(θ∗
    k
    |x) =
    1
    T k!
    σ∈Sk
    T
    t=1
    πk(σ(θ∗
    k
    )|x, z(t)
    k
    ) .
    [Berkhof, Mechelen, & Gelman, 2003]

    View full-size slide

  58. Compensation for label switching
    For mixture models, z(t)
    k
    usually fails to visit all configurations in a
    balanced way, despite the symmetry predicted by the theory
    πk(θk|x) = πk(σ(θk)|x) =
    1
    k!
    σ∈S
    πk(σ(θk)|x)
    for all σ’s in Sk, set of all permutations of {1, . . . , k}.
    Consequences on numerical approximation, biased by an order k!
    Recover the theoretical symmetry by using
    πk(θ∗
    k
    |x) =
    1
    T k!
    σ∈Sk
    T
    t=1
    πk(σ(θ∗
    k
    )|x, z(t)
    k
    ) .
    [Berkhof, Mechelen, & Gelman, 2003]

    View full-size slide

  59. Galaxy dataset
    n = 82 galaxies as a mixture of k normal distributions with both
    mean and variance unknown.
    [Roeder, 1992]
    Average density
    data
    Relative Frequency
    −2 −1 0 1 2 3
    0.0 0.2 0.4 0.6 0.8

    View full-size slide

  60. Galaxy dataset (k)
    Using only the original estimate, with θ∗
    k
    as the MAP estimator,
    log( ˆ
    mk(x)) = −105.1396
    for k = 3 (based on 103 simulations), while introducing the
    permutations leads to
    log( ˆ
    mk(x)) = −103.3479
    Note that
    −105.1396 + log(3!) = −103.3479
    k 2 3 4 5 6 7 8
    mk
    (x) -115.68 -103.35 -102.66 -101.93 -102.88 -105.48 -108.44
    Estimations of the marginal likelihoods by the symmetrised Chib’s
    approximation (based on 105 Gibbs iterations and, for k > 5, 100
    permutations selected at random in Sk).
    [Lee, Marin, Mengersen & Robert, 2008]

    View full-size slide

  61. Galaxy dataset (k)
    Using only the original estimate, with θ∗
    k
    as the MAP estimator,
    log( ˆ
    mk(x)) = −105.1396
    for k = 3 (based on 103 simulations), while introducing the
    permutations leads to
    log( ˆ
    mk(x)) = −103.3479
    Note that
    −105.1396 + log(3!) = −103.3479
    k 2 3 4 5 6 7 8
    mk
    (x) -115.68 -103.35 -102.66 -101.93 -102.88 -105.48 -108.44
    Estimations of the marginal likelihoods by the symmetrised Chib’s
    approximation (based on 105 Gibbs iterations and, for k > 5, 100
    permutations selected at random in Sk).
    [Lee, Marin, Mengersen & Robert, 2008]

    View full-size slide

  62. Galaxy dataset (k)
    Using only the original estimate, with θ∗
    k
    as the MAP estimator,
    log( ˆ
    mk(x)) = −105.1396
    for k = 3 (based on 103 simulations), while introducing the
    permutations leads to
    log( ˆ
    mk(x)) = −103.3479
    Note that
    −105.1396 + log(3!) = −103.3479
    k 2 3 4 5 6 7 8
    mk
    (x) -115.68 -103.35 -102.66 -101.93 -102.88 -105.48 -108.44
    Estimations of the marginal likelihoods by the symmetrised Chib’s
    approximation (based on 105 Gibbs iterations and, for k > 5, 100
    permutations selected at random in Sk).
    [Lee, Marin, Mengersen & Robert, 2008]

    View full-size slide

  63. Auxiliary variables
    Auxiliary variable method avoids computations of untractable
    constant(s) in likelihood function
    f(y|θ) = Zθ
    ˜
    f(y|θ)
    Introduce pseudo-data y with artificial target g(y|θ, y)
    Generate θ ∼ K(θ, θ ) and y ∼ f(y|θ )
    [Møller, Pettitt, Berthelsen, & Reeves, 2006]

    View full-size slide

  64. Auxiliary variables
    Auxiliary variable method avoids computations of untractable
    constant(s) in likelihood function
    f(y|θ) = Zθ
    ˜
    f(y|θ)
    Introduce pseudo-data y with artificial target g(y|θ, y)
    Generate θ ∼ K(θ, θ ) and y ∼ f(y|θ )
    Accept with probability
    π(θ )f(y|θ )g(y |θ , y)
    π(θ)f(y|θ)g(y|θ, y)
    K(θ , θ)f(y|θ)
    K(θ, θ )f(y |θ )
    ∧ 1
    [Møller, Pettitt, Berthelsen, & Reeves, 2006]

    View full-size slide

  65. Auxiliary variables
    Auxiliary variable method avoids computations of untractable
    constant(s) in likelihood function
    f(y|θ) = Zθ
    ˜
    f(y|θ)
    Introduce pseudo-data y with artificial target g(y|θ, y)
    Generate θ ∼ K(θ, θ ) and y ∼ f(y|θ )
    Accept with probability
    π(θ ) ˜
    f(y|θ )g(y |θ , y)
    π(θ) ˜
    f(y|θ)g(y|θ, y)
    K(θ , θ) ˜
    f(y|θ)
    K(θ, θ ) ˜
    f(y |θ )
    ∧ 1
    [Møller, Pettitt, Berthelsen, & Reeves, 2006]

    View full-size slide

  66. Auxiliary variables
    Auxiliary variable method avoids computations of untractable
    constant(s) in likelihood function
    f(y|θ) = Zθ
    ˜
    f(y|θ)
    Introduce pseudo-data y with artificial target g(y|θ, y)
    Generate θ ∼ K(θ, θ ) and y ∼ f(y|θ )
    For Gibbs random fields , existence of a genuine sufficient statistic η(y).
    [Møller, Pettitt, Berthelsen, & Reeves, 2006]

    View full-size slide

  67. Cross-model solutions
    1 Evidence
    2 Importance sampling solutions
    3 Cross-model solutions
    Reversible jump
    Saturation schemes
    Implementation error
    4 Nested sampling
    5 ABC model choice
    6 The Savage–Dickey ratio

    View full-size slide

  68. Reversible jump
    irreversible jump
    Idea: Set up a proper measure–theoretic framework for designing
    moves between models Mk
    [Green, 1995]
    Create a reversible kernel K on H = k
    {k} × Θk such that
    A B
    K(x, dy)π(x)dx =
    B A
    K(y, dx)π(y)dy
    for the invariant density π [x is of the form (k, θ(k))]

    View full-size slide

  69. Reversible jump
    irreversible jump
    Idea: Set up a proper measure–theoretic framework for designing
    moves between models Mk
    [Green, 1995]
    Create a reversible kernel K on H = k
    {k} × Θk such that
    A B
    K(x, dy)π(x)dx =
    B A
    K(y, dx)π(y)dy
    for the invariant density π [x is of the form (k, θ(k))]

    View full-size slide

  70. Local moves
    For a move between two models, M1 and M2, the Markov chain
    being in state θ1 ∈ M1, denote by K1→2(θ1, dθ) and K2→1(θ2, dθ)
    the corresponding kernels, under the detailed balance condition
    π(dθ1) K1→2(θ1, dθ) = π(dθ2) K2→1(θ2, dθ) ,
    and take, wlog, dim(M2) > dim(M1).
    Proposal expressed as
    θ2 = Ψ1→2(θ1, v1→2)
    where v1→2 is a random variable of dimension
    dim(M2) − dim(M1), generated as
    v1→2 ∼ ϕ1→2(v1→2) .

    View full-size slide

  71. Local moves
    For a move between two models, M1 and M2, the Markov chain
    being in state θ1 ∈ M1, denote by K1→2(θ1, dθ) and K2→1(θ2, dθ)
    the corresponding kernels, under the detailed balance condition
    π(dθ1) K1→2(θ1, dθ) = π(dθ2) K2→1(θ2, dθ) ,
    and take, wlog, dim(M2) > dim(M1).
    Proposal expressed as
    θ2 = Ψ1→2(θ1, v1→2)
    where v1→2 is a random variable of dimension
    dim(M2) − dim(M1), generated as
    v1→2 ∼ ϕ1→2(v1→2) .

    View full-size slide

  72. Local moves (2)
    In this case, q1→2(θ1, dθ2) has density
    ϕ1→2(v1→2)
    ∂Ψ1→2(θ1, v1→2)
    ∂(θ1, v1→2)
    −1
    ,
    by the Jacobian rule.
    Reverse importance link
    If probability 1→2 of choosing move to M2 while in M1,
    acceptance probability reduces to
    α(θ1, v1→2) = 1∧
    π(M2, θ2) 2→1
    π(M1, θ1) 1→2 ϕ1→2(v1→2)
    ∂Ψ1→2(θ1, v1→2)
    ∂(θ1, v1→2)
    .
    If several models are considered simultaneously, with probability
    1→2 of choosing move to M2 while in M1, as in
    K(x, B) =

    m=1
    ρm
    (x, y)qm
    (x, dy) + ω(x)IB
    (x)

    View full-size slide

  73. Local moves (2)
    In this case, q1→2(θ1, dθ2) has density
    ϕ1→2(v1→2)
    ∂Ψ1→2(θ1, v1→2)
    ∂(θ1, v1→2)
    −1
    ,
    by the Jacobian rule.
    Reverse importance link
    If probability 1→2 of choosing move to M2 while in M1,
    acceptance probability reduces to
    α(θ1, v1→2) = 1∧
    π(M2, θ2) 2→1
    π(M1, θ1) 1→2 ϕ1→2(v1→2)
    ∂Ψ1→2(θ1, v1→2)
    ∂(θ1, v1→2)
    .
    If several models are considered simultaneously, with probability
    1→2 of choosing move to M2 while in M1, as in
    K(x, B) =

    m=1
    ρm
    (x, y)qm
    (x, dy) + ω(x)IB
    (x)

    View full-size slide

  74. Local moves (2)
    In this case, q1→2(θ1, dθ2) has density
    ϕ1→2(v1→2)
    ∂Ψ1→2(θ1, v1→2)
    ∂(θ1, v1→2)
    −1
    ,
    by the Jacobian rule.
    Reverse importance link
    If probability 1→2 of choosing move to M2 while in M1,
    acceptance probability reduces to
    α(θ1, v1→2) = 1∧
    π(M2, θ2) 2→1
    π(M1, θ1) 1→2 ϕ1→2(v1→2)
    ∂Ψ1→2(θ1, v1→2)
    ∂(θ1, v1→2)
    .
    If several models are considered simultaneously, with probability
    1→2 of choosing move to M2 while in M1, as in
    K(x, B) =

    m=1
    ρm
    (x, y)qm
    (x, dy) + ω(x)IB
    (x)

    View full-size slide

  75. Generic reversible jump acceptance probability
    Acceptance probability of θ2 = Ψ1→2(θ1, v1→2) is
    α(θ1
    , v1→2
    ) = 1 ∧
    π(M2
    , θ2
    ) 2→1
    π(M1
    , θ1
    ) 1→2
    ϕ1→2
    (v1→2
    )
    ∂Ψ1→2
    (θ1
    , v1→2
    )
    ∂(θ1
    , v1→2
    )
    while acceptance probability of θ1 with (θ1, v1→2) = Ψ−1
    1→2
    (θ2) is
    α(θ1
    , v1→2
    ) = 1 ∧
    π(M1
    , θ1
    ) 1→2
    ϕ1→2
    (v1→2
    )
    π(M2
    , θ2
    ) 2→1
    ∂Ψ1→2
    (θ1
    , v1→2
    )
    ∂(θ1
    , v1→2
    )
    −1
    c Difficult calibration

    View full-size slide

  76. Generic reversible jump acceptance probability
    Acceptance probability of θ2 = Ψ1→2(θ1, v1→2) is
    α(θ1
    , v1→2
    ) = 1 ∧
    π(M2
    , θ2
    ) 2→1
    π(M1
    , θ1
    ) 1→2
    ϕ1→2
    (v1→2
    )
    ∂Ψ1→2
    (θ1
    , v1→2
    )
    ∂(θ1
    , v1→2
    )
    while acceptance probability of θ1 with (θ1, v1→2) = Ψ−1
    1→2
    (θ2) is
    α(θ1
    , v1→2
    ) = 1 ∧
    π(M1
    , θ1
    ) 1→2
    ϕ1→2
    (v1→2
    )
    π(M2
    , θ2
    ) 2→1
    ∂Ψ1→2
    (θ1
    , v1→2
    )
    ∂(θ1
    , v1→2
    )
    −1
    c Difficult calibration

    View full-size slide

  77. Generic reversible jump acceptance probability
    Acceptance probability of θ2 = Ψ1→2(θ1, v1→2) is
    α(θ1
    , v1→2
    ) = 1 ∧
    π(M2
    , θ2
    ) 2→1
    π(M1
    , θ1
    ) 1→2
    ϕ1→2
    (v1→2
    )
    ∂Ψ1→2
    (θ1
    , v1→2
    )
    ∂(θ1
    , v1→2
    )
    while acceptance probability of θ1 with (θ1, v1→2) = Ψ−1
    1→2
    (θ2) is
    α(θ1
    , v1→2
    ) = 1 ∧
    π(M1
    , θ1
    ) 1→2
    ϕ1→2
    (v1→2
    )
    π(M2
    , θ2
    ) 2→1
    ∂Ψ1→2
    (θ1
    , v1→2
    )
    ∂(θ1
    , v1→2
    )
    −1
    c Difficult calibration

    View full-size slide

  78. Green’s sampler
    Algorithm
    Iteration t (t ≥ 1): if x(t) = (m, θ(m)),
    1 Select model Mn with probability πmn
    2 Generate umn ∼ ϕmn(u) and set
    (θ(n), vnm) = Ψm→n(θ(m), umn)
    3 Take x(t+1) = (n, θ(n)) with probability
    min
    π(n, θ(n))
    π(m, θ(m))
    πnm
    ϕnm
    (vnm
    )
    πmn
    ϕmn
    (umn
    )
    ∂Ψm→n
    (θ(m), umn
    )
    ∂(θ(m), umn
    )
    , 1
    and take x(t+1) = x(t) otherwise.

    View full-size slide

  79. Interpretation
    The representation puts us back in a fixed dimension setting:
    M1 × V1→2 and M2 in one-to-one relation.
    reversibility imposes that θ1 is derived as
    (θ1, v1→2) = Ψ−1
    1→2
    (θ2)
    appears like a regular Metropolis–Hastings move from the
    couple (θ1, v1→2) to θ2 when stationary distributions are
    π(M1, θ1) × ϕ1→2(v1→2) and π(M2, θ2), and when proposal
    distribution is deterministic

    View full-size slide

  80. Interpretation
    The representation puts us back in a fixed dimension setting:
    M1 × V1→2 and M2 in one-to-one relation.
    reversibility imposes that θ1 is derived as
    (θ1, v1→2) = Ψ−1
    1→2
    (θ2)
    appears like a regular Metropolis–Hastings move from the
    couple (θ1, v1→2) to θ2 when stationary distributions are
    π(M1, θ1) × ϕ1→2(v1→2) and π(M2, θ2), and when proposal
    distribution is deterministic

    View full-size slide

  81. Alternative
    Saturation of the parameter space H = k
    {k} × Θk by creating
    θ = (θ1, . . . , θD)
    a model index M
    pseudo-priors πj(θj|M = k) for j = k
    [Carlin & Chib, 1995]
    Validation by
    P(M = k|x) = P(M = k|x, θ)π(θ|x)dθ = Zk
    where the (marginal) posterior is [not πk]
    π(θ|x) =
    D
    k=1
    P(θ, M = k|x)
    =
    D
    k=1
    pk Zk πk(θk|x)
    j=k
    πj(θj|M = k) .

    View full-size slide

  82. Alternative
    Saturation of the parameter space H = k
    {k} × Θk by creating
    θ = (θ1, . . . , θD)
    a model index M
    pseudo-priors πj(θj|M = k) for j = k
    [Carlin & Chib, 1995]
    Validation by
    P(M = k|x) = P(M = k|x, θ)π(θ|x)dθ = Zk
    where the (marginal) posterior is [not πk]
    π(θ|x) =
    D
    k=1
    P(θ, M = k|x)
    =
    D
    k=1
    pk Zk πk(θk|x)
    j=k
    πj(θj|M = k) .

    View full-size slide

  83. MCMC implementation
    Run a Markov chain (M(t), θ(t)
    1
    , . . . , θ(t)
    D
    ) with stationary
    distribution π(θ, M|x) by
    1 Pick M(t) = k with probability π(θ(t−1), k|x)
    2 Generate θ(t−1)
    k
    from the posterior πk(θk|x) [or MCMC step]
    3 Generate θ(t−1)
    j
    (j = k) from the pseudo-prior πj(θj|M = k)
    Approximate P(M = k|x) = Zk by
    ˇ
    pk(x) ∝ pk
    T
    t=1
    fk(x|θ(t)
    k
    ) πk(θ(t)
    k
    )
    j=k
    πj(θ(t)
    j
    |M = k)
    D
    =1
    p f (x|θ(t)) π (θ(t))
    j=
    πj(θ(t)
    j
    |M = )

    View full-size slide

  84. MCMC implementation
    Run a Markov chain (M(t), θ(t)
    1
    , . . . , θ(t)
    D
    ) with stationary
    distribution π(θ, M|x) by
    1 Pick M(t) = k with probability π(θ(t−1), k|x)
    2 Generate θ(t−1)
    k
    from the posterior πk(θk|x) [or MCMC step]
    3 Generate θ(t−1)
    j
    (j = k) from the pseudo-prior πj(θj|M = k)
    Approximate P(M = k|x) = Zk by
    ˇ
    pk(x) ∝ pk
    T
    t=1
    fk(x|θ(t)
    k
    ) πk(θ(t)
    k
    )
    j=k
    πj(θ(t)
    j
    |M = k)
    D
    =1
    p f (x|θ(t)) π (θ(t))
    j=
    πj(θ(t)
    j
    |M = )

    View full-size slide

  85. MCMC implementation
    Run a Markov chain (M(t), θ(t)
    1
    , . . . , θ(t)
    D
    ) with stationary
    distribution π(θ, M|x) by
    1 Pick M(t) = k with probability π(θ(t−1), k|x)
    2 Generate θ(t−1)
    k
    from the posterior πk(θk|x) [or MCMC step]
    3 Generate θ(t−1)
    j
    (j = k) from the pseudo-prior πj(θj|M = k)
    Approximate P(M = k|x) = Zk by
    ˇ
    pk(x) ∝ pk
    T
    t=1
    fk(x|θ(t)
    k
    ) πk(θ(t)
    k
    )
    j=k
    πj(θ(t)
    j
    |M = k)
    D
    =1
    p f (x|θ(t)) π (θ(t))
    j=
    πj(θ(t)
    j
    |M = )

    View full-size slide

  86. Scott’s (2002) proposal
    Suggest estimating P(M = k|x) by
    Zk ∝ pk
    T
    t=1



    fk(x|θ(t)
    k
    )
    D
    j=1
    pj fj(x|θ(t)
    j
    )



    ,
    based on D simultaneous and independent MCMC chains
    (θ(t)
    k
    )t , 1 ≤ k ≤ D ,
    with stationary distributions πk(θk|x) [instead of above joint]

    View full-size slide

  87. Scott’s (2002) proposal
    Suggest estimating P(M = k|x) by
    Zk ∝ pk
    T
    t=1



    fk(x|θ(t)
    k
    )
    D
    j=1
    pj fj(x|θ(t)
    j
    )



    ,
    based on D simultaneous and independent MCMC chains
    (θ(t)
    k
    )t , 1 ≤ k ≤ D ,
    with stationary distributions πk(θk|x) [instead of above joint]

    View full-size slide

  88. Congdon’s (2006) extension
    Selecting flat [prohibited] pseudo-priors, uses instead
    Zk ∝ pk
    T
    t=1



    fk(x|θ(t)
    k
    )πk(θ(t)
    k
    )
    D
    j=1
    pj fj(x|θ(t)
    j
    )πj(θ(t)
    j
    )



    ,
    where again the θ(t)
    k
    ’s are MCMC chains with stationary
    distributions πk(θk|x)
    to next section

    View full-size slide

  89. Examples
    Example (Model choice)
    Model M1 : x|θ ∼ U(0, θ) with prior θ ∼ Exp(1) is versus model
    M2 : x|θ ∼ Exp(θ) with prior θ ∼ Exp(1). Equal prior weights on
    both models: 1 = 2 = 0.5.
    Approximations of P(M = 1|x):
    Scott’s (2002) (blue), and
    Congdon’s (2006) (red)
    [N = 106 simulations].
    1 2 3 4 5
    0.0 0.2 0.4 0.6 0.8 1.0
    y

    View full-size slide

  90. Examples
    Example (Model choice)
    Model M1 : x|θ ∼ U(0, θ) with prior θ ∼ Exp(1) is versus model
    M2 : x|θ ∼ Exp(θ) with prior θ ∼ Exp(1). Equal prior weights on
    both models: 1 = 2 = 0.5.
    Approximations of P(M = 1|x):
    Scott’s (2002) (blue), and
    Congdon’s (2006) (red)
    [N = 106 simulations].
    1 2 3 4 5
    0.0 0.2 0.4 0.6 0.8 1.0
    y

    View full-size slide

  91. Examples (2)
    Example (Model choice (2))
    Normal model M1 : x ∼ N(θ, 1) with θ ∼ N(0, 1) vs. normal
    model M2 : x ∼ N(θ, 1) with θ ∼ N(5, 1)
    Comparison of both
    approximations with
    P(M = 1|x): Scott’s (2002)
    (green and mixed dashes) and
    Congdon’s (2006) (brown and
    long dashes) [N = 104
    simulations].
    −1 0 1 2 3 4 5 6
    0.0 0.2 0.4 0.6 0.8 1.0
    y

    View full-size slide

  92. Examples (3)
    Example (Model choice (3))
    Model M1 : x ∼ N(0, 1/ω) with ω ∼ Exp(a) vs.
    M2 : exp(x) ∼ Exp(λ) with λ ∼ Exp(b).
    Comparison of Congdon’s (2006)
    (brown and dashed lines) with
    P(M = 1|x) when (a, b) is equal
    to (.24, 8.9), (.56, .7), (4.1, .46)
    and (.98, .081), resp. [N = 104
    simulations].
    −10 −5 0 5 10
    0.0 0.2 0.4 0.6 0.8 1.0
    y
    −10 −5 0 5 10
    0.0 0.2 0.4 0.6 0.8 1.0
    y
    −10 −5 0 5 10
    0.0 0.2 0.4 0.6 0.8 1.0
    y
    −10 −5 0 5 10
    0.0 0.2 0.4 0.6 0.8 1.0
    y

    View full-size slide

  93. Nested sampling
    1 Evidence
    2 Importance sampling solutions
    3 Cross-model solutions
    4 Nested sampling
    Purpose
    Implementation
    Error rates
    Constraints
    Importance variant
    A mixture comparison
    5 ABC model choice

    View full-size slide

  94. Nested sampling: Goal
    Skilling’s (2007) technique using the one-dimensional
    representation:
    Z = Eπ[L(θ)] =
    1
    0
    ϕ(x) dx
    with
    ϕ−1(l) = Pπ(L(θ) > l).
    Note; ϕ(·) is intractable in most cases.

    View full-size slide

  95. Nested sampling: First approximation
    Approximate Z by a Riemann sum:
    Z =
    N
    i=1
    (xi−1 − xi)ϕ(xi)
    where the xi’s are either:
    deterministic: xi = e−i/N
    or random:
    x0 = 0, xi+1 = tixi, ti ∼ Be(N, 1)
    so that E[log xi] = −i/N.

    View full-size slide

  96. Extraneous white noise
    Take
    Z = e−θ dθ =
    1
    δ
    e−(1−δ)θ e−δθ = Eδ
    1
    δ
    e−(1−δ)θ
    ˆ
    Z =
    1
    N
    N
    i=1
    δ−1 e−(1−δ)θi (xi−1 − xi) , θi ∼ E(δ) I(θi ≤ θi−1)
    N deterministic random
    50 4.64 10.5
    4.65 10.5
    100 2.47 4.9
    2.48 5.02
    500 .549 1.01
    .550 1.14
    Comparison of variances and MSEs

    View full-size slide

  97. Extraneous white noise
    Take
    Z = e−θ dθ =
    1
    δ
    e−(1−δ)θ e−δθ = Eδ
    1
    δ
    e−(1−δ)θ
    ˆ
    Z =
    1
    N
    N
    i=1
    δ−1 e−(1−δ)θi (xi−1 − xi) , θi ∼ E(δ) I(θi ≤ θi−1)
    N deterministic random
    50 4.64 10.5
    4.65 10.5
    100 2.47 4.9
    2.48 5.02
    500 .549 1.01
    .550 1.14
    Comparison of variances and MSEs

    View full-size slide

  98. Nested sampling: Alternative representation
    Another representation is
    Z =
    N−1
    i=0
    {ϕ(xi+1) − ϕ(xi)}xi
    which is a special case of
    Z =
    N−1
    i=0
    {L(θ(i+1)
    ) − L(θ(i)
    )}π({θ; L(θ) > L(θ(i)
    )})
    where · · · L(θ(i+1)
    ) < L(θ(i)
    ) · · ·
    [Lebesgue version of Riemann’s sum]

    View full-size slide

  99. Nested sampling: Alternative representation
    Another representation is
    Z =
    N−1
    i=0
    {ϕ(xi+1) − ϕ(xi)}xi
    which is a special case of
    Z =
    N−1
    i=0
    {L(θ(i+1)
    ) − L(θ(i)
    )}π({θ; L(θ) > L(θ(i)
    )})
    where · · · L(θ(i+1)
    ) < L(θ(i)
    ) · · ·
    [Lebesgue version of Riemann’s sum]

    View full-size slide

  100. Nested sampling: Second approximation
    Estimate (intractable) ϕ(xi) by ϕi:
    Nested sampling
    Start with N values θ1, . . . , θN sampled from π
    At iteration i,
    1 Take ϕi = L(θk), where θk is the point with smallest
    likelihood in the pool of θi’s
    2 Replace θk with a sample from the prior constrained to
    L(θ) > ϕi: the current N points are sampled from prior
    constrained to L(θ) > ϕi.
    Note that
    π({θ; L(θ) > L(θ(i+1)
    )}) π({θ; L(θ) > L(θ(i)
    )}) ≈ 1 −
    1
    N

    View full-size slide

  101. Nested sampling: Second approximation
    Estimate (intractable) ϕ(xi) by ϕi:
    Nested sampling
    Start with N values θ1, . . . , θN sampled from π
    At iteration i,
    1 Take ϕi = L(θk), where θk is the point with smallest
    likelihood in the pool of θi’s
    2 Replace θk with a sample from the prior constrained to
    L(θ) > ϕi: the current N points are sampled from prior
    constrained to L(θ) > ϕi.
    Note that
    π({θ; L(θ) > L(θ(i+1)
    )}) π({θ; L(θ) > L(θ(i)
    )}) ≈ 1 −
    1
    N

    View full-size slide

  102. Nested sampling: Second approximation
    Estimate (intractable) ϕ(xi) by ϕi:
    Nested sampling
    Start with N values θ1, . . . , θN sampled from π
    At iteration i,
    1 Take ϕi = L(θk), where θk is the point with smallest
    likelihood in the pool of θi’s
    2 Replace θk with a sample from the prior constrained to
    L(θ) > ϕi: the current N points are sampled from prior
    constrained to L(θ) > ϕi.
    Note that
    π({θ; L(θ) > L(θ(i+1)
    )}) π({θ; L(θ) > L(θ(i)
    )}) ≈ 1 −
    1
    N

    View full-size slide

  103. Nested sampling: Second approximation
    Estimate (intractable) ϕ(xi) by ϕi:
    Nested sampling
    Start with N values θ1, . . . , θN sampled from π
    At iteration i,
    1 Take ϕi = L(θk), where θk is the point with smallest
    likelihood in the pool of θi’s
    2 Replace θk with a sample from the prior constrained to
    L(θ) > ϕi: the current N points are sampled from prior
    constrained to L(θ) > ϕi.
    Note that
    π({θ; L(θ) > L(θ(i+1)
    )}) π({θ; L(θ) > L(θ(i)
    )}) ≈ 1 −
    1
    N

    View full-size slide

  104. Nested sampling: Third approximation
    Iterate the above steps until a given stopping iteration j is
    reached: e.g.,
    observe very small changes in the approximation Z;
    reach the maximal value of L(θ) when the likelihood is
    bounded and its maximum is known;
    truncate the integral Z at level , i.e. replace
    1
    0
    ϕ(x) dx with
    1
    ϕ(x) dx

    View full-size slide

  105. Approximation error
    Error = Z − Z
    =
    j
    i=1
    (xi−1 − xi)ϕi −
    1
    0
    ϕ(x) dx = −
    0
    ϕ(x) dx
    +
    j
    i=1
    (xi−1 − xi)ϕ(xi) −
    1
    ϕ(x) dx (Quadrature Error)
    +
    j
    i=1
    (xi−1 − xi) {ϕi − ϕ(xi)} (Stochastic Error)
    [Dominated by Monte Carlo]

    View full-size slide

  106. A CLT for the Stochastic Error
    The (dominating) stochastic error is OP (N−1/2):
    N1/2 Z − Z D
    → N (0, V )
    with
    V = −
    s,t∈[ ,1]
    sϕ (s)tϕ (t) log(s ∨ t) ds dt.
    [Proof based on Donsker’s theorem]

    View full-size slide

  107. What of log Z?
    If the interest lies within log Z, Slutsky’s transform of the CLT:
    N−1/2 log Z − log Z D
    → N 0, V/Z2
    Note: The number of simulated points equals the number of
    iterations j, and is a multiple of N: if one stops at first iteration j
    such that e−j/N < , then:
    j = N − log .

    View full-size slide

  108. What of log Z?
    If the interest lies within log Z, Slutsky’s transform of the CLT:
    N−1/2 log Z − log Z D
    → N 0, V/Z2
    Note: The number of simulated points equals the number of
    iterations j, and is a multiple of N: if one stops at first iteration j
    such that e−j/N < , then:
    j = N − log .

    View full-size slide

  109. What of log Z?
    If the interest lies within log Z, Slutsky’s transform of the CLT:
    N−1/2 log Z − log Z D
    → N 0, V/Z2
    Note: The number of simulated points equals the number of
    iterations j, and is a multiple of N: if one stops at first iteration j
    such that e−j/N < , then:
    j = N − log .

    View full-size slide

  110. Sampling from constr’d priors
    Exact simulation from the constrained prior is intractable in most
    cases
    Skilling (2007) proposes to use MCMC but this introduces a bias
    (stopping rule)
    If implementable, then slice sampler can be devised at the same
    cost [or less]

    View full-size slide

  111. Sampling from constr’d priors
    Exact simulation from the constrained prior is intractable in most
    cases
    Skilling (2007) proposes to use MCMC but this introduces a bias
    (stopping rule)
    If implementable, then slice sampler can be devised at the same
    cost [or less]

    View full-size slide

  112. Sampling from constr’d priors
    Exact simulation from the constrained prior is intractable in most
    cases
    Skilling (2007) proposes to use MCMC but this introduces a bias
    (stopping rule)
    If implementable, then slice sampler can be devised at the same
    cost [or less]

    View full-size slide

  113. Banana illustration
    Case of a banana target made of a twisted 2D normal:
    x2 = x2 + βx2
    1
    − 100β
    [Haario, Sacksman, Tamminen, 1999]
    β = .03, σ = (100, 1)

    View full-size slide

  114. Banana illustration (2)
    Use of nested sampling with N = 1000, 50 MCMC steps with size
    0.1, compared with a population Monte Carlo (PMC) based on 10
    iterations with 5000 points per iteration and final sample of 50000
    points, using nine Student’s t components with 9 df
    [Wraith, Kilbinger, Benabed et al., 2009, Physica Rev. D]
    Evidence estimation

    View full-size slide

  115. Banana illustration (2)
    Use of nested sampling with N = 1000, 50 MCMC steps with size
    0.1, compared with a population Monte Carlo (PMC) based on 10
    iterations with 5000 points per iteration and final sample of 50000
    points, using nine Student’s t components with 9 df
    [Wraith, Kilbinger, Benabed et al., 2009, Physica Rev. D]
    E[X1] estimation

    View full-size slide

  116. Banana illustration (2)
    Use of nested sampling with N = 1000, 50 MCMC steps with size
    0.1, compared with a population Monte Carlo (PMC) based on 10
    iterations with 5000 points per iteration and final sample of 50000
    points, using nine Student’s t components with 9 df
    [Wraith, Kilbinger, Benabed et al., 2009, Physica Rev. D]
    E[X2] estimation

    View full-size slide

  117. A IS variant of nested sampling
    Consider instrumental prior π and likelihood ˜
    L, weight function
    w(θ) =
    π(θ)L(θ)
    π(θ)L(θ)
    and weighted NS estimator
    Z =
    j
    i=1
    (xi−1 − xi)ϕiw(θi).
    Then choose (π, L) so that sampling from π constrained to
    L(θ) > l is easy; e.g. N(c, Id) constrained to c − θ < r.

    View full-size slide

  118. A IS variant of nested sampling
    Consider instrumental prior π and likelihood ˜
    L, weight function
    w(θ) =
    π(θ)L(θ)
    π(θ)L(θ)
    and weighted NS estimator
    Z =
    j
    i=1
    (xi−1 − xi)ϕiw(θi).
    Then choose (π, L) so that sampling from π constrained to
    L(θ) > l is easy; e.g. N(c, Id) constrained to c − θ < r.

    View full-size slide

  119. Mixture pN(0, 1) + (1 − p)N(µ, σ) posterior
    Posterior on (µ, σ) for n
    observations with µ = 2
    and σ = 3/2, when p is
    known
    Use of a uniform prior
    both on (−2, 6) for µ
    and on (.001, 16) for
    log σ2.
    occurrences of posterior
    bursts for µ = xi

    View full-size slide

  120. Experiment
    MCMC sample for n = 16
    observations from the mixture.
    Nested sampling sequence
    with M = 1000 starting points.

    View full-size slide

  121. Experiment
    MCMC sample for n = 50
    observations from the mixture.
    Nested sampling sequence
    with M = 1000 starting points.

    View full-size slide

  122. Comparison
    1 Nested sampling: M = 1000 points, with 10 random walk
    moves at each step, simulations from the constr’d prior and a
    stopping rule at 95% of the observed maximum likelihood
    2 T = 104 MCMC (=Gibbs) simulations producing
    non-parametric estimates ϕ
    [Diebolt & Robert, 1990]
    3 Monte Carlo estimates Z1, Z2, Z3 using product of two
    Gaussian kernels
    4 numerical integration based on 850 × 950 grid [reference
    value, confirmed by Chib’s]

    View full-size slide

  123. Comparison (cont’d)
    Graph based on a sample of 10 observations for µ = 2 and
    σ = 3/2 (150 replicas).

    View full-size slide

  124. Comparison (cont’d)
    Graph based on a sample of 50 observations for µ = 2 and
    σ = 3/2 (150 replicas).

    View full-size slide

  125. Comparison (cont’d)
    Graph based on a sample of 100 observations for µ = 2 and
    σ = 3/2 (150 replicas).

    View full-size slide

  126. Approximate Bayesian computation
    1 Evidence
    2 Importance sampling solutions
    3 Cross-model solutions
    4 Nested sampling
    5 ABC model choice
    ABC method
    Formalised framework
    Consistency results
    Summary statistics
    6 The Savage–Dickey ratio

    View full-size slide

  127. Approximate Bayesian Computation
    Bayesian setting: target is π(θ)f(x|θ)
    When likelihood f(x|θ) not in closed form, likelihood-free rejection
    technique:
    ABC algorithm
    For an observation y ∼ f(y|θ), under the prior π(θ), keep jointly
    simulating
    θ ∼ π(θ) , x ∼ f(x|θ ) ,
    until the auxiliary variable x is equal to the observed value, x = y.
    [Pritchard et al., 1999]

    View full-size slide

  128. Approximate Bayesian Computation
    Bayesian setting: target is π(θ)f(x|θ)
    When likelihood f(x|θ) not in closed form, likelihood-free rejection
    technique:
    ABC algorithm
    For an observation y ∼ f(y|θ), under the prior π(θ), keep jointly
    simulating
    θ ∼ π(θ) , x ∼ f(x|θ ) ,
    until the auxiliary variable x is equal to the observed value, x = y.
    [Pritchard et al., 1999]

    View full-size slide

  129. Approximate Bayesian Computation
    Bayesian setting: target is π(θ)f(x|θ)
    When likelihood f(x|θ) not in closed form, likelihood-free rejection
    technique:
    ABC algorithm
    For an observation y ∼ f(y|θ), under the prior π(θ), keep jointly
    simulating
    θ ∼ π(θ) , x ∼ f(x|θ ) ,
    until the auxiliary variable x is equal to the observed value, x = y.
    [Pritchard et al., 1999]

    View full-size slide

  130. A as approximative
    When y is a continuous random variable, equality x = y is replaced
    with a tolerance condition,
    (x, y) ≤
    where is a distance between summary statistics
    Output distributed from
    π(θ) Pθ{ (x, y) < } ∝ π(θ| (x, y) < )

    View full-size slide

  131. A as approximative
    When y is a continuous random variable, equality x = y is replaced
    with a tolerance condition,
    (x, y) ≤
    where is a distance between summary statistics
    Output distributed from
    π(θ) Pθ{ (x, y) < } ∝ π(θ| (x, y) < )

    View full-size slide

  132. The starting point
    Central question to the validation of ABC for model choice:
    When is a Bayes factor based on an insufficient statistic
    T (y) consistent?
    Note/warnin: c drawn on T (y) through BT
    12
    (y) necessarily differs
    from c drawn on y through B12(y)
    [Marin, Pillai, X, & Rousseau, arXiv, 2012]

    View full-size slide

  133. The starting point
    Central question to the validation of ABC for model choice:
    When is a Bayes factor based on an insufficient statistic
    T (y) consistent?
    Note/warnin: c drawn on T (y) through BT
    12
    (y) necessarily differs
    from c drawn on y through B12(y)
    [Marin, Pillai, X, & Rousseau, arXiv, 2012]

    View full-size slide

  134. A benchmark if toy example
    Comparison suggested by referee of PNAS paper [thanks!]:
    [X, Cornuet, Marin, & Pillai, Aug. 2011]
    Model M1: y ∼ N(θ1, 1) opposed
    to model M2: y ∼ L(θ2, 1/

    2), Laplace distribution with mean θ2
    and scale parameter 1/

    2 (variance one).

    View full-size slide

  135. A benchmark if toy example
    Comparison suggested by referee of PNAS paper [thanks!]:
    [X, Cornuet, Marin, & Pillai, Aug. 2011]
    Model M1: y ∼ N(θ1, 1) opposed
    to model M2: y ∼ L(θ2, 1/

    2), Laplace distribution with mean θ2
    and scale parameter 1/

    2 (variance one).
    Four possible statistics
    1 sample mean y (sufficient for M1 if not M2);
    2 sample median med(y) (insufficient);
    3 sample variance var(y) (ancillary);
    4 median absolute deviation mad(y) = med(|y − med(y)|);

    View full-size slide

  136. A benchmark if toy example
    Comparison suggested by referee of PNAS paper [thanks!]:
    [X, Cornuet, Marin, & Pillai, Aug. 2011]
    Model M1: y ∼ N(θ1, 1) opposed
    to model M2: y ∼ L(θ2, 1/

    2), Laplace distribution with mean θ2
    and scale parameter 1/

    2 (variance one).
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    Gauss Laplace
    0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
    n=100

    View full-size slide

  137. A benchmark if toy example
    Comparison suggested by referee of PNAS paper [thanks!]:
    [X, Cornuet, Marin, & Pillai, Aug. 2011]
    Model M1: y ∼ N(θ1, 1) opposed
    to model M2: y ∼ L(θ2, 1/

    2), Laplace distribution with mean θ2
    and scale parameter 1/

    2 (variance one).
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    Gauss Laplace
    0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
    n=100
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    Gauss Laplace
    0.0 0.2 0.4 0.6 0.8 1.0
    n=100

    View full-size slide

  138. Framework
    Starting from sample
    y = (y1, . . . , yn)
    the observed sample, not necessarily iid with true distribution
    y ∼ Pn
    Summary statistics
    T (y) = T n = (T1(y), T2(y), · · · , Td(y)) ∈ Rd
    with true distribution T n ∼ Gn.

    View full-size slide

  139. Framework
    c Comparison of
    – under M1, y ∼ F1,n(·|θ1) where θ1 ∈ Θ1 ⊂ Rp1
    – under M2, y ∼ F2,n(·|θ2) where θ2 ∈ Θ2 ⊂ Rp2
    turned into
    – under M1, T (y) ∼ G1,n(·|θ1), and θ1|T (y) ∼ π1(·|T n)
    – under M2, T (y) ∼ G2,n(·|θ2), and θ2|T (y) ∼ π2(·|T n)

    View full-size slide

  140. Assumptions
    A collection of asymptotic “standard” assumptions:
    [A1] is a standard central limit theorem under the true model
    [A2] controls the large deviations of the estimator T n from the
    estimand µ(θ)
    [A3] is the standard prior mass condition found in Bayesian
    asymptotics (di effective dimension of the parameter)
    [A4] restricts the behaviour of the model density against the true
    density
    [Think CLT!]

    View full-size slide

  141. Assumptions
    A collection of asymptotic “standard” assumptions:
    [Think CLT!]
    [A1] There exist
    a sequence {vn} converging to +∞,
    a distribution Q,
    a symmetric, d × d positive definite matrix V0 and
    a vector µ0 ∈ Rd
    such that
    vnV −1/2
    0
    (T n − µ0) n→∞
    ; Q, under Gn

    View full-size slide

  142. Assumptions
    A collection of asymptotic “standard” assumptions:
    [Think CLT!]
    [A2] For i = 1, 2, there exist sets Fn,i ⊂ Θi, functions µi(θi) and
    constants i, τi, αi > 0 such that for all τ > 0,
    sup
    θi∈Fn,i
    Gi,n |T n − µi(θi)| > τ|µi(θi) − µ0| ∧ i |θi
    (|τµi(θi) − µ0| ∧ i)−αi
    v−αi
    n
    with
    πi(Fc
    n,i
    ) = o(v−τi
    n
    ).

    View full-size slide

  143. Assumptions
    A collection of asymptotic “standard” assumptions:
    [Think CLT!]
    [A3] If inf{|µi(θi) − µ0|; θi ∈ Θi} = 0, defining (u > 0)
    Sn,i(u) = θi ∈ Fn,i; |µi(θi) − µ0| ≤ u v−1
    n
    ,
    there exist constants di < τi ∧ αi − 1 such that
    πi(Sn,i(u)) ∼ udi v−di
    n
    , ∀u vn

    View full-size slide

  144. Assumptions
    A collection of asymptotic “standard” assumptions:
    [Think CLT!]
    [A4] If inf{|µi(θi) − µ0|; θi ∈ Θi} = 0, for any > 0, there exist
    U, δ > 0 and (En)n such that, if θi ∈ Sn,i(U)
    En ⊂ {t; gi(t|θi) < δgn(t)} and Gn (Ec
    n
    ) < .

    View full-size slide

  145. Assumptions
    A collection of asymptotic “standard” assumptions:
    [Think CLT!]
    Again (sumup)
    [A1] is a standard central limit theorem under the true model
    [A2] controls the large deviations of the estimator T n from the
    estimand µ(θ)
    [A3] is the standard prior mass condition found in Bayesian
    asymptotics (di effective dimension of the parameter)
    [A4] restricts the behaviour of the model density against the true
    density

    View full-size slide

  146. Effective dimension
    Understanding di in [A3]:
    defined only when µ0 ∈ {µi(θi), θi ∈ Θi},
    πi(θi : |µi(θi) − µ0| < n−1/2) = O(n−di/2)
    is the effective dimension of the model Θi around µ0

    View full-size slide

  147. Effective dimension
    Understanding di in [A3]:
    when inf{|µi(θi) − µ0|; θi ∈ Θi} = 0,
    mi(T n)
    gn(T n)
    ∼ v−di
    n
    ,
    thus
    log(mi(T n)/gn(T n)) ∼ −di log vn
    and v−di
    n
    penalization factor resulting from integrating θi out (see
    effective number of parameters in DIC)

    View full-size slide

  148. Effective dimension
    Understanding di in [A3]:
    In regular models, di dimension of T (Θi), leading to BIC
    In non-regular models, di can be smaller

    View full-size slide

  149. Asymptotic marginals
    Asymptotically, under [A1]–[A4]
    mi(t) =
    Θi
    gi(t|θi) πi(θi) dθi
    is such that
    (i) if inf{|µi(θi) − µ0|; θi ∈ Θi} = 0,
    Clvd−di
    n
    ≤ mi(T n) ≤ Cuvd−di
    n
    and
    (ii) if inf{|µi(θi) − µ0|; θi ∈ Θi} > 0
    mi(T n) = o
    Pn [vd−τi
    n
    + vd−αi
    n
    ].

    View full-size slide

  150. Within-model consistency
    Under same assumptions,
    if inf{|µi(θi) − µ0|; θi ∈ Θi} = 0,
    the posterior distribution of µi(θi) given T n is consistent at rate
    1/vn provided αi ∧ τi > di.

    View full-size slide

  151. Between-model consistency
    Consequence of above is that asymptotic behaviour of the Bayes
    factor is driven by the asymptotic mean value of T n under both
    models. And only by this mean value!

    View full-size slide

  152. Between-model consistency
    Consequence of above is that asymptotic behaviour of the Bayes
    factor is driven by the asymptotic mean value of T n under both
    models. And only by this mean value!
    Indeed, if
    inf{|µ0 − µ2(θ2)|; θ2 ∈ Θ2} = inf{|µ0 − µ1(θ1)|; θ1 ∈ Θ1} = 0
    then
    Clv−(d1−d2)
    n
    ≤ m1(T n) m2(T n) ≤ Cuv−(d1−d2)
    n
    ,
    where Cl, Cu = O
    Pn (1), irrespective of the true model.
    c Only depends on the difference d1 − d2: no consistency

    View full-size slide

  153. Between-model consistency
    Consequence of above is that asymptotic behaviour of the Bayes
    factor is driven by the asymptotic mean value of T n under both
    models. And only by this mean value!
    Else, if
    inf{|µ0 − µ2(θ2)|; θ2 ∈ Θ2} > inf{|µ0 − µ1(θ1)|; θ1 ∈ Θ1} = 0
    then
    m1(T n)
    m2(T n)
    ≥ Cu min v−(d1−α2)
    n
    , v−(d1−τ2)
    n

    View full-size slide

  154. Consistency theorem
    If
    inf{|µ0 − µ2(θ2)|; θ2 ∈ Θ2} = inf{|µ0 − µ1(θ1)|; θ1 ∈ Θ1} = 0,
    Bayes factor
    BT
    12
    = O(v−(d1−d2)
    n
    )
    irrespective of the true model. It is inconsistent since it always
    picks the model with the smallest dimension

    View full-size slide

  155. Consistency theorem
    If Pn belongs to one of the two models and if µ0 cannot be
    attained by the other one :
    0 = min (inf{|µ0 − µi(θi)|; θi ∈ Θi}, i = 1, 2)
    < max (inf{|µ0 − µi(θi)|; θi ∈ Θi}, i = 1, 2) ,
    then the Bayes factor BT
    12
    is consistent

    View full-size slide

  156. Consequences on summary statistics
    Bayes factor driven by the means µi(θi) and the relative position of
    µ0 wrt both sets {µi(θi); θi ∈ Θi}, i = 1, 2.
    For ABC, this implies the most likely statistics T n are ancillary
    statistics with different mean values under both models
    Else, if T n asymptotically depends on some of the parameters of
    the models, it is possible that there exists θi ∈ Θi such that
    µi(θi) = µ0 even though model M1 is misspecified

    View full-size slide

  157. Consequences on summary statistics
    Bayes factor driven by the means µi(θi) and the relative position of
    µ0 wrt both sets {µi(θi); θi ∈ Θi}, i = 1, 2.
    For ABC, this implies the most likely statistics T n are ancillary
    statistics with different mean values under both models
    Else, if T n asymptotically depends on some of the parameters of
    the models, it is possible that there exists θi ∈ Θi such that
    µi(θi) = µ0 even though model M1 is misspecified

    View full-size slide

  158. Toy example: Laplace versus Gauss [1]
    If
    T n = n−1
    n
    i=1
    X4
    i
    , µ1(θ) = 3 + θ4 + 6θ2, µ2(θ) = 6 + · · ·
    and the true distribution is Laplace with mean θ0 = 1, under the
    Gaussian model the value θ∗ = 2

    3 − 3 also leads to µ0 = µ(θ∗)
    [here d1 = d2 = d = 1]

    View full-size slide

  159. Toy example: Laplace versus Gauss [1]
    If
    T n = n−1
    n
    i=1
    X4
    i
    , µ1(θ) = 3 + θ4 + 6θ2, µ2(θ) = 6 + · · ·
    and the true distribution is Laplace with mean θ0 = 1, under the
    Gaussian model the value θ∗ = 2

    3 − 3 also leads to µ0 = µ(θ∗)
    [here d1 = d2 = d = 1]
    c A Bayes factor associated with such a statistic is inconsistent

    View full-size slide

  160. Toy example: Laplace versus Gauss [1]
    If
    T n = n−1
    n
    i=1
    X4
    i
    , µ1(θ) = 3 + θ4 + 6θ2, µ2(θ) = 6 + · · ·

    View full-size slide

  161. Toy example: Laplace versus Gauss [1]
    If
    T n = n−1
    n
    i=1
    X4
    i
    , µ1(θ) = 3 + θ4 + 6θ2, µ2(θ) = 6 + · · ·
    q
    q
    q
    q
    q
    q
    q
    M1 M2
    0.30 0.35 0.40 0.45
    n=1000
    Fourth moment

    View full-size slide

  162. Toy example: Laplace versus Gauss [0]
    When
    T (y) = ¯
    y(4)
    n
    , ¯
    y(6)
    n
    and the true distribution is Laplace with mean θ0 = 0, then
    µ0 = 6, µ1(θ∗
    1
    ) = 6 with θ∗
    1
    = 2

    3 − 3
    [d1 = 1 and d2 = 1/2]
    thus
    B12 ∼ n−1/4 → 0 : consistent
    Under the Gaussian model µ0 = 3, µ2(θ2) ≥ 6 > 3 ∀θ2
    B12 → +∞ : consistent

    View full-size slide

  163. Toy example: Laplace versus Gauss [0]
    When
    T (y) = ¯
    y(4)
    n
    , ¯
    y(6)
    n
    and the true distribution is Laplace with mean θ0 = 0, then
    µ0 = 6, µ1(θ∗
    1
    ) = 6 with θ∗
    1
    = 2

    3 − 3
    [d1 = 1 and d2 = 1/2]
    thus
    B12 ∼ n−1/4 → 0 : consistent
    Under the Gaussian model µ0 = 3, µ2(θ2) ≥ 6 > 3 ∀θ2
    B12 → +∞ : consistent

    View full-size slide

  164. Toy example: Laplace versus Gauss [0]
    When
    T (y) = ¯
    y(4)
    n
    , ¯
    y(6)
    n
    0.0 0.2 0.4 0.6 0.8 1.0
    0 1 2 3 4 5
    posterior probabilities
    Density
    Fourth and sixth moments

    View full-size slide

  165. Checking for adequate statistics
    After running ABC, i.e. creating reference tables of (θi, xi) from
    both joints, straightforward derivation of ABC estimates ˆ
    θ1 and ˆ
    θ2.
    Evaluation of E1
    ˆ
    θ1
    [T(X)] and E2
    ˆ
    θ2
    [T(X)] allows for detection of
    different means under both models via Monte Carlo simulations

    View full-size slide

  166. Toy example: Laplace versus Gauss
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    Gauss Laplace Gauss Laplace
    0 10 20 30 40
    Normalised χ2 without and with mad

    View full-size slide

  167. Embedded models
    When M1 submodel of M2, and if the true distribution belongs to
    the smaller model M1, Bayes factor is of order
    v−(d1−d2)
    n

    View full-size slide

  168. Embedded models
    When M1 submodel of M2, and if the true distribution belongs to
    the smaller model M1, Bayes factor is of order
    v−(d1−d2)
    n
    If summary statistic only informative on a parameter that is the
    same under both models, i.e if d1 = d2, then
    c the Bayes factor is not consistent

    View full-size slide

  169. Embedded models
    When M1 submodel of M2, and if the true distribution belongs to
    the smaller model M1, Bayes factor is of order
    v−(d1−d2)
    n
    Else, d1 < d2 and Bayes factor is going to ∞ under M1. If true
    distribution not in M1, then
    c Bayes factor is consistent only if µ1 = µ2 = µ0

    View full-size slide

  170. Summary
    Model selection feasible with ABC
    Choice of summary statistics is paramount
    At best, ABC → π(. | T (y)) which concentrates around µ0
    For estimation : {θ; µ(θ) = µ0} = θ0
    For testing {µ1(θ1), θ1 ∈ Θ1} ∩ {µ2(θ2), θ2 ∈ Θ2} = ∅

    View full-size slide

  171. Summary
    Model selection feasible with ABC
    Choice of summary statistics is paramount
    At best, ABC → π(. | T (y)) which concentrates around µ0
    For estimation : {θ; µ(θ) = µ0} = θ0
    For testing {µ1(θ1), θ1 ∈ Θ1} ∩ {µ2(θ2), θ2 ∈ Θ2} = ∅

    View full-size slide

  172. 1 Evidence
    2 Importance sampling solutions
    3 Cross-model solutions
    4 Nested sampling
    5 ABC model choice
    6 The Savage–Dickey ratio
    Measure-theoretic aspects
    Computational implications

    View full-size slide

  173. The Savage–Dickey ratio representation
    Special representation of the Bayes factor used for simulation
    Original version (Dickey, AoMS, 1971)

    View full-size slide

  174. The Savage–Dickey ratio representation
    Special representation of the Bayes factor used for simulation
    Original version (Dickey, AoMS, 1971)

    View full-size slide

  175. Savage’s density ratio theorem
    Given a test H0 : θ = θ0 in a model f(x|θ, ψ) with a nuisance
    parameter ψ, under priors π0(ψ) and π1(θ, ψ) such that
    π1(ψ|θ0) = π0(ψ)
    then
    B01 =
    π1(θ0|x)
    π1(θ0)
    ,
    with the obvious notations
    π1(θ) = π1(θ, ψ)dψ , π1(θ|x) = π1(θ, ψ|x)dψ ,
    [Dickey, 1971; Verdinelli & Wasserman, 1995]

    View full-size slide

  176. Rephrased
    “Suppose that f0(θ) = f1(θ|φ = φ0). As f0(x|θ) = f1(x|θ, φ = φ0),
    f0(x) = f1(x|θ, φ = φ0)f1(θ|φ = φ0) dθ = f1(x|φ = φ0) ,
    i.e., the denumerator of the Bayes factor is the value of f1(x|φ) at φ = φ0, while the denominator is an average
    of the values of f1(x|φ) for φ = φ0, weighted by the prior distribution f1(φ) under the augmented model.
    Applying Bayes’ theorem to the right-hand side of [the above] we get
    f0(x) = f1(φ0|x)f1(x) f1(φ0)
    and hence the Bayes factor is given by
    B = f0(x) f1(x) = f1(φ0|x) f1(φ0) .
    the ratio of the posterior to prior densities at φ = φ0 under the augmented model.”
    [O’Hagan & Forster, 1996]

    View full-size slide

  177. Rephrased
    “Suppose that f0(θ) = f1(θ|φ = φ0). As f0(x|θ) = f1(x|θ, φ = φ0),
    f0(x) = f1(x|θ, φ = φ0)f1(θ|φ = φ0) dθ = f1(x|φ = φ0) ,
    i.e., the denumerator of the Bayes factor is the value of f1(x|φ) at φ = φ0, while the denominator is an average
    of the values of f1(x|φ) for φ = φ0, weighted by the prior distribution f1(φ) under the augmented model.
    Applying Bayes’ theorem to the right-hand side of [the above] we get
    f0(x) = f1(φ0|x)f1(x) f1(φ0)
    and hence the Bayes factor is given by
    B = f0(x) f1(x) = f1(φ0|x) f1(φ0) .
    the ratio of the posterior to prior densities at φ = φ0 under the augmented model.”
    [O’Hagan & Forster, 1996]

    View full-size slide

  178. Measure-theoretic difficulty
    Representation depends on the choice of versions of conditional
    densities:
    B01 =
    π0(ψ)f(x|θ0, ψ) dψ
    π1(θ, ψ)f(x|θ, ψ) dψdθ
    [by definition]
    =
    π1(ψ|θ0)f(x|θ0, ψ) dψ π1(θ0)
    π1(θ, ψ)f(x|θ, ψ) dψdθ π1(θ0)
    [specific version of π1(ψ|θ0)
    and arbitrary version of π1(θ0)]
    =
    π1(θ0, ψ)f(x|θ0, ψ) dψ
    m1(x)π1(θ0)
    [specific version of π1(θ0, ψ)]
    =
    π1(θ0|x)
    π1(θ0)
    [version dependent]

    View full-size slide

  179. Measure-theoretic difficulty
    Representation depends on the choice of versions of conditional
    densities:
    B01 =
    π0(ψ)f(x|θ0, ψ) dψ
    π1(θ, ψ)f(x|θ, ψ) dψdθ
    [by definition]
    =
    π1(ψ|θ0)f(x|θ0, ψ) dψ π1(θ0)
    π1(θ, ψ)f(x|θ, ψ) dψdθ π1(θ0)
    [specific version of π1(ψ|θ0)
    and arbitrary version of π1(θ0)]
    =
    π1(θ0, ψ)f(x|θ0, ψ) dψ
    m1(x)π1(θ0)
    [specific version of π1(θ0, ψ)]
    =
    π1(θ0|x)
    π1(θ0)
    [version dependent]

    View full-size slide

  180. Choice of density version
    c Dickey’s (1971) condition is not a condition:
    If
    π1(θ0|x)
    π1(θ0)
    =
    π0(ψ)f(x|θ0, ψ) dψ
    m1(x)
    is chosen as a version, then Savage–Dickey’s representation holds

    View full-size slide

  181. Choice of density version
    c Dickey’s (1971) condition is not a condition:
    If
    π1(θ0|x)
    π1(θ0)
    =
    π0(ψ)f(x|θ0, ψ) dψ
    m1(x)
    is chosen as a version, then Savage–Dickey’s representation holds

    View full-size slide

  182. Savage–Dickey paradox
    Verdinelli-Wasserman extension:
    B01 =
    π1(θ0|x)
    π1(θ0) Eπ1(ψ|x,θ0,x)
    π0(ψ)
    π1(ψ|θ0)
    similarly depends on choices of versions...
    ...but Monte Carlo implementation relies on specific versions of all
    densities without making mention of it
    [Chen, Shao & Ibrahim, 2000]

    View full-size slide

  183. Savage–Dickey paradox
    Verdinelli-Wasserman extension:
    B01 =
    π1(θ0|x)
    π1(θ0) Eπ1(ψ|x,θ0,x)
    π0(ψ)
    π1(ψ|θ0)
    similarly depends on choices of versions...
    ...but Monte Carlo implementation relies on specific versions of all
    densities without making mention of it
    [Chen, Shao & Ibrahim, 2000]

    View full-size slide

  184. A computational exploitation
    Starting from the (instrumental) prior
    ˜
    π1(θ, ψ) = π1(θ)π0(ψ)
    define the associated posterior
    ˜
    π1(θ, ψ|x) = π0(ψ)π1(θ)f(x|θ, ψ) ˜
    m1(x)
    and impose the choice of version
    ˜
    π1(θ0|x)
    π0(θ0)
    =
    π0(ψ)f(x|θ0, ψ) dψ
    ˜
    m1(x)
    Then
    B01 =
    ˜
    π1(θ0|x)
    π1(θ0)
    ˜
    m1(x)
    m1(x)

    View full-size slide

  185. A computational exploitation
    Starting from the (instrumental) prior
    ˜
    π1(θ, ψ) = π1(θ)π0(ψ)
    define the associated posterior
    ˜
    π1(θ, ψ|x) = π0(ψ)π1(θ)f(x|θ, ψ) ˜
    m1(x)
    and impose the choice of version
    ˜
    π1(θ0|x)
    π0(θ0)
    =
    π0(ψ)f(x|θ0, ψ) dψ
    ˜
    m1(x)
    Then
    B01 =
    ˜
    π1(θ0|x)
    π1(θ0)
    ˜
    m1(x)
    m1(x)

    View full-size slide

  186. First ratio
    If (θ(1), ψ(1)), . . . , (θ(T), ψ(T)) ∼ ˜
    π(θ, ψ|x), then
    1
    T
    t
    ˜
    π1(θ0|x, ψ(t))
    converges to ˜
    π1(θ0|x) provided the right version is used in θ0
    ˜
    π1(θ0|x, ψ) =
    π1(θ0)f(x|θ0, ψ)
    π1(θ)f(x|θ, ψ) dθ

    View full-size slide

  187. Rao–Blackwellisation with latent variables
    When ˜
    π1(θ0|x, ψ) unavailable, replace with
    1
    T
    T
    t=1
    ˜
    π1(θ0|x, z(t), ψ(t))
    via data completion by latent variable z such that
    f(x|θ, ψ) = ˜
    f(x, z|θ, ψ) dz
    and that ˜
    π1(θ, ψ, z|x) ∝ π0(ψ)π1(θ) ˜
    f(x, z|θ, ψ) available in closed
    form, including the normalising constant, based on version
    ˜
    π1(θ0|x, z, ψ)
    π1(θ0)
    =
    ˜
    f(x, z|θ0, ψ)
    ˜
    f(x, z|θ, ψ)π1(θ) dθ
    .

    View full-size slide

  188. Rao–Blackwellisation with latent variables
    When ˜
    π1(θ0|x, ψ) unavailable, replace with
    1
    T
    T
    t=1
    ˜
    π1(θ0|x, z(t), ψ(t))
    via data completion by latent variable z such that
    f(x|θ, ψ) = ˜
    f(x, z|θ, ψ) dz
    and that ˜
    π1(θ, ψ, z|x) ∝ π0(ψ)π1(θ) ˜
    f(x, z|θ, ψ) available in closed
    form, including the normalising constant, based on version
    ˜
    π1(θ0|x, z, ψ)
    π1(θ0)
    =
    ˜
    f(x, z|θ0, ψ)
    ˜
    f(x, z|θ, ψ)π1(θ) dθ
    .

    View full-size slide

  189. Bridge revival (1)
    Since ˜
    m1(x)/m1(x) is unknown, apparent failure!
    Use of the bridge identity

    π1(θ,ψ|x)
    π1(θ, ψ)f(x|θ, ψ)
    π0(ψ)π1(θ)f(x|θ, ψ)
    = E˜
    π1(θ,ψ|x)
    π1(ψ|θ)
    π0(ψ)
    =
    m1(x)
    ˜
    m1(x)
    to (biasedly) estimate ˜
    m1(x)/m1(x) by
    T
    T
    t=1
    π1(ψ(t)|θ(t))
    π0(ψ(t))
    based on the same sample from ˜
    π1.

    View full-size slide

  190. Bridge revival (1)
    Since ˜
    m1(x)/m1(x) is unknown, apparent failure!
    Use of the bridge identity

    π1(θ,ψ|x)
    π1(θ, ψ)f(x|θ, ψ)
    π0(ψ)π1(θ)f(x|θ, ψ)
    = E˜
    π1(θ,ψ|x)
    π1(ψ|θ)
    π0(ψ)
    =
    m1(x)
    ˜
    m1(x)
    to (biasedly) estimate ˜
    m1(x)/m1(x) by
    T
    T
    t=1
    π1(ψ(t)|θ(t))
    π0(ψ(t))
    based on the same sample from ˜
    π1.

    View full-size slide

  191. Bridge revival (2)
    Alternative identity
    Eπ1(θ,ψ|x)
    π0(ψ)π1(θ)f(x|θ, ψ)
    π1(θ, ψ)f(x|θ, ψ)
    = Eπ1(θ,ψ|x)
    π0(ψ)
    π1(ψ|θ)
    =
    ˜
    m1(x)
    m1(x)
    suggests using a second sample (¯
    θ(t), ¯
    ψ(t), z(t)) ∼ π1(θ, ψ|x) and
    the ratio estimate
    1
    T
    T
    t=1
    π0( ¯
    ψ(t)) π1( ¯
    ψ(t)|¯
    θ(t))
    Resulting unbiased estimate:
    B01 =
    1
    T
    t
    ˜
    π1(θ0|x, z(t), ψ(t))
    π1(θ0)
    1
    T
    T
    t=1
    π0( ¯
    ψ(t))
    π1( ¯
    ψ(t)|¯
    θ(t))

    View full-size slide

  192. Bridge revival (2)
    Alternative identity
    Eπ1(θ,ψ|x)
    π0(ψ)π1(θ)f(x|θ, ψ)
    π1(θ, ψ)f(x|θ, ψ)
    = Eπ1(θ,ψ|x)
    π0(ψ)
    π1(ψ|θ)
    =
    ˜
    m1(x)
    m1(x)
    suggests using a second sample (¯
    θ(t), ¯
    ψ(t), z(t)) ∼ π1(θ, ψ|x) and
    the ratio estimate
    1
    T
    T
    t=1
    π0( ¯
    ψ(t)) π1( ¯
    ψ(t)|¯
    θ(t))
    Resulting unbiased estimate:
    B01 =
    1
    T
    t
    ˜
    π1(θ0|x, z(t), ψ(t))
    π1(θ0)
    1
    T
    T
    t=1
    π0( ¯
    ψ(t))
    π1( ¯
    ψ(t)|¯
    θ(t))

    View full-size slide

  193. Difference with Verdinelli–Wasserman representation
    The above leads to the representation
    B01 =
    ˜
    π1(θ0|x)
    π1(θ0) Eπ1(θ,ψ|x)
    π0(ψ)
    π1(ψ|θ)
    shows how our approach differs from Verdinelli and Wasserman’s
    B01 =
    π1(θ0|x)
    π1(θ0) Eπ1(ψ|x,θ0,x)
    π0(ψ)
    π1(ψ|θ0)
    [for referees only!!]

    View full-size slide

  194. Diabetes in Pima Indian women (cont’d)
    Comparison of the variation of the Bayes factor approximations
    based on 100 replicas for 20, 000 simulations for a simulation from
    the above importance, Chib’s, Savage–Dickey’s and bridge
    samplers
    q
    q
    IS Chib Savage−Dickey Bridge
    2.8 3.0 3.2 3.4

    View full-size slide

  195. Other things I could have mentioned
    use of particle filters
    [Del Moral et al., 2006]
    path sampling
    d

    log Z(θ) = Eθ
    d

    log q(ω|θ)
    [Ogata, 1989; Gelman & Meng, 1998]
    multicanonical algorithms
    [Marinari & Parisi, 1992; Geyer & Thompson, 1995]

    View full-size slide