Slide 1

Slide 1 text

On alternative perspectives and solutions on Bayesian tests Christian P. Robert Universit´ e Paris-Dauphine, Paris & University of Warwick, Coventry bayesianstatistics@gmail.com

Slide 2

Slide 2 text

On alternative perspectives and solutions on Bayesian tests Christian P. Robert Universit´ e Paris-Dauphine, Paris & University of Warwick, Coventry bayesianstatistics@gmail.com

Slide 3

Slide 3 text

Outline Significance tests: one new parameter Jeffreys-Lindley paradox Deviance (information criterion) Aitkin’s integrated likelihood Johnson’s uniformly most powerful Bayesian tests Testing under incomplete information Posterior predictive checking

Slide 4

Slide 4 text

“Significance tests: one new parameter” Significance tests: one new parameter Bayesian tests Bayes factors Improper priors for tests Conclusion Jeffreys-Lindley paradox Deviance (information criterion) Aitkin’s integrated likelihood Johnson’s uniformly most powerful Bayesian tests Testing under incomplete information

Slide 5

Slide 5 text

Fundamental setting Is the new parameter supported by the observations or is any variation expressible by it better interpreted as random? Thus we must set two hypotheses for comparison, the more complicated having the smaller initial probability (Jeffreys, ToP, V, §5.0) ...compare a specially suggested value of a new parameter, often 0 [q], with the aggregate of other possible values [q ]. We shall call q the null hypothesis and q the alternative hypothesis [and] we must take P(q|H) = P(q |H) = 1/2 .

Slide 6

Slide 6 text

Construction of Bayes tests Definition (Test) Given an hypothesis H0 : θ ∈ Θ0 on the parameter θ ∈ Θ0 of a statistical model, a test is a statistical procedure that takes its values in {0, 1}.

Slide 7

Slide 7 text

Type–one and type–two errors Associated with the risk R(θ, δ) = Eθ[aL(θ, δ(x))] = Pθ(δ(x) = 0) if θ ∈ Θ0, Pθ(δ(x) = 1) otherwise, Theorem (Bayes test) The Bayes estimator associated with π and with the 0 − 1 loss is δπ(x) = 1 if P(θ ∈ Θ0|x) > P(θ ∈ Θ0|x), 0 otherwise,

Slide 8

Slide 8 text

Type–one and type–two errors Associated with the risk R(θ, δ) = Eθ[aL(θ, δ(x))] = Pθ(δ(x) = 0) if θ ∈ Θ0, Pθ(δ(x) = 1) otherwise, Theorem (Bayes test) The Bayes estimator associated with π and with the 0 − 1 loss is δπ(x) = 1 if P(θ ∈ Θ0|x) > P(θ ∈ Θ0|x), 0 otherwise,

Slide 9

Slide 9 text

Jeffreys’ example (§5.0) Testing whether the mean α of a normal observation is zero: P(q|aH) ∝ exp − a2 2s2 P(q dα|aH) ∝ exp − (a − α)2 2s2 f (α)dα P(q |aH) ∝ exp − (a − α)2 2s2 f (α)dα

Slide 10

Slide 10 text

A (small) point of contention Jeffreys asserts Suppose that there is one old parameter α; the new parameter is β and is 0 on q. In q we could replace α by α , any function of α and β: but to make it explicit that q reduces to q when β = 0 we shall require that α = α when β = 0 (V, §5.0). This amounts to assume identical parameters in both models, a controversial principle for model choice or at the very best to make α and β dependent a priori, a choice contradicted by the next paragraph in ToP

Slide 11

Slide 11 text

A (small) point of contention Jeffreys asserts Suppose that there is one old parameter α; the new parameter is β and is 0 on q. In q we could replace α by α , any function of α and β: but to make it explicit that q reduces to q when β = 0 we shall require that α = α when β = 0 (V, §5.0). This amounts to assume identical parameters in both models, a controversial principle for model choice or at the very best to make α and β dependent a priori, a choice contradicted by the next paragraph in ToP

Slide 12

Slide 12 text

Orthogonal parameters If I(α, β) = gαα 0 0 gββ , α and β orthogonal, but not [a posteriori] independent, contrary to ToP assertions ...the result will be nearly independent on previous information on old parameters (V, §5.01). and K = 1 f (b, a) ngββ 2π exp − 1 2 ngββb2 [where] h(α) is irrelevant (V, §5.01)

Slide 13

Slide 13 text

Orthogonal parameters If I(α, β) = gαα 0 0 gββ , α and β orthogonal, but not [a posteriori] independent, contrary to ToP assertions ...the result will be nearly independent on previous information on old parameters (V, §5.01). and K = 1 f (b, a) ngββ 2π exp − 1 2 ngββb2 [where] h(α) is irrelevant (V, §5.01)

Slide 14

Slide 14 text

Acknowledgement in ToP In practice it is rather unusual for a set of parameters to arise in such a way that each can be treated as irrelevant to the presence of any other. More usual cases are (...) where some parameters are so closely associated that one could hardly occur without the others (V, §5.04).

Slide 15

Slide 15 text

Generalisation Theorem (Optimal Bayes decision) Under the 0 − 1 loss function L(θ, d) =      0 if d = IΘ0 (θ) a0 if d = 1 and θ ∈ Θ0 a1 if d = 0 and θ ∈ Θ0 the Bayes procedure is δπ(x) = 1 if Prπ(θ ∈ Θ0|x) a0/(a0 + a1) 0 otherwise

Slide 16

Slide 16 text

Generalisation Theorem (Optimal Bayes decision) Under the 0 − 1 loss function L(θ, d) =      0 if d = IΘ0 (θ) a0 if d = 1 and θ ∈ Θ0 a1 if d = 0 and θ ∈ Θ0 the Bayes procedure is δπ(x) = 1 if Prπ(θ ∈ Θ0|x) a0/(a0 + a1) 0 otherwise

Slide 17

Slide 17 text

Bound comparison Determination of a0/a1 depends on consequences of “wrong decision” under both circumstances Often difficult to assess in practice and replacement with “golden” default bounds like .05, biased towards H0

Slide 18

Slide 18 text

Bound comparison Determination of a0/a1 depends on consequences of “wrong decision” under both circumstances Often difficult to assess in practice and replacement with “golden” default bounds like .05, biased towards H0

Slide 19

Slide 19 text

A function of posterior probabilities Definition (Bayes factors) For hypotheses H0 : θ ∈ Θ0 vs. Ha : θ ∈ Θ0 B01 = π(Θ0|x) π(Θc 0 |x) π(Θ0) π(Θc 0 ) = Θ0 f (x|θ)π0(θ)dθ Θc 0 f (x|θ)π1(θ)dθ [Good, 1958 & ToP, V, §5.01] Equivalent to Bayes rule: acceptance if B01 > {(1 − π(Θ0))/a1}/{π(Θ0)/a0}

Slide 20

Slide 20 text

A major modification When the null hypothesis is supported by a set of measure 0 against Lebesgue measure, π(Θ0) = 0 for an absolutely continuous prior distribution [End of the story?!] Suppose we are considering whether a location parameter α is 0. The estimation prior probability for it is uniform and we should have to take f (α) = 0 and K[= B10] would always be infinite (V, §5.02)

Slide 21

Slide 21 text

A major modification When the null hypothesis is supported by a set of measure 0 against Lebesgue measure, π(Θ0) = 0 for an absolutely continuous prior distribution [End of the story?!] Suppose we are considering whether a location parameter α is 0. The estimation prior probability for it is uniform and we should have to take f (α) = 0 and K[= B10] would always be infinite (V, §5.02)

Slide 22

Slide 22 text

Point null refurbishment Requirement Defined prior distributions under both assumptions, π0(θ) ∝ π(θ)IΘ0 (θ), π1(θ) ∝ π(θ)IΘ1 (θ), (under the standard dominating measures on Θ0 and Θ1) Using the prior probabilities π(Θ0) = ρ0 and π(Θ1) = ρ1, π(θ) = ρ0π0(θ) + ρ1π1(θ). Note If Θ0 = {θ0}, π0 is the Dirac mass in θ0

Slide 23

Slide 23 text

Point null refurbishment Requirement Defined prior distributions under both assumptions, π0(θ) ∝ π(θ)IΘ0 (θ), π1(θ) ∝ π(θ)IΘ1 (θ), (under the standard dominating measures on Θ0 and Θ1) Using the prior probabilities π(Θ0) = ρ0 and π(Θ1) = ρ1, π(θ) = ρ0π0(θ) + ρ1π1(θ). Note If Θ0 = {θ0}, π0 is the Dirac mass in θ0

Slide 24

Slide 24 text

Point null hypotheses Particular case H0 : θ = θ0 Take ρ0 = Prπ(θ = θ0) and g1 prior density under Ha. Posterior probability of H0 π(Θ0|x) = f (x|θ0)ρ0 f (x|θ)π(θ) dθ = f (x|θ0)ρ0 f (x|θ0)ρ0 + (1 − ρ0)m1(x) and marginal under Ha m1(x) = Θ1 f (x|θ)g1(θ) dθ.

Slide 25

Slide 25 text

Point null hypotheses Particular case H0 : θ = θ0 Take ρ0 = Prπ(θ = θ0) and g1 prior density under Ha. Posterior probability of H0 π(Θ0|x) = f (x|θ0)ρ0 f (x|θ)π(θ) dθ = f (x|θ0)ρ0 f (x|θ0)ρ0 + (1 − ρ0)m1(x) and marginal under Ha m1(x) = Θ1 f (x|θ)g1(θ) dθ.

Slide 26

Slide 26 text

Point null hypotheses (cont’d) Dual representation π(Θ0|x) = 1 + 1 − ρ0 ρ0 m1(x) f (x|θ0) −1 . and Bπ 01 (x) = f (x|θ0)ρ0 m1(x)(1 − ρ0) ρ0 1 − ρ0 = f (x|θ0) m1(x) Connection π(Θ0|x) = 1 + 1 − ρ0 ρ0 1 Bπ 01 (x) −1 .

Slide 27

Slide 27 text

Point null hypotheses (cont’d) Dual representation π(Θ0|x) = 1 + 1 − ρ0 ρ0 m1(x) f (x|θ0) −1 . and Bπ 01 (x) = f (x|θ0)ρ0 m1(x)(1 − ρ0) ρ0 1 − ρ0 = f (x|θ0) m1(x) Connection π(Θ0|x) = 1 + 1 − ρ0 ρ0 1 Bπ 01 (x) −1 .

Slide 28

Slide 28 text

A further difficulty Improper priors are not allowed here If Θ1 π1(dθ1) = ∞ or Θ2 π2(dθ2) = ∞ then π1 or π2 cannot be coherently normalised while the normalisation matters in the Bayes factor remember Bayes factor?

Slide 29

Slide 29 text

A further difficulty Improper priors are not allowed here If Θ1 π1(dθ1) = ∞ or Θ2 π2(dθ2) = ∞ then π1 or π2 cannot be coherently normalised while the normalisation matters in the Bayes factor remember Bayes factor?

Slide 30

Slide 30 text

ToP unaware of the problem? A. Not entirely, as improper priors keep being used on nuisance parameters Example of testing for a zero normal mean: If σ is the standard error and λ the true value, λ is 0 on q. We want a suitable form for its prior on q . (...) Then we should take P(qdσ|H) ∝ dσ/σ P(q dσdλ|H) ∝ f λ σ dσ/σdλ/λ where f [is a true density] (V, §5.2). Fallacy of the “same” σ!

Slide 31

Slide 31 text

ToP unaware of the problem? A. Not entirely, as improper priors keep being used on nuisance parameters Example of testing for a zero normal mean: If σ is the standard error and λ the true value, λ is 0 on q. We want a suitable form for its prior on q . (...) Then we should take P(qdσ|H) ∝ dσ/σ P(q dσdλ|H) ∝ f λ σ dσ/σdλ/λ where f [is a true density] (V, §5.2). Fallacy of the “same” σ!

Slide 32

Slide 32 text

Not enought information If s = 0 [!!!], then [for σ = |¯ x|/τ, λ = σv] P(q|θH) ∝ ∞ 0 τ |¯ x| n exp − 1 2 nτ2 dτ τ , P(q |θH) ∝ ∞ 0 dτ τ ∞ −∞ τ |¯ x| n f (v) exp − 1 2 n(v − τ)2 . If n = 1 and f (v) is any even [density], P(q |θH) ∝ 1 2 √ 2π |¯ x| and P(q|θH) ∝ 1 2 √ 2π |¯ x| and therefore K = 1 (V, §5.2).

Slide 33

Slide 33 text

Strange constraints If n 2, the condition that K = 0 for s = 0, ¯ x = 0 is equivalent to ∞ 0 f (v)vn−1dv = ∞ . The function satisfying this condition for [all] n is f (v) = 1 π(1 + v2) This is the prior recommended by Jeffreys hereafter. But, first, many other families of densities satisfy this constraint and a scale of 1 cannot be universal! Second, s = 0 is a zero probability event...

Slide 34

Slide 34 text

Strange constraints If n 2, the condition that K = 0 for s = 0, ¯ x = 0 is equivalent to ∞ 0 f (v)vn−1dv = ∞ . The function satisfying this condition for [all] n is f (v) = 1 π(1 + v2) This is the prior recommended by Jeffreys hereafter. But, first, many other families of densities satisfy this constraint and a scale of 1 cannot be universal! Second, s = 0 is a zero probability event...

Slide 35

Slide 35 text

Strange constraints If n 2, the condition that K = 0 for s = 0, ¯ x = 0 is equivalent to ∞ 0 f (v)vn−1dv = ∞ . The function satisfying this condition for [all] n is f (v) = 1 π(1 + v2) This is the prior recommended by Jeffreys hereafter. But, first, many other families of densities satisfy this constraint and a scale of 1 cannot be universal! Second, s = 0 is a zero probability event...

Slide 36

Slide 36 text

Comments ToP very imprecise about choice of priors in the setting of tests (despite existence of Susie’s Jeffreys’ conventional partly proper priors) ToP misses the difficulty of improper priors [coherent with earlier stance] but this problem still generates debates within the B community Some degree of goodness-of-fit testing but against fixed alternatives Persistence of the form K ≈ πn 2 1 + t2 ν −1/2ν+1/2 but ν not so clearly defined...

Slide 37

Slide 37 text

Jeffreys–Lindley paradox Significance tests: one new parameter Jeffreys-Lindley paradox Lindley’s paradox dual versions of the paradox “Who should be afraid of the Lindley–Jeffreys paradox?” Bayesian resolutions Deviance (information criterion) Aitkin’s integrated likelihood Johnson’s uniformly most powerful Bayesian tests

Slide 38

Slide 38 text

Lindley’s paradox In a normal mean testing problem, ¯ xn ∼ N(θ, σ2/n) , H0 : θ = θ0 , under Jeffreys prior, θ ∼ N(θ0, σ2), the Bayes factor B01(tn) = (1 + n)1/2 exp −nt2 n /2[1 + n] , where tn = √ n|¯ xn − θ0|/σ, satisfies B01(tn) n−→∞ −→ ∞ [assuming a fixed tn] [Lindley, 1957]

Slide 39

Slide 39 text

Lindley’s paradox Often dubbed Jeffreys–Lindley paradox... In terms of t = √ n − 1¯ x/s , ν = n−1 K ∼ πν 2 1 + t2 ν −1/2ν+1/2 . (...) The variation of K with t is much more important than the variation with ν (Jeffreys, V, §5.2).

Slide 40

Slide 40 text

Two versions of the paradox “the weight of Lindley’s paradoxical result (...) burdens proponents of the Bayesian practice”. [Lad, 2003] official version, opposing frequentist and Bayesian assessments [Lindley, 1957] intra-Bayesian version, blaming vague and improper priors for the Bayes factor misbehaviour: if π1(·|σ) depends on a scale parameter σ, it is often the case that B01(x) σ−→∞ −→ +∞ for a given x, meaning H0 is always accepted [Robert, 1992, 2013]

Slide 41

Slide 41 text

where does it matter? In the normal case, Z ∼ N(θ, 1), θ ∼ N(0, α2), Bayes factor B10(z) = ez2α2/(1+α2) √ 1 + α2 = √ 1 − λ exp{λz2/2}

Slide 42

Slide 42 text

Evacuation of the first version Two paradigms [(b) versus (f)] one (b) operates on the parameter space Θ, while the other (f) is produced from the sample space one (f) relies solely on the point-null hypothesis H0 and the corresponding sampling distribution, while the other (b) opposes H0 to a (predictive) marginal version of H1 one (f) could reject “a hypothesis that may be true (...) because it has not predicted observable results that have not occurred” (Jeffreys, ToP, VII, §7.2) while the other (b) conditions upon the observed value xobs one (f) cannot agree with the likelihood principle, while the other (b) is almost uniformly in agreement with it one (f) resorts to an arbitrary fixed bound α on the p-value, while the other (b) refers to the (default) boundary probability of 1/2

Slide 43

Slide 43 text

More arguments on the first version observing a constant tn as n increases is of limited interest: under H0 tn has limiting N(0, 1) distribution, while, under H1 tn a.s. converges to ∞ behaviour that remains entirely compatible with the consistency of the Bayes factor, which a.s. converges either to 0 or ∞, depending on which hypothesis is true. Consequent subsequent literature (e.g., Berger & Sellke, 1987; Bayarri & Berger, 2004) has since then shown how divergent those two approaches could be (to the point of being asymptotically incompatible).

Slide 44

Slide 44 text

Nothing’s wrong with the second version n, prior’s scale factor: prior variance n times larger than the observation variance and when n goes to ∞, Bayes factor goes to ∞ no matter what the observation is n becomes what Lindley (1957) calls “a measure of lack of conviction about the null hypothesis” when prior diffuseness under H1 increases, only relevant information becomes that θ could be equal to θ0, and this overwhelms any evidence to the contrary contained in the data mass of the prior distribution in the vicinity of any fixed neighbourhood of the null hypothesis vanishes to zero under H1 c deep coherence in the outcome: being indecisive about the alternative hypothesis means we should not chose it

Slide 45

Slide 45 text

Nothing’s wrong with the second version n, prior’s scale factor: prior variance n times larger than the observation variance and when n goes to ∞, Bayes factor goes to ∞ no matter what the observation is n becomes what Lindley (1957) calls “a measure of lack of conviction about the null hypothesis” when prior diffuseness under H1 increases, only relevant information becomes that θ could be equal to θ0, and this overwhelms any evidence to the contrary contained in the data mass of the prior distribution in the vicinity of any fixed neighbourhood of the null hypothesis vanishes to zero under H1 c deep coherence in the outcome: being indecisive about the alternative hypothesis means we should not chose it

Slide 46

Slide 46 text

“Who should be afraid of the Lindley–Jeffreys paradox?” Recent publication by A. Spanos with above title: the paradox demonstrates against Bayesian and likelihood resolutions of the problem for failing to account for the large sample size. the failure of all three main paradigms (“fallacy of rejection” for (f) versus “fallacy of accaptance” for (b)) leads to advocate Mayo’s and Spanos’ (2004) “postdata severity evaluation” [Spanos, 2013]

Slide 47

Slide 47 text

“Who should be afraid of the Lindley–Jeffreys paradox?” Recent publication by A. Spanos with above title: “the postdata severity evaluation (...) addresses the key problem with Fisherian p-values in the sense that the severity evaluation provides the “magnitude” of the warranted discrepancy from the null by taking into account the generic capacity of the test (that includes n) in question as it relates to the observed data”(p.88) [Spanos, 2013]

Slide 48

Slide 48 text

what is severity? An hypothesis H passes a severe test if the data agrees with H and if it is highly probable that data not produced under H agrees less with H departure from the null, rewritten as θ1 = θ0 + γ, “provide the ‘magnitude’ of the warranted discrepancy from the null”, i.e. decide about how close (in distance) to the null we can get and still be able to discriminate the null from the alternative hypotheses “with very high probability” requires to set the “severity threshold”, Pθ1 {d(X) d(x0)} once γ found, whether it is far enough from the null is a matter of informed opinion: whether it is “substantially significant (...) pertains to the substantive subject matter”

Slide 49

Slide 49 text

...should we be afraid? A. Not! In Spanos (2013) the purpose of a test and the nature of evidence are never spelled out the rejection of decisional aspects clashes with the later call to the magnitude of the severity does not quantify how to select significance thresholds γ against sample size n contains irrelevant attacks on the likelihood principle and dependence on Euclidean distance [Robert, 2013]

Slide 50

Slide 50 text

On some resolutions of the second version use of pseudo-Bayes factors, fractional Bayes factors, &tc, which lacks complete proper Bayesian justification [Berger & Pericchi, 2001] use of identical improper priors on nuisance parameters, use of the posterior predictive distribution, matching priors, use of score functions extending the log score function

Slide 51

Slide 51 text

On some resolutions of the second version use of pseudo-Bayes factors, fractional Bayes factors, &tc, use of identical improper priors on nuisance parameters, a notion already entertained by Jeffreys [Berger et al., 1998; Marin & Robert, 2013] use of the posterior predictive distribution, matching priors, use of score functions extending the log score function

Slide 52

Slide 52 text

On some resolutions of the second version use of pseudo-Bayes factors, fractional Bayes factors, &tc, use of identical improper priors on nuisance parameters, P´ ech´ e de jeunesse: equating the values of the prior densities at the point-null value θ0, ρ0 = (1 − ρ0)π1(θ0) [Robert, 1993] use of the posterior predictive distribution, matching priors, use of score functions extending the log score function

Slide 53

Slide 53 text

On some resolutions of the second version use of pseudo-Bayes factors, fractional Bayes factors, &tc, use of identical improper priors on nuisance parameters, use of the posterior predictive distribution, which uses the data twice matching priors, use of score functions extending the log score function

Slide 54

Slide 54 text

On some resolutions of the second version use of pseudo-Bayes factors, fractional Bayes factors, &tc, use of identical improper priors on nuisance parameters, use of the posterior predictive distribution, matching priors, whose sole purpose is to bring frequentist and Bayesian coverages as close as possible [Datta & Mukerjee, 2004] use of score functions extending the log score function

Slide 55

Slide 55 text

On some resolutions of the second version use of pseudo-Bayes factors, fractional Bayes factors, &tc, use of identical improper priors on nuisance parameters, use of the posterior predictive distribution, matching priors, use of score functions extending the log score function log B12(x) = log m1(x) − log m2(x) = S0(x, m1) − S0(x, m2) , that are independent of the normalising constant [Dawid et al., 2013]

Slide 56

Slide 56 text

On some resolutions of the second version use of pseudo-Bayes factors, fractional Bayes factors, &tc, use of identical improper priors on nuisance parameters, use of the posterior predictive distribution, matching priors, use of score functions extending the log score function non-local priors correcting default priors towards more balanced error rates [Johnson & Rossell, 2010; Consonni et al., 2013]

Slide 57

Slide 57 text

Deviance (information criterion) Significance tests: one new parameter Jeffreys-Lindley paradox Deviance (information criterion) Aitkin’s integrated likelihood Johnson’s uniformly most powerful Bayesian tests Testing under incomplete information Posterior predictive checking

Slide 58

Slide 58 text

Bayesian model comparison(s) use posterior probabilities/Bayes factors B12(y) = Θ1 f1(y|θ1) dπ1(θ1) Θ2 f2(y|θ2) dπ2(θ2) [Jeffreys, 1939] posterior predictive checks P(mi (Y) mi (y)|y) [Gelman et al., 2013] comparisons of models based on prediction error and other loss-based measures DIC? BIC? integrated likelihood?

Slide 59

Slide 59 text

Bayesian model comparison(s) use posterior probabilities/Bayes factors B12(y) = Θ1 f1(y|θ1) dπ1(θ1) Θ2 f2(y|θ2) dπ2(θ2) [Jeffreys, 1939] posterior predictive checks P(mi (Y) mi (y)|y) [Gelman et al., 2013] comparisons of models based on prediction error and other loss-based measures DIC? BIC? integrated likelihood?

Slide 60

Slide 60 text

DIC as in Dayesian? Deviance defined by D(θ) = −2 log(p(y|θ)) , Effective number of parameters computed as pD = ¯ D − D(¯ θ) , with ¯ D posterior expectation of D and ¯ θ estimate of θ Deviance information criterion (DIC) defined by DIC = pD + ¯ D = D(¯ θ) + 2pD Models with smaller DIC better supported by the data [Spiegelhalter et al., 2002]

Slide 61

Slide 61 text

“thou shalt not use the data twice” The data is used twice in the DIC method: 1. y used once to produce the posterior π(θ|y), and the associated estimate, ˜ θ(y) 2. y used a second time to compute the posterior expectation of the observed likelihood p(y|θ), log p(y|θ)π(dθ|y) ∝ log p(y|θ)p(y|θ)π(dθ) ,

Slide 62

Slide 62 text

DIC for missing data models Framework of missing data models f (y|θ) = f (y, z|θ)dz , with observed data y = (y1, . . . , yn) and corresponding missing data by z = (z1, . . . , zn) How do we define DIC in such settings?

Slide 63

Slide 63 text

DIC for missing data models Framework of missing data models f (y|θ) = f (y, z|θ)dz , with observed data y = (y1, . . . , yn) and corresponding missing data by z = (z1, . . . , zn) How do we define DIC in such settings?

Slide 64

Slide 64 text

how many DICs can you fit in a mixture? Q: How many giraffes can you fit in a VW bug? A: None, the elephants are in there. 1. observed DICs DIC1 = −4Eθ [log f (y|θ)|y] + 2 log f (y|Eθ [θ|y]) often a poor choice in case of unidentifiability 2. complete DICs based on f (y, z|θ) 3. conditional DICs based on f (y|z, θ) [Celeux et al., BA, 2006]

Slide 65

Slide 65 text

how many DICs can you fit in a mixture? Q: How many giraffes can you fit in a VW bug? A: None, the elephants are in there. 1. observed DICs DIC2 = −4Eθ [log f (y|θ)|y] + 2 log f (y|θ(y)) . which uses posterior mode instead 2. complete DICs based on f (y, z|θ) 3. conditional DICs based on f (y|z, θ) [Celeux et al., BA, 2006]

Slide 66

Slide 66 text

how many DICs can you fit in a mixture? Q: How many giraffes can you fit in a VW bug? A: None, the elephants are in there. 1. observed DICs DIC3 = −4Eθ [log f (y|θ)|y] + 2 log f (y) , which instead relies on the MCMC density estimate 2. complete DICs based on f (y, z|θ) 3. conditional DICs based on f (y|z, θ) [Celeux et al., BA, 2006]

Slide 67

Slide 67 text

how many DICs can you fit in a mixture? Q: How many giraffes can you fit in a VW bug? A: None, the elephants are in there. 1. observed DICs 2. complete DICs based on f (y, z|θ) DIC4 = EZ [DIC(y, Z)|y] = −4Eθ,Z [log f (y, Z|θ)|y] + 2EZ [log f (y, Z|Eθ[θ|y, Z])|y] 3. conditional DICs based on f (y|z, θ) [Celeux et al., BA, 2006]

Slide 68

Slide 68 text

how many DICs can you fit in a mixture? Q: How many giraffes can you fit in a VW bug? A: None, the elephants are in there. 1. observed DICs 2. complete DICs based on f (y, z|θ) DIC5 = −4Eθ,Z [log f (y, Z|θ)|y] + 2 log f (y, z(y)|θ(y)) , using Z as an additional parameter 3. conditional DICs based on f (y|z, θ) [Celeux et al., BA, 2006]

Slide 69

Slide 69 text

how many DICs can you fit in a mixture? Q: How many giraffes can you fit in a VW bug? A: None, the elephants are in there. 1. observed DICs 2. complete DICs based on f (y, z|θ) DIC6 = −4Eθ,Z [log f (y, Z|θ)|y]+2EZ[log f (y, Z|θ(y))|y, θ(y)] . in analogy with EM, θ being an EM fixed point 3. conditional DICs based on f (y|z, θ) [Celeux et al., BA, 2006]

Slide 70

Slide 70 text

how many DICs can you fit in a mixture? Q: How many giraffes can you fit in a VW bug? A: None, the elephants are in there. 1. observed DICs 2. complete DICs based on f (y, z|θ) 3. conditional DICs based on f (y|z, θ) DIC7 = −4Eθ,Z [log f (y|Z, θ)|y] + 2 log f (y|z(y), θ(y)) , using MAP estimates [Celeux et al., BA, 2006]

Slide 71

Slide 71 text

how many DICs can you fit in a mixture? Q: How many giraffes can you fit in a VW bug? A: None, the elephants are in there. 1. observed DICs 2. complete DICs based on f (y, z|θ) 3. conditional DICs based on f (y|z, θ) DIC8 = −4Eθ,Z [log f (y|Z, θ)|y]+2EZ log f (y|Z, θ(y, Z))|y , conditioning first on Z and then integrating over Z conditional on y [Celeux et al., BA, 2006]

Slide 72

Slide 72 text

Galactic DICs Example of the galaxy mixture dataset DIC2 DIC3 DIC4 DIC5 DIC6 DIC7 DIC8 K (pD2) (pD3) (pD4) (pD5) (pD6) (pD7) (pD8) 2 453 451 502 705 501 417 410 (5.56) (3.66) (5.50) (207.88) (4.48) (11.07) (4.09) 3 440 436 461 622 471 378 372 (9.23) (4.94) (6.40) (167.28) (15.80) (13.59) (7.43) 4 446 439 473 649 482 388 382 (11.58) (5.41) (7.52) (183.48) (16.51) (17.47) (11.37) 5 447 442 485 658 511 395 390 (10.80) (5.48) (7.58) (180.73) (33.29) (20.00) (15.15) 6 449 444 494 676 532 407 398 (11.26) (5.49) (8.49) (191.10) (46.83) (28.23) (19.34) 7 460 446 508 700 571 425 409 (19.26) (5.83) (8.93) (200.35) (71.26) (40.51) (24.57)

Slide 73

Slide 73 text

questions what is the behaviour of DIC under model mispecification? is there an absolute scale to the DIC values, i.e. when is a difference in DICs significant? how can DIC handle small n’s versus p’s? should pD be defined as var(D|y)/2 [Gelman’s suggestion]? is WAIC (Gelman and Vehtari, 2013) making a difference for being based on expected posterior predictive? In an era of complex models, is DIC applicable? [Robert, 2013]

Slide 74

Slide 74 text

questions what is the behaviour of DIC under model mispecification? is there an absolute scale to the DIC values, i.e. when is a difference in DICs significant? how can DIC handle small n’s versus p’s? should pD be defined as var(D|y)/2 [Gelman’s suggestion]? is WAIC (Gelman and Vehtari, 2013) making a difference for being based on expected posterior predictive? In an era of complex models, is DIC applicable? [Robert, 2013]

Slide 75

Slide 75 text

Significance tests: one new parameter Jeffreys-Lindley paradox Deviance (information criterion) Aitkin’s integrated likelihood Integrated likelihood Criticisms A Bayesian version? Johnson’s uniformly most powerful Bayesian tests Testing under incomplete information Posterior predictive checking

Slide 76

Slide 76 text

Integrated likelihood Statistical Inference: An Integrated Bayesian/Likelihood Approach was published by Murray Aitkin in 2009 Theme: comparisons of posterior distributions of likelihood functions under competing models or via the posterior distribution of likelihood ratios corresponding to those models...

Slide 77

Slide 77 text

Integrated likelihood Statistical Inference: An Integrated Bayesian/Likelihood Approach was published by Murray Aitkin in 2009 Theme: comparisons of posterior distributions of likelihood functions under competing models or via the posterior distribution of likelihood ratios corresponding to those models...

Slide 78

Slide 78 text

Posterior likelihood “This quite small change to standard Bayesian analysis allows a very general approach to a wide range of apparently different inference problems; a particular advantage of the approach is that it can use the same noninformative priors.” Statistical Inference, page xiii Central tool: “posterior cdf” of the likelihood, F(z) = Prπ(L(θ, x) > z|x) . Arguments: general approach that resolves difficulties with the Bayesian processing of point null hypotheses includes use of generic noninformative and improper priors handles the “vexed question of model fit”

Slide 79

Slide 79 text

Posterior likelihood “This quite small change to standard Bayesian analysis allows a very general approach to a wide range of apparently different inference problems; a particular advantage of the approach is that it can use the same noninformative priors.” Statistical Inference, page xiii Central tool: “posterior cdf” of the likelihood, F(z) = Prπ(L(θ, x) > z|x) . Arguments: general approach that resolves difficulties with the Bayesian processing of point null hypotheses includes use of generic noninformative and improper priors handles the “vexed question of model fit”

Slide 80

Slide 80 text

Using the data twice [again!] “A persistent criticism of the posterior likelihood approach (. . . ) has been based on the claim that these approaches are ‘using the data twice,’ or are ‘violating temporal coherence.” Statistical Inference, page 48 “posterior expectation” of the likelihood as ratio of marginal of twice-replicated data over marginal of original data, E[L(θ, x)|x] = L(θ, x)π(θ|x) dθ = m(x, x) m(x) , [Aitkin, 1991] the likelihood function does not exist a priori requires a joint distribution across models to be compared connection with pseudo-priors (Carlin & Chib, 1995) who defined prior distributions on the parameters that do not exist fails to include improper priors since (θ, x) has no joint distribution

Slide 81

Slide 81 text

Using the data twice [again!] “A persistent criticism of the posterior likelihood approach (. . . ) has been based on the claim that these approaches are ‘using the data twice,’ or are ‘violating temporal coherence.” Statistical Inference, page 48 “posterior expectation” of the likelihood as ratio of marginal of twice-replicated data over marginal of original data, E[L(θ, x)|x] = L(θ, x)π(θ|x) dθ = m(x, x) m(x) , [Aitkin, 1991] the likelihood function does not exist a priori requires a joint distribution across models to be compared connection with pseudo-priors (Carlin & Chib, 1995) who defined prior distributions on the parameters that do not exist fails to include improper priors since (θ, x) has no joint distribution

Slide 82

Slide 82 text

Posterior probability on posterior probabilities “The p-value is equal to the posterior probability that the likelihood ratio, for null hypothesis to alternative, is greater than 1 (. . . ) The posterior probability is p that the posterior probability of H0 is greater than 0.5.” Statistical Inference, pages 42–43 c A posterior probability being a number, how can its posterior probability be defined? While m(x) = L(θ, x)π(θ) dθ = Eπ[L(θ, x)] is well-defined, it does not mean the whole distribution of L(θ, x) makes sense!

Slide 83

Slide 83 text

Posterior probability on posterior probabilities “The p-value is equal to the posterior probability that the likelihood ratio, for null hypothesis to alternative, is greater than 1 (. . . ) The posterior probability is p that the posterior probability of H0 is greater than 0.5.” Statistical Inference, pages 42–43 c A posterior probability being a number, how can its posterior probability be defined? While m(x) = L(θ, x)π(θ) dθ = Eπ[L(θ, x)] is well-defined, it does not mean the whole distribution of L(θ, x) makes sense!

Slide 84

Slide 84 text

Drifting apart fundamental theoretical argument: integrated likelihood leads to parallel and separate simulations from the posteriors under each model, considering distribution of Li (θi |x) Lk(θk|x), when θi ’s and θk’s drawn from respective posteriors [Scott, 2002; Congdon, 2006]

Slide 85

Slide 85 text

Drifting apart fundamental theoretical argument: integrated likelihood leads to parallel and separate simulations from the posteriors under each model, considering distribution of Li (θi |x) Lk(θk|x), when θi ’s and θk’s drawn from respective posteriors [Scott, 2002; Congdon, 2006] MCMC simulations run for each model separately and resulting MCMC samples gathered together to produce posterior distribution of ρi L(θi |x) k ρkL(θk|x) , which do not correspond to genuine Bayesian solutions [Robert and Marin, 2008]

Slide 86

Slide 86 text

Drifting apart fundamental theoretical argument: integrated likelihood leads to parallel and separate simulations from the posteriors under each model, considering distribution of Li (θi |x) Lk(θk|x), when θi ’s and θk’s drawn from respective posteriors [Scott, 2002; Congdon, 2006] c the product of the posteriors π1(θ1|yn)π2(θ2|yn) s not the posterior of the product π(θ1, θ2|yn), as in p1m1(x)π1(θ1|x)π2(θ2) + p2m2(x)π2(θ2|x)π1(θ1). [Carlin & Chib, 1995]

Slide 87

Slide 87 text

An illustration Comparison of the distribution of the likelihood ratio under (a) true joint posterior and (b) product of posteriors, when assessing fit of a Poisson against binomial model with m = 5 trials, for the observation x = 3 Marginal simulation log likelihood ratio −4 −2 0 2 Joint simulation log likelihood ratio −15 −10 −5 0

Slide 88

Slide 88 text

Appropriate loss function Estimation loss for model index j, the values of the parameters under both models and observation x: L(δ, (j, θj , θ−j )) = Iδ=1If2(x|θ2)>f1(x|θ1) + Iδ=2If2(x|θ2) 1 2 2 otherwise, depends on joint posterior distribution on (θ1, θ2), thus differs from Aitkin’s solution.

Slide 89

Slide 89 text

Appropriate loss function Estimation loss for model index j, the values of the parameters under both models and observation x: L(δ, (j, θj , θ−j )) = Iδ=1If2(x|θ2)>f1(x|θ1) + Iδ=2If2(x|θ2) 1 2 2 otherwise, depends on joint posterior distribution on (θ1, θ2), thus differs from Aitkin’s solution.

Slide 90

Slide 90 text

Asymptotic properties If M1 is “true” model, then π(M1|xn) = 1 + op(1) and Prπ1 (l1(θ1) > l2(θ2)|xn, θ2) = Pr(−X2 p1 > l2(θ2) − l2( ^ θ1)) + Op(1/ √ n) = Fp1 (l1( ^ θ1) − l2(θ2)) + Op(1/ √ n) , with p1 dimension of Θ1, ^ θ1 maximum likelihood estimator of θ1 Since l2(θ2) l2(^ θ2), l1(^ θ1) − l2(θ2) nKL(f0, fθ∗ 2 ) + Op( √ n) , where KL(f , g) Kullback-Leibler divergence and θ∗ 2 = argminθ2 KL(f0, fθ2 ), we have Prπ(f (xn|θ2) < f (xn|θ1)|xn) = 1 + op(1) . Aitkin’s approach leads to Pr[X2 p2 − X2 p1 > l2(^ θ2) − l1(^ θ1)], thus depends on the asymptotic behavior of the likelihood ratio [Gelman, Robert & Rousseau, 2012]

Slide 91

Slide 91 text

Asymptotic properties If M1 is “true” model, then π(M1|xn) = 1 + op(1) and Prπ1 (l1(θ1) > l2(θ2)|xn, θ2) = Pr(−X2 p1 > l2(θ2) − l2( ^ θ1)) + Op(1/ √ n) = Fp1 (l1( ^ θ1) − l2(θ2)) + Op(1/ √ n) , with p1 dimension of Θ1, ^ θ1 maximum likelihood estimator of θ1 Since l2(θ2) l2(^ θ2), l1(^ θ1) − l2(θ2) nKL(f0, fθ∗ 2 ) + Op( √ n) , where KL(f , g) Kullback-Leibler divergence and θ∗ 2 = argminθ2 KL(f0, fθ2 ), we have Prπ(f (xn|θ2) < f (xn|θ1)|xn) = 1 + op(1) . Aitkin’s approach leads to Pr[X2 p2 − X2 p1 > l2(^ θ2) − l1(^ θ1)], thus depends on the asymptotic behavior of the likelihood ratio [Gelman, Robert & Rousseau, 2012]

Slide 92

Slide 92 text

uniformly most powerful “Bayesian” tests Significance tests: one new parameter Jeffreys-Lindley paradox Deviance (information criterion) Aitkin’s integrated likelihood Johnson’s uniformly most powerful Bayesian tests AoS version PNAS version Testing under incomplete information

Slide 93

Slide 93 text

Uniformly most powerful tests “The difficulty in constructing a Bayesian hypothesis test arises from the requirement to specify an alternative hypothesis.” Johnson’s 2013 paper in the Annals of Statistics introduces so called uniformly most powerful Bayesian tests, relating to the original Neyman’s and Pearson’s uniformly most powerful tests: arg max δ Pθ (δ = 0) , θ ∈ Θ1 under the constraint Pθ (δ = 0) α, θ ∈ Θ0

Slide 94

Slide 94 text

definition “UMPBTs provide a new form of default, nonsubjective Bayesian tests in which the alternative hypothesis is determined so as to maximize the probability that a Bayes factor exceeds a specified threshold” i.e., find prior π1 on Θ1 (alternative parameter space) to maximise Pθ (B10(X) γ) , for all θ ∈ Θ1 ...assuming “the null hypothesis is rejected if the posterior probability of H1 exceeds a certain threshold” [Johnson, 2013]

Slide 95

Slide 95 text

definition “UMPBTs provide a new form of default, nonsubjective Bayesian tests in which the alternative hypothesis is determined so as to maximize the probability that a Bayes factor exceeds a specified threshold” i.e., find prior π1 on Θ1 (alternative parameter space) to maximise Pθ (B10(X) γ) , for all θ ∈ Θ1 ...assuming “the null hypothesis is rejected if the posterior probability of H1 exceeds a certain threshold” [Johnson, 2013]

Slide 96

Slide 96 text

Examples Example (normal mean one-sided H0 : µ = µ0 ) H1 point mass at µ1 = µ0 + σ 2 log γ/n and Bayes factor B10(z) = exp{z 2 log γ − log γ} [Johnson, PNAS, 2013]

Slide 97

Slide 97 text

Examples “Up to a constant factor that arises from the uniform distribution on µ...” Example (normal mean two-sample two-sided H0 : δµ = 0) H1 point mass at δµ = σ 2(n1 + n2) log γ/n1n2 and Bayes factor B10(z) = exp{z 2 log γ − log γ} [Johnson, PNAS, 2013]

Slide 98

Slide 98 text

Examples Example (non-central chi-square H0 : λ = 0) H1 point mass at λ∗ minimum of 1 √ λ log eλ/2γ + eλγ2 − 1 and Bayes factor B10(x) = exp{−λ∗/2} cosh( √ λ∗x) [Johnson, PNAS, 2013]

Slide 99

Slide 99 text

Examples Example (binomial probability one-sided H0 : p = p0 ) H1 point mass at p∗ minimum of log γ − n[log(1 − p) − log(1 − p0)] log[p/(1 − p)] − log[p0/(1 − p0)] and Bayes factor B10(x) = (p∗/p0)x ((1 − p∗)/(1 − p0))n−x [Johnson, PNAS, 2013]

Slide 100

Slide 100 text

Criticisms means selecting the least favourable prior under H1 so that frequentist probability of exceeding a threshold is uniformly maximal, in a minimax perspective requires frequentist averaging over all possible values of the observation (violates the Likelihood Principle) compares probabilities for all values of the parameter θ rather than integrating against a prior or posterior selects a prior under H1 with sole purpose of favouring the alternative, meaning it has no further use when H0 is rejected caters to non-Bayesian approaches: Bayesian tools as supplementing p-values argues the method is objective because it satisfies a frequentist coverage very rarely exists, apart from one-dimensional exponential families extensions lead to data-dependent local alternatives

Slide 101

Slide 101 text

An impossibility theorem? “Unfortunately, subjective Bayesian testing procedures have not been–and will likely never be–generally accepted by the scientific community. In most testing problems, the range of scientific opinion regarding the magnitude of violations from a standard theory is simply too large to make the report of a single, subjective Bayes factor worthwhile. Furthermore, scientific journals have demonstrated an unwillingness to replace the report of a single p-value with a range of subjectively determined Bayes factors or posterior model probabilities.” [Bye, everyone!]

Slide 102

Slide 102 text

Criticisms (2) use of alien notion of a “true” prior density (p.6) that would be misspecified, corresponding to “a point mass concentrated on the true value” for frequentists and to the summary of prior information for Bayesians, “not available”. why compare probability of rejection of H0 in favour of H1 for every value of θ when (a) a prior on H1 is used to define the Bayes factor, (b) conditioning on the data is lost, (c) the boundary or threshold γ is fixed, and (d) induced order is incomplete prior behind UMPB tests quite likely to be atomic, while natural dominating measure is Lebesgue those tests are not [NP] uniformly most powerful unless one picks a new definition of UMP tests. strange asymptotics: under the null log(B10(X1:n)) ≈ N(− log γ, 2 log γ)

Slide 103

Slide 103 text

goodness-of-fit? “...the tangible consequence of a Bayesian hypothesis test is often the rejection of one hypothesis in favor of the second (...) It is therefore of some practical interest to determine alternative hypotheses that maximize the probability that the Bayes factor from a test exceeds a specified threshold”. The definition of the alternative hypothesis is paramount: replacing genuine alternative H1 with one spawned by the null H0 voids the appeal of B approach, turning testing into a goodness-of-fit assessment

Slide 104

Slide 104 text

goodness-of-fit? The definition of the alternative hypothesis is paramount: replacing genuine alternative H1 with one spawned by the null H0 voids the appeal of B approach, turning testing into a goodness-of-fit assessment why would we look for the alternative that is most against H0? See Spanos’ (2013) objection of many alternative values of θ more likely than the null. This does not make them of particular interest or bound to support an alternative prior...

Slide 105

Slide 105 text

which threshold? “The posterior probability of the null hypothesis does not converge to 1 as the sample size grows. The null hypothesis is never fully accepted–nor the alternative rejected–when the evidence threshold is held constant as n increases.” notion of abstract and fixed threshold γ linked with Jeffreys-Lindley paradox assuming a golden number like 3 (b) is no less arbitrary than using 0.05 or 5σ as significance bound (f) even NP perspective on tests relies on decreasing (in n) Type I error types of error decreasing with n in fine, γ determined by inverting classical bound 0.05 or 0.005

Slide 106

Slide 106 text

which threshold? The “behavior of UMPBTs with fixed evidence thresholds is similar to the Jeffreys-Lindley paradox” Aspect jeopardises whole construct of UMPB tests, which depend on an arbitrary γ, unconnected with a loss function and orthogonal to any prior information

Slide 107

Slide 107 text

O’Bayes, anyone? “...defining a Bayes factor requires the specification of both a null hypothesis and an alternative hypothesis, and in many circumstances there is no objective mechanism for defining an alternative hypothesis. The definition of the alternative hypothesis therefore involves an element of subjectivity, and it is for this reason that scientists generally eschew the Bayesian approach toward hypothesis testing. c Notion that is purely frequentist, using Bayes factors as the statistic instead of another divergence statistic, with no objective Bayes features and no added value

Slide 108

Slide 108 text

O’Bayes, anyone? “The simultaneous report of default Bayes factors and p-values may play a pivotal role in dispelling the perception held by many scientists that a p-value of 0.05 corresponds to “significant” evidence against the null hypothesis (...) the report of Bayes factors based upon [UMPBTs] may lead to more realistic interpretations of evidence obtained from scientific studies.” c Notion that is purely frequentist, using Bayes factors as the statistic instead of another divergence statistic, with no objective Bayes features and no added value

Slide 109

Slide 109 text

PNAS paper “To correct this [lack of reproducibility] problem, evidence thresholds required for the declaration of a significant finding should be increased to 25–50:1, and to 100–200:1 for the declaration of a highly significant finding.” Johnson’s (2013b) recycled UMPB tests received much attention from the media for its simplistic message: move from the 0.05 significance bound to the 0.005 bound and hence reduce the non-reproducible research outcome [Johnson, 2013b]

Slide 110

Slide 110 text

new arguments default Bayesian procedures rejection regions can be matched to classical rejection regions provide evidence in “favor of both true null and true alternative hypotheses” “provides insight into the amount of evidence required to reject a null hypothesis” adopt level 0.005 as “P values of 0.005 correspond to Bayes factors around 50”

Slide 111

Slide 111 text

new criticisms dodges the essential nature of any such automated rule, that it expresses a tradeoff between the risks of publishing misleading results and of important results being left unpublished. Such decisions should depend on costs, benefits, and probabilities of all outcomes. minimax alternative prior not intended to correspond to any distribution of effect sizes, solely worst-case scenario not accounting for a balance between two different losses threshold chosen relative to conventional value, e.g. Jeffreys’ target Bayes factor of 1/25 or 1/50, for which there is no particular justification had Fisher chosen p = 0.005, Johnson could have argued about its failure to correspond to 200:1 evidence against the null! This γ = 0.005 turns into z = −2 log(0.005) = 3.86, and a (one-sided) tail probability of Φ(−3.86) ≈ 0.0005, with no better or worse justification [Gelman & Robert, 2013]

Slide 112

Slide 112 text

Testing under incomplete information Significance tests: one new parameter Jeffreys-Lindley paradox Deviance (information criterion) Aitkin’s integrated likelihood Johnson’s uniformly most powerful Bayesian tests Testing under incomplete information Posterior predictive checking

Slide 113

Slide 113 text

Likelihood-free settings Cases when the likelihood function f (y|θ) is unavailable (in analytic and numerical senses) and when the completion step f (y|θ) = Z f (y, z|θ) dz is impossible or too costly because of the dimension of z c MCMC cannot be implemented!

Slide 114

Slide 114 text

The ABC method Bayesian setting: target is π(θ)f (x|θ) When likelihood f (x|θ) not in closed form, likelihood-free rejection technique: ABC algorithm For an observation y ∼ f (y|θ), under the prior π(θ), keep jointly simulating θ ∼ π(θ) , z ∼ f (z|θ ) , until the auxiliary variable z is equal to the observed value, z = y. [Tavar´ e et al., 1997]

Slide 115

Slide 115 text

The ABC method Bayesian setting: target is π(θ)f (x|θ) When likelihood f (x|θ) not in closed form, likelihood-free rejection technique: ABC algorithm For an observation y ∼ f (y|θ), under the prior π(θ), keep jointly simulating θ ∼ π(θ) , z ∼ f (z|θ ) , until the auxiliary variable z is equal to the observed value, z = y. [Tavar´ e et al., 1997]

Slide 116

Slide 116 text

The ABC method Bayesian setting: target is π(θ)f (x|θ) When likelihood f (x|θ) not in closed form, likelihood-free rejection technique: ABC algorithm For an observation y ∼ f (y|θ), under the prior π(θ), keep jointly simulating θ ∼ π(θ) , z ∼ f (z|θ ) , until the auxiliary variable z is equal to the observed value, z = y. [Tavar´ e et al., 1997]

Slide 117

Slide 117 text

A as A...pproximative When y is a continuous random variable, strict equality z = y is replaced with a tolerance zone ρ(y, z) where ρ is a distance Output distributed from π(θ) Pθ{ρ(y, z) < } def ∝ π(θ|ρ(y, z) < ) [Pritchard et al., 1999]

Slide 118

Slide 118 text

A as A...pproximative When y is a continuous random variable, strict equality z = y is replaced with a tolerance zone ρ(y, z) where ρ is a distance Output distributed from π(θ) Pθ{ρ(y, z) < } def ∝ π(θ|ρ(y, z) < ) [Pritchard et al., 1999]

Slide 119

Slide 119 text

ABC algorithm In most implementations, further degree of A...pproximation: Algorithm 1 Likelihood-free rejection sampler for i = 1 to N do repeat generate θ from the prior distribution π(·) generate z from the likelihood f (·|θ ) until ρ{η(z), η(y)} set θi = θ end for where η(y) defines a (not necessarily sufficient) statistic

Slide 120

Slide 120 text

Which summary η(·)? Fundamental difficulty of the choice of the summary statistic when there is no non-trivial sufficient statistics Loss of statistical information balanced against gain in data roughening Approximation error and information loss remain unknown Choice of statistics induces choice of distance function towards standardisation may be imposed for external/practical reasons (e.g., LDA) may gather several non-B point estimates can learn about efficient combination [Estoup et al., 2012, Genetics]

Slide 121

Slide 121 text

Which summary η(·)? Fundamental difficulty of the choice of the summary statistic when there is no non-trivial sufficient statistics Loss of statistical information balanced against gain in data roughening Approximation error and information loss remain unknown Choice of statistics induces choice of distance function towards standardisation may be imposed for external/practical reasons (e.g., LDA) may gather several non-B point estimates can learn about efficient combination [Estoup et al., 2012, Genetics]

Slide 122

Slide 122 text

Which summary η(·)? Fundamental difficulty of the choice of the summary statistic when there is no non-trivial sufficient statistics Loss of statistical information balanced against gain in data roughening Approximation error and information loss remain unknown Choice of statistics induces choice of distance function towards standardisation may be imposed for external/practical reasons (e.g., LDA) may gather several non-B point estimates can learn about efficient combination [Estoup et al., 2012, Genetics]

Slide 123

Slide 123 text

Generic ABC for model choice Algorithm 2 Likelihood-free model choice sampler (ABC-MC) for t = 1 to T do repeat Generate m from the prior π(M = m) Generate θm from the prior πm(θm) Generate z from the model fm(z|θm) until ρ{η(z), η(y)} < Set m(t) = m and θ(t) = θm end for [Grelaud et al., 2009]

Slide 124

Slide 124 text

ABC estimates Posterior probability π(M = m|y) approximated by the frequency of acceptances from model m 1 T T t=1 Im(t)=m . Issues with implementation: should tolerances be the same for all models? should summary statistics vary across models (incl. their dimension)? should the distance measure ρ vary as well?

Slide 125

Slide 125 text

ABC estimates Posterior probability π(M = m|y) approximated by the frequency of acceptances from model m 1 T T t=1 Im(t)=m . Extension to a weighted polychotomous logistic regression estimate of π(M = m|y), with non-parametric kernel weights [Cornuet et al., DIYABC, 2009]

Slide 126

Slide 126 text

ABCµ Idea Infer about the error as well as about the parameter: Use of a joint density f (θ, |y) ∝ ξ( |y, θ) × πθ(θ) × π ( ) where y is the data, and ξ( |y, θ) is the prior predictive density of ρ(η(z), η(y)) given θ and y when z ∼ f (z|θ) Warning! Replacement of ξ( |y, θ) with a non-parametric kernel approximation. [Ratmann, Andrieu, Wiuf and Richardson, 2009, PNAS]

Slide 127

Slide 127 text

ABCµ Idea Infer about the error as well as about the parameter: Use of a joint density f (θ, |y) ∝ ξ( |y, θ) × πθ(θ) × π ( ) where y is the data, and ξ( |y, θ) is the prior predictive density of ρ(η(z), η(y)) given θ and y when z ∼ f (z|θ) Warning! Replacement of ξ( |y, θ) with a non-parametric kernel approximation. [Ratmann, Andrieu, Wiuf and Richardson, 2009, PNAS]

Slide 128

Slide 128 text

ABCµ Idea Infer about the error as well as about the parameter: Use of a joint density f (θ, |y) ∝ ξ( |y, θ) × πθ(θ) × π ( ) where y is the data, and ξ( |y, θ) is the prior predictive density of ρ(η(z), η(y)) given θ and y when z ∼ f (z|θ) Warning! Replacement of ξ( |y, θ) with a non-parametric kernel approximation. [Ratmann, Andrieu, Wiuf and Richardson, 2009, PNAS]

Slide 129

Slide 129 text

ABCµ details Multidimensional distances ρk (k = 1, . . . , K) and errors k = ρk(ηk(z), ηk(y)), with k ∼ ξk( |y, θ) ≈ ^ ξk( |y, θ) = 1 Bhk b K[{ k−ρk(ηk(zb), ηk(y))}/hk] then used in replacing ξ( |y, θ) with mink ^ ξk( |y, θ) ABCµ involves acceptance probability π(θ , ) π(θ, ) q(θ , θ)q( , ) q(θ, θ )q( , ) mink ^ ξk( |y, θ ) mink ^ ξk( |y, θ)

Slide 130

Slide 130 text

ABCµ details Multidimensional distances ρk (k = 1, . . . , K) and errors k = ρk(ηk(z), ηk(y)), with k ∼ ξk( |y, θ) ≈ ^ ξk( |y, θ) = 1 Bhk b K[{ k−ρk(ηk(zb), ηk(y))}/hk] then used in replacing ξ( |y, θ) with mink ^ ξk( |y, θ) ABCµ involves acceptance probability π(θ , ) π(θ, ) q(θ , θ)q( , ) q(θ, θ )q( , ) mink ^ ξk( |y, θ ) mink ^ ξk( |y, θ)

Slide 131

Slide 131 text

ABCµ multiple errors [ c Ratmann et al., PNAS, 2009]

Slide 132

Slide 132 text

ABCµ for model choice [ c Ratmann et al., PNAS, 2009]

Slide 133

Slide 133 text

Questions about ABCµ [and model choice] For each model under comparison, marginal posterior on used to assess the fit of the model (HPD includes 0 or not). Is the data informative about ? [Identifiability] How much does the prior π( ) impact the comparison? How is using both ξ( |x0, θ) and π ( ) compatible with a standard probability model? Where is the penalisation for complexity in the model comparison? [X, Mengersen & Chen, 2010, PNAS]

Slide 134

Slide 134 text

Questions about ABCµ [and model choice] For each model under comparison, marginal posterior on used to assess the fit of the model (HPD includes 0 or not). Is the data informative about ? [Identifiability] How much does the prior π( ) impact the comparison? How is using both ξ( |x0, θ) and π ( ) compatible with a standard probability model? Where is the penalisation for complexity in the model comparison? [X, Mengersen & Chen, 2010, PNAS]

Slide 135

Slide 135 text

Formalised framework Central question to the validation of ABC for model choice: When is a Bayes factor based on an insufficient statistic T(y) consistent? Note: c drawn on T(y) through BT 12 (y) necessarily differs from c drawn on y through B12(y)

Slide 136

Slide 136 text

Formalised framework Central question to the validation of ABC for model choice: When is a Bayes factor based on an insufficient statistic T(y) consistent? Note: c drawn on T(y) through BT 12 (y) necessarily differs from c drawn on y through B12(y)

Slide 137

Slide 137 text

A benchmark if toy example Comparison suggested by referee of PNAS paper [thanks]: [X, Cornuet, Marin, & Pillai, Aug. 2011] Model M1: y ∼ N(θ1, 1) opposed to model M2: y ∼ L(θ2, 1/ √ 2), Laplace distribution with mean θ2 and scale parameter 1/ √ 2 (variance one).

Slide 138

Slide 138 text

A benchmark if toy example Comparison suggested by referee of PNAS paper [thanks]: [X, Cornuet, Marin, & Pillai, Aug. 2011] Model M1: y ∼ N(θ1, 1) opposed to model M2: y ∼ L(θ2, 1/ √ 2), Laplace distribution with mean θ2 and scale parameter 1/ √ 2 (variance one). Four possible statistics 1. sample mean y (sufficient for M1 if not M2); 2. sample median med(y) (insufficient); 3. sample variance var(y) (ancillary); 4. median absolute deviation mad(y) = med(y − med(y));

Slide 139

Slide 139 text

A benchmark if toy example Comparison suggested by referee of PNAS paper [thanks]: [X, Cornuet, Marin, & Pillai, Aug. 2011] Model M1: y ∼ N(θ1, 1) opposed to model M2: y ∼ L(θ2, 1/ √ 2), Laplace distribution with mean θ2 and scale parameter 1/ √ 2 (variance one). 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0 1 2 3 4 5 6 posterior probability Density

Slide 140

Slide 140 text

A benchmark if toy example Comparison suggested by referee of PNAS paper [thanks]: [X, Cornuet, Marin, & Pillai, Aug. 2011] Model M1: y ∼ N(θ1, 1) opposed to model M2: y ∼ L(θ2, 1/ √ 2), Laplace distribution with mean θ2 and scale parameter 1/ √ 2 (variance one). q q q q q q q q q q q Gauss Laplace 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 n=100

Slide 141

Slide 141 text

A benchmark if toy example Comparison suggested by referee of PNAS paper [thanks]: [X, Cornuet, Marin, & Pillai, Aug. 2011] Model M1: y ∼ N(θ1, 1) opposed to model M2: y ∼ L(θ2, 1/ √ 2), Laplace distribution with mean θ2 and scale parameter 1/ √ 2 (variance one). 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0 1 2 3 4 5 6 posterior probability Density 0.0 0.2 0.4 0.6 0.8 1.0 0 1 2 3 probability Density

Slide 142

Slide 142 text

A benchmark if toy example Comparison suggested by referee of PNAS paper [thanks]: [X, Cornuet, Marin, & Pillai, Aug. 2011] Model M1: y ∼ N(θ1, 1) opposed to model M2: y ∼ L(θ2, 1/ √ 2), Laplace distribution with mean θ2 and scale parameter 1/ √ 2 (variance one). q q q q q q q q q q q Gauss Laplace 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 n=100 q q q q q q q q q q q q q q q q q q Gauss Laplace 0.0 0.2 0.4 0.6 0.8 1.0 n=100

Slide 143

Slide 143 text

Consistency theorem If Pn belongs to one of the two models and if µ0 = E[T] cannot be attained by the other one : 0 = min (inf{|µ0 − µi (θi )|; θi ∈ Θi }, i = 1, 2) < max (inf{|µ0 − µi (θi )|; θi ∈ Θi }, i = 1, 2) , then the Bayes factor BT 12 is consistent

Slide 144

Slide 144 text

Conclusion Model selection feasible with ABC: Choice of summary statistics is paramount At best, ABC output → π(. | η(y)) which concentrates around µ0 For estimation : {θ; µ(θ) = µ0} = θ0 For testing {µ1(θ1), θ1 ∈ Θ1} ∩ {µ2(θ2), θ2 ∈ Θ2} = ∅ [Marin et al., 2013]

Slide 145

Slide 145 text

Conclusion Model selection feasible with ABC: Choice of summary statistics is paramount At best, ABC output → π(. | η(y)) which concentrates around µ0 For estimation : {θ; µ(θ) = µ0} = θ0 For testing {µ1(θ1), θ1 ∈ Θ1} ∩ {µ2(θ2), θ2 ∈ Θ2} = ∅ [Marin et al., 2013]

Slide 146

Slide 146 text

Posterior predictive checking Significance tests: one new parameter Jeffreys-Lindley paradox Deviance (information criterion) Aitkin’s integrated likelihood Johnson’s uniformly most powerful Bayesian tests Testing under incomplete information Posterior predictive checking

Slide 147

Slide 147 text

Bayesian predictive “If the model fits, then replicated data generated under the model should look similar to observed data. To put it another way, the observed data should look plausible under the posterior predictive distribution. This is really a self-consistency check: an observed discrepancy can be due to model misfit or chance.” (BDA, p.143) Use of posterior predictive p(yrep|y) = p(yrep|θ)π(θ|y) dθ and measure of discrepancy T(·, ·) Replacing p-value p(y|θ) = P(T(yrep, θ) T(y, θ)|θ) with Bayesian posterior p-value P(T(yrep, θ) T(y, θ)|y) = p(y|θ)π(θ|x) dθ

Slide 148

Slide 148 text

Bayesian predictive “If the model fits, then replicated data generated under the model should look similar to observed data. To put it another way, the observed data should look plausible under the posterior predictive distribution. This is really a self-consistency check: an observed discrepancy can be due to model misfit or chance.” (BDA, p.143) Use of posterior predictive p(yrep|y) = p(yrep|θ)π(θ|y) dθ and measure of discrepancy T(·, ·) Replacing p-value p(y|θ) = P(T(yrep, θ) T(y, θ)|θ) with Bayesian posterior p-value P(T(yrep, θ) T(y, θ)|y) = p(y|θ)π(θ|x) dθ

Slide 149

Slide 149 text

Issues “the posterior predictive p-value is such a [Bayesian] probability statement, conditional on the model and data, about what might be expected in future replications. (BDA, p.151) sounds too much like a p-value...! relies on choice of T(·, ·) seems to favour overfitting (again) using the data twice (once for the posterior and twice in the p-value) needs to be calibrated (back to 0.05?) general difficulty in interpreting where is the penalty for model complexity?

Slide 150

Slide 150 text

Example Normal-normal mean model: X ∼ N(θ, 1) , θ ∼ N(0, 10) Bayesian posterior p-value for T(x) = x2, m(x), B10(x)−1 P(|X| |x||θ, x)π(θ|x) dθ

Slide 151

Slide 151 text

Example Normal-normal mean model: X ∼ N(θ, 1) , θ ∼ N(0, 10) Bayesian posterior p-value for T(x) = x2, m(x), B10(x)−1 P(|X| |x||θ, x)π(θ|x) dθ which interpretation? 0 1 2 3 4 5 0.4 0.5 0.6 0.7 0.8 0.9 1.0 x P(|X|>|x|)

Slide 152

Slide 152 text

Example Normal-normal mean model: X ∼ N(θ, 1) , θ ∼ N(0, 10) Bayesian posterior p-value for T(x) = x2, m(x), B10(x)−1 P(|X| |x||θ, x)π(θ|x) dθ gets down as x gets away from 0... while discrepancy based on B10(x) increases mildly 0 1 2 3 4 5 0.4 0.5 0.6 0.7 0.8 0.9 1.0 x P(|X|>|x|)

Slide 153

Slide 153 text

goodness-of-fit [only?] “A model is suspect if a discrepancy is of practical importance and its observed value has a tail-area probability near 0 or 1, indicating that the observed pattern would be unlikely to be seen in replications of the data if the model were true. An extreme p-value implies that the model cannot be expected to capture this aspect of the data. A p-value is a posterior probability and can therefore be interpreted directly—although not as Pr(model is true — data). Major failures of the model (...) can be addressed by expanding the model appropriately.” BDA, p.150 not helpful in comparing models (both may be deficient) anti-Ockham? i.e., may favour larger dimensions (if prior concentrated enough) lingering worries about using the data twice and favourable bias impact of the prior (only under the current model) but allows for improper priors

Slide 154

Slide 154 text

goodness-of-fit [only?] “A model is suspect if a discrepancy is of practical importance and its observed value has a tail-area probability near 0 or 1, indicating that the observed pattern would be unlikely to be seen in replications of the data if the model were true. An extreme p-value implies that the model cannot be expected to capture this aspect of the data. A p-value is a posterior probability and can therefore be interpreted directly—although not as Pr(model is true — data). Major failures of the model (...) can be addressed by expanding the model appropriately.” BDA, p.150 not helpful in comparing models (both may be deficient) anti-Ockham? i.e., may favour larger dimensions (if prior concentrated enough) lingering worries about using the data twice and favourable bias impact of the prior (only under the current model) but allows for improper priors