Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Bayes 250 versus Bayes-2.5.0

Xi'an
January 21, 2014

Bayes 250 versus Bayes-2.5.0

Presentation celebrating Bayes' Essay's 250th anniversary at the European Meeting of Statisticians in Budapest, Hungary, July 2013

Xi'an

January 21, 2014
Tweet

More Decks by Xi'an

Other Decks in Science

Transcript

  1. Bayes 250th versus Bayes 2.5.0
    Christian P. Robert
    Universit´
    e Paris-Dauphine, University of Warwick, & CREST, Paris
    written for EMS 2013, Budapest

    View full-size slide

  2. Outline
    Bayes, Thomas (1702–1761)
    Jeffreys, Harold (1891–1989)
    Lindley, Dennis (1923– )
    Besag, Julian (1945–2010)
    de Finetti, Bruno (1906–1985)

    View full-size slide

  3. Bayes, Price and Laplace
    Bayes, Thomas (1702–1761)
    Bayes’ 1763 paper
    Bayes’ example
    Laplace’s 1774 derivation
    Jeffreys, Harold (1891–1989)
    Lindley, Dennis (1923– )
    Besag, Julian (1945–2010)
    de Finetti, Bruno (1906–1985)

    View full-size slide

  4. a first Bayes 250
    Took place in Edinburgh, Sept. 5–7, 2011:
    Sparse Nonparametric Bayesian Learning from
    Big Data David Dunson, Duke University
    Classification Models and Predictions for Ordered
    Data Chris Holmes, Oxford University
    Bayesian Variable Selection in Markov Mixture
    Models Luigi Spezia, Biomathematics
    & Statistics Scotland, Aberdeen
    Bayesian inference for partially observed Markov
    processes, with application to systems biology
    Darren Wilkinson, University of Newcastle
    Coherent Inference on Distributed Bayesian
    Expert Systems Jim Smith, University of Warwick
    Probabilistic Programming John Winn, Microsoft
    Research
    How To Gamble If You Must (courtesy of the
    Reverend Bayes) David Spiegelhalter, University
    of Cambridge
    Inference and computing with decomposable
    graphs Peter Green, University of Bristol
    Nonparametric Bayesian Models for Sparse
    Matrices and Covariances Zoubin Gharamani,
    University of Cambridge
    Latent Force Models Neil Lawrence, University of
    Sheffield
    Does Bayes Theorem Work? Michael Goldstein,
    Durham University
    Bayesian Priors in the Brain Peggy Series,
    University of Edinburgh
    Approximate Bayesian Computation for model
    selection Christian Robert, Universit´
    e
    Paris-Dauphine
    ABC-EP: Expectation Propagation for
    Likelihood-free Bayesian Computation Nicholas
    Chopin, CREST–ENSAE
    Bayes at Edinburgh University - a talk and tour
    Dr Andrew Fraser, Honorary Fellow, University of
    Edinburgh
    Intractable likelihoods and exact approximate
    MCMC algorithms Christophe Andrieu,
    University of Bristol
    Bayesian computational methods for intractable
    continuous-time non-Gaussian time series Simon
    Godsill, University of Cambridge
    Efficient MCMC for Continuous Time Discrete
    State Systems Yee Whye Teh, Gatsby
    Computational Neuroscience Unit, University
    College London
    Adaptive Control and Bayesian Inference Carl
    Rasmussen, University of Cambridge
    Bernstein - von Mises theorem for irregular
    statistical models Natalia Bochkina, University of
    Edinburgh

    View full-size slide

  5. Why Bayes 250?
    Publication on Dec. 23, 1763 of
    “An Essay towards solving a
    Problem in the Doctrine of
    Chances” by the late
    Rev. Mr. Bayes, communicated
    by Mr. Price in the Philosophical
    Transactions of the Royal Society
    of London.
    c 250th anniversary of the Essay

    View full-size slide

  6. Why Bayes 250?
    Publication on Dec. 23, 1763 of
    “An Essay towards solving a
    Problem in the Doctrine of
    Chances” by the late
    Rev. Mr. Bayes, communicated
    by Mr. Price in the Philosophical
    Transactions of the Royal Society
    of London.
    c 250th anniversary of the Essay

    View full-size slide

  7. Why Bayes 250?
    Publication on Dec. 23, 1763 of
    “An Essay towards solving a
    Problem in the Doctrine of
    Chances” by the late
    Rev. Mr. Bayes, communicated
    by Mr. Price in the Philosophical
    Transactions of the Royal Society
    of London.
    c 250th anniversary of the Essay

    View full-size slide

  8. Why Bayes 250?
    Publication on Dec. 23, 1763 of
    “An Essay towards solving a
    Problem in the Doctrine of
    Chances” by the late
    Rev. Mr. Bayes, communicated
    by Mr. Price in the Philosophical
    Transactions of the Royal Society
    of London.
    c 250th anniversary of the Essay

    View full-size slide

  9. Breaking news!!!
    An accepted paper by Stephen Stigler in Statistical Science
    uncovers the true title of the Essay:
    A Method of
    Calculating the Exact
    Probability of All
    Conclusions founded on
    Induction
    Intended as a reply to
    Hume’s (1748) evaluation
    of the probability of
    miracles

    View full-size slide

  10. Breaking news!!!
    may have been written as early as 1749: “we may hope to
    determine the Propositions, and, by degrees, the whole Nature
    of unknown Causes, by a sufficient Observation of their
    effects” (D. Hartley)
    in 1767, Richard Price used
    Bayes’ theorem as a tool to
    attack Hume’s argument,
    refering to the above title
    Bayes’ offprints available at
    Yale’s Beinecke Library (but
    missing the title page) and
    at the Library Company of
    Philadelphia (Franklin’s
    library)
    [Stigler, 2013]

    View full-size slide

  11. Bayes Theorem
    Bayes theorem = Inversion of causes and effects
    If A and E are events such that P(E) = 0, P(A|E) and P(E|A) are
    related by
    P(A|E) =
    P(E|A)P(A)
    P(E|A)P(A) + P(E|Ac)P(Ac)
    =
    P(E|A)P(A)
    P(E)

    View full-size slide

  12. Bayes Theorem
    Bayes theorem = Inversion of causes and effects
    If A and E are events such that P(E) = 0, P(A|E) and P(E|A) are
    related by
    P(A|E) =
    P(E|A)P(A)
    P(E|A)P(A) + P(E|Ac)P(Ac)
    =
    P(E|A)P(A)
    P(E)

    View full-size slide

  13. Bayes Theorem
    Bayes theorem = Inversion of causes and effects
    Continuous version for random
    variables X and Y
    fX|Y
    (x|y) =
    fY |X
    (y|x) × fX (x)
    fY (y)

    View full-size slide

  14. Who was Thomas Bayes?
    Reverend Thomas Bayes (ca. 1702–1761), educated in London
    then at the University of Edinburgh (1719-1721), presbyterian
    minister in Tunbridge Wells (Kent) from 1731, son of Joshua
    Bayes, nonconformist minister.
    “Election to the Royal Society based on
    a tract of 1736 where he defended the
    views and philosophy of Newton.
    A notebook of his includes a method of
    finding the time and place of
    conjunction of two planets, notes on
    weights and measures, a method of
    differentiation, and logarithms.”
    [Wikipedia]

    View full-size slide

  15. Who was Thomas Bayes?
    Reverend Thomas Bayes (ca. 1702–1761), educated in London
    then at the University of Edinburgh (1719-1721), presbyterian
    minister in Tunbridge Wells (Kent) from 1731, son of Joshua
    Bayes, nonconformist minister.
    “Election to the Royal Society based on
    a tract of 1736 where he defended the
    views and philosophy of Newton.
    A notebook of his includes a method of
    finding the time and place of
    conjunction of two planets, notes on
    weights and measures, a method of
    differentiation, and logarithms.”
    [Wikipedia]

    View full-size slide

  16. Bayes’ 1763 paper:
    Billiard ball W rolled on a line of length one, with a uniform
    probability of stopping anywhere: W stops at p.
    Second ball O then rolled n times under the same assumptions. X
    denotes the number of times the ball O stopped on the left of W .

    View full-size slide

  17. Bayes’ 1763 paper:
    Billiard ball W rolled on a line of length one, with a uniform
    probability of stopping anywhere: W stops at p.
    Second ball O then rolled n times under the same assumptions. X
    denotes the number of times the ball O stopped on the left of W .
    Bayes’ question:
    Given X, what inference can we make on p?

    View full-size slide

  18. Bayes’ 1763 paper:
    Billiard ball W rolled on a line of length one, with a uniform
    probability of stopping anywhere: W stops at p.
    Second ball O then rolled n times under the same assumptions. X
    denotes the number of times the ball O stopped on the left of W .
    Bayes’ wording:
    “Given the number of times in which an unknown event has
    happened and failed; Required the chance that the probability of
    its happening in a single trial lies somewhere between any two
    degrees of probability that can be named.”

    View full-size slide

  19. Bayes’ 1763 paper:
    Billiard ball W rolled on a line of length one, with a uniform
    probability of stopping anywhere: W stops at p.
    Second ball O then rolled n times under the same assumptions. X
    denotes the number of times the ball O stopped on the left of W .
    Modern translation:
    Derive the posterior distribution of p given X, when
    p ∼ U([0, 1]) and X|p ∼ B(n, p)

    View full-size slide

  20. Resolution
    Since
    P(X = x|p) =
    n
    x
    px (1 − p)n−x ,
    P(a < p < b and X = x) =
    b
    a
    n
    x
    px (1 − p)n−x dp
    and
    P(X = x) =
    1
    0
    n
    x
    px (1 − p)n−x dp,

    View full-size slide

  21. Resolution (2)
    then
    P(a < p < b|X = x) =
    b
    a
    n
    x
    px (1 − p)n−x dp
    1
    0
    n
    x
    px (1 − p)n−x dp
    =
    b
    a
    px (1 − p)n−x dp
    B(x + 1, n − x + 1)
    ,
    i.e.
    p|x ∼ Be(x + 1, n − x + 1)
    [Beta distribution]

    View full-size slide

  22. Resolution (2)
    then
    P(a < p < b|X = x) =
    b
    a
    n
    x
    px (1 − p)n−x dp
    1
    0
    n
    x
    px (1 − p)n−x dp
    =
    b
    a
    px (1 − p)n−x dp
    B(x + 1, n − x + 1)
    ,
    i.e.
    p|x ∼ Be(x + 1, n − x + 1)
    In Bayes’ words:
    “The same things supposed, I guess that the probability
    of the event M lies somewhere between 0 and the ratio of
    Ab to AB, my chance to be in the right is the ratio of
    Abm to AiB.”

    View full-size slide

  23. Laplace’s version
    Pierre Simon (de) Laplace (1749–1827):
    “Je me propose de d´
    eterminer la probabilit´
    e
    des causes par les ´
    ev´
    enements mati`
    ere neuve `
    a
    bien des ´
    egards et qui m´
    erite d’autant plus
    d’ˆ
    etre cultiv´
    ee que c’est principalement sous ce
    point de vue que la science des hasards peut
    ˆ
    etre utile `
    a la vie civile.”
    [M´
    emoire sur la probabilit´
    e des causes par les ´
    ev´
    enemens, 1774]

    View full-size slide

  24. Laplace’s version
    “Si un ´
    ev`
    enement peut ˆ
    etre produit par un
    nombre n de causes diff`
    erentes, les probabilit´
    es
    de l’existence de ces causes prises de
    l’´
    ev`
    enement, sont entre elles comme les
    probabilit´
    es de l’´
    ev`
    enement prises de ces
    causes, et la probabilit´
    e de l’existence de
    chacune d’elles, est ´
    egale `
    a la probabilit´
    e de
    l’´
    ev`
    enement prise de cette cause, divise´
    e par la
    somme de toutes les probabilit´
    es de
    l’´
    ev`
    enement prises de chacune de ces causes.”
    [M´
    emoire sur la probabilit´
    e des causes par les ´
    ev´
    enemens, 1774]

    View full-size slide

  25. Laplace’s version
    In modern terms: Under a uniform prior,
    P(Ai |E)
    P(Aj |E)
    = P(E|Ai )
    P(E|Aj )
    and
    f (x|y) =
    f (y|x)
    f (y|x) dy
    [M´
    emoire sur la probabilit´
    e des causes par les ´
    ev´
    enemens, 1774]

    View full-size slide

  26. Laplace’s version
    Later Laplace acknowledges Bayes by
    “Bayes a cherch´
    e directement la probabilit´
    e
    que les possibilit´
    es indiqu´
    ees par des
    exp´
    eriences d´
    ej`
    a faites sont comprises dans les
    limites donn´
    ees et il y est parvenu d’une
    mani`
    ere fine et tr`
    es ing´
    enieuse”
    [Essai philosophique sur les probabilit´
    es, 1810]

    View full-size slide

  27. Another Bayes 250
    Meeting that took place at the Royal Statistical Society, June
    19-20, 2013, on the current state of Bayesian statistics
    G. Roberts (University of Warwick) “Bayes for
    differential equation models”
    N. Best (Imperial College London) “Bayesian
    space-time models for environmental
    epidemiology”
    D. Prangle (Lancaster University) “Approximate
    Bayesian Computation”
    P. Dawid (University of Cambridge), “Putting
    Bayes to the Test”
    M. Jordan (UC Berkeley) “Feature Allocations,
    Probability Functions, and Paintboxes”
    I. Murray (University of Edinburgh) “Flexible
    models for density estimation”
    M. Goldstein (Durham University) “Geometric
    Bayes”
    C. Andrieu (University of Bristol) “Inference with
    noisy likelihoods”
    A. Golightly (Newcastle University), “Auxiliary
    particle MCMC schemes for partially observed
    diffusion processes”
    S. Richardson (MRC Biostatistics Unit)
    “Biostatistics and Bayes”
    C. Yau (Imperial College London)
    “Understanding cancer through Bayesian
    approaches”
    S. Walker (University of Kent) “The Misspecified
    Bayesian”
    S. Wilson (Trinity College Dublin), “Linnaeus,
    Bayes and the number of species problem”
    B. Calderhead (UCL) “Probabilistic Integration
    for Differential Equation Models”
    P. Green (University of Bristol and UT Sydney)
    “Bayesian graphical model determination”

    View full-size slide

  28. The search for certain π
    Bayes, Thomas (1702–1761)
    Jeffreys, Harold (1891–1989)
    Keynes’ treatise
    Jeffreys’ prior distributions
    Jeffreys’ Bayes factor
    expected posterior priors
    Lindley, Dennis (1923– )
    Besag, Julian (1945–2010)
    de Finetti, Bruno (1906–1985)

    View full-size slide

  29. Keynes’ dead end
    In John Maynard Keynes’s A Treatise on Probability (1921):
    “I do not believe that there is
    any direct and simple method by
    which we can make the transition
    from an observed numerical
    frequency to a numerical measure
    of probability.”
    [Robert, 2011, ISR]

    View full-size slide

  30. Keynes’ dead end
    In John Maynard Keynes’s A Treatise on Probability (1921):
    “Bayes’ enunciation is strictly
    correct and its method of arriving
    at it shows its true logical
    connection with more
    fundamental principles, whereas
    Laplace’s enunciation gives it the
    appearance of a new principle
    specially introduced for the
    solution of causal problems.”
    [Robert, 2011, ISR]

    View full-size slide

  31. Who was Harold Jeffreys?
    Harold Jeffreys (1891–1989)
    mathematician, statistician,
    geophysicist, and astronomer.
    Knighted in 1953 and Gold
    Medal of the Royal Astronomical
    Society in 1937. Funder of
    modern British geophysics. Many
    of his contributions are
    summarised in his book The
    Earth.
    [Wikipedia]

    View full-size slide

  32. Theory of Probability
    The first modern and comprehensive treatise on (objective)
    Bayesian statistics
    Theory of Probability (1939)
    begins with probability, refining
    the treatment in Scientific
    Inference (1937), and proceeds to
    cover a range of applications
    comparable to that in Fisher’s
    book.
    [Robert, Chopin & Rousseau, 2009, Stat. Science]

    View full-size slide

  33. Jeffreys’ justifications
    All probability statements are conditional
    Actualisation of the information on θ by extracting the
    information on θ contained in the observation x
    The principle of inverse probability does correspond
    to ordinary processes of learning (I, §1.5)
    Allows incorporation of imperfect information in the decision
    process
    A probability number can be regarded as a
    generalization of the assertion sign (I, §1.51).

    View full-size slide

  34. Posterior distribution
    Operates conditional upon the observations
    Incorporates the requirement of the Likelihood Principle
    ...the whole of the information contained in the
    observations that is relevant to the posterior
    probabilities of different hypotheses is summed up in
    the values that they give the likelihood (II, §2.0).
    Avoids averaging over the unobserved values of x
    Coherent updating of the information available on θ,
    independent of the order in which i.i.d. observations are
    collected
    ...can be used as the prior probability in taking
    account of a further set of data, and the theory can
    therefore always take account of new information (I,
    §1.5).
    Provides a complete inferential scope

    View full-size slide

  35. Posterior distribution
    Operates conditional upon the observations
    Incorporates the requirement of the Likelihood Principle
    ...the whole of the information contained in the
    observations that is relevant to the posterior
    probabilities of different hypotheses is summed up in
    the values that they give the likelihood (II, §2.0).
    Avoids averaging over the unobserved values of x
    Coherent updating of the information available on θ,
    independent of the order in which i.i.d. observations are
    collected
    ...can be used as the prior probability in taking
    account of a further set of data, and the theory can
    therefore always take account of new information (I,
    §1.5).
    Provides a complete inferential scope

    View full-size slide

  36. Posterior distribution
    Operates conditional upon the observations
    Incorporates the requirement of the Likelihood Principle
    ...the whole of the information contained in the
    observations that is relevant to the posterior
    probabilities of different hypotheses is summed up in
    the values that they give the likelihood (II, §2.0).
    Avoids averaging over the unobserved values of x
    Coherent updating of the information available on θ,
    independent of the order in which i.i.d. observations are
    collected
    ...can be used as the prior probability in taking
    account of a further set of data, and the theory can
    therefore always take account of new information (I,
    §1.5).
    Provides a complete inferential scope

    View full-size slide

  37. Posterior distribution
    Operates conditional upon the observations
    Incorporates the requirement of the Likelihood Principle
    ...the whole of the information contained in the
    observations that is relevant to the posterior
    probabilities of different hypotheses is summed up in
    the values that they give the likelihood (II, §2.0).
    Avoids averaging over the unobserved values of x
    Coherent updating of the information available on θ,
    independent of the order in which i.i.d. observations are
    collected
    ...can be used as the prior probability in taking
    account of a further set of data, and the theory can
    therefore always take account of new information (I,
    §1.5).
    Provides a complete inferential scope

    View full-size slide

  38. Subjective priors
    Subjective nature of priors
    Critics (...) usually say that the prior probability is
    ‘subjective’ (...) or refer to the vagueness of previous
    knowledge as an indication that the prior probability
    cannot be assessed (VIII, §8.0).
    Long walk (from Laplace’s principle of insufficient reason) to a
    reference prior:
    A prior probability used to express ignorance is merely the
    formal statement of ignorance (VIII, §8.1).

    View full-size slide

  39. Subjective priors
    Subjective nature of priors
    Critics (...) usually say that the prior probability is
    ‘subjective’ (...) or refer to the vagueness of previous
    knowledge as an indication that the prior probability
    cannot be assessed (VIII, §8.0).
    Long walk (from Laplace’s principle of insufficient reason) to a
    reference prior:
    A prior probability used to express ignorance is merely the
    formal statement of ignorance (VIII, §8.1).

    View full-size slide

  40. The fundamental prior
    ...if we took the prior probability density for the
    parameters to be proportional to ||gik||1/2 [= |I(θ)|1/2], it
    could stated for any law that is differentiable with respect
    to all parameters that the total probability in any region
    of the αi would be equal to the total probability in the
    corresponding region of the αi
    ; in other words, it satisfies
    the rule that equivalent propositions have the same
    probability (III, §3.10)
    Note: Jeffreys never mentions Fisher information in connection
    with (gik)

    View full-size slide

  41. The fundamental prior
    In modern terms:
    if I(θ) is the Fisher information matrix associated with the
    likelihood (θ|x),
    I(θ) = Eθ

    ∂θT

    ∂θ
    the reference prior distribution is
    π∗(θ) ∝ |I(θ)|1/2
    Note: Jeffreys never mentions Fisher information in connection
    with (gik)

    View full-size slide

  42. Objective prior distributions
    reference priors (Bayarri, Bernardo, Berger, ...)
    not supposed to represent complete ignorance (Kass
    & Wasserman, 1996)
    The prior probabilities needed to express ignorance
    of the value of a quantity to be estimated, where
    there is nothing to call special attention to a
    particular value are given by an invariance theory
    (Jeffreys, VIII, §8.6).
    often endowed with or seeking frequency-based properties
    Jeffreys also proposed another Jeffreys prior dedicated to
    testing (Bayarri & Garcia-Donato, 2007)

    View full-size slide

  43. Jeffreys’ Bayes factor
    Definition (Bayes factor, Jeffreys, V, §5.01)
    For testing hypothesis H0 : θ ∈ Θ0 vs. Ha : θ ∈ Θ0
    B01 =
    π(Θ0|x)
    π(Θc
    0
    |x)
    π(Θ0)
    π(Θc
    0
    )
    = Θ0
    f (x|θ)π0(θ)dθ
    Θc
    0
    f (x|θ)π1(θ)dθ
    Equivalent to Bayes rule: acceptance if
    B01 > {(1 − π(Θ0))/a1}/{π(Θ0)/a0}
    What if... π0 is improper?!
    [DeGroot, 1973; Berger, 1985; Marin & Robert, 2007]

    View full-size slide

  44. Jeffreys’ Bayes factor
    Definition (Bayes factor, Jeffreys, V, §5.01)
    For testing hypothesis H0 : θ ∈ Θ0 vs. Ha : θ ∈ Θ0
    B01 =
    π(Θ0|x)
    π(Θc
    0
    |x)
    π(Θ0)
    π(Θc
    0
    )
    = Θ0
    f (x|θ)π0(θ)dθ
    Θc
    0
    f (x|θ)π1(θ)dθ
    Equivalent to Bayes rule: acceptance if
    B01 > {(1 − π(Θ0))/a1}/{π(Θ0)/a0}
    What if... π0 is improper?!
    [DeGroot, 1973; Berger, 1985; Marin & Robert, 2007]

    View full-size slide

  45. Expected posterior priors (example)
    Starting from reference priors πN
    0
    and πN
    1
    , substitute by prior
    distributions π0 and π1 that solve the system of integral equations
    π0(θ0) =
    X
    πN
    0
    (θ0 | x)m1(x)dx
    and
    π1(θ1) =
    X
    πN
    1
    (θ1 | x)m0(x)dx,
    where x is an imaginary minimal training sample and m0, m1 are
    the marginals associated with π0 and π1 respectively
    m0(x) = f0(x|θ0)π0(dθ0) m1(x) = f1(x|θ1)π1(dθ1)
    [Perez & Berger, 2000]

    View full-size slide

  46. Existence/Unicity
    Recurrence condition
    When both the observations and the parameters in both models
    are continuous, if the Markov chain with transition
    Q θ0
    | θ0 = g θ0, θ0
    , θ1, x, x dxdx dθ1
    where
    g θ0, θ0
    , θ1, x, x = πN
    0
    θ0
    | x f1 (x | θ1) πN
    1
    θ1 | x f0 x | θ0 ,
    is recurrent, then there exists a solution to the integral equations,
    unique up to a multiplicative constant.
    [Cano, Salmer´
    on, & Robert, 2008, 2013]

    View full-size slide

  47. Bayesian testing of hypotheses
    Bayes, Thomas (1702–1761)
    Jeffreys, Harold (1891–1989)
    Lindley, Dennis (1923– )
    Lindley’s paradox
    dual versions of the paradox
    “Who should be afraid of the
    Lindley–Jeffreys paradox?”
    Bayesian resolutions
    Besag, Julian (1945–2010)
    de Finetti, Bruno (1906–1985)

    View full-size slide

  48. Who is Dennis Lindley?
    British statistician, decision theorist and
    leading advocate of Bayesian statistics.
    Held positions at Cambridge,
    Aberystwyth, and UCL, retiring at the
    early age of 54 to become an itinerant
    scholar. Wrote four books and
    numerous papers on Bayesian statistics.
    c “Coherence is everything”

    View full-size slide

  49. Lindley’s paradox
    In a normal mean testing problem,
    ¯
    xn ∼ N(θ, σ2/n) , H0 : θ = θ0 ,
    under Jeffreys prior, θ ∼ N(θ0, σ2), the Bayes factor
    B01(tn) = (1 + n)1/2 exp −nt2
    n
    /2[1 + n] ,
    where tn =

    n|¯
    xn − θ0|/σ, satisfies
    B01(tn) n−→∞
    −→ ∞
    [assuming a fixed tn]

    View full-size slide

  50. Lindley’s paradox
    Often dubbed Jeffreys–Lindley paradox...
    In terms of
    t =

    n − 1¯
    x/s , ν = n − 1
    K ∼
    πν
    2
    1 +
    t2
    ν
    −1/2ν+1/2
    .
    (...) The variation of K with t is much more important
    than the variation with ν (Jeffreys, V, §5.2).

    View full-size slide

  51. Two versions of the paradox
    “the weight of Lindley’s paradoxical result (...) burdens
    proponents of the Bayesian practice”.
    [Lad, 2003]
    official version, opposing frequentist and Bayesian assessments
    [Lindley, 1957]
    intra-Bayesian version, blaming vague and improper priors for
    the Bayes factor misbehaviour:
    if π1(·|σ) depends on a scale parameter σ, it is often the case
    that
    B01(x) σ−→∞
    −→ +∞
    for a given x, meaning H0 is always accepted
    [Robert, 1992, 2013]

    View full-size slide

  52. Evacuation of the first version
    Two paradigms [(b) versus (f)]
    one (b) operates on the parameter space Θ, while the other
    (f) is produced from the sample space
    one (f) relies solely on the point-null hypothesis H0 and the
    corresponding sampling distribution, while the other
    (b) opposes H0 to a (predictive) marginal version of H1
    one (f) could reject “a hypothesis that may be true (...)
    because it has not predicted observable results that have not
    occurred” (Jeffreys, VII, §7.2) while the other (b) conditions
    upon the observed value xobs
    one (f) resorts to an arbitrary fixed bound α on the p-value,
    while the other (b) refers to the boundary probability of 1
    2

    View full-size slide

  53. More arguments on the first version
    observing a constant tn as n increases is of limited interest:
    under H0 tn has limiting N(0, 1) distribution, while, under H1
    tn a.s. converges to ∞
    behaviour that remains entirely compatible with the
    consistency of the Bayes factor, which a.s. converges either to
    0 or ∞, depending on which hypothesis is true.
    Consequent literature (e.g., Berger & Sellke,1987) has since then
    shown how divergent those two approaches could be (to the point
    of being asymptotically incompatible).
    [Robert, 2013]

    View full-size slide

  54. Nothing’s wrong with the second version
    n, prior’s scale factor: prior variance n times larger than the
    observation variance and when n goes to ∞, Bayes factor
    goes to ∞ no matter what the observation is
    n becomes what Lindley (1957) calls “a measure of lack of
    conviction about the null hypothesis”
    when prior diffuseness under H1 increases, only relevant
    information becomes that θ could be equal to θ0, and this
    overwhelms any evidence to the contrary contained in the data
    mass of the prior distribution in the vicinity of any fixed
    neighbourhood of the null hypothesis vanishes to zero under
    H1
    [Robert, 2013]
    c deep coherence in the outcome: being indecisive about the
    alternative hypothesis means we should not chose it

    View full-size slide

  55. Nothing’s wrong with the second version
    n, prior’s scale factor: prior variance n times larger than the
    observation variance and when n goes to ∞, Bayes factor
    goes to ∞ no matter what the observation is
    n becomes what Lindley (1957) calls “a measure of lack of
    conviction about the null hypothesis”
    when prior diffuseness under H1 increases, only relevant
    information becomes that θ could be equal to θ0, and this
    overwhelms any evidence to the contrary contained in the data
    mass of the prior distribution in the vicinity of any fixed
    neighbourhood of the null hypothesis vanishes to zero under
    H1
    [Robert, 2013]
    c deep coherence in the outcome: being indecisive about the
    alternative hypothesis means we should not chose it

    View full-size slide

  56. “Who should be afraid of the Lindley–Jeffreys paradox?”
    Recent publication by A. Spanos with above title:
    the paradox demonstrates against
    Bayesian and likelihood resolutions of the
    problem for failing to account for the
    large sample size.
    the failure of all three main paradigms
    leads Spanos to advocate Mayo’s and
    Spanos’“postdata severity evaluation”
    [Spanos, 2013]

    View full-size slide

  57. “Who should be afraid of the Lindley–Jeffreys paradox?”
    Recent publication by A. Spanos with above title:
    “the postdata severity evaluation
    (...) addresses the key problem with
    Fisherian p-values in the sense that
    the severity evaluation provides the
    “magnitude” of the warranted
    discrepancy from the null by taking
    into account the generic capacity of
    the test (that includes n) in question
    as it relates to the observed
    data”(p.88)
    [Spanos, 2013]

    View full-size slide

  58. On some resolutions of the second version
    use of pseudo-Bayes factors, fractional Bayes factors, &tc,
    which lacks proper Bayesian justification
    [Berger & Pericchi, 2001]
    use of identical improper priors on nuisance parameters, a
    notion already entertained by Jeffreys
    [Berger et al., 1998; Marin & Robert, 2013]
    use of the posterior predictive distribution, which uses the
    data twice (see also Aitkin’s (2010) integrated likelihood)
    [Gelman, Rousseau & Robert, 2013]
    use of score functions extending the log score function
    log B12(x) = log m1(x) − log m2(x) = S0(x, m1) − S0(x, m2) ,
    that are independent of the normalising constant
    [Dawid et al., 2013]

    View full-size slide

  59. Bayesian computing (R)evolution
    Bayes, Thomas (1702–1761)
    Jeffreys, Harold (1891–1989)
    Lindley, Dennis (1923– )
    Besag, Julian (1945–2010)
    Besag’s early contributions
    MCMC revolution and beyond
    de Finetti, Bruno (1906–1985)

    View full-size slide

  60. computational jam
    In the 1970’s and early 1980’s, theoretical foundations of Bayesian
    statistics were sound, but methodology was lagging for lack of
    computing tools.
    restriction to conjugate priors
    limited complexity of models
    small sample sizes
    The field was desperately in need of a new computing paradigm!
    [Robert & Casella, 2012]

    View full-size slide

  61. MCMC as in Markov Chain Monte Carlo
    Notion that i.i.d. simulation is definitely not necessary, all that
    matters is the ergodic theorem
    Realization that Markov chains could be used in a wide variety of
    situations only came to mainstream statisticians with Gelfand and
    Smith (1990) despite earlier publications in the statistical literature
    like Hastings (1970) and growing awareness in spatial statistics
    (Besag, 1986)
    Reasons:
    lack of computing machinery
    lack of background on Markov chains
    lack of trust in the practicality of the method

    View full-size slide

  62. Who was Julian Besag?
    British statistician known chiefly for his
    work in spatial statistics (including its
    applications to epidemiology, image
    analysis and agricultural science), and
    Bayesian inference (including Markov
    chain Monte Carlo algorithms).
    Lecturer in Liverpool and Durham, then
    professor in Durham and Seattle.
    [Wikipedia]

    View full-size slide

  63. pre-Gibbs/pre-Hastings era
    Early 1970’s, Hammersley, Clifford, and Besag were working on the
    specification of joint distributions from conditional distributions
    and on necessary and sufficient conditions for the conditional
    distributions to be compatible with a joint distribution.
    [Hammersley and Clifford, 1971]

    View full-size slide

  64. pre-Gibbs/pre-Hastings era
    Early 1970’s, Hammersley, Clifford, and Besag were working on the
    specification of joint distributions from conditional distributions
    and on necessary and sufficient conditions for the conditional
    distributions to be compatible with a joint distribution.
    “What is the most general form of the conditional
    probability functions that define a coherent joint
    function? And what will the joint look like?”
    [Besag, 1972]

    View full-size slide

  65. Hammersley-Clifford[-Besag] theorem
    Theorem (Hammersley-Clifford)
    Joint distribution of vector associated with a dependence graph
    must be represented as product of functions over the cliques of the
    graphs, i.e., of functions depending only on the components
    indexed by the labels in the clique.
    [Cressie, 1993; Lauritzen, 1996]

    View full-size slide

  66. Hammersley-Clifford[-Besag] theorem
    Theorem (Hammersley-Clifford)
    A probability distribution P with positive and continuous density f
    satisfies the pairwise Markov property with respect to an
    undirected graph G if and only if it factorizes according to G, i.e.,
    (F) ≡ (G)
    [Cressie, 1993; Lauritzen, 1996]

    View full-size slide

  67. Hammersley-Clifford[-Besag] theorem
    Theorem (Hammersley-Clifford)
    Under the positivity condition, the joint distribution g satisfies
    g(y1, . . . , yp) ∝
    p
    j=1
    g
    j
    (y
    j
    |y
    1
    , . . . , y
    j−1
    , y
    j+1
    , . . . , y
    p
    )
    g
    j
    (y
    j
    |y
    1
    , . . . , y
    j−1
    , y
    j+1
    , . . . , y
    p
    )
    for every permutation on {1, 2, . . . , p} and every y ∈ Y.
    [Cressie, 1993; Lauritzen, 1996]

    View full-size slide

  68. To Gibbs or not to Gibbs?
    Julian Besag should certainly be credited to a large extent of the
    (re?-)discovery of the Gibbs sampler.

    View full-size slide

  69. To Gibbs or not to Gibbs?
    Julian Besag should certainly be credited to a large extent of the
    (re?-)discovery of the Gibbs sampler.
    “The simulation procedure is to consider the sites
    cyclically and, at each stage, to amend or leave unaltered
    the particular site value in question, according to a
    probability distribution whose elements depend upon the
    current value at neighboring sites (...) However, the
    technique is unlikely to be particularly helpful in many
    other than binary situations and the Markov chain itself
    has no practical interpretation.”
    [Besag, 1974]

    View full-size slide

  70. Clicking in
    After Peskun (1973), MCMC mostly dormant in mainstream
    statistical world for about 10 years, then several papers/books
    highlighted its usefulness in specific settings:
    Geman and Geman (1984)
    Besag (1986)
    Strauss (1986)
    Ripley (Stochastic Simulation, 1987)
    Tanner and Wong (1987)
    Younes (1988)

    View full-size slide

  71. Enters the Gibbs sampler
    Geman and Geman (1984), building on
    Metropolis et al. (1953), Hastings (1970), and
    Peskun (1973), constructed a Gibbs sampler
    for optimisation in a discrete image processing
    problem with a Gibbs random field without
    completion.
    Back to Metropolis et al., 1953: the Gibbs
    sampler is already in use therein and ergodicity
    is proven on the collection of global maxima

    View full-size slide

  72. Enters the Gibbs sampler
    Geman and Geman (1984), building on
    Metropolis et al. (1953), Hastings (1970), and
    Peskun (1973), constructed a Gibbs sampler
    for optimisation in a discrete image processing
    problem with a Gibbs random field without
    completion.
    Back to Metropolis et al., 1953: the Gibbs
    sampler is already in use therein and ergodicity
    is proven on the collection of global maxima

    View full-size slide

  73. Besag (1986) integrates GS for SA...
    “...easy to construct the transition matrix Q, of a
    discrete time Markov chain, with state space Ω and limit
    distribution (4). Simulated annealing proceeds by
    running an associated time inhomogeneous Markov chain
    with transition matrices QT , where T is progressively
    decreased according to a prescribed “schedule” to a value
    close to zero.”
    [Besag, 1986]

    View full-size slide

  74. ...and links with Metropolis-Hastings...
    “There are various related methods of constructing a
    manageable QT (Hastings, 1970). Geman and Geman
    (1984) adopt the simplest, which they term the ”Gibbs
    sampler” (...) time reversibility, a common ingredient in
    this type of problem (see, for example, Besag, 1977a), is
    present at individual stages but not over complete cycles,
    though Peter Green has pointed out that it returns if QT
    is taken over a pair of cycles, the second of which visits
    pixels in reverse order”
    [Besag, 1986]

    View full-size slide

  75. The candidate’s formula
    Representation of the marginal likelihood as
    m(x) =
    π(θ)f (x|θ)
    π(θ|x)
    or of the marginal predictive as
    pn(y |y) = f (y |θ)πn(θ|y) πn+1(θ|y, y )
    [Besag, 1989]
    Why candidate?
    “Equation (2) appeared without explanation in a Durham
    University undergraduate final examination script of
    1984. Regrettably, the student’s name is no longer
    known to me.”

    View full-size slide

  76. The candidate’s formula
    Representation of the marginal likelihood as
    m(x) =
    π(θ)f (x|θ)
    π(θ|x)
    or of the marginal predictive as
    pn(y |y) = f (y |θ)πn(θ|y) πn+1(θ|y, y )
    [Besag, 1989]
    Why candidate?
    “Equation (2) appeared without explanation in a Durham
    University undergraduate final examination script of
    1984. Regrettably, the student’s name is no longer
    known to me.”

    View full-size slide

  77. Implications
    Newton and Raftery (1994) used this representation to derive
    the [infamous] harmonic mean approximation to the marginal
    likelihood
    Gelfand and Dey (1994)
    Geyer and Thompson (1995)
    Chib (1995)
    Marin and Robert (2010) and Robert and Wraith (2009)
    [Chen, Shao & Ibrahim, 2000]

    View full-size slide

  78. Implications
    Newton and Raftery (1994)
    Gelfand and Dey (1994) also relied on this formula for the
    same purpose in a more general perspective
    Geyer and Thompson (1995)
    Chib (1995)
    Marin and Robert (2010) and Robert and Wraith (2009)
    [Chen, Shao & Ibrahim, 2000]

    View full-size slide

  79. Implications
    Newton and Raftery (1994)
    Gelfand and Dey (1994)
    Geyer and Thompson (1995) derived MLEs by a Monte Carlo
    approximation to the normalising constant
    Chib (1995)
    Marin and Robert (2010) and Robert and Wraith (2009)
    [Chen, Shao & Ibrahim, 2000]

    View full-size slide

  80. Implications
    Newton and Raftery (1994)
    Gelfand and Dey (1994)
    Geyer and Thompson (1995)
    Chib (1995) uses this representation to build a MCMC
    approximation to the marginal likelihood
    Marin and Robert (2010) and Robert and Wraith (2009)
    [Chen, Shao & Ibrahim, 2000]

    View full-size slide

  81. Implications
    Newton and Raftery (1994)
    Gelfand and Dey (1994)
    Geyer and Thompson (1995)
    Chib (1995)
    Marin and Robert (2010) and Robert and Wraith (2009)
    corrected Newton and Raftery (1994) by restricting the
    importance function to an HPD region
    [Chen, Shao & Ibrahim, 2000]

    View full-size slide

  82. Removing the jam
    In early 1990s, researchers found that Gibbs and then Metropolis -
    Hastings algorithms would crack almost any problem!
    Flood of papers followed applying MCMC:
    linear mixed models (Gelfand & al., 1990; Zeger & Karim, 1991;
    Wang & al., 1993, 1994)
    generalized linear mixed models (Albert & Chib, 1993)
    mixture models (Tanner & Wong, 1987; Diebolt & Robert., 1990,
    1994; Escobar & West, 1993)
    changepoint analysis (Carlin & al., 1992)
    point processes (Grenander & Møller, 1994)
    &tc

    View full-size slide

  83. Removing the jam
    In early 1990s, researchers found that Gibbs and then Metropolis -
    Hastings algorithms would crack almost any problem!
    Flood of papers followed applying MCMC:
    genomics (Stephens & Smith, 1993; Lawrence & al., 1993;
    Churchill, 1995; Geyer & Thompson, 1995; Stephens & Donnelly,
    2000)
    ecology (George & Robert, 1992)
    variable selection in regression (George & mcCulloch, 1993; Green,
    1995; Chen & al., 2000)
    spatial statistics (Raftery & Banfield, 1991; Besag & Green, 1993))
    longitudinal studies (Lange & al., 1992)
    &tc

    View full-size slide

  84. MCMC and beyond
    reversible jump MCMC which impacted considerably Bayesian
    model choice (Green, 1995)
    adaptive MCMC algorithms (Haario & al., 1999; Roberts
    & Rosenthal, 2009)
    exact approximations to targets (Tanner & Wong, 1987;
    Beaumont, 2003; Andrieu & Roberts, 2009)
    particle filters with application to sequential statistics,
    state-space models, signal processing, &tc. (Gordon & al.,
    1993; Doucet & al., 2001; del Moral & al., 2006)

    View full-size slide

  85. MCMC and beyond beyond
    comp’al stats catching up with comp’al physics: free energy
    sampling (e.g., Wang-Landau), Hamilton Monte Carlo
    (Girolami & Calderhead, 2011)
    sequential Monte Carlo (SMC) for non-sequential problems
    (Chopin, 2002; Neal, 2001; Del Moral et al 2006)
    retrospective sampling
    intractability: EP – GIMH – PMCMC – SMC2 – INLA
    QMC[MC] (Owen, 2011)

    View full-size slide

  86. Particles
    Iterating/sequential importance sampling is about as old as Monte
    Carlo methods themselves!
    [Hammersley and Morton,1954; Rosenbluth and Rosenbluth, 1955]
    Found in the molecular simulation literature of the 50’s with
    self-avoiding random walks and signal processing
    [Marshall, 1965; Handschin and Mayne, 1969]
    Use of the term “particle” dates back to Kitagawa (1996), and Carpenter
    et al. (1997) coined the term “particle filter”.

    View full-size slide

  87. Particles
    Iterating/sequential importance sampling is about as old as Monte
    Carlo methods themselves!
    [Hammersley and Morton,1954; Rosenbluth and Rosenbluth, 1955]
    Found in the molecular simulation literature of the 50’s with
    self-avoiding random walks and signal processing
    [Marshall, 1965; Handschin and Mayne, 1969]
    Use of the term “particle” dates back to Kitagawa (1996), and Carpenter
    et al. (1997) coined the term “particle filter”.

    View full-size slide

  88. pMC & pMCMC
    Recycling of past simulations legitimate to build better
    importance sampling functions as in population Monte Carlo
    [Iba, 2000; Capp´
    e et al, 2004; Del Moral et al., 2007]
    synthesis by Andrieu, Doucet, and Hollenstein (2010) using
    particles to build an evolving MCMC kernel ^
    pθ(y1:T ) in state
    space models p(x1:T )p(y1:T |x1:T )
    importance sampling on discretely observed diffusions
    [Beskos et al., 2006; Fearnhead et al., 2008, 2010]

    View full-size slide

  89. towards ever more complexity
    Bayes, Thomas (1702–1761)
    Jeffreys, Harold (1891–1989)
    Lindley, Dennis (1923– )
    Besag, Julian (1945–2010)
    de Finetti, Bruno (1906–1985)
    de Finetti’s exchangeability theorem
    Bayesian nonparametrics
    Bayesian analysis in a Big Data era

    View full-size slide

  90. Who was Bruno de Finetti?
    “Italian probabilist, statistician and
    actuary, noted for the “operational
    subjective” conception of probability.
    The classic exposition of his distinctive
    theory is the 1937 “La pr´
    evision: ses
    lois logiques, ses sources subjectives,”
    which discussed probability founded on
    the coherence of betting odds and the
    consequences of exchangeability.”
    [Wikipedia]
    Chair in Financial Mathematics at Trieste University (1939) and
    Roma (1954) then in Calculus of Probabilities (1961). Most
    famous sentence:
    “Probability does not exist”

    View full-size slide

  91. Who was Bruno de Finetti?
    “Italian probabilist, statistician and
    actuary, noted for the “operational
    subjective” conception of probability.
    The classic exposition of his distinctive
    theory is the 1937 “La pr´
    evision: ses
    lois logiques, ses sources subjectives,”
    which discussed probability founded on
    the coherence of betting odds and the
    consequences of exchangeability.”
    [Wikipedia]
    Chair in Financial Mathematics at Trieste University (1939) and
    Roma (1954) then in Calculus of Probabilities (1961). Most
    famous sentence:
    “Probability does not exist”

    View full-size slide

  92. Exchangeability
    Notion of exchangeable sequences:
    A random sequence (x1, . . . , xn, . . .) is exchangeable if for
    any n the distribution of (x1, . . . , xn) is equal to the
    distribution of any permutation of the sequence
    (xσ1
    , . . . , xσn
    )
    de Finetti’s theorem (1937):
    An exchangeable distribution is a mixture of iid
    distributions
    p(x1, . . . , xn) =
    n
    i=1
    f (xi |G)dπ(G)
    where G can be infinite-dimensional
    Extension to Markov chains (Freedman, 1962; Diaconis
    & Freedman, 1980)

    View full-size slide

  93. Exchangeability
    Notion of exchangeable sequences:
    A random sequence (x1, . . . , xn, . . .) is exchangeable if for
    any n the distribution of (x1, . . . , xn) is equal to the
    distribution of any permutation of the sequence
    (xσ1
    , . . . , xσn
    )
    de Finetti’s theorem (1937):
    An exchangeable distribution is a mixture of iid
    distributions
    p(x1, . . . , xn) =
    n
    i=1
    f (xi |G)dπ(G)
    where G can be infinite-dimensional
    Extension to Markov chains (Freedman, 1962; Diaconis
    & Freedman, 1980)

    View full-size slide

  94. Exchangeability
    Notion of exchangeable sequences:
    A random sequence (x1, . . . , xn, . . .) is exchangeable if for
    any n the distribution of (x1, . . . , xn) is equal to the
    distribution of any permutation of the sequence
    (xσ1
    , . . . , xσn
    )
    de Finetti’s theorem (1937):
    An exchangeable distribution is a mixture of iid
    distributions
    p(x1, . . . , xn) =
    n
    i=1
    f (xi |G)dπ(G)
    where G can be infinite-dimensional
    Extension to Markov chains (Freedman, 1962; Diaconis
    & Freedman, 1980)

    View full-size slide

  95. Bayesian nonparametrics
    Based on de Finetti’s representation,
    use of priors on functional spaces (densities, regression, trees,
    partitions, clustering, &tc)
    production of Bayes estimates in those spaces
    convergence mileage may vary
    available efficient (MCMC) algorithms to conduct
    non-parametric inference
    [van der Vaart, 1998; Hjort et al., 2010; M¨
    uller & Rodriguez, 2013]

    View full-size slide

  96. Dirichlet processes
    One of the earliest examples of priors on distributions
    [Ferguson, 1973]
    stick-breaking construction of D(α0
    , G0
    )
    generate βk ∼ B(1, α0)
    define π1 = β1 and πk = k−1
    j=1
    (1 − βj )βk
    generate θk ∼ G0
    derive G = k
    πkδθk
    ∼ D(α0, G0)
    [Sethuraman, 1994]

    View full-size slide

  97. Chinese restaurant process
    If we assume
    G ∼ D(α0, G0)
    θi ∼ G
    then the marginal distribution of (θ1, . . .) is a Chinese restaurant
    process (P´
    olya urn model), which is exchangeable. In particular,
    θi |θ1:i−1 ∼ α0G0 +
    i−1
    j=1
    δθj
    Posterior distribution built by MCMC
    [Escobar and West, 1992]

    View full-size slide

  98. Chinese restaurant process
    If we assume
    G ∼ D(α0, G0)
    θi ∼ G
    then the marginal distribution of (θ1, . . .) is a Chinese restaurant
    process (P´
    olya urn model), which is exchangeable. In particular,
    θi |θ1:i−1 ∼ α0G0 +
    i−1
    j=1
    δθj
    Posterior distribution built by MCMC
    [Escobar and West, 1992]

    View full-size slide

  99. Many alternatives
    truncated Dirichlet processes
    Pitman Yor processes
    completely random measures
    normalized random measures with independent increments
    (NRMI)
    [M¨
    uller and Mitra, 2013]

    View full-size slide

  100. Theoretical advances
    posterior consistency: Seminal work of Schwarz (1965) in iid
    case and extension of Barron et al. (1999) for general
    consistency
    consistency rates: Ghosal & van der Vaart (2000) Ghosal et
    al. (2008) with minimax (adaptive ) Bayesian nonparametric
    estimators for nonparametric process mixtures (Gaussian,
    Beta) (Rousseau, 2008; Kruijer, Rousseau & van der Vaart,
    2010; Shen, Tokdar & Ghosal, 2013; Scricciolo, 2013)
    Bernstein-von Mises theorems: (Castillo, 2011; Rivoirard
    & Rousseau, 2012; Kleijn & Bickel, 2013; Castillo
    & Rousseau, 2013)
    recent extensions to semiparametric models

    View full-size slide

  101. Consistency and posterior concentration rates
    Posterior
    dπ(θ|Xn) =
    fθ(Xn)dπ(θ)
    m(Xn)
    m(Xn) =
    Θ
    fθ(Xn)dπ(θ)
    and posterior concentration: Under Pθ0
    Pπ [d(θ, θ0) |Xn] = 1+op(1), Pπ [d(θ, θ0) n|Xn] = 1+op(1)
    Given n: consistency
    where d(θ, θ ) is a loss function. e.g. Hellinger, L1, L2, L∞

    View full-size slide

  102. Consistency and posterior concentration rates
    Posterior
    dπ(θ|Xn) =
    fθ(Xn)dπ(θ)
    m(Xn)
    m(Xn) =
    Θ
    fθ(Xn)dπ(θ)
    and posterior concentration: Under Pθ0
    Pπ [d(θ, θ0) |Xn] = 1+op(1), Pπ [d(θ, θ0) n|Xn] = 1+op(1)
    Setting n
    ↓ 0: consistency rates
    where d(θ, θ ) is a loss function. e.g. Hellinger, L1, L2, L∞

    View full-size slide

  103. Bernstein–von Mises theorems
    Parameter of interest
    ψ = ψ(θ) ∈ Rd , d < +∞, θ ∼ π
    (with dim(θ) = +∞)
    BVM:
    π[

    n(ψ − ^
    ψ) z|Xn] = Φ(z/ V0) + op(1), Pθ0
    and √
    n( ^
    ψ − ψ(θ0)) ≈ N(0, V0) under Pθ0
    [Doob, 1949; Le Cam, 1986; van der Vaart, 1998]

    View full-size slide

  104. New challenges
    Novel statisticial issues that forces a different Bayesian answer:
    very large datasets
    complex or unknown dependence structures with maybe p n
    multiple and involved random effects
    missing data structures containing most of the information
    sequential structures involving most of the above

    View full-size slide

  105. New paradigm?
    “Surprisingly, the confident prediction of the previous
    generation that Bayesian methods would ultimately supplant
    frequentist methods has given way to a realization that Markov
    chain Monte Carlo (MCMC) may be too slow to handle
    modern data sets. Size matters because large data sets stress
    computer storage and processing power to the breaking point.
    The most successful compromises between Bayesian and
    frequentist methods now rely on penalization and
    optimization.”
    [Lange at al., ISR, 2013]

    View full-size slide

  106. New paradigm?
    Observe (Xi , Ri , Yi Ri ) where
    Xi ∼ U(0, 1)d , Ri |Xi ∼ B(π(Xi )) and Yi |Xi ∼ B(θ(Xi ))
    (π(·) is known and θ(·) is unknwon)
    Then any estimator of E[Y ] that does not depend on π is
    inconsistent.
    c There is no genuine Bayesian answer producing a consistent
    estimator (without throwing away part of the data)
    [Robins & Wasserman, 2000, 2013]

    View full-size slide

  107. New paradigm?
    sad reality constraint that
    size does matter
    focus on much smaller
    dimensions and on sparse
    summaries
    many (fast if non-Bayesian)
    ways of producing those
    summaries
    Bayesian inference can kick
    in almost automatically at
    this stage

    View full-size slide

  108. Approximate Bayesian computation (ABC)
    Case of a well-defined statistical model where the likelihood
    function
    (θ|y) = f (y1, . . . , yn|θ)
    is out of reach!
    Empirical approximations to the original
    Bayesian inference problem
    Degrading the data precision down
    to a tolerance ε
    Replacing the likelihood with a
    non-parametric approximation
    Summarising/replacing the data
    with insufficient statistics

    View full-size slide

  109. Approximate Bayesian computation (ABC)
    Case of a well-defined statistical model where the likelihood
    function
    (θ|y) = f (y1, . . . , yn|θ)
    is out of reach!
    Empirical approximations to the original
    Bayesian inference problem
    Degrading the data precision down
    to a tolerance ε
    Replacing the likelihood with a
    non-parametric approximation
    Summarising/replacing the data
    with insufficient statistics

    View full-size slide

  110. Approximate Bayesian computation (ABC)
    Case of a well-defined statistical model where the likelihood
    function
    (θ|y) = f (y1, . . . , yn|θ)
    is out of reach!
    Empirical approximations to the original
    Bayesian inference problem
    Degrading the data precision down
    to a tolerance ε
    Replacing the likelihood with a
    non-parametric approximation
    Summarising/replacing the data
    with insufficient statistics

    View full-size slide

  111. Approximate Bayesian computation (ABC)
    Case of a well-defined statistical model where the likelihood
    function
    (θ|y) = f (y1, . . . , yn|θ)
    is out of reach!
    Empirical approximations to the original
    Bayesian inference problem
    Degrading the data precision down
    to a tolerance ε
    Replacing the likelihood with a
    non-parametric approximation
    Summarising/replacing the data
    with insufficient statistics

    View full-size slide

  112. ABC methodology
    Bayesian setting: target is π(θ)f (x|θ)
    When likelihood f (x|θ) not in closed form, likelihood-free rejection
    technique:
    Foundation
    For an observation y ∼ f (y|θ), under the prior π(θ), if one keeps
    jointly simulating
    θ ∼ π(θ) , z ∼ f (z|θ ) ,
    until the auxiliary variable z is equal to the observed value, z = y,
    then the selected
    θ ∼ π(θ|y)
    [Rubin, 1984; Diggle & Gratton, 1984; Griffith et al., 1997]

    View full-size slide

  113. ABC methodology
    Bayesian setting: target is π(θ)f (x|θ)
    When likelihood f (x|θ) not in closed form, likelihood-free rejection
    technique:
    Foundation
    For an observation y ∼ f (y|θ), under the prior π(θ), if one keeps
    jointly simulating
    θ ∼ π(θ) , z ∼ f (z|θ ) ,
    until the auxiliary variable z is equal to the observed value, z = y,
    then the selected
    θ ∼ π(θ|y)
    [Rubin, 1984; Diggle & Gratton, 1984; Griffith et al., 1997]

    View full-size slide

  114. ABC methodology
    Bayesian setting: target is π(θ)f (x|θ)
    When likelihood f (x|θ) not in closed form, likelihood-free rejection
    technique:
    Foundation
    For an observation y ∼ f (y|θ), under the prior π(θ), if one keeps
    jointly simulating
    θ ∼ π(θ) , z ∼ f (z|θ ) ,
    until the auxiliary variable z is equal to the observed value, z = y,
    then the selected
    θ ∼ π(θ|y)
    [Rubin, 1984; Diggle & Gratton, 1984; Griffith et al., 1997]

    View full-size slide

  115. ABC algorithm
    In most implementations, degree of approximation:
    Algorithm 1 Likelihood-free rejection sampler
    for i = 1 to N do
    repeat
    generate θ from the prior distribution π(·)
    generate z from the likelihood f (·|θ )
    until ρ{η(z), η(y)}
    set θi = θ
    end for
    where η(y) defines a (not necessarily sufficient) statistic

    View full-size slide

  116. Comments
    role of distance paramount
    (because = 0)
    scaling of components of η(y) also
    capital
    matters little if “small enough”
    representative of “curse of
    dimensionality”
    small is beautiful!, i.e. data as a
    whole may be weakly informative
    for ABC
    non-parametric method at core

    View full-size slide

  117. ABC simulation advances
    Simulating from the prior is often poor in efficiency
    Either modify the proposal distribution on θ to increase the density
    of x’s within the vicinity of y...
    [Marjoram et al, 2003; Beaumont et al., 2009, Del Moral et al., 2012]
    ...or by viewing the problem as a conditional density estimation
    and by developing techniques to allow for larger
    [Beaumont et al., 2002; Blum & Fran¸
    cois, 2010; Biau et al., 2013]
    .....or even by including in the inferential framework [ABCµ]
    [Ratmann et al., 2009]

    View full-size slide

  118. ABC simulation advances
    Simulating from the prior is often poor in efficiency
    Either modify the proposal distribution on θ to increase the density
    of x’s within the vicinity of y...
    [Marjoram et al, 2003; Beaumont et al., 2009, Del Moral et al., 2012]
    ...or by viewing the problem as a conditional density estimation
    and by developing techniques to allow for larger
    [Beaumont et al., 2002; Blum & Fran¸
    cois, 2010; Biau et al., 2013]
    .....or even by including in the inferential framework [ABCµ]
    [Ratmann et al., 2009]

    View full-size slide

  119. ABC simulation advances
    Simulating from the prior is often poor in efficiency
    Either modify the proposal distribution on θ to increase the density
    of x’s within the vicinity of y...
    [Marjoram et al, 2003; Beaumont et al., 2009, Del Moral et al., 2012]
    ...or by viewing the problem as a conditional density estimation
    and by developing techniques to allow for larger
    [Beaumont et al., 2002; Blum & Fran¸
    cois, 2010; Biau et al., 2013]
    .....or even by including in the inferential framework [ABCµ]
    [Ratmann et al., 2009]

    View full-size slide

  120. ABC simulation advances
    Simulating from the prior is often poor in efficiency
    Either modify the proposal distribution on θ to increase the density
    of x’s within the vicinity of y...
    [Marjoram et al, 2003; Beaumont et al., 2009, Del Moral et al., 2012]
    ...or by viewing the problem as a conditional density estimation
    and by developing techniques to allow for larger
    [Beaumont et al., 2002; Blum & Fran¸
    cois, 2010; Biau et al., 2013]
    .....or even by including in the inferential framework [ABCµ]
    [Ratmann et al., 2009]

    View full-size slide

  121. ABC as an inference machine
    Starting point is summary statistic η(y), either chosen for
    computational realism or imposed by external constraints
    ABC can produce a distribution on the parameter of interest
    conditional on this summary statistic η(y)
    inference based on ABC may be consistent or not, so it needs
    to be validated on its own
    the choice of the tolerance level is dictated by both
    computational and convergence constraints

    View full-size slide

  122. ABC as an inference machine
    Starting point is summary statistic η(y), either chosen for
    computational realism or imposed by external constraints
    ABC can produce a distribution on the parameter of interest
    conditional on this summary statistic η(y)
    inference based on ABC may be consistent or not, so it needs
    to be validated on its own
    the choice of the tolerance level is dictated by both
    computational and convergence constraints

    View full-size slide

  123. How Bayesian aBc is..?
    At best, ABC approximates π(θ|η(y)):
    approximation error unknown (w/o massive simulation)
    pragmatic or empirical Bayes (there is no other solution!)
    many calibration issues (tolerance, distance, statistics)
    the NP side should be incorporated into the whole Bayesian
    picture
    the approximation error should also be part of the Bayesian
    inference

    View full-size slide

  124. Noisy ABC
    ABC approximation error (under non-zero tolerance ) replaced
    with exact simulation from a controlled approximation to the
    target, convolution of true posterior with kernel function
    π (θ, z|y) =
    π(θ)f (z|θ)K (y − z)
    π(θ)f (z|θ)K (y − z)dzdθ
    ,
    with K kernel parameterised by bandwidth .
    [Wilkinson, 2013]
    Theorem
    The ABC algorithm based on a randomised observation y = ˜
    y + ξ,
    ξ ∼ K , and an acceptance probability of
    K (y − z)/M
    gives draws from the posterior distribution π(θ|y).

    View full-size slide

  125. Noisy ABC
    ABC approximation error (under non-zero tolerance ) replaced
    with exact simulation from a controlled approximation to the
    target, convolution of true posterior with kernel function
    π (θ, z|y) =
    π(θ)f (z|θ)K (y − z)
    π(θ)f (z|θ)K (y − z)dzdθ
    ,
    with K kernel parameterised by bandwidth .
    [Wilkinson, 2013]
    Theorem
    The ABC algorithm based on a randomised observation y = ˜
    y + ξ,
    ξ ∼ K , and an acceptance probability of
    K (y − z)/M
    gives draws from the posterior distribution π(θ|y).

    View full-size slide

  126. Which summary?
    Fundamental difficulty of the choice of the summary statistic when
    there is no non-trivial sufficient statistics [except when done by the
    experimenters in the field]

    View full-size slide

  127. Which summary?
    Fundamental difficulty of the choice of the summary statistic when
    there is no non-trivial sufficient statistics [except when done by the
    experimenters in the field]
    Loss of statistical information balanced against gain in data
    roughening
    Approximation error and information loss remain unknown
    Choice of statistics induces choice of distance function
    towards standardisation
    borrowing tools from data analysis (LDA) machine learning
    [Estoup et al., ME, 2012]

    View full-size slide

  128. Which summary?
    Fundamental difficulty of the choice of the summary statistic when
    there is no non-trivial sufficient statistics [except when done by the
    experimenters in the field]
    may be imposed for external/practical reasons
    may gather several non-B point estimates
    we can learn about efficient combination
    distance can be provided by estimation techniques

    View full-size slide

  129. Which summary for model choice?
    ‘This is also why focus on model discrimination typically
    (...) proceeds by (...) accepting that the Bayes Factor
    that one obtains is only derived from the summary
    statistics and may in no way correspond to that of the
    full model.’
    [Scott Sisson, Jan. 31, 2011, xianblog]
    Depending on the choice of η(·), the Bayes factor based on this
    insufficient statistic,

    12
    (y) =
    π1(θ1)f η
    1
    (η(y)|θ1) dθ1
    π2(θ2)f η
    2
    (η(y)|θ2) dθ2
    ,
    is either consistent or inconsistent
    [Robert et al., PNAS, 2012]

    View full-size slide

  130. Which summary for model choice?
    Depending on the choice of η(·), the Bayes factor based on this
    insufficient statistic,

    12
    (y) =
    π1(θ1)f η
    1
    (η(y)|θ1) dθ1
    π2(θ2)f η
    2
    (η(y)|θ2) dθ2
    ,
    is either consistent or inconsistent
    [Robert et al., PNAS, 2012]
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    Gauss Laplace
    0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
    n=100
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    Gauss Laplace
    0.0 0.2 0.4 0.6 0.8 1.0
    n=100

    View full-size slide

  131. Selecting proper summaries
    Consistency only depends on the range of
    µi (θ) = Ei [η(y)]
    under both models against the asymptotic mean µ0 of η(y)
    Theorem
    If Pn belongs to one of the two models and if µ0 cannot be
    attained by the other one :
    0 = min (inf{|µ0 − µi (θi )|; θi ∈ Θi }, i = 1, 2)
    < max (inf{|µ0 − µi (θi )|; θi ∈ Θi }, i = 1, 2) ,
    then the Bayes factor Bη
    12
    is consistent
    [Marin et al., 2012]

    View full-size slide

  132. Selecting proper summaries
    Consistency only depends on the range of
    µi (θ) = Ei [η(y)]
    under both models against the asymptotic mean µ0 of η(y)
    q
    M1 M2
    0.3 0.4 0.5 0.6 0.7
    q
    q
    q
    q
    M1 M2
    0.3 0.4 0.5 0.6 0.7
    M1 M2
    0.3 0.4 0.5 0.6 0.7
    q
    q
    q
    q
    q
    q
    q
    q
    M1 M2
    0.0 0.2 0.4 0.6 0.8
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    M1 M2
    0.0 0.2 0.4 0.6 0.8 1.0
    q
    q
    q
    q
    q
    q
    q
    q
    M1 M2
    0.0 0.2 0.4 0.6 0.8 1.0
    q
    q
    q
    q
    q
    q
    q
    M1 M2
    0.0 0.2 0.4 0.6 0.8
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    M1 M2
    0.0 0.2 0.4 0.6 0.8 1.0
    q
    q
    q
    q
    q
    q
    q
    M1 M2
    0.0 0.2 0.4 0.6 0.8 1.0
    [Marin et al., 2012]

    View full-size slide

  133. on some Bayesian open problems
    In 2011, Michael Jordan, then ISBA President, conducted a
    mini-survey on Bayesian open problems:
    Nonparametrics and semiparametrics: assessing and validating
    priors on infinite dimension spaces with an infinite number of
    nuisance parameters
    Priors: elicitation mecchanisms and strategies to get the prior
    from the likelihood or even from the posterior distribution
    Bayesian/frequentist relationships: how far should one reach
    for frequentist validation?
    Computation and statistics: computational abilities should be
    part of the modelling, with some expressing doubts about
    INLA and ABC
    Model selection and hypothesis testing: still unsettled
    opposition between model checking, model averaging and
    model selection
    [Jordan, ISBA Bulletin, March 2011]

    View full-size slide

  134. yet another Bayes 250
    Meeting that will take place in Duke University, December 17:
    Stephen Fienberg, Carnegie-Mellon
    University
    Michael Jordan, University of
    California, Berkeley
    Christopher Sims, Princeton University
    Adrian Smith, University of London
    Stephen Stigler, University of Chicago
    Sharon Bertsch McGrayne, author of
    “the theory that would not die”

    View full-size slide