2011: Sparse Nonparametric Bayesian Learning from Big Data David Dunson, Duke University Classiﬁcation Models and Predictions for Ordered Data Chris Holmes, Oxford University Bayesian Variable Selection in Markov Mixture Models Luigi Spezia, Biomathematics & Statistics Scotland, Aberdeen Bayesian inference for partially observed Markov processes, with application to systems biology Darren Wilkinson, University of Newcastle Coherent Inference on Distributed Bayesian Expert Systems Jim Smith, University of Warwick Probabilistic Programming John Winn, Microsoft Research How To Gamble If You Must (courtesy of the Reverend Bayes) David Spiegelhalter, University of Cambridge Inference and computing with decomposable graphs Peter Green, University of Bristol Nonparametric Bayesian Models for Sparse Matrices and Covariances Zoubin Gharamani, University of Cambridge Latent Force Models Neil Lawrence, University of Sheﬃeld Does Bayes Theorem Work? Michael Goldstein, Durham University Bayesian Priors in the Brain Peggy Series, University of Edinburgh Approximate Bayesian Computation for model selection Christian Robert, Universit´ e Paris-Dauphine ABC-EP: Expectation Propagation for Likelihood-free Bayesian Computation Nicholas Chopin, CREST–ENSAE Bayes at Edinburgh University - a talk and tour Dr Andrew Fraser, Honorary Fellow, University of Edinburgh Intractable likelihoods and exact approximate MCMC algorithms Christophe Andrieu, University of Bristol Bayesian computational methods for intractable continuous-time non-Gaussian time series Simon Godsill, University of Cambridge Eﬃcient MCMC for Continuous Time Discrete State Systems Yee Whye Teh, Gatsby Computational Neuroscience Unit, University College London Adaptive Control and Bayesian Inference Carl Rasmussen, University of Cambridge Bernstein - von Mises theorem for irregular statistical models Natalia Bochkina, University of Edinburgh

Essay towards solving a Problem in the Doctrine of Chances” by the late Rev. Mr. Bayes, communicated by Mr. Price in the Philosophical Transactions of the Royal Society of London. c 250th anniversary of the Essay

Essay towards solving a Problem in the Doctrine of Chances” by the late Rev. Mr. Bayes, communicated by Mr. Price in the Philosophical Transactions of the Royal Society of London. c 250th anniversary of the Essay

Essay towards solving a Problem in the Doctrine of Chances” by the late Rev. Mr. Bayes, communicated by Mr. Price in the Philosophical Transactions of the Royal Society of London. c 250th anniversary of the Essay

Essay towards solving a Problem in the Doctrine of Chances” by the late Rev. Mr. Bayes, communicated by Mr. Price in the Philosophical Transactions of the Royal Society of London. c 250th anniversary of the Essay

Science uncovers the true title of the Essay: A Method of Calculating the Exact Probability of All Conclusions founded on Induction Intended as a reply to Hume’s (1748) evaluation of the probability of miracles

“we may hope to determine the Propositions, and, by degrees, the whole Nature of unknown Causes, by a suﬃcient Observation of their eﬀects” (D. Hartley) in 1767, Richard Price used Bayes’ theorem as a tool to attack Hume’s argument, refering to the above title Bayes’ oﬀprints available at Yale’s Beinecke Library (but missing the title page) and at the Library Company of Philadelphia (Franklin’s library) [Stigler, 2013]

in London then at the University of Edinburgh (1719-1721), presbyterian minister in Tunbridge Wells (Kent) from 1731, son of Joshua Bayes, nonconformist minister. “Election to the Royal Society based on a tract of 1736 where he defended the views and philosophy of Newton. A notebook of his includes a method of ﬁnding the time and place of conjunction of two planets, notes on weights and measures, a method of diﬀerentiation, and logarithms.” [Wikipedia]

in London then at the University of Edinburgh (1719-1721), presbyterian minister in Tunbridge Wells (Kent) from 1731, son of Joshua Bayes, nonconformist minister. “Election to the Royal Society based on a tract of 1736 where he defended the views and philosophy of Newton. A notebook of his includes a method of ﬁnding the time and place of conjunction of two planets, notes on weights and measures, a method of diﬀerentiation, and logarithms.” [Wikipedia]

of length one, with a uniform probability of stopping anywhere: W stops at p. Second ball O then rolled n times under the same assumptions. X denotes the number of times the ball O stopped on the left of W .

of length one, with a uniform probability of stopping anywhere: W stops at p. Second ball O then rolled n times under the same assumptions. X denotes the number of times the ball O stopped on the left of W . Bayes’ question: Given X, what inference can we make on p?

of length one, with a uniform probability of stopping anywhere: W stops at p. Second ball O then rolled n times under the same assumptions. X denotes the number of times the ball O stopped on the left of W . Bayes’ wording: “Given the number of times in which an unknown event has happened and failed; Required the chance that the probability of its happening in a single trial lies somewhere between any two degrees of probability that can be named.”

of length one, with a uniform probability of stopping anywhere: W stops at p. Second ball O then rolled n times under the same assumptions. X denotes the number of times the ball O stopped on the left of W . Modern translation: Derive the posterior distribution of p given X, when p ∼ U([0, 1]) and X|p ∼ B(n, p)

= b a n x px (1 − p)n−x dp 1 0 n x px (1 − p)n−x dp = b a px (1 − p)n−x dp B(x + 1, n − x + 1) , i.e. p|x ∼ Be(x + 1, n − x + 1) In Bayes’ words: “The same things supposed, I guess that the probability of the event M lies somewhere between 0 and the ratio of Ab to AB, my chance to be in the right is the ratio of Abm to AiB.”

de d´ eterminer la probabilit´ e des causes par les ´ ev´ enements mati` ere neuve ` a bien des ´ egards et qui m´ erite d’autant plus d’ˆ etre cultiv´ ee que c’est principalement sous ce point de vue que la science des hasards peut ˆ etre utile ` a la vie civile.” [M´ emoire sur la probabilit´ e des causes par les ´ ev´ enemens, 1774]

produit par un nombre n de causes diﬀ` erentes, les probabilit´ es de l’existence de ces causes prises de l’´ ev` enement, sont entre elles comme les probabilit´ es de l’´ ev` enement prises de ces causes, et la probabilit´ e de l’existence de chacune d’elles, est ´ egale ` a la probabilit´ e de l’´ ev` enement prise de cette cause, divise´ e par la somme de toutes les probabilit´ es de l’´ ev` enement prises de chacune de ces causes.” [M´ emoire sur la probabilit´ e des causes par les ´ ev´ enemens, 1774]

e directement la probabilit´ e que les possibilit´ es indiqu´ ees par des exp´ eriences d´ ej` a faites sont comprises dans les limites donn´ ees et il y est parvenu d’une mani` ere ﬁne et tr` es ing´ enieuse” [Essai philosophique sur les probabilit´ es, 1810]

Statistical Society, June 19-20, 2013, on the current state of Bayesian statistics G. Roberts (University of Warwick) “Bayes for diﬀerential equation models” N. Best (Imperial College London) “Bayesian space-time models for environmental epidemiology” D. Prangle (Lancaster University) “Approximate Bayesian Computation” P. Dawid (University of Cambridge), “Putting Bayes to the Test” M. Jordan (UC Berkeley) “Feature Allocations, Probability Functions, and Paintboxes” I. Murray (University of Edinburgh) “Flexible models for density estimation” M. Goldstein (Durham University) “Geometric Bayes” C. Andrieu (University of Bristol) “Inference with noisy likelihoods” A. Golightly (Newcastle University), “Auxiliary particle MCMC schemes for partially observed diﬀusion processes” S. Richardson (MRC Biostatistics Unit) “Biostatistics and Bayes” C. Yau (Imperial College London) “Understanding cancer through Bayesian approaches” S. Walker (University of Kent) “The Misspeciﬁed Bayesian” S. Wilson (Trinity College Dublin), “Linnaeus, Bayes and the number of species problem” B. Calderhead (UCL) “Probabilistic Integration for Diﬀerential Equation Models” P. Green (University of Bristol and UT Sydney) “Bayesian graphical model determination”

Probability (1921): “I do not believe that there is any direct and simple method by which we can make the transition from an observed numerical frequency to a numerical measure of probability.” [Robert, 2011, ISR]

Probability (1921): “Bayes’ enunciation is strictly correct and its method of arriving at it shows its true logical connection with more fundamental principles, whereas Laplace’s enunciation gives it the appearance of a new principle specially introduced for the solution of causal problems.” [Robert, 2011, ISR]

and astronomer. Knighted in 1953 and Gold Medal of the Royal Astronomical Society in 1937. Funder of modern British geophysics. Many of his contributions are summarised in his book The Earth. [Wikipedia]

(objective) Bayesian statistics Theory of Probability (1939) begins with probability, reﬁning the treatment in Scientiﬁc Inference (1937), and proceeds to cover a range of applications comparable to that in Fisher’s book. [Robert, Chopin & Rousseau, 2009, Stat. Science]

information on θ by extracting the information on θ contained in the observation x The principle of inverse probability does correspond to ordinary processes of learning (I, §1.5) Allows incorporation of imperfect information in the decision process A probability number can be regarded as a generalization of the assertion sign (I, §1.51).

of the Likelihood Principle ...the whole of the information contained in the observations that is relevant to the posterior probabilities of diﬀerent hypotheses is summed up in the values that they give the likelihood (II, §2.0). Avoids averaging over the unobserved values of x Coherent updating of the information available on θ, independent of the order in which i.i.d. observations are collected ...can be used as the prior probability in taking account of a further set of data, and the theory can therefore always take account of new information (I, §1.5). Provides a complete inferential scope

of the Likelihood Principle ...the whole of the information contained in the observations that is relevant to the posterior probabilities of diﬀerent hypotheses is summed up in the values that they give the likelihood (II, §2.0). Avoids averaging over the unobserved values of x Coherent updating of the information available on θ, independent of the order in which i.i.d. observations are collected ...can be used as the prior probability in taking account of a further set of data, and the theory can therefore always take account of new information (I, §1.5). Provides a complete inferential scope

of the Likelihood Principle ...the whole of the information contained in the observations that is relevant to the posterior probabilities of diﬀerent hypotheses is summed up in the values that they give the likelihood (II, §2.0). Avoids averaging over the unobserved values of x Coherent updating of the information available on θ, independent of the order in which i.i.d. observations are collected ...can be used as the prior probability in taking account of a further set of data, and the theory can therefore always take account of new information (I, §1.5). Provides a complete inferential scope

of the Likelihood Principle ...the whole of the information contained in the observations that is relevant to the posterior probabilities of diﬀerent hypotheses is summed up in the values that they give the likelihood (II, §2.0). Avoids averaging over the unobserved values of x Coherent updating of the information available on θ, independent of the order in which i.i.d. observations are collected ...can be used as the prior probability in taking account of a further set of data, and the theory can therefore always take account of new information (I, §1.5). Provides a complete inferential scope

that the prior probability is ‘subjective’ (...) or refer to the vagueness of previous knowledge as an indication that the prior probability cannot be assessed (VIII, §8.0). Long walk (from Laplace’s principle of insuﬃcient reason) to a reference prior: A prior probability used to express ignorance is merely the formal statement of ignorance (VIII, §8.1).

that the prior probability is ‘subjective’ (...) or refer to the vagueness of previous knowledge as an indication that the prior probability cannot be assessed (VIII, §8.0). Long walk (from Laplace’s principle of insuﬃcient reason) to a reference prior: A prior probability used to express ignorance is merely the formal statement of ignorance (VIII, §8.1).

for the parameters to be proportional to ||gik||1/2 [= |I(θ)|1/2], it could stated for any law that is diﬀerentiable with respect to all parameters that the total probability in any region of the αi would be equal to the total probability in the corresponding region of the αi ; in other words, it satisﬁes the rule that equivalent propositions have the same probability (III, §3.10) Note: Jeﬀreys never mentions Fisher information in connection with (gik)

Fisher information matrix associated with the likelihood (θ|x), I(θ) = Eθ ∂ ∂θT ∂ ∂θ the reference prior distribution is π∗(θ) ∝ |I(θ)|1/2 Note: Jeﬀreys never mentions Fisher information in connection with (gik)

supposed to represent complete ignorance (Kass & Wasserman, 1996) The prior probabilities needed to express ignorance of the value of a quantity to be estimated, where there is nothing to call special attention to a particular value are given by an invariance theory (Jeﬀreys, VIII, §8.6). often endowed with or seeking frequency-based properties Jeﬀreys also proposed another Jeﬀreys prior dedicated to testing (Bayarri & Garcia-Donato, 2007)

and πN 1 , substitute by prior distributions π0 and π1 that solve the system of integral equations π0(θ0) = X πN 0 (θ0 | x)m1(x)dx and π1(θ1) = X πN 1 (θ1 | x)m0(x)dx, where x is an imaginary minimal training sample and m0, m1 are the marginals associated with π0 and π1 respectively m0(x) = f0(x|θ0)π0(dθ0) m1(x) = f1(x|θ1)π1(dθ1) [Perez & Berger, 2000]

in both models are continuous, if the Markov chain with transition Q θ0 | θ0 = g θ0, θ0 , θ1, x, x dxdx dθ1 where g θ0, θ0 , θ1, x, x = πN 0 θ0 | x f1 (x | θ1) πN 1 θ1 | x f0 x | θ0 , is recurrent, then there exists a solution to the integral equations, unique up to a multiplicative constant. [Cano, Salmer´ on, & Robert, 2008, 2013]

Lindley, Dennis (1923– ) Lindley’s paradox dual versions of the paradox “Who should be afraid of the Lindley–Jeﬀreys paradox?” Bayesian resolutions Besag, Julian (1945–2010) de Finetti, Bruno (1906–1985)

advocate of Bayesian statistics. Held positions at Cambridge, Aberystwyth, and UCL, retiring at the early age of 54 to become an itinerant scholar. Wrote four books and numerous papers on Bayesian statistics. c “Coherence is everything”

= √ n − 1¯ x/s , ν = n − 1 K ∼ πν 2 1 + t2 ν −1/2ν+1/2 . (...) The variation of K with t is much more important than the variation with ν (Jeﬀreys, V, §5.2).

result (...) burdens proponents of the Bayesian practice”. [Lad, 2003] oﬃcial version, opposing frequentist and Bayesian assessments [Lindley, 1957] intra-Bayesian version, blaming vague and improper priors for the Bayes factor misbehaviour: if π1(·|σ) depends on a scale parameter σ, it is often the case that B01(x) σ−→∞ −→ +∞ for a given x, meaning H0 is always accepted [Robert, 1992, 2013]

one (b) operates on the parameter space Θ, while the other (f) is produced from the sample space one (f) relies solely on the point-null hypothesis H0 and the corresponding sampling distribution, while the other (b) opposes H0 to a (predictive) marginal version of H1 one (f) could reject “a hypothesis that may be true (...) because it has not predicted observable results that have not occurred” (Jeﬀreys, VII, §7.2) while the other (b) conditions upon the observed value xobs one (f) resorts to an arbitrary ﬁxed bound α on the p-value, while the other (b) refers to the boundary probability of 1 2

as n increases is of limited interest: under H0 tn has limiting N(0, 1) distribution, while, under H1 tn a.s. converges to ∞ behaviour that remains entirely compatible with the consistency of the Bayes factor, which a.s. converges either to 0 or ∞, depending on which hypothesis is true. Consequent literature (e.g., Berger & Sellke,1987) has since then shown how divergent those two approaches could be (to the point of being asymptotically incompatible). [Robert, 2013]

prior variance n times larger than the observation variance and when n goes to ∞, Bayes factor goes to ∞ no matter what the observation is n becomes what Lindley (1957) calls “a measure of lack of conviction about the null hypothesis” when prior diﬀuseness under H1 increases, only relevant information becomes that θ could be equal to θ0, and this overwhelms any evidence to the contrary contained in the data mass of the prior distribution in the vicinity of any ﬁxed neighbourhood of the null hypothesis vanishes to zero under H1 [Robert, 2013] c deep coherence in the outcome: being indecisive about the alternative hypothesis means we should not chose it

prior variance n times larger than the observation variance and when n goes to ∞, Bayes factor goes to ∞ no matter what the observation is n becomes what Lindley (1957) calls “a measure of lack of conviction about the null hypothesis” when prior diﬀuseness under H1 increases, only relevant information becomes that θ could be equal to θ0, and this overwhelms any evidence to the contrary contained in the data mass of the prior distribution in the vicinity of any ﬁxed neighbourhood of the null hypothesis vanishes to zero under H1 [Robert, 2013] c deep coherence in the outcome: being indecisive about the alternative hypothesis means we should not chose it

by A. Spanos with above title: the paradox demonstrates against Bayesian and likelihood resolutions of the problem for failing to account for the large sample size. the failure of all three main paradigms leads Spanos to advocate Mayo’s and Spanos’“postdata severity evaluation” [Spanos, 2013]

by A. Spanos with above title: “the postdata severity evaluation (...) addresses the key problem with Fisherian p-values in the sense that the severity evaluation provides the “magnitude” of the warranted discrepancy from the null by taking into account the generic capacity of the test (that includes n) in question as it relates to the observed data”(p.88) [Spanos, 2013]

factors, fractional Bayes factors, &tc, which lacks proper Bayesian justiﬁcation [Berger & Pericchi, 2001] use of identical improper priors on nuisance parameters, a notion already entertained by Jeﬀreys [Berger et al., 1998; Marin & Robert, 2013] use of the posterior predictive distribution, which uses the data twice (see also Aitkin’s (2010) integrated likelihood) [Gelman, Rousseau & Robert, 2013] use of score functions extending the log score function log B12(x) = log m1(x) − log m2(x) = S0(x, m1) − S0(x, m2) , that are independent of the normalising constant [Dawid et al., 2013]

of Bayesian statistics were sound, but methodology was lagging for lack of computing tools. restriction to conjugate priors limited complexity of models small sample sizes The ﬁeld was desperately in need of a new computing paradigm! [Robert & Casella, 2012]

simulation is deﬁnitely not necessary, all that matters is the ergodic theorem Realization that Markov chains could be used in a wide variety of situations only came to mainstream statisticians with Gelfand and Smith (1990) despite earlier publications in the statistical literature like Hastings (1970) and growing awareness in spatial statistics (Besag, 1986) Reasons: lack of computing machinery lack of background on Markov chains lack of trust in the practicality of the method

work in spatial statistics (including its applications to epidemiology, image analysis and agricultural science), and Bayesian inference (including Markov chain Monte Carlo algorithms). Lecturer in Liverpool and Durham, then professor in Durham and Seattle. [Wikipedia]

on the speciﬁcation of joint distributions from conditional distributions and on necessary and suﬃcient conditions for the conditional distributions to be compatible with a joint distribution. [Hammersley and Cliﬀord, 1971]

on the speciﬁcation of joint distributions from conditional distributions and on necessary and suﬃcient conditions for the conditional distributions to be compatible with a joint distribution. “What is the most general form of the conditional probability functions that deﬁne a coherent joint function? And what will the joint look like?” [Besag, 1972]

a dependence graph must be represented as product of functions over the cliques of the graphs, i.e., of functions depending only on the components indexed by the labels in the clique. [Cressie, 1993; Lauritzen, 1996]

and continuous density f satisﬁes the pairwise Markov property with respect to an undirected graph G if and only if it factorizes according to G, i.e., (F) ≡ (G) [Cressie, 1993; Lauritzen, 1996]

distribution g satisﬁes g(y1, . . . , yp) ∝ p j=1 g j (y j |y 1 , . . . , y j−1 , y j+1 , . . . , y p ) g j (y j |y 1 , . . . , y j−1 , y j+1 , . . . , y p ) for every permutation on {1, 2, . . . , p} and every y ∈ Y. [Cressie, 1993; Lauritzen, 1996]

be credited to a large extent of the (re?-)discovery of the Gibbs sampler. “The simulation procedure is to consider the sites cyclically and, at each stage, to amend or leave unaltered the particular site value in question, according to a probability distribution whose elements depend upon the current value at neighboring sites (...) However, the technique is unlikely to be particularly helpful in many other than binary situations and the Markov chain itself has no practical interpretation.” [Besag, 1974]

statistical world for about 10 years, then several papers/books highlighted its usefulness in speciﬁc settings: Geman and Geman (1984) Besag (1986) Strauss (1986) Ripley (Stochastic Simulation, 1987) Tanner and Wong (1987) Younes (1988)

Metropolis et al. (1953), Hastings (1970), and Peskun (1973), constructed a Gibbs sampler for optimisation in a discrete image processing problem with a Gibbs random ﬁeld without completion. Back to Metropolis et al., 1953: the Gibbs sampler is already in use therein and ergodicity is proven on the collection of global maxima

Metropolis et al. (1953), Hastings (1970), and Peskun (1973), constructed a Gibbs sampler for optimisation in a discrete image processing problem with a Gibbs random ﬁeld without completion. Back to Metropolis et al., 1953: the Gibbs sampler is already in use therein and ergodicity is proven on the collection of global maxima

transition matrix Q, of a discrete time Markov chain, with state space Ω and limit distribution (4). Simulated annealing proceeds by running an associated time inhomogeneous Markov chain with transition matrices QT , where T is progressively decreased according to a prescribed “schedule” to a value close to zero.” [Besag, 1986]

constructing a manageable QT (Hastings, 1970). Geman and Geman (1984) adopt the simplest, which they term the ”Gibbs sampler” (...) time reversibility, a common ingredient in this type of problem (see, for example, Besag, 1977a), is present at individual stages but not over complete cycles, though Peter Green has pointed out that it returns if QT is taken over a pair of cycles, the second of which visits pixels in reverse order” [Besag, 1986]

= π(θ)f (x|θ) π(θ|x) or of the marginal predictive as pn(y |y) = f (y |θ)πn(θ|y) πn+1(θ|y, y ) [Besag, 1989] Why candidate? “Equation (2) appeared without explanation in a Durham University undergraduate ﬁnal examination script of 1984. Regrettably, the student’s name is no longer known to me.”

= π(θ)f (x|θ) π(θ|x) or of the marginal predictive as pn(y |y) = f (y |θ)πn(θ|y) πn+1(θ|y, y ) [Besag, 1989] Why candidate? “Equation (2) appeared without explanation in a Durham University undergraduate ﬁnal examination script of 1984. Regrettably, the student’s name is no longer known to me.”

the [infamous] harmonic mean approximation to the marginal likelihood Gelfand and Dey (1994) Geyer and Thompson (1995) Chib (1995) Marin and Robert (2010) and Robert and Wraith (2009) [Chen, Shao & Ibrahim, 2000]

relied on this formula for the same purpose in a more general perspective Geyer and Thompson (1995) Chib (1995) Marin and Robert (2010) and Robert and Wraith (2009) [Chen, Shao & Ibrahim, 2000]

and Thompson (1995) derived MLEs by a Monte Carlo approximation to the normalising constant Chib (1995) Marin and Robert (2010) and Robert and Wraith (2009) [Chen, Shao & Ibrahim, 2000]

and Thompson (1995) Chib (1995) uses this representation to build a MCMC approximation to the marginal likelihood Marin and Robert (2010) and Robert and Wraith (2009) [Chen, Shao & Ibrahim, 2000]

and Thompson (1995) Chib (1995) Marin and Robert (2010) and Robert and Wraith (2009) corrected Newton and Raftery (1994) by restricting the importance function to an HPD region [Chen, Shao & Ibrahim, 2000]

physics: free energy sampling (e.g., Wang-Landau), Hamilton Monte Carlo (Girolami & Calderhead, 2011) sequential Monte Carlo (SMC) for non-sequential problems (Chopin, 2002; Neal, 2001; Del Moral et al 2006) retrospective sampling intractability: EP – GIMH – PMCMC – SMC2 – INLA QMC[MC] (Owen, 2011)

Carlo methods themselves! [Hammersley and Morton,1954; Rosenbluth and Rosenbluth, 1955] Found in the molecular simulation literature of the 50’s with self-avoiding random walks and signal processing [Marshall, 1965; Handschin and Mayne, 1969] Use of the term “particle” dates back to Kitagawa (1996), and Carpenter et al. (1997) coined the term “particle ﬁlter”.

Carlo methods themselves! [Hammersley and Morton,1954; Rosenbluth and Rosenbluth, 1955] Found in the molecular simulation literature of the 50’s with self-avoiding random walks and signal processing [Marshall, 1965; Handschin and Mayne, 1969] Use of the term “particle” dates back to Kitagawa (1996), and Carpenter et al. (1997) coined the term “particle ﬁlter”.

better importance sampling functions as in population Monte Carlo [Iba, 2000; Capp´ e et al, 2004; Del Moral et al., 2007] synthesis by Andrieu, Doucet, and Hollenstein (2010) using particles to build an evolving MCMC kernel ^ pθ(y1:T ) in state space models p(x1:T )p(y1:T |x1:T ) importance sampling on discretely observed diﬀusions [Beskos et al., 2006; Fearnhead et al., 2008, 2010]

Lindley, Dennis (1923– ) Besag, Julian (1945–2010) de Finetti, Bruno (1906–1985) de Finetti’s exchangeability theorem Bayesian nonparametrics Bayesian analysis in a Big Data era

noted for the “operational subjective” conception of probability. The classic exposition of his distinctive theory is the 1937 “La pr´ evision: ses lois logiques, ses sources subjectives,” which discussed probability founded on the coherence of betting odds and the consequences of exchangeability.” [Wikipedia] Chair in Financial Mathematics at Trieste University (1939) and Roma (1954) then in Calculus of Probabilities (1961). Most famous sentence: “Probability does not exist”

noted for the “operational subjective” conception of probability. The classic exposition of his distinctive theory is the 1937 “La pr´ evision: ses lois logiques, ses sources subjectives,” which discussed probability founded on the coherence of betting odds and the consequences of exchangeability.” [Wikipedia] Chair in Financial Mathematics at Trieste University (1939) and Roma (1954) then in Calculus of Probabilities (1961). Most famous sentence: “Probability does not exist”

. . , xn, . . .) is exchangeable if for any n the distribution of (x1, . . . , xn) is equal to the distribution of any permutation of the sequence (xσ1 , . . . , xσn ) de Finetti’s theorem (1937): An exchangeable distribution is a mixture of iid distributions p(x1, . . . , xn) = n i=1 f (xi |G)dπ(G) where G can be inﬁnite-dimensional Extension to Markov chains (Freedman, 1962; Diaconis & Freedman, 1980)

. . , xn, . . .) is exchangeable if for any n the distribution of (x1, . . . , xn) is equal to the distribution of any permutation of the sequence (xσ1 , . . . , xσn ) de Finetti’s theorem (1937): An exchangeable distribution is a mixture of iid distributions p(x1, . . . , xn) = n i=1 f (xi |G)dπ(G) where G can be inﬁnite-dimensional Extension to Markov chains (Freedman, 1962; Diaconis & Freedman, 1980)

. . , xn, . . .) is exchangeable if for any n the distribution of (x1, . . . , xn) is equal to the distribution of any permutation of the sequence (xσ1 , . . . , xσn ) de Finetti’s theorem (1937): An exchangeable distribution is a mixture of iid distributions p(x1, . . . , xn) = n i=1 f (xi |G)dπ(G) where G can be inﬁnite-dimensional Extension to Markov chains (Freedman, 1962; Diaconis & Freedman, 1980)

on functional spaces (densities, regression, trees, partitions, clustering, &tc) production of Bayes estimates in those spaces convergence mileage may vary available eﬃcient (MCMC) algorithms to conduct non-parametric inference [van der Vaart, 1998; Hjort et al., 2010; M¨ uller & Rodriguez, 2013]

θi ∼ G then the marginal distribution of (θ1, . . .) is a Chinese restaurant process (P´ olya urn model), which is exchangeable. In particular, θi |θ1:i−1 ∼ α0G0 + i−1 j=1 δθj Posterior distribution built by MCMC [Escobar and West, 1992]

θi ∼ G then the marginal distribution of (θ1, . . .) is a Chinese restaurant process (P´ olya urn model), which is exchangeable. In particular, θi |θ1:i−1 ∼ α0G0 + i−1 j=1 δθj Posterior distribution built by MCMC [Escobar and West, 1992]

iid case and extension of Barron et al. (1999) for general consistency consistency rates: Ghosal & van der Vaart (2000) Ghosal et al. (2008) with minimax (adaptive ) Bayesian nonparametric estimators for nonparametric process mixtures (Gaussian, Beta) (Rousseau, 2008; Kruijer, Rousseau & van der Vaart, 2010; Shen, Tokdar & Ghosal, 2013; Scricciolo, 2013) Bernstein-von Mises theorems: (Castillo, 2011; Rivoirard & Rousseau, 2012; Kleijn & Bickel, 2013; Castillo & Rousseau, 2013) recent extensions to semiparametric models

m(Xn) = Θ fθ(Xn)dπ(θ) and posterior concentration: Under Pθ0 Pπ [d(θ, θ0) |Xn] = 1+op(1), Pπ [d(θ, θ0) n|Xn] = 1+op(1) Given n: consistency where d(θ, θ ) is a loss function. e.g. Hellinger, L1, L2, L∞

m(Xn) = Θ fθ(Xn)dπ(θ) and posterior concentration: Under Pθ0 Pπ [d(θ, θ0) |Xn] = 1+op(1), Pπ [d(θ, θ0) n|Xn] = 1+op(1) Setting n ↓ 0: consistency rates where d(θ, θ ) is a loss function. e.g. Hellinger, L1, L2, L∞

answer: very large datasets complex or unknown dependence structures with maybe p n multiple and involved random eﬀects missing data structures containing most of the information sequential structures involving most of the above

that Bayesian methods would ultimately supplant frequentist methods has given way to a realization that Markov chain Monte Carlo (MCMC) may be too slow to handle modern data sets. Size matters because large data sets stress computer storage and processing power to the breaking point. The most successful compromises between Bayesian and frequentist methods now rely on penalization and optimization.” [Lange at al., ISR, 2013]

where Xi ∼ U(0, 1)d , Ri |Xi ∼ B(π(Xi )) and Yi |Xi ∼ B(θ(Xi )) (π(·) is known and θ(·) is unknwon) Then any estimator of E[Y ] that does not depend on π is inconsistent. c There is no genuine Bayesian answer producing a consistent estimator (without throwing away part of the data) [Robins & Wasserman, 2000, 2013]

on much smaller dimensions and on sparse summaries many (fast if non-Bayesian) ways of producing those summaries Bayesian inference can kick in almost automatically at this stage

where the likelihood function (θ|y) = f (y1, . . . , yn|θ) is out of reach! Empirical approximations to the original Bayesian inference problem Degrading the data precision down to a tolerance ε Replacing the likelihood with a non-parametric approximation Summarising/replacing the data with insuﬃcient statistics

where the likelihood function (θ|y) = f (y1, . . . , yn|θ) is out of reach! Empirical approximations to the original Bayesian inference problem Degrading the data precision down to a tolerance ε Replacing the likelihood with a non-parametric approximation Summarising/replacing the data with insuﬃcient statistics

where the likelihood function (θ|y) = f (y1, . . . , yn|θ) is out of reach! Empirical approximations to the original Bayesian inference problem Degrading the data precision down to a tolerance ε Replacing the likelihood with a non-parametric approximation Summarising/replacing the data with insuﬃcient statistics

where the likelihood function (θ|y) = f (y1, . . . , yn|θ) is out of reach! Empirical approximations to the original Bayesian inference problem Degrading the data precision down to a tolerance ε Replacing the likelihood with a non-parametric approximation Summarising/replacing the data with insuﬃcient statistics

f (x|θ) not in closed form, likelihood-free rejection technique: Foundation For an observation y ∼ f (y|θ), under the prior π(θ), if one keeps jointly simulating θ ∼ π(θ) , z ∼ f (z|θ ) , until the auxiliary variable z is equal to the observed value, z = y, then the selected θ ∼ π(θ|y) [Rubin, 1984; Diggle & Gratton, 1984; Griﬃth et al., 1997]

f (x|θ) not in closed form, likelihood-free rejection technique: Foundation For an observation y ∼ f (y|θ), under the prior π(θ), if one keeps jointly simulating θ ∼ π(θ) , z ∼ f (z|θ ) , until the auxiliary variable z is equal to the observed value, z = y, then the selected θ ∼ π(θ|y) [Rubin, 1984; Diggle & Gratton, 1984; Griﬃth et al., 1997]

f (x|θ) not in closed form, likelihood-free rejection technique: Foundation For an observation y ∼ f (y|θ), under the prior π(θ), if one keeps jointly simulating θ ∼ π(θ) , z ∼ f (z|θ ) , until the auxiliary variable z is equal to the observed value, z = y, then the selected θ ∼ π(θ|y) [Rubin, 1984; Diggle & Gratton, 1984; Griﬃth et al., 1997]

Likelihood-free rejection sampler for i = 1 to N do repeat generate θ from the prior distribution π(·) generate z from the likelihood f (·|θ ) until ρ{η(z), η(y)} set θi = θ end for where η(y) deﬁnes a (not necessarily suﬃcient) statistic

components of η(y) also capital matters little if “small enough” representative of “curse of dimensionality” small is beautiful!, i.e. data as a whole may be weakly informative for ABC non-parametric method at core

in eﬃciency Either modify the proposal distribution on θ to increase the density of x’s within the vicinity of y... [Marjoram et al, 2003; Beaumont et al., 2009, Del Moral et al., 2012] ...or by viewing the problem as a conditional density estimation and by developing techniques to allow for larger [Beaumont et al., 2002; Blum & Fran¸ cois, 2010; Biau et al., 2013] .....or even by including in the inferential framework [ABCµ] [Ratmann et al., 2009]

in eﬃciency Either modify the proposal distribution on θ to increase the density of x’s within the vicinity of y... [Marjoram et al, 2003; Beaumont et al., 2009, Del Moral et al., 2012] ...or by viewing the problem as a conditional density estimation and by developing techniques to allow for larger [Beaumont et al., 2002; Blum & Fran¸ cois, 2010; Biau et al., 2013] .....or even by including in the inferential framework [ABCµ] [Ratmann et al., 2009]

in eﬃciency Either modify the proposal distribution on θ to increase the density of x’s within the vicinity of y... [Marjoram et al, 2003; Beaumont et al., 2009, Del Moral et al., 2012] ...or by viewing the problem as a conditional density estimation and by developing techniques to allow for larger [Beaumont et al., 2002; Blum & Fran¸ cois, 2010; Biau et al., 2013] .....or even by including in the inferential framework [ABCµ] [Ratmann et al., 2009]

in eﬃciency Either modify the proposal distribution on θ to increase the density of x’s within the vicinity of y... [Marjoram et al, 2003; Beaumont et al., 2009, Del Moral et al., 2012] ...or by viewing the problem as a conditional density estimation and by developing techniques to allow for larger [Beaumont et al., 2002; Blum & Fran¸ cois, 2010; Biau et al., 2013] .....or even by including in the inferential framework [ABCµ] [Ratmann et al., 2009]

η(y), either chosen for computational realism or imposed by external constraints ABC can produce a distribution on the parameter of interest conditional on this summary statistic η(y) inference based on ABC may be consistent or not, so it needs to be validated on its own the choice of the tolerance level is dictated by both computational and convergence constraints

η(y), either chosen for computational realism or imposed by external constraints ABC can produce a distribution on the parameter of interest conditional on this summary statistic η(y) inference based on ABC may be consistent or not, so it needs to be validated on its own the choice of the tolerance level is dictated by both computational and convergence constraints

error unknown (w/o massive simulation) pragmatic or empirical Bayes (there is no other solution!) many calibration issues (tolerance, distance, statistics) the NP side should be incorporated into the whole Bayesian picture the approximation error should also be part of the Bayesian inference

with exact simulation from a controlled approximation to the target, convolution of true posterior with kernel function π (θ, z|y) = π(θ)f (z|θ)K (y − z) π(θ)f (z|θ)K (y − z)dzdθ , with K kernel parameterised by bandwidth . [Wilkinson, 2013] Theorem The ABC algorithm based on a randomised observation y = ˜ y + ξ, ξ ∼ K , and an acceptance probability of K (y − z)/M gives draws from the posterior distribution π(θ|y).

with exact simulation from a controlled approximation to the target, convolution of true posterior with kernel function π (θ, z|y) = π(θ)f (z|θ)K (y − z) π(θ)f (z|θ)K (y − z)dzdθ , with K kernel parameterised by bandwidth . [Wilkinson, 2013] Theorem The ABC algorithm based on a randomised observation y = ˜ y + ξ, ξ ∼ K , and an acceptance probability of K (y − z)/M gives draws from the posterior distribution π(θ|y).

statistic when there is no non-trivial suﬃcient statistics [except when done by the experimenters in the ﬁeld] Loss of statistical information balanced against gain in data roughening Approximation error and information loss remain unknown Choice of statistics induces choice of distance function towards standardisation borrowing tools from data analysis (LDA) machine learning [Estoup et al., ME, 2012]

statistic when there is no non-trivial suﬃcient statistics [except when done by the experimenters in the ﬁeld] may be imposed for external/practical reasons may gather several non-B point estimates we can learn about eﬃcient combination distance can be provided by estimation techniques

on model discrimination typically (...) proceeds by (...) accepting that the Bayes Factor that one obtains is only derived from the summary statistics and may in no way correspond to that of the full model.’ [Scott Sisson, Jan. 31, 2011, xianblog] Depending on the choice of η(·), the Bayes factor based on this insuﬃcient statistic, Bη 12 (y) = π1(θ1)f η 1 (η(y)|θ1) dθ1 π2(θ2)f η 2 (η(y)|θ2) dθ2 , is either consistent or inconsistent [Robert et al., PNAS, 2012]

µi (θ) = Ei [η(y)] under both models against the asymptotic mean µ0 of η(y) Theorem If Pn belongs to one of the two models and if µ0 cannot be attained by the other one : 0 = min (inf{|µ0 − µi (θi )|; θi ∈ Θi }, i = 1, 2) < max (inf{|µ0 − µi (θi )|; θi ∈ Θi }, i = 1, 2) , then the Bayes factor Bη 12 is consistent [Marin et al., 2012]

ISBA President, conducted a mini-survey on Bayesian open problems: Nonparametrics and semiparametrics: assessing and validating priors on inﬁnite dimension spaces with an inﬁnite number of nuisance parameters Priors: elicitation mecchanisms and strategies to get the prior from the likelihood or even from the posterior distribution Bayesian/frequentist relationships: how far should one reach for frequentist validation? Computation and statistics: computational abilities should be part of the modelling, with some expressing doubts about INLA and ABC Model selection and hypothesis testing: still unsettled opposition between model checking, model averaging and model selection [Jordan, ISBA Bulletin, March 2011]

Duke University, December 17: Stephen Fienberg, Carnegie-Mellon University Michael Jordan, University of California, Berkeley Christopher Sims, Princeton University Adrian Smith, University of London Stephen Stigler, University of Chicago Sharon Bertsch McGrayne, author of “the theory that would not die”