Slide 1

Slide 1 text

Introduction to Bayesian Statistics Data Science Philipp Singer CC image courtesy of user mattbuck007 on Flickr Additional juypter notebook material: http://nbviewer.jupyter.org/github/psinger/notebooks/blob/master/bayesian_inference.ipynb

Slide 2

Slide 2 text

2 Conditional Probability

Slide 3

Slide 3 text

3 Conditional Probability ● Probability of event A given that B is true ● P(cough|cold) > P(cough) ● Fundamental in probability theory

Slide 4

Slide 4 text

4 Before we start with Bayes ... ● Another perspective on conditional probability ● Conditional probability via growing trimmed trees ● https://www.youtube.com/watch?v=Zxm4Xxvzohk

Slide 5

Slide 5 text

5 Bayes Theorem

Slide 6

Slide 6 text

6 Bayes Theorem ● P(A|B) is conditional probability of observing A given B is true ● P(B|A) is conditional probability of observing B given A is true ● P(A) and P(B) are probabilities of A and B without conditioning on each other

Slide 7

Slide 7 text

7 Visualize Bayes Theorem Source: https://oscarbonilla.com/2009/05/visualizing-bayes-theorem/ All possible outcomes Some event

Slide 8

Slide 8 text

8 Visualize Bayes Theorem All people in study People having cancer

Slide 9

Slide 9 text

9 Visualize Bayes Theorem All people in study People where screening test is positive

Slide 10

Slide 10 text

10 Visualize Bayes Theorem People having positive screening test and cancer

Slide 11

Slide 11 text

11 Visualize Bayes Theorem ● Given the test is positive, what is the probability that said person has cancer?

Slide 12

Slide 12 text

12 Visualize Bayes Theorem ● Given the test is positive, what is the probability that said person has cancer?

Slide 13

Slide 13 text

13 Visualize Bayes Theorem ● Given that someone has cancer, what is the probability that said person had a positive test?

Slide 14

Slide 14 text

14 Example: Fake coin ● Two coins – One fair – One unfair ● What is the probability of having the fair coin after flipping Heads? CC image courtesy of user pagedooley on Flickr

Slide 15

Slide 15 text

15 Example: Fake coin CC image courtesy of user pagedooley on Flickr

Slide 16

Slide 16 text

16 Example: Fake coin CC image courtesy of user pagedooley on Flickr

Slide 17

Slide 17 text

17 Update of beliefs ● Allows new evidence to update beliefs ● Prior can also be posterior of previous update

Slide 18

Slide 18 text

18 Example: Fake coin CC image courtesy of user pagedooley on Flickr ● Belief update ● What is probability of seeing a fair coin after we have already seen one Heads

Slide 19

Slide 19 text

19 Bayesian Inference

Slide 20

Slide 20 text

20 Source: https://xkcd.com/1132/

Slide 21

Slide 21 text

21 Bayesian Inference ● Statistical inference of parameters Parameters Data Additional knowledge

Slide 22

Slide 22 text

22 Frequentist vs. Bayesian statistics ● Frequentist – There is a true parameter that is fixed – Data is random – Repeated measurements (frequencies) – Point estimates ● Bayesian – True parameter drawn from probability distribution – Data is fixed – Degrees of certainty – Probabilistic statements http://jakevdp.github.io/blog/2014/03/11/frequentism-and-bayesianism-a-practical-intro/

Slide 23

Slide 23 text

23 Coin flip example ● Flip a coin several times ● Which model? ● How can we fit the model? ● Confidence vs. Credible intervals ● Prediction ● Hypotheses testing 80 120

Slide 24

Slide 24 text

24 Binomial model: frequentist perspective ● Probability p of flipping heads ● Flipping tails: 1-p ● Binomial model

Slide 25

Slide 25 text

25 Model fitting: frequentist approach ● Parameter is fixed, data is random ● Find estimate for parameter (point estimate) ● Maximum likelihood estimation (MLE) – Estimate parameter by maximizing likelihood – Covered in previous lectures

Slide 26

Slide 26 text

26 Binomial model: Bayesian perspective ● Probability p of flipping heads ● Flipping tails: 1-p ● Binomial model

Slide 27

Slide 27 text

27 Full Bayesian model Binomial distribution Beta distribution Beta distribution Posterior combines our prior belief about the parameters with observed data. This allows to make probabilistic statements about the parameters. Data is fixed, parameters are random!

Slide 28

Slide 28 text

28 Prior ● Prior belief about parameter(s) ● Conjugate prior – Posterior of same distribution as prior – Beta distribution conjugate to binomial ● Beta prior

Slide 29

Slide 29 text

29 Beta distribution ● Continuous probability distribution ● Interval [0,1] ● Two shape parameters: α and β – If >= 1, interpret as pseudo counts – α would refer to flipping heads

Slide 30

Slide 30 text

30 Beta distribution

Slide 31

Slide 31 text

31 Beta distribution

Slide 32

Slide 32 text

32 Beta distribution

Slide 33

Slide 33 text

33 Beta distribution

Slide 34

Slide 34 text

34 Beta distribution

Slide 35

Slide 35 text

35 Model fitting: Bayesian approach ● Data fixed, parameter chosen from probability distribution ● “Learn” posterior ● Posterior also Beta distribution ● For exact deviation: http://www.cs.cmu.edu/~10701/lecture/technote2_betabinomial.pdf

Slide 36

Slide 36 text

36 Posterior ● Posterior – 80 Heads, 120 Tails – Biased Beta prior: α=1 and β=1 – Update posterior (stepwise) This slide contains a video not working in the PDF version! The static output does not represent the final posterior which is α=81 and β=121.

Slide 37

Slide 37 text

37 Posterior ● Posterior – 80 Heads, 120 Tails – Biased Beta prior: α=50 and β=10 – Update posterior (stepwise) This slide contains a video not working in the PDF version! The static output does not represent the final posterior which is α=130 and β=130.

Slide 38

Slide 38 text

38 Posterior ● Convex combination of prior and data ● The stronger our prior belief, the more data we need to overrule the prior ● The less prior belief we have, the quicker the data overrules the prior

Slide 39

Slide 39 text

39 Confidence vs. credible intervals ● Confidence interval (frequentist) – There is a true (fixed) unknown population parameter h – Derive confidence interval from sample – Interval constructed this way will contain h 95% of time – Again: parameter fixed, data random ● Credible interval (Bayesian) – Parameter random, data fixed – Probabilistic statements about parameter – 95% credible interval:

Slide 40

Slide 40 text

40 Confidence vs. credible intervals ● Confidence interval – Uncertainty about the interval we obtained ● Credible interval – Uncertainty about the parameter ● Sources: – http://jakevdp.github.io/blog/2014/06/12/frequentism-and-bayesianism-3-confidence-credibility/ – http://stats.stackexchange.com/questions/2272/whats-the-difference-between-a-confidence-interval-and-a-credible-interval – https://zenodo.org/record/16991 – http://freakonometrics.hypotheses.org/18117

Slide 41

Slide 41 text

41 Confidence vs. credible intervals ● Confidence interval – 95%: [0.33, 0.47] ● Credible interval – Directly from posterior – 95%: (0.33, 0.47)

Slide 42

Slide 42 text

42 Confidence vs. credible intervals ● Confidence interval – 95%: [0.33, 0.47] ● Credible interval – Directly from posterior – 95%: (0.33, 0.47) Given observed data, there is a 95% probability that the true value of p falls within credible interval There is a 95% probability that when I create confidence intervals of this sort, the CI will include the population parameter p.

Slide 43

Slide 43 text

43 Confidence vs. credible intervals ● Confidence interval – 95%: [0.33, 0.47] ● Credible interval – Directly from posterior – 95%: (0.33, 0.47) Confidence and credible intervals are not always equal! See: http://jakevdp.github.io/blog/2014/06/12/frequentism-and-bayesianism-3-confidence-credibility/ http://bayes.wustl.edu/etj/articles/confidence.pdf

Slide 44

Slide 44 text

44 So is the coin fair? Frequentist approach ● Null hypothesis test ● Binomial test with null p=0.5 – one-tailed – 0.0028 – → reject null hypothesis ● Alternative: Chi² test

Slide 45

Slide 45 text

45 So is the coin fair? Bayesian approach ● Examine posterior – 95% posterior density interval – ROPE [1]: Region of practical equivalence for null hypothesis – Fair coin: [0.45,0.55] ● 95% HDI: (0.33, 0.47) ● Cannot reject null ● More samples→ we can [1] Kruschke, John. Doing Bayesian data analysis: A tutorial with R, JAGS, and Stan. Academic Press, 2014.

Slide 46

Slide 46 text

46 Bayesian Model Comparison ● Parameters marginalized out ● Average of likelihood weighted by prior Evidence

Slide 47

Slide 47 text

47 Bayesian Model Comparison ● Bayes factors [1] ● Ratio of marginal likelihoods ● Interpretation table by Kass & Raftery [1] ● >100 → decisive evidence against M2 [1] Kass, Robert E., and Adrian E. Raftery. "Bayes factors." Journal of the american statistical association 90.430 (1995): 773-795.

Slide 48

Slide 48 text

48 So is the coin fair? ● Null hypothesis ● Alternative hypothesis – Anything is possible – Beta(1,1) ● Bayes factor

Slide 49

Slide 49 text

49 So is the coin fair? ● n = 200 ● k = 80 ● Bayes factor ● (Decent) preference for alt. hypothesis

Slide 50

Slide 50 text

50 Other priors ● Prior can encode (theories) hypotheses ● Biased hypothesis: Beta(101,11) ● Haldane prior: Beta(0.001, 0.001) – u-shaped – high probability on p=1 or (1-p)=1

Slide 51

Slide 51 text

51 Prediction: Frequentist approach ● Predict based on MLE ● Example: – Training: 80 Heads, 120 Tails – Test/Prediction: 2 Heads, 0 Tails

Slide 52

Slide 52 text

52 Prediction: Bayesian approach ● Posterior mean ● If data large→converges to MLE ● MAP: Maximum a posteriori – Bayesian estimator – uses mode

Slide 53

Slide 53 text

53 Prediction: Bayesian approach ● Posterior predictive distribution ● Distribution of unobserved observations conditioned on observed data (train, test)

Slide 54

Slide 54 text

54 Alternative Bayesian Inference ● Often marginal likelihood not easy to evaluate – No analytical solution – Numerical integration expensive ● Alternatives – Monte Carlo integration ● Markov Chain Monte Carlo (MCMC) ● Gibbs sampling ● Metropolis-Hastings algorithm – Laplace approximation – Variational Bayes

Slide 55

Slide 55 text

55 Generalized Linear Model ● Multiple linear regression ● Logistic regression ● Bayesian ANOVA

Slide 56

Slide 56 text

56 Bayesian Statistical Tests ● Alternatives to frequentist approaches ● Bayesian correlation ● Bayesian t-test

Slide 57

Slide 57 text

57 Resources ● Harvard Data Science Course Lectures 16/17 http://cs109.github.io/2015/pages/videos.html ● Doing Bayesian Data Analysis ● Bayesian Methods for Hackers https://github.com/CamDavidsonPilon/Probabilistic-Programming-and-Bayes ian-Methods-for-Hackers ● Google :)

Slide 58

Slide 58 text

58

Slide 59

Slide 59 text

59 Questions? Philipp Singer [email protected] Image credit: talk of Mike West: http://www2.stat.duke.edu/~mw/ABS04/Lecture_Slides/4.Stats_Regression.pdf