50

# An overview of statistical inference

Guest lecture for LMU Munich.

June 22, 2023

## Transcript

1. an overview of
statistical inference
Dr. Mine Çetinkaya-Rundel
Duke University

2. slides at bit.ly/lmu-inference

3. hypothesis testing

4. ‣ Prediction of 2010 World Cup
winners:
‣ Presented with 2 clear plastic
boxes, each containing food and
marked with
fl
ag of a team.
‣ Winner: Box which Paul opened
fi
rst to eat its contents.
‣ Accurately predicted the outcome
of 8 games!
example: Paul the octopus

5. Paul the Octopus predicted 8 World Cup games, and predicted
them all correctly.
Does this provide convincing evidence that Paul actually has
psychic powers, i.e. that he does better than just randomly
guessing?
example: Paul the octopus

6. null hypothesis
“There is nothing going on”
alternative hypothesis
“There is something going on”
two competing claims

7. In context of Paul’s predictions, which of the following does the
null hypothesis of “there is nothing going on” maps to?
a. Paul does no better than random guessing.
b. Paul does better than random guessing.
c. Paul predicts all games correctly.
d. Paul predicts none of the games correctly.
e. Paul predicts 50% of the games correctly.
setting the null

8. In context of Paul’s predictions, which of the following does the
null hypothesis of “there is nothing going on” maps to?
a. Paul does no better than random guessing.
b. Paul does better than random guessing.
c. Paul predicts all games correctly.
d. Paul predicts none of the games correctly.
e. Paul predicts 50% of the games correctly.
setting the null

9. null hypothesis
H0: Defendant is innocent
alternative hypothesis
HA: Defendant is guilty
collect data
present the evidence
“Could these data plausibly have
happened by chance if the null
hypothesis were true?”
judge the evidence
Fail to reject H0
yes
Reject H0
no
burden
of proof
Image source: http://en.wikipedia.org/wiki/File:Trial_by_Jury_Usher.jpg

10. Which of the following is not a component of the hypothesis
testing framework?
b. Set an alternative hypothesis that represents the research question, i.e.
what we’re testing for
c. Conduct a hypothesis test under the assumption that the altertnative
hypothesis is true
d. If the test results suggest that the data do not provide convincing
evidence for the alternative hypothesis, stick with the null hypothesis
e. If the test results suggest that the data do provide convincing
evidence for the alternative hypothesis, then reject the null hypothesis
in favor of the alternative
hypothesis testing framework

11. a. Start with a null hypothesis that represents the status quo
b. Set an alternative hypothesis that represents the research question, i.e.
what we’re testing for
c. Conduct a hypothesis test under the assumption that
the altertnative hypothesis is true
d. If the test results suggest that the data do not provide convincing
evidence for the alternative hypothesis, stick with the null hypothesis
e. If the test results suggest that the data do provide convincing
evidence for the alternative hypothesis, then reject the null hypothesis
in favor of the alternative
hypothesis testing framework
Which of the following is not a component of the hypothesis
testing framework?

12. Which of the following is the best set of hypotheses associated
with the following two claims: “Paul does no better than random
guessing” and “Paul does better than random guessing”?
a. H0: p = 0 ; HA: p > 0
b. H0: p = 1/8 ; HA: p > 1/8
c. H0: p < 0.5 ; HA: p = 0.5
d. H0: p = 0.5 ; HA: p > 0.5
e. H0: p = 0.5 ; HA: p =1
hypothesis testing framework

13. a. H0: p = 0 ; HA: p > 0
b. H0: p = 1/8 ; HA: p > 1/8
c. H0: p < 0.5 ; HA: p = 0.5
d. H0: p = 0.5 ; HA: p > 0.5
e. H0: p = 0.5 ; HA: p =1
hypothesis testing framework
Which of the following is the best set of hypotheses associated
with the following two claims: “Paul does no better than random
guessing” and “Paul does better than random guessing”?

14. null hypothesis
Paul does no better than
random guessing.
“There is nothing going on”
alternative hypothesis
Paul does better than random
guessing.
“There is something going on”
H0: p = 0.5 HA: p > 0.5
two competing claims

15. ‣ Use a fair coin, and label head as success (correct guess)
‣ One simulation:
fl
ip the coin 8 times and record the
‣ Repeat the simulation many times, recording the
proportion of heads at each iteration
‣ Calculate the percentage of simulations where the
simulated proportion of heads is at least as extreme as
the observed proportion
Paul the Octopus predicted 8 World Cup games, and predicted them
all correctly. Does this provide convincing evidence that Paul actually
has psychic powers, i.e. that he does better than just randomly
guessing?
H0: p = 0.5
HA: p > 0.5
example: Paul the octopus

16. simulation 1: H H
H
H
H H
H T 7 / 8 = 0.875
simulation 2: T H H T H T T T 3 / 8 = 0.375
0 1
0.5
0.25 0.75
simulation 3: T T H H H H T H 5 / 8 = 0.625
simulation 10: T H T H H H H H 6 / 8 = 0.75
… …
What proportion of simulations yielded a
proportion of success at least as extreme as Paul’s?
simulating Paul

17. Based on the probability that you just calculated, which of the
following is the best conclusion of this hypothesis test?
a. It is likely to predict 8 or more games correctly if randomly guessing, hence
the data suggest that Paul is doing no better than randomly guessing.
b. It is likely to predict 8 or more games correctly if randomly guessing, hence
the data suggest that Paul is doing better than randomly guessing.
c. It is very unlikely to predict 8 or more games correctly if randomly
guessing, hence the data suggest that Paul is doing no better than
randomly guessing.
d. It is very unlikely to predict 8 or more games correctly if randomly
guessing, hence the data suggest that Paul is doing better than randomly
guessing.
e. None of the above.
conclusion of the test

18. a. It is likely to predict 8 or more games correctly if randomly guessing, hence
the data suggest that Paul is doing no better than randomly guessing.
b. It is likely to predict 8 or more games correctly if randomly guessing, hence
the data suggest that Paul is doing better than randomly guessing.
c. It is very unlikely to predict 8 or more games correctly if randomly
guessing, hence the data suggest that Paul is doing no better than
randomly guessing.
d. It is very unlikely to predict 8 or more games correctly if
randomly guessing, hence the data suggest that Paul is
doing better than randomly guessing.
e. None of the above.
conclusion of the test
Based on the probability that you just calculated, which of the
following is the best conclusion of this hypothesis test?

19. ‣ Hypotheses:
‣ H0: p = 0.5 - Paul does no better than random guessing
‣ HA: p > 0.5 - Paul does better than random guessing
‣ Data: Paul predicted 8 out of 8 games correctly
‣ Results: Assuming H0 is true, the probability of obtaining results at least as extreme as
Paul’s is almost 0.
‣ Decision: Since this probability is low (lower than 5%), we reject H0 in favor of HA.
‣ This doesn’t mean we proved the alternative hypothesis, just that the data provide
convincing evidence for it.
making a decision

20. ‣ study considered sex roles, and only allowed for options of “male” and
“female.” We should note that the identities being considered are not gender
identities and that the study allowed only for a binary classi
fi
cation of sex.
‣ 48 male bank supervisors given the same personnel
fi
whether the person should be promoted
‣ identical
fi
les, except that half of them indicated the candidate identi
fi
ed as
male and the other half indicated the candidate identi
fi
ed as female

fi
les randomly assigned to managers
‣ 35 / 48 promoted
‣ are females are unfairly discriminated against?
example: sex discrimination
“Are individuals who identify as female discriminated against in promotion
decisions made by their managers who identify as male?”

21. promotion
promoted not promoted total
sex
male 21 3 24
female 14 10 24
total 35 13 48
% of males promoted = 21/24 ≈ 88%
% of females promoted = 14/24 ≈ 58%
example: sex discrimination

22. null hypothesis
promotion and gender are
independent, no gender
discrimination, observed
difference in proportions is
simply due to chance
“There is nothing going on”
alternative hypothesis
promotion and gender are
dependent, there is gender
discrimination, observed
difference in proportions is
not due to chance.
“There is something going on”
two competing claims

23. simulation scheme
1. face card: not promoted, non-face card: promoted
‣ set aside the jokers, consider aces as face cards
‣ take out 3 aces → 13 face cards left in the deck (face cards: A, K, Q, J)
‣ take out a number card → 35 number (non-face) cards left in the deck (number cards: 2-10)
[use a deck of playing cards to simulate this experiment]

24. Step 1:
Image source: http://www.j
fi
tz.com/cards/

25. simulation scheme
1. face card: not promoted, non-face card: promoted
‣ set aside the jokers, consider aces as face cards
‣ take out 3 aces → 13 face cards left in the deck (face cards: A, K, Q, J)
‣ take out a number card → 35 number (non-face) cards left in the deck (number cards: 2-10)
2. shuf
fl
e the cards, deal into two groups of size 24, representing males and
females
[use a deck of playing cards to simulate this experiment]

26. Step 2:
Image source: http://www.j
fi
tz.com/cards/

27. simulation scheme
1. face card: not promoted, non-face card: promoted
‣ set aside the jokers, consider aces as face cards
‣ take out 3 aces → 13 face cards left in the deck (face cards: A, K, Q, J)
‣ take out a number card → 35 number (non-face) cards left in the deck (number cards: 2-10)
2. shuf
fl
e the cards, deal into two groups of size 24, representing males and
females
3. count how many number cards are in each group (representing promoted
fi
les)
4. calculate the proportion of promoted
fi
les in each group, take the difference
(male - female), and record this value
[use a deck of playing cards to simulate this experiment]

28. Steps 3&4:
Image source: http://www.j
fi
tz.com/cards/

29. 0 0.2 0.4
-0.4 -0.2
x

30. simulation scheme
1. face card: not promoted, non-face card: promoted
‣ set aside the jokers, consider aces as face cards
‣ take out 3 aces → 13 face cards left in the deck (face cards: A, K, Q, J)
‣ take out a number card → 35 number (non-face) cards left in the deck (number cards: 2-10)
2. shuf
fl
e the cards, deal into two groups of size 24, representing males and
females
3. count how many number cards are in each group (representing promoted
fi
les)
4. calculate the proportion of promoted
fi
les in each group, take the difference
(male - female), and record this value
5. repeat steps 2 - 4 many times
[use a deck of playing cards to simulate this experiment]

31. ● ●

Difference in promotion rates
−0.4 −0.2 0 0.2 0.4

32. ‣ Results from the simulations look like the data → the difference between the
proportions of promoted
fi
les between males and females was due to chance
(promotion and sex are independent)
‣ Results from the simulations do not look like the data → the difference
between the proportions of promoted
fi
les between males and females was
not due to chance, but due to an actual effect of gender (promotion and sex
are dependent)
making a decision

33. ● ●

Difference in promotion rates
−0.4 −0.2 0 0.2 0.4

34. ‣ set a null and an alternative hypothesis
‣ simulate the experiment assuming that the null hypothesis is true
‣ evaluated the probability of observing an outcome at least as extreme as the
one observed in the original data
‣ and if this probability is low, reject the null hypothesis in favor of the
alternative
p-value
summary

35. con
fi
dence intervals

36. A plausible range of values for the population parameter is called a
con
fi
dence interval.
Net: Photo by ozgurmulazimoglu on Flickr: http://www.
fl
Spear
fi
shing: Photo by Chris Penny on Flickr: http://www.
fl
‣ If we report a point estimate, we probably won’t hit the exact
population parameter.
‣ If we report a range of plausible values we have a good shot at
capturing the parameter.

37. x
Central Limit Theorem
(CLT):
x ±2SE
approximate 95% CI:
µ − 3σ µ − 2σ µ − σ µ µ + σ µ + 2σ µ + 3σ
99.7%
95%
68%
margin of error (ME)

38. One of the earliest examples of behavioral asymmetry is a preference in
humans for turning the head to the right, rather than to the left, during
the
fi
nal weeks of gestation and for the
fi
rst 6 months after birth. This is
thought to in
fl
uence subsequent development of perceptual and motor
preferences. A study of 124 couples found that 64.5% turned their heads
to the right when kissing. The standard error associated with this
estimate is roughly 4%. Which of the below is false?
(a) A higher sample size would yield a lower standard error.
(b) The margin of error for a 95% CI for the percentage of
kissers who turn their heads to the right is roughly 8%.
(c) The 95% CI for the percentage of kissers who turn their
heads to the right is roughly 64.5% ± 4%.
(d) The 99.7% CI for the percentage of kissers who turn their
heads to the right is roughly 64.5% ± 12%.
The Kiss: http://en.wikipedia.org/wiki/File:Gustav_Klimt_016.jpg
✔︎
✔︎
x
✔︎
Study reference: Gunturkun, O. (2003) Adult persistence of head-turning asymmetry. Nature. Vol 421.

39. con
fi
dence level
‣ Then about 95% of those intervals
would contain the true population
mean (μ).
‣ Commonly used con
fi
dence levels in
practice are 90%, 95%, 98%, and 99%. 24 / 25 = 0.96
µ = 94.52

‣ Suppose we took many samples
and built a con
fi
dence interval from
each sample using the equation

40. If we want to be very certain that we capture the population
parameter, should we use a wider interval or a narrower interval?
µ = 94.52

41. standard deviations from the mean
−3 −2 −1 0 1 2 3
95%, extends −1.96 to 1.96
99%, extends −2.58 to 2.58
CL ↑ width ↑

and higher accuracy?
What drawbacks are associated with using a wider interval?
Weather icon: Matthew Petroff, http://commons.wikimedia.org/wiki/File:Weather_Icons.png,
Creative Commons CC0 1.0 Universal Public Domain Dedication, http://creativecommons.org/about/cc0
Low: -20F / -29C
High: 110F / 43 C
CL ↑ width ↑ accuracy ↑
precision ↓
increase sample size

43. The General Social Survey (GSS) is a sociological survey used to collect data on demographic
characteristics and attitudes of residents of the United States. In 2010, the survey collected responses
from 1,154 US residents. Based on the survey results, a 95% con
fi
dence interval for the average
number of hours Americans have to relax or pursue activities that they enjoy after an average work day
was found to be 3.53 to 3.83 hours. Determine if each of the following statements are true or false.
(a) 95% of Americans spend 3.53 to 3.83 hours relaxing after a work day.
(b) 95% of random samples of 1,154 Americans will yield con
fi
dence intervals that
contain the true average number of hours Americans spend relaxing after a work day.
(c) 95% of the time the true average number of hours Americans spend relaxing after a
work day is between 3.53 and 3.83 hours.
(d) We are 95% con
fi
dent that Americans in this sample spend on average 3.53 to 3.83
hours relaxing after a work day.
F
T
F
F

44. The General Social Survey asks: “For how many days during the past 30 days was your mental health,
which includes stress, depression, and problems with emotions, not good?” Based on responses from
1,151 US residents, the survey reported a 95% con
fi
dence interval of 3.40 to 4.24 days in 2010.
Interpret this interval in context of the data.
We are 95% con
fi
dent that Americans on average have 3.40 to 4.24 bad
mental health days per month.

45. 95% of random samples of 1,151 Americans will yield CIs that capture the
true population mean of number of bad mental health days per month.
The General Social Survey asks: “For how many days during the past 30 days was your mental health, which
includes stress, depression, and problems with emotions, not good?” Based on responses from 1,151 US
residents, the survey reported a 95% con
fi
dence interval of 3.40 to 4.24 days in 2010. Interpret this interval
in context of the data
In this context, what does a 95% con
fi
dence level mean?

46. As CL increases so does the width of the con
fi
dence interval, so wider.
The General Social Survey asks: “For how many days during the past 30 days was your mental health, which
includes stress, depression, and problems with emotions, not good?” Based on responses from 1,151 US
residents, the survey reported a 95% con
fi
dence interval of 3.40 to 4.24 days in 2010. Interpret this interval
in context of the data
Suppose the researchers think a 99% con
fi
dence level would be more appropriate for this
interval. Will this new interval be narrower or wider than the 95% con
fi
dence interval?

47. A sample of 50 college students were asked how many exclusive relationships they’ve
been in so far. The students in the sample had an average of 3.2 exclusive
relationships, with a standard deviation of 1.74. In addition, the sample distribution
was only slightly skewed to the right. Estimate the true average number of exclusive
relationships based on this sample using a 95% con
fi
dence interval.
1. random sample & 50 < 10% of all college students
We can assume that the number of exclusive relationships

one student in the sample has been in is independent of another.
2. n > 30 & not so skewed sample
We can assume that the sampling distribution of average number of exclusive
relationships from samples of size 50 will be nearly normal.
n = 50
s = 1.74
x = 3.2

48. n = 50
s = 1.74
x = 3.2
x ± z* SE = 3.2 ± 1.96 (0.246)
s
n
1.74
50
SE = = ≈ 0.246
= 3.2 ± 0.48
= (2.72, 3.68)
We are 95% con
fi
dent that college students on average have been in

2.72 to 3.68 exclusive relationships.

49. an overview of
statistical inference
frequentist
we just completed…

50. bayesian inference
a mini foray into

51. P(E) = lim
n!1
nE
n
frequentist de
fi
nition of probability

52. ‣ Indifferent between winning

‣ \$1 if event E occurs, or

‣ winning \$1 if you draw a blue chip from a box with 1,000 × p blue chips
+1,000 × (1-p) white chips

‣ Equating the probability of event E, P(E), to the probability of drawing a blue chip
from this box, p
P(E) = p
bayesian de
fi
nition of probability

53. Example: Based on a 2022 Pew Research poll on 5,074 Adults: “We
are 95% con
fi
dent that 68% to 72% of Americans think in
fl
ation is the
biggest problem facing the country.”

‣ 95% of random samples of 5,074 adults will produce con
fi
dence
intervals for the proportion of Americans who think in
fl
ation is the
biggest problem facing the country.

‣ Common misconceptions:

‣ There is a 95% chance that this con
fi
dence intervals includes the
true population proportion.

‣ The true population proportion is in this interval 95% of the time.
Source: https://www.pewresearch.org/fact-tank/2022/05/12/by-a-wide-margin-americans-view-in
fl
ation-as-the-top-problem-facing-the-country-today/
con
fi
dence intervals

54. ‣ Allows us to describe the unknown true parameter not as a
fi
xed
value but with a probability distribution

‣ This will let us construct something like a con
fi
dence interval, except
we can make probabilistic statements about the parameter falling
within that range.

‣ Example: “The posterior distribution yields a 95% credible interval
of 68% to 72% for the proportion of Americans who think in
fl
ation is
the biggest problem facing the country.”

‣ These are called credible intervals.
Source: http://www.pewsocialtrends.org/2016/02/04/most-americans-say-government-doesnt-do-enough-to-help-middle-class/
credible intervals

55. slides at bit.ly/lmu-inference
thank you!