$30 off During Our Annual Pro Sale. View Details »

An overview of statistical inference

An overview of statistical inference

Guest lecture for LMU Munich.

Mine Cetinkaya-Rundel

June 22, 2023
Tweet

More Decks by Mine Cetinkaya-Rundel

Other Decks in Education

Transcript

  1. an overview of
    statistical inference
    Dr. Mine Çetinkaya-Rundel
    Duke University

    View Slide

  2. slides at bit.ly/lmu-inference

    View Slide

  3. hypothesis testing

    View Slide

  4. ‣ Prediction of 2010 World Cup
    winners:
    ‣ Presented with 2 clear plastic
    boxes, each containing food and
    marked with
    fl
    ag of a team.
    ‣ Winner: Box which Paul opened
    fi
    rst to eat its contents.
    ‣ Accurately predicted the outcome
    of 8 games!
    https://www.youtube.com/watch?v=Ya85knuDzp8
    example: Paul the octopus

    View Slide

  5. Paul the Octopus predicted 8 World Cup games, and predicted
    them all correctly.
    Does this provide convincing evidence that Paul actually has
    psychic powers, i.e. that he does better than just randomly
    guessing?
    example: Paul the octopus

    View Slide

  6. null hypothesis
    “There is nothing going on”
    alternative hypothesis
    “There is something going on”
    two competing claims

    View Slide

  7. In context of Paul’s predictions, which of the following does the
    null hypothesis of “there is nothing going on” maps to?
    a. Paul does no better than random guessing.
    b. Paul does better than random guessing.
    c. Paul predicts all games correctly.
    d. Paul predicts none of the games correctly.
    e. Paul predicts 50% of the games correctly.
    setting the null

    View Slide

  8. In context of Paul’s predictions, which of the following does the
    null hypothesis of “there is nothing going on” maps to?
    a. Paul does no better than random guessing.
    b. Paul does better than random guessing.
    c. Paul predicts all games correctly.
    d. Paul predicts none of the games correctly.
    e. Paul predicts 50% of the games correctly.
    setting the null

    View Slide

  9. null hypothesis
    H0: Defendant is innocent
    alternative hypothesis
    HA: Defendant is guilty
    collect data
    present the evidence
    “Could these data plausibly have
    happened by chance if the null
    hypothesis were true?”
    judge the evidence
    Fail to reject H0
    yes
    Reject H0
    no
    burden
    of proof
    Image source: http://en.wikipedia.org/wiki/File:Trial_by_Jury_Usher.jpg

    View Slide

  10. Which of the following is not a component of the hypothesis
    testing framework?
    a. Start with a null hypothesis that represents the status quo
    b. Set an alternative hypothesis that represents the research question, i.e.
    what we’re testing for
    c. Conduct a hypothesis test under the assumption that the altertnative
    hypothesis is true
    d. If the test results suggest that the data do not provide convincing
    evidence for the alternative hypothesis, stick with the null hypothesis
    e. If the test results suggest that the data do provide convincing
    evidence for the alternative hypothesis, then reject the null hypothesis
    in favor of the alternative
    hypothesis testing framework

    View Slide

  11. a. Start with a null hypothesis that represents the status quo
    b. Set an alternative hypothesis that represents the research question, i.e.
    what we’re testing for
    c. Conduct a hypothesis test under the assumption that
    the altertnative hypothesis is true
    d. If the test results suggest that the data do not provide convincing
    evidence for the alternative hypothesis, stick with the null hypothesis
    e. If the test results suggest that the data do provide convincing
    evidence for the alternative hypothesis, then reject the null hypothesis
    in favor of the alternative
    hypothesis testing framework
    Which of the following is not a component of the hypothesis
    testing framework?

    View Slide

  12. Which of the following is the best set of hypotheses associated
    with the following two claims: “Paul does no better than random
    guessing” and “Paul does better than random guessing”?
    a. H0: p = 0 ; HA: p > 0
    b. H0: p = 1/8 ; HA: p > 1/8
    c. H0: p < 0.5 ; HA: p = 0.5
    d. H0: p = 0.5 ; HA: p > 0.5
    e. H0: p = 0.5 ; HA: p =1
    hypothesis testing framework

    View Slide

  13. a. H0: p = 0 ; HA: p > 0
    b. H0: p = 1/8 ; HA: p > 1/8
    c. H0: p < 0.5 ; HA: p = 0.5
    d. H0: p = 0.5 ; HA: p > 0.5
    e. H0: p = 0.5 ; HA: p =1
    hypothesis testing framework
    Which of the following is the best set of hypotheses associated
    with the following two claims: “Paul does no better than random
    guessing” and “Paul does better than random guessing”?

    View Slide

  14. null hypothesis
    Paul does no better than
    random guessing.
    “There is nothing going on”
    alternative hypothesis
    Paul does better than random
    guessing.
    “There is something going on”
    H0: p = 0.5 HA: p > 0.5
    two competing claims

    View Slide

  15. ‣ Use a fair coin, and label head as success (correct guess)
    ‣ One simulation:
    fl
    ip the coin 8 times and record the
    proportion of heads (correct guesses)
    ‣ Repeat the simulation many times, recording the
    proportion of heads at each iteration
    ‣ Calculate the percentage of simulations where the
    simulated proportion of heads is at least as extreme as
    the observed proportion
    Paul the Octopus predicted 8 World Cup games, and predicted them
    all correctly. Does this provide convincing evidence that Paul actually
    has psychic powers, i.e. that he does better than just randomly
    guessing?
    H0: p = 0.5
    HA: p > 0.5
    example: Paul the octopus

    View Slide

  16. simulation 1: H H
    H
    H
    H H
    H T 7 / 8 = 0.875
    simulation 2: T H H T H T T T 3 / 8 = 0.375
    0 1
    0.5
    0.25 0.75
    simulation 3: T T H H H H T H 5 / 8 = 0.625
    simulation 10: T H T H H H H H 6 / 8 = 0.75
    … …
    What proportion of simulations yielded a
    proportion of success at least as extreme as Paul’s?
    simulating Paul

    View Slide

  17. Based on the probability that you just calculated, which of the
    following is the best conclusion of this hypothesis test?
    a. It is likely to predict 8 or more games correctly if randomly guessing, hence
    the data suggest that Paul is doing no better than randomly guessing.
    b. It is likely to predict 8 or more games correctly if randomly guessing, hence
    the data suggest that Paul is doing better than randomly guessing.
    c. It is very unlikely to predict 8 or more games correctly if randomly
    guessing, hence the data suggest that Paul is doing no better than
    randomly guessing.
    d. It is very unlikely to predict 8 or more games correctly if randomly
    guessing, hence the data suggest that Paul is doing better than randomly
    guessing.
    e. None of the above.
    conclusion of the test

    View Slide

  18. a. It is likely to predict 8 or more games correctly if randomly guessing, hence
    the data suggest that Paul is doing no better than randomly guessing.
    b. It is likely to predict 8 or more games correctly if randomly guessing, hence
    the data suggest that Paul is doing better than randomly guessing.
    c. It is very unlikely to predict 8 or more games correctly if randomly
    guessing, hence the data suggest that Paul is doing no better than
    randomly guessing.
    d. It is very unlikely to predict 8 or more games correctly if
    randomly guessing, hence the data suggest that Paul is
    doing better than randomly guessing.
    e. None of the above.
    conclusion of the test
    Based on the probability that you just calculated, which of the
    following is the best conclusion of this hypothesis test?

    View Slide

  19. ‣ Hypotheses:
    ‣ H0: p = 0.5 - Paul does no better than random guessing
    ‣ HA: p > 0.5 - Paul does better than random guessing
    ‣ Data: Paul predicted 8 out of 8 games correctly
    ‣ Results: Assuming H0 is true, the probability of obtaining results at least as extreme as
    Paul’s is almost 0.
    ‣ Decision: Since this probability is low (lower than 5%), we reject H0 in favor of HA.
    ‣ This doesn’t mean we proved the alternative hypothesis, just that the data provide
    convincing evidence for it.
    making a decision

    View Slide

  20. ‣ study considered sex roles, and only allowed for options of “male” and
    “female.” We should note that the identities being considered are not gender
    identities and that the study allowed only for a binary classi
    fi
    cation of sex.
    ‣ 48 male bank supervisors given the same personnel
    fi
    le, asked to judge
    whether the person should be promoted
    ‣ identical
    fi
    les, except that half of them indicated the candidate identi
    fi
    ed as
    male and the other half indicated the candidate identi
    fi
    ed as female

    fi
    les randomly assigned to managers
    ‣ 35 / 48 promoted
    ‣ are females are unfairly discriminated against?
    example: sex discrimination
    “Are individuals who identify as female discriminated against in promotion
    decisions made by their managers who identify as male?”

    View Slide

  21. promotion
    promoted not promoted total
    sex
    male 21 3 24
    female 14 10 24
    total 35 13 48
    % of males promoted = 21/24 ≈ 88%
    % of females promoted = 14/24 ≈ 58%
    example: sex discrimination

    View Slide

  22. null hypothesis
    promotion and gender are
    independent, no gender
    discrimination, observed
    difference in proportions is
    simply due to chance
    “There is nothing going on”
    alternative hypothesis
    promotion and gender are
    dependent, there is gender
    discrimination, observed
    difference in proportions is
    not due to chance.
    “There is something going on”
    two competing claims

    View Slide

  23. simulation scheme
    1. face card: not promoted, non-face card: promoted
    ‣ set aside the jokers, consider aces as face cards
    ‣ take out 3 aces → 13 face cards left in the deck (face cards: A, K, Q, J)
    ‣ take out a number card → 35 number (non-face) cards left in the deck (number cards: 2-10)
    [use a deck of playing cards to simulate this experiment]

    View Slide

  24. Step 1:
    Image source: http://www.j
    fi
    tz.com/cards/

    View Slide

  25. simulation scheme
    1. face card: not promoted, non-face card: promoted
    ‣ set aside the jokers, consider aces as face cards
    ‣ take out 3 aces → 13 face cards left in the deck (face cards: A, K, Q, J)
    ‣ take out a number card → 35 number (non-face) cards left in the deck (number cards: 2-10)
    2. shuf
    fl
    e the cards, deal into two groups of size 24, representing males and
    females
    [use a deck of playing cards to simulate this experiment]

    View Slide

  26. Step 2:
    Image source: http://www.j
    fi
    tz.com/cards/

    View Slide

  27. simulation scheme
    1. face card: not promoted, non-face card: promoted
    ‣ set aside the jokers, consider aces as face cards
    ‣ take out 3 aces → 13 face cards left in the deck (face cards: A, K, Q, J)
    ‣ take out a number card → 35 number (non-face) cards left in the deck (number cards: 2-10)
    2. shuf
    fl
    e the cards, deal into two groups of size 24, representing males and
    females
    3. count how many number cards are in each group (representing promoted
    fi
    les)
    4. calculate the proportion of promoted
    fi
    les in each group, take the difference
    (male - female), and record this value
    [use a deck of playing cards to simulate this experiment]

    View Slide

  28. Steps 3&4:
    Image source: http://www.j
    fi
    tz.com/cards/

    View Slide

  29. 0 0.2 0.4
    -0.4 -0.2
    x

    View Slide

  30. simulation scheme
    1. face card: not promoted, non-face card: promoted
    ‣ set aside the jokers, consider aces as face cards
    ‣ take out 3 aces → 13 face cards left in the deck (face cards: A, K, Q, J)
    ‣ take out a number card → 35 number (non-face) cards left in the deck (number cards: 2-10)
    2. shuf
    fl
    e the cards, deal into two groups of size 24, representing males and
    females
    3. count how many number cards are in each group (representing promoted
    fi
    les)
    4. calculate the proportion of promoted
    fi
    les in each group, take the difference
    (male - female), and record this value
    5. repeat steps 2 - 4 many times
    [use a deck of playing cards to simulate this experiment]

    View Slide
















































































































































































































































































































































































































































































































































































































































































































































































































































  31. ● ●

























































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































    Difference in promotion rates
    −0.4 −0.2 0 0.2 0.4

    View Slide

  32. ‣ Results from the simulations look like the data → the difference between the
    proportions of promoted
    fi
    les between males and females was due to chance
    (promotion and sex are independent)
    ‣ Results from the simulations do not look like the data → the difference
    between the proportions of promoted
    fi
    les between males and females was
    not due to chance, but due to an actual effect of gender (promotion and sex
    are dependent)
    making a decision

    View Slide
















































































































































































































































































































































































































































































































































































































































































































































































































































  33. ● ●

























































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































    Difference in promotion rates
    −0.4 −0.2 0 0.2 0.4

    View Slide

  34. ‣ set a null and an alternative hypothesis
    ‣ simulate the experiment assuming that the null hypothesis is true
    ‣ evaluated the probability of observing an outcome at least as extreme as the
    one observed in the original data
    ‣ and if this probability is low, reject the null hypothesis in favor of the
    alternative
    p-value
    summary

    View Slide

  35. con
    fi
    dence intervals

    View Slide

  36. A plausible range of values for the population parameter is called a
    con
    fi
    dence interval.
    Net: Photo by ozgurmulazimoglu on Flickr: http://www.
    fl
    ickr.com/photos/mulazimoglu/5195133899, CC-A 3.0 http://creativecommons.org/licenses/by/3.0/deed.en
    Spear
    fi
    shing: Photo by Chris Penny on Flickr: http://www.
    fl
    ickr.com/photos/clearlydived/7029109617, CC-BY 2.0 http://creativecommons.org/licenses/by/2.0/
    ‣ If we report a point estimate, we probably won’t hit the exact
    population parameter.
    ‣ If we report a range of plausible values we have a good shot at
    capturing the parameter.

    View Slide

  37. x
    Central Limit Theorem
    (CLT):
    x ±2SE
    approximate 95% CI:
    µ − 3σ µ − 2σ µ − σ µ µ + σ µ + 2σ µ + 3σ
    99.7%
    95%
    68%
    margin of error (ME)

    View Slide

  38. One of the earliest examples of behavioral asymmetry is a preference in
    humans for turning the head to the right, rather than to the left, during
    the
    fi
    nal weeks of gestation and for the
    fi
    rst 6 months after birth. This is
    thought to in
    fl
    uence subsequent development of perceptual and motor
    preferences. A study of 124 couples found that 64.5% turned their heads
    to the right when kissing. The standard error associated with this
    estimate is roughly 4%. Which of the below is false?
    (a) A higher sample size would yield a lower standard error.
    (b) The margin of error for a 95% CI for the percentage of
    kissers who turn their heads to the right is roughly 8%.
    (c) The 95% CI for the percentage of kissers who turn their
    heads to the right is roughly 64.5% ± 4%.
    (d) The 99.7% CI for the percentage of kissers who turn their
    heads to the right is roughly 64.5% ± 12%.
    The Kiss: http://en.wikipedia.org/wiki/File:Gustav_Klimt_016.jpg
    ✔︎
    ✔︎
    x
    ✔︎
    Study reference: Gunturkun, O. (2003) Adult persistence of head-turning asymmetry. Nature. Vol 421.

    View Slide

  39. con
    fi
    dence level
    ‣ Then about 95% of those intervals
    would contain the true population
    mean (μ).
    ‣ Commonly used con
    fi
    dence levels in
    practice are 90%, 95%, 98%, and 99%. 24 / 25 = 0.96
    µ = 94.52


























    ‣ Suppose we took many samples
    and built a con
    fi
    dence interval from
    each sample using the equation

    View Slide

  40. If we want to be very certain that we capture the population
    parameter, should we use a wider interval or a narrower interval?
    µ = 94.52


























    View Slide

  41. standard deviations from the mean
    −3 −2 −1 0 1 2 3
    95%, extends −1.96 to 1.96
    99%, extends −2.58 to 2.58
    CL ↑ width ↑

    View Slide

  42. How can we get the best of both worlds — higher precision
    and higher accuracy?
    What drawbacks are associated with using a wider interval?
    Weather icon: Matthew Petroff, http://commons.wikimedia.org/wiki/File:Weather_Icons.png,
    Creative Commons CC0 1.0 Universal Public Domain Dedication, http://creativecommons.org/about/cc0
    Low: -20F / -29C
    High: 110F / 43 C
    CL ↑ width ↑ accuracy ↑
    precision ↓
    increase sample size

    View Slide

  43. The General Social Survey (GSS) is a sociological survey used to collect data on demographic
    characteristics and attitudes of residents of the United States. In 2010, the survey collected responses
    from 1,154 US residents. Based on the survey results, a 95% con
    fi
    dence interval for the average
    number of hours Americans have to relax or pursue activities that they enjoy after an average work day
    was found to be 3.53 to 3.83 hours. Determine if each of the following statements are true or false.
    (a) 95% of Americans spend 3.53 to 3.83 hours relaxing after a work day.
    (b) 95% of random samples of 1,154 Americans will yield con
    fi
    dence intervals that
    contain the true average number of hours Americans spend relaxing after a work day.
    (c) 95% of the time the true average number of hours Americans spend relaxing after a
    work day is between 3.53 and 3.83 hours.
    (d) We are 95% con
    fi
    dent that Americans in this sample spend on average 3.53 to 3.83
    hours relaxing after a work day.
    F
    T
    F
    F

    View Slide

  44. The General Social Survey asks: “For how many days during the past 30 days was your mental health,
    which includes stress, depression, and problems with emotions, not good?” Based on responses from
    1,151 US residents, the survey reported a 95% con
    fi
    dence interval of 3.40 to 4.24 days in 2010.
    Interpret this interval in context of the data.
    We are 95% con
    fi
    dent that Americans on average have 3.40 to 4.24 bad
    mental health days per month.

    View Slide

  45. 95% of random samples of 1,151 Americans will yield CIs that capture the
    true population mean of number of bad mental health days per month.
    The General Social Survey asks: “For how many days during the past 30 days was your mental health, which
    includes stress, depression, and problems with emotions, not good?” Based on responses from 1,151 US
    residents, the survey reported a 95% con
    fi
    dence interval of 3.40 to 4.24 days in 2010. Interpret this interval
    in context of the data
    In this context, what does a 95% con
    fi
    dence level mean?

    View Slide

  46. As CL increases so does the width of the con
    fi
    dence interval, so wider.
    The General Social Survey asks: “For how many days during the past 30 days was your mental health, which
    includes stress, depression, and problems with emotions, not good?” Based on responses from 1,151 US
    residents, the survey reported a 95% con
    fi
    dence interval of 3.40 to 4.24 days in 2010. Interpret this interval
    in context of the data
    Suppose the researchers think a 99% con
    fi
    dence level would be more appropriate for this
    interval. Will this new interval be narrower or wider than the 95% con
    fi
    dence interval?

    View Slide

  47. A sample of 50 college students were asked how many exclusive relationships they’ve
    been in so far. The students in the sample had an average of 3.2 exclusive
    relationships, with a standard deviation of 1.74. In addition, the sample distribution
    was only slightly skewed to the right. Estimate the true average number of exclusive
    relationships based on this sample using a 95% con
    fi
    dence interval.
    1. random sample & 50 < 10% of all college students
    We can assume that the number of exclusive relationships


    one student in the sample has been in is independent of another.
    2. n > 30 & not so skewed sample
    We can assume that the sampling distribution of average number of exclusive
    relationships from samples of size 50 will be nearly normal.
    n = 50
    s = 1.74
    x = 3.2
    Heart: http://commons.wikimedia.org/wiki/File:Heart-padlock.svg

    View Slide

  48. n = 50
    s = 1.74
    x = 3.2
    x ± z* SE = 3.2 ± 1.96 (0.246)
    s
    n
    1.74
    50
    SE = = ≈ 0.246
    = 3.2 ± 0.48
    = (2.72, 3.68)
    We are 95% con
    fi
    dent that college students on average have been in


    2.72 to 3.68 exclusive relationships.

    View Slide

  49. an overview of
    statistical inference
    frequentist
    we just completed…

    View Slide

  50. bayesian inference
    a mini foray into

    View Slide

  51. P(E) = lim
    n!1
    nE
    n
    frequentist de
    fi
    nition of probability

    View Slide

  52. ‣ Indifferent between winning


    ‣ $1 if event E occurs, or


    ‣ winning $1 if you draw a blue chip from a box with 1,000 × p blue chips
    +1,000 × (1-p) white chips


    ‣ Equating the probability of event E, P(E), to the probability of drawing a blue chip
    from this box, p
    P(E) = p
    bayesian de
    fi
    nition of probability

    View Slide

  53. Example: Based on a 2022 Pew Research poll on 5,074 Adults: “We
    are 95% con
    fi
    dent that 68% to 72% of Americans think in
    fl
    ation is the
    biggest problem facing the country.”


    ‣ 95% of random samples of 5,074 adults will produce con
    fi
    dence
    intervals for the proportion of Americans who think in
    fl
    ation is the
    biggest problem facing the country.


    ‣ Common misconceptions:


    ‣ There is a 95% chance that this con
    fi
    dence intervals includes the
    true population proportion.


    ‣ The true population proportion is in this interval 95% of the time.
    Source: https://www.pewresearch.org/fact-tank/2022/05/12/by-a-wide-margin-americans-view-in
    fl
    ation-as-the-top-problem-facing-the-country-today/
    con
    fi
    dence intervals

    View Slide

  54. ‣ Allows us to describe the unknown true parameter not as a
    fi
    xed
    value but with a probability distribution


    ‣ This will let us construct something like a con
    fi
    dence interval, except
    we can make probabilistic statements about the parameter falling
    within that range.


    ‣ Example: “The posterior distribution yields a 95% credible interval
    of 68% to 72% for the proportion of Americans who think in
    fl
    ation is
    the biggest problem facing the country.”


    ‣ These are called credible intervals.
    Source: http://www.pewsocialtrends.org/2016/02/04/most-americans-say-government-doesnt-do-enough-to-help-middle-class/
    credible intervals

    View Slide

  55. slides at bit.ly/lmu-inference
    thank you!

    View Slide