Upgrade to Pro — share decks privately, control downloads, hide ads and more …

hst953-pvalues

Patrick Kimes
November 08, 2019

 hst953-pvalues

MIT, HST.953: Collaborative Data Science in Medicine
p-values, multiple testing, and replicability in science
2019-11-08

Patrick Kimes

November 08, 2019
Tweet

More Decks by Patrick Kimes

Other Decks in Education

Transcript

  1. p-values, multiple testing,
    and replicability in science
    HST 953, Fall 2019
    Patrick Kimes, PhD
    Data Sciences, Dana-Farber Cancer Institute
    Biostatistics, Harvard TH Chan School of Public Health

    View full-size slide

  2. “… this is an important topic.”
    - Leo Anthony Celi

    View full-size slide

  3. part I. the crisis

    View full-size slide

  4. https://doi.org/10.1371/journal.pmed.0020124

    View full-size slide

  5. https://doi.org/10.1371/journal.pmed.0020124
    “Simulations show that for most study designs and settings,
    it is more likely for a research claim to be false than true.”

    View full-size slide

  6. http://jtleek.com/talks.html

    View full-size slide

  7. http://jtleek.com/talks.html

    View full-size slide

  8. http://jtleek.com/talks.html

    View full-size slide

  9. http://jtleek.com/talks.html

    View full-size slide

  10. http://jtleek.com/talks.html

    View full-size slide

  11. http://jtleek.com/talks.html

    View full-size slide

  12. http://jtleek.com/talks.html

    View full-size slide

  13. https://www.nature.com/news/1-500-scientists-
    lift-the-lid-on-reproducibility-1.19970

    View full-size slide

  14. what’s going on?

    View full-size slide

  15. let’s clarify some language
    reproducibility
    replicability

    View full-size slide

  16. let’s clarify some language
    reproducibility
    replicability
    the ability to take the original data
    and the computer code used to
    analyze the data and reproduce all of
    the numerical findings from the study
    https://simplystatistics.org/2016/08/24/replication-crisis/

    View full-size slide

  17. let’s clarify some language
    reproducibility
    replicability
    the ability to take the original data
    and the computer code used to
    analyze the data and reproduce all of
    the numerical findings from the study
    https://simplystatistics.org/2016/08/24/replication-crisis/

    View full-size slide

  18. let’s clarify some language
    reproducibility
    replicability
    the ability to repeat an entire study,
    independent of the original investigator
    without the use of original data
    https://simplystatistics.org/2016/08/24/replication-crisis/

    View full-size slide

  19. let’s clarify some language
    reproducibility
    replicability
    the ability to repeat an entire study,
    independent of the original investigator
    without the use of original data
    https://simplystatistics.org/2016/08/24/replication-crisis/

    View full-size slide

  20. let’s clarify some language
    reproducibility
    replicability

    View full-size slide

  21. REPLICABILITY

    View full-size slide

  22. what’s going on?
    REPLICABILITY

    View full-size slide

  23. REPLICABILITY
    crisis:
    experiments are replicated.
    results not so much.

    View full-size slide

  24. “… replications of 100 experimental
    and correlational studies …”
    https://www.ncbi.nlm.nih.gov/pubmed/26315443

    View full-size slide

  25. “39% of effects were subjectively
    rated to have replicated the
    original results.”
    https://www.ncbi.nlm.nih.gov/pubmed/26315443
    “… replications of 100 experimental
    and correlational studies …”

    View full-size slide

  26. 39%??
    https://www.ncbi.nlm.nih.gov/pubmed/26315443

    View full-size slide

  27. https://www.ncbi.nlm.nih.gov/pubmed/22460902
    https://www.ncbi.nlm.nih.gov/pubmed/22460905
    not just
    psychology

    View full-size slide

  28. https://www.ncbi.nlm.nih.gov/pubmed/22460902
    https://www.ncbi.nlm.nih.gov/pubmed/22460905
    drug sensitivity in
    cancer cell lines

    View full-size slide

  29. https://www.ncbi.nlm.nih.gov/pubmed/24284626

    View full-size slide

  30. https://www.ncbi.nlm.nih.gov/pubmed/24284626

    View full-size slide

  31. https://www.ncbi.nlm.nih.gov/pubmed/27905415

    View full-size slide

  32. REPLICABILITY
    crisis:
    experiments are replicated.
    results not so much.

    View full-size slide

  33. REPLICABILITY
    what does it mean for
    results to “replicate”?

    View full-size slide

  34. “No single indicator sufficiently
    describes replication success, and
    the five indicators examined here
    are not the only ways to evaluate
    reproducibility.”
    https://www.ncbi.nlm.nih.gov/pubmed/26315443

    View full-size slide

  35. “Upon careful analysis of the same data,
    we have come to quite different and
    much more positive conclusions.”
    https://www.ncbi.nlm.nih.gov/pubmed/27905415

    View full-size slide

  36. REPLICABILITY
    what does it mean for
    results to “replicate”?
    a lot of things

    View full-size slide

  37. REPLICABILITY
    what does it mean for
    results to “replicate”?
    a lot of things
    p-values

    View full-size slide

  38. part II. the p-value

    View full-size slide

  39. what do statisticians do?

    View full-size slide

  40. inference
    .. and some other things too
    what do statisticians do?

    View full-size slide

  41. inference
    •point estimation
    •interval estimation
    •hypothesis testing
    .. and some other things too
    what do statisticians do?

    View full-size slide

  42. •point estimation
    •interval estimation
    •hypothesis testing
    what do statisticians do?

    View full-size slide

  43. the statistical approach
    to testing hypotheses

    View full-size slide

  44. the statistical approach
    to testing hypotheses
    H01
    : no difference between groups
    pose uninteresting
    baseline hypothesis

    View full-size slide

  45. the statistical approach
    to testing hypotheses
    H01
    : no difference between groups
    pose uninteresting
    baseline hypothesis
    collect data

    View full-size slide

  46. the statistical approach
    to testing hypotheses
    pose uninteresting
    baseline hypothesis
    H01
    : no difference between groups
    collect data
    how likely is this data
    if baseline were true?

    View full-size slide

  47. the statistical approach
    to testing hypotheses
    pose uninteresting
    baseline hypothesis
    H01
    : no difference between groups
    collect data
    how likely is this data
    if baseline were true?
    p-value

    View full-size slide

  48. H01
    : no difference between groups
    null hypothesis
    more formally …

    View full-size slide

  49. H01
    : no difference between groups
    H10
    : no difference between groups
    null hypothesis
    alternative hypothesis
    more formally …

    View full-size slide

  50. p-value
    H01
    : no difference between groups
    H10
    : no difference between groups
    more formally …

    View full-size slide

  51. p-value
    H01
    : no difference between groups
    H10
    : no difference between groups
    0 difference between sampled groups
    H0

    View full-size slide

  52. p-value
    H01
    : no difference between groups
    H10
    : no difference between groups
    0 difference between sampled groups
    H0

    View full-size slide

  53. p-value
    H01
    : no difference between groups
    H10
    : no difference between groups
    0 difference between sampled groups
    H0
    small p-value

    View full-size slide

  54. p-value
    H01
    : no difference between groups
    H10
    : no difference between groups
    0 difference between sampled groups
    H0
    larger p-value

    View full-size slide

  55. H01
    : no difference between groups
    H10
    : no difference between groups
    0
    H0
    difference between sampled groups
    p-value
    (often use a 5% cutoff)

    View full-size slide

  56. H01
    : no difference between groups
    H10
    : no difference between groups
    0
    H0
    difference between sampled groups
    p-value
    (often use a 5% cutoff)

    View full-size slide

  57. H01
    : no difference between groups
    H10
    : no difference between groups
    0
    H0
    difference between sampled groups
    H1
    p-value
    (often use a 5% cutoff)

    View full-size slide

  58. “If he was cited every time a
    p-value was reported his
    paper would have, at the very
    least, 3 million citations* …”
    https://simplystatistics.org/2012/01/06/p-values-
    and-hypothesis-testing-get-a-bad-rap-but-we/

    View full-size slide

  59. https://jamanetwork.com/journals/jama/fullarticle/2503172
    p-values are
    everywhere

    View full-size slide

  60. and they’re
    significant
    https://jamanetwork.com/journals/jama/fullarticle/2503172

    View full-size slide

  61. https://simplystatistics.org/2017/07/26/
    announcing-the-tidypvals-package/

    View full-size slide

  62. REPLICABILITY
    what does it mean for
    results to “replicate”?
    a lot of things
    p-values

    View full-size slide

  63. https://www.nature.com/news/psychology-
    journal-bans-p-values-1.17001

    View full-size slide

  64. https://www.sciencenews.org/blog/context/p-
    value-ban-small-step-journal-giant-leap-science

    View full-size slide

  65. “The fact that many
    misinterpret the p-value is
    not the p-value’s fault.”
    https://simplystatistics.org/2012/01/06/p-values-
    and-hypothesis-testing-get-a-bad-rap-but-we/

    View full-size slide

  66. https://www.tandfonline.com/doi/full/10.1080/00031305.2016.1154108
    “… the scientific community
    could benefit from a formal
    statement clarifying several
    widely agreed upon principles
    underlying the proper use and
    interpretation of the p-value.”

    View full-size slide

  67. https://www.tandfonline.com/doi/full/10.1080/00031305.2019.1583913

    View full-size slide

  68. https://simplystatistics.org/2012/01/06/p-values-
    and-hypothesis-testing-get-a-bad-rap-but-we/
    “The fact that many
    misinterpret the p-value is
    not the p-value’s fault.”

    View full-size slide

  69. REPLICABILITY
    Common pitfalls
    p-values

    View full-size slide

  70. part III. multiple testing

    View full-size slide

  71. Bennett et al. (2010). Neural correlates of interspecies perspective taking in the post-mortem Atlantic
    Salmon: An argument for multiple comparisons correction. Journal of Serendipitous and Unexpected Results.

    View full-size slide

  72. brain activity
    Bennett et al. (2010). Neural correlates of interspecies perspective taking in the post-mortem Atlantic
    Salmon: An argument for multiple comparisons correction. Journal of Serendipitous and Unexpected Results.

    View full-size slide

  73. dead salmon
    from market
    Bennett et al. (2010). Neural correlates of interspecies perspective taking in the post-mortem Atlantic
    Salmon: An argument for multiple comparisons correction. Journal of Serendipitous and Unexpected Results.

    View full-size slide

  74. zombie
    fish?
    Bennett et al. (2010). Neural correlates of interspecies perspective taking in the post-mortem Atlantic
    Salmon: An argument for multiple comparisons correction. Journal of Serendipitous and Unexpected Results.

    View full-size slide

  75. H01
    : no difference in signal at voxel
    H10
    : no difference in signal at voxel
    hypothesis testing in
    salmon fMRI study
    H0
    : no difference in signal at voxel 1
    H1
    : difference in signal at voxel 1

    View full-size slide

  76. H01
    : no difference in signal at voxel
    H10
    : no difference in signal at voxel
    hypothesis testing in
    salmon fMRI study
    H0
    : no difference in signal at voxel 1
    H1
    : difference in signal at voxel 1
    H0
    : no difference in signal at voxel 2
    H1
    : difference in signal at voxel 2
    H0
    : no difference in signal at voxel 3
    H1
    : difference in signal at voxel 3
    H0
    : no difference in signal at voxel 4
    H1
    : difference in signal at voxel 4
    H0
    : no difference in signal at voxel 5
    H1
    : difference in signal at voxel 5
    H0
    : no difference in signal at voxel 6

    View full-size slide

  77. H01
    : no difference in signal at voxel
    H10
    : no difference in signal at voxel
    hypothesis testing in
    salmon fMRI study
    H0
    : no difference in signal at voxel 1
    H1
    : difference in signal at voxel 1
    H0
    : no difference in signal at voxel 2
    H1
    : difference in signal at voxel 2
    H0
    : no difference in signal at voxel 3
    H1
    : difference in signal at voxel 3
    H0
    : no difference in signal at voxel 4
    H1
    : difference in signal at voxel 4
    H0
    : no difference in signal at voxel 5
    H1
    : difference in signal at voxel 5
    H0
    : no difference in signal at voxel 6
    ~8,000 voxels
    considered

    View full-size slide

  78. the problem of
    multiple hypothesis testing
    8,000
    voxels

    View full-size slide

  79. the problem of
    multiple hypothesis testing
    8,000
    true null
    0
    true differential
    8,000
    voxels

    View full-size slide

  80. the problem of
    multiple hypothesis testing
    400
    significant
    7,600
    not sig.
    8,000
    true null
    0
    true differential
    8,000
    voxels
    5% p-value cutoff

    View full-size slide

  81. the problem of
    multiple hypothesis testing
    400
    significant
    7,600
    not sig.
    0
    significant
    8,000
    true null
    0
    true differential
    8,000
    voxels
    5% p-value cutoff

    View full-size slide

  82. the problem of
    multiple hypothesis testing
    950
    significant
    7,600
    not sig.
    1,000
    significant
    8,000
    true null
    0
    true differential
    8,000
    voxels
    400
    significant
    5% p-value cutoff
    100% of our
    hits are false!

    View full-size slide

  83. “… random noise in the EPI
    timeseries may yield spurious
    results if multiple comparisons
    are not controlled for …”
    Bennett et al. (2010). Neural correlates of interspecies perspective taking in the post-mortem Atlantic
    Salmon: An argument for multiple comparisons correction. Journal of Serendipitous and Unexpected Results.

    View full-size slide

  84. 8,000
    true null
    0
    true differential
    8,000
    voxels
    what can we do?

    View full-size slide

  85. Family-wise Error Rate (FWER)
    •Bonferroni correction
    P( at least 1 false positive ) < ⍺
    what can we do?
    8,000
    true null
    0
    true differential
    8,000
    voxels

    View full-size slide

  86. Family-wise Error Rate (FWER)
    •Bonferroni correction
    P( at least 1 false positive ) < ⍺
    (5 / 8,000)% p-value cutoff
    what can we do?
    8,000
    true null
    0
    true differential
    8,000
    voxels

    View full-size slide

  87. 19,000
    true null
    1,000
    true differential
    20,000
    genes
    False Discovery Rate (FDR)
    •Benjamini-Hochberg (BH) procedure
    •Storey’s q-value
    E( ) < ⍺
    # false positives
    # total positives
    what can we do?
    8,000
    true null
    0
    true differential
    8,000
    voxels

    View full-size slide

  88. 19,000
    true null
    1,000
    true differential
    20,000
    genes
    False Discovery Rate (FDR)
    •Benjamini-Hochberg (BH) procedure
    •Storey’s q-value
    E( ) < ⍺
    # false positives
    # total positives
    some cutoff
    estimate these
    what can we do?
    8,000
    true null
    0
    true differential
    8,000
    voxels

    View full-size slide

  89. “… controlling the false
    discovery rate (FDR) and
    familywise error rate (FWER) …
    indicated no active voxels …”
    Bennett et al. (2010). Neural correlates of interspecies perspective taking in the post-mortem Atlantic
    Salmon: An argument for multiple comparisons correction. Journal of Serendipitous and Unexpected Results.

    View full-size slide

  90. https://doi.org/10.1371/journal.pone.0124165

    View full-size slide

  91. https://doi.org/10.1371/journal.pone.0124165

    View full-size slide

  92. https://doi.org/10.1371/journal.pone.0124165

    View full-size slide

  93. WORKSHOP (in R)

    View full-size slide

  94. ## for FWER control (Bonferroni)
    p.adjust(my_pvals, method = “bonferroni”)
    ## for FDR control (Benjamini-Hochberg)
    p.adjust(my_pvals, method = “BH”)
    WORKSHOP (in R)

    View full-size slide

  95. WORKSHOP (in Python)

    View full-size slide

  96. # use statsmodels package
    import statsmodels.stats.multitest as mt
    ## for FWER control (Bonferroni)
    mt.multipletests(my_pvals, method = “bonferro
    ## for FDR control (Benjamini-Hochberg)
    mt.multipletests(my_pvals, method = “fdr_bh”)
    WORKSHOP (in Python)

    View full-size slide

  97. FWER correction,
    FDR correction,
    got it, done, finished, great

    View full-size slide

  98. FWER correction,
    FDR correction,
    got it, done, finished, great
    … almost

    View full-size slide

  99. https://www.nature.com/news/statistics-p-values-are-just-the-tip-of-the-iceberg-1.17412

    View full-size slide

  100. part IV. forking paths

    View full-size slide

  101. Goodhart’s law.
    When a measure becomes a target,
    it ceases to be a good measure.

    View full-size slide

  102. “… it’s easy to find a p < .05
    comparison even if nothing is
    going on, if you look hard
    enough—and good scientists are
    skilled at looking hard enough …”

    View full-size slide

  103. specify hypothesis

    View full-size slide

  104. specify hypothesis

    View full-size slide

  105. specify hypothesis

    View full-size slide

  106. specify hypothesis

    View full-size slide

  107. specify hypothesis
    trying a different
    set of statistics

    View full-size slide

  108. specify hypothesis
    revisiting filtering
    of dataset

    View full-size slide

  109. specify hypothesis

    View full-size slide

  110. specify hypothesis
    “… a sort of invisible multiplicity:
    data-dependent analysis choices
    that did not appear to be degrees of
    freedom because researchers
    analyze only one dataset at a time.”

    View full-size slide

  111. Goodhart’s law.
    When a measure becomes a target,
    it ceases to be a good measure.

    View full-size slide

  112. “… it’s easy to find a p < .05
    comparison even if nothing is
    going on, if you look hard
    enough—and good scientists are
    skilled at looking hard enough …”

    View full-size slide

  113. specify hypothesis
    moving forward?
    pre-registerion
    external validation
    acceptance

    View full-size slide

  114. “While it is easy to lie with statistics,
    it is even easier to lie without them.”
    - Frederick Mosteller

    View full-size slide

  115. https://fivethirtyeight.com/features/science-isnt-broken/

    View full-size slide

  116. https://projects.fivethirtyeight.com/p-hacking/

    View full-size slide

  117. part V. onward/upward

    View full-size slide

  118. REPLICABILITY
    crisis:
    experiments are replicated.
    results not so much.

    View full-size slide

  119. REPLICABILITY
    should we expect
    scientific results to
    always replicate?

    View full-size slide

  120. https://simplystatistics.org/2016/08/24/replication-crisis/
    “the replication crisis in science is largely attributable to
    a mismatch in our expectations of how often findings
    should replicate and how difficult it is to actually
    discover true findings in certain fields.”

    View full-size slide

  121. https://simplystatistics.org/2013/08/01/the-roc-curves-of-science/
    “…I argue that the rate of
    discoveries is higher in
    biomedical research than in
    physics. But, to achieve this higher
    true positive rate, biomedical
    research has to tolerate a higher
    false positive rate.”

    View full-size slide

  122. REPLICABILITY
    should we expect
    scientific results to
    always replicate?
    not always

    View full-size slide

  123. REPLICABILITY
    crisis:
    experiments are replicated.
    results not so much.
    maybe that’s part of science?

    View full-size slide

  124. “… this is an important topic.”
    - Leo Anthony Celi

    View full-size slide

  125. Done right, reproducibility should
    not be a crisis for digital medicine,
    but rather one of its strengths.
    “As an embryonic discipline, digital
    medicine has the chance to inculcate
    among its practitioners a healthier
    set of attitudes towards replication.”

    View full-size slide

  126. p-values, multiple testing,
    and replicability in science
    HST 953, Fall 2019
    Patrick Kimes, PhD
    Data Sciences, Dana-Farber Cancer Institute
    Biostatistics, Harvard TH Chan School of Public Health

    View full-size slide