Upgrade to Pro — share decks privately, control downloads, hide ads and more …

hst953-pvalues

Patrick Kimes
November 08, 2019

 hst953-pvalues

MIT, HST.953: Collaborative Data Science in Medicine
p-values, multiple testing, and replicability in science
2019-11-08

Patrick Kimes

November 08, 2019
Tweet

More Decks by Patrick Kimes

Other Decks in Education

Transcript

  1. p-values, multiple testing,
    and replicability in science
    HST 953, Fall 2019
    Patrick Kimes, PhD
    Data Sciences, Dana-Farber Cancer Institute
    Biostatistics, Harvard TH Chan School of Public Health

    View Slide

  2. “… this is an important topic.”
    - Leo Anthony Celi

    View Slide

  3. part I. the crisis

    View Slide

  4. https://doi.org/10.1371/journal.pmed.0020124

    View Slide

  5. https://doi.org/10.1371/journal.pmed.0020124
    “Simulations show that for most study designs and settings,
    it is more likely for a research claim to be false than true.”

    View Slide

  6. http://jtleek.com/talks.html

    View Slide

  7. http://jtleek.com/talks.html

    View Slide

  8. http://jtleek.com/talks.html

    View Slide

  9. http://jtleek.com/talks.html

    View Slide

  10. http://jtleek.com/talks.html

    View Slide

  11. http://jtleek.com/talks.html

    View Slide

  12. http://jtleek.com/talks.html

    View Slide

  13. https://www.nature.com/news/1-500-scientists-
    lift-the-lid-on-reproducibility-1.19970

    View Slide

  14. what’s going on?

    View Slide

  15. let’s clarify some language
    reproducibility
    replicability

    View Slide

  16. let’s clarify some language
    reproducibility
    replicability
    the ability to take the original data
    and the computer code used to
    analyze the data and reproduce all of
    the numerical findings from the study
    https://simplystatistics.org/2016/08/24/replication-crisis/

    View Slide

  17. let’s clarify some language
    reproducibility
    replicability
    the ability to take the original data
    and the computer code used to
    analyze the data and reproduce all of
    the numerical findings from the study
    https://simplystatistics.org/2016/08/24/replication-crisis/

    View Slide

  18. let’s clarify some language
    reproducibility
    replicability
    the ability to repeat an entire study,
    independent of the original investigator
    without the use of original data
    https://simplystatistics.org/2016/08/24/replication-crisis/

    View Slide

  19. let’s clarify some language
    reproducibility
    replicability
    the ability to repeat an entire study,
    independent of the original investigator
    without the use of original data
    https://simplystatistics.org/2016/08/24/replication-crisis/

    View Slide

  20. let’s clarify some language
    reproducibility
    replicability

    View Slide

  21. REPLICABILITY

    View Slide

  22. what’s going on?
    REPLICABILITY

    View Slide

  23. REPLICABILITY
    crisis:
    experiments are replicated.
    results not so much.

    View Slide

  24. “… replications of 100 experimental
    and correlational studies …”
    https://www.ncbi.nlm.nih.gov/pubmed/26315443

    View Slide

  25. “39% of effects were subjectively
    rated to have replicated the
    original results.”
    https://www.ncbi.nlm.nih.gov/pubmed/26315443
    “… replications of 100 experimental
    and correlational studies …”

    View Slide

  26. 39%??
    https://www.ncbi.nlm.nih.gov/pubmed/26315443

    View Slide

  27. https://www.ncbi.nlm.nih.gov/pubmed/22460902
    https://www.ncbi.nlm.nih.gov/pubmed/22460905
    not just
    psychology

    View Slide

  28. https://www.ncbi.nlm.nih.gov/pubmed/22460902
    https://www.ncbi.nlm.nih.gov/pubmed/22460905
    drug sensitivity in
    cancer cell lines

    View Slide

  29. https://www.ncbi.nlm.nih.gov/pubmed/24284626

    View Slide

  30. https://www.ncbi.nlm.nih.gov/pubmed/24284626

    View Slide

  31. https://www.ncbi.nlm.nih.gov/pubmed/27905415

    View Slide

  32. REPLICABILITY
    crisis:
    experiments are replicated.
    results not so much.

    View Slide

  33. REPLICABILITY
    what does it mean for
    results to “replicate”?

    View Slide

  34. “No single indicator sufficiently
    describes replication success, and
    the five indicators examined here
    are not the only ways to evaluate
    reproducibility.”
    https://www.ncbi.nlm.nih.gov/pubmed/26315443

    View Slide

  35. “Upon careful analysis of the same data,
    we have come to quite different and
    much more positive conclusions.”
    https://www.ncbi.nlm.nih.gov/pubmed/27905415

    View Slide

  36. REPLICABILITY
    what does it mean for
    results to “replicate”?
    a lot of things

    View Slide

  37. REPLICABILITY
    what does it mean for
    results to “replicate”?
    a lot of things
    p-values

    View Slide

  38. part II. the p-value

    View Slide

  39. what do statisticians do?

    View Slide

  40. inference
    .. and some other things too
    what do statisticians do?

    View Slide

  41. inference
    •point estimation
    •interval estimation
    •hypothesis testing
    .. and some other things too
    what do statisticians do?

    View Slide

  42. •point estimation
    •interval estimation
    •hypothesis testing
    what do statisticians do?

    View Slide

  43. the statistical approach
    to testing hypotheses

    View Slide

  44. the statistical approach
    to testing hypotheses
    H01
    : no difference between groups
    pose uninteresting
    baseline hypothesis

    View Slide

  45. the statistical approach
    to testing hypotheses
    H01
    : no difference between groups
    pose uninteresting
    baseline hypothesis
    collect data

    View Slide

  46. the statistical approach
    to testing hypotheses
    pose uninteresting
    baseline hypothesis
    H01
    : no difference between groups
    collect data
    how likely is this data
    if baseline were true?

    View Slide

  47. the statistical approach
    to testing hypotheses
    pose uninteresting
    baseline hypothesis
    H01
    : no difference between groups
    collect data
    how likely is this data
    if baseline were true?
    p-value

    View Slide

  48. H01
    : no difference between groups
    null hypothesis
    more formally …

    View Slide

  49. H01
    : no difference between groups
    H10
    : no difference between groups
    null hypothesis
    alternative hypothesis
    more formally …

    View Slide

  50. p-value
    H01
    : no difference between groups
    H10
    : no difference between groups
    more formally …

    View Slide

  51. p-value
    H01
    : no difference between groups
    H10
    : no difference between groups
    0 difference between sampled groups
    H0

    View Slide

  52. p-value
    H01
    : no difference between groups
    H10
    : no difference between groups
    0 difference between sampled groups
    H0

    View Slide

  53. p-value
    H01
    : no difference between groups
    H10
    : no difference between groups
    0 difference between sampled groups
    H0
    small p-value

    View Slide

  54. p-value
    H01
    : no difference between groups
    H10
    : no difference between groups
    0 difference between sampled groups
    H0
    larger p-value

    View Slide

  55. H01
    : no difference between groups
    H10
    : no difference between groups
    0
    H0
    difference between sampled groups
    p-value
    (often use a 5% cutoff)

    View Slide

  56. H01
    : no difference between groups
    H10
    : no difference between groups
    0
    H0
    difference between sampled groups
    p-value
    (often use a 5% cutoff)

    View Slide

  57. H01
    : no difference between groups
    H10
    : no difference between groups
    0
    H0
    difference between sampled groups
    H1
    p-value
    (often use a 5% cutoff)

    View Slide

  58. “If he was cited every time a
    p-value was reported his
    paper would have, at the very
    least, 3 million citations* …”
    https://simplystatistics.org/2012/01/06/p-values-
    and-hypothesis-testing-get-a-bad-rap-but-we/

    View Slide

  59. https://jamanetwork.com/journals/jama/fullarticle/2503172
    p-values are
    everywhere

    View Slide

  60. and they’re
    significant
    https://jamanetwork.com/journals/jama/fullarticle/2503172

    View Slide

  61. https://simplystatistics.org/2017/07/26/
    announcing-the-tidypvals-package/

    View Slide

  62. REPLICABILITY
    what does it mean for
    results to “replicate”?
    a lot of things
    p-values

    View Slide

  63. https://www.nature.com/news/psychology-
    journal-bans-p-values-1.17001

    View Slide

  64. https://www.sciencenews.org/blog/context/p-
    value-ban-small-step-journal-giant-leap-science

    View Slide

  65. “The fact that many
    misinterpret the p-value is
    not the p-value’s fault.”
    https://simplystatistics.org/2012/01/06/p-values-
    and-hypothesis-testing-get-a-bad-rap-but-we/

    View Slide

  66. https://www.tandfonline.com/doi/full/10.1080/00031305.2016.1154108
    “… the scientific community
    could benefit from a formal
    statement clarifying several
    widely agreed upon principles
    underlying the proper use and
    interpretation of the p-value.”

    View Slide

  67. https://www.tandfonline.com/doi/full/10.1080/00031305.2019.1583913

    View Slide

  68. https://simplystatistics.org/2012/01/06/p-values-
    and-hypothesis-testing-get-a-bad-rap-but-we/
    “The fact that many
    misinterpret the p-value is
    not the p-value’s fault.”

    View Slide

  69. REPLICABILITY
    Common pitfalls
    p-values

    View Slide

  70. part III. multiple testing

    View Slide

  71. Bennett et al. (2010). Neural correlates of interspecies perspective taking in the post-mortem Atlantic
    Salmon: An argument for multiple comparisons correction. Journal of Serendipitous and Unexpected Results.

    View Slide

  72. brain activity
    Bennett et al. (2010). Neural correlates of interspecies perspective taking in the post-mortem Atlantic
    Salmon: An argument for multiple comparisons correction. Journal of Serendipitous and Unexpected Results.

    View Slide

  73. dead salmon
    from market
    Bennett et al. (2010). Neural correlates of interspecies perspective taking in the post-mortem Atlantic
    Salmon: An argument for multiple comparisons correction. Journal of Serendipitous and Unexpected Results.

    View Slide

  74. zombie
    fish?
    Bennett et al. (2010). Neural correlates of interspecies perspective taking in the post-mortem Atlantic
    Salmon: An argument for multiple comparisons correction. Journal of Serendipitous and Unexpected Results.

    View Slide

  75. H01
    : no difference in signal at voxel
    H10
    : no difference in signal at voxel
    hypothesis testing in
    salmon fMRI study
    H0
    : no difference in signal at voxel 1
    H1
    : difference in signal at voxel 1

    View Slide

  76. H01
    : no difference in signal at voxel
    H10
    : no difference in signal at voxel
    hypothesis testing in
    salmon fMRI study
    H0
    : no difference in signal at voxel 1
    H1
    : difference in signal at voxel 1
    H0
    : no difference in signal at voxel 2
    H1
    : difference in signal at voxel 2
    H0
    : no difference in signal at voxel 3
    H1
    : difference in signal at voxel 3
    H0
    : no difference in signal at voxel 4
    H1
    : difference in signal at voxel 4
    H0
    : no difference in signal at voxel 5
    H1
    : difference in signal at voxel 5
    H0
    : no difference in signal at voxel 6

    View Slide

  77. H01
    : no difference in signal at voxel
    H10
    : no difference in signal at voxel
    hypothesis testing in
    salmon fMRI study
    H0
    : no difference in signal at voxel 1
    H1
    : difference in signal at voxel 1
    H0
    : no difference in signal at voxel 2
    H1
    : difference in signal at voxel 2
    H0
    : no difference in signal at voxel 3
    H1
    : difference in signal at voxel 3
    H0
    : no difference in signal at voxel 4
    H1
    : difference in signal at voxel 4
    H0
    : no difference in signal at voxel 5
    H1
    : difference in signal at voxel 5
    H0
    : no difference in signal at voxel 6
    ~8,000 voxels
    considered

    View Slide

  78. the problem of
    multiple hypothesis testing
    8,000
    voxels

    View Slide

  79. the problem of
    multiple hypothesis testing
    8,000
    true null
    0
    true differential
    8,000
    voxels

    View Slide

  80. the problem of
    multiple hypothesis testing
    400
    significant
    7,600
    not sig.
    8,000
    true null
    0
    true differential
    8,000
    voxels
    5% p-value cutoff

    View Slide

  81. the problem of
    multiple hypothesis testing
    400
    significant
    7,600
    not sig.
    0
    significant
    8,000
    true null
    0
    true differential
    8,000
    voxels
    5% p-value cutoff

    View Slide

  82. the problem of
    multiple hypothesis testing
    950
    significant
    7,600
    not sig.
    1,000
    significant
    8,000
    true null
    0
    true differential
    8,000
    voxels
    400
    significant
    5% p-value cutoff
    100% of our
    hits are false!

    View Slide

  83. “… random noise in the EPI
    timeseries may yield spurious
    results if multiple comparisons
    are not controlled for …”
    Bennett et al. (2010). Neural correlates of interspecies perspective taking in the post-mortem Atlantic
    Salmon: An argument for multiple comparisons correction. Journal of Serendipitous and Unexpected Results.

    View Slide

  84. 8,000
    true null
    0
    true differential
    8,000
    voxels
    what can we do?

    View Slide

  85. Family-wise Error Rate (FWER)
    •Bonferroni correction
    P( at least 1 false positive ) < ⍺
    what can we do?
    8,000
    true null
    0
    true differential
    8,000
    voxels

    View Slide

  86. Family-wise Error Rate (FWER)
    •Bonferroni correction
    P( at least 1 false positive ) < ⍺
    (5 / 8,000)% p-value cutoff
    what can we do?
    8,000
    true null
    0
    true differential
    8,000
    voxels

    View Slide

  87. 19,000
    true null
    1,000
    true differential
    20,000
    genes
    False Discovery Rate (FDR)
    •Benjamini-Hochberg (BH) procedure
    •Storey’s q-value
    E( ) < ⍺
    # false positives
    # total positives
    what can we do?
    8,000
    true null
    0
    true differential
    8,000
    voxels

    View Slide

  88. 19,000
    true null
    1,000
    true differential
    20,000
    genes
    False Discovery Rate (FDR)
    •Benjamini-Hochberg (BH) procedure
    •Storey’s q-value
    E( ) < ⍺
    # false positives
    # total positives
    some cutoff
    estimate these
    what can we do?
    8,000
    true null
    0
    true differential
    8,000
    voxels

    View Slide

  89. “… controlling the false
    discovery rate (FDR) and
    familywise error rate (FWER) …
    indicated no active voxels …”
    Bennett et al. (2010). Neural correlates of interspecies perspective taking in the post-mortem Atlantic
    Salmon: An argument for multiple comparisons correction. Journal of Serendipitous and Unexpected Results.

    View Slide

  90. https://doi.org/10.1371/journal.pone.0124165

    View Slide

  91. https://doi.org/10.1371/journal.pone.0124165

    View Slide

  92. https://doi.org/10.1371/journal.pone.0124165

    View Slide

  93. WORKSHOP (in R)

    View Slide

  94. ## for FWER control (Bonferroni)
    p.adjust(my_pvals, method = “bonferroni”)
    ## for FDR control (Benjamini-Hochberg)
    p.adjust(my_pvals, method = “BH”)
    WORKSHOP (in R)

    View Slide

  95. WORKSHOP (in Python)

    View Slide

  96. # use statsmodels package
    import statsmodels.stats.multitest as mt
    ## for FWER control (Bonferroni)
    mt.multipletests(my_pvals, method = “bonferro
    ## for FDR control (Benjamini-Hochberg)
    mt.multipletests(my_pvals, method = “fdr_bh”)
    WORKSHOP (in Python)

    View Slide

  97. WORKSHOP

    View Slide

  98. FWER correction,
    FDR correction,
    got it, done, finished, great

    View Slide

  99. FWER correction,
    FDR correction,
    got it, done, finished, great
    … almost

    View Slide

  100. https://www.nature.com/news/statistics-p-values-are-just-the-tip-of-the-iceberg-1.17412

    View Slide

  101. part IV. forking paths

    View Slide

  102. Goodhart’s law.
    When a measure becomes a target,
    it ceases to be a good measure.

    View Slide

  103. View Slide

  104. View Slide

  105. View Slide

  106. “… it’s easy to find a p < .05
    comparison even if nothing is
    going on, if you look hard
    enough—and good scientists are
    skilled at looking hard enough …”

    View Slide

  107. specify hypothesis

    View Slide

  108. specify hypothesis

    View Slide

  109. specify hypothesis

    View Slide

  110. specify hypothesis

    View Slide

  111. specify hypothesis
    trying a different
    set of statistics

    View Slide

  112. specify hypothesis
    revisiting filtering
    of dataset

    View Slide

  113. specify hypothesis

    View Slide

  114. specify hypothesis
    “… a sort of invisible multiplicity:
    data-dependent analysis choices
    that did not appear to be degrees of
    freedom because researchers
    analyze only one dataset at a time.”

    View Slide

  115. Goodhart’s law.
    When a measure becomes a target,
    it ceases to be a good measure.

    View Slide

  116. “… it’s easy to find a p < .05
    comparison even if nothing is
    going on, if you look hard
    enough—and good scientists are
    skilled at looking hard enough …”

    View Slide

  117. specify hypothesis
    moving forward?
    pre-registerion
    external validation
    acceptance

    View Slide

  118. “While it is easy to lie with statistics,
    it is even easier to lie without them.”
    - Frederick Mosteller

    View Slide

  119. https://fivethirtyeight.com/features/science-isnt-broken/

    View Slide

  120. https://projects.fivethirtyeight.com/p-hacking/

    View Slide

  121. part V. onward/upward

    View Slide

  122. REPLICABILITY
    crisis:
    experiments are replicated.
    results not so much.

    View Slide

  123. REPLICABILITY
    should we expect
    scientific results to
    always replicate?

    View Slide

  124. https://simplystatistics.org/2016/08/24/replication-crisis/
    “the replication crisis in science is largely attributable to
    a mismatch in our expectations of how often findings
    should replicate and how difficult it is to actually
    discover true findings in certain fields.”

    View Slide

  125. https://simplystatistics.org/2013/08/01/the-roc-curves-of-science/
    “…I argue that the rate of
    discoveries is higher in
    biomedical research than in
    physics. But, to achieve this higher
    true positive rate, biomedical
    research has to tolerate a higher
    false positive rate.”

    View Slide

  126. REPLICABILITY
    should we expect
    scientific results to
    always replicate?
    not always

    View Slide

  127. REPLICABILITY
    crisis:
    experiments are replicated.
    results not so much.
    maybe that’s part of science?

    View Slide

  128. 39%??

    View Slide

  129. View Slide

  130. “… this is an important topic.”
    - Leo Anthony Celi

    View Slide

  131. View Slide

  132. View Slide

  133. Done right, reproducibility should
    not be a crisis for digital medicine,
    but rather one of its strengths.
    “As an embryonic discipline, digital
    medicine has the chance to inculcate
    among its practitioners a healthier
    set of attitudes towards replication.”

    View Slide

  134. p-values, multiple testing,
    and replicability in science
    HST 953, Fall 2019
    Patrick Kimes, PhD
    Data Sciences, Dana-Farber Cancer Institute
    Biostatistics, Harvard TH Chan School of Public Health

    View Slide