hst953-pvalues

 hst953-pvalues

MIT, HST.953: Collaborative Data Science in Medicine
p-values, multiple testing, and replicability in science
2019-11-08

E1a375fbe8cc71e23307a519eb4848e9?s=128

Patrick Kimes

November 08, 2019
Tweet

Transcript

  1. p-values, multiple testing, and replicability in science HST 953, Fall

    2019 Patrick Kimes, PhD Data Sciences, Dana-Farber Cancer Institute Biostatistics, Harvard TH Chan School of Public Health
  2. “… this is an important topic.” - Leo Anthony Celi

  3. part I. the crisis

  4. https://doi.org/10.1371/journal.pmed.0020124

  5. https://doi.org/10.1371/journal.pmed.0020124 “Simulations show that for most study designs and settings,

    it is more likely for a research claim to be false than true.”
  6. http://jtleek.com/talks.html

  7. http://jtleek.com/talks.html

  8. http://jtleek.com/talks.html

  9. http://jtleek.com/talks.html

  10. http://jtleek.com/talks.html

  11. http://jtleek.com/talks.html

  12. http://jtleek.com/talks.html

  13. https://www.nature.com/news/1-500-scientists- lift-the-lid-on-reproducibility-1.19970

  14. what’s going on?

  15. let’s clarify some language reproducibility replicability

  16. let’s clarify some language reproducibility replicability the ability to take

    the original data and the computer code used to analyze the data and reproduce all of the numerical findings from the study https://simplystatistics.org/2016/08/24/replication-crisis/
  17. let’s clarify some language reproducibility replicability the ability to take

    the original data and the computer code used to analyze the data and reproduce all of the numerical findings from the study https://simplystatistics.org/2016/08/24/replication-crisis/
  18. let’s clarify some language reproducibility replicability the ability to repeat

    an entire study, independent of the original investigator without the use of original data https://simplystatistics.org/2016/08/24/replication-crisis/
  19. let’s clarify some language reproducibility replicability the ability to repeat

    an entire study, independent of the original investigator without the use of original data https://simplystatistics.org/2016/08/24/replication-crisis/
  20. let’s clarify some language reproducibility replicability

  21. REPLICABILITY

  22. what’s going on? REPLICABILITY

  23. REPLICABILITY crisis: experiments are replicated. results not so much.

  24. “… replications of 100 experimental and correlational studies …” https://www.ncbi.nlm.nih.gov/pubmed/26315443

  25. “39% of effects were subjectively rated to have replicated the

    original results.” https://www.ncbi.nlm.nih.gov/pubmed/26315443 “… replications of 100 experimental and correlational studies …”
  26. 39%?? https://www.ncbi.nlm.nih.gov/pubmed/26315443

  27. https://www.ncbi.nlm.nih.gov/pubmed/22460902 https://www.ncbi.nlm.nih.gov/pubmed/22460905 not just psychology

  28. https://www.ncbi.nlm.nih.gov/pubmed/22460902 https://www.ncbi.nlm.nih.gov/pubmed/22460905 drug sensitivity in cancer cell lines

  29. https://www.ncbi.nlm.nih.gov/pubmed/24284626

  30. https://www.ncbi.nlm.nih.gov/pubmed/24284626

  31. https://www.ncbi.nlm.nih.gov/pubmed/27905415

  32. REPLICABILITY crisis: experiments are replicated. results not so much.

  33. REPLICABILITY what does it mean for results to “replicate”?

  34. “No single indicator sufficiently describes replication success, and the five

    indicators examined here are not the only ways to evaluate reproducibility.” https://www.ncbi.nlm.nih.gov/pubmed/26315443
  35. “Upon careful analysis of the same data, we have come

    to quite different and much more positive conclusions.” https://www.ncbi.nlm.nih.gov/pubmed/27905415
  36. REPLICABILITY what does it mean for results to “replicate”? a

    lot of things
  37. REPLICABILITY what does it mean for results to “replicate”? a

    lot of things p-values
  38. part II. the p-value

  39. what do statisticians do?

  40. inference .. and some other things too what do statisticians

    do?
  41. inference •point estimation •interval estimation •hypothesis testing .. and some

    other things too what do statisticians do?
  42. •point estimation •interval estimation •hypothesis testing what do statisticians do?

  43. the statistical approach to testing hypotheses

  44. the statistical approach to testing hypotheses H01 : no difference

    between groups pose uninteresting baseline hypothesis
  45. the statistical approach to testing hypotheses H01 : no difference

    between groups pose uninteresting baseline hypothesis collect data
  46. the statistical approach to testing hypotheses pose uninteresting baseline hypothesis

    H01 : no difference between groups collect data how likely is this data if baseline were true?
  47. the statistical approach to testing hypotheses pose uninteresting baseline hypothesis

    H01 : no difference between groups collect data how likely is this data if baseline were true? p-value
  48. H01 : no difference between groups null hypothesis more formally

  49. H01 : no difference between groups H10 : no difference

    between groups null hypothesis alternative hypothesis more formally …
  50. p-value H01 : no difference between groups H10 : no

    difference between groups more formally …
  51. p-value H01 : no difference between groups H10 : no

    difference between groups 0 difference between sampled groups H0
  52. p-value H01 : no difference between groups H10 : no

    difference between groups 0 difference between sampled groups H0
  53. p-value H01 : no difference between groups H10 : no

    difference between groups 0 difference between sampled groups H0 small p-value
  54. p-value H01 : no difference between groups H10 : no

    difference between groups 0 difference between sampled groups H0 larger p-value
  55. H01 : no difference between groups H10 : no difference

    between groups 0 H0 difference between sampled groups p-value (often use a 5% cutoff)
  56. H01 : no difference between groups H10 : no difference

    between groups 0 H0 difference between sampled groups p-value (often use a 5% cutoff)
  57. H01 : no difference between groups H10 : no difference

    between groups 0 H0 difference between sampled groups H1 p-value (often use a 5% cutoff)
  58. “If he was cited every time a p-value was reported

    his paper would have, at the very least, 3 million citations* …” https://simplystatistics.org/2012/01/06/p-values- and-hypothesis-testing-get-a-bad-rap-but-we/
  59. https://jamanetwork.com/journals/jama/fullarticle/2503172 p-values are everywhere

  60. and they’re significant https://jamanetwork.com/journals/jama/fullarticle/2503172

  61. https://simplystatistics.org/2017/07/26/ announcing-the-tidypvals-package/

  62. REPLICABILITY what does it mean for results to “replicate”? a

    lot of things p-values
  63. https://www.nature.com/news/psychology- journal-bans-p-values-1.17001

  64. https://www.sciencenews.org/blog/context/p- value-ban-small-step-journal-giant-leap-science

  65. “The fact that many misinterpret the p-value is not the

    p-value’s fault.” https://simplystatistics.org/2012/01/06/p-values- and-hypothesis-testing-get-a-bad-rap-but-we/
  66. https://www.tandfonline.com/doi/full/10.1080/00031305.2016.1154108 “… the scientific community could benefit from a formal

    statement clarifying several widely agreed upon principles underlying the proper use and interpretation of the p-value.”
  67. https://www.tandfonline.com/doi/full/10.1080/00031305.2019.1583913

  68. https://simplystatistics.org/2012/01/06/p-values- and-hypothesis-testing-get-a-bad-rap-but-we/ “The fact that many misinterpret the p-value is

    not the p-value’s fault.”
  69. REPLICABILITY Common pitfalls p-values

  70. part III. multiple testing

  71. Bennett et al. (2010). Neural correlates of interspecies perspective taking

    in the post-mortem Atlantic Salmon: An argument for multiple comparisons correction. Journal of Serendipitous and Unexpected Results.
  72. brain activity Bennett et al. (2010). Neural correlates of interspecies

    perspective taking in the post-mortem Atlantic Salmon: An argument for multiple comparisons correction. Journal of Serendipitous and Unexpected Results.
  73. dead salmon from market Bennett et al. (2010). Neural correlates

    of interspecies perspective taking in the post-mortem Atlantic Salmon: An argument for multiple comparisons correction. Journal of Serendipitous and Unexpected Results.
  74. zombie fish? Bennett et al. (2010). Neural correlates of interspecies

    perspective taking in the post-mortem Atlantic Salmon: An argument for multiple comparisons correction. Journal of Serendipitous and Unexpected Results.
  75. H01 : no difference in signal at voxel H10 :

    no difference in signal at voxel hypothesis testing in salmon fMRI study H0 : no difference in signal at voxel 1 H1 : difference in signal at voxel 1
  76. H01 : no difference in signal at voxel H10 :

    no difference in signal at voxel hypothesis testing in salmon fMRI study H0 : no difference in signal at voxel 1 H1 : difference in signal at voxel 1 H0 : no difference in signal at voxel 2 H1 : difference in signal at voxel 2 H0 : no difference in signal at voxel 3 H1 : difference in signal at voxel 3 H0 : no difference in signal at voxel 4 H1 : difference in signal at voxel 4 H0 : no difference in signal at voxel 5 H1 : difference in signal at voxel 5 H0 : no difference in signal at voxel 6
  77. H01 : no difference in signal at voxel H10 :

    no difference in signal at voxel hypothesis testing in salmon fMRI study H0 : no difference in signal at voxel 1 H1 : difference in signal at voxel 1 H0 : no difference in signal at voxel 2 H1 : difference in signal at voxel 2 H0 : no difference in signal at voxel 3 H1 : difference in signal at voxel 3 H0 : no difference in signal at voxel 4 H1 : difference in signal at voxel 4 H0 : no difference in signal at voxel 5 H1 : difference in signal at voxel 5 H0 : no difference in signal at voxel 6 ~8,000 voxels considered
  78. the problem of multiple hypothesis testing 8,000 voxels

  79. the problem of multiple hypothesis testing 8,000 true null 0

    true differential 8,000 voxels
  80. the problem of multiple hypothesis testing 400 significant 7,600 not

    sig. 8,000 true null 0 true differential 8,000 voxels 5% p-value cutoff
  81. the problem of multiple hypothesis testing 400 significant 7,600 not

    sig. 0 significant 8,000 true null 0 true differential 8,000 voxels 5% p-value cutoff
  82. the problem of multiple hypothesis testing 950 significant 7,600 not

    sig. 1,000 significant 8,000 true null 0 true differential 8,000 voxels 400 significant 5% p-value cutoff 100% of our hits are false!
  83. “… random noise in the EPI timeseries may yield spurious

    results if multiple comparisons are not controlled for …” Bennett et al. (2010). Neural correlates of interspecies perspective taking in the post-mortem Atlantic Salmon: An argument for multiple comparisons correction. Journal of Serendipitous and Unexpected Results.
  84. 8,000 true null 0 true differential 8,000 voxels what can

    we do?
  85. Family-wise Error Rate (FWER) •Bonferroni correction P( at least 1

    false positive ) < ⍺ what can we do? 8,000 true null 0 true differential 8,000 voxels
  86. Family-wise Error Rate (FWER) •Bonferroni correction P( at least 1

    false positive ) < ⍺ (5 / 8,000)% p-value cutoff what can we do? 8,000 true null 0 true differential 8,000 voxels
  87. 19,000 true null 1,000 true differential 20,000 genes False Discovery

    Rate (FDR) •Benjamini-Hochberg (BH) procedure •Storey’s q-value E( ) < ⍺ # false positives # total positives what can we do? 8,000 true null 0 true differential 8,000 voxels
  88. 19,000 true null 1,000 true differential 20,000 genes False Discovery

    Rate (FDR) •Benjamini-Hochberg (BH) procedure •Storey’s q-value E( ) < ⍺ # false positives # total positives some cutoff estimate these what can we do? 8,000 true null 0 true differential 8,000 voxels
  89. “… controlling the false discovery rate (FDR) and familywise error

    rate (FWER) … indicated no active voxels …” Bennett et al. (2010). Neural correlates of interspecies perspective taking in the post-mortem Atlantic Salmon: An argument for multiple comparisons correction. Journal of Serendipitous and Unexpected Results.
  90. https://doi.org/10.1371/journal.pone.0124165

  91. https://doi.org/10.1371/journal.pone.0124165

  92. https://doi.org/10.1371/journal.pone.0124165

  93. WORKSHOP (in R)

  94. ## for FWER control (Bonferroni) p.adjust(my_pvals, method = “bonferroni”) ##

    for FDR control (Benjamini-Hochberg) p.adjust(my_pvals, method = “BH”) WORKSHOP (in R)
  95. WORKSHOP (in Python)

  96. # use statsmodels package import statsmodels.stats.multitest as mt ## for

    FWER control (Bonferroni) mt.multipletests(my_pvals, method = “bonferro ## for FDR control (Benjamini-Hochberg) mt.multipletests(my_pvals, method = “fdr_bh”) WORKSHOP (in Python)
  97. WORKSHOP

  98. FWER correction, FDR correction, got it, done, finished, great

  99. FWER correction, FDR correction, got it, done, finished, great …

    almost
  100. https://www.nature.com/news/statistics-p-values-are-just-the-tip-of-the-iceberg-1.17412

  101. part IV. forking paths

  102. Goodhart’s law. When a measure becomes a target, it ceases

    to be a good measure.
  103. None
  104. None
  105. None
  106. “… it’s easy to find a p < .05 comparison

    even if nothing is going on, if you look hard enough—and good scientists are skilled at looking hard enough …”
  107. specify hypothesis

  108. specify hypothesis

  109. specify hypothesis

  110. specify hypothesis

  111. specify hypothesis trying a different set of statistics

  112. specify hypothesis revisiting filtering of dataset

  113. specify hypothesis

  114. specify hypothesis “… a sort of invisible multiplicity: data-dependent analysis

    choices that did not appear to be degrees of freedom because researchers analyze only one dataset at a time.”
  115. Goodhart’s law. When a measure becomes a target, it ceases

    to be a good measure.
  116. “… it’s easy to find a p < .05 comparison

    even if nothing is going on, if you look hard enough—and good scientists are skilled at looking hard enough …”
  117. specify hypothesis moving forward? pre-registerion external validation acceptance

  118. “While it is easy to lie with statistics, it is

    even easier to lie without them.” - Frederick Mosteller
  119. https://fivethirtyeight.com/features/science-isnt-broken/

  120. https://projects.fivethirtyeight.com/p-hacking/

  121. part V. onward/upward

  122. REPLICABILITY crisis: experiments are replicated. results not so much.

  123. REPLICABILITY should we expect scientific results to always replicate?

  124. https://simplystatistics.org/2016/08/24/replication-crisis/ “the replication crisis in science is largely attributable to

    a mismatch in our expectations of how often findings should replicate and how difficult it is to actually discover true findings in certain fields.”
  125. https://simplystatistics.org/2013/08/01/the-roc-curves-of-science/ “…I argue that the rate of discoveries is higher

    in biomedical research than in physics. But, to achieve this higher true positive rate, biomedical research has to tolerate a higher false positive rate.”
  126. REPLICABILITY should we expect scientific results to always replicate? not

    always
  127. REPLICABILITY crisis: experiments are replicated. results not so much. maybe

    that’s part of science?
  128. 39%??

  129. None
  130. “… this is an important topic.” - Leo Anthony Celi

  131. None
  132. None
  133. Done right, reproducibility should not be a crisis for digital

    medicine, but rather one of its strengths. “As an embryonic discipline, digital medicine has the chance to inculcate among its practitioners a healthier set of attitudes towards replication.”
  134. p-values, multiple testing, and replicability in science HST 953, Fall

    2019 Patrick Kimes, PhD Data Sciences, Dana-Farber Cancer Institute Biostatistics, Harvard TH Chan School of Public Health