p-values, multiple testing, and replicability in science HST 953, Fall 2019 Patrick Kimes, PhD Data Sciences, Dana-Farber Cancer Institute Biostatistics, Harvard TH Chan School of Public Health
https://doi.org/10.1371/journal.pmed.0020124 “Simulations show that for most study designs and settings, it is more likely for a research claim to be false than true.”
let’s clarify some language reproducibility replicability the ability to take the original data and the computer code used to analyze the data and reproduce all of the numerical findings from the study https://simplystatistics.org/2016/08/24/replication-crisis/
let’s clarify some language reproducibility replicability the ability to take the original data and the computer code used to analyze the data and reproduce all of the numerical findings from the study https://simplystatistics.org/2016/08/24/replication-crisis/
let’s clarify some language reproducibility replicability the ability to repeat an entire study, independent of the original investigator without the use of original data https://simplystatistics.org/2016/08/24/replication-crisis/
let’s clarify some language reproducibility replicability the ability to repeat an entire study, independent of the original investigator without the use of original data https://simplystatistics.org/2016/08/24/replication-crisis/
“39% of effects were subjectively rated to have replicated the original results.” https://www.ncbi.nlm.nih.gov/pubmed/26315443 “… replications of 100 experimental and correlational studies …”
“No single indicator sufficiently describes replication success, and the five indicators examined here are not the only ways to evaluate reproducibility.” https://www.ncbi.nlm.nih.gov/pubmed/26315443
“Upon careful analysis of the same data, we have come to quite different and much more positive conclusions.” https://www.ncbi.nlm.nih.gov/pubmed/27905415
the statistical approach to testing hypotheses pose uninteresting baseline hypothesis H01 : no difference between groups collect data how likely is this data if baseline were true?
the statistical approach to testing hypotheses pose uninteresting baseline hypothesis H01 : no difference between groups collect data how likely is this data if baseline were true? p-value
“If he was cited every time a p-value was reported his paper would have, at the very least, 3 million citations* …” https://simplystatistics.org/2012/01/06/p-values- and-hypothesis-testing-get-a-bad-rap-but-we/
“The fact that many misinterpret the p-value is not the p-value’s fault.” https://simplystatistics.org/2012/01/06/p-values- and-hypothesis-testing-get-a-bad-rap-but-we/
https://www.tandfonline.com/doi/full/10.1080/00031305.2016.1154108 “… the scientific community could benefit from a formal statement clarifying several widely agreed upon principles underlying the proper use and interpretation of the p-value.”
https://simplystatistics.org/2012/01/06/p-values- and-hypothesis-testing-get-a-bad-rap-but-we/ “The fact that many misinterpret the p-value is not the p-value’s fault.”
Bennett et al. (2010). Neural correlates of interspecies perspective taking in the post-mortem Atlantic Salmon: An argument for multiple comparisons correction. Journal of Serendipitous and Unexpected Results.
brain activity Bennett et al. (2010). Neural correlates of interspecies perspective taking in the post-mortem Atlantic Salmon: An argument for multiple comparisons correction. Journal of Serendipitous and Unexpected Results.
dead salmon from market Bennett et al. (2010). Neural correlates of interspecies perspective taking in the post-mortem Atlantic Salmon: An argument for multiple comparisons correction. Journal of Serendipitous and Unexpected Results.
zombie fish? Bennett et al. (2010). Neural correlates of interspecies perspective taking in the post-mortem Atlantic Salmon: An argument for multiple comparisons correction. Journal of Serendipitous and Unexpected Results.
H01 : no difference in signal at voxel H10 : no difference in signal at voxel hypothesis testing in salmon fMRI study H0 : no difference in signal at voxel 1 H1 : difference in signal at voxel 1
H01 : no difference in signal at voxel H10 : no difference in signal at voxel hypothesis testing in salmon fMRI study H0 : no difference in signal at voxel 1 H1 : difference in signal at voxel 1 H0 : no difference in signal at voxel 2 H1 : difference in signal at voxel 2 H0 : no difference in signal at voxel 3 H1 : difference in signal at voxel 3 H0 : no difference in signal at voxel 4 H1 : difference in signal at voxel 4 H0 : no difference in signal at voxel 5 H1 : difference in signal at voxel 5 H0 : no difference in signal at voxel 6
H01 : no difference in signal at voxel H10 : no difference in signal at voxel hypothesis testing in salmon fMRI study H0 : no difference in signal at voxel 1 H1 : difference in signal at voxel 1 H0 : no difference in signal at voxel 2 H1 : difference in signal at voxel 2 H0 : no difference in signal at voxel 3 H1 : difference in signal at voxel 3 H0 : no difference in signal at voxel 4 H1 : difference in signal at voxel 4 H0 : no difference in signal at voxel 5 H1 : difference in signal at voxel 5 H0 : no difference in signal at voxel 6 ~8,000 voxels considered
the problem of multiple hypothesis testing 400 significant 7,600 not sig. 0 significant 8,000 true null 0 true differential 8,000 voxels 5% p-value cutoff
the problem of multiple hypothesis testing 950 significant 7,600 not sig. 1,000 significant 8,000 true null 0 true differential 8,000 voxels 400 significant 5% p-value cutoff 100% of our hits are false!
“… random noise in the EPI timeseries may yield spurious results if multiple comparisons are not controlled for …” Bennett et al. (2010). Neural correlates of interspecies perspective taking in the post-mortem Atlantic Salmon: An argument for multiple comparisons correction. Journal of Serendipitous and Unexpected Results.
“… controlling the false discovery rate (FDR) and familywise error rate (FWER) … indicated no active voxels …” Bennett et al. (2010). Neural correlates of interspecies perspective taking in the post-mortem Atlantic Salmon: An argument for multiple comparisons correction. Journal of Serendipitous and Unexpected Results.
## for FWER control (Bonferroni) p.adjust(my_pvals, method = “bonferroni”) ## for FDR control (Benjamini-Hochberg) p.adjust(my_pvals, method = “BH”) WORKSHOP (in R)
# use statsmodels package import statsmodels.stats.multitest as mt ## for FWER control (Bonferroni) mt.multipletests(my_pvals, method = “bonferro ## for FDR control (Benjamini-Hochberg) mt.multipletests(my_pvals, method = “fdr_bh”) WORKSHOP (in Python)
“… it’s easy to find a p < .05 comparison even if nothing is going on, if you look hard enough—and good scientists are skilled at looking hard enough …”
specify hypothesis “… a sort of invisible multiplicity: data-dependent analysis choices that did not appear to be degrees of freedom because researchers analyze only one dataset at a time.”
“… it’s easy to find a p < .05 comparison even if nothing is going on, if you look hard enough—and good scientists are skilled at looking hard enough …”
https://simplystatistics.org/2016/08/24/replication-crisis/ “the replication crisis in science is largely attributable to a mismatch in our expectations of how often findings should replicate and how difficult it is to actually discover true findings in certain fields.”
https://simplystatistics.org/2013/08/01/the-roc-curves-of-science/ “…I argue that the rate of discoveries is higher in biomedical research than in physics. But, to achieve this higher true positive rate, biomedical research has to tolerate a higher false positive rate.”
Done right, reproducibility should not be a crisis for digital medicine, but rather one of its strengths. “As an embryonic discipline, digital medicine has the chance to inculcate among its practitioners a healthier set of attitudes towards replication.”
p-values, multiple testing, and replicability in science HST 953, Fall 2019 Patrick Kimes, PhD Data Sciences, Dana-Farber Cancer Institute Biostatistics, Harvard TH Chan School of Public Health