hst953-pvalues

p-values, multiple testing, and replicability in science HST 953, Fall
2019 Patrick Kimes, PhD Data Sciences, Dana-Farber Cancer Institute Biostatistics, Harvard TH Chan School of Public Health

“… this is an important topic.” - Leo Anthony Celi

part I. the crisis

https://doi.org/10.1371/journal.pmed.0020124

https://doi.org/10.1371/journal.pmed.0020124 “Simulations show that for most study designs and settings,
it is more likely for a research claim to be false than true.”

http://jtleek.com/talks.html

https://www.nature.com/news/1-500-scientists- lift-the-lid-on-reproducibility-1.19970

what’s going on?

let’s clarify some language reproducibility replicability

let’s clarify some language reproducibility replicability the ability to take
the original data and the computer code used to analyze the data and reproduce all of the numerical ﬁndings from the study https://simplystatistics.org/2016/08/24/replication-crisis/

let’s clarify some language reproducibility replicability the ability to repeat
an entire study, independent of the original investigator without the use of original data https://simplystatistics.org/2016/08/24/replication-crisis/

let’s clarify some language reproducibility replicability

REPLICABILITY

what’s going on? REPLICABILITY

REPLICABILITY crisis: experiments are replicated. results not so much.

“… replications of 100 experimental and correlational studies …” https://www.ncbi.nlm.nih.gov/pubmed/26315443

“39% of effects were subjectively rated to have replicated the
original results.” https://www.ncbi.nlm.nih.gov/pubmed/26315443 “… replications of 100 experimental and correlational studies …”

39%?? https://www.ncbi.nlm.nih.gov/pubmed/26315443

https://www.ncbi.nlm.nih.gov/pubmed/22460902 https://www.ncbi.nlm.nih.gov/pubmed/22460905 not just psychology

https://www.ncbi.nlm.nih.gov/pubmed/22460902 https://www.ncbi.nlm.nih.gov/pubmed/22460905 drug sensitivity in cancer cell lines

https://www.ncbi.nlm.nih.gov/pubmed/24284626

https://www.ncbi.nlm.nih.gov/pubmed/27905415

REPLICABILITY what does it mean for results to “replicate”?

“No single indicator sufﬁciently describes replication success, and the ﬁve
indicators examined here are not the only ways to evaluate reproducibility.” https://www.ncbi.nlm.nih.gov/pubmed/26315443

“Upon careful analysis of the same data, we have come
to quite different and much more positive conclusions.” https://www.ncbi.nlm.nih.gov/pubmed/27905415

REPLICABILITY what does it mean for results to “replicate”? a
lot of things

lot of things p-values

part II. the p-value

what do statisticians do?

inference .. and some other things too what do statisticians
do?

inference •point estimation •interval estimation •hypothesis testing .. and some
other things too what do statisticians do?

•point estimation •interval estimation •hypothesis testing what do statisticians do?

the statistical approach to testing hypotheses

the statistical approach to testing hypotheses H01 : no difference
between groups pose uninteresting baseline hypothesis

the statistical approach to testing hypotheses H01 : no difference
between groups pose uninteresting baseline hypothesis collect data

the statistical approach to testing hypotheses pose uninteresting baseline hypothesis
H01 : no difference between groups collect data how likely is this data if baseline were true?

the statistical approach to testing hypotheses pose uninteresting baseline hypothesis
H01 : no difference between groups collect data how likely is this data if baseline were true? p-value

H01 : no difference between groups null hypothesis more formally
…

H01 : no difference between groups H10 : no difference
between groups null hypothesis alternative hypothesis more formally …

p-value H01 : no difference between groups H10 : no
difference between groups more formally …

difference between groups 0 difference between sampled groups H0

difference between groups 0 difference between sampled groups H0 small p-value

difference between groups 0 difference between sampled groups H0 larger p-value

between groups 0 H0 difference between sampled groups p-value (often use a 5% cutoff)

between groups 0 H0 difference between sampled groups H1 p-value (often use a 5% cutoff)

“If he was cited every time a p-value was reported
his paper would have, at the very least, 3 million citations* …” https://simplystatistics.org/2012/01/06/p-values- and-hypothesis-testing-get-a-bad-rap-but-we/

https://jamanetwork.com/journals/jama/fullarticle/2503172 p-values are everywhere

and they’re signiﬁcant https://jamanetwork.com/journals/jama/fullarticle/2503172

https://simplystatistics.org/2017/07/26/ announcing-the-tidypvals-package/

lot of things p-values

https://www.nature.com/news/psychology- journal-bans-p-values-1.17001

https://www.sciencenews.org/blog/context/p- value-ban-small-step-journal-giant-leap-science

“The fact that many misinterpret the p-value is not the
p-value’s fault.” https://simplystatistics.org/2012/01/06/p-values- and-hypothesis-testing-get-a-bad-rap-but-we/

https://www.tandfonline.com/doi/full/10.1080/00031305.2016.1154108 “… the scientiﬁc community could beneﬁt from a formal
statement clarifying several widely agreed upon principles underlying the proper use and interpretation of the p-value.”

https://www.tandfonline.com/doi/full/10.1080/00031305.2019.1583913

https://simplystatistics.org/2012/01/06/p-values- and-hypothesis-testing-get-a-bad-rap-but-we/ “The fact that many misinterpret the p-value is
not the p-value’s fault.”

REPLICABILITY Common pitfalls p-values

part III. multiple testing

Bennett et al. (2010). Neural correlates of interspecies perspective taking
in the post-mortem Atlantic Salmon: An argument for multiple comparisons correction. Journal of Serendipitous and Unexpected Results.

brain activity Bennett et al. (2010). Neural correlates of interspecies
perspective taking in the post-mortem Atlantic Salmon: An argument for multiple comparisons correction. Journal of Serendipitous and Unexpected Results.

dead salmon from market Bennett et al. (2010). Neural correlates
of interspecies perspective taking in the post-mortem Atlantic Salmon: An argument for multiple comparisons correction. Journal of Serendipitous and Unexpected Results.

zombie ﬁsh? Bennett et al. (2010). Neural correlates of interspecies
perspective taking in the post-mortem Atlantic Salmon: An argument for multiple comparisons correction. Journal of Serendipitous and Unexpected Results.

H01 : no difference in signal at voxel H10 :
no difference in signal at voxel hypothesis testing in salmon fMRI study H0 : no difference in signal at voxel 1 H1 : difference in signal at voxel 1

no difference in signal at voxel hypothesis testing in salmon fMRI study H0 : no difference in signal at voxel 1 H1 : difference in signal at voxel 1 H0 : no difference in signal at voxel 2 H1 : difference in signal at voxel 2 H0 : no difference in signal at voxel 3 H1 : difference in signal at voxel 3 H0 : no difference in signal at voxel 4 H1 : difference in signal at voxel 4 H0 : no difference in signal at voxel 5 H1 : difference in signal at voxel 5 H0 : no difference in signal at voxel 6

no difference in signal at voxel hypothesis testing in salmon fMRI study H0 : no difference in signal at voxel 1 H1 : difference in signal at voxel 1 H0 : no difference in signal at voxel 2 H1 : difference in signal at voxel 2 H0 : no difference in signal at voxel 3 H1 : difference in signal at voxel 3 H0 : no difference in signal at voxel 4 H1 : difference in signal at voxel 4 H0 : no difference in signal at voxel 5 H1 : difference in signal at voxel 5 H0 : no difference in signal at voxel 6 ~8,000 voxels considered

the problem of multiple hypothesis testing 8,000 voxels

the problem of multiple hypothesis testing 8,000 true null 0
true differential 8,000 voxels

the problem of multiple hypothesis testing 400 signiﬁcant 7,600 not
sig. 8,000 true null 0 true differential 8,000 voxels 5% p-value cutoff

sig. 0 signiﬁcant 8,000 true null 0 true differential 8,000 voxels 5% p-value cutoff

sig. 1,000 signiﬁcant 8,000 true null 0 true differential 8,000 voxels 400 signiﬁcant 5% p-value cutoff 100% of our hits are false!

“… random noise in the EPI timeseries may yield spurious
results if multiple comparisons are not controlled for …” Bennett et al. (2010). Neural correlates of interspecies perspective taking in the post-mortem Atlantic Salmon: An argument for multiple comparisons correction. Journal of Serendipitous and Unexpected Results.

8,000 true null 0 true differential 8,000 voxels what can
we do?

Family-wise Error Rate (FWER) •Bonferroni correction P( at least 1
false positive ) < ⍺ what can we do? 8,000 true null 0 true differential 8,000 voxels

Family-wise Error Rate (FWER) •Bonferroni correction P( at least 1
false positive ) < ⍺ (5 / 8,000)% p-value cutoff what can we do? 8,000 true null 0 true differential 8,000 voxels

19,000 true null 1,000 true differential 20,000 genes False Discovery
Rate (FDR) •Benjamini-Hochberg (BH) procedure •Storey’s q-value E( ) < ⍺ # false positives # total positives what can we do? 8,000 true null 0 true differential 8,000 voxels

19,000 true null 1,000 true differential 20,000 genes False Discovery
Rate (FDR) •Benjamini-Hochberg (BH) procedure •Storey’s q-value E( ) < ⍺ # false positives # total positives some cutoff estimate these what can we do? 8,000 true null 0 true differential 8,000 voxels

“… controlling the false discovery rate (FDR) and familywise error
rate (FWER) … indicated no active voxels …” Bennett et al. (2010). Neural correlates of interspecies perspective taking in the post-mortem Atlantic Salmon: An argument for multiple comparisons correction. Journal of Serendipitous and Unexpected Results.

https://doi.org/10.1371/journal.pone.0124165

WORKSHOP (in R)

## for FWER control (Bonferroni) p.adjust(my_pvals, method = “bonferroni”) ##
for FDR control (Benjamini-Hochberg) p.adjust(my_pvals, method = “BH”) WORKSHOP (in R)

WORKSHOP (in Python)

# use statsmodels package import statsmodels.stats.multitest as mt ## for
FWER control (Bonferroni) mt.multipletests(my_pvals, method = “bonferro ## for FDR control (Benjamini-Hochberg) mt.multipletests(my_pvals, method = “fdr_bh”) WORKSHOP (in Python)

WORKSHOP

FWER correction, FDR correction, got it, done, ﬁnished, great

FWER correction, FDR correction, got it, done, ﬁnished, great …
almost

https://www.nature.com/news/statistics-p-values-are-just-the-tip-of-the-iceberg-1.17412

part IV. forking paths

Goodhart’s law. When a measure becomes a target, it ceases
to be a good measure.

“… it’s easy to ﬁnd a p < .05 comparison
even if nothing is going on, if you look hard enough—and good scientists are skilled at looking hard enough …”

specify hypothesis

specify hypothesis trying a different set of statistics

specify hypothesis revisiting ﬁltering of dataset

specify hypothesis

specify hypothesis “… a sort of invisible multiplicity: data-dependent analysis
choices that did not appear to be degrees of freedom because researchers analyze only one dataset at a time.”

Goodhart’s law. When a measure becomes a target, it ceases
to be a good measure.

“… it’s easy to ﬁnd a p < .05 comparison
even if nothing is going on, if you look hard enough—and good scientists are skilled at looking hard enough …”

specify hypothesis moving forward? pre-registerion external validation acceptance

“While it is easy to lie with statistics, it is
even easier to lie without them.” - Frederick Mosteller

https://ﬁvethirtyeight.com/features/science-isnt-broken/

https://projects.ﬁvethirtyeight.com/p-hacking/

part V. onward/upward

REPLICABILITY should we expect scientiﬁc results to always replicate?

https://simplystatistics.org/2016/08/24/replication-crisis/ “the replication crisis in science is largely attributable to
a mismatch in our expectations of how often findings should replicate and how difficult it is to actually discover true findings in certain fields.”

https://simplystatistics.org/2013/08/01/the-roc-curves-of-science/ “…I argue that the rate of discoveries is higher
in biomedical research than in physics. But, to achieve this higher true positive rate, biomedical research has to tolerate a higher false positive rate.”

REPLICABILITY should we expect scientiﬁc results to always replicate? not
always

REPLICABILITY crisis: experiments are replicated. results not so much. maybe
that’s part of science?

“… this is an important topic.” - Leo Anthony Celi

Done right, reproducibility should not be a crisis for digital
medicine, but rather one of its strengths. “As an embryonic discipline, digital medicine has the chance to inculcate among its practitioners a healthier set of attitudes towards replication.”

p-values, multiple testing, and replicability in science HST 953, Fall
2019 Patrick Kimes, PhD Data Sciences, Dana-Farber Cancer Institute Biostatistics, Harvard TH Chan School of Public Health

hst953-pvalues

hst953-pvalues

More Decks by Patrick Kimes

Other Decks in Education

Featured

Transcript