Upgrade to Pro — share decks privately, control downloads, hide ads and more …

hst953-pvalues

Patrick Kimes
November 08, 2019

 hst953-pvalues

MIT, HST.953: Collaborative Data Science in Medicine
p-values, multiple testing, and replicability in science
2019-11-08

Patrick Kimes

November 08, 2019
Tweet

More Decks by Patrick Kimes

Other Decks in Education

Transcript

  1. p-values, multiple testing, and replicability in science HST 953, Fall

    2019 Patrick Kimes, PhD Data Sciences, Dana-Farber Cancer Institute Biostatistics, Harvard TH Chan School of Public Health
  2. let’s clarify some language reproducibility replicability the ability to take

    the original data and the computer code used to analyze the data and reproduce all of the numerical findings from the study https://simplystatistics.org/2016/08/24/replication-crisis/
  3. let’s clarify some language reproducibility replicability the ability to take

    the original data and the computer code used to analyze the data and reproduce all of the numerical findings from the study https://simplystatistics.org/2016/08/24/replication-crisis/
  4. let’s clarify some language reproducibility replicability the ability to repeat

    an entire study, independent of the original investigator without the use of original data https://simplystatistics.org/2016/08/24/replication-crisis/
  5. let’s clarify some language reproducibility replicability the ability to repeat

    an entire study, independent of the original investigator without the use of original data https://simplystatistics.org/2016/08/24/replication-crisis/
  6. “39% of effects were subjectively rated to have replicated the

    original results.” https://www.ncbi.nlm.nih.gov/pubmed/26315443 “… replications of 100 experimental and correlational studies …”
  7. “No single indicator sufficiently describes replication success, and the five

    indicators examined here are not the only ways to evaluate reproducibility.” https://www.ncbi.nlm.nih.gov/pubmed/26315443
  8. “Upon careful analysis of the same data, we have come

    to quite different and much more positive conclusions.” https://www.ncbi.nlm.nih.gov/pubmed/27905415
  9. the statistical approach to testing hypotheses H01 : no difference

    between groups pose uninteresting baseline hypothesis
  10. the statistical approach to testing hypotheses H01 : no difference

    between groups pose uninteresting baseline hypothesis collect data
  11. the statistical approach to testing hypotheses pose uninteresting baseline hypothesis

    H01 : no difference between groups collect data how likely is this data if baseline were true?
  12. the statistical approach to testing hypotheses pose uninteresting baseline hypothesis

    H01 : no difference between groups collect data how likely is this data if baseline were true? p-value
  13. H01 : no difference between groups H10 : no difference

    between groups null hypothesis alternative hypothesis more formally …
  14. p-value H01 : no difference between groups H10 : no

    difference between groups more formally …
  15. p-value H01 : no difference between groups H10 : no

    difference between groups 0 difference between sampled groups H0
  16. p-value H01 : no difference between groups H10 : no

    difference between groups 0 difference between sampled groups H0
  17. p-value H01 : no difference between groups H10 : no

    difference between groups 0 difference between sampled groups H0 small p-value
  18. p-value H01 : no difference between groups H10 : no

    difference between groups 0 difference between sampled groups H0 larger p-value
  19. H01 : no difference between groups H10 : no difference

    between groups 0 H0 difference between sampled groups p-value (often use a 5% cutoff)
  20. H01 : no difference between groups H10 : no difference

    between groups 0 H0 difference between sampled groups p-value (often use a 5% cutoff)
  21. H01 : no difference between groups H10 : no difference

    between groups 0 H0 difference between sampled groups H1 p-value (often use a 5% cutoff)
  22. “If he was cited every time a p-value was reported

    his paper would have, at the very least, 3 million citations* …” https://simplystatistics.org/2012/01/06/p-values- and-hypothesis-testing-get-a-bad-rap-but-we/
  23. “The fact that many misinterpret the p-value is not the

    p-value’s fault.” https://simplystatistics.org/2012/01/06/p-values- and-hypothesis-testing-get-a-bad-rap-but-we/
  24. https://www.tandfonline.com/doi/full/10.1080/00031305.2016.1154108 “… the scientific community could benefit from a formal

    statement clarifying several widely agreed upon principles underlying the proper use and interpretation of the p-value.”
  25. Bennett et al. (2010). Neural correlates of interspecies perspective taking

    in the post-mortem Atlantic Salmon: An argument for multiple comparisons correction. Journal of Serendipitous and Unexpected Results.
  26. brain activity Bennett et al. (2010). Neural correlates of interspecies

    perspective taking in the post-mortem Atlantic Salmon: An argument for multiple comparisons correction. Journal of Serendipitous and Unexpected Results.
  27. dead salmon from market Bennett et al. (2010). Neural correlates

    of interspecies perspective taking in the post-mortem Atlantic Salmon: An argument for multiple comparisons correction. Journal of Serendipitous and Unexpected Results.
  28. zombie fish? Bennett et al. (2010). Neural correlates of interspecies

    perspective taking in the post-mortem Atlantic Salmon: An argument for multiple comparisons correction. Journal of Serendipitous and Unexpected Results.
  29. H01 : no difference in signal at voxel H10 :

    no difference in signal at voxel hypothesis testing in salmon fMRI study H0 : no difference in signal at voxel 1 H1 : difference in signal at voxel 1
  30. H01 : no difference in signal at voxel H10 :

    no difference in signal at voxel hypothesis testing in salmon fMRI study H0 : no difference in signal at voxel 1 H1 : difference in signal at voxel 1 H0 : no difference in signal at voxel 2 H1 : difference in signal at voxel 2 H0 : no difference in signal at voxel 3 H1 : difference in signal at voxel 3 H0 : no difference in signal at voxel 4 H1 : difference in signal at voxel 4 H0 : no difference in signal at voxel 5 H1 : difference in signal at voxel 5 H0 : no difference in signal at voxel 6
  31. H01 : no difference in signal at voxel H10 :

    no difference in signal at voxel hypothesis testing in salmon fMRI study H0 : no difference in signal at voxel 1 H1 : difference in signal at voxel 1 H0 : no difference in signal at voxel 2 H1 : difference in signal at voxel 2 H0 : no difference in signal at voxel 3 H1 : difference in signal at voxel 3 H0 : no difference in signal at voxel 4 H1 : difference in signal at voxel 4 H0 : no difference in signal at voxel 5 H1 : difference in signal at voxel 5 H0 : no difference in signal at voxel 6 ~8,000 voxels considered
  32. the problem of multiple hypothesis testing 400 significant 7,600 not

    sig. 8,000 true null 0 true differential 8,000 voxels 5% p-value cutoff
  33. the problem of multiple hypothesis testing 400 significant 7,600 not

    sig. 0 significant 8,000 true null 0 true differential 8,000 voxels 5% p-value cutoff
  34. the problem of multiple hypothesis testing 950 significant 7,600 not

    sig. 1,000 significant 8,000 true null 0 true differential 8,000 voxels 400 significant 5% p-value cutoff 100% of our hits are false!
  35. “… random noise in the EPI timeseries may yield spurious

    results if multiple comparisons are not controlled for …” Bennett et al. (2010). Neural correlates of interspecies perspective taking in the post-mortem Atlantic Salmon: An argument for multiple comparisons correction. Journal of Serendipitous and Unexpected Results.
  36. Family-wise Error Rate (FWER) •Bonferroni correction P( at least 1

    false positive ) < ⍺ what can we do? 8,000 true null 0 true differential 8,000 voxels
  37. Family-wise Error Rate (FWER) •Bonferroni correction P( at least 1

    false positive ) < ⍺ (5 / 8,000)% p-value cutoff what can we do? 8,000 true null 0 true differential 8,000 voxels
  38. 19,000 true null 1,000 true differential 20,000 genes False Discovery

    Rate (FDR) •Benjamini-Hochberg (BH) procedure •Storey’s q-value E( ) < ⍺ # false positives # total positives what can we do? 8,000 true null 0 true differential 8,000 voxels
  39. 19,000 true null 1,000 true differential 20,000 genes False Discovery

    Rate (FDR) •Benjamini-Hochberg (BH) procedure •Storey’s q-value E( ) < ⍺ # false positives # total positives some cutoff estimate these what can we do? 8,000 true null 0 true differential 8,000 voxels
  40. “… controlling the false discovery rate (FDR) and familywise error

    rate (FWER) … indicated no active voxels …” Bennett et al. (2010). Neural correlates of interspecies perspective taking in the post-mortem Atlantic Salmon: An argument for multiple comparisons correction. Journal of Serendipitous and Unexpected Results.
  41. ## for FWER control (Bonferroni) p.adjust(my_pvals, method = “bonferroni”) ##

    for FDR control (Benjamini-Hochberg) p.adjust(my_pvals, method = “BH”) WORKSHOP (in R)
  42. # use statsmodels package import statsmodels.stats.multitest as mt ## for

    FWER control (Bonferroni) mt.multipletests(my_pvals, method = “bonferro ## for FDR control (Benjamini-Hochberg) mt.multipletests(my_pvals, method = “fdr_bh”) WORKSHOP (in Python)
  43. “… it’s easy to find a p < .05 comparison

    even if nothing is going on, if you look hard enough—and good scientists are skilled at looking hard enough …”
  44. specify hypothesis “… a sort of invisible multiplicity: data-dependent analysis

    choices that did not appear to be degrees of freedom because researchers analyze only one dataset at a time.”
  45. “… it’s easy to find a p < .05 comparison

    even if nothing is going on, if you look hard enough—and good scientists are skilled at looking hard enough …”
  46. “While it is easy to lie with statistics, it is

    even easier to lie without them.” - Frederick Mosteller
  47. https://simplystatistics.org/2016/08/24/replication-crisis/ “the replication crisis in science is largely attributable to

    a mismatch in our expectations of how often findings should replicate and how difficult it is to actually discover true findings in certain fields.”
  48. https://simplystatistics.org/2013/08/01/the-roc-curves-of-science/ “…I argue that the rate of discoveries is higher

    in biomedical research than in physics. But, to achieve this higher true positive rate, biomedical research has to tolerate a higher false positive rate.”
  49. Done right, reproducibility should not be a crisis for digital

    medicine, but rather one of its strengths. “As an embryonic discipline, digital medicine has the chance to inculcate among its practitioners a healthier set of attitudes towards replication.”
  50. p-values, multiple testing, and replicability in science HST 953, Fall

    2019 Patrick Kimes, PhD Data Sciences, Dana-Farber Cancer Institute Biostatistics, Harvard TH Chan School of Public Health