Why Statistics Matters in the Analysis of Genomics Data

Why Sta(s(cs Ma,ers: Analysis of Genomics Data Stephanie
Hicks @stephaniehicks stephaniehicks.com

Aug 2009: “I keep saying that the sexy job in
the next 10 years will be sta(s(cians. And I’m not kidding.” -‐ Hal Varian, chief economist at Google 2 min Youtube video

h,p://www.forbes.com/sites/danwoods/2012/03/08/hilary-‐mason-‐what-‐is-‐a-‐data-‐scien(st/ @hmason Hilary Mason

@jtleek

Preparing students: high school h,p://magazine.amstat.org/blog/2013/05/01/stats-‐degrees/ Data
source: The College Board

h,p://www.amstat.org/newsroom/pressreleases/2015-‐StatsFastestGrowingSTEMDegree.pdf Data from the Na(onal Center for Educa(on
Sta(s(cs; Analysis by the ASA

Preparing students: undergraduate and graduate h,p://magazine.amstat.org/blog/2013/05/01/stats-‐degrees/
Data source: NCES Digest of Educa(on Sta(s(cs Graduated from high school Completed two REUs in Biosta:s:cs Graduated from LSU (BS, Mathema:cs); Started a PhD in Sta:s:cs

Rapid change in technology Mardis (2011) Nature 470: 198-‐203
“Next-‐genera(on sequencing” “Sanger sequencing”

Hayden (2014) Nature 570: 294-‐295

Ques(ons: (2) How to find differences between two
or more groups? (3) How to find interes+ng genomic regions? (1) How to process & normalize?

Discuss these three ques(ons and illustrate how sta(s(cs can
help Focus on DNA methyla:on data (But these challenges are common to other areas of genomics) Morgan et al. (1999). Nature Gene+cs 23: 314-‐8 h,p://epigenome.eu/en/2,48,873 Bradbury (2003). PLoS Biology 1: e82

ATCGCGTTACTGCGGAA TAGCGCAATGTCGCCTT m m m
m m m DNA Methyla(on

h,p://www.learnnc.org/lp/pages/7828

Measuring DNA Methyla(on Boch (2012). Nature Reviews Gene+cs 13,
705-‐719 Problem: Which CpGs are diﬀeren(ally methylated between two groups? Some proposed sta:s:cal solu:ons: At each CpG, test if there is a diﬀerence using e.g. t-‐test, F-‐test or linear regression

t-‐test for Diﬀeren(al Methyla(on p-‐value = 0.034 0.00
0.25 0.50 0.75 1.00 Case Control Methylation level Status Case Control CpG #1 CpG is diﬀeren(ally methylated < 0.05

t-‐test for Diﬀeren(al Methyla(on p-‐value = 0.343 0.00
0.25 0.50 0.75 1.00 Case Control Methylation level Status Case Control CpG #2 CpG is not diﬀeren(ally methylated > 0.05

600 CpGs 36 samples (6 cell types) Jaﬀe
and Irizarry (2014). Genome Biol 15: R31.

What about neighboring CpGs? Problem: If one CpG is
methylated, would a CpG nearby be also methylated? Some proposed solu:ons: (1) Can we find two or more runs of differen(ally methylated CpGs? •  If p-‐value < 0.05 for CpG #1, #2, #3, etc… •  Cau(on: mul(ple tes(ng (2) Can we smooth across CpGs and find genomic regions that are differen(ally methylated?

Smoothing Across Genomic Regions Jaﬀe et al. (2012) Int
J Epidemiology

Locally Weighted Sca,erplot Smoothing (Loess)

Irizarry (2009). Nature Gene+csl 41: 178-‐185

Correla(on ≠ Causa(on h,p://xkcd.com/552/

Technical vs Biological Varia(on •  Raw genomics data contains
biases and unwanted technical varia:on – e.g. sequencing technology, batch effects – Can cause perceived differences between samples, irrespec(ve of the biological varia:on •  Changes in experimental condi(ons can be confused with biological variability – Can lead to false discoveries (e.g. finding DMRs)

Bladder Normal and Cancer Samples

How do you pre-‐process and normalize noisy genomics data?

Quan(le Normaliza(on •  Mostly widely used mul:-‐sample normaliza:on
•  Originally developed for gene expression microarrays •  Now applied to –  Genotyping arrays, RNA-‐Sequencing, DNA methyla(on, ChIP-‐Sequencing & Brain imaging Can be very helpful in elimina(ng unwanted varia(on e.g. ``batch eﬀects'' (good), but has poten(al to wash out true biological varia(on (bad)

How does it work? Quan(le normaliza(on is a non-‐linear
transforma(on that replaces each intensity score with the mean of the features with the same rank from each array Raw data Order values within each sample (or column) Re-order averaged values in original order 2 4 4 5 5 14 4 7 4 8 6 9 3 8 5 8 3 9 3 5 2 4 3 5 3 8 4 5 3 8 4 7 4 9 5 8 5 14 6 9 3.5 3.5 5.0 5.0 8.5 8.5 5.5 5.5 6.5 5.0 8.5 8.5 5.0 5.5 6.5 6.5 5.5 6.5 3.5 3.5 3.5 3.5 3.5 3.5 5.0 5.0 5.0 5.0 5.5 5.5 5.5 5.5 6.5 6.5 6.5 6.5 8.5 8.5 8.5 8.5 Average across rows and substitute value with average

Gene expression of brain and liver (ssues 6 8
10 12 14 16 0.0 0.2 0.4 0.6 0.8 1.0 log2 PM values density Brain (GSE17612, n=23) Brain (GSE21935, n=19) Liver (GSE29721, n=10) Liver (GSE14668, n=20) Liver (GSE39841, n=10)

Back to example: 6 puriﬁed cell types 0.0 0.2
0.4 0.6 0.8 1.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 DNA Methylation (450K arrays) beta values density Bcell CD4T CD8T Gran Mono NK

Back to mo(va(ng example Should we use
quan(le normaliza(on? Will we remove important biological varia(on? 0.0 0.2 0.4 0.6 0.8 1.0 DNA Methylation (450K arrays) beta values

Back to mo(va(ng example (quan(le normalized) Should
we use quan(le normaliza(on? Will we remove important biological varia(on? 0.0 0.2 0.4 0.6 0.8 1.0 DNA Methylation (450K arrays) beta values

quantro: Test for global changes between groups quantro
•  R/Bioconductor package to test for the assump(ons of quan(le normaliza(on 6 8 10 12 14 16 0.0 0.2 0.4 0.6 0.8 1.0 log2 PM values density Brain (GSE17612, n=23) Brain (GSE21935, n=19) Liver (GSE29721, n=10) Liver (GSE14668, n=20) Liver (GSE39841, n=10) Main idea: •  Compare variability within groups to variability between groups •  If variability between groups > variability within groups, then there may be global changes across groups

Targeted vs Global changes −5 0 5 10 15
0.00 0.02 0.04 0.06 0.08 rlogTransformation counts density GG (n=18) AG (n=32) AA (n=15) 6 8 10 12 14 16 0.0 0.1 0.2 0.3 log2 PM values density Nonsmoker (n=15) Smoker (n=15) Asthmatic (n=15) 6 8 10 12 14 16 0.0 0.2 0.4 0.6 0.8 1.0 log2 PM values density Brain (GSE17612, n=23) Brain (GSE21935, n=19) Liver (GSE29721, n=10) Liver (GSE14668, n=20) Liver (GSE39841, n=10) A B C Observed variation Reason? Small technical variability; no global changes Large technical variability or batch effects within groups; no global changes Global technical variability or batch effects across groups Global biological variability across groups What to do? Use quantile normalization (but not necessary) Small variability within groups, Small variability across groups Large variability within groups, Small variability across groups Small variability within groups, Large variability across groups Use quantile normalization Use quantile normalization Do not use quantile normalization quantro will detect global differences due to both technical and biological variation Global changes Targeted changes Targeted changes Raw data alone cannot detect difference

Final thoughts •  Sta:s:cs maSers in the analysis of
any data! •  Sta(s(cs can help iden(fy relevant biological varia:on in genomics data –  Diﬀerences in CpGs –  Smoothing across genomic regions •  Sta:s:cs can help eliminate unwanted technical varia:on in genomics data –  “Batch eﬀects”

Acknowledgements Rafael Irizarry Funding: NIH
R01 grants GM083084 and RR021967/GM103552.

Feel free to send comments/ques(ons: Twi,er: @stephaniehicks Email:
[email protected] Ques(ons? Normal distribu(on Weibull distribu(on Poisson distribu(on

Why Statistics Matters in the Analysis of Genom...

Why Statistics Matters in the Analysis of Genomics Data

More Decks by Stephanie Hicks

Other Decks in Science

Featured

Transcript