Why Statistics Matters in the Analysis of Genomics Data

Slide 1

Slide 1 text

Why Sta(s(cs Ma,ers: Analysis of Genomics Data Stephanie Hicks @stephaniehicks stephaniehicks.com

Slide 2

Slide 2 text

Aug 2009: “I keep saying that the sexy job in the next 10 years will be sta(s(cians. And I’m not kidding.” -‐ Hal Varian, chief economist at Google 2 min Youtube video

Slide 3

Slide 3 text

h,p://www.forbes.com/sites/danwoods/2012/03/08/hilary-‐mason-‐what-‐is-‐a-‐data-‐scien(st/ @hmason Hilary Mason

Slide 4

Slide 4 text

@jtleek

Slide 5

Slide 5 text

Preparing students: high school h,p://magazine.amstat.org/blog/2013/05/01/stats-‐degrees/ Data source: The College Board

Slide 6

Slide 6 text

h,p://www.amstat.org/newsroom/pressreleases/2015-‐StatsFastestGrowingSTEMDegree.pdf Data from the Na(onal Center for Educa(on Sta(s(cs; Analysis by the ASA

Slide 7

Slide 7 text

Preparing students: undergraduate and graduate h,p://magazine.amstat.org/blog/2013/05/01/stats-‐degrees/ Data source: NCES Digest of Educa(on Sta(s(cs Graduated from high school Completed two REUs in Biosta:s:cs Graduated from LSU (BS, Mathema:cs); Started a PhD in Sta:s:cs

Slide 8

Slide 8 text

Rapid change in technology Mardis (2011) Nature 470: 198-‐203 “Next-‐genera(on sequencing” “Sanger sequencing”

Slide 9

Slide 9 text

Hayden (2014) Nature 570: 294-‐295

Slide 10

Slide 10 text

Ques(ons: (2) How to find differences between two or more groups? (3) How to find interes+ng genomic regions? (1) How to process & normalize?

Slide 11

Slide 11 text

Discuss these three ques(ons and illustrate how sta(s(cs can help Focus on DNA methyla:on data (But these challenges are common to other areas of genomics) Morgan et al. (1999). Nature Gene+cs 23: 314-‐8 h,p://epigenome.eu/en/2,48,873 Bradbury (2003). PLoS Biology 1: e82

Slide 12

Slide 12 text

ATCGCGTTACTGCGGAA TAGCGCAATGTCGCCTT m m m m m m DNA Methyla(on

Slide 13

Slide 13 text

h,p://www.learnnc.org/lp/pages/7828

Slide 14

Slide 14 text

Measuring DNA Methyla(on Boch (2012). Nature Reviews Gene+cs 13, 705-‐719 Problem: Which CpGs are diﬀeren(ally methylated between two groups? Some proposed sta:s:cal solu:ons: At each CpG, test if there is a diﬀerence using e.g. t-‐test, F-‐test or linear regression

Slide 15

Slide 15 text

t-‐test for Diﬀeren(al Methyla(on p-‐value = 0.034 0.00 0.25 0.50 0.75 1.00 Case Control Methylation level Status Case Control CpG #1 CpG is diﬀeren(ally methylated < 0.05

Slide 16

Slide 16 text

t-‐test for Diﬀeren(al Methyla(on p-‐value = 0.343 0.00 0.25 0.50 0.75 1.00 Case Control Methylation level Status Case Control CpG #2 CpG is not diﬀeren(ally methylated > 0.05

Slide 17

Slide 17 text

600 CpGs 36 samples (6 cell types) Jaﬀe and Irizarry (2014). Genome Biol 15: R31.

Slide 18

Slide 18 text

What about neighboring CpGs? Problem: If one CpG is methylated, would a CpG nearby be also methylated? Some proposed solu:ons: (1) Can we find two or more runs of differen(ally methylated CpGs? •  If p-‐value < 0.05 for CpG #1, #2, #3, etc… •  Cau(on: mul(ple tes(ng (2) Can we smooth across CpGs and find genomic regions that are differen(ally methylated?

Slide 19

Slide 19 text

Smoothing Across Genomic Regions Jaﬀe et al. (2012) Int J Epidemiology

Slide 20

Slide 20 text

Locally Weighted Sca,erplot Smoothing (Loess)

Slide 21

Slide 21 text

Irizarry (2009). Nature Gene+csl 41: 178-‐185

Slide 22

Slide 22 text

Correla(on ≠ Causa(on h,p://xkcd.com/552/

Slide 23

Slide 23 text

Technical vs Biological Varia(on •  Raw genomics data contains biases and unwanted technical varia:on – e.g. sequencing technology, batch effects – Can cause perceived differences between samples, irrespec(ve of the biological varia:on •  Changes in experimental condi(ons can be confused with biological variability – Can lead to false discoveries (e.g. finding DMRs)

Slide 24

Slide 24 text

Bladder Normal and Cancer Samples

Slide 25

Slide 25 text

Bladder Normal and Cancer Samples

Slide 26

Slide 26 text

How do you pre-‐process and normalize noisy genomics data?

Slide 27

Slide 27 text

Quan(le Normaliza(on •  Mostly widely used mul:-‐sample normaliza:on •  Originally developed for gene expression microarrays •  Now applied to –  Genotyping arrays, RNA-‐Sequencing, DNA methyla(on, ChIP-‐Sequencing & Brain imaging Can be very helpful in elimina(ng unwanted varia(on e.g. ``batch eﬀects'' (good), but has poten(al to wash out true biological varia(on (bad)

Slide 28

Slide 28 text

How does it work? Quan(le normaliza(on is a non-‐linear transforma(on that replaces each intensity score with the mean of the features with the same rank from each array Raw data Order values within each sample (or column) Re-order averaged values in original order 2 4 4 5 5 14 4 7 4 8 6 9 3 8 5 8 3 9 3 5 2 4 3 5 3 8 4 5 3 8 4 7 4 9 5 8 5 14 6 9 3.5 3.5 5.0 5.0 8.5 8.5 5.5 5.5 6.5 5.0 8.5 8.5 5.0 5.5 6.5 6.5 5.5 6.5 3.5 3.5 3.5 3.5 3.5 3.5 5.0 5.0 5.0 5.0 5.5 5.5 5.5 5.5 6.5 6.5 6.5 6.5 8.5 8.5 8.5 8.5 Average across rows and substitute value with average

Slide 29

Slide 29 text

Gene expression of brain and liver (ssues 6 8 10 12 14 16 0.0 0.2 0.4 0.6 0.8 1.0 log2 PM values density Brain (GSE17612, n=23) Brain (GSE21935, n=19) Liver (GSE29721, n=10) Liver (GSE14668, n=20) Liver (GSE39841, n=10)

Slide 30

Slide 30 text

Back to example: 6 puriﬁed cell types 0.0 0.2 0.4 0.6 0.8 1.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 DNA Methylation (450K arrays) beta values density Bcell CD4T CD8T Gran Mono NK

Slide 31

Slide 31 text

Back to mo(va(ng example Should we use quan(le normaliza(on? Will we remove important biological varia(on? 0.0 0.2 0.4 0.6 0.8 1.0 DNA Methylation (450K arrays) beta values

Slide 32

Slide 32 text

Back to mo(va(ng example (quan(le normalized) Should we use quan(le normaliza(on? Will we remove important biological varia(on? 0.0 0.2 0.4 0.6 0.8 1.0 DNA Methylation (450K arrays) beta values

Slide 33

Slide 33 text

quantro: Test for global changes between groups quantro •  R/Bioconductor package to test for the assump(ons of quan(le normaliza(on 6 8 10 12 14 16 0.0 0.2 0.4 0.6 0.8 1.0 log2 PM values density Brain (GSE17612, n=23) Brain (GSE21935, n=19) Liver (GSE29721, n=10) Liver (GSE14668, n=20) Liver (GSE39841, n=10) Main idea: •  Compare variability within groups to variability between groups •  If variability between groups > variability within groups, then there may be global changes across groups

Slide 34

Slide 34 text

Targeted vs Global changes −5 0 5 10 15 0.00 0.02 0.04 0.06 0.08 rlogTransformation counts density GG (n=18) AG (n=32) AA (n=15) 6 8 10 12 14 16 0.0 0.1 0.2 0.3 log2 PM values density Nonsmoker (n=15) Smoker (n=15) Asthmatic (n=15) 6 8 10 12 14 16 0.0 0.2 0.4 0.6 0.8 1.0 log2 PM values density Brain (GSE17612, n=23) Brain (GSE21935, n=19) Liver (GSE29721, n=10) Liver (GSE14668, n=20) Liver (GSE39841, n=10) A B C Observed variation Reason? Small technical variability; no global changes Large technical variability or batch effects within groups; no global changes Global technical variability or batch effects across groups Global biological variability across groups What to do? Use quantile normalization (but not necessary) Small variability within groups, Small variability across groups Large variability within groups, Small variability across groups Small variability within groups, Large variability across groups Use quantile normalization Use quantile normalization Do not use quantile normalization quantro will detect global differences due to both technical and biological variation Global changes Targeted changes Targeted changes Raw data alone cannot detect difference

Slide 35

Slide 35 text

Final thoughts •  Sta:s:cs maSers in the analysis of any data! •  Sta(s(cs can help iden(fy relevant biological varia:on in genomics data –  Diﬀerences in CpGs –  Smoothing across genomic regions •  Sta:s:cs can help eliminate unwanted technical varia:on in genomics data –  “Batch eﬀects”

Slide 36

Slide 36 text

Acknowledgements Rafael Irizarry Funding: NIH R01 grants GM083084 and RR021967/GM103552.

Slide 37

Slide 37 text

Feel free to send comments/ques(ons: Twi,er: @stephaniehicks Email: [email protected] Ques(ons? Normal distribu(on Weibull distribu(on Poisson distribu(on