Slide 1

Slide 1 text

Why  Sta(s(cs  Ma,ers:   Analysis  of  Genomics  Data   Stephanie  Hicks   @stephaniehicks   stephaniehicks.com  

Slide 2

Slide 2 text

Aug  2009:  “I  keep  saying  that  the  sexy  job  in  the   next  10  years  will  be  sta(s(cians.  And  I’m  not   kidding.”     -­‐  Hal  Varian,  chief  economist  at  Google   2  min  Youtube  video  

Slide 3

Slide 3 text

h,p://www.forbes.com/sites/danwoods/2012/03/08/hilary-­‐mason-­‐what-­‐is-­‐a-­‐data-­‐scien(st/   @hmason   Hilary  Mason      

Slide 4

Slide 4 text

@jtleek  

Slide 5

Slide 5 text

Preparing  students:     high  school   h,p://magazine.amstat.org/blog/2013/05/01/stats-­‐degrees/   Data  source:  The  College  Board  

Slide 6

Slide 6 text

h,p://www.amstat.org/newsroom/pressreleases/2015-­‐StatsFastestGrowingSTEMDegree.pdf     Data  from  the  Na(onal  Center  for  Educa(on  Sta(s(cs;  Analysis  by  the  ASA  

Slide 7

Slide 7 text

Preparing  students:     undergraduate  and  graduate     h,p://magazine.amstat.org/blog/2013/05/01/stats-­‐degrees/   Data  source:  NCES  Digest  of  Educa(on  Sta(s(cs   Graduated     from  high     school   Completed   two  REUs  in   Biosta:s:cs   Graduated  from  LSU     (BS,  Mathema:cs);     Started  a  PhD  in     Sta:s:cs  

Slide 8

Slide 8 text

Rapid  change  in  technology   Mardis  (2011)  Nature  470:  198-­‐203   “Next-­‐genera(on  sequencing”   “Sanger  sequencing”  

Slide 9

Slide 9 text

Hayden  (2014)  Nature  570:  294-­‐295  

Slide 10

Slide 10 text

Ques(ons:   (2)  How  to  find  differences  between   two  or  more  groups?   (3)  How  to  find  interes+ng     genomic  regions?     (1) How  to  process  &  normalize?  

Slide 11

Slide 11 text

Discuss  these  three  ques(ons  and   illustrate  how  sta(s(cs  can  help   Focus  on  DNA  methyla:on  data   (But  these  challenges  are  common  to     other  areas  of  genomics)   Morgan  et  al.  (1999).  Nature  Gene+cs  23:  314-­‐8   h,p://epigenome.eu/en/2,48,873   Bradbury  (2003).  PLoS  Biology  1:  e82  

Slide 12

Slide 12 text

ATCGCGTTACTGCGGAA   TAGCGCAATGTCGCCTT   m   m   m   m   m   m   DNA  Methyla(on  

Slide 13

Slide 13 text

h,p://www.learnnc.org/lp/pages/7828  

Slide 14

Slide 14 text

Measuring  DNA  Methyla(on   Boch  (2012).  Nature  Reviews  Gene+cs  13,  705-­‐719     Problem:  Which  CpGs  are  differen(ally  methylated  between  two  groups?   Some  proposed  sta:s:cal  solu:ons:  At  each  CpG,  test  if  there  is  a  difference  using   e.g.  t-­‐test,  F-­‐test  or  linear  regression  

Slide 15

Slide 15 text

t-­‐test  for  Differen(al  Methyla(on   p-­‐value  =  0.034   0.00 0.25 0.50 0.75 1.00 Case Control Methylation level Status Case Control CpG #1 CpG  is  differen(ally   methylated   <  0.05  

Slide 16

Slide 16 text

t-­‐test  for  Differen(al  Methyla(on   p-­‐value  =  0.343   0.00 0.25 0.50 0.75 1.00 Case Control Methylation level Status Case Control CpG #2 CpG  is  not  differen(ally   methylated   >  0.05  

Slide 17

Slide 17 text

600  CpGs   36  samples  (6  cell  types)   Jaffe  and  Irizarry  (2014).  Genome  Biol  15:  R31.      

Slide 18

Slide 18 text

What  about  neighboring  CpGs?   Problem:  If  one  CpG  is  methylated,  would  a  CpG   nearby  be  also  methylated?   Some  proposed  solu:ons:         (1)  Can  we  find  two  or  more  runs  of  differen(ally  methylated   CpGs?   •  If  p-­‐value  <  0.05  for  CpG  #1,  #2,  #3,  etc…   •  Cau(on:  mul(ple  tes(ng     (2)  Can  we  smooth  across  CpGs  and  find  genomic  regions  that   are  differen(ally  methylated?  

Slide 19

Slide 19 text

Smoothing  Across  Genomic  Regions   Jaffe  et  al.  (2012)  Int  J  Epidemiology      

Slide 20

Slide 20 text

Locally  Weighted  Sca,erplot   Smoothing  (Loess)  

Slide 21

Slide 21 text

Irizarry  (2009).  Nature  Gene+csl  41:  178-­‐185      

Slide 22

Slide 22 text

Correla(on  ≠  Causa(on   h,p://xkcd.com/552/  

Slide 23

Slide 23 text

Technical  vs  Biological  Varia(on   •  Raw  genomics  data  contains  biases  and   unwanted  technical  varia:on     – e.g.  sequencing  technology,  batch  effects   – Can  cause  perceived  differences  between   samples,  irrespec(ve  of  the  biological  varia:on   •  Changes  in  experimental  condi(ons  can  be   confused  with  biological  variability     – Can  lead  to  false  discoveries  (e.g.  finding  DMRs)  

Slide 24

Slide 24 text

Bladder  Normal  and  Cancer  Samples  

Slide 25

Slide 25 text

Bladder  Normal  and  Cancer  Samples  

Slide 26

Slide 26 text

How  do  you  pre-­‐process  and   normalize  noisy  genomics  data?  

Slide 27

Slide 27 text

Quan(le  Normaliza(on   •  Mostly  widely  used  mul:-­‐sample  normaliza:on   •  Originally  developed  for  gene  expression  microarrays   •  Now  applied  to   –  Genotyping  arrays,  RNA-­‐Sequencing,  DNA  methyla(on,   ChIP-­‐Sequencing  &  Brain  imaging   Can  be  very  helpful  in  elimina(ng  unwanted  varia(on   e.g.  ``batch  effects''  (good),  but  has  poten(al  to  wash   out  true  biological  varia(on  (bad)    

Slide 28

Slide 28 text

How  does  it  work?   Quan(le  normaliza(on  is  a  non-­‐linear  transforma(on  that  replaces  each  intensity   score  with  the  mean  of  the  features  with  the  same  rank  from  each  array   Raw data Order values within each sample (or column) Re-order averaged values in original order 2 4 4 5 5 14 4 7 4 8 6 9 3 8 5 8 3 9 3 5 2 4 3 5 3 8 4 5 3 8 4 7 4 9 5 8 5 14 6 9 3.5 3.5 5.0 5.0 8.5 8.5 5.5 5.5 6.5 5.0 8.5 8.5 5.0 5.5 6.5 6.5 5.5 6.5 3.5 3.5 3.5 3.5 3.5 3.5 5.0 5.0 5.0 5.0 5.5 5.5 5.5 5.5 6.5 6.5 6.5 6.5 8.5 8.5 8.5 8.5 Average across rows and substitute value with average

Slide 29

Slide 29 text

Gene  expression  of  brain  and  liver  (ssues   6 8 10 12 14 16 0.0 0.2 0.4 0.6 0.8 1.0 log2 PM values density Brain (GSE17612, n=23) Brain (GSE21935, n=19) Liver (GSE29721, n=10) Liver (GSE14668, n=20) Liver (GSE39841, n=10)

Slide 30

Slide 30 text

Back  to  example:  6  purified  cell  types   0.0 0.2 0.4 0.6 0.8 1.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 DNA Methylation (450K arrays) beta values density Bcell CD4T CD8T Gran Mono NK

Slide 31

Slide 31 text

Back  to  mo(va(ng  example       Should  we  use  quan(le  normaliza(on?  Will  we  remove  important  biological  varia(on?   0.0 0.2 0.4 0.6 0.8 1.0 DNA Methylation (450K arrays) beta values

Slide 32

Slide 32 text

Back  to  mo(va(ng  example     (quan(le  normalized)   Should  we  use  quan(le  normaliza(on?  Will  we  remove  important  biological  varia(on?   0.0 0.2 0.4 0.6 0.8 1.0 DNA Methylation (450K arrays) beta values

Slide 33

Slide 33 text

quantro:  Test  for  global  changes   between  groups   quantro   •  R/Bioconductor  package  to     test  for  the  assump(ons  of   quan(le  normaliza(on     6 8 10 12 14 16 0.0 0.2 0.4 0.6 0.8 1.0 log2 PM values density Brain (GSE17612, n=23) Brain (GSE21935, n=19) Liver (GSE29721, n=10) Liver (GSE14668, n=20) Liver (GSE39841, n=10) Main  idea:   •  Compare  variability  within  groups     to  variability  between  groups   •  If  variability  between  groups  >  variability  within  groups,  then   there  may  be  global  changes  across  groups    

Slide 34

Slide 34 text

Targeted  vs  Global  changes   −5 0 5 10 15 0.00 0.02 0.04 0.06 0.08 rlogTransformation counts density GG (n=18) AG (n=32) AA (n=15) 6 8 10 12 14 16 0.0 0.1 0.2 0.3 log2 PM values density Nonsmoker (n=15) Smoker (n=15) Asthmatic (n=15) 6 8 10 12 14 16 0.0 0.2 0.4 0.6 0.8 1.0 log2 PM values density Brain (GSE17612, n=23) Brain (GSE21935, n=19) Liver (GSE29721, n=10) Liver (GSE14668, n=20) Liver (GSE39841, n=10) A B C Observed variation Reason? Small technical variability; no global changes Large technical variability or batch effects within groups; no global changes Global technical variability or batch effects across groups Global biological variability across groups What to do? Use quantile normalization (but not necessary) Small variability within groups, Small variability across groups Large variability within groups, Small variability across groups Small variability within groups, Large variability across groups Use quantile normalization Use quantile normalization Do not use quantile normalization quantro will detect global differences due to both technical and biological variation Global changes Targeted changes Targeted changes Raw data alone cannot detect difference

Slide 35

Slide 35 text

Final  thoughts   •  Sta:s:cs  maSers  in  the  analysis  of  any  data!     •  Sta(s(cs  can  help  iden(fy  relevant  biological   varia:on  in  genomics  data   –  Differences  in  CpGs   –  Smoothing  across  genomic  regions   •  Sta:s:cs  can  help  eliminate  unwanted  technical   varia:on  in  genomics  data   –  “Batch  effects”      

Slide 36

Slide 36 text

Acknowledgements   Rafael  Irizarry     Funding:     NIH  R01  grants  GM083084   and  RR021967/GM103552.      

Slide 37

Slide 37 text

Feel  free  to  send  comments/ques(ons:   Twi,er:  @stephaniehicks   Email:  shicks@jimmy.harvard.edu   Ques(ons?   Normal  distribu(on   Weibull  distribu(on   Poisson  distribu(on