Why Statistics Matters in the Analysis of Genomics Data

68c6191fa302627da003b9ac1eaba4b5?s=47 Stephanie Hicks
February 12, 2015

Why Statistics Matters in the Analysis of Genomics Data

Presented at the LSU Computational Biology seminar and the LSUConnect event.

68c6191fa302627da003b9ac1eaba4b5?s=128

Stephanie Hicks

February 12, 2015
Tweet

Transcript

  1. Why  Sta(s(cs  Ma,ers:   Analysis  of  Genomics  Data   Stephanie

     Hicks   @stephaniehicks   stephaniehicks.com  
  2. Aug  2009:  “I  keep  saying  that  the  sexy  job  in

     the   next  10  years  will  be  sta(s(cians.  And  I’m  not   kidding.”     -­‐  Hal  Varian,  chief  economist  at  Google   2  min  Youtube  video  
  3. h,p://www.forbes.com/sites/danwoods/2012/03/08/hilary-­‐mason-­‐what-­‐is-­‐a-­‐data-­‐scien(st/   @hmason   Hilary  Mason      

  4. @jtleek  

  5. Preparing  students:     high  school   h,p://magazine.amstat.org/blog/2013/05/01/stats-­‐degrees/   Data

     source:  The  College  Board  
  6. h,p://www.amstat.org/newsroom/pressreleases/2015-­‐StatsFastestGrowingSTEMDegree.pdf     Data  from  the  Na(onal  Center  for  Educa(on

     Sta(s(cs;  Analysis  by  the  ASA  
  7. Preparing  students:     undergraduate  and  graduate     h,p://magazine.amstat.org/blog/2013/05/01/stats-­‐degrees/

      Data  source:  NCES  Digest  of  Educa(on  Sta(s(cs   Graduated     from  high     school   Completed   two  REUs  in   Biosta:s:cs   Graduated  from  LSU     (BS,  Mathema:cs);     Started  a  PhD  in     Sta:s:cs  
  8. Rapid  change  in  technology   Mardis  (2011)  Nature  470:  198-­‐203

      “Next-­‐genera(on  sequencing”   “Sanger  sequencing”  
  9. Hayden  (2014)  Nature  570:  294-­‐295  

  10. Ques(ons:   (2)  How  to  find  differences  between   two

     or  more  groups?   (3)  How  to  find  interes+ng     genomic  regions?     (1) How  to  process  &  normalize?  
  11. Discuss  these  three  ques(ons  and   illustrate  how  sta(s(cs  can

     help   Focus  on  DNA  methyla:on  data   (But  these  challenges  are  common  to     other  areas  of  genomics)   Morgan  et  al.  (1999).  Nature  Gene+cs  23:  314-­‐8   h,p://epigenome.eu/en/2,48,873   Bradbury  (2003).  PLoS  Biology  1:  e82  
  12. ATCGCGTTACTGCGGAA   TAGCGCAATGTCGCCTT   m   m   m  

    m   m   m   DNA  Methyla(on  
  13. h,p://www.learnnc.org/lp/pages/7828  

  14. Measuring  DNA  Methyla(on   Boch  (2012).  Nature  Reviews  Gene+cs  13,

     705-­‐719     Problem:  Which  CpGs  are  differen(ally  methylated  between  two  groups?   Some  proposed  sta:s:cal  solu:ons:  At  each  CpG,  test  if  there  is  a  difference  using   e.g.  t-­‐test,  F-­‐test  or  linear  regression  
  15. t-­‐test  for  Differen(al  Methyla(on   p-­‐value  =  0.034   0.00

    0.25 0.50 0.75 1.00 Case Control Methylation level Status Case Control CpG #1 CpG  is  differen(ally   methylated   <  0.05  
  16. t-­‐test  for  Differen(al  Methyla(on   p-­‐value  =  0.343   0.00

    0.25 0.50 0.75 1.00 Case Control Methylation level Status Case Control CpG #2 CpG  is  not  differen(ally   methylated   >  0.05  
  17. 600  CpGs   36  samples  (6  cell  types)   Jaffe

     and  Irizarry  (2014).  Genome  Biol  15:  R31.      
  18. What  about  neighboring  CpGs?   Problem:  If  one  CpG  is

     methylated,  would  a  CpG   nearby  be  also  methylated?   Some  proposed  solu:ons:         (1)  Can  we  find  two  or  more  runs  of  differen(ally  methylated   CpGs?   •  If  p-­‐value  <  0.05  for  CpG  #1,  #2,  #3,  etc…   •  Cau(on:  mul(ple  tes(ng     (2)  Can  we  smooth  across  CpGs  and  find  genomic  regions  that   are  differen(ally  methylated?  
  19. Smoothing  Across  Genomic  Regions   Jaffe  et  al.  (2012)  Int

     J  Epidemiology      
  20. Locally  Weighted  Sca,erplot   Smoothing  (Loess)  

  21. Irizarry  (2009).  Nature  Gene+csl  41:  178-­‐185      

  22. Correla(on  ≠  Causa(on   h,p://xkcd.com/552/  

  23. Technical  vs  Biological  Varia(on   •  Raw  genomics  data  contains

     biases  and   unwanted  technical  varia:on     – e.g.  sequencing  technology,  batch  effects   – Can  cause  perceived  differences  between   samples,  irrespec(ve  of  the  biological  varia:on   •  Changes  in  experimental  condi(ons  can  be   confused  with  biological  variability     – Can  lead  to  false  discoveries  (e.g.  finding  DMRs)  
  24. Bladder  Normal  and  Cancer  Samples  

  25. Bladder  Normal  and  Cancer  Samples  

  26. How  do  you  pre-­‐process  and   normalize  noisy  genomics  data?

     
  27. Quan(le  Normaliza(on   •  Mostly  widely  used  mul:-­‐sample  normaliza:on  

    •  Originally  developed  for  gene  expression  microarrays   •  Now  applied  to   –  Genotyping  arrays,  RNA-­‐Sequencing,  DNA  methyla(on,   ChIP-­‐Sequencing  &  Brain  imaging   Can  be  very  helpful  in  elimina(ng  unwanted  varia(on   e.g.  ``batch  effects''  (good),  but  has  poten(al  to  wash   out  true  biological  varia(on  (bad)    
  28. How  does  it  work?   Quan(le  normaliza(on  is  a  non-­‐linear

     transforma(on  that  replaces  each  intensity   score  with  the  mean  of  the  features  with  the  same  rank  from  each  array   Raw data Order values within each sample (or column) Re-order averaged values in original order 2 4 4 5 5 14 4 7 4 8 6 9 3 8 5 8 3 9 3 5 2 4 3 5 3 8 4 5 3 8 4 7 4 9 5 8 5 14 6 9 3.5 3.5 5.0 5.0 8.5 8.5 5.5 5.5 6.5 5.0 8.5 8.5 5.0 5.5 6.5 6.5 5.5 6.5 3.5 3.5 3.5 3.5 3.5 3.5 5.0 5.0 5.0 5.0 5.5 5.5 5.5 5.5 6.5 6.5 6.5 6.5 8.5 8.5 8.5 8.5 Average across rows and substitute value with average
  29. Gene  expression  of  brain  and  liver  (ssues   6 8

    10 12 14 16 0.0 0.2 0.4 0.6 0.8 1.0 log2 PM values density Brain (GSE17612, n=23) Brain (GSE21935, n=19) Liver (GSE29721, n=10) Liver (GSE14668, n=20) Liver (GSE39841, n=10)
  30. Back  to  example:  6  purified  cell  types   0.0 0.2

    0.4 0.6 0.8 1.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 DNA Methylation (450K arrays) beta values density Bcell CD4T CD8T Gran Mono NK
  31. Back  to  mo(va(ng  example       Should  we  use

     quan(le  normaliza(on?  Will  we  remove  important  biological  varia(on?   0.0 0.2 0.4 0.6 0.8 1.0 DNA Methylation (450K arrays) beta values
  32. Back  to  mo(va(ng  example     (quan(le  normalized)   Should

     we  use  quan(le  normaliza(on?  Will  we  remove  important  biological  varia(on?   0.0 0.2 0.4 0.6 0.8 1.0 DNA Methylation (450K arrays) beta values
  33. quantro:  Test  for  global  changes   between  groups   quantro

      •  R/Bioconductor  package  to     test  for  the  assump(ons  of   quan(le  normaliza(on     6 8 10 12 14 16 0.0 0.2 0.4 0.6 0.8 1.0 log2 PM values density Brain (GSE17612, n=23) Brain (GSE21935, n=19) Liver (GSE29721, n=10) Liver (GSE14668, n=20) Liver (GSE39841, n=10) Main  idea:   •  Compare  variability  within  groups     to  variability  between  groups   •  If  variability  between  groups  >  variability  within  groups,  then   there  may  be  global  changes  across  groups    
  34. Targeted  vs  Global  changes   −5 0 5 10 15

    0.00 0.02 0.04 0.06 0.08 rlogTransformation counts density GG (n=18) AG (n=32) AA (n=15) 6 8 10 12 14 16 0.0 0.1 0.2 0.3 log2 PM values density Nonsmoker (n=15) Smoker (n=15) Asthmatic (n=15) 6 8 10 12 14 16 0.0 0.2 0.4 0.6 0.8 1.0 log2 PM values density Brain (GSE17612, n=23) Brain (GSE21935, n=19) Liver (GSE29721, n=10) Liver (GSE14668, n=20) Liver (GSE39841, n=10) A B C Observed variation Reason? Small technical variability; no global changes Large technical variability or batch effects within groups; no global changes Global technical variability or batch effects across groups Global biological variability across groups What to do? Use quantile normalization (but not necessary) Small variability within groups, Small variability across groups Large variability within groups, Small variability across groups Small variability within groups, Large variability across groups Use quantile normalization Use quantile normalization Do not use quantile normalization quantro will detect global differences due to both technical and biological variation Global changes Targeted changes Targeted changes Raw data alone cannot detect difference
  35. Final  thoughts   •  Sta:s:cs  maSers  in  the  analysis  of

     any  data!     •  Sta(s(cs  can  help  iden(fy  relevant  biological   varia:on  in  genomics  data   –  Differences  in  CpGs   –  Smoothing  across  genomic  regions   •  Sta:s:cs  can  help  eliminate  unwanted  technical   varia:on  in  genomics  data   –  “Batch  effects”      
  36. Acknowledgements   Rafael  Irizarry     Funding:     NIH

     R01  grants  GM083084   and  RR021967/GM103552.      
  37. Feel  free  to  send  comments/ques(ons:   Twi,er:  @stephaniehicks   Email:

     shicks@jimmy.harvard.edu   Ques(ons?   Normal  distribu(on   Weibull  distribu(on   Poisson  distribu(on