Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Why Statistics Matters in the Analysis of Genom...

Stephanie Hicks
February 12, 2015

Why Statistics Matters in the Analysis of Genomics Data

Presented at the LSU Computational Biology seminar and the LSUConnect event.

Stephanie Hicks

February 12, 2015
Tweet

More Decks by Stephanie Hicks

Other Decks in Science

Transcript

  1. Why  Sta(s(cs  Ma,ers:   Analysis  of  Genomics  Data   Stephanie

     Hicks   @stephaniehicks   stephaniehicks.com  
  2. Aug  2009:  “I  keep  saying  that  the  sexy  job  in

     the   next  10  years  will  be  sta(s(cians.  And  I’m  not   kidding.”     -­‐  Hal  Varian,  chief  economist  at  Google   2  min  Youtube  video  
  3. Preparing  students:     undergraduate  and  graduate     h,p://magazine.amstat.org/blog/2013/05/01/stats-­‐degrees/

      Data  source:  NCES  Digest  of  Educa(on  Sta(s(cs   Graduated     from  high     school   Completed   two  REUs  in   Biosta:s:cs   Graduated  from  LSU     (BS,  Mathema:cs);     Started  a  PhD  in     Sta:s:cs  
  4. Rapid  change  in  technology   Mardis  (2011)  Nature  470:  198-­‐203

      “Next-­‐genera(on  sequencing”   “Sanger  sequencing”  
  5. Ques(ons:   (2)  How  to  find  differences  between   two

     or  more  groups?   (3)  How  to  find  interes+ng     genomic  regions?     (1) How  to  process  &  normalize?  
  6. Discuss  these  three  ques(ons  and   illustrate  how  sta(s(cs  can

     help   Focus  on  DNA  methyla:on  data   (But  these  challenges  are  common  to     other  areas  of  genomics)   Morgan  et  al.  (1999).  Nature  Gene+cs  23:  314-­‐8   h,p://epigenome.eu/en/2,48,873   Bradbury  (2003).  PLoS  Biology  1:  e82  
  7. Measuring  DNA  Methyla(on   Boch  (2012).  Nature  Reviews  Gene+cs  13,

     705-­‐719     Problem:  Which  CpGs  are  differen(ally  methylated  between  two  groups?   Some  proposed  sta:s:cal  solu:ons:  At  each  CpG,  test  if  there  is  a  difference  using   e.g.  t-­‐test,  F-­‐test  or  linear  regression  
  8. t-­‐test  for  Differen(al  Methyla(on   p-­‐value  =  0.034   0.00

    0.25 0.50 0.75 1.00 Case Control Methylation level Status Case Control CpG #1 CpG  is  differen(ally   methylated   <  0.05  
  9. t-­‐test  for  Differen(al  Methyla(on   p-­‐value  =  0.343   0.00

    0.25 0.50 0.75 1.00 Case Control Methylation level Status Case Control CpG #2 CpG  is  not  differen(ally   methylated   >  0.05  
  10. 600  CpGs   36  samples  (6  cell  types)   Jaffe

     and  Irizarry  (2014).  Genome  Biol  15:  R31.      
  11. What  about  neighboring  CpGs?   Problem:  If  one  CpG  is

     methylated,  would  a  CpG   nearby  be  also  methylated?   Some  proposed  solu:ons:         (1)  Can  we  find  two  or  more  runs  of  differen(ally  methylated   CpGs?   •  If  p-­‐value  <  0.05  for  CpG  #1,  #2,  #3,  etc…   •  Cau(on:  mul(ple  tes(ng     (2)  Can  we  smooth  across  CpGs  and  find  genomic  regions  that   are  differen(ally  methylated?  
  12. Technical  vs  Biological  Varia(on   •  Raw  genomics  data  contains

     biases  and   unwanted  technical  varia:on     – e.g.  sequencing  technology,  batch  effects   – Can  cause  perceived  differences  between   samples,  irrespec(ve  of  the  biological  varia:on   •  Changes  in  experimental  condi(ons  can  be   confused  with  biological  variability     – Can  lead  to  false  discoveries  (e.g.  finding  DMRs)  
  13. Quan(le  Normaliza(on   •  Mostly  widely  used  mul:-­‐sample  normaliza:on  

    •  Originally  developed  for  gene  expression  microarrays   •  Now  applied  to   –  Genotyping  arrays,  RNA-­‐Sequencing,  DNA  methyla(on,   ChIP-­‐Sequencing  &  Brain  imaging   Can  be  very  helpful  in  elimina(ng  unwanted  varia(on   e.g.  ``batch  effects''  (good),  but  has  poten(al  to  wash   out  true  biological  varia(on  (bad)    
  14. How  does  it  work?   Quan(le  normaliza(on  is  a  non-­‐linear

     transforma(on  that  replaces  each  intensity   score  with  the  mean  of  the  features  with  the  same  rank  from  each  array   Raw data Order values within each sample (or column) Re-order averaged values in original order 2 4 4 5 5 14 4 7 4 8 6 9 3 8 5 8 3 9 3 5 2 4 3 5 3 8 4 5 3 8 4 7 4 9 5 8 5 14 6 9 3.5 3.5 5.0 5.0 8.5 8.5 5.5 5.5 6.5 5.0 8.5 8.5 5.0 5.5 6.5 6.5 5.5 6.5 3.5 3.5 3.5 3.5 3.5 3.5 5.0 5.0 5.0 5.0 5.5 5.5 5.5 5.5 6.5 6.5 6.5 6.5 8.5 8.5 8.5 8.5 Average across rows and substitute value with average
  15. Gene  expression  of  brain  and  liver  (ssues   6 8

    10 12 14 16 0.0 0.2 0.4 0.6 0.8 1.0 log2 PM values density Brain (GSE17612, n=23) Brain (GSE21935, n=19) Liver (GSE29721, n=10) Liver (GSE14668, n=20) Liver (GSE39841, n=10)
  16. Back  to  example:  6  purified  cell  types   0.0 0.2

    0.4 0.6 0.8 1.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 DNA Methylation (450K arrays) beta values density Bcell CD4T CD8T Gran Mono NK
  17. Back  to  mo(va(ng  example       Should  we  use

     quan(le  normaliza(on?  Will  we  remove  important  biological  varia(on?   0.0 0.2 0.4 0.6 0.8 1.0 DNA Methylation (450K arrays) beta values
  18. Back  to  mo(va(ng  example     (quan(le  normalized)   Should

     we  use  quan(le  normaliza(on?  Will  we  remove  important  biological  varia(on?   0.0 0.2 0.4 0.6 0.8 1.0 DNA Methylation (450K arrays) beta values
  19. quantro:  Test  for  global  changes   between  groups   quantro

      •  R/Bioconductor  package  to     test  for  the  assump(ons  of   quan(le  normaliza(on     6 8 10 12 14 16 0.0 0.2 0.4 0.6 0.8 1.0 log2 PM values density Brain (GSE17612, n=23) Brain (GSE21935, n=19) Liver (GSE29721, n=10) Liver (GSE14668, n=20) Liver (GSE39841, n=10) Main  idea:   •  Compare  variability  within  groups     to  variability  between  groups   •  If  variability  between  groups  >  variability  within  groups,  then   there  may  be  global  changes  across  groups    
  20. Targeted  vs  Global  changes   −5 0 5 10 15

    0.00 0.02 0.04 0.06 0.08 rlogTransformation counts density GG (n=18) AG (n=32) AA (n=15) 6 8 10 12 14 16 0.0 0.1 0.2 0.3 log2 PM values density Nonsmoker (n=15) Smoker (n=15) Asthmatic (n=15) 6 8 10 12 14 16 0.0 0.2 0.4 0.6 0.8 1.0 log2 PM values density Brain (GSE17612, n=23) Brain (GSE21935, n=19) Liver (GSE29721, n=10) Liver (GSE14668, n=20) Liver (GSE39841, n=10) A B C Observed variation Reason? Small technical variability; no global changes Large technical variability or batch effects within groups; no global changes Global technical variability or batch effects across groups Global biological variability across groups What to do? Use quantile normalization (but not necessary) Small variability within groups, Small variability across groups Large variability within groups, Small variability across groups Small variability within groups, Large variability across groups Use quantile normalization Use quantile normalization Do not use quantile normalization quantro will detect global differences due to both technical and biological variation Global changes Targeted changes Targeted changes Raw data alone cannot detect difference
  21. Final  thoughts   •  Sta:s:cs  maSers  in  the  analysis  of

     any  data!     •  Sta(s(cs  can  help  iden(fy  relevant  biological   varia:on  in  genomics  data   –  Differences  in  CpGs   –  Smoothing  across  genomic  regions   •  Sta:s:cs  can  help  eliminate  unwanted  technical   varia:on  in  genomics  data   –  “Batch  effects”      
  22. Acknowledgements   Rafael  Irizarry     Funding:     NIH

     R01  grants  GM083084   and  RR021967/GM103552.      
  23. Feel  free  to  send  comments/ques(ons:   Twi,er:  @stephaniehicks   Email:

     [email protected]   Ques(ons?   Normal  distribu(on   Weibull  distribu(on   Poisson  distribu(on