$30 off During Our Annual Pro Sale. View Details »

Why Statistics Matters in the Analysis of Genomics Data

Stephanie Hicks
February 12, 2015

Why Statistics Matters in the Analysis of Genomics Data

Presented at the LSU Computational Biology seminar and the LSUConnect event.

Stephanie Hicks

February 12, 2015
Tweet

More Decks by Stephanie Hicks

Other Decks in Science

Transcript

  1. Why  Sta(s(cs  Ma,ers:  
    Analysis  of  Genomics  Data  
    Stephanie  Hicks  
    @stephaniehicks  
    stephaniehicks.com  

    View Slide

  2. Aug  2009:  “I  keep  saying  that  the  sexy  job  in  the  
    next  10  years  will  be  sta(s(cians.  And  I’m  not  
    kidding.”    
    -­‐  Hal  Varian,  chief  economist  at  Google  
    2  min  Youtube  video  

    View Slide

  3. h,p://www.forbes.com/sites/danwoods/2012/03/08/hilary-­‐mason-­‐what-­‐is-­‐a-­‐data-­‐scien(st/  
    @hmason  
    Hilary  Mason      

    View Slide

  4. @jtleek  

    View Slide

  5. Preparing  students:    
    high  school  
    h,p://magazine.amstat.org/blog/2013/05/01/stats-­‐degrees/  
    Data  source:  The  College  Board  

    View Slide

  6. h,p://www.amstat.org/newsroom/pressreleases/2015-­‐StatsFastestGrowingSTEMDegree.pdf    
    Data  from  the  Na(onal  Center  for  Educa(on  Sta(s(cs;  Analysis  by  the  ASA  

    View Slide

  7. Preparing  students:    
    undergraduate  and  graduate    
    h,p://magazine.amstat.org/blog/2013/05/01/stats-­‐degrees/  
    Data  source:  NCES  Digest  of  Educa(on  Sta(s(cs  
    Graduated    
    from  high    
    school  
    Completed  
    two  REUs  in  
    Biosta:s:cs  
    Graduated  from  LSU    
    (BS,  Mathema:cs);    
    Started  a  PhD  in    
    Sta:s:cs  

    View Slide

  8. Rapid  change  in  technology  
    Mardis  (2011)  Nature  470:  198-­‐203  
    “Next-­‐genera(on  sequencing”  
    “Sanger  sequencing”  

    View Slide

  9. Hayden  (2014)  Nature  570:  294-­‐295  

    View Slide

  10. Ques(ons:  
    (2)  How  to  find  differences  between  
    two  or  more  groups?  
    (3)  How  to  find  interes+ng    
    genomic  regions?    
    (1) How  to  process  &  normalize?  

    View Slide

  11. Discuss  these  three  ques(ons  and  
    illustrate  how  sta(s(cs  can  help  
    Focus  on  DNA  methyla:on  data  
    (But  these  challenges  are  common  to    
    other  areas  of  genomics)  
    Morgan  et  al.  (1999).  Nature  Gene+cs  23:  314-­‐8  
    h,p://epigenome.eu/en/2,48,873  
    Bradbury  (2003).  PLoS  Biology  1:  e82  

    View Slide

  12. ATCGCGTTACTGCGGAA  
    TAGCGCAATGTCGCCTT  
    m  
    m  
    m  
    m  
    m  
    m  
    DNA  Methyla(on  

    View Slide

  13. h,p://www.learnnc.org/lp/pages/7828  

    View Slide

  14. Measuring  DNA  Methyla(on  
    Boch  (2012).  Nature  Reviews  Gene+cs  13,  705-­‐719    
    Problem:  Which  CpGs  are  differen(ally  methylated  between  two  groups?  
    Some  proposed  sta:s:cal  solu:ons:  At  each  CpG,  test  if  there  is  a  difference  using  
    e.g.  t-­‐test,  F-­‐test  or  linear  regression  

    View Slide

  15. t-­‐test  for  Differen(al  Methyla(on  
    p-­‐value  =  0.034  
    0.00
    0.25
    0.50
    0.75
    1.00
    Case Control
    Methylation level
    Status
    Case
    Control
    CpG #1
    CpG  is  differen(ally  
    methylated  
    <  0.05  

    View Slide

  16. t-­‐test  for  Differen(al  Methyla(on  
    p-­‐value  =  0.343  
    0.00
    0.25
    0.50
    0.75
    1.00
    Case Control
    Methylation level
    Status
    Case
    Control
    CpG #2
    CpG  is  not  differen(ally  
    methylated  
    >  0.05  

    View Slide

  17. 600  CpGs  
    36  samples  (6  cell  types)  
    Jaffe  and  Irizarry  (2014).  Genome  Biol  15:  R31.    
     

    View Slide

  18. What  about  neighboring  CpGs?  
    Problem:  If  one  CpG  is  methylated,  would  a  CpG  
    nearby  be  also  methylated?  
    Some  proposed  solu:ons:    
       
    (1)  Can  we  find  two  or  more  runs  of  differen(ally  methylated  
    CpGs?  
    •  If  p-­‐value  <  0.05  for  CpG  #1,  #2,  #3,  etc…  
    •  Cau(on:  mul(ple  tes(ng    
    (2)  Can  we  smooth  across  CpGs  and  find  genomic  regions  that  
    are  differen(ally  methylated?  

    View Slide

  19. Smoothing  Across  Genomic  Regions  
    Jaffe  et  al.  (2012)  Int  J  Epidemiology    
     

    View Slide

  20. Locally  Weighted  Sca,erplot  
    Smoothing  (Loess)  

    View Slide

  21. Irizarry  (2009).  Nature  Gene+csl  41:  178-­‐185    
     

    View Slide

  22. Correla(on  ≠  Causa(on  
    h,p://xkcd.com/552/  

    View Slide

  23. Technical  vs  Biological  Varia(on  
    •  Raw  genomics  data  contains  biases  and  
    unwanted  technical  varia:on    
    – e.g.  sequencing  technology,  batch  effects  
    – Can  cause  perceived  differences  between  
    samples,  irrespec(ve  of  the  biological  varia:on  
    •  Changes  in  experimental  condi(ons  can  be  
    confused  with  biological  variability    
    – Can  lead  to  false  discoveries  (e.g.  finding  DMRs)  

    View Slide

  24. Bladder  Normal  and  Cancer  Samples  

    View Slide

  25. Bladder  Normal  and  Cancer  Samples  

    View Slide

  26. How  do  you  pre-­‐process  and  
    normalize  noisy  genomics  data?  

    View Slide

  27. Quan(le  Normaliza(on  
    •  Mostly  widely  used  mul:-­‐sample  normaliza:on  
    •  Originally  developed  for  gene  expression  microarrays  
    •  Now  applied  to  
    –  Genotyping  arrays,  RNA-­‐Sequencing,  DNA  methyla(on,  
    ChIP-­‐Sequencing  &  Brain  imaging  
    Can  be  very  helpful  in  elimina(ng  unwanted  varia(on  
    e.g.  ``batch  effects''  (good),  but  has  poten(al  to  wash  
    out  true  biological  varia(on  (bad)  
     

    View Slide

  28. How  does  it  work?  
    Quan(le  normaliza(on  is  a  non-­‐linear  transforma(on  that  replaces  each  intensity  
    score  with  the  mean  of  the  features  with  the  same  rank  from  each  array  
    Raw data
    Order values
    within each sample
    (or column)
    Re-order averaged
    values in original
    order
    2 4 4 5
    5 14 4 7
    4 8 6 9
    3 8 5 8
    3 9 3 5
    2 4 3 5
    3 8 4 5
    3 8 4 7
    4 9 5 8
    5 14 6 9
    3.5 3.5 5.0 5.0
    8.5 8.5 5.5 5.5
    6.5 5.0 8.5 8.5
    5.0 5.5 6.5 6.5
    5.5 6.5 3.5 3.5
    3.5 3.5 3.5 3.5
    5.0 5.0 5.0 5.0
    5.5 5.5 5.5 5.5
    6.5 6.5 6.5 6.5
    8.5 8.5 8.5 8.5
    Average across rows
    and substitute value
    with average

    View Slide

  29. Gene  expression  of  brain  and  liver  (ssues  
    6 8 10 12 14 16
    0.0 0.2 0.4 0.6 0.8 1.0
    log2
    PM values
    density
    Brain (GSE17612, n=23)
    Brain (GSE21935, n=19)
    Liver (GSE29721, n=10)
    Liver (GSE14668, n=20)
    Liver (GSE39841, n=10)

    View Slide

  30. Back  to  example:  6  purified  cell  types  
    0.0 0.2 0.4 0.6 0.8 1.0
    0.5 1.0 1.5 2.0 2.5 3.0 3.5
    DNA Methylation (450K arrays)
    beta values
    density
    Bcell
    CD4T
    CD8T
    Gran
    Mono
    NK

    View Slide

  31. Back  to  mo(va(ng  example    
     
    Should  we  use  quan(le  normaliza(on?  Will  we  remove  important  biological  varia(on?  
    0.0 0.2 0.4 0.6 0.8 1.0
    DNA Methylation (450K arrays)
    beta values

    View Slide

  32. Back  to  mo(va(ng  example    
    (quan(le  normalized)  
    Should  we  use  quan(le  normaliza(on?  Will  we  remove  important  biological  varia(on?  
    0.0 0.2 0.4 0.6 0.8 1.0
    DNA Methylation (450K arrays)
    beta values

    View Slide

  33. quantro:  Test  for  global  changes  
    between  groups  
    quantro  
    •  R/Bioconductor  package  to    
    test  for  the  assump(ons  of  
    quan(le  normaliza(on    
    6 8 10 12 14 16
    0.0 0.2 0.4 0.6 0.8 1.0
    log2
    PM values
    density
    Brain (GSE17612, n=23)
    Brain (GSE21935, n=19)
    Liver (GSE29721, n=10)
    Liver (GSE14668, n=20)
    Liver (GSE39841, n=10)
    Main  idea:  
    •  Compare  variability  within  groups    
    to  variability  between  groups  
    •  If  variability  between  groups  >  variability  within  groups,  then  
    there  may  be  global  changes  across  groups  
     

    View Slide

  34. Targeted  vs  Global  changes  
    −5 0 5 10 15
    0.00 0.02 0.04 0.06 0.08
    rlogTransformation counts
    density
    GG (n=18)
    AG (n=32)
    AA (n=15)
    6 8 10 12 14 16
    0.0 0.1 0.2 0.3
    log2
    PM values
    density
    Nonsmoker (n=15)
    Smoker (n=15)
    Asthmatic (n=15)
    6 8 10 12 14 16
    0.0 0.2 0.4 0.6 0.8 1.0
    log2
    PM values
    density
    Brain (GSE17612, n=23)
    Brain (GSE21935, n=19)
    Liver (GSE29721, n=10)
    Liver (GSE14668, n=20)
    Liver (GSE39841, n=10)
    A B C
    Observed variation
    Reason?
    Small technical variability;
    no global changes
    Large technical variability or
    batch effects within groups;
    no global changes
    Global technical
    variability or batch
    effects across groups
    Global biological
    variability across
    groups
    What to do?
    Use quantile
    normalization
    (but not necessary)
    Small variability within groups,
    Small variability across groups
    Large variability within groups,
    Small variability across groups
    Small variability within groups,
    Large variability across groups
    Use quantile
    normalization
    Use quantile
    normalization
    Do not use quantile
    normalization
    quantro will detect global differences due to both
    technical and biological variation
    Global changes
    Targeted changes
    Targeted changes
    Raw data alone cannot
    detect difference

    View Slide

  35. Final  thoughts  
    •  Sta:s:cs  maSers  in  the  analysis  of  any  data!  
     
    •  Sta(s(cs  can  help  iden(fy  relevant  biological  
    varia:on  in  genomics  data  
    –  Differences  in  CpGs  
    –  Smoothing  across  genomic  regions  
    •  Sta:s:cs  can  help  eliminate  unwanted  technical  
    varia:on  in  genomics  data  
    –  “Batch  effects”      

    View Slide

  36. Acknowledgements  
    Rafael  Irizarry  
     
    Funding:    
    NIH  R01  grants  GM083084  
    and  RR021967/GM103552.    
     

    View Slide

  37. Feel  free  to  send  comments/ques(ons:  
    Twi,er:  @stephaniehicks  
    Email:  [email protected]  
    Ques(ons?  
    Normal  distribu(on  
    Weibull  distribu(on  
    Poisson  distribu(on  

    View Slide