GSBSE Seminar 12-11-2014

8e4bf6269bc939dfd942996af10e070a?s=47 Steve Munger
December 11, 2014

GSBSE Seminar 12-11-2014

This informal roundtable meeting discussed skills necessary for the next generation of geneticists to be successful. My focus was on Big genetics and Big data.


Steve Munger

December 11, 2014


  1. BIG Steve Munger Slides are posted on Genetics

  2. What  is  BIG  Gene+cs?   •  BIG    Experiments  (100’s

     to  millions  of  samples)   –  Large  experimental  crosses   –  Popula+on  studies   •  BIG    Mul+dimensional  Data  (Gb  to  Pb  of  data)   –  DNA/RNA/Methyla+on  Sequencing   –  Shotgun  Proteomics     –  Metabolomics   –  Large-­‐scale  phenotyping   •  BIG    Complexity     –  Dealing  with  (and  exploi+ng)  high  gene+c  diversity   –  Computa+onal  challenges  (must  use  cloud  or  hpc  resources)   –  Analy+cal/Sta+s+cal  challenges   –  Mul+ple  tes+ng  problem  –  What  is  significant?  
  3. BIG Experiments

  4. 18M   18M   4M   4M   4M  

    4M   7M   More  samples  +     More  gene+c  diversity  =     More  phenotypic  diversity   129S1/SvImJ   C57BL/6J   Brynn  Voy  
  5. The  Collabora+ve  Cross:  A  large  panel  of  recombinant  inbred  

    lines  derived  from  eight  inbred  founder  strains.   CC001–  98%  Homozygous  
  6. The  complementary  Diversity  Outbred  heterogeneous  stock.   Collabora+ve  Cross  

    Funnel   Diversity  Outbred   …   G2:F4-­‐F12  mice     from  144  different   funnels   Random  Outbreeding  
  7. Mouse  Mapping/Reference     Popula+ons   •  Backcross/Intercross   • 

    Recombinant  Inbred  (RI)  Strains  –  Collabora+ve   Cross,  BXD,  AXB,  others.   •  Consomic  Strains  –  example  A.B-­‐C17  (Strain  A/J   with  Chromosome  17  from  C57BL/6J.     •  Advanced  Intercross  Lines  –  Diversity  Outcross,   LG/SM  AIL,  HS-­‐CC,  Northport  Stock,  others.   •  Commercial  Outbred  Stocks  –  CD1/ICR,  many   others.  
  8. BIG Data

  9. ENCyclopedia  Of  DNA  Elements   Credit:  Darryl  Leja,  Ian  Dunham

      Big  Data  
  10. DNA/RNA  Sequencing  

  11. Fastq  formaked  short  reads  

  12. Sam/Bam  formaked  read  alignments  

  13. UCSC  GB   file  formats  

  14. BIG Challenges

  15. BIG  Challenges   Basic  programming  skills  for  BIG  data  

    •  R  sta+s+cal  language   •  Python  or  Perl   •  Bash/Linux   •  Visualiza+on       Basic  Sta+s+cs  for  BIG  data   •  Distribu+ons,  variance,  significance,  normaliza+on,   transforma+on   •  Mul+ple  tes+ng  problem   •  Linear  regression,  mixed  models,  residuals,  principle   components  analysis/singular  value  decomposi+on  
  16. Ye better learn some R me mateys! hkp://www.r-­‐  

  17. R  Studio  (,  FREE)  

  18. RStudio  

  19. Bourne  Again  Unix  SHell  (BASH)   Mac  OS:  Terminal  window

             PC:  Download  Cygwin  
  20. hkp://   Be  not  afraid.  If  I  could  learn  this

     at  age  35,  you  can  too.  
  21. Get  to  know  Python  (at  least  a  likle  bit)…  
  22. Get  to  know  your  High  Performance   Compu+ng  Cluster  

  23. Know  enough  sta+s+cs  to  understand  when   you’re  reading  (or

     trying  to  publish)  BS.   •  Understand  what  a  distribu(on  is  and  how  to   plot/characterize  one:  Normal/Gaussian,   Poisson,  NB   •  What  assump+ons  about  distribu+on  variance   are  made  by  specific  significance  tests  (e.g.   two-­‐tailed  Student’s  T-­‐test)?     – Are  you  comparing  two  groups  (treated/ untreated)  or  a  popula+on?   •  How  do  you  adjust  significance  thresholds  to   correct  for  mul+ple  tests?  
  24. Learn  how  to  look  at  your  BIG  data   • 

    Plopng  func+onality  in  R   •  Genome  browsers  like  UCSC,  IGV,  JBrowse,   etc.