GSBSE Seminar 12-11-2014

8e4bf6269bc939dfd942996af10e070a?s=47 Steve Munger
December 11, 2014

GSBSE Seminar 12-11-2014

This informal roundtable meeting discussed skills necessary for the next generation of geneticists to be successful. My focus was on Big genetics and Big data.

8e4bf6269bc939dfd942996af10e070a?s=128

Steve Munger

December 11, 2014
Tweet

Transcript

  1. BIG Steve Munger steven.munger@jax.org Slides are posted on https://speakerdeck.com/stevemunger Genetics

  2. What  is  BIG  Gene+cs?   •  BIG    Experiments  (100’s

     to  millions  of  samples)   –  Large  experimental  crosses   –  Popula+on  studies   •  BIG    Mul+dimensional  Data  (Gb  to  Pb  of  data)   –  DNA/RNA/Methyla+on  Sequencing   –  Shotgun  Proteomics     –  Metabolomics   –  Large-­‐scale  phenotyping   •  BIG    Complexity     –  Dealing  with  (and  exploi+ng)  high  gene+c  diversity   –  Computa+onal  challenges  (must  use  cloud  or  hpc  resources)   –  Analy+cal/Sta+s+cal  challenges   –  Mul+ple  tes+ng  problem  –  What  is  significant?  
  3. BIG Experiments

  4. 18M   18M   4M   4M   4M  

    4M   7M   More  samples  +     More  gene+c  diversity  =     More  phenotypic  diversity   129S1/SvImJ   C57BL/6J   Brynn  Voy  
  5. The  Collabora+ve  Cross:  A  large  panel  of  recombinant  inbred  

    lines  derived  from  eight  inbred  founder  strains.   CC001–  98%  Homozygous  
  6. The  complementary  Diversity  Outbred  heterogeneous  stock.   Collabora+ve  Cross  

    Funnel   Diversity  Outbred   …   G2:F4-­‐F12  mice     from  144  different   funnels   Random  Outbreeding  
  7. Mouse  Mapping/Reference     Popula+ons   •  Backcross/Intercross   • 

    Recombinant  Inbred  (RI)  Strains  –  Collabora+ve   Cross,  BXD,  AXB,  others.   •  Consomic  Strains  –  example  A.B-­‐C17  (Strain  A/J   with  Chromosome  17  from  C57BL/6J.     •  Advanced  Intercross  Lines  –  Diversity  Outcross,   LG/SM  AIL,  HS-­‐CC,  Northport  Stock,  others.   •  Commercial  Outbred  Stocks  –  CD1/ICR,  many   others.  
  8. BIG Data

  9. ENCyclopedia  Of  DNA  Elements   Credit:  Darryl  Leja,  Ian  Dunham

      Big  Data  
  10. DNA/RNA  Sequencing  

  11. Fastq  formaked  short  reads  

  12. Sam/Bam  formaked  read  alignments  

  13. UCSC  GB   file  formats  

  14. BIG Challenges

  15. BIG  Challenges   Basic  programming  skills  for  BIG  data  

    •  R  sta+s+cal  language   •  Python  or  Perl   •  Bash/Linux   •  Visualiza+on       Basic  Sta+s+cs  for  BIG  data   •  Distribu+ons,  variance,  significance,  normaliza+on,   transforma+on   •  Mul+ple  tes+ng  problem   •  Linear  regression,  mixed  models,  residuals,  principle   components  analysis/singular  value  decomposi+on  
  16. Ye better learn some R me mateys! hkp://www.r-­‐project.org  

  17. R  Studio  (rstudio.com,  FREE)  

  18. RStudio  

  19. Bourne  Again  Unix  SHell  (BASH)   Mac  OS:  Terminal  window

             PC:  Download  Cygwin  
  20. hkp://linuxcommand.org/learning_the_shell.php   Be  not  afraid.  If  I  could  learn  this

     at  age  35,  you  can  too.  
  21. Get  to  know  Python  (at  least  a  likle  bit)…  

    www.python.org  
  22. Get  to  know  your  High  Performance   Compu+ng  Cluster  

  23. Know  enough  sta+s+cs  to  understand  when   you’re  reading  (or

     trying  to  publish)  BS.   •  Understand  what  a  distribu(on  is  and  how  to   plot/characterize  one:  Normal/Gaussian,   Poisson,  NB   •  What  assump+ons  about  distribu+on  variance   are  made  by  specific  significance  tests  (e.g.   two-­‐tailed  Student’s  T-­‐test)?     – Are  you  comparing  two  groups  (treated/ untreated)  or  a  popula+on?   •  How  do  you  adjust  significance  thresholds  to   correct  for  mul+ple  tests?  
  24. Learn  how  to  look  at  your  BIG  data   • 

    Plopng  func+onality  in  R   •  Genome  browsers  like  UCSC,  IGV,  JBrowse,   etc.