Upgrade to Pro — share decks privately, control downloads, hide ads and more …

GSBSE Seminar 12-11-2014

Steve Munger
December 11, 2014

GSBSE Seminar 12-11-2014

This informal roundtable meeting discussed skills necessary for the next generation of geneticists to be successful. My focus was on Big genetics and Big data.

Steve Munger

December 11, 2014
Tweet

More Decks by Steve Munger

Other Decks in Research

Transcript

  1. What  is  BIG  Gene+cs?   •  BIG    Experiments  (100’s

     to  millions  of  samples)   –  Large  experimental  crosses   –  Popula+on  studies   •  BIG    Mul+dimensional  Data  (Gb  to  Pb  of  data)   –  DNA/RNA/Methyla+on  Sequencing   –  Shotgun  Proteomics     –  Metabolomics   –  Large-­‐scale  phenotyping   •  BIG    Complexity     –  Dealing  with  (and  exploi+ng)  high  gene+c  diversity   –  Computa+onal  challenges  (must  use  cloud  or  hpc  resources)   –  Analy+cal/Sta+s+cal  challenges   –  Mul+ple  tes+ng  problem  –  What  is  significant?  
  2. 18M   18M   4M   4M   4M  

    4M   7M   More  samples  +     More  gene+c  diversity  =     More  phenotypic  diversity   129S1/SvImJ   C57BL/6J   Brynn  Voy  
  3. The  Collabora+ve  Cross:  A  large  panel  of  recombinant  inbred  

    lines  derived  from  eight  inbred  founder  strains.   CC001–  98%  Homozygous  
  4. The  complementary  Diversity  Outbred  heterogeneous  stock.   Collabora+ve  Cross  

    Funnel   Diversity  Outbred   …   G2:F4-­‐F12  mice     from  144  different   funnels   Random  Outbreeding  
  5. Mouse  Mapping/Reference     Popula+ons   •  Backcross/Intercross   • 

    Recombinant  Inbred  (RI)  Strains  –  Collabora+ve   Cross,  BXD,  AXB,  others.   •  Consomic  Strains  –  example  A.B-­‐C17  (Strain  A/J   with  Chromosome  17  from  C57BL/6J.     •  Advanced  Intercross  Lines  –  Diversity  Outcross,   LG/SM  AIL,  HS-­‐CC,  Northport  Stock,  others.   •  Commercial  Outbred  Stocks  –  CD1/ICR,  many   others.  
  6. BIG  Challenges   Basic  programming  skills  for  BIG  data  

    •  R  sta+s+cal  language   •  Python  or  Perl   •  Bash/Linux   •  Visualiza+on       Basic  Sta+s+cs  for  BIG  data   •  Distribu+ons,  variance,  significance,  normaliza+on,   transforma+on   •  Mul+ple  tes+ng  problem   •  Linear  regression,  mixed  models,  residuals,  principle   components  analysis/singular  value  decomposi+on  
  7. Know  enough  sta+s+cs  to  understand  when   you’re  reading  (or

     trying  to  publish)  BS.   •  Understand  what  a  distribu(on  is  and  how  to   plot/characterize  one:  Normal/Gaussian,   Poisson,  NB   •  What  assump+ons  about  distribu+on  variance   are  made  by  specific  significance  tests  (e.g.   two-­‐tailed  Student’s  T-­‐test)?     – Are  you  comparing  two  groups  (treated/ untreated)  or  a  popula+on?   •  How  do  you  adjust  significance  thresholds  to   correct  for  mul+ple  tests?  
  8. Learn  how  to  look  at  your  BIG  data   • 

    Plopng  func+onality  in  R   •  Genome  browsers  like  UCSC,  IGV,  JBrowse,   etc.