Bio 599: Lecture 26 (NAU Fall 2012)

Bio 599: Lecture 26 (NAU Fall 2012)

Bioinformatics provenance tracking: record keeping and reproducibility


Greg Caporaso

November 27, 2012


  1. 3.

    Reproducible  versus  replicable   •  Replicable:  exact  same  condi5ons  lead

     to   concordant  results   •  Reproducible:  some  experimental  varia5on  is   allowed,  but  results  are  concordant  
  2. 4.

    Not assigned Coriobacteriales Lactobacillales Erysipelotrichales Enterobacteriales Bifidobacteriales Bacteroidales Clostridiales Not

    assigned Mitsuokella Rikenellaceae Prevotellaceae Porphyromonadaceae Butyrivibrio Holdemania Coprococcus Clostridiales FamilyXI Incertae Sedis Cedecea Anaerotruncus Streptococcus Enterobacter Eubacterium Blautia Citrobacter Coriobacterineae Subdoligranulum Desulfitobacterium Dorea Unclassified Erysipelotrichaceae Catenibacte Roseburia Ruminococc Proteus Providencia Bifidobacteriaceae Clostridium Bacteroid (b) 0 10 20 30 40 50 60 70 80 90 100 Even1 Even2 Even3 Even1 Even2 Even3 Expected (a) Relative abundance (% of 16S rRNA gene sequences) 5’ primer 3’ primer Even1 Even2 Even3 Even1 Even2 Even3 Expected 5’ primer 3’ primer 0 10 20 30 40 50 60 70 80 Relative abundance (% of 16S rRNA gene seque Firmicutes Proteobacteria Actinobacteria Bacteroidetes Relative Abundance Order-level taxonomy assignments G-test (goodness of fit) ** p < 0.01 * p < 0.05 Can accurate taxonomy assignments be achieved? 5’ Mock1 5’ Mock2 5’ Mock3 3’ Mock1 3’ Mock2 3’ Mock3 Expected ** ** ** * * *
  3. 6.

    Ley et al., Nature Reviews of Microbiology, 2008 Worlds within

    worlds: evolution of the vertebrate gut microbiota. Can known differences between microbial communities be recaptured on Illumina?
  4. 7.

    Methods: Number of samples, sample types 28 samples 28 samples

    Human body habitats skin (n=3) dorsal tongue (n=3) gut (human feces) (n=5) Environmental samples soil (n=3) freshwater lake (n=2) freshwater creek (n=3) ocean (n=3) marine sediment (n=3) Mock communities 67 bacterial strains pooled at even abundance (n=3) Replication: 5 prime reads and 3 prime reads analyzed independently
  5. 8.

    Can known differences between microbial communities be recaptured on Illumina?

    Gut Palm, tongue Lake, Creek, Soil Ocean, Marine sediment
  6. 9.

    What  does  it  mean  for  a   bioinforma5cs  experiment  to

     be   replicable?   •  Our  experimental  methods  are  not  as  ‘noisy’   as  most.   •  Same  commands  on  the  same  system  should   give  you  the  same  results  (if  the  algorithm  is   determinis)c).    
  7. 10.

    Determinis5c  versus  non-­‐determinis5c   •  Determinis5c  algorithm:  a  given  input

      produces  the  same  series  of  internal  states   and  results  in  the  same  output.   •  Non-­‐determinis5c  algorithm:  a  given  input   may  produce  different  internal  states  and/or   result  in  a  different  output.   – Commonly  probabilis5c  algorithms  in   bioinforma5cs  
  8. 11.

    Determinis5c     •  Smith-­‐Waterman  alignment:  if  properly   implemented,

     aligning  two  sequences  will   always  give  the  same  result  
  9. 13.

    Reproducible   •  Some  varia5on  in  experimental  condi5ons  is  

    allowed   •  Tends  to  be  what  biologists  are  interested  in   as  it’s  nearly  impossible  to  truly  replicate  an   experiment  (e.g.,  microevolu5on)  and  it  oYen   would  be  too  expensive  (e.g.,  replica5ng  the   Human  Microbiome  Project)   •  More  interes5ng  than  replicability:  is  your   conclusion  robust?  
  10. 16.

    Fig. 1 The spectrum of reproducibility. R D Peng Science

    2011;334:1226-1227 Published by AAAS
  11. 18.

    Keep  code  in  public  revision  control   systems   • 

    Git   •  Subversion  (svn)   •  CVS  (ancient  history)   •  Allow  for  viewing  history  of  changes,   obtaining  previous  versions.   •  Example:   –  hDp://  
  12. 19.

    Revision  control   •  A  repository  of  files  where:  

    – Modifica5ons  to  files  can  be  made  and  tracked   (i.e.,  you  know  who  made  them  and  when)   – You  can  revert  to  previous  revisions  (roll  back   changes  or  access  something  that  was  previously   deleted)   – Repository  can  be  made  public  so  others  can  view   your  history,  access  different  versions,  or   collaborate  with  you.    
  13. 20.

    Benefits  of  public  revision  control   •  For  developers,  it’s

     a  resume  builder.   Showcases  your  development  and   communica5on  skills.   •  Scien5fic  integrity:  providing  others  access  to   your  source  (including  old  versions)  allows   them  to  reproduce  your  analysis.     •  Others  can  contribute.   •  Less  experienced  developers  can  learn  from   your  code.  
  14. 21.

    Using  GitHub  for  community  driven   open  source  projects  (e.g.,

     QIIME*)   •  Only  lead  developers  have  push  access.   •  All  devs  have  pull  access  (even  though  it’s  not   technically  necessary).   •  All  pushes  to  master  go  in  as  pull  requests,  including   pushes  from  the  lead  developers.   •  Code  reviews  are  performed  using  the  GitHub  pull   requests.     •  Discussion  of  new  code  should  happen  on  the  page   associated  with  the  pull  request,  not  by  email.     •  All  feature  requests  and  bug  reports  should  happen  via   the  GitHub  issue  tracker  system.     *  hDps://   As  a  development  group,  we’re  fairly  new  to  GitHub,  so  these  strategies  may  change  with  5me.  
  15. 22.

    Perform  analyses  using  virtual   machines   •  A  “guest”

     opera5ng  system  running  within  a   “host”  opera5ng  system   •  A  soYware  implementa5on  of  a  computer,   that  operates  like  a  physical  computer.     •  A  developer  can  create  a  virtual  machine   image  which  contains  their  tools  pre-­‐installed.   Users  can  then  instan)ate  that  image  to  work   with  those  tools.   Browse  this  page:  hDp://  
  16. 23.

    Benefits  that  virtual  machines  offer   bioinforma5cs   •  Reproducibility:

     can  publish  protocols  with  a   virtual  machine  instance  id.   •  Updates  are  burden  of  developer,  not  user.   •  Coupled  with  cloud  compu5ng,  it’s  the  perfect   model  for  users  with  sporadic  compute  needs.  
  17. 24.

    Write  and  use  soYware  that  generates   good  log  files

      •  Ideally  will  supplement  your  lab  notebook  (for   successful  runs)   – Version  informa5on   – exact  command  that  was  run   – Any  ‘subcommands’  that  were  run   – Details  on  input  files  (path,  md5)   – System  configura5on  details  
  18. 25.

    Keep  track  of  MD5  sums   •  A  cryptographic  hash

     func5on:  determinis5c  func5on   which  takes  some  input  and  returns  a  fixed-­‐size  string  –   changing  the  input  should  change  the  return  value     From  Wikipedia:   •  it  is  easy  (but  not  necessarily  quick)  to  compute  the  hash   value  for  any  given  message   •  it  is  infeasible  to  generate  a  message  that  has  a  given  hash   •  it  is  infeasible  to  modify  a  message  without  changing  the   hash   •  it  is  infeasible  to  find  two  different  messages  with  the  same   hash  
  19. 26.

    Keep  a  computa5onal  lab  notebook   •  What  should  these

     look  like?   •  How  should  we  use  them?     •  In  lieu  of  a  good  system,  I  use  text  files  (but   more  recently  am  switching  to  IPython   Notebooks  and  gist).  What  might  a  good   system  do?    
  20. 27.

    Publish  workflows  with  your  paper   •  Allows  others  to

     easily  reproduce  your   analyses  by  providing  the  exact  list  of   commands  that  were  run   •  Examples  include  the  IPython  Notebook  and   Galaxy,  but  there  are  others  as  well.  
  21. 30.

    This  work  is  licensed  under  the  Crea5ve  Commons  ADribu5on  3.0

     United  States  License.  To  view  a   copy  of  this  license,  visit   hDp://  or  send  a  leDer  to  Crea5ve  Commons,  171   Second  Street,  Suite  300,  San  Francisco,  California,  94105,  USA.     Feel  free  to  use  or  modify  these  slides,  but  please  credit  me  by  placing  the  following  aDribu5on   informa5on  where  you  feel  that  it  makes  sense:  Greg  Caporaso,