Upgrade to Pro — share decks privately, control downloads, hide ads and more …

geo-bio_presentation.pdf

John
June 22, 2014
91

 geo-bio_presentation.pdf

John

June 22, 2014
Tweet

Transcript

  1. Outline  of  Today’s  Schedule   •  IntroducLons   •  Microbial

     Ecology  and  BioinformaLcs  Background   •  Cloud  CompuLng   •  Command  line  interface   •  QIIME   •  Mapping  Files   •  Data  analysis   –  Beta-­‐Diversity   –  Alpha-­‐Diversity     –  Summarize  Taxa   –  Bar  Charts     –  Joined  Read  Comparisons(Procrustes)  
  2. …  or  is  that  we  can  only  keep  a  Lny

      fracLon  of  them  in  capLvity?      
  3. …  as  shown  by  our  rapid  expansion    in  

    our  knowledge  to  uncultured  
  4. What  you  do  with  the  DNA  is  up  to  

    you  …   Credit:  Rob  Moser  
  5. Isolate  the  small  subunit  ribosomal   RNA  gene  to  “fingerprint”

     different   microbial  organisms.   Why  this  gene?   •  It’s  ubiquitous.   •  Contains  regions  that   idenLcal  across  organisms,   and  regions  that  are   variable  across  organisms.  
  6. Sequence  the  rRNA  from  all  samples  on  a  “high-­‐ throughput”

     DNA  sequencer   Pool  samples   and  sequence   Micah  Hamady,  et  al.,  Nature  Methods,  2008.   Error-­‐correcLng  barcodes  for  pyrosequencing  hundreds  of  samples  in  mulLplex.   Per-­‐sample  rRNA   >GCACCTGAGGACAGGCATGAGGAA…   >GCACCTGAGGACAGGGGAGGAGGA…   >TCACATGAACCTAGGCAGGACGAA…   >CTACCGGAGGACAGGCATGAGGAT…   >TCACATGAACCTAGGCAGGAGGAA…   >GCACCTGAGGACACGCAGGACGAC…   >CTACCGGAGGACAGGCAGGAGGAA…   >CTACCGGAGGACACACAGGAGGAA…   >GAACCTTCACATAGGCAGGAGGAT…   >TCACATGAACCTAGGGGCAAGGAA…   >GCACCTGAGGACAGGCAGGAGGAA…    
  7. Which  microbial  organisms  are   represented  by  the  rRNA  gene

      sequences  in  each  sample?   >PC.634_1 FLP3FBN01ELBSX CTGGGCCGTGTCTCAGTCCCAATGTGGCCGTTTACCCTCTCAGGCCGG CTACGCATCATCGCCTTGGTGGGCCGTTACCTCACCAACTAGCTAATG CGCCGCAGGTCCATCCATGTTCACGCCTTGATGGGCGCTTTAATATAC TGAGCATGCGCTCTGTATACCTATCCGGTTTTAGCTACCGTTTCCAGC AGTTATCCCGGACACATGGGCTAGG! >PC.634_2 FLP3FBN01EG8AX! TTGGACCGTGTCTCAGTTCCAATGTGGGGGCCTTCCTCTCAGAACCCC TATCCATCGAAGGCTTGGTGGGCCGTTACCCCGCCAACAACCTAATGG AACGCATCCCCATCGATGACCGAAGTTCTTTAATAGTTCTACCATGCG GAAGAACTATGCCATCGGGTATTAATCTTTCTTTCGAAAGGCTATCCC CGAGTCATCGGCAGGTTGGATACGTGTTACTCACCCGTGCGCCGGT! >PC.354_3 FLP3FBN01EEWKD! TTGGGCCGTGTCTCAGTCCCAATGTGGCCGATCAGTCTCTTAACTCGG CTATGCATCATTGCCTTGGTAAGCCGTTACCTTACCAACTAGCTAATG CACCGCAGGTCCATCCAAGAGTGATAGCAGAACCATCTTTCAAACTCT AGACATGCGTCTAGTGTTGTTATCCGGTATTAGCATCTGTTTCCAGGT GTTATCCCAGTCTCTTGGG   RefSeq 1 RefSeq 2 RefSeq 3 RefSeq 4 RefSeq 5 RefSeq 6 RefSeq 7 RefSeq 8 RefSeq 9 RefSeq 10 rRNA  reference  database   Search  against   reference   sequences  
  8. Search  against   reference   sequences   RefSeq 1 RefSeq

    2 RefSeq 3 RefSeq 4 RefSeq 5 RefSeq 6 RefSeq 7 RefSeq 8 RefSeq 9 RefSeq 10 >PC.634_1 FLP3FBN01ELBSX CTGGGCCGTGTCTCAGTCCCAATGTGGCCGTTTACCCTCTCAGGCCGG CTACGCATCATCGCCTTGGTGGGCCGTTACCTCACCAACTAGCTAATG CGCCGCAGGTCCATCCATGTTCACGCCTTGATGGGCGCTTTAATATAC TGAGCATGCGCTCTGTATACCTATCCGGTTTTAGCTACCGTTTCCAGC AGTTATCCCGGACACATGGGCTAGG! >PC.634_2 FLP3FBN01EG8AX! TTGGACCGTGTCTCAGTTCCAATGTGGGGGCCTTCCTCTCAGAACCCC TATCCATCGAAGGCTTGGTGGGCCGTTACCCCGCCAACAACCTAATGG AACGCATCCCCATCGATGACCGAAGTTCTTTAATAGTTCTACCATGCG GAAGAACTATGCCATCGGGTATTAATCTTTCTTTCGAAAGGCTATCCC CGAGTCATCGGCAGGTTGGATACGTGTTACTCACCCGTGCGCCGGT! >PC.354_3 FLP3FBN01EEWKD! TTGGGCCGTGTCTCAGTCCCAATGTGGCCGATCAGTCTCTTAACTCGG CTATGCATCATTGCCTTGGTAAGCCGTTACCTTACCAACTAGCTAATG CACCGCAGGTCCATCCAAGAGTGATAGCAGAACCATCTTTCAAACTCT AGACATGCGTCTAGTGTTGTTATCCGGTATTAGCATCTGTTTCCAGGT GTTATCCCAGTCTCTTGGG   Which  microbial  organisms  are   represented  by  the  rRNA  gene   sequences  in  each  sample?  
  9. Comparing  microbial  communiLes   Who  is  there?      

    How  many  “species”  are  there?       How  similar  are  pairs  of  samples?    
  10. Assign  millions  of   sequences  from  thousands   of  samples

     to  reference   Compare  samples   staLsLcally  and  visually   www.qiime.org   Assign  reads  to  samples   >GCACCTGAGGACAGGCATGAGGAA…   >GCACCTGAGGACAGGGGAGGAGGA…   >TCACATGAACCTAGGCAGGACGAA…   >CTACCGGAGGACAGGCATGAGGAT…   >TCACATGAACCTAGGCAGGAGGAA…   >GCACCTGAGGACACGCAGGACGAC…   >CTACCGGAGGACAGGCAGGAGGAA…   >CTACCGGAGGACACACAGGAGGAA…   >GAACCTTCACATAGGCAGGAGGAT…   >TCACATGAACCTAGGGGCAAGGAA…   >GCACCTGAGGACAGGCAGGAGGAA…     RefSeq 1 RefSeq 2 RefSeq 3 RefSeq 4 RefSeq 5 RefSeq 6 RefSeq 7 RefSeq 8 RefSeq 9 RefSeq 10
  11. PCoA  shows  human  body  sites  differences   Bacterial  Community  VariaLon

     in  Human     Body  Habitats  Across  Space  and  Time,     Costello  et  al.,  Science  2009  
  12. The  cloud   What  is  the  cloud?   "Cloud  compuLng

     refers  to  the  delivery  of   compuLng  and  storage  capacity  as  a  service  to  a   heterogeneous  community  of  end-­‐recipients."   Source:  Wikipedia     Why  do  we  want  to  use  it?   Local  resources  limited.   Economies  of  scale  make  it  cheap  and  reliable.     No  hardware  maintenance.      
  13. Cloud  compuLng  opLons   •  Amazon  ElasLc  Compute  Cloud  (EC2)

      •  GoGrid   •  Magellan  –  Argonne's  DOE  Cloud  CompuLng   •  Data  Intensive  Academic  Grid  (DIAG)  –       InsLtute  for  Genome  Sciences  (IGS),  University   of  Maryland  School  of  Medicine  (UMSOM)  
  14. Cloud  nomenclature  1/3   Virtual  Machine:  Sooware  implementaLon  of  a

      machine  which  executes  programs  like  a  physical   machine.     Instance:  Amazon's  name  for  a  virtual  machine     AWS:  Amazon  Web  Services.  CollecLon  of  services  Amazon  offers  for  cloud  compuLng.       ECU:  CPU  equivalent  that  Amazon  uses  to   guarantee  consistent  performance  across  instances.    
  15. Cloud  nomenclature  2/3   IAM:  Amazon  service  for  managing  users,

     groups   and  permissions  within  our  AWS  account     EC2:  Amazon  service  for  launching  virtual   machines  (instances)  and  storing  EBS  volumes.       AMI:  Amazon  Machine  Image.  A    clone  of  an   instance.  You  can  use  them  to  launch  several   instances  with  the  same  configuraLon.  
  16. Cloud  nomenclature  3/3   S3:  Amazon  service  for  storing  certain

     data  in  the  cloud.       Volumes:  cloud  hard  drives  for  your  instances.  They  only   can  be  accessed  using  an  instance.     EBS:  ElasLc  Block  Storage.  This  is  where  you  can  store   volumes  that  instances  can  access.       Snapshots:  copy  from  a  volume  in  a  certain  moment.  It  is   useful  for  backup  and  sharing  volumes.  
  17. StarCluster   What  is  it?   StarCluster  is  a  program

     that  allows  us  to  build   our  own  cluster  using  Amazon  EC2  instances.       Why  do  we  want  to  use  it?   Supercomputer  power  without  hardware   maintenance   Command  line  manipulaLon   Flexibility  
  18. StarCluster   StarCluster  allows  us  to  emulate  a  cluster  using

      only  VM's/AMI's.  StarCluster  coordinates  all  of   the  interacLon  between  the  various  node.  
  19. Why  people/computer-­‐geeks  love  *nix   based  OS     • 

    MulLtasking:  mulLprocessing  &  mulLuser   •  Very  efficient  virtual  memory  (RAM  &  Swap)   •  Access  control  &  security   •  A  unified  file  system   •  Available  for  a  wide  range  of  machines   •  OpLmized  for  development  
  20. What's  UNIX   •  An  operaLng  system  (43  yo)  

    •  Commonly  used  for  criLcal  tasks   o  Physics  &  Math   o  SCIENCE!   •  Base  for  QIIME   •  Powerful  
  21. Who  cares  about  UNIX   just  to  menLon  a  

    few  ...   QIIME   VritualBox  
  22. Downside   Taken  from,  Biomedical  Digital  Signal  Processing,  C-­‐Language  Examples

     and  Laboratory   Experiments  for  the  IBM  PC,  Willis  J.  Tompkins  Editor  p.  18.   circa  1992  
  23. Talking  to  UNIX   •  Through  a  terminal  emulator  

    more  specifically  using  a  shell  ...  
  24. Launching  a  terminal  window   •  In  Mac  OS  X

     in  a  Finder  window  or  the   Desktop:   •  command  +  shio  +  u   •  Search  for  the  “Terminal”   •  Ubuntu   In  the  sidebar  search  for  the  terminal  icon    
  25. How  do  you  talk  to  UNIX?   •  Get  the

     current  date  and  hour   open  a  terminal  window  and  try  them  ...   date
  26. How  do  you  talk  to  UNIX?   •  See  what

     is  your  user  name   whoami
  27. Home  Sweet  Home   Can  be  referred  to  as:  

     $HOME or  ${HOME} Or  also  as:    ~   Mac  OS  X:    /Users/username/ Linux  based  systems:    /home/username/ username  stands  for  your  user  in  your  machine   Try:      echo  ${HOME}    ls  $HOME    ls  ~/  
  28. Paths  (absolute)   /Users/yoshiki/evident-data/hmp-v13_arare/alpha_div $HOME/evident-data/hmp-v13_arare/alpha_div ~/evident-data/hmp-v13/alpha_div A  slash  at  the

     beginning  of  a  path  denotes  it  as  an  absolute  path,  i.  e.  from  the  base  of  your   hard  drive.  
  29. Folders,  files  and  its  informaLon   List  files  from  your

     current  working  directory:   ls   List  all  the  files  in  your  current  directory,  including  hidden  files:   ls -a List  files  in  your  Downloads  folder,  in  the  long  format  and  sort   them  by  7me:   ls -lt ~/Downloads
  30. NavigaLng  your  machine   Change  from  your  current  directory  to

     your  home   cd   Change  from  your  current  directory  to  a  directory  below  it:   cd .. Change  from  your  current  directory  to  your  documents  folder:   cd ~/Documents to   autocomplete  
  31. Making,  copying    and  moving  stuff   Make  a  new

     directory    mkdir AnExample Move  a  file/folder  or  change  its  name    mv oldname.txt newname.txt mv Files/ NewName/ Copy  a  file    cp homework.txt backup_homework.txt Copy  a  directory    cp -r NewName Files
  32. Making,  copying    and  moving  stuff   Make  a  new

     directory    mkdir AnExample Move  a  file/folder  or  change  its  name    mv oldname.txt newname.txt mv Documents/ NewName/ Copy  a  file    cp homework.txt backup_homework.txt Copy  a  directory    cp -r physics1 physics2 All  these  commands  work  with   a  old  (source)  -­‐>  new  (des:ny)     scheme  
  33. Star   It's  a  wildcard.     List  anything  that

     ends  with  a  .txt   ls *.txt List  anything  with  the  le^er  t   ls *t* Copy  to  your  desktop  all  text  files   cp *.txt ~/Desktop/  
  34. Removing  files   Remember  to  be  careful,  be  very  careful,

     there   is  no  undo  for  this  command.   Remove  a  file    rm some_file.txt Remove  a  folder  with  things  inside  it    rm -r UselessFolder/ Force,  the  removal  of  a  folder    rm -rf UselessFolder/
  35. Compression  and  decompression   Using  zip  to  compress    zip

    compressed.zip bigfile.txt zip -r compressedFolder.zip BigFolder Using  zip  to  decompress  things    unzip compressed.zip Using  tar  to  compress  a  folder  or  file    tar -czf output.tgz BigFolder Using  tar  to  decompress  things    tar -xzf output.tgz
  36. Permissions   Allow  all  to  have  write  permissions  to  a

     file    chmod a+w file.txt Allow  all  to  have  write  permissions  to  a  folder   and  its  contents:    chmod -R a+w file.txt Remove  all  the  permission  to  write  to  a  file    chmod a-w file.txt Remove  all  the  permission  to  write  to  a  folder   and  its  contents:    chmod -R a-w file.txt To  see  how  permissions  change,  use:  
  37. InspecLng  a  text  file   Print  a  file  in  the

     screen   cat file.txt Inspect  the  contents   less file.txt more file.txt Count  the  words  of  a  file   wc file.txt Count  the  lines  of  a  file   wc -l file.txt to  exit  these  commands  type  q     more  is  less  :P  
  38. InspecLng  a  parts  of  a  file   Seeing  the  first

     few  lines  of  a  file   head file.txt Seeing  the  N  lines  of  a  file   head -n 20 file.txt Seeing  the  last  few  lines  of  a  file   tail file.txt Seeing  the  N  lines  of  a  file   tail -n 20 file.txt
  39. Searching  the  contents  of  a  file   Searching  for  text

     in  a  file:   grep "yet" file.txt   Searching  for  text  in  a  file  (and  show  2  lines  before  "-­‐B"  or  2  lines   aoer  "-­‐A"  value):   grep -A 2 "yet" file.txt grep -B 2 "yet" file.txt   Searching  for  text  and  highlight  the  matches:   grep --color "yet" file.txt
  40. Operators   |  operator  connects  commands   cat file.txt |

    less cat file.txt | grep 'yet' >  saves  to  a  file   cat file.txt | grep 'yet' > result.txt >>  appends  to  a  file   cat file.txt | grep 'yet' >> result.txt &  sends  to  the  background   cat file.txt | grep 'yet' >> result.txt &
  41. Ge\ng  help   Each  command  has  its  own  way  i.

     e.       As  an  argument:   zip -h tar --help Using  the  manual  reference  command:   man ls man grep to  exit  the  manual  just  type  q  
  42. Binaries,  scripts,  programs  etc  ...   •  Try  the  following:

      which ls •  PATH  has  a  lot  of  informaLon,  it's  the  "route"     echo $PATH •  To  execute  something  from  your  current   working  directory   ensure  it  has  the  right  permissions,  then:   ./program_test if  it  doesn't  have  permissions  try  chmod a+x program_test
  43. Special  shortcuts   halt  and  kill  a  command   stop

     a  command  (doesn't  kill  it)   go  to  the  beginning  of  the  line  in  a   terminal  window   go  to  the  end  of  the  line  in  a  terminal   window   This  applies  for  any  operaLng  system.  
  44. QIIME  Structure   •  Set  of  scripts  to  perform  certain

     funcLons.   •  Integrates  other  sooware.   •  Allows  an  easy  workflow.   •  Keys,  wallet,  phone:   print_qiime_config.py  
  45. QIIME  commands   Get  help  with  the  -­‐h  opLon  

     pick_otus.py -h Command  names  are  self-­‐explanatory   Filtering   filter_fasta.py filter_otus_by_sample.py filter_distance_matrix.py SorLng   sort_otu_table.py
  46. Script  types   Single  Task    One  step.    Most

     of  them.     Workflows    Mul:ple  scripts  in  one.    Uses  a  log  file.    Indicated  in  the  script  descripLon.  
  47. Sequencing output (454, Illumina, Sanger) fastq, fasta, qual, or sff/trace

    files Metadata mapping file Pre-processing e.g., remove primer(s), demultiplex, quality filter Denoise 454 Data PyroNoise, Denoiser Reference based BLAST, UCLUST, USEARCH Pick OTUs and representative sequences De novo e.g., UCLUST, CD-HIT, MOTHUR, USEARCH Assign taxonomy BLAST, RDP Classifier Align sequences e.g., PyNAST, INFERNAL, MUSCLE, MAFFT Build 'OTU table' i.e., sample by observation matrix Build phylogenetic tree e.g., FastTree, RAxML, ClearCut Database Submission (In development) OTU (or other sample by observation) table Phylogenetic Tree Evolutionary relationship between OTUs α-diversity and rarefaction e.g., Phylogenetic Diversity, Chao1, Observed Species β-diversity and rarefaction e.g., Weighted and unweighted UniFrac, Bray- Curtis, Jaccard Interactive visualizations e.g., PCoA plots, distance histograms, taxonomy charts, rarefaction plots, network visualization, jackknifed hierarchical clustering. Legend Required step or input Optional step or input Currently supported for marker-gene data only (i.e., 'upstream' step) Currently supported for general sample by observation data (i.e., 'downstream' step) www.QIIME.org
  48. Sequencing output (454, Illumina, Sanger) fastq, fasta, qual, or sff/trace

    files Metadata mapping file Pre-processing e.g., remove primer(s), demultiplex, quality filter Denoise 454 Data PyroNoise, Denoiser Reference based BLAST, UCLUST, USEARCH Pick OTUs and representative sequences De novo e.g., UCLUST, CD-HIT, MOTHUR, USEARCH Assign taxonomy BLAST, RDP Classifier Align sequences e.g., PyNAST, INFERNAL, MUSCLE, MAFFT Build 'OTU table' i.e., sample by observation matrix Build phylogenetic tree e.g., FastTree, RAxML, ClearCut Database Submission (In development) OTU (or other sample by observation) table Phylogenetic Tree Evolutionary relationship between OTUs α-diversity and rarefaction e.g., Phylogenetic Diversity, Chao1, Observed Species β-diversity and rarefaction e.g., Weighted and unweighted UniFrac, Bray- Curtis, Jaccard Interactive visualizations e.g., PCoA plots, distance histograms, taxonomy charts, rarefaction plots, network visualization, jackknifed hierarchical clustering. Legend Required step or input Optional step or input Currently supported for marker-gene data only (i.e., 'upstream' step) Currently supported for general sample by observation data (i.e., 'downstream' step) www.QIIME.org QC  and  split  libraries  
  49. Sequencing output (454, Illumina, Sanger) fastq, fasta, qual, or sff/trace

    files Metadata mapping file Pre-processing e.g., remove primer(s), demultiplex, quality filter Denoise 454 Data PyroNoise, Denoiser Reference based BLAST, UCLUST, USEARCH Pick OTUs and representative sequences De novo e.g., UCLUST, CD-HIT, MOTHUR, USEARCH Assign taxonomy BLAST, RDP Classifier Align sequences e.g., PyNAST, INFERNAL, MUSCLE, MAFFT Build 'OTU table' i.e., sample by observation matrix Build phylogenetic tree e.g., FastTree, RAxML, ClearCut Database Submission (In development) OTU (or other sample by observation) table Phylogenetic Tree Evolutionary relationship between OTUs α-diversity and rarefaction e.g., Phylogenetic Diversity, Chao1, Observed Species β-diversity and rarefaction e.g., Weighted and unweighted UniFrac, Bray- Curtis, Jaccard Interactive visualizations e.g., PCoA plots, distance histograms, taxonomy charts, rarefaction plots, network visualization, jackknifed hierarchical clustering. Legend Required step or input Optional step or input Currently supported for marker-gene data only (i.e., 'upstream' step) Currently supported for general sample by observation data (i.e., 'downstream' step) www.QIIME.org Building  an  OTU  table  
  50. Alpha  and  Beta  diversity   Sequencing output (454, Illumina, Sanger)

    fastq, fasta, qual, or sff/trace files Metadata mapping file Pre-processing e.g., remove primer(s), demultiplex, quality filter Denoise 454 Data PyroNoise, Denoiser Reference based BLAST, UCLUST, USEARCH Pick OTUs and representative sequences De novo e.g., UCLUST, CD-HIT, MOTHUR, USEARCH Assign taxonomy BLAST, RDP Classifier Align sequences e.g., PyNAST, INFERNAL, MUSCLE, MAFFT Build 'OTU table' i.e., sample by observation matrix Build phylogenetic tree e.g., FastTree, RAxML, ClearCut Database Submission (In development) OTU (or other sample by observation) table Phylogenetic Tree Evolutionary relationship between OTUs α-diversity and rarefaction e.g., Phylogenetic Diversity, Chao1, Observed Species β-diversity and rarefaction e.g., Weighted and unweighted UniFrac, Bray- Curtis, Jaccard Interactive visualizations e.g., PCoA plots, distance histograms, taxonomy charts, rarefaction plots, network visualization, jackknifed hierarchical clustering. Legend Required step or input Optional step or input Currently supported for marker-gene data only (i.e., 'upstream' step) Currently supported for general sample by observation data (i.e., 'downstream' step) www.QIIME.org
  51. Sequencing output (454, Illumina, Sanger) fastq, fasta, qual, or sff/trace

    files Metadata mapping file Pre-processing e.g., remove primer(s), demultiplex, quality filter Denoise 454 Data PyroNoise, Denoiser Reference based BLAST, UCLUST, USEARCH Pick OTUs and representative sequences De novo e.g., UCLUST, CD-HIT, MOTHUR, USEARCH Assign taxonomy BLAST, RDP Classifier Align sequences e.g., PyNAST, INFERNAL, MUSCLE, MAFFT Build 'OTU table' i.e., sample by observation matrix Build phylogenetic tree e.g., FastTree, RAxML, ClearCut Database Submission (In development) OTU (or other sample by observation) table Phylogenetic Tree Evolutionary relationship between OTUs α-diversity and rarefaction e.g., Phylogenetic Diversity, Chao1, Observed Species β-diversity and rarefaction e.g., Weighted and unweighted UniFrac, Bray- Curtis, Jaccard Interactive visualizations e.g., PCoA plots, distance histograms, taxonomy charts, rarefaction plots, network visualization, jackknifed hierarchical clustering. Legend Required step or input Optional step or input Currently supported for marker-gene data only (i.e., 'upstream' step) Currently supported for general sample by observation data (i.e., 'downstream' step) www.QIIME.org VisualizaLons  
  52. QIIME   •  The  code  is  tested  (properly)   • 

    The  documentaLon  is  updated  constantly   based  on  users  suggesLons   •  The  help  in  the  QIIME-­‐forum  has  a   collaboraLve  spirit  (developers  &  users   sharing  their  research  experiences)  
  53. Import  mapping  files  from  google  docs   •  In  google

     docs  publish  mapping  file  to  web   •  In  terminal  load_remote_mapping_file.py   –   load_remote_mapping_file.py  -­‐k   0AnzomiBiZW0ddDVrdENlNG5lTWpBTm5kNjRGbj VpQmc  -­‐w  FasLng_Map  -­‐o  example2_map.txt  
  54. Split  libraries   •  split_libraries_fastq.py  -­‐o  illumina/slout/  -­‐i  illumina/raw/ subsampled_s_1_sequence.fastq,illumina/raw/

    subsampled_s_2_sequence.fastq,illumina/raw/ subsampled_s_3_sequence.fastq,illumina/raw/ subsampled_s_4_sequence.fastq,illumina/raw/ subsampled_s_5_sequence.fastq,illumina/raw/ subsampled_s_6_sequence.fastq  -­‐b  illumina/raw/ subsampled_s_1_sequence_barcodes.fastq,illumina/raw/ subsampled_s_2_sequence_barcodes.fastq,illumina/raw/ subsampled_s_3_sequence_barcodes.fastq,illumina/raw/ subsampled_s_4_sequence_barcodes.fastq,illumina/raw/ subsampled_s_5_sequence_barcodes.fastq,illumina/raw/ subsampled_s_6_sequence_barcodes.fastq  -­‐m  illumina/raw/ filtered_mapping_l1.txt,illumina/raw/filtered_mapping_l2.txt,illumina/ raw/filtered_mapping_l3.txt,illumina/raw/ filtered_mapping_l4.txt,illumina/raw/filtered_mapping_l5.txt,illumina/ raw/filtered_mapping_l6.txt  
  55. Join  paired  ends   Paired-­‐end  tag  sequencing:  sequence  in  from

     both  ends  of   the  amplicon.    Goal  is  to  have  sequence  enough  overlap  to   sLtch  the  two  direcLons  together.     Example:  V1V2  region  of  16S  with  250bp  paired-­‐end   sequencing  on  Illumina  MiSeq.    Amplicon  length  is  320bp,   leaving  >50bp  overlap  aoer  accounLng  for  barcode/ primers.     New  in  QIIME  1.8:  join_paired_ends.py Methods  available:   1.  fastq-­‐join   2.  SeqPrep     Input  files:  forward  reads  FASTQ,  reverse  reads  FASTQ,   opLonal  FASTQ  of  barcode  reads.     Important  seKngs:  --min_overlap  sets  the  minimum   number  of  base  pairs  in  the  overlapping  region.   Figure:  Masella  et  al.  BMC  BioinformaBcs  13,  31  (2012)  
  56. Denoising   Errors  are  introduced  in  tag  sequencing  experiments  during

     PCR   amplificaLon  and  sequencing.    "Denoising"  methods  are  designed   to  detect  low-­‐abundance  variants  with  potenLal  errors  and   merge  them  with  higher-­‐abundance  variants  (hopefully  error-­‐ free).     Methods  available  in  QIIME  (both  are  454-­‐specific):   1.  AmpliconNoise  Quince  BMC  Bioinfor.  12,  38  (2011)   2.  Denoiser  Reeder  Nat.  Meth.  7,  668  (2010)     Benefits  vs.  disadvantages:   •  Good  for  eliminaLng  spurious  OTUs,  thus  gets  you  closer   to  the  number  of  "true"  OTUs.   •  Far  from  perfect,  thus  not  a  good  opLon  to  accurately   idenLfy  rare  species.    Gaspar  PLOS  One  8,  e60458  (2014)   •  Learns  from  input  data,  so  answer  will  change  as  data  is   added/removed.    Claesson  Nature  488,  178  (2012)   Figure:  Bragg  Nat.  Meth.  9,  425  (2012)   h^p://qiime.org/tutorials/denoising_454_data.html  
  57. OTU  table   Feature  X  Sample  table   (conLngency  table)

      Features  →     Samples  →     S1   S2   S3   OTU1   100   0   0   OTU2   100   40   600   OTU3   0   10   0   S1   S2   S3   OTU1   .5   0   0   OTU2   .5   .8   1.0   OTU3   0   .2   0   Count  table   RelaLve  abundance  table  
  58. OTU  Picking  -­‐  “de-­‐novo”   •  Pros   –  Vast

     majority  of  reads  are  clustered     –  No  reference  database  bias   •  Cons   –  Speed;  not  easily  parallelizable     –  Erroneous  reads  get  clustered   CTGGGCCGTGTCTCAGTCCCAA TTGGAAGATGTCTCAGTTCCAG TTGGGCCGTATGTCAGTCCCTA CTGGGCCGTGTCTCAGTCCCAA TTGGAAGATGTCTCAGTTCCAG TTGGGCCGTATGTCAGTCCCTA   Clustered Sequences OTUS OTU1 OTU2 OTU3 Clustering Algorithm CTGGGCCGTGTCTCAGTCCCAAACA TTGGAAGATGTCTCAGTTCCAGACA CTGGGCCGTGTCTCAGTCCCAAACA TTGGAAGATGTCTCAGTTCCAGACA CTGGGCCGTGTCTCAGTCCCAAACA TTGGAAGATGTCTCAGTTCCAGACA Experimental Sequences
  59. OTU  Picking  -­‐  “closed-­‐reference”   •  Pros   –  Reference

     database  is  a  quality  filter   –  Speed;  easily  parallelizable   •  Cons   –  No  new  OTUs  can  be  observed   –  Reference  database  bias   CTGGGCCGTGTCTCAGTCCCAA CTGGGCCGTGTCTCAGTCCCAA TTGGAAGATGTCTCAGTTCCAG TTGGGCCGTATGTCAGTCCCTA CTGGGCCGTGTCTCAGTCCCAA TTGGAAGATGTCTCAGTTCCAG TTGGGCCGTATGTCAGTCCCTA   Experimental Sequences Reference  Sequences CTGGGCCGTGTCTCAGTCCCAA CTGGGCCGTGTCTCAGTCCCAA TTGGAAGATGTCTCAGTTCCAG CTGGGCCGTGTCTCAGTCCCAA TTGGAAGATGTCTCAGTTCCAG CTGGGCCGTGTCTCAGTCCCAA TTGGAAGATGTCTCAGTTCCAG Sequences that hit a reference CTGGGCCGTGTCTCAGTCCCAA Sequences that failed to hit CTGGGCCGTGTCTCAGTCCCAA TTGGAAGATGTCTCAGTTCCAG TTGGGCCGTATGTCAGTCCCTA CTGGGCCGTGTCTCAGTCCCAA TTGGAAGATGTCTCAGTTCCAG TTGGGCCGTATGTCAGTCCCTA   OTUS OTU1 OTU1 OTU1
  60. Percentage  of  reads   that  do  not  hit  the  

    reference  collecLon,   by  environment  type.  
  61. OTU  Picking  -­‐  “open-­‐reference”   •  Pros   –  Best

     of  both  worlds   •  Cons   –  Downsides  of  de-­‐novo   CTGGGCCGTGTCTCAGTCCCAA CTGGGCCGTGTCTCAGTCCCAA TTGGAAGATGTCTCAGTTCCAG TTGGGCCGTATGTCAGTCCCTA CTGGGCCGTGTCTCAGTCCCAA TTGGAAGATGTCTCAGTTCCAG TTGGGCCGTATGTCAGTCCCTA   Experimental Sequences Reference   Sequences CTGGGCCGTGTCTCAGTCCCAA CTGGGCCGTGTCTCAGTCCCAA TTGGAAGATGTCTCAGTTCCAG CTGGGCCGTGTCTCAGTCCCAA TTGGAAGATGTCTCAGTTCCAG CTGGGCCGTGTCTCAGTCCCAA TTGGAAGATGTCTCAGTTCCAG Sequences that hit a reference CTGGGCCGTGTCTCAGTCCCAA Sequences that failed to hit CTGGGCCGTGTCTCAGTCCCAA TTGGAAGATGTCTCAGTTCCAG TTGGGCCGTATGTCAGTCCCTA CTGGGCCGTGTCTCAGTCCCAA TTGGAAGATGTCTCAGTTCCAG TTGGGCCGTATGTCAGTCCCTA   OTUS OTU1 OTU2 OTU3 OTU4 OTU5 OTU6 Clustering Algorithm
  62. Sub-­‐sampled  picking   •  Step  0:  Prefilter  (parallel)   • 

    Step  1:  Closed  reference  (parallel)   •  Step  2:  De  novo  clustering  of  subsampled   failures  (serial)   •  Step  3:  Closed  reference,  round  2  (parallel)   •  Step  4:  De  novo  (serial)    
  63. Things  to  remember   •  OTU  picking  is  slow  and

     memory  intensive   ◦  You  ooen  need  a  cluster  for  open  reference  picking     •  The  orientaLon  of  your  reads  is  important  (WRT  the  reference   database)     •  A  ‘.biom’  or  ‘biom’  table  is  just  a  (non-­‐human  readable)  data   format  for  storing  OTU  tables.       •  This  is  a  good  resource  for  more  informaLon  OTU  h^ps:// peerj.com/preprints/411.pdf  
  64. Qiime  parameters  file   #  Parallel  opLons   parallel:jobs_to_start  20

      parallel:retain_temp_files  False   parallel:seconds_to_sleep  60     #  Beta  diversity  parameters   beta_diversity:metrics  unweighted_unifrac    
  65. Where  to  Get  Help     •  [email protected]   • 

    h^p://qiime.org/   •  h^ps://groups.google.com/forum/#!forum/ qiime-­‐forum