Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Assessing Assemblies

Avatar for PacBio PacBio
September 19, 2013

Assessing Assemblies

Avatar for PacBio

PacBio

September 19, 2013
Tweet

More Decks by PacBio

Other Decks in Science

Transcript

  1. FIND MEANING IN COMPLEXITY © Copyright 2013 by Pacific Biosciences

    of California, Inc. All rights reserved. Richard Hall Assessing and Finishing Bacterial Genomes
  2. Learning Objectives After the training, the participant will be able

    to: • Assess HGAP assembly results of bacterial genomes Scientists, Research Associates, Bioinformaticians: • Interested in finishing and closing bacterial genomes • Familiarity with UNIX commands
  3. Introduction • Basics of Assembly Metrics • Assembly QC via

    SMRT® Portal – Raw Read Coverage – Bridge Mapper • Other tools for QC / Finishing – BLAST® analysis – Dot plots – Gepard – Circularizing contigs – minimus2 – Comparing with known references • Advanced analysis – Visualizing the overlap graph • Tertiary analysis – PHAST – RAST – Basys • Examples • Summary 3
  4. Basic Assembly Metrics • Commonly used metrics include: – Number

    of contigs – N50: Equal to the size of the contig found if you sort contigs by size and walk to the contig that represents 50% of the total sequence − N50 = 10 bp − Mean contig length = 3 bp – Max contig size • Limitation of these metrics: – They do not capture information about assembly accuracy! − Large scale mis-assemblies − Base level errors – There might be more than one chromosome (plasmid, phage, etc.) – Contaminants may contribute to a contig number (such as a cloning vector) 4 10 4 1 1 1 1
  5. Assembly QC via SMRT® Portal - Raw Read Coverage •

    Undulation in coverage in chromosome is biological (more DNA close to ori when cells are harvested in log phase) • Different levels of coverage between chromosome and one of the plasmids, leading to distinct coverage peaks in histogram ori
  6. Assembly QC via SMRT® Portal - Raw Read Coverage 6

    Coverage Plot SMRT View • Re-mapping the reads to the assembly may reveal discontinuities • Sharp dips in coverage (lacks read support) • Sharp spikes in coverage (collapsed repeat elements)
  7. Assembly QC via SMRT® Portal - Bridge Mapper • New

    for SMRT Analysis 2.1 • Run BLASR multiple times on input subreads • Split alignments are calculated – Then the start, middle, and end of a read align to different locations in the reference • Visualization of alignments in SMRT View allows: – Detection of mis-assemblies – Identification of structural variation – Characterization of chimerism
  8. Other tools for QC / Finishing - BLAST® analysis •

    http://blast.ncbi.nlm.nih.gov/Blast.cgi 13
  9. Other tools for QC / Finishing – Dot Plots 14

    Dot plot for contig with a close match found via BLAST® analysis • Gepard - http://www.helmholtz-muenchen.de/icb/gepard Self – self dot plot showing circularity
  10. Other tools for QC / Finishing - Circularization 15 Contig

    Split Contig Overlap Consensus Manually introduce a break, “>”, in the fasta sequence Minimus2 can be used as a simple overlapper. Minimus2 - http://sourceforge.net/apps/mediawiki/amos/i ndex.php?title=Minimus2
  11. Other tools for QC / Finishing – Comparing assemblies •

    Mummer – http://mummer.sourceforge.net/ – Alignment of multi-contig data against reference – Alignment of two draft genomes – Repeat finding – Good examples and step by steps: − http://mummer.sourceforge.net/examples/ 16
  12. Other tools for QC / Finishing – Comparing assemblies •

    Mauve – Multiple Genome Alignment – Aaron E. Darling, Bob Mau, and Nicole T. Perna. 2010. progressiveMauve: Multiple Genome Alignment with Gene Gain, Loss, and Rearrangement. PLoS One. 5(6):e11147. – http://asap.ahabs.wisc.edu/mauve/ 17
  13. Advanced analysis - Visualizing the overlap graph • CA_best_edge_to_GML.py -

    https://github.com/PacificBiosciences/HBAR-DTK • Gephi - https://gephi.org/ – YifanHu's Multilevel algorithm 18
  14. Tertiary analysis – Bacterial annotation • Find phage insertions in

    genome or plasmid: – PHAST(PHAge Search Tool) - http://phast.wishartlab.com/ • Automatic annotation: – RAST (Rapid Annotation using Subsystem Technology) - http://rast.nmpdr.org/ – BASys (Bacterial Annotation System) - https://www.basys.ca/ 19
  15. Tertiary analysis - PHAST 20 • Several intact phage elements

    in the chromosome • Regions 2, 6 and 8 each have a single adenine-specific methyltransferase using PHAST (http://phast.wishartlab.com) Region2 Region6 Region8
  16. Example 1: Bacterial Genome, 1 Circular Chromosome • E. coli

    20 kb Size-Selected Library with P4 C2, 1 SMRT® Cell • https://github.com/PacificBiosciences/DevNet/wiki/E.-coli-20kb-Size- Selected-Library-with-P4-C2 24 Polished Contigs 1 Max Contig Length 4653310 N50 Contig Length 4653310 Sum of Contig Lengths 4653310
  17. Example 1: Bacterial Genome, 1 Circular Chromosome 25 • Overlapping,

    self-similar ends - chromosome is circular: Contig can be circularized and used as a reference for resequencing / Base modification analysis
  18. Example 3: Plasmid with multiple repeat elements • What evidence

    do we have for the largest number of repeats on a single plasmid? • Largest mapped subread. – 10,386 bp read with even coverage – 3 units 28
  19. Example 4: Incomplete assembly 29 # Contigs 7 # Bases

    5,130,210 N50 2,594,109 Max contig length 2,594,109 Contigs > 10,000 bp 5 Contig id Size BLAST® hits coverage of raw reads 0007 7,298 possibly rRNA with high repeats ~35x 0008 95,929 pRSB107 like plasmid 70x ** 0009 2,594,109 Ends map to Enterobacteria phage DNA * 117x 0010 1,252,695 Ends map to Enterobacteria phage DNA * 106x 0011 1,157,888 Ends map to Enterobacteria phage DNA * 107x 0012 16,801 76% match to an Enterobacteria phage at high identity ~100x 0013 5,490 96% match to an Enterobacteria phage at high identitiy ~10x ** plasmid DNA not 1:1 with genomic DNA
  20. Example 4: Incomplete assembly - Ambiguity 33 0009 0010 0011

    Not to scale High similarity to Phage DNA Mapping direction Enterobacteria phage DE3 0009 0010 0011 0009 0010 0011 or
  21. Summary • Manually validating assembles becomes a viable option when

    contig numbers are low • SMRT® Portal can be used as a first pass – Coverage plot – SMRT View − Raw read coverage − Bridge Mapping • Third party tools can be used for QC / finishing – Dot plots – Aligning to know sequences – Circularization • Tertiary analysis for bacterial genomes can be done in an automated fashion and results visualized in SMRT View – RAST, PHAST, BASys 37
  22. Pacific Biosciences, the Pacific Biosciences logo, PacBio, SMRT, and SMRTbell

    are trademarks of Pacific Biosciences in the United States and/or other countries. All other trademarks are the sole property of their respective owners.
  23. 39