Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Assessing Assemblies

PacBio
September 19, 2013

Assessing Assemblies

PacBio

September 19, 2013
Tweet

More Decks by PacBio

Other Decks in Science

Transcript

  1. FIND MEANING IN COMPLEXITY © Copyright 2013 by Pacific Biosciences

    of California, Inc. All rights reserved. Richard Hall Assessing and Finishing Bacterial Genomes
  2. Learning Objectives After the training, the participant will be able

    to: • Assess HGAP assembly results of bacterial genomes Scientists, Research Associates, Bioinformaticians: • Interested in finishing and closing bacterial genomes • Familiarity with UNIX commands
  3. Introduction • Basics of Assembly Metrics • Assembly QC via

    SMRT® Portal – Raw Read Coverage – Bridge Mapper • Other tools for QC / Finishing – BLAST® analysis – Dot plots – Gepard – Circularizing contigs – minimus2 – Comparing with known references • Advanced analysis – Visualizing the overlap graph • Tertiary analysis – PHAST – RAST – Basys • Examples • Summary 3
  4. Basic Assembly Metrics • Commonly used metrics include: – Number

    of contigs – N50: Equal to the size of the contig found if you sort contigs by size and walk to the contig that represents 50% of the total sequence − N50 = 10 bp − Mean contig length = 3 bp – Max contig size • Limitation of these metrics: – They do not capture information about assembly accuracy! − Large scale mis-assemblies − Base level errors – There might be more than one chromosome (plasmid, phage, etc.) – Contaminants may contribute to a contig number (such as a cloning vector) 4 10 4 1 1 1 1
  5. Assembly QC via SMRT® Portal - Raw Read Coverage •

    Undulation in coverage in chromosome is biological (more DNA close to ori when cells are harvested in log phase) • Different levels of coverage between chromosome and one of the plasmids, leading to distinct coverage peaks in histogram ori
  6. Assembly QC via SMRT® Portal - Raw Read Coverage 6

    Coverage Plot SMRT View • Re-mapping the reads to the assembly may reveal discontinuities • Sharp dips in coverage (lacks read support) • Sharp spikes in coverage (collapsed repeat elements)
  7. Assembly QC via SMRT® Portal - Bridge Mapper • New

    for SMRT Analysis 2.1 • Run BLASR multiple times on input subreads • Split alignments are calculated – Then the start, middle, and end of a read align to different locations in the reference • Visualization of alignments in SMRT View allows: – Detection of mis-assemblies – Identification of structural variation – Characterization of chimerism
  8. Other tools for QC / Finishing - BLAST® analysis •

    http://blast.ncbi.nlm.nih.gov/Blast.cgi 13
  9. Other tools for QC / Finishing – Dot Plots 14

    Dot plot for contig with a close match found via BLAST® analysis • Gepard - http://www.helmholtz-muenchen.de/icb/gepard Self – self dot plot showing circularity
  10. Other tools for QC / Finishing - Circularization 15 Contig

    Split Contig Overlap Consensus Manually introduce a break, “>”, in the fasta sequence Minimus2 can be used as a simple overlapper. Minimus2 - http://sourceforge.net/apps/mediawiki/amos/i ndex.php?title=Minimus2
  11. Other tools for QC / Finishing – Comparing assemblies •

    Mummer – http://mummer.sourceforge.net/ – Alignment of multi-contig data against reference – Alignment of two draft genomes – Repeat finding – Good examples and step by steps: − http://mummer.sourceforge.net/examples/ 16
  12. Other tools for QC / Finishing – Comparing assemblies •

    Mauve – Multiple Genome Alignment – Aaron E. Darling, Bob Mau, and Nicole T. Perna. 2010. progressiveMauve: Multiple Genome Alignment with Gene Gain, Loss, and Rearrangement. PLoS One. 5(6):e11147. – http://asap.ahabs.wisc.edu/mauve/ 17
  13. Advanced analysis - Visualizing the overlap graph • CA_best_edge_to_GML.py -

    https://github.com/PacificBiosciences/HBAR-DTK • Gephi - https://gephi.org/ – YifanHu's Multilevel algorithm 18
  14. Tertiary analysis – Bacterial annotation • Find phage insertions in

    genome or plasmid: – PHAST(PHAge Search Tool) - http://phast.wishartlab.com/ • Automatic annotation: – RAST (Rapid Annotation using Subsystem Technology) - http://rast.nmpdr.org/ – BASys (Bacterial Annotation System) - https://www.basys.ca/ 19
  15. Tertiary analysis - PHAST 20 • Several intact phage elements

    in the chromosome • Regions 2, 6 and 8 each have a single adenine-specific methyltransferase using PHAST (http://phast.wishartlab.com) Region2 Region6 Region8
  16. Example 1: Bacterial Genome, 1 Circular Chromosome • E. coli

    20 kb Size-Selected Library with P4 C2, 1 SMRT® Cell • https://github.com/PacificBiosciences/DevNet/wiki/E.-coli-20kb-Size- Selected-Library-with-P4-C2 24 Polished Contigs 1 Max Contig Length 4653310 N50 Contig Length 4653310 Sum of Contig Lengths 4653310
  17. Example 1: Bacterial Genome, 1 Circular Chromosome 25 • Overlapping,

    self-similar ends - chromosome is circular: Contig can be circularized and used as a reference for resequencing / Base modification analysis
  18. Example 3: Plasmid with multiple repeat elements • What evidence

    do we have for the largest number of repeats on a single plasmid? • Largest mapped subread. – 10,386 bp read with even coverage – 3 units 28
  19. Example 4: Incomplete assembly 29 # Contigs 7 # Bases

    5,130,210 N50 2,594,109 Max contig length 2,594,109 Contigs > 10,000 bp 5 Contig id Size BLAST® hits coverage of raw reads 0007 7,298 possibly rRNA with high repeats ~35x 0008 95,929 pRSB107 like plasmid 70x ** 0009 2,594,109 Ends map to Enterobacteria phage DNA * 117x 0010 1,252,695 Ends map to Enterobacteria phage DNA * 106x 0011 1,157,888 Ends map to Enterobacteria phage DNA * 107x 0012 16,801 76% match to an Enterobacteria phage at high identity ~100x 0013 5,490 96% match to an Enterobacteria phage at high identitiy ~10x ** plasmid DNA not 1:1 with genomic DNA
  20. Example 4: Incomplete assembly - Ambiguity 33 0009 0010 0011

    Not to scale High similarity to Phage DNA Mapping direction Enterobacteria phage DE3 0009 0010 0011 0009 0010 0011 or
  21. Summary • Manually validating assembles becomes a viable option when

    contig numbers are low • SMRT® Portal can be used as a first pass – Coverage plot – SMRT View − Raw read coverage − Bridge Mapping • Third party tools can be used for QC / finishing – Dot plots – Aligning to know sequences – Circularization • Tertiary analysis for bacterial genomes can be done in an automated fashion and results visualized in SMRT View – RAST, PHAST, BASys 37
  22. Pacific Biosciences, the Pacific Biosciences logo, PacBio, SMRT, and SMRTbell

    are trademarks of Pacific Biosciences in the United States and/or other countries. All other trademarks are the sole property of their respective owners.
  23. 39