Assessing Assemblies

F11d4ddd9ca7e190fdabf0cda3f7ae29?s=47 PacBio
September 19, 2013

Assessing Assemblies

F11d4ddd9ca7e190fdabf0cda3f7ae29?s=128

PacBio

September 19, 2013
Tweet

Transcript

  1. FIND MEANING IN COMPLEXITY © Copyright 2013 by Pacific Biosciences

    of California, Inc. All rights reserved. Richard Hall Assessing and Finishing Bacterial Genomes
  2. Learning Objectives After the training, the participant will be able

    to: • Assess HGAP assembly results of bacterial genomes Scientists, Research Associates, Bioinformaticians: • Interested in finishing and closing bacterial genomes • Familiarity with UNIX commands
  3. Introduction • Basics of Assembly Metrics • Assembly QC via

    SMRT® Portal – Raw Read Coverage – Bridge Mapper • Other tools for QC / Finishing – BLAST® analysis – Dot plots – Gepard – Circularizing contigs – minimus2 – Comparing with known references • Advanced analysis – Visualizing the overlap graph • Tertiary analysis – PHAST – RAST – Basys • Examples • Summary 3
  4. Basic Assembly Metrics • Commonly used metrics include: – Number

    of contigs – N50: Equal to the size of the contig found if you sort contigs by size and walk to the contig that represents 50% of the total sequence − N50 = 10 bp − Mean contig length = 3 bp – Max contig size • Limitation of these metrics: – They do not capture information about assembly accuracy! − Large scale mis-assemblies − Base level errors – There might be more than one chromosome (plasmid, phage, etc.) – Contaminants may contribute to a contig number (such as a cloning vector) 4 10 4 1 1 1 1
  5. Assembly QC via SMRT® Portal - Raw Read Coverage •

    Undulation in coverage in chromosome is biological (more DNA close to ori when cells are harvested in log phase) • Different levels of coverage between chromosome and one of the plasmids, leading to distinct coverage peaks in histogram ori
  6. Assembly QC via SMRT® Portal - Raw Read Coverage 6

    Coverage Plot SMRT View • Re-mapping the reads to the assembly may reveal discontinuities • Sharp dips in coverage (lacks read support) • Sharp spikes in coverage (collapsed repeat elements)
  7. Assembly QC via SMRT® Portal - Bridge Mapper • New

    for SMRT Analysis 2.1 • Run BLASR multiple times on input subreads • Split alignments are calculated – Then the start, middle, and end of a read align to different locations in the reference • Visualization of alignments in SMRT View allows: – Detection of mis-assemblies – Identification of structural variation – Characterization of chimerism
  8. SMRT® Portal 2.1 - BridgeMapper

  9. SMRT® View Example - Inversion

  10. SMRT® View Example - Deletion 10

  11. SMRT® View Example – Collapsed repeat 11

  12. SMRT® View Example – Joining two contigs 12

  13. Other tools for QC / Finishing - BLAST® analysis •

    http://blast.ncbi.nlm.nih.gov/Blast.cgi 13
  14. Other tools for QC / Finishing – Dot Plots 14

    Dot plot for contig with a close match found via BLAST® analysis • Gepard - http://www.helmholtz-muenchen.de/icb/gepard Self – self dot plot showing circularity
  15. Other tools for QC / Finishing - Circularization 15 Contig

    Split Contig Overlap Consensus Manually introduce a break, “>”, in the fasta sequence Minimus2 can be used as a simple overlapper. Minimus2 - http://sourceforge.net/apps/mediawiki/amos/i ndex.php?title=Minimus2
  16. Other tools for QC / Finishing – Comparing assemblies •

    Mummer – http://mummer.sourceforge.net/ – Alignment of multi-contig data against reference – Alignment of two draft genomes – Repeat finding – Good examples and step by steps: − http://mummer.sourceforge.net/examples/ 16
  17. Other tools for QC / Finishing – Comparing assemblies •

    Mauve – Multiple Genome Alignment – Aaron E. Darling, Bob Mau, and Nicole T. Perna. 2010. progressiveMauve: Multiple Genome Alignment with Gene Gain, Loss, and Rearrangement. PLoS One. 5(6):e11147. – http://asap.ahabs.wisc.edu/mauve/ 17
  18. Advanced analysis - Visualizing the overlap graph • CA_best_edge_to_GML.py -

    https://github.com/PacificBiosciences/HBAR-DTK • Gephi - https://gephi.org/ – YifanHu's Multilevel algorithm 18
  19. Tertiary analysis – Bacterial annotation • Find phage insertions in

    genome or plasmid: – PHAST(PHAge Search Tool) - http://phast.wishartlab.com/ • Automatic annotation: – RAST (Rapid Annotation using Subsystem Technology) - http://rast.nmpdr.org/ – BASys (Bacterial Annotation System) - https://www.basys.ca/ 19
  20. Tertiary analysis - PHAST 20 • Several intact phage elements

    in the chromosome • Regions 2, 6 and 8 each have a single adenine-specific methyltransferase using PHAST (http://phast.wishartlab.com) Region2 Region6 Region8
  21. Tertiary analysis - RAST 21 gff3 can be downloaded and

    input as a track in SMRT® View
  22. Tertiary analysis – RAST, track in SMRT® View 22

  23. Tertiary analysis - BASys 23

  24. Example 1: Bacterial Genome, 1 Circular Chromosome • E. coli

    20 kb Size-Selected Library with P4 C2, 1 SMRT® Cell • https://github.com/PacificBiosciences/DevNet/wiki/E.-coli-20kb-Size- Selected-Library-with-P4-C2 24 Polished Contigs 1 Max Contig Length 4653310 N50 Contig Length 4653310 Sum of Contig Lengths 4653310
  25. Example 1: Bacterial Genome, 1 Circular Chromosome 25 • Overlapping,

    self-similar ends - chromosome is circular: Contig can be circularized and used as a reference for resequencing / Base modification analysis
  26. Example 2: Bacterial Genome, multiple plasmids 26 • Remove spurious

    contig with low coverage
  27. Example 3: Plasmid with multiple repeat elements 27 Subread mapping

  28. Example 3: Plasmid with multiple repeat elements • What evidence

    do we have for the largest number of repeats on a single plasmid? • Largest mapped subread. – 10,386 bp read with even coverage – 3 units 28
  29. Example 4: Incomplete assembly 29 # Contigs 7 # Bases

    5,130,210 N50 2,594,109 Max contig length 2,594,109 Contigs > 10,000 bp 5 Contig id Size BLAST® hits coverage of raw reads 0007 7,298 possibly rRNA with high repeats ~35x 0008 95,929 pRSB107 like plasmid 70x ** 0009 2,594,109 Ends map to Enterobacteria phage DNA * 117x 0010 1,252,695 Ends map to Enterobacteria phage DNA * 106x 0011 1,157,888 Ends map to Enterobacteria phage DNA * 107x 0012 16,801 76% match to an Enterobacteria phage at high identity ~100x 0013 5,490 96% match to an Enterobacteria phage at high identitiy ~10x ** plasmid DNA not 1:1 with genomic DNA
  30. Example 4: Incomplete assembly – Contig 0009 vs. Phage 30

  31. Example 4: Incomplete assembly – Contig 0010 vs. Phage 31

  32. Example 4: Incomplete assembly – Contig 0011 vs. Phage 32

  33. Example 4: Incomplete assembly - Ambiguity 33 0009 0010 0011

    Not to scale High similarity to Phage DNA Mapping direction Enterobacteria phage DE3 0009 0010 0011 0009 0010 0011 or
  34. Example 5: Evidence for splitting a contig 34

  35. Example 6: Mis-assembly at the end of contigs? 35

  36. Example 6: Mis-assembly at the end of contigs? 36

  37. Summary • Manually validating assembles becomes a viable option when

    contig numbers are low • SMRT® Portal can be used as a first pass – Coverage plot – SMRT View − Raw read coverage − Bridge Mapping • Third party tools can be used for QC / finishing – Dot plots – Aligning to know sequences – Circularization • Tertiary analysis for bacterial genomes can be done in an automated fashion and results visualized in SMRT View – RAST, PHAST, BASys 37
  38. Pacific Biosciences, the Pacific Biosciences logo, PacBio, SMRT, and SMRTbell

    are trademarks of Pacific Biosciences in the United States and/or other countries. All other trademarks are the sole property of their respective owners.
  39. 39