Slide 1

Slide 1 text

FIND MEANING IN COMPLEXITY © Copyright 2013 by Pacific Biosciences of California, Inc. All rights reserved. Richard Hall Assessing and Finishing Bacterial Genomes

Slide 2

Slide 2 text

Learning Objectives After the training, the participant will be able to: • Assess HGAP assembly results of bacterial genomes Scientists, Research Associates, Bioinformaticians: • Interested in finishing and closing bacterial genomes • Familiarity with UNIX commands

Slide 3

Slide 3 text

Introduction • Basics of Assembly Metrics • Assembly QC via SMRT® Portal – Raw Read Coverage – Bridge Mapper • Other tools for QC / Finishing – BLAST® analysis – Dot plots – Gepard – Circularizing contigs – minimus2 – Comparing with known references • Advanced analysis – Visualizing the overlap graph • Tertiary analysis – PHAST – RAST – Basys • Examples • Summary 3

Slide 4

Slide 4 text

Basic Assembly Metrics • Commonly used metrics include: – Number of contigs – N50: Equal to the size of the contig found if you sort contigs by size and walk to the contig that represents 50% of the total sequence − N50 = 10 bp − Mean contig length = 3 bp – Max contig size • Limitation of these metrics: – They do not capture information about assembly accuracy! − Large scale mis-assemblies − Base level errors – There might be more than one chromosome (plasmid, phage, etc.) – Contaminants may contribute to a contig number (such as a cloning vector) 4 10 4 1 1 1 1

Slide 5

Slide 5 text

Assembly QC via SMRT® Portal - Raw Read Coverage • Undulation in coverage in chromosome is biological (more DNA close to ori when cells are harvested in log phase) • Different levels of coverage between chromosome and one of the plasmids, leading to distinct coverage peaks in histogram ori

Slide 6

Slide 6 text

Assembly QC via SMRT® Portal - Raw Read Coverage 6 Coverage Plot SMRT View • Re-mapping the reads to the assembly may reveal discontinuities • Sharp dips in coverage (lacks read support) • Sharp spikes in coverage (collapsed repeat elements)

Slide 7

Slide 7 text

Assembly QC via SMRT® Portal - Bridge Mapper • New for SMRT Analysis 2.1 • Run BLASR multiple times on input subreads • Split alignments are calculated – Then the start, middle, and end of a read align to different locations in the reference • Visualization of alignments in SMRT View allows: – Detection of mis-assemblies – Identification of structural variation – Characterization of chimerism

Slide 8

Slide 8 text

SMRT® Portal 2.1 - BridgeMapper

Slide 9

Slide 9 text

SMRT® View Example - Inversion

Slide 10

Slide 10 text

SMRT® View Example - Deletion 10

Slide 11

Slide 11 text

SMRT® View Example – Collapsed repeat 11

Slide 12

Slide 12 text

SMRT® View Example – Joining two contigs 12

Slide 13

Slide 13 text

Other tools for QC / Finishing - BLAST® analysis • http://blast.ncbi.nlm.nih.gov/Blast.cgi 13

Slide 14

Slide 14 text

Other tools for QC / Finishing – Dot Plots 14 Dot plot for contig with a close match found via BLAST® analysis • Gepard - http://www.helmholtz-muenchen.de/icb/gepard Self – self dot plot showing circularity

Slide 15

Slide 15 text

Other tools for QC / Finishing - Circularization 15 Contig Split Contig Overlap Consensus Manually introduce a break, “>”, in the fasta sequence Minimus2 can be used as a simple overlapper. Minimus2 - http://sourceforge.net/apps/mediawiki/amos/i ndex.php?title=Minimus2

Slide 16

Slide 16 text

Other tools for QC / Finishing – Comparing assemblies • Mummer – http://mummer.sourceforge.net/ – Alignment of multi-contig data against reference – Alignment of two draft genomes – Repeat finding – Good examples and step by steps: − http://mummer.sourceforge.net/examples/ 16

Slide 17

Slide 17 text

Other tools for QC / Finishing – Comparing assemblies • Mauve – Multiple Genome Alignment – Aaron E. Darling, Bob Mau, and Nicole T. Perna. 2010. progressiveMauve: Multiple Genome Alignment with Gene Gain, Loss, and Rearrangement. PLoS One. 5(6):e11147. – http://asap.ahabs.wisc.edu/mauve/ 17

Slide 18

Slide 18 text

Advanced analysis - Visualizing the overlap graph • CA_best_edge_to_GML.py - https://github.com/PacificBiosciences/HBAR-DTK • Gephi - https://gephi.org/ – YifanHu's Multilevel algorithm 18

Slide 19

Slide 19 text

Tertiary analysis – Bacterial annotation • Find phage insertions in genome or plasmid: – PHAST(PHAge Search Tool) - http://phast.wishartlab.com/ • Automatic annotation: – RAST (Rapid Annotation using Subsystem Technology) - http://rast.nmpdr.org/ – BASys (Bacterial Annotation System) - https://www.basys.ca/ 19

Slide 20

Slide 20 text

Tertiary analysis - PHAST 20 • Several intact phage elements in the chromosome • Regions 2, 6 and 8 each have a single adenine-specific methyltransferase using PHAST (http://phast.wishartlab.com) Region2 Region6 Region8

Slide 21

Slide 21 text

Tertiary analysis - RAST 21 gff3 can be downloaded and input as a track in SMRT® View

Slide 22

Slide 22 text

Tertiary analysis – RAST, track in SMRT® View 22

Slide 23

Slide 23 text

Tertiary analysis - BASys 23

Slide 24

Slide 24 text

Example 1: Bacterial Genome, 1 Circular Chromosome • E. coli 20 kb Size-Selected Library with P4 C2, 1 SMRT® Cell • https://github.com/PacificBiosciences/DevNet/wiki/E.-coli-20kb-Size- Selected-Library-with-P4-C2 24 Polished Contigs 1 Max Contig Length 4653310 N50 Contig Length 4653310 Sum of Contig Lengths 4653310

Slide 25

Slide 25 text

Example 1: Bacterial Genome, 1 Circular Chromosome 25 • Overlapping, self-similar ends - chromosome is circular: Contig can be circularized and used as a reference for resequencing / Base modification analysis

Slide 26

Slide 26 text

Example 2: Bacterial Genome, multiple plasmids 26 • Remove spurious contig with low coverage

Slide 27

Slide 27 text

Example 3: Plasmid with multiple repeat elements 27 Subread mapping

Slide 28

Slide 28 text

Example 3: Plasmid with multiple repeat elements • What evidence do we have for the largest number of repeats on a single plasmid? • Largest mapped subread. – 10,386 bp read with even coverage – 3 units 28

Slide 29

Slide 29 text

Example 4: Incomplete assembly 29 # Contigs 7 # Bases 5,130,210 N50 2,594,109 Max contig length 2,594,109 Contigs > 10,000 bp 5 Contig id Size BLAST® hits coverage of raw reads 0007 7,298 possibly rRNA with high repeats ~35x 0008 95,929 pRSB107 like plasmid 70x ** 0009 2,594,109 Ends map to Enterobacteria phage DNA * 117x 0010 1,252,695 Ends map to Enterobacteria phage DNA * 106x 0011 1,157,888 Ends map to Enterobacteria phage DNA * 107x 0012 16,801 76% match to an Enterobacteria phage at high identity ~100x 0013 5,490 96% match to an Enterobacteria phage at high identitiy ~10x ** plasmid DNA not 1:1 with genomic DNA

Slide 30

Slide 30 text

Example 4: Incomplete assembly – Contig 0009 vs. Phage 30

Slide 31

Slide 31 text

Example 4: Incomplete assembly – Contig 0010 vs. Phage 31

Slide 32

Slide 32 text

Example 4: Incomplete assembly – Contig 0011 vs. Phage 32

Slide 33

Slide 33 text

Example 4: Incomplete assembly - Ambiguity 33 0009 0010 0011 Not to scale High similarity to Phage DNA Mapping direction Enterobacteria phage DE3 0009 0010 0011 0009 0010 0011 or

Slide 34

Slide 34 text

Example 5: Evidence for splitting a contig 34

Slide 35

Slide 35 text

Example 6: Mis-assembly at the end of contigs? 35

Slide 36

Slide 36 text

Example 6: Mis-assembly at the end of contigs? 36

Slide 37

Slide 37 text

Summary • Manually validating assembles becomes a viable option when contig numbers are low • SMRT® Portal can be used as a first pass – Coverage plot – SMRT View − Raw read coverage − Bridge Mapping • Third party tools can be used for QC / finishing – Dot plots – Aligning to know sequences – Circularization • Tertiary analysis for bacterial genomes can be done in an automated fashion and results visualized in SMRT View – RAST, PHAST, BASys 37

Slide 38

Slide 38 text

Pacific Biosciences, the Pacific Biosciences logo, PacBio, SMRT, and SMRTbell are trademarks of Pacific Biosciences in the United States and/or other countries. All other trademarks are the sole property of their respective owners.

Slide 39

Slide 39 text

39