Track 1: De Novo Assembly

FIND MEANING IN COMPLEXITY © Copyright 2014 by Pacific Biosciences
of California, Inc. All rights reserved. For Research Use Only. Not for use in diagnostic procedures. User Group Meeting – Menlo Park, October 2014 De Novo Assembly

Learning Objectives 2 Scientists and Bioinformaticians • Interested in Experimental
Design and De Novo assembly using the PacBio® HGAP method After the training, you will be able to • Understand how the HGAP method works • Understand the coverage targets for de novo assembly with PacBio data • Import and parameterize an HGAP assembly job in SMRT® Portal • SMRT® Technology • PacBio® System Workflow

Agenda • Seq Metrics and Assembly definitions • Hierarchical Genome
Assembly Process-Microbial • Large Genome Assembly Considerations • HGAP Walk-Through- Hands On 3

Improve and Finish Genomes with the PacBio® System De novo
Assembly Complete genomes with PacBio reads alone Combine technologies for best of both worlds 2 3 2 3 1 1 Scaffold Establish framework for genome and resolve ambiguities Span Gaps Polish genomic regions with up to 10x improvement

Hybrid Solutions for De Novo Assemblies • Combine long SMRT®
Sequencing reads with short reads for error correction • Requires multiple types of data, (at least) two library preps, different sequencing technologies…

New De Novo Assembly Algorithms • Powerful assembly algorithms combining
long reads with short reads for error correction • Use just the long-insert- library reads for de novo assembly best for assembly

Seq Metrics and Assembly Definitions

From Polymerase Reads to Subreads or Read of Insert •
Subreads (purple and gold) are separated by adapter sequences (green) • Read of Insert represents the highest quality single-sequence for an insert, regardless of number of passes • ≥ 2 full passes required for CCS • Both adapters must be detected for a read to be identified as “full pass” • Either individual subreads, read of insert or CCS can be used for subsequent analysis depending on application needs Polymerase Read Subread Read of Insert or CCS 8

Read of Insert Definition: • Represents highest-quality single-sequence for an
insert, regardless of number of passes • Generalizes CCS for <2 passes and RQ <0.9 • 1 or more passes • 1 molecule, 1 read Purpose: • For Library QC • For subsequent analysis Subread Definition: • Single pass of template • Adapters removed • 1 molecule, ≥1 subreads Unique data: • Kinetic measurements • Rich QVs Purpose: • For subsequent analysis Polymerase Read Definition: • Sequence of nucleotides incorporated by polymerase while reading a template • Includes adapters • Often called “read” • Includes adapters • 1 molecule, 1 pol. read Purpose: • QC of instrument run • Benchmarking Read Metrics Definitions SMRTbell™ Template 9

Mapped Polymerase Read Length vs. Mapped Subread 10 Mapped Polymerase
Read Length Mapped Subread Length 4 kb 900 bp Mapped Polymerase Read Length Measure of ZMW sequencing productivity and read length Upper bound by speed and fidelity of the polymerase and movie time Mapped Subread Length Measure of scientifically applicable sequence Upper bound by insert size and loading effects

Basic Assembly Metrics • Commonly used metrics include: – Number
of contigs – N50: Equal to the size of the contig found if you sort contigs by size and walk to the contig that represents 50% of the total sequence − N50 = 10 bp − Mean contig length = 3 bp – Max contig size • Limitation of these metrics: – They do not capture information about assembly accuracy! − Large scale misassemblies − Base-level errors – There might be more than one chromosome (plasmid, phage, and so on) – Contaminants may contribute to a contig (such as a cloning vector) 11 10 4 1 1 1 1

Scaffolds vs. Contigs Defined • Scaffolds have Ns in them,
due to links from mate pair data • Contigs are contiguous sequences (no Ns). PacBio® sequencing generates contigs given our continuous reads ACACCACATCACGATCGATCGTGCATNNNNNNNNNNNNNNNNNNNCAGTAGTCAGCTAGCTACA contig contig Scaffold

Hierarchical Genome Assembly Process

Construct Pre-assembled Long Reads (PLR) from CLRs Assemble PLRs into
contigs Short Continuous Long Reads PLRs Pre-assembled Long Read Contig Hierarchical Genome Assembly Process

Map all to seed reads Single-pass long reads Select longest
as seed reads Pre-assembled reads Pre-Assembly of Single-Pass Long Reads Generate consensus of mapped reads RS_PreAssembler

Assembly of Pre-Assembled Reads into Contigs 16 Identify overlaps between
reads Pre-assembled reads Unitigs Celera® Assembler Generate layout of overlapping reads Contigs of assembly Generate consensus

Single-pass long reads Contigs Assembly Polishing via Quiver High-quality consensus
Base-quality-aware consensus of uniquely mapped reads Quiver Map to de novo-assembled reference RS_Resequencing

HGAP Example - Meiothermus ruber 10 kb SMRTbell™ library 3
SMRT® Cells (C2-C2 Chemistry, PacBio® RS) Long seed reads (>5kb) Pre-assembled long reads 5 contigs 1 contig pre-assembly Celera Assembler Polish, Quiver 250 Mb >5 kb Collaboration with A. Clum, A. Copeland (Joint Genome Institute)

HGAP Example - Meiothermus ruber 10 kb SMRTbell™ library 3
SMRT® Cells (C2-C2 Chemistry, PacBio® RS) Long seed reads (>5 kb) Pre-assembled long reads 5 contigs 1 contig pre-assembly Celera Assembler Polish, Quiver Collaboration with A. Clum, A. Copeland (Joint Genome Institute)

HGAP Example - Meiothermus ruber 10kb SMRTbell™ library 3 SMRT®
Cells (C2-C2 Chemistry, PacBio® RS) Long seed reads (>5 kb) Pre-assembled long reads 5 contigs 1 contig Pre-assembly Celera Assembler Polish, Quiver Collaboration with A. Clum, A. Copeland (Joint Genome Institute)

HGAP Example - Meiothermus ruber 10kb SMRTbell™ library 3 SMRT®
Cells (C2-C2 Chemistry, PacBio® RS) Long seed reads (>5 kb) Pre-assembled long reads 5 contigs 1 contig Pre-assembly 1 contig Celera® Assembler Minimus2 Quiver Collaboration with A. Clum, A. Copeland (Joint Genome Institute) • Single-contig assembly • 99.99965% concordance with reference • 99.3% genes predicted

Polish with Quiver for High Accuracy Organism Assembly size (bases)
Differences with Sanger reference Concordance with Sanger reference Nominal QV SNPs validated as correct PacBio calls Remaining differences QV Meiothermus ruber 3,098,781 11 99.99965% 54.5 8 1(3) 60 M. ruber Sanger reference PacBio® reads Targeted Sanger validation

Large Genome Assembly Considerations

Experimental Design – Choosing an Analysis Method Experimental Design Isolate
DNA Template Preparation Sequencing Analysis • Analysis strategy depends on project objectives, genome size and complexity, as well as the quality and type of data available Method Project Objectives Available Inputs Genome Sizes Hierarchical Draft or finished assembly • >70x PacBio® long-insert library < 130 MB SMRT® Portal Larger at the command line Hybrid Draft or finished assembly • 20-50x of PacBio long-insert library • 20-50x shorter read library (CCS/454®/Illumina®) Any size Scaffolding Improve existing assembly • 10X PacBio long-insert library • high-confidence contigs from existing assembly < 200 MB Gap Filling Fill gaps in existing assembly • 5-10x PacBio long-insert library • scaffolds from mate-pair assembly < 4 GB

Experimental Design – Repetitive Content Experimental Design Isolate DNA Template
Preparation Sequencing Analysis • Repetitive Content - One of the biggest challenges with De Novo Assembly. - Solution: work with insert sizes that can span repeats and indentify unique anchoring sequencing on each side of the repeat.

Experimental Design – Ploidy • Most Assemblers were designed for
Haploid Genomes. – Diploid with little structural variation between the chromosomes then a haploid approach can work. – Structural heterozygosity appears as separate contigs. • Select Strains to minimize heterozygosity – This helps facilitate assembly. – Use inbred lines – Double Haploid strains • Diploid or multiploid genomes – Using a haploid assembler leads to fragmented assemblies. – Consider Falcon (experimental Code) or Celera Assembler can be configured to favor merging haplotypes. 26

Towards True Diploid Assemblies • Truth: • Current assemblers: •
New diploid/polyploid assembler: maternal allele paternal allele Keep the long range information while maintaining the relations of the alternative alleles. FALCON assembler: https://github.com/PacificBiosciences/falcon

Draft Genome Quality • Gap filling of mate pair-based scaffolded
assemblies – Sensitive to the quality of the starting assembly. – Missassemblies in the scaffolds can result in improper alignments and incorrectly-filled or unfillable gaps. 28 Scaffolds with gaps PacBio long reads Filled in scaffold Gap Filling: Using PacBio® CLR to fill gaps in existing mate-pair-based scaffolds

Sample Preparation • Key to a successful assembly is the
generation of the longest reads possible. • Sample quality is critical to maximize potential performance • No amplification step during library preparation • Recommendations: – Take care during extraction to avoid gDNA damage & avoid contaminants – Use extraction methods or kits that produce very high molecular weight gDNA – If contaminants are present, purify starting DNA material prior to library prep – Accurately quantify and qualitatively evaluate gDNA – Include DNA-damage-repair step in library prep 29 Experimental Design Isolate DNA Template Preparation Sequencing Analysis

Template Preparation Recommendations • Recommend at least 10 kb insert
libraries to maximize subread length – DNA Template Prep Kit 2.0 (3 kb ‒ <10 kb) – Procedure & Checklist ‒ Low-Input 10 kb Library Preparation and Sequencing (MagBead Station) – Recommend: Final 0.4x AMPure® Purification instead of 0.45x – Minimum input: 1 µg • 20 kb insert libraries combined with size selection beneficial for increasing subread lengths – Protocols available on Sample Net for 20kb libraries – Requires more starting sample (recommended >7.5 µg) – Final AMPure® Purification (0.4 or even 0.375x) can also remove shorter SMRTbell™ inserts Experimental Design Isolate DNA Template Preparation Sequencing Analysis

Blue Pippin™ System for Size Selection Size-Selected Mouse Lemur 20
kb library 20 kb AMPure® Mouse Lemur library - Input gDNA - Size-selected

Size Selection Increases the Number of Long Subreads (from Lex
Nederbragt Blog) 32 http://flxlexblog.wordpress.com/2013/06/19/longing-for-the-longest-reads-pacbio-and-bluepippin/ “The plot shows that the BluePippin prep indeed had the desired effect: the reads are much longer.” N50 4,041bp 8,820bp

PacBio® Bioinformatics Training Wiki on DevNet Large Genome Assembly Guidance
https://github.com/PacificBiosciences/Bioinformatics-Training/wiki/Large- Genome-Assembly-with-PacBio-Long-Reads 33

Where to Find Additional Information • Links to publications, videos
of presentations, posters and other de novo assembly resources available through PacBio’s website (www.pacb.com/denovo) • Protocols, Technical & Application Notes available through Customer Portal • DevNet – HGAP Reference Implementation: http://www.smrtcommunity.com/Share/Code?id=a1q70000000H2qRAAS – Quiver: www.pacbiodevnet.com/quiver – Bacterial Assembly and Epigenetic Analysis Training Web Video http://www.pacificbiosciences.com/Tutorials/Bacterial_Assembly_Epigenetic_Analysis_HGA P/story_html5.html • Additional information on Assembly Tools – Celera® Assembler: http://sourceforge.net/apps/mediawiki/wgs- assembler/index.php?title=PacBioToCA – Allpaths-LG: http://www.broadinstitute.org/software/allpaths-lg/blog/ – PBJelly: http://sourceforge.net/p/pb-jelly/wiki/Home/ 34

For Research Use Only. Not for use in diagnostic procedures.
Pacific Biosciences, the Pacific Biosciences logo, PacBio, SMRT, SMRTbell, and Iso-Seq are trademarks of Pacific Biosciences in the United States and/or other countries. All other trademarks are the sole property of their respective owners.

HGAP Walk-Through

Import Data To SMRT® Portal

Create New HGAP Job in SMRT® Portal 2.3

HGAP Protocol and Parameters in SMRT® Portal 2.3  Minimum
Seed Read Length: - 30X Coverage of longest Seed Reads automatically calculated - Uncheck to override “auto” Key Parameter to set: Genome Size - 130 MB limit in SMRT Portal 2.3

HGAP Protocol and Parameters in SMRT® Portal 2.3  For
datasets with high coverage: Minimum Subread Length - Total coverage 3-4X seed-read coverage - (Usually not necessary)

HGAP Output - Overview 47

HGAP Output-Polished Assembly 48

For Research Use Only. Not for use in diagnostic procedures.
Pacific Biosciences, the Pacific Biosciences logo, PacBio, SMRT, SMRTbell, and Iso-Seq are trademarks of Pacific Biosciences in the United States and/or other countries. All other trademarks are the sole property of their respective owners.

Track 1: De Novo Assembly

Track 1: De Novo Assembly

More Decks by PacBio

Other Decks in Science

Featured

Transcript