Slide 1

Slide 1 text

FIND MEANING IN COMPLEXITY © Copyright 2013 by Pacific Biosciences of California, Inc. All rights reserved. User Group Meeting – Menlo Park, September 2013 De Novo Assembly

Slide 2

Slide 2 text

Learning Objectives 2 Scientists and Bioinformaticians • Interested in Experimental Design and De Novo assembly using the PacBio® HGAP method After the training, you will be able to • Understand how the HGAP method works • Understand the coverage targets for de novo assembly with PacBio® data • Import and parameterize an HGAP assembly job in SMRT® Portal • SMRT® Technology • PacBio® System Workflow

Slide 3

Slide 3 text

Improve and Finish Genomes with the PacBio® System De novo Assembly Complete genomes with PacBio reads alone Combine technologies for best of both worlds 2 3 2 3 1 1 Scaffold Establish framework for genome and resolve ambiguities Span Gaps Polish genomic regions with up to 10x improvement

Slide 4

Slide 4 text

Hierarchical Genome Assembly Process

Slide 5

Slide 5 text

Hybrid Solutions for De Novo Assemblies • Combine long SMRT® Sequencing reads with short reads for error correction • Requires multiple types of data, (at least) two library preps, different sequencing technologies…

Slide 6

Slide 6 text

New De Novo Assembly Algorithms • Powerful assembly algorithms combining long reads with short reads for error correction • Use just the long-insert- library reads for de novo assembly best for assembly

Slide 7

Slide 7 text

Construct Pre-assembled Long Reads (PLR) from CLRs Assemble PLRs into contigs Short Continuous Long Reads PLRs Pre-assembled Long Read Contig Hierarchical Genome Assembly Process

Slide 8

Slide 8 text

Map all to seed reads Single-pass long reads Select longest as seed reads Pre-assembled reads Pre-Assembly of Single-Pass Long Reads Generate consensus of mapped reads RS_PreAssembler

Slide 9

Slide 9 text

Assembly of Pre-Assembled Reads into Contigs 13 Identify overlaps between reads Pre-assembled reads Unitigs Celera® Assembler Generate layout of overlapping reads Contigs of assembly Generate consensus

Slide 10

Slide 10 text

Single-pass long reads Contigs Assembly Polishing via Quiver High-quality consensus Base-quality-aware consensus of uniquely mapped reads Quiver Map to de novo-assembled reference RS_Resequencing

Slide 11

Slide 11 text

HGAP Example - Meiothermus ruber 10 kb SMRTbell™ library 3 SMRT® Cells (C2-C2 Chemistry, PacBio® RS) Long seed reads (>5kb) Pre-assembled long reads 5 contigs 1 contig pre-assembly Celera Assembler Polish, Quiver 250 Mb >5 kb Collaboration with A. Clum, A. Copeland (Joint Genome Institute)

Slide 12

Slide 12 text

HGAP Example - Meiothermus ruber 10 kb SMRTbell™ library 3 SMRT® Cells (C2-C2 Chemistry, PacBio® RS) Long seed reads (>5 kb) Pre-assembled long reads 5 contigs 1 contig pre-assembly Celera Assembler Polish, Quiver Collaboration with A. Clum, A. Copeland (Joint Genome Institute)

Slide 13

Slide 13 text

HGAP Example - Meiothermus ruber 10kb SMRTbell™ library 3 SMRT® Cells (C2-C2 Chemistry, PacBio® RS) Long seed reads (>5 kb) Pre-assembled long reads 5 contigs 1 contig Pre-assembly Celera Assembler Polish, Quiver Collaboration with A. Clum, A. Copeland (Joint Genome Institute)

Slide 14

Slide 14 text

HGAP Example - Meiothermus ruber 10kb SMRTbell™ library 3 SMRT® Cells (C2-C2 Chemistry, PacBio® RS) Long seed reads (>5 kb) Pre-assembled long reads 5 contigs 1 contig Pre-assembly 1 contig Celera® Assembler Minimus2 Quiver Collaboration with A. Clum, A. Copeland (Joint Genome Institute) • Single-contig assembly • 99.99965% concordance with reference • 99.3% genes predicted

Slide 15

Slide 15 text

Polish with Quiver for High Accuracy Organism Assembly size (bases) Differences with Sanger reference Concordance with Sanger reference Nominal QV SNPs validated as correct PacBio calls Remaining differences QV Meiothermus ruber 3,098,781 11 99.99965% 54.5 8 1(3) 60 M. ruber Sanger reference PacBio® reads Targeted Sanger validation

Slide 16

Slide 16 text

De novo Assembly Experimental Design Recommendations

Slide 17

Slide 17 text

Experimental Design – Choosing an Analysis Method Experimental Design Isolate DNA Template Preparation Sequencing Analysis • Analysis strategy depends on project objectives, genome size and complexity, as well as the quality and type of data available Method Project Objectives Available Inputs Genome Sizes Hierarchical Draft or finished assembly • >70x PacBio® long-insert library < 10 MB SMRT® Portal < 130 MB Command Line Hybrid Draft or finished assembly • 20-50x of PacBio long-insert library • 20-50x shorter read library (CCS/454®/Illumina®) Any size Scaffolding Improve existing assembly • 10X PacBio long-insert library • high-confidence contigs from existing assembly < 200 MB Gap Filling Fill gaps in existing assembly • 5-10x PacBio long-insert library • scaffolds from mate-pair assembly < 4 GB

Slide 18

Slide 18 text

Sample Preparation • Sample quality is critical to maximize potential performance • No amplification step during library preparation • Recommendations: – Take care during extraction to avoid gDNA damage & avoid contaminants – Use extraction methods or kits that produce very high molecular weight gDNA – If contaminants are present, purify starting DNA material prior to library prep – Accurately quantify and qualitatively evaluate gDNA – Include DNA-damage-repair step in library prep 29 Experimental Design Isolate DNA Template Preparation Sequencing Analysis

Slide 19

Slide 19 text

Template Preparation Recommendations • Recommend at least 10 kb insert libraries to maximize subread length – DNA Template Prep Kit 2.0 (3 kb ‒ <10 kb) – Procedure & Checklist ‒ Low-Input 10 kb Library Preparation and Sequencing (MagBead Station) – Recommend: Final 0.4x AMPure® Purification instead of 0.45x – Minimum input: 1 µg • 20 kb insert libraries combined with size selection beneficial for increasing subread lengths – Optional protocols available on Sample Net for >10 kb libraries – Requires more starting sample (recommended >7.5 µg) – Final AMPure® Purification (0.4 or even 0.375x) can also remove shorter SMRTbell™ inserts Experimental Design Isolate DNA Template Preparation Sequencing Analysis

Slide 20

Slide 20 text

Sequencing Recommendations Long Insert Libraries Instrument PacBio® RS II DNA Polymerase/ Binding Kit DNA/Polymerase Binding Kit P4 DNA Sequencing Kit DNA Sequencing Kit 2.0 (C2) Loading MagBead loading; follow protocol for insert size Stage Start Stage Start = yes Movie Time 1 x 120 minutes 31 Experimental Design Isolate DNA Template Preparation Sequencing Analysis

Slide 21

Slide 21 text

De Novo Experimental Design Takeaways • P4 enzyme • MagBead loading • Stage Start • Movie Time • 1 x 120 min • Do not overload • Target 100X Coverage • SMRT® Analysis 2.0.1 supports Hierarchical Assembly using RS_Preassembler and Celera® Assembler • Quiver for assembly polishing to increase consensus accuracy • Post-assembly QC • See DevNet for additional recommendations • Don’t forget base modification • Good quality sample preparation is key! • Limit DNA damage during sample extraction • 10 kb library protocol for long read library • Recommend size selection and large-insert protocols available through SampleNet • Error correction (2 kb libraries) no longer needed for HGAP Sample Prep Run Design Sequencing on the PacBio® System and Primary Analysis Secondary Analysis Tertiary Analysis

Slide 22

Slide 22 text

Where to Find Additional Information • Links to publications, videos of presentations, posters and other de novo assembly resources available through PacBio’s website (www.pacb.com/denovo) • Protocols, Technical & Application Notes available through Customer Portal • DevNet – HGAP Reference Implementation: http://www.smrtcommunity.com/Share/Code?id=a1q70000000H2qRAAS – Quiver: www.pacbiodevnet.com/quiver – Bacterial Assembly and Epigenetic Analysis Training Web Video http://www.pacificbiosciences.com/Tutorials/Bacterial_Assembly_Epigenetic_Analysis_HGA P/story_html5.html • Additional information on Assembly Tools – Celera® Assembler: http://sourceforge.net/apps/mediawiki/wgs- assembler/index.php?title=PacBioToCA – Allpaths-LG: http://www.broadinstitute.org/software/allpaths-lg/blog/ – PBJelly: http://sourceforge.net/p/pb-jelly/wiki/Home/ 33

Slide 23

Slide 23 text

HGAP Walk-Through

Slide 24

Slide 24 text

Import Data To SMRT® Portal

Slide 25

Slide 25 text

Import Data To SMRT® Portal

Slide 26

Slide 26 text

Import Data To SMRT® Portal

Slide 27

Slide 27 text

Import Data To SMRT® Portal

Slide 28

Slide 28 text

Create New HGAP Job in SMRT® Portal 2.0.1

Slide 29

Slide 29 text

Create New HGAP Job in SMRT® Portal 2.0.1

Slide 30

Slide 30 text

HGAP Protocol and Parameters in SMRT® Portal 2.0.1 Minimum Seed Read Length: - 30X Coverage of longest Seed Reads automatically calculated - Uncheck to override “auto” Automatic FASTQ Trimming - QV > 59.5 & Length > 500 bp Use CCS option - Enable Hybrid Assembly Genome Size - 10 MB limit in SMRT Portal 2.0.1 Allow Partial Alignments - Improves PreAssembly with P4-C2 & XL-C2 

Slide 31

Slide 31 text

HGAP Protocol and Parameters in SMRT® Portal 2.0.1 *Minimum Subread Length - Total coverage 3-4X seed-read coverage - (Usually not necessary) 

Slide 32

Slide 32 text

HGAP Output - Overview 57

Slide 33

Slide 33 text

Pacific Biosciences, the Pacific Biosciences logo, PacBio, SMRT, and SMRTbell are trademarks of Pacific Biosciences in the United States and/or other countries. All other trademarks are the sole property of their respective owners.