Upgrade to Pro — share decks privately, control downloads, hide ads and more …

De Novo Assembly Overview

PacBio
September 19, 2013

De Novo Assembly Overview

PacBio

September 19, 2013
Tweet

More Decks by PacBio

Other Decks in Science

Transcript

  1. FIND MEANING IN COMPLEXITY © Copyright 2013 by Pacific Biosciences

    of California, Inc. All rights reserved. User Group Meeting – Menlo Park, September 2013 De Novo Assembly
  2. Learning Objectives 2 Scientists and Bioinformaticians • Interested in Experimental

    Design and De Novo assembly using the PacBio® HGAP method After the training, you will be able to • Understand how the HGAP method works • Understand the coverage targets for de novo assembly with PacBio® data • Import and parameterize an HGAP assembly job in SMRT® Portal • SMRT® Technology • PacBio® System Workflow
  3. Improve and Finish Genomes with the PacBio® System De novo

    Assembly Complete genomes with PacBio reads alone Combine technologies for best of both worlds 2 3 2 3 1 1 Scaffold Establish framework for genome and resolve ambiguities Span Gaps Polish genomic regions with up to 10x improvement
  4. Hybrid Solutions for De Novo Assemblies • Combine long SMRT®

    Sequencing reads with short reads for error correction • Requires multiple types of data, (at least) two library preps, different sequencing technologies…
  5. New De Novo Assembly Algorithms • Powerful assembly algorithms combining

    long reads with short reads for error correction • Use just the long-insert- library reads for de novo assembly best for assembly
  6. Construct Pre-assembled Long Reads (PLR) from CLRs Assemble PLRs into

    contigs Short Continuous Long Reads PLRs Pre-assembled Long Read Contig Hierarchical Genome Assembly Process
  7. Map all to seed reads Single-pass long reads Select longest

    as seed reads Pre-assembled reads Pre-Assembly of Single-Pass Long Reads Generate consensus of mapped reads RS_PreAssembler
  8. Assembly of Pre-Assembled Reads into Contigs 13 Identify overlaps between

    reads Pre-assembled reads Unitigs Celera® Assembler Generate layout of overlapping reads Contigs of assembly Generate consensus
  9. Single-pass long reads Contigs Assembly Polishing via Quiver High-quality consensus

    Base-quality-aware consensus of uniquely mapped reads Quiver Map to de novo-assembled reference RS_Resequencing
  10. HGAP Example - Meiothermus ruber 10 kb SMRTbell™ library 3

    SMRT® Cells (C2-C2 Chemistry, PacBio® RS) Long seed reads (>5kb) Pre-assembled long reads 5 contigs 1 contig pre-assembly Celera Assembler Polish, Quiver 250 Mb >5 kb Collaboration with A. Clum, A. Copeland (Joint Genome Institute)
  11. HGAP Example - Meiothermus ruber 10 kb SMRTbell™ library 3

    SMRT® Cells (C2-C2 Chemistry, PacBio® RS) Long seed reads (>5 kb) Pre-assembled long reads 5 contigs 1 contig pre-assembly Celera Assembler Polish, Quiver Collaboration with A. Clum, A. Copeland (Joint Genome Institute)
  12. HGAP Example - Meiothermus ruber 10kb SMRTbell™ library 3 SMRT®

    Cells (C2-C2 Chemistry, PacBio® RS) Long seed reads (>5 kb) Pre-assembled long reads 5 contigs 1 contig Pre-assembly Celera Assembler Polish, Quiver Collaboration with A. Clum, A. Copeland (Joint Genome Institute)
  13. HGAP Example - Meiothermus ruber 10kb SMRTbell™ library 3 SMRT®

    Cells (C2-C2 Chemistry, PacBio® RS) Long seed reads (>5 kb) Pre-assembled long reads 5 contigs 1 contig Pre-assembly 1 contig Celera® Assembler Minimus2 Quiver Collaboration with A. Clum, A. Copeland (Joint Genome Institute) • Single-contig assembly • 99.99965% concordance with reference • 99.3% genes predicted
  14. Polish with Quiver for High Accuracy Organism Assembly size (bases)

    Differences with Sanger reference Concordance with Sanger reference Nominal QV SNPs validated as correct PacBio calls Remaining differences QV Meiothermus ruber 3,098,781 11 99.99965% 54.5 8 1(3) 60 M. ruber Sanger reference PacBio® reads Targeted Sanger validation
  15. Experimental Design – Choosing an Analysis Method Experimental Design Isolate

    DNA Template Preparation Sequencing Analysis • Analysis strategy depends on project objectives, genome size and complexity, as well as the quality and type of data available Method Project Objectives Available Inputs Genome Sizes Hierarchical Draft or finished assembly • >70x PacBio® long-insert library < 10 MB SMRT® Portal < 130 MB Command Line Hybrid Draft or finished assembly • 20-50x of PacBio long-insert library • 20-50x shorter read library (CCS/454®/Illumina®) Any size Scaffolding Improve existing assembly • 10X PacBio long-insert library • high-confidence contigs from existing assembly < 200 MB Gap Filling Fill gaps in existing assembly • 5-10x PacBio long-insert library • scaffolds from mate-pair assembly < 4 GB
  16. Sample Preparation • Sample quality is critical to maximize potential

    performance • No amplification step during library preparation • Recommendations: – Take care during extraction to avoid gDNA damage & avoid contaminants – Use extraction methods or kits that produce very high molecular weight gDNA – If contaminants are present, purify starting DNA material prior to library prep – Accurately quantify and qualitatively evaluate gDNA – Include DNA-damage-repair step in library prep 29 Experimental Design Isolate DNA Template Preparation Sequencing Analysis
  17. Template Preparation Recommendations • Recommend at least 10 kb insert

    libraries to maximize subread length – DNA Template Prep Kit 2.0 (3 kb ‒ <10 kb) – Procedure & Checklist ‒ Low-Input 10 kb Library Preparation and Sequencing (MagBead Station) – Recommend: Final 0.4x AMPure® Purification instead of 0.45x – Minimum input: 1 µg • 20 kb insert libraries combined with size selection beneficial for increasing subread lengths – Optional protocols available on Sample Net for >10 kb libraries – Requires more starting sample (recommended >7.5 µg) – Final AMPure® Purification (0.4 or even 0.375x) can also remove shorter SMRTbell™ inserts Experimental Design Isolate DNA Template Preparation Sequencing Analysis
  18. Sequencing Recommendations Long Insert Libraries Instrument PacBio® RS II DNA

    Polymerase/ Binding Kit DNA/Polymerase Binding Kit P4 DNA Sequencing Kit DNA Sequencing Kit 2.0 (C2) Loading MagBead loading; follow protocol for insert size Stage Start Stage Start = yes Movie Time 1 x 120 minutes 31 Experimental Design Isolate DNA Template Preparation Sequencing Analysis
  19. De Novo Experimental Design Takeaways • P4 enzyme • MagBead

    loading • Stage Start • Movie Time • 1 x 120 min • Do not overload • Target 100X Coverage • SMRT® Analysis 2.0.1 supports Hierarchical Assembly using RS_Preassembler and Celera® Assembler • Quiver for assembly polishing to increase consensus accuracy • Post-assembly QC • See DevNet for additional recommendations • Don’t forget base modification • Good quality sample preparation is key! • Limit DNA damage during sample extraction • 10 kb library protocol for long read library • Recommend size selection and large-insert protocols available through SampleNet • Error correction (2 kb libraries) no longer needed for HGAP Sample Prep Run Design Sequencing on the PacBio® System and Primary Analysis Secondary Analysis Tertiary Analysis
  20. Where to Find Additional Information • Links to publications, videos

    of presentations, posters and other de novo assembly resources available through PacBio’s website (www.pacb.com/denovo) • Protocols, Technical & Application Notes available through Customer Portal • DevNet – HGAP Reference Implementation: http://www.smrtcommunity.com/Share/Code?id=a1q70000000H2qRAAS – Quiver: www.pacbiodevnet.com/quiver – Bacterial Assembly and Epigenetic Analysis Training Web Video http://www.pacificbiosciences.com/Tutorials/Bacterial_Assembly_Epigenetic_Analysis_HGA P/story_html5.html • Additional information on Assembly Tools – Celera® Assembler: http://sourceforge.net/apps/mediawiki/wgs- assembler/index.php?title=PacBioToCA – Allpaths-LG: http://www.broadinstitute.org/software/allpaths-lg/blog/ – PBJelly: http://sourceforge.net/p/pb-jelly/wiki/Home/ 33
  21. HGAP Protocol and Parameters in SMRT® Portal 2.0.1 Minimum Seed

    Read Length: - 30X Coverage of longest Seed Reads automatically calculated - Uncheck to override “auto” Automatic FASTQ Trimming - QV > 59.5 & Length > 500 bp Use CCS option - Enable Hybrid Assembly Genome Size - 10 MB limit in SMRT Portal 2.0.1 Allow Partial Alignments - Improves PreAssembly with P4-C2 & XL-C2 
  22. HGAP Protocol and Parameters in SMRT® Portal 2.0.1 *Minimum Subread

    Length - Total coverage 3-4X seed-read coverage - (Usually not necessary) 
  23. Pacific Biosciences, the Pacific Biosciences logo, PacBio, SMRT, and SMRTbell

    are trademarks of Pacific Biosciences in the United States and/or other countries. All other trademarks are the sole property of their respective owners.