Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Track 1: De Novo Assembly

PacBio
October 15, 2014

Track 1: De Novo Assembly

PacBio

October 15, 2014
Tweet

More Decks by PacBio

Other Decks in Science

Transcript

  1. FIND MEANING IN COMPLEXITY © Copyright 2014 by Pacific Biosciences

    of California, Inc. All rights reserved. For Research Use Only. Not for use in diagnostic procedures. User Group Meeting – Menlo Park, October 2014 De Novo Assembly
  2. Learning Objectives 2 Scientists and Bioinformaticians • Interested in Experimental

    Design and De Novo assembly using the PacBio® HGAP method After the training, you will be able to • Understand how the HGAP method works • Understand the coverage targets for de novo assembly with PacBio data • Import and parameterize an HGAP assembly job in SMRT® Portal • SMRT® Technology • PacBio® System Workflow
  3. Agenda • Seq Metrics and Assembly definitions • Hierarchical Genome

    Assembly Process-Microbial • Large Genome Assembly Considerations • HGAP Walk-Through- Hands On 3
  4. Improve and Finish Genomes with the PacBio® System De novo

    Assembly Complete genomes with PacBio reads alone Combine technologies for best of both worlds 2 3 2 3 1 1 Scaffold Establish framework for genome and resolve ambiguities Span Gaps Polish genomic regions with up to 10x improvement
  5. Hybrid Solutions for De Novo Assemblies • Combine long SMRT®

    Sequencing reads with short reads for error correction • Requires multiple types of data, (at least) two library preps, different sequencing technologies…
  6. New De Novo Assembly Algorithms • Powerful assembly algorithms combining

    long reads with short reads for error correction • Use just the long-insert- library reads for de novo assembly best for assembly
  7. From Polymerase Reads to Subreads or Read of Insert •

    Subreads (purple and gold) are separated by adapter sequences (green) • Read of Insert represents the highest quality single-sequence for an insert, regardless of number of passes • ≥ 2 full passes required for CCS • Both adapters must be detected for a read to be identified as “full pass” • Either individual subreads, read of insert or CCS can be used for subsequent analysis depending on application needs Polymerase Read Subread Read of Insert or CCS 8
  8. Read of Insert Definition: • Represents highest-quality single-sequence for an

    insert, regardless of number of passes • Generalizes CCS for <2 passes and RQ <0.9 • 1 or more passes • 1 molecule, 1 read Purpose: • For Library QC • For subsequent analysis Subread Definition: • Single pass of template • Adapters removed • 1 molecule, ≥1 subreads Unique data: • Kinetic measurements • Rich QVs Purpose: • For subsequent analysis Polymerase Read Definition: • Sequence of nucleotides incorporated by polymerase while reading a template • Includes adapters • Often called “read” • Includes adapters • 1 molecule, 1 pol. read Purpose: • QC of instrument run • Benchmarking Read Metrics Definitions SMRTbell™ Template 9
  9. Mapped Polymerase Read Length vs. Mapped Subread 10 Mapped Polymerase

    Read Length Mapped Subread Length 4 kb 900 bp Mapped Polymerase Read Length Measure of ZMW sequencing productivity and read length Upper bound by speed and fidelity of the polymerase and movie time Mapped Subread Length Measure of scientifically applicable sequence Upper bound by insert size and loading effects
  10. Basic Assembly Metrics • Commonly used metrics include: – Number

    of contigs – N50: Equal to the size of the contig found if you sort contigs by size and walk to the contig that represents 50% of the total sequence − N50 = 10 bp − Mean contig length = 3 bp – Max contig size • Limitation of these metrics: – They do not capture information about assembly accuracy! − Large scale misassemblies − Base-level errors – There might be more than one chromosome (plasmid, phage, and so on) – Contaminants may contribute to a contig (such as a cloning vector) 11 10 4 1 1 1 1
  11. Scaffolds vs. Contigs Defined • Scaffolds have Ns in them,

    due to links from mate pair data • Contigs are contiguous sequences (no Ns). PacBio® sequencing generates contigs given our continuous reads ACACCACATCACGATCGATCGTGCATNNNNNNNNNNNNNNNNNNNCAGTAGTCAGCTAGCTACA contig contig Scaffold
  12. Construct Pre-assembled Long Reads (PLR) from CLRs Assemble PLRs into

    contigs Short Continuous Long Reads PLRs Pre-assembled Long Read Contig Hierarchical Genome Assembly Process
  13. Map all to seed reads Single-pass long reads Select longest

    as seed reads Pre-assembled reads Pre-Assembly of Single-Pass Long Reads Generate consensus of mapped reads RS_PreAssembler
  14. Assembly of Pre-Assembled Reads into Contigs 16 Identify overlaps between

    reads Pre-assembled reads Unitigs Celera® Assembler Generate layout of overlapping reads Contigs of assembly Generate consensus
  15. Single-pass long reads Contigs Assembly Polishing via Quiver High-quality consensus

    Base-quality-aware consensus of uniquely mapped reads Quiver Map to de novo-assembled reference RS_Resequencing
  16. HGAP Example - Meiothermus ruber 10 kb SMRTbell™ library 3

    SMRT® Cells (C2-C2 Chemistry, PacBio® RS) Long seed reads (>5kb) Pre-assembled long reads 5 contigs 1 contig pre-assembly Celera Assembler Polish, Quiver 250 Mb >5 kb Collaboration with A. Clum, A. Copeland (Joint Genome Institute)
  17. HGAP Example - Meiothermus ruber 10 kb SMRTbell™ library 3

    SMRT® Cells (C2-C2 Chemistry, PacBio® RS) Long seed reads (>5 kb) Pre-assembled long reads 5 contigs 1 contig pre-assembly Celera Assembler Polish, Quiver Collaboration with A. Clum, A. Copeland (Joint Genome Institute)
  18. HGAP Example - Meiothermus ruber 10kb SMRTbell™ library 3 SMRT®

    Cells (C2-C2 Chemistry, PacBio® RS) Long seed reads (>5 kb) Pre-assembled long reads 5 contigs 1 contig Pre-assembly Celera Assembler Polish, Quiver Collaboration with A. Clum, A. Copeland (Joint Genome Institute)
  19. HGAP Example - Meiothermus ruber 10kb SMRTbell™ library 3 SMRT®

    Cells (C2-C2 Chemistry, PacBio® RS) Long seed reads (>5 kb) Pre-assembled long reads 5 contigs 1 contig Pre-assembly 1 contig Celera® Assembler Minimus2 Quiver Collaboration with A. Clum, A. Copeland (Joint Genome Institute) • Single-contig assembly • 99.99965% concordance with reference • 99.3% genes predicted
  20. Polish with Quiver for High Accuracy Organism Assembly size (bases)

    Differences with Sanger reference Concordance with Sanger reference Nominal QV SNPs validated as correct PacBio calls Remaining differences QV Meiothermus ruber 3,098,781 11 99.99965% 54.5 8 1(3) 60 M. ruber Sanger reference PacBio® reads Targeted Sanger validation
  21. Experimental Design – Choosing an Analysis Method Experimental Design Isolate

    DNA Template Preparation Sequencing Analysis • Analysis strategy depends on project objectives, genome size and complexity, as well as the quality and type of data available Method Project Objectives Available Inputs Genome Sizes Hierarchical Draft or finished assembly • >70x PacBio® long-insert library < 130 MB SMRT® Portal Larger at the command line Hybrid Draft or finished assembly • 20-50x of PacBio long-insert library • 20-50x shorter read library (CCS/454®/Illumina®) Any size Scaffolding Improve existing assembly • 10X PacBio long-insert library • high-confidence contigs from existing assembly < 200 MB Gap Filling Fill gaps in existing assembly • 5-10x PacBio long-insert library • scaffolds from mate-pair assembly < 4 GB
  22. Experimental Design – Repetitive Content Experimental Design Isolate DNA Template

    Preparation Sequencing Analysis • Repetitive Content - One of the biggest challenges with De Novo Assembly. - Solution: work with insert sizes that can span repeats and indentify unique anchoring sequencing on each side of the repeat.
  23. Experimental Design – Ploidy • Most Assemblers were designed for

    Haploid Genomes. – Diploid with little structural variation between the chromosomes then a haploid approach can work. – Structural heterozygosity appears as separate contigs. • Select Strains to minimize heterozygosity – This helps facilitate assembly. – Use inbred lines – Double Haploid strains • Diploid or multiploid genomes – Using a haploid assembler leads to fragmented assemblies. – Consider Falcon (experimental Code) or Celera Assembler can be configured to favor merging haplotypes. 26
  24. Towards True Diploid Assemblies • Truth: • Current assemblers: •

    New diploid/polyploid assembler: maternal allele paternal allele Keep the long range information while maintaining the relations of the alternative alleles. FALCON assembler: https://github.com/PacificBiosciences/falcon
  25. Draft Genome Quality • Gap filling of mate pair-based scaffolded

    assemblies – Sensitive to the quality of the starting assembly. – Missassemblies in the scaffolds can result in improper alignments and incorrectly-filled or unfillable gaps. 28 Scaffolds with gaps PacBio long reads Filled in scaffold Gap Filling: Using PacBio® CLR to fill gaps in existing mate-pair-based scaffolds
  26. Sample Preparation • Key to a successful assembly is the

    generation of the longest reads possible. • Sample quality is critical to maximize potential performance • No amplification step during library preparation • Recommendations: – Take care during extraction to avoid gDNA damage & avoid contaminants – Use extraction methods or kits that produce very high molecular weight gDNA – If contaminants are present, purify starting DNA material prior to library prep – Accurately quantify and qualitatively evaluate gDNA – Include DNA-damage-repair step in library prep 29 Experimental Design Isolate DNA Template Preparation Sequencing Analysis
  27. Template Preparation Recommendations • Recommend at least 10 kb insert

    libraries to maximize subread length – DNA Template Prep Kit 2.0 (3 kb ‒ <10 kb) – Procedure & Checklist ‒ Low-Input 10 kb Library Preparation and Sequencing (MagBead Station) – Recommend: Final 0.4x AMPure® Purification instead of 0.45x – Minimum input: 1 µg • 20 kb insert libraries combined with size selection beneficial for increasing subread lengths – Protocols available on Sample Net for 20kb libraries – Requires more starting sample (recommended >7.5 µg) – Final AMPure® Purification (0.4 or even 0.375x) can also remove shorter SMRTbell™ inserts Experimental Design Isolate DNA Template Preparation Sequencing Analysis
  28. Blue Pippin™ System for Size Selection Size-Selected Mouse Lemur 20

    kb library 20 kb AMPure® Mouse Lemur library - Input gDNA - Size-selected
  29. Size Selection Increases the Number of Long Subreads (from Lex

    Nederbragt Blog) 32 http://flxlexblog.wordpress.com/2013/06/19/longing-for-the-longest-reads-pacbio-and-bluepippin/ “The plot shows that the BluePippin prep indeed had the desired effect: the reads are much longer.” N50 4,041bp 8,820bp
  30. PacBio® Bioinformatics Training Wiki on DevNet Large Genome Assembly Guidance

    https://github.com/PacificBiosciences/Bioinformatics-Training/wiki/Large- Genome-Assembly-with-PacBio-Long-Reads 33
  31. Where to Find Additional Information • Links to publications, videos

    of presentations, posters and other de novo assembly resources available through PacBio’s website (www.pacb.com/denovo) • Protocols, Technical & Application Notes available through Customer Portal • DevNet – HGAP Reference Implementation: http://www.smrtcommunity.com/Share/Code?id=a1q70000000H2qRAAS – Quiver: www.pacbiodevnet.com/quiver – Bacterial Assembly and Epigenetic Analysis Training Web Video http://www.pacificbiosciences.com/Tutorials/Bacterial_Assembly_Epigenetic_Analysis_HGA P/story_html5.html • Additional information on Assembly Tools – Celera® Assembler: http://sourceforge.net/apps/mediawiki/wgs- assembler/index.php?title=PacBioToCA – Allpaths-LG: http://www.broadinstitute.org/software/allpaths-lg/blog/ – PBJelly: http://sourceforge.net/p/pb-jelly/wiki/Home/ 34
  32. For Research Use Only. Not for use in diagnostic procedures.

    Pacific Biosciences, the Pacific Biosciences logo, PacBio, SMRT, SMRTbell, and Iso-Seq are trademarks of Pacific Biosciences in the United States and/or other countries. All other trademarks are the sole property of their respective owners.
  33. HGAP Protocol and Parameters in SMRT® Portal 2.3  Minimum

    Seed Read Length: - 30X Coverage of longest Seed Reads automatically calculated - Uncheck to override “auto” Key Parameter to set: Genome Size - 130 MB limit in SMRT Portal 2.3
  34. HGAP Protocol and Parameters in SMRT® Portal 2.3  For

    datasets with high coverage: Minimum Subread Length - Total coverage 3-4X seed-read coverage - (Usually not necessary)
  35. For Research Use Only. Not for use in diagnostic procedures.

    Pacific Biosciences, the Pacific Biosciences logo, PacBio, SMRT, SMRTbell, and Iso-Seq are trademarks of Pacific Biosciences in the United States and/or other countries. All other trademarks are the sole property of their respective owners.