Upgrade to Pro — share decks privately, control downloads, hide ads and more …

De Novo Assembly

PacBio
August 01, 2013

De Novo Assembly

PacBio

August 01, 2013
Tweet

More Decks by PacBio

Other Decks in Science

Transcript

  1. FIND MEANING IN COMPLEXITY PACIFIC BIOSCIENCES® CONFIDENTIAL © Copyright 2013

    by Pacific Biosciences of California, Inc. All rights reserved. De Novo Assembly
  2. Overview of Release De Novo Assembly – Experimental Design Overview

    2 Experimental-design guidelines by genome size Sample-preparation and sequencing recommendations Assembly options with PacBio data PacBio® benefits for finishing genomes Assembly recommendations Where to find additional information
  3. Discover Biology with Extraordinary Read Lengths Complete microbial genomes and

    improve assemblies of larger organisms • Highest N50 • Fewest fragments • Detect structural variation • 99.999% consensus accuracy Read lengths up to 20 kb, unbiased genome coverage, and high accuracy Finished bacterial genome www.pacb.com/denovo
  4. Genome Size PacBio’s Benefits Cost How • Finish genomes with

    highest accuracy • Cost effective & fast • Resolve mobile elements and structural-variation events • Full characterization of the epigenome • 1 library prep • 1-2 SMRT® Cells / 5 MB • large-insert library • Size selection optional • >75X coverage; P4-C2 • HGAP assembly • Genome assemblies with fewest contigs, largest N50s and highest accuracy • Annotate more genes and improve resolution of gene order • Resolve challenging genomic regions • Detect structural variation • 1 library prep • 30-40 SMRT Cells / 100 MB • large-insert library • Size selection recommended • >75X coverage; P4-C2 • HGAP assembly • 1 library prep • 10-16 SMRT Cells / 100 MB • Short-read library preps & sequencing runs • large-insert library • Size selection recommended • >20X coverage; P4-C2 • 50X cov. short-read data • Hybrid assembly • Improve connectivity & N50 • Sequence genes and regulatory elements in challenging genomic regions • Identify structural variation, resolve palindromes and delineate tandem repeats • Detect phasing information • 1 library prep • 120 SMRT Cells / 3 GB • Short-read library preps & sequencing runs • large-insert library • Size selection recommended • 10X coverage; P4-C2 • Draft assembly • Gap fill • 1 library prep • 240 SMRT Cells / 3 GB • Short-read library preps & sequencing runs • large-insert library • Size selection recommended • >20X coverage; P4-C2 • 50X coverage; short-read data • Hybrid assembly Strategies for Improving and Finishing Genomes
  5. Complete De novo Assembly of Microbes with Repetitive Elements is

    Difficult Without Long Read Lengths • Analyzed repeat complexity of 2,267 complete bacteria and archaea • Divided into three classes of complexity • With read lengths in excess of 7 kb, automated closure of Class I and II genomes possible • All but the longest Class III repeats can be resolved Koren et al (2013) http://arxiv.org/abs/1304.3752 http://www.cbcb.umd.edu/software/PBcR/closure/report.log.krona.html
  6. Hybrid Solutions for De Novo Assemblies • Combine long SMRT®

    Sequencing reads with short reads for error correction • Requires multiple types of data, (at least) two library preps, different sequencing technologies…
  7. Finishing Genomes Using Only PacBio® Reads • Utilizes all PacBio

    data from single, long-insert library – Longest reads for continuity – All reads for high consensus accuracy • Now available through SMRT® Portal in SMRT Analysis v2.0.1 Hierarchical Genome Assembly Process (HGAP) Chin et al (2013), “Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data” Nature Methods. doi 10.1038/nmeth.2474
  8. Hierarchical Genome Assembly Process (HGAP) 1. Start with long ‘seed’

    reads 2. Align other reads 3. Build consensus 4. Construct accurate (>99%) pre-assembled reads
  9. HGAP Example - Meiothermus ruber 10 kb SMRTbell™ library 3

    SMRT® Cells (C2-C2 Chemistry, PacBio® RS) Long seed reads (>5kb) Pre-assembled long reads 5 contigs 1 contig pre-assembly Celera Assembler Polish, Quiver 250 Mb >5 kb Collaboration with A. Clum, A. Copeland (Joint Genome Institute)
  10. HGAP Example - Meiothermus ruber 10 kb SMRTbell™ library 3

    SMRT® Cells (C2-C2 Chemistry, PacBio® RS) Long seed reads (>5 kb) Pre-assembled long reads 5 contigs 1 contig pre-assembly Celera Assembler Polish, Quiver Collaboration with A. Clum, A. Copeland (Joint Genome Institute)
  11. HGAP Example - Meiothermus ruber 10kb SMRTbell™ library 3 SMRT®

    Cells (C2-C2 Chemistry, PacBio® RS) Long seed reads (>5 kb) Pre-assembled long reads 5 contigs 1 contig Pre-assembly Celera Assembler Polish, Quiver Collaboration with A. Clum, A. Copeland (Joint Genome Institute)
  12. HGAP Example - Meiothermus ruber 10kb SMRTbell™ library 3 SMRT®

    Cells (C2-C2 Chemistry, PacBio® RS) Long seed reads (>5 kb) Pre-assembled long reads 5 contigs 1 contig Pre-assembly 1 contig Celera® Assembler Minimus2 Quiver Collaboration with A. Clum, A. Copeland (Joint Genome Institute) • Single-contig assembly • 99.99965% concordance with reference • 99.3% genes predicted
  13. Polish with Quiver for High Accuracy Organism Assembly size (bases)

    Differences with Sanger reference Concordance with Sanger reference Nominal QV SNPs validated as correct PacBio calls Remaining differences QV Meiothermus ruber 3,098,781 11 99.99965% 54.5 8 1(3) 60 M. ruber Sanger reference PacBio® reads Targeted Sanger validation
  14. Quiver: A New Consensus Caller for PacBio® Data • Can

    achieve accuracy >Q50 (i.e. > 99.999%) using only PacBio reads • How Quiver works – Takes multiple reads of a given DNA template, outputs best guess of template’s identity – QV-aware hidden Markov model to account for sequencing errors; a greedy algorithm to find the maximum likelihood template – Similar underlying algorithm currently used for CCS generation • Links: – www.pacbiodevnet.com/quiver – https://github.com/PacificBiosciences/GenomicConsensus 15 Quiver Aligned Reads Reference Consensus fasta/fastq Variants.gff
  15. Improve and Finish Genomes with the PacBio® System De novo

    Assembly Complete genomes with PacBio reads alone Combine technologies for best of both worlds 2 3 2 3 1 1 Scaffold Establish framework for genome and resolve ambiguities Span Gaps Polish genomic regions with up to 10x improvement
  16. Towards Gap-Free Reference Genomes English et al. (2012) Mind the

    Gap: Upgrading Genomes with Pacific Biosciences RS Long-Read Sequencing Technology. PLoS One. D. melanogaster (139.5 Mb) D. pseudoobscura (176.04 Mb) M. undulatus (1.23 Gb) C. atys (2.82 Gb) Original PacBio Original PacBio Original PacBio Original PacBio Gap Count 4651 311 6026 1852 49,376 39,204 186,841 66,211 Total Gap Size (Mb) 3.19 0.54 6.67 3.61 154.9 134.6 197.5 79.3 Contig N50 (kb) 64 723.6 53 224.4 134.4 233.27 34.92 128.38 Contig N50 Improvement 1030.6% (11.3x) 323.4% (4.2x) 73.6% (1.74x) 267.6% (3.68x)
  17. “Hybrid” HGAP Workflow • Modification to HGAP workflow: – Use

    contigs from previous draft assemblies as additional inputs into HGAP preassembly step – Use HBAR-DTK (research implementation of HGAP available on DevNet) • Potential advantages over standard hybrid-assembly approach – Longer contigs provide better mappability compared to short-reads – Reduced compute requirements Contigs
  18. Improving Larger Genome Assemblies Using Preliminary “Hybrid” HGAP Approach •

    350 Mb genome • Draft assemblies generated from short-read data: – Illumina® System − 200 bp paired end, 100 bp, two libraries, ~80X − 10 kb mate pair, 36 bp, two libraries, ~15X − 3 kb mate pair, 36 bp ~4X – 454® System − 15 runs, some mate pair ~19X • Hybrid HGAP strategy – Contigs from Illumina assembly, 454 assembly, and 454/Illumina assembly (1X each) – 60 SMRT® Cells from size-selected 20 kb library, P4-C2 chemistry on PacBio® RS II (20X coverage) 454® and Illumina® Assembly Hybrid HGAP Assembly with PacBio® reads Improvement Number of Contigs 52,273 6,810 7.7X N50 Contig Size (bases) 10,406 146,050 14.7X Largest Contig (bases) 113,585 1,666,490 14X Number of assembled bases (Mb) 270 320 Closer to predicted size Collaboration with UIUC Aphid
  19. Better Gene Predictions from Larger Contigs Assembly Contig N50 (kb)

    GeneID exons / gene Mean exon size (bp) Mean protein size (aa) Initial Illumina®/454® Assemblies 10.4 2.16 170 123 Hybrid HGAP (PacBio® + Illumina®/454® contigs) 146 3.24 217 284 Pea aphid (Acyr 2.0) 27 2.84 223 212 • GeneID using an aphid model was run on assemblies to predict gene content • Increased continuity of hybrid HGAP assembly yields more exons / gene and higher average protein size • These statistics are more consistent with the published pea aphid assembly, and possibly even better Collaboration with UIUC
  20. Improve Assemblies with Low PacBio® Coverage http://http://schatzlab.cshl.edu/presentations/2013-04-10.UVA.De%20novo%20assembly%20of%20complex%20genomes.pdf http://schatzlab.cshl.edu/presentations/2013-06-18.PBUserMeeting.pdf “With the

    RS, the contigs from our de novo assembly of the 400Mbp rice genome are several fold better than the state-of-the- art ALLPATHS-LG assembly using short reads” Michael C. Schatz, Ph.D. Assistant Professor of Quantitative Biology Cold Spring Harbor Laboratory Rice Genome Assembly (Oryza sativa pv Nipponbare: 400 MB) Contig N50 HiSeq® Fragments 50x 2x100bp @ 180 3,925 MiSeq® Fragments 23x 459bp 8x 2x251bp @ 450 6,332 Illumina® Mates 50x 2x100bp @ 180 36x 2x50bp @ 2100 51x 2x50bp @ 4800 18,248 PBeCR + Illumina® Mates 7x 3500bp ** MiSeq for correction 50,995 PBeCR + Illumina® Mates 19x ** MiSeq for correction 155 kb Case Study: The Next Frontier in Assembly – Long Reads Offer Finished Genomes
  21. Experimental Design – Choosing an Analysis Method Experimental Design Isolate

    DNA Template Preparation Sequencing Analysis • Analysis strategy depends on project objectives, genome size & complexity, as well as the quality and type of data available Method Project Objectives Available Inputs Genome Sizes Hierarchical Draft or finished assembly • >70x PacBio® long-insert library < 130 MB Hybrid Draft or finished assembly • 20-50x of PacBio long-insert library • 20-50x shorter read library (CCS/454®/Illumina®) Any size Scaffolding Improve existing assembly • 10X PacBio long-insert library • high-confidence contigs from existing assembly < 200 MB Gap Filling Fill gaps in existing assembly • 5-10x PacBio long-insert library • scaffolds from mate-pair assembly < 4 GB
  22. Estimated Coverage Targets for Finishing Smaller Genomes Assembly Approach /

    Software Tool Recommended PacBio® Coverage Additional Data Sets Genome Size Constraints Hierarchical SMRT® Analysis implementation of HGAP (uses Celera® Assembler 7.0) 75-100X PacBio CLR None < 10 MB (SMRT Portal) < 130 MB (Command Line) Celera® Assembler via PacBiotoCA (recent compilation) see Koren et al (2013) http://arxiv.org/abs/1304.3752 75-100X PacBio CLR None Similar to above Hybrid Celera® Assembler 7.0 with PacBiotoCA (SMRT® Analysis) 20-50X PacBio CLR 50X short reads ALLPATHS-LG 50X PacBio 3 kb CLR - 50X Illumina® PE - 50X Illumina® jumping libraries 20 MB MIRA (with PacBiotoCA) 20-50X PacBio CLR 50X short reads Scaffolding AHA (SMRT Analysis) 10X PacBio CLR High-confidence contigs <200 MB; <20,000 contigs
  23. Estimated Coverage Targets for Improving Larger Genomes Assembly Approach /

    Software Tool Recommended PacBio® Coverage Additional Data Sets Genome Size Constraint Hybrid Celera® Assembler via PacBiotoCA (recent compilation) see Koren et al (2013) http://arxiv.org/abs/1304.3752 20-50X PacBio CLR 50X short reads Compute power & time MIRA (with PacBiotoCA) 20-50X PacBio CLR 50X short reads Compute power & time Hierarchical SMRT® Analysis implementation of HGAP (uses Celera® Assembler 7.0) 75-100X PacBio CLR None <130 MB provided sufficient compute power Scaffolding AHA (SMRT Analysis) 10X PacBio CLR High-confidence contigs <200 MB; <20,000 contigs Gap Filling PBJelly 5-10X PacBio CLR High-confidence scaffolds <4 GB Compute power & time
  24. PACIFIC BIOSCIENCES® CONFIDENTIAL Sample Preparation • Sample quality is critical

    to maximize potential performance • No amplification step during library preparation • Recommendations: – Take care during extraction to avoid gDNA damage & avoid contaminants – Use extraction methods or kits that produce very high molecular weight gDNA – If contaminants are present, purify starting DNA material prior to library prep – Accurately quantify and qualitatively evaluate gDNA – Include DNA damage repair step in library prep 28 Experimental Design Isolate DNA Template Preparation Sequencing Analysis
  25. PACIFIC BIOSCIENCES® CONFIDENTIAL Sample Conditions that Lead to Higher-Quality Libraries

    • Double-stranded DNA Sample (dsDNA) • Minimized freeze-thaw cycles • No exposure to high temp (>65° C) • No exposure to pH extremes (<6 or >9) • Minimize gDNA vortexing and pipetting; pipette gently • OD260/280 between 1.8 and 2.0 • OD260/230 between 2.0 and 2.2 • No insoluble material • No RNA contamination • No exposure to UV or intercalating fluorescent dyes • No chelating agents, divalent metal cations, denaturants, or detergents • No carryover contamination (e.g. polysaccharides) from starting organism 29
  26. Template Preparation Recommendations • Recommend at least 10 kb insert

    libraries to maximize subread length – DNA Template Prep Kit 2.0 (3 kb ‒ <10 kb) – Procedure & Checklist ‒ Low-Input 10 kb Library Preparation and Sequencing (MagBead Station) – Recommend: Final 0.4x Ampure® Purification instead of 0.45x – Minimum input: 1 µg • 20 kb insert libraries combined with size selection beneficial for increasing subread lengths – Optional protocols available on Sample Net for >10 kb libraries – Requires more starting sample (recommended >7.5 µg) – Final Ampure® Purification (0.4 or even 0.375x) can also remove shorter SMRTbell™ inserts Experimental Design Isolate DNA Template Preparation Sequencing Analysis
  27. Alternatives for Size Selection 31 Amount of gDNA available >5

    µg 20 kb library prep (Sample Net) How much sample available after library prep 6-40 SMRT Cells per µg input N50 of 4 kb 1-8 SMRT Cells per µg input N50 of 5 kb 1.5 µg gDNA (into shearing) 10 kb library prep 0.45x AMPure® purification 50-90 SMRT® Cells per µg input N50 of 2.5 kb Official 10 kb Protocol SampleNet: Size Selection + Larger-Insert Library >10 kb library prep (Sample Net) 1 – 5 µg 0.40x AMPure cutoff MagBead Loading Req’d: 1 µg into damage repair ~0.5-4 SMRT Cells per µg input N50 of 6-7 kb 0.375x AMPure cutoff BluePippin™ System
  28. Library Quality Tied to Sequencing Performance • Potential system performance

    is highly dependent on sample quality & library insert size • Potential sources of variability – Sample damage – Sample degradation – Contaminants – Shearing size & distribution • Important to QC sample if performing size section • Use of XL DNA Sequencing Kit 1.0 (P4 pol + XL Seq Kit) with known low- quality or short-insert libraries is Not Recommended – Unlikely to see subread length gain compared to P4 Binding Kit + C2 Sequencing Kit condition (P4-C2), but will see a drop in consensus accuracy & throughput 33
  29. Sequencing Recommendations Long Insert Libraries Instrument PacBio® RS II DNA

    Polymerase/ Binding Kit DNA/Polymerase Binding Kit P4 DNA Sequencing Kit DNA Sequencing Kit 2.0 (C2) Loading MagBead loading; follow protocol for insert size Stage Start Stage Start = yes Movie Time 1 x 120 minutes 35 Experimental Design Isolate DNA Template Preparation Sequencing Analysis
  30. DNA Template Library Preparation Polymerase Binding On-Instrument DNA Sequencing New

    P4 Enzyme for High Accuracy & Long Read Lengths • For most applications where consensus accuracy matters, would recommend combining P4 Binding kit with DNA Sequencing Kit (P4 – C2) • Optional XL DNA Sequencing Kit – Slightly longer read lengths, but at a cost of consensus accuracy – Better suited for scaffolding, spanning structural rearrangements, spanning long repeats, etc. 36 DNA Template Prep Kits • DNA Template Prep Kit 2.0 (250 bp ‒ <3 kb) • DNA Template Prep Kit 2.0 (3 kb ‒ <10 kb) DNA/Polymerase Binding Kit • New: DNA/Polymerase Binding Kit P4 DNA Sequencing Kit • DNA Sequencing Kit 2.0 • XL DNA Sequencing Kit 1.0 (Optional)
  31. Optimizing Loading • Overloading may increase output of MB per

    SMRT® Cell, but can increase multiply loaded ZMWs • High Quality (HQ) region filtering can “rescue” some multiply loaded ZMWs, increasing total number of reads / SMRT Cell • Reads that have undergone HQ filtering have – Shortened read lengths – Lower accuracy compared to single- loaded ZMWs • These are less useful reads for de novo assembly • Loading can be optimized through titration 38 prod=0 prod=1 prod=2
  32. Typical Microbial Performance for P4-C2 Chemistry, 10 kb Library 0

    20000 40000 60000 80000 100000 120000 0 1000 2000 3000 4000 5000 6000 # of Subreads per SMRT Cell Subread Length >X Cumulative Subread Length Distribution; 10 kb Library B. subtilis E. coli R. palustris 0 20 40 60 80 100 120 0 50 100 150 200 250 300 350 B. subtilis E. coli R. palustris Mean Mapped Read (Thousands) Mean Mappend MB Throughput per SMRT® Cell, 10 kb Library P4:C2 - Average of mean Mapped MegaBases P4:C2 - Average of mean Mapped Reads • Instrument: PacBio® RS II • Chemistry: P4 – C2 • Library: 10 kb • Size Selection: None • Collection Time: 1 x 120 min • Stage Start • MagBead Loading Mapped MB Mapped Reads
  33. Typical Microbial Performance for P4-C2 Chemistry, 20 kb Library •

    Instrument: PacBio® RS II • Chemistry: P4 – C2 • Library: 20 kb • Size Selection: BluePippin™ System • Collection Time: 1 x 120 min • Stage Start • MagBead Loading 0 20000 40000 60000 80000 100000 120000 0 1000 2000 3000 4000 5000 6000 # of Subreads per SMRT Cell Subread Length >X Cumulative Subread Length Distribution; 20 kb SS Library B. subtilis E. coli 0 20 40 60 80 100 120 0 50 100 150 200 250 300 350 B. subtilis E. coli Mean Mapped Reads (thousands) Mean Mapped Megabases Throughput per SMRT® Cell, 20 kb Size Selected Library P4:C2 - Mapped MB P4:C2 - Mapped Reads Mapped MB Mapped Reads
  34. SMRT® Assembly Methods Finished genome PacBio long reads Pre-assembled reads

    Hierarchical: Iterative pre-assembly and assembly of reads from a single long-insert PacBio® library (HGAP) Scaffolds with gaps PacBio long tads Filled in scaffold Gap Filling: Using PacBio CLR to fill gaps in existing mate pair-based scaffolds Scaffolding with PacBio long reads Improved assembly Short read contigs Scaffolding: Using PacBio CLR to scaffold existing contigs generated from short-read data 2 3 2 3 1 1 PacBio long reads Shorter reads (Higher Accuracy) Finished genome Hybrid: Assembly of error-corrected PacBio Continuous Long Read (“CLR”) with a second data type with higher accuracy Corrected reads
  35. SMRT® Assembly Tools Inputs Pre- processing Assembly Consensus Polishing AHA

    CLR + contigs PBJelly CLR + scaffolds PacBio SMRT® Analysis PacBio DevNet 3rd Party or DevNet HBAR-DTK P_PreAssembler Celera Assembler Mira Quiver CLR or Quiver pacBioToCA ALLPATHS-LG Celera Assembler CLR + CCS or other CLR + ILMN PE + ILM jumping libraries Hybrid: Assembly of error-corrected PacBio Continuous Long Read (“CLR”) with a second data type with higher accuracy Hierarchical: Iterative pre-assembly and assembly of reads from a single long-insert PacBio® library (HGAP) Scaffolding: Using PacBio CLR to scaffold existing contigs generated from short-read data Gap Filling: Using PacBio CLR to fill gaps in existing mate pair-based scaffolds pacBioToCA Celera Assembler Quiver CLR CLR Quiver Celera® Assembler
  36. HGAP 2.0.1 Recommendations with P4 Chemistry • Library insert size

    critical to lower read coverage requirements • Total single-pass coverage for best results: > 70X • SMRT® Portal imposed 10 MB genome size limit (safely within our min compute requirements) • SMRT Pipe HGAP tested to 100 MB genome size (4 GB seed read limit) – Larger genomes can be processed with HGAP Developer Kit (DevNet) BUT this has not been extensively validated • Additional/latest information can be found in HGAP wiki or by contacting Field BFX Group https://github.com/PacificBiosciences/Bioinformatics-Training/wiki/HGAP 44 Watch Training Video: Bacterial Assembly and Epigenetic Analysis
  37. HGAP Protocol and Parameters in SMRT® Portal 2.0.1 Minimum Seed

    Read Length: - 30X Coverage of longest Seed Reads automatically calculated - Uncheck to override “auto” Automatic FASTQ Trimming - QV > 59.5 & Length > 500 bp Use CCS option - Enable Hybrid Assembly Genome Size - 10 MB limit in SMRT Portal 2.0.1 Allow Partial Alignments - Improves PreAssembly with P4-C2 & XL-C2 
  38. The Command-Line Unlocks the Full Power of HGAP • HGAP_Assembly_Advanced.1.xml

    on BFX wiki site • Modularized for even more control: – Run P_PreAssembler with SMRT® Pipe on the command-line – Run Celera® Assembler on the command-line – Use Quiver option of P_GenomicConsensus to polish the assembly – Additional tweaks to filtering and trimming may improve assembly More details here: https://github.com/PacificBiosciences/Bioinformatics-Training/wiki/HGAP https://github.com/PacificBiosciences/Bioinformatics-Training/wiki/HGAP-2.0 47
  39. Basic Assembly Metrics • Commonly used metrics include: – Number

    of contigs – N50: Equal to the size of the contig found if you sort contigs by size and walk to the contig that represents 50% of the total sequence − N50 = 10 bp − Mean contig length = 3 bp – Max contig size • Limitation of these metrics: – They do not capture information about assembly accuracy! − Large scale mis-assemblies − Base level errors – There might be more than one chromosome (plasmid, phage, etc.) – Contaminants may contribute to a contig (such as a cloning vector) 48 10 4 1 1 1 1
  40. Things to Do After Your Assembly • Quality Checks of

    Assembly Results – Ensure minimum coverage and subread length thresholds met – Check for coverage uniformity − Spikes/valleys evidence of mis-assemblies − Low coverage, short contigs may be discarded – Look for evidence of plasmids in degenerates file – Ensure at least 90% of reads are mapping to assembly – Evaluate circularity of chromosomes and plasmids (Gepard) • Additional Ways to Improve Final Assembly – Parameter optimization of HGAP – Manual trimming of ends may be needed for circular genomes – Minimus2 and AHA to join contigs • Post Assembly Analysis – Methylation Detection and Motif Analysis) – Phage insertions (PHAST (http://phast.wishartlab.com) 49 https://github.com/PacificBiosciences/Bioinformatics-Training/wiki/Finishing-Bacterial-Genomes
  41. Detecting Misassemblies by Aligning Reads to Assembly Coverage Plot in

    SMRT® Portal SMRT® View Re-mapping the reads to the assembly may reveal discontinuities • Sharp dips in coverage (lacks read support) • Sharp spikes in coverage (collapsed repeat elements, phage insertions)
  42. De Novo Experimental Design Takeaways – Microbial Assembly • P4

    enzyme • MagBead loading • Stage Start • Movie Time • 1 x 120 min • Do not overload • Target 100X Coverage • SMRT® Analysis 2.0.1 supports Hierarchical Assembly using RS_Preassembler and Celera® Assembler • Quiver for assembly polishing to increase consensus accuracy • Post-assembly QC • See DevNet for additional recommendations • Don’t forget base modification • Good quality sample preparation is key! • Limit DNA damage during sample extraction • 10 kb library protocol for long read library • Optional size selection and large-insert protocols available through SampleNet • Error correction (2 kb libraries) no longer needed for HGAP Sample Prep Run Design Sequencing on the PacBio® RS and primary analysis Secondary Analysis Tertiary Analysis
  43. De Novo Experimental Design – Improving Larger Genomes • Chemistry:

    P4-C2 • MagBead loading; follow loading recommendation • Stage start • Movie Time: 1 x 120 • Do not overload • Loading titrations useful • Third Party/DevNet Options • Hybrid Assembly • Ideally 25-50X of CLR, can get improvements with 15-20X • Short Read for error correction (50X) • Filter to longest 25x PBcR prior to assembly • Gap Filling • PBJelly for Gap Filling • 5-10X coverage • Scaffolding • AHA supported up to 200 MB • 5-10X coverage • Limit DNA damage during sample extraction • If sample available; follow optional 20 kb, Size Selection protocol available through SampleNet • If sample limited; try >10 kb stricter AMPure protocol available through SampleNet • Good quality sample preparation is key! Sample Prep Run Design Sequencing on the PacBio® RS and primary analysis Secondary Analysis Tertiary Analysis
  44. Where to Find Additional Information • Links to publications, videos

    of presentations, posters and other de novo assembly resources available through PacBio’s website (www.pacb.com/denovo) • Protocols, Technical & Application Notes available through Customer Portal • DevNet – HGAP Reference Implementation: http://www.smrtcommunity.com/Share/Code?id=a1q70000000H2qRAAS – Quiver: www.pacbiodevnet.com/quiver – Bacterial Assembly and Epigenetic Analysis Training Web Video http://www.pacificbiosciences.com/Tutorials/Bacterial_Assembly_Epigenetic_Analysis_HGA P/story_html5.html • Additional information on Assembly Tools – Celera® Assembler: http://sourceforge.net/apps/mediawiki/wgs- assembler/index.php?title=PacBioToCA – Allpaths-LG: http://www.broadinstitute.org/software/allpaths-lg/blog/ – PBJelly: http://sourceforge.net/p/pb-jelly/wiki/Home/ 56
  45. Pacific Biosciences, the Pacific Biosciences logo, PacBio, SMRT, and SMRTbell

    are trademarks of Pacific Biosciences in the United States and/or other countries. All other trademarks are the sole property of their respective owners.