Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Hierarchical Genome Assembly Process (HGAP)

PacBio
April 02, 2013

Hierarchical Genome Assembly Process (HGAP)

PacBio

April 02, 2013
Tweet

More Decks by PacBio

Other Decks in Science

Transcript

  1. FIND MEANING IN COMPLEXITY © Copyright 2013 by Pacific Biosciences

    of California, Inc. All rights reserved. Hierarchical Genome Assembly Process
  2. Objectives • Introduction to HGAP • Using HGAP in SMRT

    Analysis 1.4 • Experimental Design and FAQ 2
  3. New De Novo Assembly Algorithms • Powerful assembly algorithms combining

    long reads with short reads for error correction • Can we use just the long- insert-library reads for de novo assembly? best for assembly
  4. Construct Pre-assembled Long Reads (PLR) from CLRs Assemble PLRs into

    contigs Short Continuous Long Reads PLRs Pre-assembled Long Read Contig Hierarchical Genome Assembly Process
  5. Hierarchical Assembly: Assembles Genomes from Single PacBio® Long Insert Library

    Prep – No CCS or 2nd Gen • How HGAP (“hierarchical genome assembly process”) works: – Take reads from a long insert library (e.g., 4-8 SMRT® Cells) – Pre-assemble each of the really long reads (say, > 5 kb) − Align all short and long reads against it − Trim and filter as needed − Take the consensus of the result – Perform an assembly using the pre-assembled reads – Polish the assembled contigs using Quiver Organism SMRT® Cells Chromosomes Contigs Genome Size N50 Estimated Accuracy E. coli K12 8 1 2 4.6 MB 4.6 MB 99.9995% M. ruber 4 1 3 3.1 MB 3.1 MB 99.9996% P. heparunus 7 1 1 5.2 MB 5.2 MB 99.9995%
  6. Map all to seed reads Single-pass long reads Select longest

    as seed reads Pre-assembled reads Pre-Assembly of Single-Pass Long Reads Generate consensus of mapped reads RS_PreAssembler
  7. Assembly of Pre-Assembled Reads into Contigs 7 Identify overlaps between

    reads Pre-assembled reads Unitigs Celera® Assembler Generate layout of overlapping reads Contigs of assembly Generate consensus
  8. Single-pass long reads Contigs Assembly Polishing via Quiver High quality

    consensus Base Quality aware consensus of uniquely mapped reads Quiver Map to de novo-assembled reference RS_Resequencing
  9. HGAP Example - Meiothermus ruber (JGI) 9 10 kb SMRTbell™

    library 4 SMRT® Cells - 330 Mb (>100X) Long seed reads (>5 kb) 92 Mb (30X) Pre-assembled long reads 61 Mb (20X) 5 contigs 1 contig pre-assembly Celera® Assembler clean-up (Minimus2) CLR
  10. HGAP Example - Meiothermus ruber (JGI) 10 10 kb SMRTbell™

    library 4 SMRT® Cells - 330 Mb (>100X) Long seed reads (>5 kb) 92 Mb (30X) Pre-assembled long reads 61 Mb (20X) 5 contigs 1 contig pre-assembly Celera® Assembler clean-up (Minimus2) PLR
  11. HGAP Example - Meiothermus ruber (JGI) 11 10 kb SMRTbell™

    library 4 SMRT® Cells - 330 Mb (>100X) Long seed reads (>5 kb) 92 Mb (30X) Pre-assembled long reads 61 Mb (20X) 5 contigs 1 contig pre-assembly Celera® Assembler clean-up (Minimus2) single contig spans the entire reference
  12. Initial Assessment of M. ruber Assembly from 4 SMRT® Cells

    • Used reference (Sanger) from the Joint Genome Institute to evaluate assembly concordance • Assessment of initial differences: – QV ~43.4 (99.9954%) – 141 differences between the assembly and the reference 12 actual: variant caller:
  13. • Used reference (Sanger) from the Joint Genome Institute to

    evaluate assembly concordance • Assessment of initial differences: – QV ~43.4 (99.9954%) – 141 differences between the assembly and the reference • Final accuracy post-Quiver: – QV ~54.5 (99.99964%) – 11 differences between the assembly and the reference 13 Initial Assessment of M. ruber Assembly from 4 SMRT® Cells
  14. Three ways to use HGAP 14 P_PreAssembler_Allora protocol in SMRT

    ® Analysis 1.4 (via SMRT Portal) P_PreAssembler protocol in SMRT ® Analysis 1.4 (via Command Line) Reference Implementation of HGAP available on DevNet Skill set required General user Command line skill Savvy bioinformatician Genome size BACs or Viral assembly Microbial size Up to 100 MB tested Assembly Performance Fine for small genomes, not recommended and slow for larger genomes Good results, but may require parameter tweaking Good results Installation difficulty Part of SMRT Analysis 1.4 Part of SMRT Analysis 1.4 High (requires compiling code, cluster configuration, etc.) Target user General users who want to try the HGAP workflow and test on a small genome Bioinformatics users new to HGAP Customers already introduced to DevNet HGAP Our recommendation
  15. Using HGAP (Hierarchical Genome Assembly Process) in SMRT® Analysis v1.4

    • HGAP consists of 3 steps: 15 Generate very long, high-accuracy reads Tools: SMRT® Portal: RS_PreAssembler_ALLORA Command-line: RS_PreAssembler or P_PreAssembler Pre-Assembly Join reads into a near- perfect assembly Tools: SMRT® Portal: RS_PreAssembler_ALLORA Command-line: Celera® Assembler Assembly Realign reads against the assembly for the highest final accuracy Tools: SMRT® Portal: RS_Resequencing Command-line RS_Resequencing or P_GenomicConsensus Assembly Polishing
  16. Pre-Assembly (Step 1) 16 Generate very long, high-accuracy reads Tools:

    SMRT® Portal: RS_PreAssembler_ALLORA Command-line: RS_PreAssembler or P_PreAssembler Pre-Assembly
  17. SMRT® Portal 1.4 Workflow for HGAP - Filtering • Change

    Min RQ to 0.8 • Run “Filter Only” on your SMRT Cells to identify Seed Read Length from subread length distribution
  18. SMRT® Portal 1.4 Workflow for HGAP – Seed Read Length

    18 • Target Genome Size: ~5 Mb • 20X coverage: ~100 Mb • Select a minimum seed read length to obtain >20X coverage of your genome • On this dataset, 5000 bp yields >20X coverage • Total coverage should exceed 60X
  19. SMRT® Portal 1.4 Workflow for HGAP - PreAssembler • Set

    the Seed Read Length in PreAssembler based on coverage (goal >20X of seed read) • Suggest changing-maxLCPLength from 16 to 14 in BLASR options for XL-C2 data
  20. Assembly (Step 2) 20 Generate very long, high accuracy reads

    Tools: SMRT® Portal: RS_PreAssembler_ALLORA Command-line: RS_PreAssembler or P_PreAssembler Pre-Assembly Join reads into a near perfect assembly Tools: SMRT® Portal: RS_PreAssembler_ALLORA Command-line: Celera® Assembler Assembly
  21. • Download pre-assembled reads in FASTQ format to use on

    local computer • Note the Job ID for further processing in local LINUX system • Optional: QV and length filtering of the corrected.fastq on the LINUX command line • Assemble corrected.fastq file via Celera® Assembler on the command line Exporting Data to Use for Celera® Assembler
  22. Assembly Polishing (Step 3) 23 Generate very long, high accuracy

    reads Tools: SMRT® Portal: RS_PreAssembler_ALLORA Command-line: RS_PreAssembler or P_PreAssembler Pre-Assembly Join reads into a near perfect assembly Tools: SMRT® Portal: RS_PreAssembler_ALLORA Command-line: Celera® Assembler Assembly Realign reads against the assembly for the highest final accuracy Tools: SMRT® Portal: RS_Resequencing Command-line RS_Resequencing or P_GenomicConsensus Assembly Polishing
  23. Import Assembly into SMRT® Portal as Reference 25 Points to

    consider: • Scientist level SMRT Portal users can now delete their own single-use assembly references after finishing Quiver • Multiple fasta files can be combined into one reference via <SHIFT><Select> • Depositing a fasta file in the reference_dropbox requires write access to the directory
  24. RS_Resequencing in SMRT® Portal 1.4 - Quiver 26 • Choose

    Reference on Design Job page • Random placement of reads into repeats – more uniform coverage
  25. RS_Resequencing in SMRT® Portal 1.4 - Quiver 27 • Basecaller

    QV aware consensus algorithm - Quiver - is default in SMRT Analysis 1.4 • Improved mapping selectivity to further increase accuracy of the de novo consensus • More accurate variant calls
  26. Highly Accurate Assembly Consensus and Variant Calls 28 • Download

    consensus.fasta for functional annotation • Evaluate aligned reads for continuity of assembly via BAM and SAM files • Observe uniformity of coverage to evaluate assembly accuracy and identify possible miss- assemblies where coverage drops • Re-import polished assembly as reference to start base modification analysis
  27. The Command-Line Unlocks the Full Power of HGAP • Run

    P_PreAssembler with SMRT® Pipe on the command-line • Run Celera® Assembler on the command-line • Use Quiver option of P_GenomicConsensus to polish the assembly • For advanced users: – Additional tweaks to filtering and trimming may improve assembly – A beta release of HGAP on DevNet may generate even better assemblies (separate installation required) • More details here: https://github.com/PacificBiosciences/Bioinformatics-Training/wiki/HGAP 29
  28. Microbial Experimental Design Recommendations Using HGAP Sample Prep Run Design

    Sequencing on the PacBio® RS and primary analysis Secondary Analysis Tertiary Analysis • XL-C2 chemistry • MagBead loading • Stage start • Movie Time: 1 x 120 min • Alternative movie times can be explored to optimize throughput • Do not overload; Loading titrations may be useful • 1.4 RS_Preassembler+Celera® Assembler SMRT® Analysis • Cov: 100 X • Use XL parameters, custom trimming as necessary • Recommend Quiver for assembly polishing to increase consensus accuracy • Base modification caveats • Limit DNA damage during sample extraction • 10 kb library protocol for long read library • Optional >10 kb protocol available through SampleNet • Good quality sample preparation is key!
  29. FAQ Q. How large a genome does HGAP support? PacBio

    has tested HGAP primarily on microbial-sized genomes. In principle, HGAP will work on genomes of 100 MB or larger, but this has not yet been tested, and manual fine-tuning will likely be necessary to achieve the best assembly. Q. What if customers have been using the DevNet implementation of HGAP? For advanced users who are comfortable installing beta software, the DevNet implementation (called the “reference implementation” or “beta”) is also available. • Advantages: potentially more scalable for larger genomes >500 MB. • Disadvantages: separate installation, command-line only, and may not be better in all cases. Q. What are the future plans for HGAP in SMRT® Analysis? In the upcoming release of SMRT Analysis 2.0, HGAP will be an integrated protocol in SMRT Portal, combining the Pre- Assembler with Celera® Assembler. Q. What about Celera® Assembler? Will CA implement PacBio long-read-only assembly in the future? A pre-release version of pacBioToCA can perform the preassembly step. More information can be found at http://sourceforge.net/apps/mediawiki/wgs-assembler/index.php?title=PacBioToCA. Celera Assembler can perform the assembly step. It’s still necessary to run resequencing with Quiver to polish the final assembly. We do not know when the Celera Assembler update officially will be released. We will evaluate including the update in a future version of SMRT Analysis. Q. Where can I get more information about HGAP? See pacbiodevnet.com for more details; in particular: https://github.com/PacificBiosciences/Bioinformatics-Training/wiki/HGAP 31