Hierarchical Genome Assembly Process (HGAP)

Objectives • Introduction to HGAP • Using HGAP in SMRT
Analysis 1.4 • Experimental Design and FAQ 2

New De Novo Assembly Algorithms • Powerful assembly algorithms combining
long reads with short reads for error correction • Can we use just the long- insert-library reads for de novo assembly? best for assembly

Construct Pre-assembled Long Reads (PLR) from CLRs Assemble PLRs into
contigs Short Continuous Long Reads PLRs Pre-assembled Long Read Contig Hierarchical Genome Assembly Process

Hierarchical Assembly: Assembles Genomes from Single PacBio® Long Insert Library
Prep – No CCS or 2nd Gen • How HGAP (“hierarchical genome assembly process”) works: – Take reads from a long insert library (e.g., 4-8 SMRT® Cells) – Pre-assemble each of the really long reads (say, > 5 kb) − Align all short and long reads against it − Trim and filter as needed − Take the consensus of the result – Perform an assembly using the pre-assembled reads – Polish the assembled contigs using Quiver Organism SMRT® Cells Chromosomes Contigs Genome Size N50 Estimated Accuracy E. coli K12 8 1 2 4.6 MB 4.6 MB 99.9995% M. ruber 4 1 3 3.1 MB 3.1 MB 99.9996% P. heparunus 7 1 1 5.2 MB 5.2 MB 99.9995%

Map all to seed reads Single-pass long reads Select longest
as seed reads Pre-assembled reads Pre-Assembly of Single-Pass Long Reads Generate consensus of mapped reads RS_PreAssembler

Assembly of Pre-Assembled Reads into Contigs 7 Identify overlaps between
reads Pre-assembled reads Unitigs Celera® Assembler Generate layout of overlapping reads Contigs of assembly Generate consensus

Single-pass long reads Contigs Assembly Polishing via Quiver High quality
consensus Base Quality aware consensus of uniquely mapped reads Quiver Map to de novo-assembled reference RS_Resequencing

HGAP Example - Meiothermus ruber (JGI) 9 10 kb SMRTbell™
library 4 SMRT® Cells - 330 Mb (>100X) Long seed reads (>5 kb) 92 Mb (30X) Pre-assembled long reads 61 Mb (20X) 5 contigs 1 contig pre-assembly Celera® Assembler clean-up (Minimus2) CLR

library 4 SMRT® Cells - 330 Mb (>100X) Long seed reads (>5 kb) 92 Mb (30X) Pre-assembled long reads 61 Mb (20X) 5 contigs 1 contig pre-assembly Celera® Assembler clean-up (Minimus2) PLR

library 4 SMRT® Cells - 330 Mb (>100X) Long seed reads (>5 kb) 92 Mb (30X) Pre-assembled long reads 61 Mb (20X) 5 contigs 1 contig pre-assembly Celera® Assembler clean-up (Minimus2) single contig spans the entire reference

Initial Assessment of M. ruber Assembly from 4 SMRT® Cells
• Used reference (Sanger) from the Joint Genome Institute to evaluate assembly concordance • Assessment of initial differences: – QV ~43.4 (99.9954%) – 141 differences between the assembly and the reference 12 actual: variant caller:

• Used reference (Sanger) from the Joint Genome Institute to
evaluate assembly concordance • Assessment of initial differences: – QV ~43.4 (99.9954%) – 141 differences between the assembly and the reference • Final accuracy post-Quiver: – QV ~54.5 (99.99964%) – 11 differences between the assembly and the reference 13 Initial Assessment of M. ruber Assembly from 4 SMRT® Cells

Three ways to use HGAP 14 P_PreAssembler_Allora protocol in SMRT
® Analysis 1.4 (via SMRT Portal) P_PreAssembler protocol in SMRT ® Analysis 1.4 (via Command Line) Reference Implementation of HGAP available on DevNet Skill set required General user Command line skill Savvy bioinformatician Genome size BACs or Viral assembly Microbial size Up to 100 MB tested Assembly Performance Fine for small genomes, not recommended and slow for larger genomes Good results, but may require parameter tweaking Good results Installation difficulty Part of SMRT Analysis 1.4 Part of SMRT Analysis 1.4 High (requires compiling code, cluster configuration, etc.) Target user General users who want to try the HGAP workflow and test on a small genome Bioinformatics users new to HGAP Customers already introduced to DevNet HGAP Our recommendation

Using HGAP (Hierarchical Genome Assembly Process) in SMRT® Analysis v1.4
• HGAP consists of 3 steps: 15 Generate very long, high-accuracy reads Tools: SMRT® Portal: RS_PreAssembler_ALLORA Command-line: RS_PreAssembler or P_PreAssembler Pre-Assembly Join reads into a near- perfect assembly Tools: SMRT® Portal: RS_PreAssembler_ALLORA Command-line: Celera® Assembler Assembly Realign reads against the assembly for the highest final accuracy Tools: SMRT® Portal: RS_Resequencing Command-line RS_Resequencing or P_GenomicConsensus Assembly Polishing

Pre-Assembly (Step 1) 16 Generate very long, high-accuracy reads Tools:
SMRT® Portal: RS_PreAssembler_ALLORA Command-line: RS_PreAssembler or P_PreAssembler Pre-Assembly

SMRT® Portal 1.4 Workflow for HGAP - Filtering • Change
Min RQ to 0.8 • Run “Filter Only” on your SMRT Cells to identify Seed Read Length from subread length distribution

SMRT® Portal 1.4 Workflow for HGAP – Seed Read Length
18 • Target Genome Size: ~5 Mb • 20X coverage: ~100 Mb • Select a minimum seed read length to obtain >20X coverage of your genome • On this dataset, 5000 bp yields >20X coverage • Total coverage should exceed 60X

SMRT® Portal 1.4 Workflow for HGAP - PreAssembler • Set
the Seed Read Length in PreAssembler based on coverage (goal >20X of seed read) • Suggest changing-maxLCPLength from 16 to 14 in BLASR options for XL-C2 data

Assembly (Step 2) 20 Generate very long, high accuracy reads
Tools: SMRT® Portal: RS_PreAssembler_ALLORA Command-line: RS_PreAssembler or P_PreAssembler Pre-Assembly Join reads into a near perfect assembly Tools: SMRT® Portal: RS_PreAssembler_ALLORA Command-line: Celera® Assembler Assembly

RS_PreAssembler_ALLORA: Option to Combine the Pre-Assembly and Assembly Steps 21
500000

• Download pre-assembled reads in FASTQ format to use on
local computer • Note the Job ID for further processing in local LINUX system • Optional: QV and length filtering of the corrected.fastq on the LINUX command line • Assemble corrected.fastq file via Celera® Assembler on the command line Exporting Data to Use for Celera® Assembler

Assembly Polishing (Step 3) 23 Generate very long, high accuracy
reads Tools: SMRT® Portal: RS_PreAssembler_ALLORA Command-line: RS_PreAssembler or P_PreAssembler Pre-Assembly Join reads into a near perfect assembly Tools: SMRT® Portal: RS_PreAssembler_ALLORA Command-line: Celera® Assembler Assembly Realign reads against the assembly for the highest final accuracy Tools: SMRT® Portal: RS_Resequencing Command-line RS_Resequencing or P_GenomicConsensus Assembly Polishing

Import Assembly into SMRT® Portal as Reference 24

Import Assembly into SMRT® Portal as Reference 25 Points to
consider: • Scientist level SMRT Portal users can now delete their own single-use assembly references after finishing Quiver • Multiple fasta files can be combined into one reference via <SHIFT><Select> • Depositing a fasta file in the reference_dropbox requires write access to the directory

RS_Resequencing in SMRT® Portal 1.4 - Quiver 26 • Choose
Reference on Design Job page • Random placement of reads into repeats – more uniform coverage

RS_Resequencing in SMRT® Portal 1.4 - Quiver 27 • Basecaller
QV aware consensus algorithm - Quiver - is default in SMRT Analysis 1.4 • Improved mapping selectivity to further increase accuracy of the de novo consensus • More accurate variant calls

Highly Accurate Assembly Consensus and Variant Calls 28 • Download
consensus.fasta for functional annotation • Evaluate aligned reads for continuity of assembly via BAM and SAM files • Observe uniformity of coverage to evaluate assembly accuracy and identify possible miss- assemblies where coverage drops • Re-import polished assembly as reference to start base modification analysis

The Command-Line Unlocks the Full Power of HGAP • Run
P_PreAssembler with SMRT® Pipe on the command-line • Run Celera® Assembler on the command-line • Use Quiver option of P_GenomicConsensus to polish the assembly • For advanced users: – Additional tweaks to filtering and trimming may improve assembly – A beta release of HGAP on DevNet may generate even better assemblies (separate installation required) • More details here: https://github.com/PacificBiosciences/Bioinformatics-Training/wiki/HGAP 29

Microbial Experimental Design Recommendations Using HGAP Sample Prep Run Design
Sequencing on the PacBio® RS and primary analysis Secondary Analysis Tertiary Analysis • XL-C2 chemistry • MagBead loading • Stage start • Movie Time: 1 x 120 min • Alternative movie times can be explored to optimize throughput • Do not overload; Loading titrations may be useful • 1.4 RS_Preassembler+Celera® Assembler SMRT® Analysis • Cov: 100 X • Use XL parameters, custom trimming as necessary • Recommend Quiver for assembly polishing to increase consensus accuracy • Base modification caveats • Limit DNA damage during sample extraction • 10 kb library protocol for long read library • Optional >10 kb protocol available through SampleNet • Good quality sample preparation is key!

FAQ Q. How large a genome does HGAP support? PacBio
has tested HGAP primarily on microbial-sized genomes. In principle, HGAP will work on genomes of 100 MB or larger, but this has not yet been tested, and manual fine-tuning will likely be necessary to achieve the best assembly. Q. What if customers have been using the DevNet implementation of HGAP? For advanced users who are comfortable installing beta software, the DevNet implementation (called the “reference implementation” or “beta”) is also available. • Advantages: potentially more scalable for larger genomes >500 MB. • Disadvantages: separate installation, command-line only, and may not be better in all cases. Q. What are the future plans for HGAP in SMRT® Analysis? In the upcoming release of SMRT Analysis 2.0, HGAP will be an integrated protocol in SMRT Portal, combining the Pre- Assembler with Celera® Assembler. Q. What about Celera® Assembler? Will CA implement PacBio long-read-only assembly in the future? A pre-release version of pacBioToCA can perform the preassembly step. More information can be found at http://sourceforge.net/apps/mediawiki/wgs-assembler/index.php?title=PacBioToCA. Celera Assembler can perform the assembly step. It’s still necessary to run resequencing with Quiver to polish the final assembly. We do not know when the Celera Assembler update officially will be released. We will evaluate including the update in a future version of SMRT Analysis. Q. Where can I get more information about HGAP? See pacbiodevnet.com for more details; in particular: https://github.com/PacificBiosciences/Bioinformatics-Training/wiki/HGAP 31

Hierarchical Genome Assembly Process (HGAP)

Hierarchical Genome Assembly Process (HGAP)

PacBio

More Decks by PacBio

Other Decks in Science

Featured

Transcript

FIND MEANING IN COMPLEXITY © Copyright 2013 by Pacific Biosciences

Objectives • Introduction to HGAP • Using HGAP in SMRT

New De Novo Assembly Algorithms • Powerful assembly algorithms combining

Construct Pre-assembled Long Reads (PLR) from CLRs Assemble PLRs into

Hierarchical Assembly: Assembles Genomes from Single PacBio® Long Insert Library

Map all to seed reads Single-pass long reads Select longest

Assembly of Pre-Assembled Reads into Contigs 7 Identify overlaps between

Single-pass long reads Contigs Assembly Polishing via Quiver High quality

HGAP Example - Meiothermus ruber (JGI) 9 10 kb SMRTbell™

HGAP Example - Meiothermus ruber (JGI) 10 10 kb SMRTbell™

HGAP Example - Meiothermus ruber (JGI) 11 10 kb SMRTbell™

Initial Assessment of M. ruber Assembly from 4 SMRT® Cells

• Used reference (Sanger) from the Joint Genome Institute to

Three ways to use HGAP 14 P_PreAssembler_Allora protocol in SMRT

Using HGAP (Hierarchical Genome Assembly Process) in SMRT® Analysis v1.4

Pre-Assembly (Step 1) 16 Generate very long, high-accuracy reads Tools:

SMRT® Portal 1.4 Workflow for HGAP - Filtering • Change

SMRT® Portal 1.4 Workflow for HGAP – Seed Read Length

SMRT® Portal 1.4 Workflow for HGAP - PreAssembler • Set

Assembly (Step 2) 20 Generate very long, high accuracy reads

RS_PreAssembler_ALLORA: Option to Combine the Pre-Assembly and Assembly Steps 21

• Download pre-assembled reads in FASTQ format to use on

Assembly Polishing (Step 3) 23 Generate very long, high accuracy

Import Assembly into SMRT® Portal as Reference 24

Import Assembly into SMRT® Portal as Reference 25 Points to

RS_Resequencing in SMRT® Portal 1.4 - Quiver 26 • Choose

RS_Resequencing in SMRT® Portal 1.4 - Quiver 27 • Basecaller

Highly Accurate Assembly Consensus and Variant Calls 28 • Download

The Command-Line Unlocks the Full Power of HGAP • Run

Microbial Experimental Design Recommendations Using HGAP Sample Prep Run Design

FAQ Q. How large a genome does HGAP support? PacBio