Planning, Running, and Understanding the FALCON Genome Assembly Pipeline

FIND MEANING IN COMPLEXITY © Copyright 2015 by Pacific Biosciences
of California, Inc. All rights reserved. For Research Use Only. Not for use in diagnostic procedures. Matthew Seetin, Roberto Lleras, and Richard Hall / June 16 & 18, 2015 Planning, Running, and Understanding the FALCON Genome Assembly Pipeline

Planning to Run FALCON 2 • Developmental, diploid-aware genome assembler
• Employs Gene Myers’ cutting-edge Daligner, greatly reducing computational time • Preserves alternate contigs in ambiguous assembly graphs • Requires more understanding than bacterial assembly with HGAP

Planning to Run FALCON • Understand command line usage, compiling
software, virtual environments • Understand your cluster’s file system • Understand your cluster’s queuing system – Syntax, time and other resource limits • Understand your organism – Size, ploidy, heterozygosity, repeat content 3

Planning to Run Falcon 4 • FALCON scales quadratically –
Begins with all-by-all comparison of raw subreads, matches written to disk – Matches then read, sorted, and merged into error-corrected, preassembled reads – These preassembled reads are nearly as valuable as the final result! • FALCON is limited by file i/o capabilities – Lustre file system recommended – NFS can only handle 3-5 concurrent jobs during preassembly step – Highly repetitive genomes require quadratically more storage space – In exchange, 20x reduction in CPU time for human assembly

FALCON Run Times 5 PacBio Cluster • Lustre 2.1.2 HPFS
• 8 nodes, 48 cores/node, 256 GB RAM Human assembly, 3 Gbase • ~20000 CPU-hours, ~9 days Animal assembly, 1 Gbase • ~2100 CPU-hours, ~1 day Collaborator cluster • NFS, not limited by nodes or RAM Eukaryotic assembly, 300 Mbase • ~1 week DNAnexus • Highly-optimized, high-performance cluster Human assembly in ~24 hours

Installing FALCON • https://github.com/PacificBiosciences/Bioinformatics-Training/wiki/FALCON- Walkthrough:-Assembling-Lambda-in-the-Amazon-Cloud 6

Preparing the FALCON Configuration File [General] job_type=local • Alternative is
“SGE,” for running on a Sun Grid Engine cluster • Running locally only practical for toy genomes input_fofn = input.fofn • File of file names listing all input files input_type = raw #input_type = preads • Can restart job using preassembled reads to quickly test alternative downstream parameters 7

Preparing the FALCON Configuration File # The length cutoff used
for seed reads used for initial mapping length_cutoff = 1000 • Use your longest 30x of coverage. Length correlates strongly with assembly quality # The length cutoff used for seed reads used for pre-assembled reads length_cutoff_pr = 1000 • Usually 0-5000 shorter than length_cutoff. Shorter number may increase contiguity at the risk of misassembly 8

Preparing the FALCON Configuration File sge_option_da = sge_option_la = sge_option_pda
= sge_option_pla = sge_option_fc = sge_option_cns = pa_concurrent_jobs = 1 ovlp_concurrent_jobs = 1 • These settings depend on your cluster setup; text is inserted after qsub command. • sge_option_da, sge_option_la, and pa_concurrent_jobs govern most expensive steps. 9

Preparing the FALCON Configuration File pa_HPCdaligner_option = -v -dal4 -t16
-e.70 -l500 -s500 ovlp_HPCdaligner_option = -v -dal4 -t32 -h60 -e.96 -l500 -s500 • Flags for Daligner • -dal4 specifies 4 serial calls to daligner command per job submitted to the queue • -l and -s set how many base pairs constitute the minimum local alignment and how frequently (in bases) these are recorded, respectively • -t suppresses k-mer frequency (suppress repeats) so as to limit memory usage • For more information: https://dazzlerblog.wordpress.com/ 10

Preparing the FALCON Configuration File pa_DBsplit_option = -x500 -s50 ovlp_DBsplit_option
= -x500 -s50 • Flags for how the read database is split up between jobs • -s specifies number of megabytes in each DB chunk: larger number generates a smaller number of longer jobs. • For more information: https://dazzlerblog.wordpress.com/ 11

Preparing the FALCON Configuration File falcon_sense_option = --output_multi --min_idt 0.70
--min_cov 4 --local_match_count_threshold 2 --max_n_read 200 --n_core 2 • Flags pertain to error correction of raw reads • These settings work fine for many assemblies, but max_n_read may need to be lowered in highly repetitive genomes. overlap_filtering_setting = --max_diff 100 --max_cov 100 -- min_cov 2 • -min_cov specifies minimum coverage in preassembled reads in order to continue to extend contig. Low values promote contiguity at expense of misassembly • -max_diff specifies the maximum difference in coverage between ends of a pread, which usually indicates a repeat 12

FALCON Documentation • https://github.com/PacificBiosciences/FALCON/blob/master/doc/falcon_manual.md • https://github.com/PacificBiosciences/FALCON/wiki 13

For Research Use Only. Not for use in diagnostic procedures.
FALCON Output 14

FALCON Outputs File Description p_ctg.fa Unpolished primary contig sequences a_ctg.fa
Unpolished associated contig sequences a_ctg_base.fa The sequence from the primary contigs to which the assoc. contigs correspond sg_edges_list Complete list of edges in assembly string graph utg_data Unitigs and the edges used to assemble them ctg_paths List of contigs and the unitigs used to assemble them *_tiling_path How the preassembled reads align against the primary and associated contigs 15

FALCON Assembler Principle • Truth: • Current assemblers: • FALCON:
maternal allele paternal allele Keep the long range information while maintaining the relations of the alternative alleles.

String Graph Edges Connect Preassembled Reads 17 • 3 overlapping
reads, 6 edges 0:E 1:B 1 2000 0 2000 99.8 G 1:B 2:E 2 2000 5500 3000 98.4 G 0:E 2:E 2 1000 5500 1000 99.6 TR 2:B 1:E 1 3000 4000 3000 99.1 G 1:E 0:B 0 3500 0 2000 99.5 G 2:B 0:B 0 4500 0 1000 99.5 TR Read 0 Read1:End Read2:End OverhangRead Start End Overlap %Identity EdgeTypes

Unitigs are Collections of Connected Edges 18 • “Simple” unitigs
are unambiguously- resolved straight lines of edges • Consecutive edges usually documented as: start:NA:end • “Compound” unitigs demarcate bubbles or other ambiguities in the graph • Edges grouped as: start:via:end • via edge indicates which branch of a fork to take • “Contained” unitigs are the paths within another compound unitig

Advanced: Visualizing the String Graph to Investigate Misassembly 19 •
In CHM13 assembly, Contig 39 (blue) mapped to two different human chromosomes. The first 3 MB aligned with chr13 and the last 17 MB with chr14 • FALCON includes Python code to identify neighbors in the string graph, which represent alternative hypotheses for the assembly For the very adventurous: https://dl.dropboxusercontent.com/u/38943405/assembly_graph_notebook/asm_g raph_exploration_notebook_CHM13.slides.html

The Future of FALCON • V 0.2 will remain stable
– Fix bugs • V 0.3 – New Dalinger version – Support for 100 kb subreads – New consensus code for superior diploid handling – Other features under discussion on FALCON’s Github page – Release TBD 20

For Research Use Only. Not for use in diagnostic procedures.
Pacific Biosciences, the Pacific Biosciences logo, PacBio, SMRT, SMRTbell, and Iso-Seq are trademarks of Pacific Biosciences. BluePippin and SageELF are trademarks of Sage Science. NGS-go and NGSengine are trademarks of GenDx. All other trademarks are the sole property of their respective owners.

Planning, Running, and Understanding the FALCON...

Planning, Running, and Understanding the FALCON Genome Assembly Pipeline

PacBio

More Decks by PacBio

Other Decks in Science

Featured

Transcript

FIND MEANING IN COMPLEXITY © Copyright 2015 by Pacific Biosciences

Planning to Run FALCON 2 • Developmental, diploid-aware genome assembler

Planning to Run FALCON • Understand command line usage, compiling

Planning to Run Falcon 4 • FALCON scales quadratically –

FALCON Run Times 5 PacBio Cluster • Lustre 2.1.2 HPFS

Installing FALCON • https://github.com/PacificBiosciences/Bioinformatics-Training/wiki/FALCON- Walkthrough:-Assembling-Lambda-in-the-Amazon-Cloud 6

Preparing the FALCON Configuration File [General] job_type=local • Alternative is

Preparing the FALCON Configuration File # The length cutoff used

Preparing the FALCON Configuration File sge_option_da = sge_option_la = sge_option_pda

Preparing the FALCON Configuration File pa_HPCdaligner_option = -v -dal4 -t16

Preparing the FALCON Configuration File pa_DBsplit_option = -x500 -s50 ovlp_DBsplit_option

Preparing the FALCON Configuration File falcon_sense_option = --output_multi --min_idt 0.70

FALCON Documentation • https://github.com/PacificBiosciences/FALCON/blob/master/doc/falcon_manual.md • https://github.com/PacificBiosciences/FALCON/wiki 13

For Research Use Only. Not for use in diagnostic procedures.

FALCON Outputs File Description p_ctg.fa Unpolished primary contig sequences a_ctg.fa

FALCON Assembler Principle • Truth: • Current assemblers: • FALCON:

String Graph Edges Connect Preassembled Reads 17 • 3 overlapping

Unitigs are Collections of Connected Edges 18 • “Simple” unitigs

Advanced: Visualizing the String Graph to Investigate Misassembly 19 •

The Future of FALCON • V 0.2 will remain stable

For Research Use Only. Not for use in diagnostic procedures.