Planning, Running, and Understanding the FALCON Genome Assembly Pipeline

F11d4ddd9ca7e190fdabf0cda3f7ae29?s=47 PacBio
June 18, 2015

Planning, Running, and Understanding the FALCON Genome Assembly Pipeline

F11d4ddd9ca7e190fdabf0cda3f7ae29?s=128

PacBio

June 18, 2015
Tweet

Transcript

  1. FIND MEANING IN COMPLEXITY © Copyright 2015 by Pacific Biosciences

    of California, Inc. All rights reserved. For Research Use Only. Not for use in diagnostic procedures. Matthew Seetin, Roberto Lleras, and Richard Hall / June 16 & 18, 2015 Planning, Running, and Understanding the FALCON Genome Assembly Pipeline
  2. Planning to Run FALCON 2 • Developmental, diploid-aware genome assembler

    • Employs Gene Myers’ cutting-edge Daligner, greatly reducing computational time • Preserves alternate contigs in ambiguous assembly graphs • Requires more understanding than bacterial assembly with HGAP
  3. Planning to Run FALCON • Understand command line usage, compiling

    software, virtual environments • Understand your cluster’s file system • Understand your cluster’s queuing system – Syntax, time and other resource limits • Understand your organism – Size, ploidy, heterozygosity, repeat content 3
  4. Planning to Run Falcon 4 • FALCON scales quadratically –

    Begins with all-by-all comparison of raw subreads, matches written to disk – Matches then read, sorted, and merged into error-corrected, preassembled reads – These preassembled reads are nearly as valuable as the final result! • FALCON is limited by file i/o capabilities – Lustre file system recommended – NFS can only handle 3-5 concurrent jobs during preassembly step – Highly repetitive genomes require quadratically more storage space – In exchange, 20x reduction in CPU time for human assembly
  5. FALCON Run Times 5 PacBio Cluster • Lustre 2.1.2 HPFS

    • 8 nodes, 48 cores/node, 256 GB RAM Human assembly, 3 Gbase • ~20000 CPU-hours, ~9 days Animal assembly, 1 Gbase • ~2100 CPU-hours, ~1 day Collaborator cluster • NFS, not limited by nodes or RAM Eukaryotic assembly, 300 Mbase • ~1 week DNAnexus • Highly-optimized, high-performance cluster Human assembly in ~24 hours
  6. Installing FALCON • https://github.com/PacificBiosciences/Bioinformatics-Training/wiki/FALCON- Walkthrough:-Assembling-Lambda-in-the-Amazon-Cloud 6

  7. Preparing the FALCON Configuration File [General] job_type=local • Alternative is

    “SGE,” for running on a Sun Grid Engine cluster • Running locally only practical for toy genomes input_fofn = input.fofn • File of file names listing all input files input_type = raw #input_type = preads • Can restart job using preassembled reads to quickly test alternative downstream parameters 7
  8. Preparing the FALCON Configuration File # The length cutoff used

    for seed reads used for initial mapping length_cutoff = 1000 • Use your longest 30x of coverage. Length correlates strongly with assembly quality # The length cutoff used for seed reads used for pre-assembled reads length_cutoff_pr = 1000 • Usually 0-5000 shorter than length_cutoff. Shorter number may increase contiguity at the risk of misassembly 8
  9. Preparing the FALCON Configuration File sge_option_da = sge_option_la = sge_option_pda

    = sge_option_pla = sge_option_fc = sge_option_cns = pa_concurrent_jobs = 1 ovlp_concurrent_jobs = 1 • These settings depend on your cluster setup; text is inserted after qsub command. • sge_option_da, sge_option_la, and pa_concurrent_jobs govern most expensive steps. 9
  10. Preparing the FALCON Configuration File pa_HPCdaligner_option = -v -dal4 -t16

    -e.70 -l500 -s500 ovlp_HPCdaligner_option = -v -dal4 -t32 -h60 -e.96 -l500 -s500 • Flags for Daligner • -dal4 specifies 4 serial calls to daligner command per job submitted to the queue • -l and -s set how many base pairs constitute the minimum local alignment and how frequently (in bases) these are recorded, respectively • -t suppresses k-mer frequency (suppress repeats) so as to limit memory usage • For more information: https://dazzlerblog.wordpress.com/ 10
  11. Preparing the FALCON Configuration File pa_DBsplit_option = -x500 -s50 ovlp_DBsplit_option

    = -x500 -s50 • Flags for how the read database is split up between jobs • -s specifies number of megabytes in each DB chunk: larger number generates a smaller number of longer jobs. • For more information: https://dazzlerblog.wordpress.com/ 11
  12. Preparing the FALCON Configuration File falcon_sense_option = --output_multi --min_idt 0.70

    --min_cov 4 --local_match_count_threshold 2 --max_n_read 200 --n_core 2 • Flags pertain to error correction of raw reads • These settings work fine for many assemblies, but max_n_read may need to be lowered in highly repetitive genomes. overlap_filtering_setting = --max_diff 100 --max_cov 100 -- min_cov 2 • -min_cov specifies minimum coverage in preassembled reads in order to continue to extend contig. Low values promote contiguity at expense of misassembly • -max_diff specifies the maximum difference in coverage between ends of a pread, which usually indicates a repeat 12
  13. FALCON Documentation • https://github.com/PacificBiosciences/FALCON/blob/master/doc/falcon_manual.md • https://github.com/PacificBiosciences/FALCON/wiki 13

  14. For Research Use Only. Not for use in diagnostic procedures.

    FALCON Output 14
  15. FALCON Outputs File Description p_ctg.fa Unpolished primary contig sequences a_ctg.fa

    Unpolished associated contig sequences a_ctg_base.fa The sequence from the primary contigs to which the assoc. contigs correspond sg_edges_list Complete list of edges in assembly string graph utg_data Unitigs and the edges used to assemble them ctg_paths List of contigs and the unitigs used to assemble them *_tiling_path How the preassembled reads align against the primary and associated contigs 15
  16. FALCON Assembler Principle • Truth: • Current assemblers: • FALCON:

    maternal allele paternal allele Keep the long range information while maintaining the relations of the alternative alleles.
  17. String Graph Edges Connect Preassembled Reads 17 • 3 overlapping

    reads, 6 edges 0:E 1:B 1 2000 0 2000 99.8 G 1:B 2:E 2 2000 5500 3000 98.4 G 0:E 2:E 2 1000 5500 1000 99.6 TR 2:B 1:E 1 3000 4000 3000 99.1 G 1:E 0:B 0 3500 0 2000 99.5 G 2:B 0:B 0 4500 0 1000 99.5 TR Read 0 Read1:End Read2:End OverhangRead Start End Overlap %Identity EdgeTypes
  18. Unitigs are Collections of Connected Edges 18 • “Simple” unitigs

    are unambiguously- resolved straight lines of edges • Consecutive edges usually documented as: start:NA:end • “Compound” unitigs demarcate bubbles or other ambiguities in the graph • Edges grouped as: start:via:end • via edge indicates which branch of a fork to take • “Contained” unitigs are the paths within another compound unitig
  19. Advanced: Visualizing the String Graph to Investigate Misassembly 19 •

    In CHM13 assembly, Contig 39 (blue) mapped to two different human chromosomes. The first 3 MB aligned with chr13 and the last 17 MB with chr14 • FALCON includes Python code to identify neighbors in the string graph, which represent alternative hypotheses for the assembly For the very adventurous: https://dl.dropboxusercontent.com/u/38943405/assembly_graph_notebook/asm_g raph_exploration_notebook_CHM13.slides.html
  20. The Future of FALCON • V 0.2 will remain stable

    – Fix bugs • V 0.3 – New Dalinger version – Support for 100 kb subreads – New consensus code for superior diploid handling – Other features under discussion on FALCON’s Github page – Release TBD 20
  21. For Research Use Only. Not for use in diagnostic procedures.

    Pacific Biosciences, the Pacific Biosciences logo, PacBio, SMRT, SMRTbell, and Iso-Seq are trademarks of Pacific Biosciences. BluePippin and SageELF are trademarks of Sage Science. NGS-go and NGSengine are trademarks of GenDx. All other trademarks are the sole property of their respective owners.