Track 2: Beyond Bacteria

F11d4ddd9ca7e190fdabf0cda3f7ae29?s=47 PacBio
October 15, 2014

Track 2: Beyond Bacteria

F11d4ddd9ca7e190fdabf0cda3f7ae29?s=128

PacBio

October 15, 2014
Tweet

Transcript

  1. FIND MEANING IN COMPLEXITY © Copyright 2014 by Pacific Biosciences

    of California, Inc. All rights reserved. For Research Use Only. Not for use in diagnostic procedures. Richard J Hall / 10/9/2014 Beyond Bacteria, Assembling >100 Mb Genomes Using Only PacBio® Sequence and Hybrid Assembly Methods
  2. Options for >100 Mb Genomes • Gap filling – High-quality

    draft assembly – Low-coverage PacBio® data • Hybrid assembly – Low-coverage PacBio data >10x to <40x – Illumina data • PacBio-only assembly – High-coverage PacBio data >40x 2
  3. A Customer’s Perspective on Eukaryotic De Novo Assembly Experimental Design

    PAG 2013: Michael Schatz, “De novo assembly of complex genomes using single molecule sequencing”
  4. Name Additional Data Sets Genome Size Constraint HGAP None <100

    Mb – SMRT® Portal;<500 Mb – SMRT Pipe; >500 Mb SMRT Make HBAR-DTK None Compute power & time Falcon None <150 Mb- Command line only pacBioToCA self-correction None Compute power & time Celera® Assembler None Compute power & time Sprai None <2 Gbp, Compute power & time Hybrid Assemblers pacBioToCA >50X short-read data <3 Gb- compute power & time ECTools >50X short-read data <3 Gb- compute power & time Spades Illumina Contigs Consult the author Cerulean ABySS assembly graph; Illumina contigs Consult the author Gap Filling PBJelly 2 50X short reads Compute power & time Scaffolding AHA High-confidence contigs <200 MB, <20,000 contigs PBJelly 2 High-confidence scaffolds <4 GB, Compute power & time
  5. Gap Filling 5

  6. Towards Gap-Free Reference Genomes English et al. (2012) Mind the

    Gap: Upgrading Genomes with Pacific Biosciences RS Long-Read Sequencing Technology. PLoS One. D. melanogaster (139.5 Mb) D. pseudoobscura (176.04 Mb) M. undulatus (1.23 Gb) C. atys (2.82 Gb) Original PacBio Original PacBio Original PacBio Original PacBio Gap Count 4651 311 6026 1852 49,376 39,204 186,841 66,211 Total Gap Size (Mb) 3.19 0.54 6.67 3.61 154.9 134.6 197.5 79.3 Contig N50 (kb) 64 723.6 53 224.4 134.4 233.27 34.92 128.38 Contig N50 Improvement 1030.6% (11.3x) 323.4% (4.2x) 73.6% (1.74x) 267.6% (3.68x)
  7. PBJelly • Part of PBSuite – http://sourceforge.net/projects/pb-jelly/ – PBJelly -

    the genome upgrading tool. – PBHoney - the structural variation discovery tool • Advantages – Make use of very low-coverage PacBio® data – Significant improvements for well-studied assemblies – Reasonable computation times, even for large genomes • Disadvantages – Requires a ‘good’ draft genome – Limited by the current assembly – Will not correct for missing data in the current assembly, due to the limitations of the technologies used 7
  8. Hybrid Assembly Methods 8

  9. Hybrid Assembly Tools PAG 2013: Michael Schatz, “De novo assembly

    of complex genomes using single molecule sequencing”
  10. ECTools • https://github.com/jgurtowski/ectools • Advantages – Make use of low

    to medium coverage PacBio® data – More computationally efficient that correcting with short reads (PBcR - http://wgs-assembler.sourceforge.net/wiki/index.php?title=PBcR) – Limited largely by computational resources, large genomes • Disadvantages – Does not overcome all the limitations of short-read assembly 10
  11. Example: 15 kb Repeat Region Spanned with PacBio® Assembly Presented

    by S. Vij & L. Orban (Temasek Life Sciences Laboratory) at PAG Asia, May 2014
  12. Assembly with Only PacBio® Data 12

  13. HGAP Evolution Dazzler MHAP FALCON HGAP.3 HGAP.2 HGAP.1 Alignment Daligner

    MHAP FALCON BLASR BLASR BLASR Correction CA/dagcon FALCON PB/dagcon PB/dagcon AMOS/make- consensus Overlap CA/overlap FALCON CA/overlap CA/overlap CA/overlap Layout CA/unitigger FALCON CA/unitigger CA/unitigger CA/unitigger Consensus CA/utgcns PB/utgcns CA/utgcns CA/utgcns 13 Experimental Deprecated Production • Bug fixes in PB/dagcon benefit both HGAP.2 and HGAP.3 • FALCON informs HGAP development • Nearly interchangeable • Replacing least efficient pieces first, while still maintaining functionality Dazzler - https://github.com/thegenemyers/DALIGNER MHAP - http://wgs-assembler.sourceforge.net/wiki/index.php/PBcR
  14. For Research Use Only. Not for use in diagnostic procedures.

    Pacific Biosciences, the Pacific Biosciences logo, PacBio, SMRT, and SMRTbell are trademarks of Pacific Biosciences in the United States and/or other countries. All other trademarks are the sole property of their respective owners.
  15. FIND MEANING IN COMPLEXITY © Copyright 2014 by Pacific Biosciences

    of California, Inc. All rights reserved. For Research Use Only. Not for use in diagnostic procedures. James Drake – Bioinformatics Workshop 10/2014 SMRT® Make
  16. SMRT® Make • https://github.com/PacificBiosciences/smrtmake • Make file-driven, bare-boned SMRT Analysis

    workflows • Why make? – Re-entrant – Stable, well-documented – Portable – Make files are (mostly) self-contained, easily modified – Flexible • Custom analysis • Rapid prototype, develop and experimentation • Not everything wraps nicely into SMRT Analysis
  17. Design Principles • Initially designed to be self-contained within a

    single file, but desire to modularize is overwhelming. – Chunking, i.e., splitting the data up for parallel processing. – Defining cluster submission pragmas (e.g., qsub calls) – Data filtering – Reports • Provide the general purpose use case, user customizes to fit their needs. • Minimal parameters • Serves as a record of the experiment • Easily reproducible by others, just share your make file. • Create self-contained make files by simply `cat`ing them together.
  18. Anatomy of a SMRT® Make File # SGE queue name

    to submit jobs to QUEUE ?= huasm # Size of the genome GENOME_SIZE ?= 700000000 # Splits data into this many chunks, each chunk processed independently CHUNK_SIZE ?= 15 # How many threads a process will use (also how many SGE slots will be requested) NPROC ?= 32 # Local temp root directory, must have write access and a decent amount of space (~100GB) LOCALTMP ?= /scratch Parameters Initialization Recipes Typically run in bottom-up order
  19. Interacting with a SMRT® Make File # Typically just need

    an input.fofn file and a SMRT Make file > ls input.fofn hgap3.mk # Rename a make file so you don’t have to explicitly pass it as an argument > mv hgap3.mk Makefile # change the genome size > make GENOME_SIZE=100000000 # Run 5 jobs at a time and split the data into 10 ‘chunks’ > make –j 5 CHUNK_SIZE=10 # Recovery and continue, e.g., a large assembly that takes days > make –f hgap3.mk progress ... progress ... progress ... ERROR! (cluster dies, out of disk, etc) > fix ... fix ... fix ... > make –f hgap3.mk # Save/Share your make files > make –f my_super_cool_assembly.mk (makes best assembly ever) > mail –s “Check this out” collegue@university.edu < my_super_cool_assembly.mk # or submit a github pull request to PacBio and share with the community!
  20. Example - Circularization HGAP reads assembly Plasmids Bacteria Higher-organisms circularization.mk

    1) Iterates over all fasta-formatted entries 2) Gets the beginning and end of each sequence 3) Runs BLASR to detect overlap 4) Declares if entry ‘looks’ circular based on thresholds
  21. For Research Use Only. Not for use in diagnostic procedures.

    Pacific Biosciences, the Pacific Biosciences logo, PacBio, SMRT, and SMRTbell are trademarks of Pacific Biosciences in the United States and/or other countries. All other trademarks are the sole property of their respective owners.