Track 2: Beyond Bacteria

FIND MEANING IN COMPLEXITY © Copyright 2014 by Pacific Biosciences
of California, Inc. All rights reserved. For Research Use Only. Not for use in diagnostic procedures. Richard J Hall / 10/9/2014 Beyond Bacteria, Assembling >100 Mb Genomes Using Only PacBio® Sequence and Hybrid Assembly Methods

Options for >100 Mb Genomes • Gap filling – High-quality
draft assembly – Low-coverage PacBio® data • Hybrid assembly – Low-coverage PacBio data >10x to <40x – Illumina data • PacBio-only assembly – High-coverage PacBio data >40x 2

A Customer’s Perspective on Eukaryotic De Novo Assembly Experimental Design
PAG 2013: Michael Schatz, “De novo assembly of complex genomes using single molecule sequencing”

Name Additional Data Sets Genome Size Constraint HGAP None <100
Mb – SMRT® Portal;<500 Mb – SMRT Pipe; >500 Mb SMRT Make HBAR-DTK None Compute power & time Falcon None <150 Mb- Command line only pacBioToCA self-correction None Compute power & time Celera® Assembler None Compute power & time Sprai None <2 Gbp, Compute power & time Hybrid Assemblers pacBioToCA >50X short-read data <3 Gb- compute power & time ECTools >50X short-read data <3 Gb- compute power & time Spades Illumina Contigs Consult the author Cerulean ABySS assembly graph; Illumina contigs Consult the author Gap Filling PBJelly 2 50X short reads Compute power & time Scaffolding AHA High-confidence contigs <200 MB, <20,000 contigs PBJelly 2 High-confidence scaffolds <4 GB, Compute power & time

Gap Filling 5

Towards Gap-Free Reference Genomes English et al. (2012) Mind the
Gap: Upgrading Genomes with Pacific Biosciences RS Long-Read Sequencing Technology. PLoS One. D. melanogaster (139.5 Mb) D. pseudoobscura (176.04 Mb) M. undulatus (1.23 Gb) C. atys (2.82 Gb) Original PacBio Original PacBio Original PacBio Original PacBio Gap Count 4651 311 6026 1852 49,376 39,204 186,841 66,211 Total Gap Size (Mb) 3.19 0.54 6.67 3.61 154.9 134.6 197.5 79.3 Contig N50 (kb) 64 723.6 53 224.4 134.4 233.27 34.92 128.38 Contig N50 Improvement 1030.6% (11.3x) 323.4% (4.2x) 73.6% (1.74x) 267.6% (3.68x)

PBJelly • Part of PBSuite – http://sourceforge.net/projects/pb-jelly/ – PBJelly -
the genome upgrading tool. – PBHoney - the structural variation discovery tool • Advantages – Make use of very low-coverage PacBio® data – Significant improvements for well-studied assemblies – Reasonable computation times, even for large genomes • Disadvantages – Requires a ‘good’ draft genome – Limited by the current assembly – Will not correct for missing data in the current assembly, due to the limitations of the technologies used 7

Hybrid Assembly Methods 8

Hybrid Assembly Tools PAG 2013: Michael Schatz, “De novo assembly
of complex genomes using single molecule sequencing”

ECTools • https://github.com/jgurtowski/ectools • Advantages – Make use of low
to medium coverage PacBio® data – More computationally efficient that correcting with short reads (PBcR - http://wgs-assembler.sourceforge.net/wiki/index.php?title=PBcR) – Limited largely by computational resources, large genomes • Disadvantages – Does not overcome all the limitations of short-read assembly 10

Example: 15 kb Repeat Region Spanned with PacBio® Assembly Presented
by S. Vij & L. Orban (Temasek Life Sciences Laboratory) at PAG Asia, May 2014

Assembly with Only PacBio® Data 12

HGAP Evolution Dazzler MHAP FALCON HGAP.3 HGAP.2 HGAP.1 Alignment Daligner
MHAP FALCON BLASR BLASR BLASR Correction CA/dagcon FALCON PB/dagcon PB/dagcon AMOS/make- consensus Overlap CA/overlap FALCON CA/overlap CA/overlap CA/overlap Layout CA/unitigger FALCON CA/unitigger CA/unitigger CA/unitigger Consensus CA/utgcns PB/utgcns CA/utgcns CA/utgcns 13 Experimental Deprecated Production • Bug fixes in PB/dagcon benefit both HGAP.2 and HGAP.3 • FALCON informs HGAP development • Nearly interchangeable • Replacing least efficient pieces first, while still maintaining functionality Dazzler - https://github.com/thegenemyers/DALIGNER MHAP - http://wgs-assembler.sourceforge.net/wiki/index.php/PBcR

For Research Use Only. Not for use in diagnostic procedures.
Pacific Biosciences, the Pacific Biosciences logo, PacBio, SMRT, and SMRTbell are trademarks of Pacific Biosciences in the United States and/or other countries. All other trademarks are the sole property of their respective owners.

FIND MEANING IN COMPLEXITY © Copyright 2014 by Pacific Biosciences
of California, Inc. All rights reserved. For Research Use Only. Not for use in diagnostic procedures. James Drake – Bioinformatics Workshop 10/2014 SMRT® Make

SMRT® Make • https://github.com/PacificBiosciences/smrtmake • Make file-driven, bare-boned SMRT Analysis
workflows • Why make? – Re-entrant – Stable, well-documented – Portable – Make files are (mostly) self-contained, easily modified – Flexible • Custom analysis • Rapid prototype, develop and experimentation • Not everything wraps nicely into SMRT Analysis

Design Principles • Initially designed to be self-contained within a
single file, but desire to modularize is overwhelming. – Chunking, i.e., splitting the data up for parallel processing. – Defining cluster submission pragmas (e.g., qsub calls) – Data filtering – Reports • Provide the general purpose use case, user customizes to fit their needs. • Minimal parameters • Serves as a record of the experiment • Easily reproducible by others, just share your make file. • Create self-contained make files by simply `cat`ing them together.

Anatomy of a SMRT® Make File # SGE queue name
to submit jobs to QUEUE ?= huasm # Size of the genome GENOME_SIZE ?= 700000000 # Splits data into this many chunks, each chunk processed independently CHUNK_SIZE ?= 15 # How many threads a process will use (also how many SGE slots will be requested) NPROC ?= 32 # Local temp root directory, must have write access and a decent amount of space (~100GB) LOCALTMP ?= /scratch Parameters Initialization Recipes Typically run in bottom-up order

Interacting with a SMRT® Make File # Typically just need
an input.fofn file and a SMRT Make file > ls input.fofn hgap3.mk # Rename a make file so you don’t have to explicitly pass it as an argument > mv hgap3.mk Makefile # change the genome size > make GENOME_SIZE=100000000 # Run 5 jobs at a time and split the data into 10 ‘chunks’ > make –j 5 CHUNK_SIZE=10 # Recovery and continue, e.g., a large assembly that takes days > make –f hgap3.mk progress ... progress ... progress ... ERROR! (cluster dies, out of disk, etc) > fix ... fix ... fix ... > make –f hgap3.mk # Save/Share your make files > make –f my_super_cool_assembly.mk (makes best assembly ever) > mail –s “Check this out” [email protected] < my_super_cool_assembly.mk # or submit a github pull request to PacBio and share with the community!

Example - Circularization HGAP reads assembly Plasmids Bacteria Higher-organisms circularization.mk
1) Iterates over all fasta-formatted entries 2) Gets the beginning and end of each sequence 3) Runs BLASR to detect overlap 4) Declares if entry ‘looks’ circular based on thresholds

For Research Use Only. Not for use in diagnostic procedures.
Pacific Biosciences, the Pacific Biosciences logo, PacBio, SMRT, and SMRTbell are trademarks of Pacific Biosciences in the United States and/or other countries. All other trademarks are the sole property of their respective owners.

Track 2: Beyond Bacteria

Track 2: Beyond Bacteria

PacBio

More Decks by PacBio

Other Decks in Science

Featured

Transcript

FIND MEANING IN COMPLEXITY © Copyright 2014 by Pacific Biosciences

Options for >100 Mb Genomes • Gap filling – High-quality

A Customer’s Perspective on Eukaryotic De Novo Assembly Experimental Design

Name Additional Data Sets Genome Size Constraint HGAP None <100

Gap Filling 5

Towards Gap-Free Reference Genomes English et al. (2012) Mind the

PBJelly • Part of PBSuite – http://sourceforge.net/projects/pb-jelly/ – PBJelly -

Hybrid Assembly Methods 8

Hybrid Assembly Tools PAG 2013: Michael Schatz, “De novo assembly

ECTools • https://github.com/jgurtowski/ectools • Advantages – Make use of low

Example: 15 kb Repeat Region Spanned with PacBio® Assembly Presented

Assembly with Only PacBio® Data 12

HGAP Evolution Dazzler MHAP FALCON HGAP.3 HGAP.2 HGAP.1 Alignment Daligner

For Research Use Only. Not for use in diagnostic procedures.

FIND MEANING IN COMPLEXITY © Copyright 2014 by Pacific Biosciences

SMRT® Make • https://github.com/PacificBiosciences/smrtmake • Make file-driven, bare-boned SMRT Analysis

Design Principles • Initially designed to be self-contained within a

Anatomy of a SMRT® Make File # SGE queue name

Interacting with a SMRT® Make File # Typically just need

Example - Circularization HGAP reads assembly Plasmids Bacteria Higher-organisms circularization.mk

For Research Use Only. Not for use in diagnostic procedures.