Slide 1

Slide 1 text

FIND MEANING IN COMPLEXITY © Copyright 2013 by Pacific Biosciences of California, Inc. All rights reserved. Comparing and Assessing Quality of Assemblies

Slide 2

Slide 2 text

Learning Objectives 2 Scientists and Bioinformaticians: • Involved in de novo assembly projects After the training, you will: • Have knowledge of standard assembly metrics • Understand the limitations of standard assembly metrics • Know what additional advanced tools are available • SMRT® Technology • PacBio® RS Workflow

Slide 3

Slide 3 text

Basic Assembly Metrics • Commonly used metrics include: – Number of contigs – N50: Equal to the size of the contig found if you sort contigs by size and walk to the contig that represents 50% of the total sequence − N50 = 10 bp − Mean contig length = 3 bp – Max contig size • Limitation of these metrics: – They do not capture information about assembly accuracy! – Example: You can trivially concatenate all reads together and get one contig 3 10 4 1 1 1 1

Slide 4

Slide 4 text

Assembly Accuracy Comes in Multiple Forms • Misassemblies: Parts of genome that are incorrectly joined together • Base level errors: Result of sequencing error • Scientific goals – Can you detect the genes in which you’re interested? – Can you see relevant structural variation? 4

Slide 5

Slide 5 text

Best Assembly Might Not be One Contig… • There might be more than one chromosome (plasmid, phage, etc.) • Contaminants may contribute a contig (such as a cloning vector) 5

Slide 6

Slide 6 text

Detecting Misassemblies using Mauve • Red lines: Contig boundaries • Colored blocks: Stretches of one or more contigs that align continuously to the reference (or “local collinear blocks” (LCB)) – Multiple blocks represent misassemblies 6 Reference: Assembly: Aaron C.E. Darling et al. 2004. Genome Research. 14(7):1394-1403.

Slide 7

Slide 7 text

Using Nucmer • Calculate base level accuracy using Nucmer /data/assembly/lambda.fasta /home/lhon/training/assembly/celera/asm/9-terminator/asm.ctg.fasta NUCMER [S1] [E1] | [S2] [E2] | [LEN 1] [LEN 2] | [% IDY] | [COV R] [COV Q] | [TAGS] ========================================================================================================== 42 48496 | 48452 1 | 48455 48452 | 99.93 | 99.90 100.00 | ref000001|lambda_NEB3011 ctg7180000000001 • Dotplots – Blue line: inversion? – Break in lines: indel? 7 Kurtz S et al. Genome Biol. 2004;5(2):R12. Epub 2004 Jan 30.

Slide 8

Slide 8 text

Detecting Misassemblies by Aligning Reads to Assembly • No reference needed—just use assembly • Tools – SMRT® Portal Resequencing and SMRT View – BWA-SW and IGV/Tablet • Tips: – Use long reads for more accurate mapping – “Polish” the assembly by getting consensus from resequencing job – If using BWA-SW, make sure that parameters are tuned for PacBio® data 8

Slide 9

Slide 9 text

Coverage Dips using SMRT® Portal Resequencing 9

Slide 10

Slide 10 text

Exploring Dips using SMRT® View 10

Slide 11

Slide 11 text

Per-base Accuracy and Assembly Polishing • Why per-base accuracy is important in assemblies: – SNP detection in assemblies – Gene prediction using open reading frames – Differentiating repeats using single-base differences • How to increase per-base accuracy? – Assembly polishing using Quiver – Available now on Github, coming to SMRT® Analysis in early 2013 11

Slide 12

Slide 12 text

Quiver: A New Consensus Caller for PacBio® Data • Takes multiple reads of a given DNA template, outputs best guess of template’s identity • QV-aware hidden Markov model to account for sequencing errors; a greedy algorithm to find the maximum likelihood template • Can achieve accuracy >Q50 (i.e. > 99.999%) using pure PacBio raw reads • Same underlying algorithm currently used for CCS generation 12

Slide 13

Slide 13 text

Using Quiver https://github.com/PacificBiosciences/GenomicConsensus/blob/master/doc/HowToQuiver.rst 13

Slide 14

Slide 14 text

Summary Key Points • Basic assembly metrics are useful, but do not capture assembly quality • Consider also misassemblies, base-level accuracy, and your scientific goals • Mauve and Nucmer are tools to assess assembly quality • Quiver can be used to polish assemblies Where to Find More Information • Mauve: http://gel.ahabs.wisc.edu/mauve/ • Nucmer: http://mummer.sourceforge.net/ • Quiver: https://github.com/PacificBiosciences/GenomicConsensus/blob/master/ doc/HowToQuiver.rst

Slide 15

Slide 15 text

Pacific Biosciences, the Pacific Biosciences logo, PacBio, SMRT, and SMRTbell are trademarks of Pacific Biosciences in the United States and/or other countries. All other trademarks are the sole property of their respective owners.