Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Comparing and Assessing Quality of Assemblies

PacBio
April 02, 2013

Comparing and Assessing Quality of Assemblies

PacBio

April 02, 2013
Tweet

More Decks by PacBio

Other Decks in Science

Transcript

  1. FIND MEANING IN COMPLEXITY © Copyright 2013 by Pacific Biosciences

    of California, Inc. All rights reserved. Comparing and Assessing Quality of Assemblies
  2. Learning Objectives 2 Scientists and Bioinformaticians: • Involved in de

    novo assembly projects After the training, you will: • Have knowledge of standard assembly metrics • Understand the limitations of standard assembly metrics • Know what additional advanced tools are available • SMRT® Technology • PacBio® RS Workflow
  3. Basic Assembly Metrics • Commonly used metrics include: – Number

    of contigs – N50: Equal to the size of the contig found if you sort contigs by size and walk to the contig that represents 50% of the total sequence − N50 = 10 bp − Mean contig length = 3 bp – Max contig size • Limitation of these metrics: – They do not capture information about assembly accuracy! – Example: You can trivially concatenate all reads together and get one contig 3 10 4 1 1 1 1
  4. Assembly Accuracy Comes in Multiple Forms • Misassemblies: Parts of

    genome that are incorrectly joined together • Base level errors: Result of sequencing error • Scientific goals – Can you detect the genes in which you’re interested? – Can you see relevant structural variation? 4
  5. Best Assembly Might Not be One Contig… • There might

    be more than one chromosome (plasmid, phage, etc.) • Contaminants may contribute a contig (such as a cloning vector) 5
  6. Detecting Misassemblies using Mauve • Red lines: Contig boundaries •

    Colored blocks: Stretches of one or more contigs that align continuously to the reference (or “local collinear blocks” (LCB)) – Multiple blocks represent misassemblies 6 Reference: Assembly: Aaron C.E. Darling et al. 2004. Genome Research. 14(7):1394-1403.
  7. Using Nucmer • Calculate base level accuracy using Nucmer /data/assembly/lambda.fasta

    /home/lhon/training/assembly/celera/asm/9-terminator/asm.ctg.fasta NUCMER [S1] [E1] | [S2] [E2] | [LEN 1] [LEN 2] | [% IDY] | [COV R] [COV Q] | [TAGS] ========================================================================================================== 42 48496 | 48452 1 | 48455 48452 | 99.93 | 99.90 100.00 | ref000001|lambda_NEB3011 ctg7180000000001 • Dotplots – Blue line: inversion? – Break in lines: indel? 7 Kurtz S et al. Genome Biol. 2004;5(2):R12. Epub 2004 Jan 30.
  8. Detecting Misassemblies by Aligning Reads to Assembly • No reference

    needed—just use assembly • Tools – SMRT® Portal Resequencing and SMRT View – BWA-SW and IGV/Tablet • Tips: – Use long reads for more accurate mapping – “Polish” the assembly by getting consensus from resequencing job – If using BWA-SW, make sure that parameters are tuned for PacBio® data 8
  9. Per-base Accuracy and Assembly Polishing • Why per-base accuracy is

    important in assemblies: – SNP detection in assemblies – Gene prediction using open reading frames – Differentiating repeats using single-base differences • How to increase per-base accuracy? – Assembly polishing using Quiver – Available now on Github, coming to SMRT® Analysis in early 2013 11
  10. Quiver: A New Consensus Caller for PacBio® Data • Takes

    multiple reads of a given DNA template, outputs best guess of template’s identity • QV-aware hidden Markov model to account for sequencing errors; a greedy algorithm to find the maximum likelihood template • Can achieve accuracy >Q50 (i.e. > 99.999%) using pure PacBio raw reads • Same underlying algorithm currently used for CCS generation 12
  11. Summary Key Points • Basic assembly metrics are useful, but

    do not capture assembly quality • Consider also misassemblies, base-level accuracy, and your scientific goals • Mauve and Nucmer are tools to assess assembly quality • Quiver can be used to polish assemblies Where to Find More Information • Mauve: http://gel.ahabs.wisc.edu/mauve/ • Nucmer: http://mummer.sourceforge.net/ • Quiver: https://github.com/PacificBiosciences/GenomicConsensus/blob/master/ doc/HowToQuiver.rst
  12. Pacific Biosciences, the Pacific Biosciences logo, PacBio, SMRT, and SMRTbell

    are trademarks of Pacific Biosciences in the United States and/or other countries. All other trademarks are the sole property of their respective owners.