What After Getting Awesome Contigs? Challenges and Opportunities

5633e4eaa009d960042a8f32b55b3d7f?s=47 Jason Chin
September 15, 2016

What After Getting Awesome Contigs? Challenges and Opportunities

This is for a lighting talk for #SMRTBFX 2016 @ GAITHERSBURG. I go over some discussion about why contigs break and how we can utilize the assembly graph to get more information from an assembly than just using the contigs.

Here is the live demo video link: https://www.youtube.com/watch?v=oKSRzYRGwb8

5633e4eaa009d960042a8f32b55b3d7f?s=128

Jason Chin

September 15, 2016
Tweet

Transcript

  1. For Research Use Only. Not for use in diagnostics procedures.

    © Copyright 2016 by Pacific Biosciences of California, Inc. All rights reserved. What After Getting Awesome Contigs? Challenges and Opportunities
  2. WHY CONTIGS BREAK? WHAT WE CAN DO ABOUT IT? Break

    points Contigs break because of • Not enough coverage • Repeat Induced ambiguity
  3. TUG OF WAR Only using the longest reads (and the

    longest overlaps) helps to reduce repeat induced ambiguities but at a cost of losing coverages coverages longest reads
  4. PARAMETER CHOICE CONSIDERATIONS • Length cutoff: Coverage vs. Repeat-resolution •

    More coverage at shorter length may not be good • Lower coverage of “enough” longer reads is good • What “enough” is “enough”? RIB limited Coverage limited Read length cutoff Assembly Contiguity (Just a guess) coverage Repeat-induced branching (RID) “Tuning assembler parameters (TAP)” developed by Shoudan may help: https://github.com/pb-sliang/TAP
  5. PARAMETER CHOICE CONSIDERATIONS • Length cutoff: Coverage vs. Repeat-resolution •

    More coverage at shorter length may not be good • Lower coverage of “enough” longer reads is good • What “enough” is “enough”? RIB limited Coverage limited “Tuning assembler parameters (TAP)” developed by Shoudan may help: https://github.com/pb-sliang/TAP 748,163 1,047,929 1,293,937 1,442,972 1,272,783 753,370 2,184,510,437 2,079,537,346 2,029,248,251 1,984,532,045 1,943,822,269 1,833,838,523 1.8E+09 1.85E+09 1.9E+09 1.95E+09 2E+09 2.05E+09 2.1E+09 2.15E+09 2.2E+09 2.25E+09 - 200,000 400,000 600,000 800,000 1,000,000 1,200,000 1,400,000 1,600,000 0 5000 10000 15000 20000 25000 30000 ASSEMBLY SIZE NN50 LENGTH CUTOFF n50 assembly size
  6. GRAPH TO CONTIGS

  7. WHY CONTIGS BREAK? WHAT WE CAN DO ABOUT IT? Need

    to understand the relationship between the break points & the underlying assembly graph data structure Simple Path Bubble ”Balloon" "Lollipop" "Bridge" ”Spur" "Hair ball"
  8. Live Demo Check https://youtu.be/oKSRzYRGwb8

  9. 000062 ó 000146F, AMY Ctg 62 Ctg 146 CNV region

    of AMY genes Haplotype difference
  10. 000305 ó 000074, NXF2B, LONG INVERT REPEAT Ctg 305 Ctg

    74 Ctg 305 (INV) Unique region
  11. MIS-ASSEMBLY Ctg 33 Ctg 120 Mis-assembly point

  12. CONTIGUITY: JOIN CONTIGS ACROSS COMPLICATED BUT LOCAL REPEATS • We

    find 91 junctions can be joined in the NA19240 assembly • Boost on contiguity: From N50: 24,239,162 / Max: 81,042,162 to N50: 28,125,580 / Max: 109,042,162 • What is the theoretical contig N50: • Take the current GRCh38 and breaks the reference on seg-dup > 50kb • The N50 of the unbroken segments are 30,332,297bp • Our N50 is close to the theoretical one 30.3Mbp • Largest 5 continuous non-seg-dup segments in GRCh38: 132.529Mb chr8:12,609,996-145,138,636 109.831Mb chr2:132,362,523-242,193,529 * 109.453Mb chr6:6,1159,830-170,612,704 108.732Mb chr4:9,735,201-118,467,376 104.559Mb chr5:71,364,489-175,923,361 * match our longest “scaffold” (of 3 contigs)
  13. CHALLENGES & OPPORTUNITIES - Using assembly graph to give “quality

    value” indicating uncertainties or errors at a given point of the contigs. -Are there systematic patterns of local repeats? - Graph complexity measurement - Combining different data types on an assembly graph before contigs -Diploid assembly -Haplotype specific scaffolding -Visualization tools for “debugging” genome assembly
  14. For Research Use Only. Not for use in diagnostics procedures.

    © Copyright 2016 by Pacific Biosciences of California, Inc. All rights reserved. Pacific Biosciences, the Pacific Biosciences logo, PacBio, SMRT, SMRTbell, Iso-Seq, and Sequel are trademarks of Pacific Biosciences. BluePippin and SageELF are trademarks of Sage Science. NGS-go and NGSengine are trademarks of GenDx. All other trademarks are the sole property of their respective owners. www.pacb.com