String Graph Assembly For Diploid Genomes With Long Reads

5633e4eaa009d960042a8f32b55b3d7f?s=47 Jason Chin
November 08, 2013

String Graph Assembly For Diploid Genomes With Long Reads

This is from the talk I gave in CSHL Genomic Informatics Meeting 2013. It demonstrates an algorithm based on string graph for assembling diploid from PacBio(R) single molecule reads larger than 10kb.


Jason Chin

November 08, 2013


  1. FIND MEANING IN COMPLEXITY © Copyright 2013 by Pacific Biosciences

    of California, Inc. All rights reserved. Jason Chin, Paul Peluso & David Rank, Pacific Biosciences Cold Spring Harbor Laboratory, Genome Informatics 2013 String Graph Assembly For Diploid Genomes With Long Reads
  2. String Graph 2 •  String graph: –  A graph structure

    that models a genome •  Nodes: –  Particular positions (typically corresponding to the beginnings or endings of the read fragments) in the genome •  Edges: –  The sequence between the vertices •  Any string from a path spell out a possible assembly from the reads Genome String Graph R1 R1 R2 R2 R2 R3 R3 e1 e2 e3 e4 e5 e1 e2 e3 e4 e5
  3. From Read Overlap to a String Graph 3 Add f:B,

    g:B, f:E, g:E as vertices Add edges f:E è g:B and g:E è f:B f:B f:E g:B g:E f:E g:B g:E f:B 1:E 2:E 3:B 4:E 1:E 2:E 3:B 4:E 1:E 2:E 3:B 4:E overlaps Initial graph String graph Transitive Reduction For each overlap, two edges are constructed. Example: Overlapped reads New edges
  4. An Example •  E. coli simulated data –  10X 10000

    bp reads, perfect tiling every 1000 bp through the genome. 4 9278 nodes 9536 edges Why is there some tanglement? Any overlap region that falls fully in the repeat could create a “knot”. 1:E 2:E 3:E 4:E 5:E 1:E 2:E 3:E 4:E 5:E repeat
  5. “Untangle” the Knots!! •  Using a simple “best overlapping logic”

    to “untangle” the knots. 5 1:E 2:E 3:E 1:E 2:E 3:E 4:E 4:E 5:E 5:E The 4Eè5E edge is better than 4E è2E. The 1Eè3E edge is better than 1E è2E. (“wrong” edge) 1:E 2:E 3:E 4:E 5:E Best overlap string graph 1:E 2:E 3:E 4:E 5:E Desired final graph repeat
  6. Apply The Best Overlapping Rule On The E. Coli String

    Graph 6 Reversed strand Forward strand 9278 nodes 9280 edges The string graph becomes two simple circles (one forward and one reversed) Untangle
  7. Simplify The String Graph 7 Building “unitigs” from the non-branching

    part of a string graph. String Graph Unitig Graph Graph traversal for generating contigs
  8. Challenge of Diploid Assembly •  It may be hard to

    distinguish homologous regions from repeats 8 R R Haplotype 1 Haplotype 2 Non homologous regions Sequences String Graph / Overlap Graph R H H For developing a diploid assembler, we need find ways to distinguish these different cases with the same local topology. Same local topology
  9. Variations (or Errors) Induce Simple Bubbles 9 1:E 2:E 3:E

    4:E 5:E 6:E 7:E 8:E SV or Error 1 2 3 4 5 6 7 8 With long reads, the string graph may have a quasi linear structure with bubbles induced by variations between haplotypes. Hopefully, with longer read length, most difference between two haplotypes appears as such simple bubble.
  10. Sequencing Plan for Generate a “Synthetic” Diploid Dataset 10 Reads

    Assembly Validation for the diploid assembly results. Reads Consensus Correction Arabidopsis 120Mbp genome Two inbred strains, Ler-0 & Col-0 sequenced separately Ler-0 Col-0
  11. Example for the Input Data: Length Distribution of the Preassembled

    Reads For Assembly 11 Transposons 45S rDNAs Retrotransposons Common repeat element lengths Methods for pre-assembly consensus: Genome Biology 2013, 14:R101 S. Koren, et al. Nature Methods 10, 563–569 (2013), C.-S. Chin, et al. Acc. > 99%
  12. Diploid: Different Large and Small Scale Topological Features 12 “Bubbles”

    caused by SV between homologous copies Branching point caused by repeats An unitig graph from Ler-0 + Col-0 data The graph “diameter” ~ 12 M bp Mean edge size=17.4 k bp
  13. The Plan •  Build a “string bundle” along the path

    ⇒ the primary contig + locally “associated contigs” •  Break the string bundle at branching point caused by repeats ⇒ Corrected primary contigs + locally associated contigs •  Find an end-to-end path ⇒ A initial “primary contig” •  Repeat until no edge left String bundle: compound paths that contain sequences from both haplotypes.
  14. From a “String Bundle” to a Primary Contig + Associated

    Contigs 14 String Bundle Choose a path to be the “primary contig” Identify “associated contigs”
  15. Distinguish Vertices in The Bundle From Those in Branching Paths

    15 u’ u v’ w’ v w At vertex u, the downstream paths from v and w meet within a pre-specified radius, add v and w into the bundle. At vertex u, the downstream paths from v and w do not meet within certain radius, break the initial primary contig at u’. initial primary contig edges associated contig edges u’
  16. Result I: Haploid Assembly of Col-0 (inbred line) 16 Contig

    Stats #Seqs 512 Max 10.2 Mbp Total 120 Mbp n50 6.2 Mbp From tick mark to tick mark is a contig. TAIR10, Col-0 Reference Assembled Contigs un-used edges in the original string graph
  17. Result II: Haploid Assembly Ler-0 (inbred line) 17 Contig Stats

    #Seqs 983 Max 13.3 Mbp Total 123 Mbp n50 5.0 Mbp From tick mark to tick mark is a contig. TAIR10, Col-0 Reference Assembled Contigs
  18. Result III: “Diploid Assembly”: Ler-0 + Col-0 18 Primary Contigs

    Stats #Seqs 1085 Max 9.4 Mbp Total 127 Mbp n50 2.8 Mbp TAIR10, Col-0 Reference Assembled Contigs From tick mark to tick mark is a contig. Full Contigs Stats #Seqs 2483 Max 12.5 Mbp (un-corrected) Total 177 Mbp n50 2.8 Mbp
  19. Validation I: Haplotype Structure Variants Resolved by a Simple Bubble

    19 Branch 1 Branch 2 Ler-0 Hap- assembly Col-0 Hap- assembly Branch 1 Branch 2 Ler-0 Col-0
  20. Validation II: Haplotype Structure Variants May Not Be Full Resolved

    in A Complicated Bubble 20 Branch 1 Branch 2 Ler-0 Hap- assembly Col-0 Hap- assembly Branch 1 Branch 2 Ler-0 followed by Col-0 Col-0 followed by Ler-0
  21. Validation III: Identify The Structure Variations between The Homologous Copies

    21 Insertion in associated contigs Insertion in primary contigs Align the insertion/deletion elements to the “haploid assembly” of Ler-0 or Col-0 Unique element in ler-0 Unique element in col-0 Size distribution of the structure variations between the haplotypes Alignment Identity to Ler-0 Assembly Alignment Identity to Col-0 Assembly
  22. Summary & Next Phase for Building a Full Diploid Assembler

    •  With enough long read data, starting assembly from reads > 10kb to reveals the diploid structure as quasi-linear chains (string bundles) in the string graph. •  Successfully assemble “diploid”-like long read data with heuristics. N50 > 5 Mb for haploid and N50 > 2.5 Mb for diploid. •  Next: –  Diploid / polyploid graph traversing problem: from heuristics to more rigorous theoretical framework –  Generate diploid consensus: need an efficient aligner that can align long reads to string graph directly –  Phasing: combining SV discoveries and SNP calling to “unzip” the bubbles –  “Serialize the graph”: FASTG as output? –  More testing cases: −  Real biological diploid genomes −  Other diploid genome might have different structure •  Datasets: just search “PacBio Dataset” 22
  23. Acknowledgements •  Everyone in Pacific Biosciences. It is truly a

    team effort to bring useful data for the community. •  Col-0 DNA Sample –  Joe Ecker and Chongyun Lou (HHMI & Salk Institute) •  For several important things I learned about assembly tools/algorithm through social networks: –  Michael Schatz (@mike_schatz, CSHL), Adam Phillippy (@aphillippy , NBACC) •  Blasr, a long read aligner: –  Mark Chaisson (University of Washington, Eichler’s lab) 23