Jason Chin
November 08, 2013
7k

# String Graph Assembly For Diploid Genomes With Long Reads

This is from the talk I gave in CSHL Genomic Informatics Meeting 2013. It demonstrates an algorithm based on string graph for assembling diploid from PacBio(R) single molecule reads larger than 10kb.

## Jason Chin

November 08, 2013

## Transcript

1. ### FIND MEANING IN COMPLEXITY © Copyright 2013 by Pacific Biosciences

of California, Inc. All rights reserved. Jason Chin, Paul Peluso & David Rank, Pacific Biosciences Cold Spring Harbor Laboratory, Genome Informatics 2013 String Graph Assembly For Diploid Genomes With Long Reads
2. ### String Graph 2 •  String graph: –  A graph structure

that models a genome •  Nodes: –  Particular positions (typically corresponding to the beginnings or endings of the read fragments) in the genome •  Edges: –  The sequence between the vertices •  Any string from a path spell out a possible assembly from the reads Genome String Graph R1 R1 R2 R2 R2 R3 R3 e1 e2 e3 e4 e5 e1 e2 e3 e4 e5
3. ### From Read Overlap to a String Graph 3 Add f:B,

g:B, f:E, g:E as vertices Add edges f:E è g:B and g:E è f:B f:B f:E g:B g:E f:E g:B g:E f:B 1:E 2:E 3:B 4:E 1:E 2:E 3:B 4:E 1:E 2:E 3:B 4:E overlaps Initial graph String graph Transitive Reduction For each overlap, two edges are constructed. Example: Overlapped reads New edges
4. ### An Example •  E. coli simulated data –  10X 10000

bp reads, perfect tiling every 1000 bp through the genome. 4 9278 nodes 9536 edges Why is there some tanglement? Any overlap region that falls fully in the repeat could create a “knot”. 1:E 2:E 3:E 4:E 5:E 1:E 2:E 3:E 4:E 5:E repeat
5. ### “Untangle” the Knots!! •  Using a simple “best overlapping logic”

to “untangle” the knots. 5 1:E 2:E 3:E 1:E 2:E 3:E 4:E 4:E 5:E 5:E The 4Eè5E edge is better than 4E è2E. The 1Eè3E edge is better than 1E è2E. (“wrong” edge) 1:E 2:E 3:E 4:E 5:E Best overlap string graph 1:E 2:E 3:E 4:E 5:E Desired final graph repeat
6. ### Apply The Best Overlapping Rule On The E. Coli String

Graph 6 Reversed strand Forward strand 9278 nodes 9280 edges The string graph becomes two simple circles (one forward and one reversed) Untangle
7. ### Simplify The String Graph 7 Building “unitigs” from the non-branching

part of a string graph. String Graph Unitig Graph Graph traversal for generating contigs
8. ### Challenge of Diploid Assembly •  It may be hard to

distinguish homologous regions from repeats 8 R R Haplotype 1 Haplotype 2 Non homologous regions Sequences String Graph / Overlap Graph R H H For developing a diploid assembler, we need find ways to distinguish these different cases with the same local topology. Same local topology
9. ### Variations (or Errors) Induce Simple Bubbles 9 1:E 2:E 3:E

4:E 5:E 6:E 7:E 8:E SV or Error 1 2 3 4 5 6 7 8 With long reads, the string graph may have a quasi linear structure with bubbles induced by variations between haplotypes. Hopefully, with longer read length, most difference between two haplotypes appears as such simple bubble.
10. ### Sequencing Plan for Generate a “Synthetic” Diploid Dataset 10 Reads

Assembly Validation for the diploid assembly results. Reads Consensus Correction Arabidopsis 120Mbp genome Two inbred strains, Ler-0 & Col-0 sequenced separately Ler-0 Col-0
11. ### Example for the Input Data: Length Distribution of the Preassembled

Reads For Assembly 11 Transposons 45S rDNAs Retrotransposons Common repeat element lengths Methods for pre-assembly consensus: Genome Biology 2013, 14:R101 S. Koren, et al. Nature Methods 10, 563–569 (2013), C.-S. Chin, et al. Acc. > 99%
12. ### Diploid: Different Large and Small Scale Topological Features 12 “Bubbles”

caused by SV between homologous copies Branching point caused by repeats An unitig graph from Ler-0 + Col-0 data The graph “diameter” ~ 12 M bp Mean edge size=17.4 k bp
13. ### The Plan •  Build a “string bundle” along the path

⇒ the primary contig + locally “associated contigs” •  Break the string bundle at branching point caused by repeats ⇒ Corrected primary contigs + locally associated contigs •  Find an end-to-end path ⇒ A initial “primary contig” •  Repeat until no edge left String bundle: compound paths that contain sequences from both haplotypes.
14. ### From a “String Bundle” to a Primary Contig + Associated

Contigs 14 String Bundle Choose a path to be the “primary contig” Identify “associated contigs”
15. ### Distinguish Vertices in The Bundle From Those in Branching Paths

15 u’ u v’ w’ v w At vertex u, the downstream paths from v and w meet within a pre-specified radius, add v and w into the bundle. At vertex u, the downstream paths from v and w do not meet within certain radius, break the initial primary contig at u’. initial primary contig edges associated contig edges u’
16. ### Result I: Haploid Assembly of Col-0 (inbred line) 16 Contig

Stats #Seqs 512 Max 10.2 Mbp Total 120 Mbp n50 6.2 Mbp From tick mark to tick mark is a contig. TAIR10, Col-0 Reference Assembled Contigs un-used edges in the original string graph
17. ### Result II: Haploid Assembly Ler-0 (inbred line) 17 Contig Stats

#Seqs 983 Max 13.3 Mbp Total 123 Mbp n50 5.0 Mbp From tick mark to tick mark is a contig. TAIR10, Col-0 Reference Assembled Contigs
18. ### Result III: “Diploid Assembly”: Ler-0 + Col-0 18 Primary Contigs

Stats #Seqs 1085 Max 9.4 Mbp Total 127 Mbp n50 2.8 Mbp TAIR10, Col-0 Reference Assembled Contigs From tick mark to tick mark is a contig. Full Contigs Stats #Seqs 2483 Max 12.5 Mbp (un-corrected) Total 177 Mbp n50 2.8 Mbp
19. ### Validation I: Haplotype Structure Variants Resolved by a Simple Bubble

19 Branch 1 Branch 2 Ler-0 Hap- assembly Col-0 Hap- assembly Branch 1 Branch 2 Ler-0 Col-0
20. ### Validation II: Haplotype Structure Variants May Not Be Full Resolved

in A Complicated Bubble 20 Branch 1 Branch 2 Ler-0 Hap- assembly Col-0 Hap- assembly Branch 1 Branch 2 Ler-0 followed by Col-0 Col-0 followed by Ler-0
21. ### Validation III: Identify The Structure Variations between The Homologous Copies

21 Insertion in associated contigs Insertion in primary contigs Align the insertion/deletion elements to the “haploid assembly” of Ler-0 or Col-0 Unique element in ler-0 Unique element in col-0 Size distribution of the structure variations between the haplotypes Alignment Identity to Ler-0 Assembly Alignment Identity to Col-0 Assembly
22. ### Summary & Next Phase for Building a Full Diploid Assembler

•  With enough long read data, starting assembly from reads > 10kb to reveals the diploid structure as quasi-linear chains (string bundles) in the string graph. •  Successfully assemble “diploid”-like long read data with heuristics. N50 > 5 Mb for haploid and N50 > 2.5 Mb for diploid. •  Next: –  Diploid / polyploid graph traversing problem: from heuristics to more rigorous theoretical framework –  Generate diploid consensus: need an efficient aligner that can align long reads to string graph directly –  Phasing: combining SV discoveries and SNP calling to “unzip” the bubbles –  “Serialize the graph”: FASTG as output? –  More testing cases: −  Real biological diploid genomes −  Other diploid genome might have different structure •  Datasets: https://github.com/PacificBiosciences/DevNet/wiki/Datasets just search “PacBio Dataset” 22
23. ### Acknowledgements •  Everyone in Pacific Biosciences. It is truly a

team effort to bring useful data for the community. •  Col-0 DNA Sample –  Joe Ecker and Chongyun Lou (HHMI & Salk Institute) •  For several important things I learned about assembly tools/algorithm through social networks: –  Michael Schatz (@mike_schatz, CSHL), Adam Phillippy (@aphillippy , NBACC) •  Blasr, a long read aligner: –  Mark Chaisson (University of Washington, Eichler’s lab) 23