String Graph Assembly For Diploid Genomes With Long Reads

FIND MEANING IN COMPLEXITY © Copyright 2013 by Pacific Biosciences
of California, Inc. All rights reserved. Jason Chin, Paul Peluso & David Rank, Pacific Biosciences Cold Spring Harbor Laboratory, Genome Informatics 2013 String Graph Assembly For Diploid Genomes With Long Reads

String Graph 2 •  String graph: –  A graph structure
that models a genome •  Nodes: –  Particular positions (typically corresponding to the beginnings or endings of the read fragments) in the genome •  Edges: –  The sequence between the vertices •  Any string from a path spell out a possible assembly from the reads Genome String Graph R1 R1 R2 R2 R2 R3 R3 e1 e2 e3 e4 e5 e1 e2 e3 e4 e5

From Read Overlap to a String Graph 3 Add f:B,
g:B, f:E, g:E as vertices Add edges f:E è g:B and g:E è f:B f:B f:E g:B g:E f:E g:B g:E f:B 1:E 2:E 3:B 4:E 1:E 2:E 3:B 4:E 1:E 2:E 3:B 4:E overlaps Initial graph String graph Transitive Reduction For each overlap, two edges are constructed. Example: Overlapped reads New edges

An Example •  E. coli simulated data –  10X 10000
bp reads, perfect tiling every 1000 bp through the genome. 4 9278 nodes 9536 edges Why is there some tanglement? Any overlap region that falls fully in the repeat could create a “knot”. 1:E 2:E 3:E 4:E 5:E 1:E 2:E 3:E 4:E 5:E repeat

“Untangle” the Knots!! •  Using a simple “best overlapping logic”
to “untangle” the knots. 5 1:E 2:E 3:E 1:E 2:E 3:E 4:E 4:E 5:E 5:E The 4Eè5E edge is better than 4E è2E. The 1Eè3E edge is better than 1E è2E. (“wrong” edge) 1:E 2:E 3:E 4:E 5:E Best overlap string graph 1:E 2:E 3:E 4:E 5:E Desired final graph repeat

Apply The Best Overlapping Rule On The E. Coli String
Graph 6 Reversed strand Forward strand 9278 nodes 9280 edges The string graph becomes two simple circles (one forward and one reversed) Untangle

Simplify The String Graph 7 Building “unitigs” from the non-branching
part of a string graph. String Graph Unitig Graph Graph traversal for generating contigs

Challenge of Diploid Assembly •  It may be hard to
distinguish homologous regions from repeats 8 R R Haplotype 1 Haplotype 2 Non homologous regions Sequences String Graph / Overlap Graph R H H For developing a diploid assembler, we need find ways to distinguish these different cases with the same local topology. Same local topology

Variations (or Errors) Induce Simple Bubbles 9 1:E 2:E 3:E
4:E 5:E 6:E 7:E 8:E SV or Error 1 2 3 4 5 6 7 8 With long reads, the string graph may have a quasi linear structure with bubbles induced by variations between haplotypes. Hopefully, with longer read length, most difference between two haplotypes appears as such simple bubble.

Sequencing Plan for Generate a “Synthetic” Diploid Dataset 10 Reads
Assembly Validation for the diploid assembly results. Reads Consensus Correction Arabidopsis 120Mbp genome Two inbred strains, Ler-0 & Col-0 sequenced separately Ler-0 Col-0

Example for the Input Data: Length Distribution of the Preassembled
Reads For Assembly 11 Transposons 45S rDNAs Retrotransposons Common repeat element lengths Methods for pre-assembly consensus: Genome Biology 2013, 14:R101 S. Koren, et al. Nature Methods 10, 563–569 (2013), C.-S. Chin, et al. Acc. > 99%

Diploid: Different Large and Small Scale Topological Features 12 “Bubbles”
caused by SV between homologous copies Branching point caused by repeats An unitig graph from Ler-0 + Col-0 data The graph “diameter” ~ 12 M bp Mean edge size=17.4 k bp

The Plan •  Build a “string bundle” along the path
⇒ the primary contig + locally “associated contigs” •  Break the string bundle at branching point caused by repeats ⇒ Corrected primary contigs + locally associated contigs •  Find an end-to-end path ⇒ A initial “primary contig” •  Repeat until no edge left String bundle: compound paths that contain sequences from both haplotypes.

From a “String Bundle” to a Primary Contig + Associated
Contigs 14 String Bundle Choose a path to be the “primary contig” Identify “associated contigs”

Distinguish Vertices in The Bundle From Those in Branching Paths
15 u’ u v’ w’ v w At vertex u, the downstream paths from v and w meet within a pre-specified radius, add v and w into the bundle. At vertex u, the downstream paths from v and w do not meet within certain radius, break the initial primary contig at u’. initial primary contig edges associated contig edges u’

Result I: Haploid Assembly of Col-0 (inbred line) 16 Contig
Stats #Seqs 512 Max 10.2 Mbp Total 120 Mbp n50 6.2 Mbp From tick mark to tick mark is a contig. TAIR10, Col-0 Reference Assembled Contigs un-used edges in the original string graph

Result II: Haploid Assembly Ler-0 (inbred line) 17 Contig Stats
#Seqs 983 Max 13.3 Mbp Total 123 Mbp n50 5.0 Mbp From tick mark to tick mark is a contig. TAIR10, Col-0 Reference Assembled Contigs

Result III: “Diploid Assembly”: Ler-0 + Col-0 18 Primary Contigs
Stats #Seqs 1085 Max 9.4 Mbp Total 127 Mbp n50 2.8 Mbp TAIR10, Col-0 Reference Assembled Contigs From tick mark to tick mark is a contig. Full Contigs Stats #Seqs 2483 Max 12.5 Mbp (un-corrected) Total 177 Mbp n50 2.8 Mbp

Validation I: Haplotype Structure Variants Resolved by a Simple Bubble
19 Branch 1 Branch 2 Ler-0 Hap- assembly Col-0 Hap- assembly Branch 1 Branch 2 Ler-0 Col-0

Validation II: Haplotype Structure Variants May Not Be Full Resolved
in A Complicated Bubble 20 Branch 1 Branch 2 Ler-0 Hap- assembly Col-0 Hap- assembly Branch 1 Branch 2 Ler-0 followed by Col-0 Col-0 followed by Ler-0

Validation III: Identify The Structure Variations between The Homologous Copies
21 Insertion in associated contigs Insertion in primary contigs Align the insertion/deletion elements to the “haploid assembly” of Ler-0 or Col-0 Unique element in ler-0 Unique element in col-0 Size distribution of the structure variations between the haplotypes Alignment Identity to Ler-0 Assembly Alignment Identity to Col-0 Assembly

Summary & Next Phase for Building a Full Diploid Assembler
•  With enough long read data, starting assembly from reads > 10kb to reveals the diploid structure as quasi-linear chains (string bundles) in the string graph. •  Successfully assemble “diploid”-like long read data with heuristics. N50 > 5 Mb for haploid and N50 > 2.5 Mb for diploid. •  Next: –  Diploid / polyploid graph traversing problem: from heuristics to more rigorous theoretical framework –  Generate diploid consensus: need an efficient aligner that can align long reads to string graph directly –  Phasing: combining SV discoveries and SNP calling to “unzip” the bubbles –  “Serialize the graph”: FASTG as output? –  More testing cases: −  Real biological diploid genomes −  Other diploid genome might have different structure •  Datasets: https://github.com/PacificBiosciences/DevNet/wiki/Datasets just search “PacBio Dataset” 22

Acknowledgements •  Everyone in Pacific Biosciences. It is truly a
team effort to bring useful data for the community. •  Col-0 DNA Sample –  Joe Ecker and Chongyun Lou (HHMI & Salk Institute) •  For several important things I learned about assembly tools/algorithm through social networks: –  Michael Schatz (@mike_schatz, CSHL), Adam Phillippy (@aphillippy , NBACC) •  Blasr, a long read aligner: –  Mark Chaisson (University of Washington, Eichler’s lab) 23

String Graph Assembly For Diploid Genomes With ...

String Graph Assembly For Diploid Genomes With Long Reads

Jason Chin

More Decks by Jason Chin

Other Decks in Science

Featured

Transcript

FIND MEANING IN COMPLEXITY © Copyright 2013 by Pacific Biosciences

String Graph 2 •  String graph: –  A graph structure

From Read Overlap to a String Graph 3 Add f:B,

An Example •  E. coli simulated data –  10X 10000

“Untangle” the Knots!! •  Using a simple “best overlapping logic”

Apply The Best Overlapping Rule On The E. Coli String

Simplify The String Graph 7 Building “unitigs” from the non-branching

Challenge of Diploid Assembly •  It may be hard to

Variations (or Errors) Induce Simple Bubbles 9 1:E 2:E 3:E

Sequencing Plan for Generate a “Synthetic” Diploid Dataset 10 Reads

Example for the Input Data: Length Distribution of the Preassembled

Diploid: Different Large and Small Scale Topological Features 12 “Bubbles”

The Plan •  Build a “string bundle” along the path

From a “String Bundle” to a Primary Contig + Associated

Distinguish Vertices in The Bundle From Those in Branching Paths

Result I: Haploid Assembly of Col-0 (inbred line) 16 Contig

Result II: Haploid Assembly Ler-0 (inbred line) 17 Contig Stats

Result III: “Diploid Assembly”: Ler-0 + Col-0 18 Primary Contigs

Validation I: Haplotype Structure Variants Resolved by a Simple Bubble

Validation II: Haplotype Structure Variants May Not Be Full Resolved

Validation III: Identify The Structure Variations between The Homologous Copies

Summary & Next Phase for Building a Full Diploid Assembler

Acknowledgements •  Everyone in Pacific Biosciences. It is truly a