Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Lecture 14 Genome Assembly II

Avatar for shaunmahony shaunmahony
February 28, 2022
69

Lecture 14 Genome Assembly II

BMMB 554 Lecture 14

Avatar for shaunmahony

shaunmahony

February 28, 2022
Tweet

Transcript

  1. Today’s learning objectives • Learn how to find Eulerian paths

    on a graph. • Understand how De Bruijn graphs can be applied to the genome assembly problem.
  2. Königsberg Bridges Graph (Euler, 1735) • For the Königsberg Bridge

    Problem, we create a graph: • Nodes = 4 land masses of the city • Edges = 7 bridges connecting land areas Problem: Find path that goes over each bridge once
  3. Eulerian Cycles • Cycle: path in graph where first and

    last nodes are the same. • An Eulerian cycle is a cycle that travels to each edge exactly once. • A graph containing such a cycle is called Eulerian. • If there were a solution to the Königsberg Bridge Problem, then we could find an Eulerian cycle in this graph. • However, no such cycle exists. • Eulerian Cycle Problem (ECP): Find an Eulerian cycle in G or prove that G is not Eulerian.
  4. Directed Graphs • Directed Graph: A graph in which each

    edge has a direction (represented by an arrow). • An Eulerian cycle in a directed graph is simply a cycle that travels down all the edges in the correct direction. Undirected Graph Directed Graph
  5. • indegree(v) = the number of edges leading into node

    v. • outdegree(v) = the number of edges leading out of v. • A graph is balanced if indegree(v) = outdegree(v) for every node v. • Label each node v with (indegree(v), outdegree(v)) • This graph isn’t balanced since some nodes don’t have equal indegree and outdegree. Balanced Graphs (1, 2) (2, 1) (1, 0) (2, 1) (1, 1) (0, 2) (1, 1)
  6. Euler’s Theorem (directed graphs) • A graph is connected if

    for every pair of nodes {u, v}, we can travel either from u to v or from v to u. Not Connected (2, 2) (2, 2) (1, 1) (2, 2) (1, 1) (2, 2) (1, 1) Connected + Balanced = Eulerian Euler’s Theorem: A connected directed graph G contains an Eulerian cycle precisely when G is balanced.
  7. Making an Eulerian Cycle from a Balanced Graph • Place

    an ant on an arbitrary node v of the graph and let it walk along any edges it likes. • The ant cannot walk along any edge that has been previously traversed. • The ant must always walk along edges in the legal direction. (2, 2) (2, 2) (1, 1) (2, 2) (1, 1) (1, 1) (2, 2)
  8. Making an Eulerian Cycle from a Balanced Graph (2, 2)

    (2, 2) (1, 1) (0, 0) (1, 1) (1, 1) (0, 0) • One cycle found – not Eulerian yet… • Remove cycle edges & nodes no longer connected
  9. Making an Eulerian Cycle from a Balanced Graph (2, 2)

    (2, 2) (1, 1) (1, 1) (1, 1) • One cycle found – not Eulerian yet… • Remove cycle edges & nodes no longer connected
  10. • Again, let the ant walk through the graph however

    it chooses. • We always start with a balanced graph, which means that the ant can never “get stuck” at a node along the way, because it will always have an edge leading out of any node that it enters. • Another cycle found – but still not Eulerian… Making an Eulerian Cycle from a Balanced Graph (1, 1) (1, 1) (0, 0) (0, 0) (1, 1)
  11. • Let’s trim out this cycle one more time. •

    The ant is stranded, so let’s move it to a node. • Now there’s only one way that the ant can walk through the graph. • Last cycle found – this one is Eulerian Making an Eulerian Cycle from a Balanced Graph (0, 0) (0, 0) (0, 0)
  12. • Let’s bring back our original graph. • Highlight the

    three cycles found. • Follow the discovered cycles… • If we hit a node with no new edges to follow, we backtrack to a node shared with another cycle, and change the edge followed Making an Eulerian Cycle from a Balanced Graph 1 2 3 4 5 6 7 8
  13. • Let’s bring back our original graph. • Highlight the

    three cycles found. • Follow the discovered cycles… • If we hit a node with no new edges to follow, we backtrack to a node shared with another cycle, and change the edge followed Making an Eulerian Cycle from a Balanced Graph 1 2 3 4 5 6 7 8 9 10 11
  14. What’s the Big Deal? • The great thing about this

    method is that it can be easily generalized to any balanced graph to give an Eulerian cycle. • “Yeah, but this Eulerian cycle wasn’t that hard to find anyway! So why should we care about the method?” • Think about trying to eyeball an Eulerian cycle in a graph containing billions of edges. Not so easy… 1 2 3 4 5 6 7 8 9 10 11
  15. What’s the Big Deal? • More profoundly, this method to

    find an Eulerian cycle in a balanced graph can be implemented extremely efficiently on a computer. • Example: A modern computer can find an Eulerian cycle in a balanced graph containing billions of edges in under a minute! 1 2 3 4 5 6 7 8 9 10 11
  16. Graphs for fragment assembly Simplifying assumptions: 1. Every k-mer occurring

    in the genome is generated by some read. 2. Reads are error-free. 3. Every k-mer occurring in the genome occurs exactly once. 4. The underlying genome consists of a single circular-shaped chromosome.
  17. Read assembly as graphs • Create a node for every

    k length read. • Prefix: First k – 1 nucleotides of a k-mer (CAA) • Suffix: Last k – 1 nucleotides of a k-mer (CAA) • Different 3-mers may share a prefix/suffix: ATG, TGA, CTG TGC GGC CGT CAA AAT GTG GCG GCA ATG TGG Reads ATG CGT GGC AAT GTG TGG TGC CAA GCA GCG
  18. Read assembly as graphs • Connect node v to node

    w with a directed edge if suffix of v matches the prefix of w. TGC GGC CGT CAA AAT GTG GCG GCA ATG TGG Reads ATG CGT GGC AAT GTG TGG TGC CAA GCA GCG Sequence assembly is a Hamiltonian cycle in this graph… This is actually how overlap-layout-consensus assemblers phrase the problem
  19. Read assembly as De Bruijn graphs • De Bruijn graphs

    contain (k-1)-mers as nodes. • Edges on the De Bruijn graph connecting node A to node B represent a k-mer prefixed by A and suffixed by B.
  20. Read assembly as De Bruijn graphs • Form a different

    graph as follows: • Create a node for each distinct prefix/suffix from reads. CA GC CG TG GT GG AT AA TGC GGC CGT CAA AAT GTG GCG GCA ATG TGG Reads
  21. Read assembly as De Bruijn graphs • Form a different

    graph as follows: • Create a node for each distinct prefix/suffix from reads. • Connect node v to node w with a directed edge if there is a read whose prefix is v and whose suffix is w. TGC GGC CGT CAA AAT GTG GCG GCA ATG TGG Reads CA GC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAA AAT
  22. Read assembly as De Bruijn graphs • Eulerian cycle: •

    ATG à TGG à GGC à GCG à CGT à GTG à TGC à GCA à CAA à AAT 3 CA GC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAA AAT 1 2 4 5 6 7 8 9 10 A T G G C G T G C A
  23. Read assembly as De Bruijn graphs • Good News: We

    now only have to find an Eulerian cycle in the graph. • Bad News: 1. There may be more than one Eulerian cycle in the graph. 2. How do we know that the graph even has an Eulerian cycle? • By Euler’s Theorem, we only need to show that E is a balanced graph.
  24. Relax assumptions for real data Recall our assumptions: 1. Every

    k-mer occurring in the genome is generated by some read. 2. Reads are error-free. 3. Every k-mer occurring in the genome occurs exactly once. 4. The underlying genome consists of a single circular-shaped chromosome.
  25. • Example: Say our genome is ATGCAAGCTAGCT, and we generate

    four reads of length 6: • We didn’t generate all possible 6-mers from the genome as reads (e.g. TGCAAG), but we will have all possible 3-mers by splitting the reads into pieces. • e.g. break every 100-nucleotide read into 46 overlapping 55-mers and further assemble the resulting 55-mers using de Bruijn graphs. Assumption 1: Generating (nearly) all k-mers ATGCAAGCTAGCT ATGCAA CAAGCT CTAGCT CTATGC Reads Genome
  26. Assumption 2: Handling errors in reads • What happens to

    the graph when some reads have errors? • Example: Say our graph for genome C should look like this.
  27. Assumption 2: Handling errors in reads • What happens to

    the graph when some reads have errors? • Example: Say our graph for genome ATGGCGTGCAATG should look like this. • An error at the end of a read will cause a tip or spur. • e.g. TGGGCGA CGA GCGA Typical rule: remove tips of lengths < 2k
  28. Assumption 2: Handling errors in reads • What happens to

    the graph when some reads have errors? • Example: Say our graph for genome ATGGCGTGCAATG should look like this. • If read TGGCGTG is mistakenly sequenced as TGGAGTG , then the graph will look like this instead. • This is called a bulge or bubble in the graph.
  29. Assumption 3: Handling repeated k-mers • The genome ACGTACGT has

    only four 3-mers: ACG, CGT, GTA, and TAC. • We would obtain the graph E below and reconstruct this genome as: ACGT • In other words, we can’t represent repeated k-mers in the genome! AC CG GT TA TAC ACG CGT GTA
  30. Assumption 3: Handling repeated k-mers • Define the multiplicity of

    a k-mer as the number of times it occurs in a genome. • We will add edges to graph E in order to form a new graph E* for which the number of edges connecting two nodes represents the multiplicity of the k-mer on that edge. • An Eulerian cycle in E* gives a candidate genome.
  31. Assumption 3: Handling repeated k-mers • Say that we have

    the following read multiplicities: • Multiplicity 1: ATG, GGC, AAT, TGG, CAA, CAA, GCA • Multiplicity 2: GCG, CGT, GTG, TGC • We reflect multiplicities as multiple edges • Candidate genome: • E* is balanced because indegree(v) and outdegree(v) still equal the # of times v appears. CA GC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAA AAT ATGCGTGGCGTGCA
  32. Assumption 4: Linear genomes • Say our linear DNA segment

    is ATGCGTGGCGTGCA. • An Eulerian path in a directed graph G is a path through the graph that uses every edge exactly once. • An Eulerian path is just like an Eulerian cycle, except that we don’t have to start and end at the same node. • Euler’s Theorem II: A connected directed graph has an Eulerian path precisely when either all nodes are balanced or exactly two nodes are not balanced. CA GC CG TG GT GG AT ATG TGG GGC GCG CGT GTG TGC GCA
  33. Further reading • Bioinformatics & Functional Genomics p396 - 398

    • Bioinformatics for Biologists Pevzner & Shamir Chapter 3 • How to apply de Bruijn graphs to genome assembly Compeau, et al.
  34. Summary • Genome assembly can be thought of as an

    Eulerian cycle problem. • De Bruijn graphs can be used to find Eulerian cycles in a k-mer graph.