Lecture 14 Genome Assembly II

L14: DE BRUIJN GRAPHS FOR SHORT READ ASSEMBLY Foundations in
Data Driven Life Sciences BMMB-554

Today’s learning objectives • Learn how to find Eulerian paths
on a graph. • Understand how De Bruijn graphs can be applied to the genome assembly problem.

Königsberg Bridges Graph (Euler, 1735) • For the Königsberg Bridge
Problem, we create a graph: • Nodes = 4 land masses of the city • Edges = 7 bridges connecting land areas Problem: Find path that goes over each bridge once

Eulerian Cycles • Cycle: path in graph where first and
last nodes are the same. • An Eulerian cycle is a cycle that travels to each edge exactly once. • A graph containing such a cycle is called Eulerian. • If there were a solution to the Königsberg Bridge Problem, then we could find an Eulerian cycle in this graph. • However, no such cycle exists. • Eulerian Cycle Problem (ECP): Find an Eulerian cycle in G or prove that G is not Eulerian.

Directed Graphs • Directed Graph: A graph in which each
edge has a direction (represented by an arrow). • An Eulerian cycle in a directed graph is simply a cycle that travels down all the edges in the correct direction. Undirected Graph Directed Graph

• indegree(v) = the number of edges leading into node
v. • outdegree(v) = the number of edges leading out of v. • A graph is balanced if indegree(v) = outdegree(v) for every node v. • Label each node v with (indegree(v), outdegree(v)) • This graph isn’t balanced since some nodes don’t have equal indegree and outdegree. Balanced Graphs (1, 2) (2, 1) (1, 0) (2, 1) (1, 1) (0, 2) (1, 1)

Euler’s Theorem (directed graphs) • A graph is connected if
for every pair of nodes {u, v}, we can travel either from u to v or from v to u. Not Connected (2, 2) (2, 2) (1, 1) (2, 2) (1, 1) (2, 2) (1, 1) Connected + Balanced = Eulerian Euler’s Theorem: A connected directed graph G contains an Eulerian cycle precisely when G is balanced.

Making an Eulerian Cycle from a Balanced Graph • Place
an ant on an arbitrary node v of the graph and let it walk along any edges it likes. • The ant cannot walk along any edge that has been previously traversed. • The ant must always walk along edges in the legal direction. (2, 2) (2, 2) (1, 1) (2, 2) (1, 1) (1, 1) (2, 2)

Making an Eulerian Cycle from a Balanced Graph (2, 2)
(2, 2) (1, 1) (0, 0) (1, 1) (1, 1) (0, 0) • One cycle found – not Eulerian yet… • Remove cycle edges & nodes no longer connected

Making an Eulerian Cycle from a Balanced Graph (2, 2)
(2, 2) (1, 1) (1, 1) (1, 1) • One cycle found – not Eulerian yet… • Remove cycle edges & nodes no longer connected

• Again, let the ant walk through the graph however
it chooses. • We always start with a balanced graph, which means that the ant can never “get stuck” at a node along the way, because it will always have an edge leading out of any node that it enters. • Another cycle found – but still not Eulerian… Making an Eulerian Cycle from a Balanced Graph (1, 1) (1, 1) (0, 0) (0, 0) (1, 1)

• Let’s trim out this cycle one more time. •
The ant is stranded, so let’s move it to a node. • Now there’s only one way that the ant can walk through the graph. • Last cycle found – this one is Eulerian Making an Eulerian Cycle from a Balanced Graph (0, 0) (0, 0) (0, 0)

• Let’s bring back our original graph. • Highlight the
three cycles found. • Follow the discovered cycles… • If we hit a node with no new edges to follow, we backtrack to a node shared with another cycle, and change the edge followed Making an Eulerian Cycle from a Balanced Graph 1 2 3 4 5 6 7 8

• Let’s bring back our original graph. • Highlight the
three cycles found. • Follow the discovered cycles… • If we hit a node with no new edges to follow, we backtrack to a node shared with another cycle, and change the edge followed Making an Eulerian Cycle from a Balanced Graph 1 2 3 4 5 6 7 8 9 10 11

What’s the Big Deal? • The great thing about this
method is that it can be easily generalized to any balanced graph to give an Eulerian cycle. • “Yeah, but this Eulerian cycle wasn’t that hard to find anyway! So why should we care about the method?” • Think about trying to eyeball an Eulerian cycle in a graph containing billions of edges. Not so easy… 1 2 3 4 5 6 7 8 9 10 11

What’s the Big Deal? • More profoundly, this method to
find an Eulerian cycle in a balanced graph can be implemented extremely efficiently on a computer. • Example: A modern computer can find an Eulerian cycle in a balanced graph containing billions of edges in under a minute! 1 2 3 4 5 6 7 8 9 10 11

Graphs for fragment assembly Simplifying assumptions: 1. Every k-mer occurring
in the genome is generated by some read. 2. Reads are error-free. 3. Every k-mer occurring in the genome occurs exactly once. 4. The underlying genome consists of a single circular-shaped chromosome.

Read assembly as graphs • Create a node for every
k length read. • Prefix: First k – 1 nucleotides of a k-mer (CAA) • Suffix: Last k – 1 nucleotides of a k-mer (CAA) • Different 3-mers may share a prefix/suffix: ATG, TGA, CTG TGC GGC CGT CAA AAT GTG GCG GCA ATG TGG Reads ATG CGT GGC AAT GTG TGG TGC CAA GCA GCG

Read assembly as graphs • Connect node v to node
w with a directed edge if suffix of v matches the prefix of w. TGC GGC CGT CAA AAT GTG GCG GCA ATG TGG Reads ATG CGT GGC AAT GTG TGG TGC CAA GCA GCG Sequence assembly is a Hamiltonian cycle in this graph… This is actually how overlap-layout-consensus assemblers phrase the problem

Read assembly as De Bruijn graphs • De Bruijn graphs
contain (k-1)-mers as nodes. • Edges on the De Bruijn graph connecting node A to node B represent a k-mer prefixed by A and suffixed by B.

Read assembly as De Bruijn graphs • Form a different
graph as follows: • Create a node for each distinct prefix/suffix from reads. CA GC CG TG GT GG AT AA TGC GGC CGT CAA AAT GTG GCG GCA ATG TGG Reads

Read assembly as De Bruijn graphs • Form a different
graph as follows: • Create a node for each distinct prefix/suffix from reads. • Connect node v to node w with a directed edge if there is a read whose prefix is v and whose suffix is w. TGC GGC CGT CAA AAT GTG GCG GCA ATG TGG Reads CA GC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAA AAT

Read assembly as De Bruijn graphs • Eulerian cycle: •
ATG à TGG à GGC à GCG à CGT à GTG à TGC à GCA à CAA à AAT 3 CA GC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAA AAT 1 2 4 5 6 7 8 9 10 A T G G C G T G C A

Read assembly as De Bruijn graphs • Good News: We
now only have to find an Eulerian cycle in the graph. • Bad News: 1. There may be more than one Eulerian cycle in the graph. 2. How do we know that the graph even has an Eulerian cycle? • By Euler’s Theorem, we only need to show that E is a balanced graph.

Relax assumptions for real data Recall our assumptions: 1. Every
k-mer occurring in the genome is generated by some read. 2. Reads are error-free. 3. Every k-mer occurring in the genome occurs exactly once. 4. The underlying genome consists of a single circular-shaped chromosome.

• Example: Say our genome is ATGCAAGCTAGCT, and we generate
four reads of length 6: • We didn’t generate all possible 6-mers from the genome as reads (e.g. TGCAAG), but we will have all possible 3-mers by splitting the reads into pieces. • e.g. break every 100-nucleotide read into 46 overlapping 55-mers and further assemble the resulting 55-mers using de Bruijn graphs. Assumption 1: Generating (nearly) all k-mers ATGCAAGCTAGCT ATGCAA CAAGCT CTAGCT CTATGC Reads Genome

Assumption 2: Handling errors in reads • What happens to
the graph when some reads have errors? • Example: Say our graph for genome C should look like this.

the graph when some reads have errors? • Example: Say our graph for genome ATGGCGTGCAATG should look like this. • An error at the end of a read will cause a tip or spur. • e.g. TGGGCGA CGA GCGA Typical rule: remove tips of lengths < 2k

the graph when some reads have errors? • Example: Say our graph for genome ATGGCGTGCAATG should look like this. • If read TGGCGTG is mistakenly sequenced as TGGAGTG , then the graph will look like this instead. • This is called a bulge or bubble in the graph.

Assumption 3: Handling repeated k-mers • The genome ACGTACGT has
only four 3-mers: ACG, CGT, GTA, and TAC. • We would obtain the graph E below and reconstruct this genome as: ACGT • In other words, we can’t represent repeated k-mers in the genome! AC CG GT TA TAC ACG CGT GTA

Assumption 3: Handling repeated k-mers • Define the multiplicity of
a k-mer as the number of times it occurs in a genome. • We will add edges to graph E in order to form a new graph E* for which the number of edges connecting two nodes represents the multiplicity of the k-mer on that edge. • An Eulerian cycle in E* gives a candidate genome.

Assumption 3: Handling repeated k-mers • Say that we have
the following read multiplicities: • Multiplicity 1: ATG, GGC, AAT, TGG, CAA, CAA, GCA • Multiplicity 2: GCG, CGT, GTG, TGC • We reflect multiplicities as multiple edges • Candidate genome: • E* is balanced because indegree(v) and outdegree(v) still equal the # of times v appears. CA GC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAA AAT ATGCGTGGCGTGCA

Assumption 4: Linear genomes • Say our linear DNA segment
is ATGCGTGGCGTGCA. • An Eulerian path in a directed graph G is a path through the graph that uses every edge exactly once. • An Eulerian path is just like an Eulerian cycle, except that we don’t have to start and end at the same node. • Euler’s Theorem II: A connected directed graph has an Eulerian path precisely when either all nodes are balanced or exactly two nodes are not balanced. CA GC CG TG GT GG AT ATG TGG GGC GCG CGT GTG TGC GCA

Further reading • Bioinformatics & Functional Genomics p396 - 398
• Bioinformatics for Biologists Pevzner & Shamir Chapter 3 • How to apply de Bruijn graphs to genome assembly Compeau, et al.

Summary • Genome assembly can be thought of as an
Eulerian cycle problem. • De Bruijn graphs can be used to find Eulerian cycles in a k-mer graph.

Lecture 14 Genome Assembly II

Lecture 14 Genome Assembly II

More Decks by shaunmahony

Featured

Transcript