Problem, we create a graph: • Nodes = 4 land masses of the city • Edges = 7 bridges connecting land areas Problem: Find path that goes over each bridge once
last nodes are the same. • An Eulerian cycle is a cycle that travels to each edge exactly once. • A graph containing such a cycle is called Eulerian. • If there were a solution to the Königsberg Bridge Problem, then we could find an Eulerian cycle in this graph. • However, no such cycle exists. • Eulerian Cycle Problem (ECP): Find an Eulerian cycle in G or prove that G is not Eulerian.
edge has a direction (represented by an arrow). • An Eulerian cycle in a directed graph is simply a cycle that travels down all the edges in the correct direction. Undirected Graph Directed Graph
v. • outdegree(v) = the number of edges leading out of v. • A graph is balanced if indegree(v) = outdegree(v) for every node v. • Label each node v with (indegree(v), outdegree(v)) • This graph isn’t balanced since some nodes don’t have equal indegree and outdegree. Balanced Graphs (1, 2) (2, 1) (1, 0) (2, 1) (1, 1) (0, 2) (1, 1)
for every pair of nodes {u, v}, we can travel either from u to v or from v to u. Not Connected (2, 2) (2, 2) (1, 1) (2, 2) (1, 1) (2, 2) (1, 1) Connected + Balanced = Eulerian Euler’s Theorem: A connected directed graph G contains an Eulerian cycle precisely when G is balanced.
an ant on an arbitrary node v of the graph and let it walk along any edges it likes. • The ant cannot walk along any edge that has been previously traversed. • The ant must always walk along edges in the legal direction. (2, 2) (2, 2) (1, 1) (2, 2) (1, 1) (1, 1) (2, 2)
it chooses. • We always start with a balanced graph, which means that the ant can never “get stuck” at a node along the way, because it will always have an edge leading out of any node that it enters. • Another cycle found – but still not Eulerian… Making an Eulerian Cycle from a Balanced Graph (1, 1) (1, 1) (0, 0) (0, 0) (1, 1)
The ant is stranded, so let’s move it to a node. • Now there’s only one way that the ant can walk through the graph. • Last cycle found – this one is Eulerian Making an Eulerian Cycle from a Balanced Graph (0, 0) (0, 0) (0, 0)
three cycles found. • Follow the discovered cycles… • If we hit a node with no new edges to follow, we backtrack to a node shared with another cycle, and change the edge followed Making an Eulerian Cycle from a Balanced Graph 1 2 3 4 5 6 7 8
three cycles found. • Follow the discovered cycles… • If we hit a node with no new edges to follow, we backtrack to a node shared with another cycle, and change the edge followed Making an Eulerian Cycle from a Balanced Graph 1 2 3 4 5 6 7 8 9 10 11
method is that it can be easily generalized to any balanced graph to give an Eulerian cycle. • “Yeah, but this Eulerian cycle wasn’t that hard to find anyway! So why should we care about the method?” • Think about trying to eyeball an Eulerian cycle in a graph containing billions of edges. Not so easy… 1 2 3 4 5 6 7 8 9 10 11
find an Eulerian cycle in a balanced graph can be implemented extremely efficiently on a computer. • Example: A modern computer can find an Eulerian cycle in a balanced graph containing billions of edges in under a minute! 1 2 3 4 5 6 7 8 9 10 11
in the genome is generated by some read. 2. Reads are error-free. 3. Every k-mer occurring in the genome occurs exactly once. 4. The underlying genome consists of a single circular-shaped chromosome.
k length read. • Prefix: First k – 1 nucleotides of a k-mer (CAA) • Suffix: Last k – 1 nucleotides of a k-mer (CAA) • Different 3-mers may share a prefix/suffix: ATG, TGA, CTG TGC GGC CGT CAA AAT GTG GCG GCA ATG TGG Reads ATG CGT GGC AAT GTG TGG TGC CAA GCA GCG
w with a directed edge if suffix of v matches the prefix of w. TGC GGC CGT CAA AAT GTG GCG GCA ATG TGG Reads ATG CGT GGC AAT GTG TGG TGC CAA GCA GCG Sequence assembly is a Hamiltonian cycle in this graph… This is actually how overlap-layout-consensus assemblers phrase the problem
graph as follows: • Create a node for each distinct prefix/suffix from reads. • Connect node v to node w with a directed edge if there is a read whose prefix is v and whose suffix is w. TGC GGC CGT CAA AAT GTG GCG GCA ATG TGG Reads CA GC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAA AAT
ATG à TGG à GGC à GCG à CGT à GTG à TGC à GCA à CAA à AAT 3 CA GC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAA AAT 1 2 4 5 6 7 8 9 10 A T G G C G T G C A
now only have to find an Eulerian cycle in the graph. • Bad News: 1. There may be more than one Eulerian cycle in the graph. 2. How do we know that the graph even has an Eulerian cycle? • By Euler’s Theorem, we only need to show that E is a balanced graph.
k-mer occurring in the genome is generated by some read. 2. Reads are error-free. 3. Every k-mer occurring in the genome occurs exactly once. 4. The underlying genome consists of a single circular-shaped chromosome.
four reads of length 6: • We didn’t generate all possible 6-mers from the genome as reads (e.g. TGCAAG), but we will have all possible 3-mers by splitting the reads into pieces. • e.g. break every 100-nucleotide read into 46 overlapping 55-mers and further assemble the resulting 55-mers using de Bruijn graphs. Assumption 1: Generating (nearly) all k-mers ATGCAAGCTAGCT ATGCAA CAAGCT CTAGCT CTATGC Reads Genome
the graph when some reads have errors? • Example: Say our graph for genome ATGGCGTGCAATG should look like this. • An error at the end of a read will cause a tip or spur. • e.g. TGGGCGA CGA GCGA Typical rule: remove tips of lengths < 2k
the graph when some reads have errors? • Example: Say our graph for genome ATGGCGTGCAATG should look like this. • If read TGGCGTG is mistakenly sequenced as TGGAGTG , then the graph will look like this instead. • This is called a bulge or bubble in the graph.
only four 3-mers: ACG, CGT, GTA, and TAC. • We would obtain the graph E below and reconstruct this genome as: ACGT • In other words, we can’t represent repeated k-mers in the genome! AC CG GT TA TAC ACG CGT GTA
a k-mer as the number of times it occurs in a genome. • We will add edges to graph E in order to form a new graph E* for which the number of edges connecting two nodes represents the multiplicity of the k-mer on that edge. • An Eulerian cycle in E* gives a candidate genome.
the following read multiplicities: • Multiplicity 1: ATG, GGC, AAT, TGG, CAA, CAA, GCA • Multiplicity 2: GCG, CGT, GTG, TGC • We reflect multiplicities as multiple edges • Candidate genome: • E* is balanced because indegree(v) and outdegree(v) still equal the # of times v appears. CA GC CG TG GT GG AT AA ATG TGG GGC GCG CGT GTG TGC GCA CAA AAT ATGCGTGGCGTGCA
is ATGCGTGGCGTGCA. • An Eulerian path in a directed graph G is a path through the graph that uses every edge exactly once. • An Eulerian path is just like an Eulerian cycle, except that we don’t have to start and end at the same node. • Euler’s Theorem II: A connected directed graph has an Eulerian path precisely when either all nodes are balanced or exactly two nodes are not balanced. CA GC CG TG GT GG AT ATG TGG GGC GCG CGT GTG TGC GCA