0 -1 -2 -3 -4 -5 -1 -1 -2 -3 -4 -3 -2 -2 0 -1 -2 -3 -3 -1 -1 1 0 -1 -4 -2 0 0 0 -1 -5 -3 -1 1 1 0 C A G A G G A G G C 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 1 0 2 1 0 0 0 2 1 1 0 0 1 1 3 2 1 l global alignment is CAGAG- GAG-GC with score 0, while an optimal local alignment is G G ote that the optimal global alignment fails to align the two identical subsequences. r the local alignment algorithm is that the scoring scheme must be such that random pected score of less than zero. us, it is very sensitive to a realistic choice of scorin -1 -2 -3 -4 -5 -1 -2 -3 -4 -3 -2 0 -1 -2 -3 -1 -1 1 0 -1 -2 0 0 0 -1 -3 -1 1 1 0 G A G G C 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 1 0 2 1 0 ignment is CAGAG- GAG-GC with score 0, while an optimal lo e optimal global alignment fails to align the two identi alignment algorithm is that the scoring scheme must b G A G G C 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 1 0 2 1 0 0 2 1 1 0 1 1 3 2 1 while an optimal local alignment is GAG GAG with lign the two identical subsequences. ring scheme must be such that random matches
for one kind of event. In o do occur as single character events in reality. Indels of multiple positions can be of a single position. us, we would like to handle gaps more realistically. trary Gap Penalties ne an arbitrary function γ(g) that species the cost of a gap or indel of length e for nding the optimal alignment becomes: S(s0..i , t0..j ) = max σ(si , tj ) + S(s0..i−1 , t0..j−1 ) −γ(k) + S(s0..i−k , t0..j )for k = 0 . . . i − 1 −γ(k) + S(s0..i , t0..j−k )for k = 0 . . . j − 1 we have added iteration over ∼ n possible gap lengths to the computation of ea hm is now O(n3). ne gap penalties n specic forms of γ(g) the computation can be bounded. e common case score, in which the score of a gap depends on only two values, an initiation c dless of length, and an extension cost e for each additional base in the gap. In o − (g − 1)e. is can be computed by the following recurrences:
forms of γ( ane gap score, in which the s gaps regardless of length, and a γ(g) = −d − (g − 1)e. is can b S(so...i , I(so...i , Speciﬁc gap penalty function: e gap penalties n specic forms of γ(g) the computation can be bounded. e common case score, in which the score of a gap depends on only two values, an initiation dless of length, and an extension cost e for each additional base in the gap. In − (g − 1)e. is can be computed by the following recurrences: S(so...i , t0...j ) = max σ(si , tj ) + S(s0...i−1 , t0...j−1 ) σ(si , tj ) + I(s0...i−1 , t0...j−1 ) I(so...i , t0...j ) = max −d + S(s0...i , t0...j−1 ) −e + I(s0...i , t0...j−1 ) −d + S(s0...i−1 , t0...j ) −e + I(s0...i−1 , t0...j )
Solved eﬃciently with “dynamic programming” • Global (Needleman-Wunsch 1970) • Local (Smith-Waterman 1981) • Under speciﬁc models and scoring schemes • Similarity matrix + linear or aﬃne gap penalty (Gotoh 1982) Summary
are very costly, then we would expect the optimal alignment lie mostly on a diagonal • In banded Smith-Waterman, we ignore cells more than a certain distance from the diagonal (set them to zero). Thus, the complexity is reduced from O(n m) to ~O(n k) where k is the width of the band • (Similar to the microsatellite ﬁnding idea)
short regions of very high similarity or identity • Heuristic: rather that considering all possible alignments, consider only those that pass through or near short regions of high similarity
keep only those over threshold. Attempt to join nearby diagonals 2) Extend (ungapped) until score drops by a certain amount over best seen, keep best if score over threshold
along the diagonals that score over the required threshold 3) From the center of every HSP, perform a Smith- Waterman style alignment, stopping when the score drops over a certain amount
Filtration: • Identify a set of initial candidate matches (seeds) • Filter seeds using heuristics • Extend seeds that pass ﬁlters using bounded dynamic programming
extension • The key to good performance is identifying just the right number of seeds to extend • Seed criteria must not be so stringent as to miss many optimal alignments • But must be stringent enough to achieve good performance
(exact seeds) • Pairs that produce a score over T when matched (score seeds) • Speciﬁc patterns of matches spaced with mismatches (spaced seeds) • Fixed or spaced seeds allowing transitions at one or more position (transition seeds) • Arbitrary combinations
eﬀective for particular classes of elements • For non-coding genome alignment, transition seeds are much better than ﬁxed spaced seeds • Allowing transitions at any position is advantageous • Optimal seed design is hard (Sun and Buhler, 2006)
high quality sequences with known homology • Align with ungapped blast, +1/-1 scoring • Discard high scoring HSPs (>70% identity) • Infer new scoring scheme as log-odds ratio: (Chiaromonte et al. 2002) score of the alignment column x-over-y is the log of an “odds ratio” s(x, y) = log p(x, y) q1 (x)q2 (y) e p(x, y) is the frequency of x-over-y in the training set, expressed as a f of the observed aligned pairs, and q1 (x) and q2 (y) denote the backgro encies of nucleotides x and y as the upper and lower components (re Pacific Symposium on Biocomputing 7:115-126 (2002)
47.5% G+C T A C G T A –117 91 –114 –31 –123 100 –20 –114 100 –125 –31 –123 –96 –31 –125 100 –114 –28 67 –123 –31 –114 91 –109 s can then be used in the tradition (Chiaromonte et al. 2002)
two transitions • Soft and dynamic masking of known repeats, and regions that generate highly repetitive alignments • Chaining: reduce overlapping alignments to a maximal scoring non-overlapping set • Interpolation: realign between best alignments with more sensitive parameters