Anton Nekrutenko
February 03, 2016
46

# BMMB554 2016 Lecture 4

## Anton Nekrutenko

February 03, 2016

## Transcript

1. Lecture 4
Alignment basics

2. Why sequence similarity?
‣ Mapping problems
‣ Database search problems
‣ Evolutionary problems
‣ Functional similarity

3. s/t dot plot

4. s/t dot plot

5. s/t dot plot (word = 4)

6. s/t dot plot (word = 4)

7. Distance (D) and Similarity (S)

8. you can turn s into t by:
‣ substituting a character in s from t
‣ delete character in s
‣ insert a character from t to s

9. general case

10. AGAG(s)versus AGG(t)

11. Distance (D) and Similarity (S)

13. DAG: directed acyclic graph
1 3
2 4
5
(i 1, j)
(i, j 1)
(i 1, j 1)
(i, j)
(si, tj
)
(i 1, j)
(i, j 1)
(i 1, j 1)
(i, j)
(si, tj
)

14. Dynamic programming:
AGAG versus AGG
ﬁlling matrix

15. Dynamic programming:
AGAG versus AGG
ﬁlling matrix

16. Dynamic programming:
AGAG versus AGG
ﬁlling matrix

17. Dynamic programming:
AGAG versus AGG
backtracking

18. Eﬀect of gap penalties

19. Global versus Local

20. C A G A G
G
A
G
G
C
0
-1
-2
-3
-4
-5
-1
-1
-2
-3
-4
-3
-2
-2
0
-1
-2
-3
-3
-1
-1
1
0
-1
-4
-2
0
0
0
-1
-5
-3
-1
1
1
0
C A G A G
G
A
G
G
C
0
0
0
0
0
0
0
0
0
0
0
1
0
0
1
0
0
0
0
1
0
2
1
0
0
0
2
1
1
0
0
1
1
3
2
1
l global alignment is
CAGAG-
GAG-GC
with score 0, while an optimal local alignment is
G
G
ote that the optimal global alignment fails to align the two identical subsequences.
r the local alignment algorithm is that the scoring scheme must be such that random
pected score of less than zero. us, it is very sensitive to a realistic choice of scorin
-1
-2
-3
-4
-5
-1
-2
-3
-4
-3
-2
0
-1
-2
-3
-1
-1
1
0
-1
-2
0
0
0
-1
-3
-1
1
1
0
G
A
G
G
C
0
0
0
0
0
0
0
0
0
1
0
1
0
0
0
1
0
2
1
0
ignment is
CAGAG-
GAG-GC
with score 0, while an optimal lo
e optimal global alignment fails to align the two identi
alignment algorithm is that the scoring scheme must b
G
A
G
G
C
0
0
0
0
0
0
0
0
0
1
0
1
0
0
0
1
0
2
1
0
0
2
1
1
0
1
1
3
2
1
while an optimal local alignment is
GAG
GAG
with
lign the two identical subsequences.
ring scheme must be such that random matches

21. More realistic scoring schemes

22. Arbitrary gap penalties
single position). However, this is unrealistic, particularly for one kind of event. In
o do occur as single character events in reality. Indels of multiple positions can be
of a single position. us, we would like to handle gaps more realistically.
trary Gap Penalties
ne an arbitrary function γ(g) that species the cost of a gap or indel of length
e for nding the optimal alignment becomes:
S(s0..i
, t0..j
) = max

σ(si
, tj
) + S(s0..i−1
, t0..j−1
)
−γ(k) + S(s0..i−k
, t0..j
)for k = 0 . . . i − 1
−γ(k) + S(s0..i
, t0..j−k
)for k = 0 . . . j − 1
we have added iteration over ∼ n possible gap lengths to the computation of ea
hm is now O(n3).
ne gap penalties
n specic forms of γ(g) the computation can be bounded. e common case
score, in which the score of a gap depends on only two values, an initiation c
dless of length, and an extension cost e for each additional base in the gap. In o
− (g − 1)e. is can be computed by the following recurrences:

23. Aﬃne gap penalties
. Ane gap penalties
For certain specic forms of γ(
ane gap score, in which the s
gaps regardless of length, and a
γ(g) = −d − (g − 1)e. is can b
S(so...i
,
I(so...i
,
Speciﬁc gap penalty function:
e gap penalties
n specic forms of γ(g) the computation can be bounded. e common case
score, in which the score of a gap depends on only two values, an initiation
dless of length, and an extension cost e for each additional base in the gap. In
− (g − 1)e. is can be computed by the following recurrences:
S(so...i
, t0...j
) = max
σ(si
, tj
) + S(s0...i−1
, t0...j−1
)
σ(si
, tj
) + I(s0...i−1
, t0...j−1
)
I(so...i
, t0...j
) = max

−d + S(s0...i
, t0...j−1
)
−e + I(s0...i
, t0...j−1
)
−d + S(s0...i−1
, t0...j
)
−e + I(s0...i−1
, t0...j
)

24. • Alignment as a recursive problem with “optimal
substructure”
• Solved eﬃciently with “dynamic programming”
• Global (Needleman-Wunsch 1970)
• Local (Smith-Waterman 1981)
• Under speciﬁc models and scoring schemes
• Similarity matrix + linear or aﬃne gap
penalty (Gotoh 1982)
Summary

25. Banded Smith-Waterman
• If we expect few gaps, or gaps are very costly,
then we would expect the optimal alignment lie
mostly on a diagonal
• In banded Smith-Waterman, we ignore cells more
than a certain distance from the diagonal (set
them to zero). Thus, the complexity is reduced
from O(n m) to ~O(n k) where k is the width of the
band
• (Similar to the microsatellite ﬁnding idea)

26. Banded Smith-Waterman
• Main problem, this is local alignment, the optimal
alignment may not be on the main diagonal
• How can we decide what bands to search?
• Inspiration: dotplot, use short matches to guide
search

27. (human/marmoset genomic DNA)

28. Banded Smith-Waterman
• Main problem, this is local alignment, the optimal
alignment may not be on the main diagonal
• How can we decide what bands to search?
• Inspiration: dotplot, use short matches to guide
search

29. Approximate local alignment with
word-match heuristics

30. Approximating Smith-Waterman
• Assumption: the optimal local alignment will
include short regions of very high similarity or
identity
• Heuristic: rather that considering all possible
alignments, consider only those that pass
through or near short regions of high similarity

31. Filtration algorithm concept
• Find a set of candidate partial solutions (seeds)
using a fast method
• Filter candidates using one or a series of more
expensive criteria
• Report results

32. Seeds
FASTA BLAST
For DNA: All exact
matches of length
4 to 6
For DNA: All exact
matches of length
11

33. Filtration
FASTA BLAST
1) Rank diagonals
by number of
exact matches
(roughly) and
keep only the top
10
1) Keep only
seeds that are
paired with
another seed on
the same diagonal
within a limited
distance

34. Filtration
FASTA BLAST
2) Rescore top
runs using
similarity matrix,
keep only those
over threshold.
Attempt to join
nearby diagonals
2) Extend
(ungapped) until
score drops by a
certain amount
over best seen,
keep best if score
over threshold

35. ungapped extension

36. Filtration
FASTA BLAST
3) Perform a
banded Smith-
Waterman
alignment along
the diagonals that
score over the
required
threshold
3) From the center
of every HSP,
perform a Smith-
Waterman style
alignment,
stopping when
the score drops
over a certain
amount

37. banded extension score bounded extension

38. Performing initial seed match
• Hash table:
• Data structure that provides a mapping from
keys to values (think dictionary)
• Inserting a new key/value pair and looking up
the value for a key are both O(1)

39. Performing initial seed match
target = “ACAGGAGGGAGA”
Table:
ACA → 0
CAG → 1
AGG → 2, 5
GGA → 3, 7
GAG → 4, 8
GGG → 6
AGA → 9
n - 3 3-tuples,
n - 3 inserts,
Build time O(n)

40. Performing initial seed match
query = “GGAGCCC”
Table:
ACA → 0
CAG → 1
AGG → 2, 5
GGA → 3, 7
GAG → 4, 8
GGG → 6
AGA → 9
Lookups:
GGA → 3, 7
GAG → 4, 8
AGC →
GCC →
CCC →
m - 3 3-tuples,
m - 3 lookups,
Lookup time O(m)

41. Performing initial seed match
m - 3 3-tuples,
m - 3 lookups,
Lookup time O(m)
n - 3 3-tuples,
n - 3 inserts,
Build time O(n)
To ﬁnd all matches,
build and then lookup:
Total time O(n+m)

42. Similarities between BLAST and FASTA
• Fairly similar core idea, Filtration:
• Identify a set of initial candidate matches
(seeds)
• Filter seeds using heuristics
• Extend seeds that pass ﬁlters using bounded
dynamic programming

43. Extension
• The vast majority of running time is in extension
• The key to good performance is identifying just
the right number of seeds to extend
• Seed criteria must not be so stringent as to
miss many optimal alignments
• But must be stringent enough to achieve good
performance

44. Seeding

45. Some types of seeds
• Exact matches of length k (exact seeds)
• Pairs that produce a score over T when matched
(score seeds)
• Speciﬁc patterns of matches spaced with
mismatches (spaced seeds)
• Fixed or spaced seeds allowing transitions at one
or more position (transition seeds)
• Arbitrary combinations

46. Evaluating seeds
• Seed sensitivity: probability that a biologically
meaningful alignments that contain that seed
• Seed speciﬁcity: 1 - probability of ﬁnding a seed
match in unrelated sequences

47. Balancing accuracy and performance
• Fixed seeds: hash table
• Scored seeds: neighborhood strategy + hash
table
• Spaced seeds: shifting + hashtable (non match
positions can be completely ignored)
• Transition seeds: transition insensitive encoding +
hash table

48. Seeds matter
• Diﬀerent seeds may be more or less eﬀective for
particular classes of elements
• For non-coding genome alignment, transition
seeds are much better than ﬁxed spaced seeds
• Allowing transitions at any position is
• Optimal seed design is hard
(Sun and Buhler, 2006)

49. Scoring parameters

50. Inferring a scoring matrix from data
• For proteins, estimate from hand tuned structural
alignments, using a time dependent model for
extrapolation
• No great analog for genomic alignments

51. Inferring a scoring matrix from data
• Start with two high quality sequences with
known homology
• Align with ungapped blast, +1/-1 scoring
• Discard high scoring HSPs (>70% identity)
• Infer new scoring scheme as log-odds ratio:
(Chiaromonte et al. 2002)
score of the alignment column x-over-y is the log of an “odds ratio”
s(x, y) = log p(x, y)
q1
(x)q2
(y)
e p(x, y) is the frequency of x-over-y in the training set, expressed as a f
of the observed aligned pairs, and q1
(x) and q2
(y) denote the backgro
encies of nucleotides x and y as the upper and lower components (re
Pacific Symposium on Biocomputing 7:115-126 (2002)

52. in the human genome. In all three
HOXD matrix h
47.5% G+C
T A C G T A
–117 91 –114 –31 –123 100
–20 –114 100 –125 –31 –123
–96 –31 –125 100 –114 –28
67 –123 –31 –114 91 –109
s can then be used in the tradition
(Chiaromonte et al. 2002)

53. Long genomic alignments
blastz and lastz

54. Blastz ideas
• 12of19 spaced seed (from PatternHunter) but
allowing two transitions
• Soft and dynamic masking of known repeats, and
regions that generate highly repetitive
alignments
• Chaining: reduce overlapping alignments to a
maximal scoring non-overlapping set
• Interpolation: realign between best alignments
with more sensitive parameters