BMMB554 2016 Lecture 4 - Speaker Deck

Slide 1

Slide 1 text

Lecture 4 Alignment basics

Slide 2

Slide 2 text

Why sequence similarity? ‣ Mapping problems ‣ Database search problems ‣ Evolutionary problems ‣ Functional similarity

Slide 3

Slide 3 text

s/t dot plot

Slide 4

Slide 4 text

s/t dot plot

Slide 5

Slide 5 text

s/t dot plot (word = 4)

Slide 6

Slide 6 text

s/t dot plot (word = 4)

Slide 7

Slide 7 text

Distance (D) and Similarity (S)

Slide 8

Slide 8 text

you can turn s into t by: ‣ substituting a character in s from t ‣ delete character in s ‣ insert a character from t to s

Slide 9

Slide 9 text

general case

Slide 10

Slide 10 text

AGAG(s)versus AGG(t)

Slide 11

Slide 11 text

Distance (D) and Similarity (S)

Slide 12

Slide 12 text

Adding cost

Slide 13

Slide 13 text

DAG: directed acyclic graph 1 3 2 4 5 (i 1, j) (i, j 1) (i 1, j 1) (i, j) (si, tj ) (i 1, j) (i, j 1) (i 1, j 1) (i, j) (si, tj )

Slide 14

Slide 14 text

Dynamic programming: AGAG versus AGG ﬁlling matrix

Slide 15

Slide 15 text

Dynamic programming: AGAG versus AGG ﬁlling matrix

Slide 16

Slide 16 text

Dynamic programming: AGAG versus AGG ﬁlling matrix

Slide 17

Slide 17 text

Dynamic programming: AGAG versus AGG backtracking

Slide 18

Slide 18 text

Eﬀect of gap penalties

Slide 19

Slide 19 text

Global versus Local

Slide 20

Slide 20 text

C A G A G G A G G C 0 -1 -2 -3 -4 -5 -1 -1 -2 -3 -4 -3 -2 -2 0 -1 -2 -3 -3 -1 -1 1 0 -1 -4 -2 0 0 0 -1 -5 -3 -1 1 1 0 C A G A G G A G G C 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 1 0 2 1 0 0 0 2 1 1 0 0 1 1 3 2 1 l global alignment is CAGAG- GAG-GC with score 0, while an optimal local alignment is G G ote that the optimal global alignment fails to align the two identical subsequences. r the local alignment algorithm is that the scoring scheme must be such that random pected score of less than zero. us, it is very sensitive to a realistic choice of scorin -1 -2 -3 -4 -5 -1 -2 -3 -4 -3 -2 0 -1 -2 -3 -1 -1 1 0 -1 -2 0 0 0 -1 -3 -1 1 1 0 G A G G C 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 1 0 2 1 0 ignment is CAGAG- GAG-GC with score 0, while an optimal lo e optimal global alignment fails to align the two identi alignment algorithm is that the scoring scheme must b G A G G C 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 1 0 2 1 0 0 2 1 1 0 1 1 3 2 1 while an optimal local alignment is GAG GAG with lign the two identical subsequences. ring scheme must be such that random matches

Slide 21

Slide 21 text

More realistic scoring schemes

Slide 22

Slide 22 text

Arbitrary gap penalties single position). However, this is unrealistic, particularly for one kind of event. In o do occur as single character events in reality. Indels of multiple positions can be of a single position. us, we would like to handle gaps more realistically. trary Gap Penalties ne an arbitrary function γ(g) that species the cost of a gap or indel of length e for nding the optimal alignment becomes: S(s0..i , t0..j ) = max σ(si , tj ) + S(s0..i−1 , t0..j−1 ) −γ(k) + S(s0..i−k , t0..j )for k = 0 . . . i − 1 −γ(k) + S(s0..i , t0..j−k )for k = 0 . . . j − 1 we have added iteration over ∼ n possible gap lengths to the computation of ea hm is now O(n3). ne gap penalties n specic forms of γ(g) the computation can be bounded. e common case score, in which the score of a gap depends on only two values, an initiation c dless of length, and an extension cost e for each additional base in the gap. In o − (g − 1)e. is can be computed by the following recurrences:

Slide 23

Slide 23 text

Aﬃne gap penalties . Ane gap penalties For certain specic forms of γ( ane gap score, in which the s gaps regardless of length, and a γ(g) = −d − (g − 1)e. is can b S(so...i , I(so...i , Speciﬁc gap penalty function: e gap penalties n specic forms of γ(g) the computation can be bounded. e common case score, in which the score of a gap depends on only two values, an initiation dless of length, and an extension cost e for each additional base in the gap. In − (g − 1)e. is can be computed by the following recurrences: S(so...i , t0...j ) = max σ(si , tj ) + S(s0...i−1 , t0...j−1 ) σ(si , tj ) + I(s0...i−1 , t0...j−1 ) I(so...i , t0...j ) = max −d + S(s0...i , t0...j−1 ) −e + I(s0...i , t0...j−1 ) −d + S(s0...i−1 , t0...j ) −e + I(s0...i−1 , t0...j )

Slide 24

Slide 24 text

• Alignment as a recursive problem with “optimal substructure” • Solved efficiently with “dynamic programming” • Global (Needleman-Wunsch 1970) • Local (Smith-Waterman 1981) • Under specific models and scoring schemes • Similarity matrix + linear or affine gap penalty (Gotoh 1982) Summary

Slide 25

Slide 25 text

Banded Smith-Waterman • If we expect few gaps, or gaps are very costly, then we would expect the optimal alignment lie mostly on a diagonal • In banded Smith-Waterman, we ignore cells more than a certain distance from the diagonal (set them to zero). Thus, the complexity is reduced from O(n m) to ~O(n k) where k is the width of the band • (Similar to the microsatellite ﬁnding idea)

Slide 26

Slide 26 text

Banded Smith-Waterman • Main problem, this is local alignment, the optimal alignment may not be on the main diagonal • How can we decide what bands to search? • Inspiration: dotplot, use short matches to guide search

Slide 27

Slide 27 text

(human/marmoset genomic DNA)

Slide 28

Slide 28 text

Slide 29

Slide 29 text

Approximate local alignment with word-match heuristics

Slide 30

Slide 30 text

Approximating Smith-Waterman • Assumption: the optimal local alignment will include short regions of very high similarity or identity • Heuristic: rather that considering all possible alignments, consider only those that pass through or near short regions of high similarity

Slide 31

Slide 31 text

Filtration algorithm concept • Find a set of candidate partial solutions (seeds) using a fast method • Filter candidates using one or a series of more expensive criteria • Report results

Slide 32

Slide 32 text

Seeds FASTA BLAST For DNA: All exact matches of length 4 to 6 For DNA: All exact matches of length 11

Slide 33

Slide 33 text

Filtration FASTA BLAST 1) Rank diagonals by number of exact matches (roughly) and keep only the top 10 1) Keep only seeds that are paired with another seed on the same diagonal within a limited distance

Slide 34

Slide 34 text

Filtration FASTA BLAST 2) Rescore top runs using similarity matrix, keep only those over threshold. Attempt to join nearby diagonals 2) Extend (ungapped) until score drops by a certain amount over best seen, keep best if score over threshold

Slide 35

Slide 35 text

ungapped extension

Slide 36

Slide 36 text

Filtration FASTA BLAST 3) Perform a banded Smith- Waterman alignment along the diagonals that score over the required threshold 3) From the center of every HSP, perform a Smith- Waterman style alignment, stopping when the score drops over a certain amount

Slide 37

Slide 37 text

banded extension score bounded extension

Slide 38

Slide 38 text

Performing initial seed match • Hash table: • Data structure that provides a mapping from keys to values (think dictionary) • Inserting a new key/value pair and looking up the value for a key are both O(1)

Slide 39

Slide 39 text

Performing initial seed match target = “ACAGGAGGGAGA” Table: ACA → 0 CAG → 1 AGG → 2, 5 GGA → 3, 7 GAG → 4, 8 GGG → 6 AGA → 9 n - 3 3-tuples, n - 3 inserts, Build time O(n)

Slide 40

Slide 40 text

Performing initial seed match query = “GGAGCCC” Table: ACA → 0 CAG → 1 AGG → 2, 5 GGA → 3, 7 GAG → 4, 8 GGG → 6 AGA → 9 Lookups: GGA → 3, 7 GAG → 4, 8 AGC → GCC → CCC → m - 3 3-tuples, m - 3 lookups, Lookup time O(m)

Slide 41

Slide 41 text

Performing initial seed match m - 3 3-tuples, m - 3 lookups, Lookup time O(m) n - 3 3-tuples, n - 3 inserts, Build time O(n) To ﬁnd all matches, build and then lookup: Total time O(n+m)

Slide 42

Slide 42 text

Similarities between BLAST and FASTA • Fairly similar core idea, Filtration: • Identify a set of initial candidate matches (seeds) • Filter seeds using heuristics • Extend seeds that pass ﬁlters using bounded dynamic programming

Slide 43

Slide 43 text

Extension • The vast majority of running time is in extension • The key to good performance is identifying just the right number of seeds to extend • Seed criteria must not be so stringent as to miss many optimal alignments • But must be stringent enough to achieve good performance

Slide 44

Slide 44 text

Seeding

Slide 45

Slide 45 text

Some types of seeds • Exact matches of length k (exact seeds) • Pairs that produce a score over T when matched (score seeds) • Speciﬁc patterns of matches spaced with mismatches (spaced seeds) • Fixed or spaced seeds allowing transitions at one or more position (transition seeds) • Arbitrary combinations

Slide 46

Slide 46 text

Evaluating seeds • Seed sensitivity: probability that a biologically meaningful alignments that contain that seed • Seed speciﬁcity: 1 - probability of ﬁnding a seed match in unrelated sequences

Slide 47

Slide 47 text

Balancing accuracy and performance • Fixed seeds: hash table • Scored seeds: neighborhood strategy + hash table • Spaced seeds: shifting + hashtable (non match positions can be completely ignored) • Transition seeds: transition insensitive encoding + hash table

Slide 48

Slide 48 text

Seeds matter • Different seeds may be more or less effective for particular classes of elements • For non-coding genome alignment, transition seeds are much better than fixed spaced seeds • Allowing transitions at any position is advantageous • Optimal seed design is hard (Sun and Buhler, 2006)

Slide 49

Slide 49 text

Scoring parameters

Slide 50

Slide 50 text

Inferring a scoring matrix from data • For proteins, estimate from hand tuned structural alignments, using a time dependent model for extrapolation • No great analog for genomic alignments

Slide 51

Slide 51 text

Inferring a scoring matrix from data • Start with two high quality sequences with known homology • Align with ungapped blast, +1/-1 scoring • Discard high scoring HSPs (>70% identity) • Infer new scoring scheme as log-odds ratio: (Chiaromonte et al. 2002) score of the alignment column x-over-y is the log of an “odds ratio” s(x, y) = log p(x, y) q1 (x)q2 (y) e p(x, y) is the frequency of x-over-y in the training set, expressed as a f of the observed aligned pairs, and q1 (x) and q2 (y) denote the backgro encies of nucleotides x and y as the upper and lower components (re Pacific Symposium on Biocomputing 7:115-126 (2002)

Slide 52

Slide 52 text

in the human genome. In all three HOXD matrix h 47.5% G+C T A C G T A –117 91 –114 –31 –123 100 –20 –114 100 –125 –31 –123 –96 –31 –125 100 –114 –28 67 –123 –31 –114 91 –109 s can then be used in the tradition (Chiaromonte et al. 2002)

Slide 53

Slide 53 text

Long genomic alignments blastz and lastz

Slide 54

Slide 54 text

Blastz ideas • 12of19 spaced seed (from PatternHunter) but allowing two transitions • Soft and dynamic masking of known repeats, and regions that generate highly repetitive alignments • Chaining: reduce overlapping alignments to a maximal scoring non-overlapping set • Interpolation: realign between best alignments with more sensitive parameters