lecture-11-sequence-alignment-2

L11: SEQUENCE ALIGNMENT II Foundations in Data-Driven Life Sciences BMMB/MCIBS
554

Today’s learning objectives • How do we align two biological
sequences? • Global alignment • Local alignment • Similarity scores • Sequence similarity vs. homology

Example global alignment A C G T A C T
0 -2 -4 -6 -8 -2 -4 -6 Scoring scheme Match = +3 Mismatch = -1 Gap = -2 +3 +6 +7 Problem: Align: ACGT vs ACT +1 -1 -3 +1 -1 +4 +5 +4 +2

Dynamic Programming See: B&FG – Chapter 3 Fi-1, j-1 Fi-1,
j Fi, j-1 Fi, j -d -d s(X i , Y j ) Let: Fi, j = optimal score of aligning X1 …Xi to Y1 …Yj Three possibilities: • Xi aligns to Yj : Fi, j = Fi-1, j-1 + s(Xi , Yj ) • Xi aligns to gap : Fi, j = Fi-1, j – d • Yj aligns to gap : Fi, j = Fi, j-1 – d

Needleman-Wunsch algorithm F0, 0 = 0 F0, 1…j = -
j * d F1…i, 0 = - i * d for each i = 1…M for each j = 1…N Fi-1, j-1 + s(Xi , Yj ) [match] Fi, j = max Fi-1, j – d [gap in X] Fi, j-1 – d [gap in Y] DIAG, if [match] Ptri, j = LEFT, if [gap in X] UP, if [gap in Y] Initialization Iteration Termination: FM, N is the score of the optimal alignment. Alignment path can be traced back from PtrM, N Problem: Find the best way to align X and Y from 0,0 to M,N

Algorithmic complexity • Given two sequences of length L •
Brute force alignment: • Possible pairwise alignments: • Needleman-Wunsch alignment: • 3 summations and a max operation per matrix entry • L x L matrix entries to compute • à O(L2) 22L 2πL

T G A A C T C C T A
C T G T A A G T T G T T C T T A C T G T C T A A G X: -TGAACTCCTACTGT--AAG Y: TTGTTCT--TACTGTCTAAG Global Alignment

A C C G A T G T A C
T G T A G G T G A G T C T A C T G T T T A A T C X: ACCGATGTACTGTAGGT Y: GAGTCTACTGTTTAATC Local Alignment

Local alignment (Smith-Waterman) Problem: Find optimal alignments between subsequences of
X and Y. Given X1 …XM and Y1 …YN , find i, j, k, l such that the score of alignment between Xi …Xj and Yk …Yl is maximal. Idea: If the alignment score becomes negative, it is better to start a new alignment. i.e. set the score to 0

Smith-Waterman algorithm F0, 0 = 0 F0, 1…j = 0
F1…i, 0 = 0 for each i = 1…M for each j = 1…N Fi-1, j-1 + s(Xi , Yj ) [match] Fi, j = max Fi-1, j – d [gap in X] Fi, j-1 – d [gap in Y] 0 DIAG, if [match] Ptri, j = LEFT, if [gap in X] UP, if [gap in Y] Initialization Iteration Termination: Best local alignment score is the Fi, j with maximum value. Best local alignment path can be traced back from Ptri, j corresponding to maximum Fi, j

Example local alignment T A C G A C T
0 0 0 0 0 0 0 0 Scoring scheme Match = +3 Mismatch = -3 Gap = -4 0 0 +3 Problem: Align: TACGT vs ACT +3 0 0 0 0 +2 +6 +2 T 0 0 0 +5 +3

Question • In what scenario would local alignment be appropriate?

Gap opening vs gap extension Seq1 AAGACCTACGGGGGGGGTTCCCGCACTTCGACCTGAGCC Seq2 AAGACCTACGGGGTTCCCGCACTTCGACCTGAGCC Original
Seq1 AAGACCTACGGGGGGGGTTCCCGCACTTCGACCTGAGCC Seq2 AAGACCTACG-G-G-G-TTCCCGCACTTCGACCTGAGCC Alignment 1 Seq1 AAGACCTACGGGGGGGGTTCCCGCACTTCGACCTGAGCC Seq2 AAGACCTACGGGG----TTCCCGCACTTCGACCTGAGCC Alignment 2 How many evolutionary events does each alignment suggest? Should gap opening ”cost” the same as gap extension?

Affine gap penalties Gap extensions should be penalized less than
gap openings: Penalty for gap of length g: γ(g) = – d – (g – 1)e e = extension penalty M (+1,+1 ) Ix (+1,+0 ) Iy (+0,+1 ) -e -e -d -d s(xi , yj ) s(xi , yj ) s(xi , yj )

Affine gap penalties Gap extensions should be penalized less than
gap openings: Penalty for gap of length g: γ(g) = – d – (g – 1)e To implement, we need two more matrices to keep track of insertions: Fi, j = max{ Mi, j , Ixi, j , Iyi, j } Mi-1, j-1 + s(Xi , Yj ) Mi, j = max Ixi-1, j-1 + s(Xi , Yj ) Iyi-1, j-1 + s(Xi , Yj ) Mi-1, j – d Ixi, j = max Ixi-1, j – e Mi, j-1 – d Iyi, j = max Iyi, j-1 – e e = extension penalty

How do we define similarity?

Let’s think of substitution penalties as weight matrices s(Xi ,
Yj ) A C G T A +1 -2 -2 -2 C -2 +1 -2 -2 G -2 -2 +1 -2 T -2 -2 -2 +1 Default nucleotide substitution matrix used by BLAST A ratio of 0.33 (1/-3) is appropriate for sequences that are about 99% conserved; a ratio of 0.5 (1/-2) is best for sequences that are 95% conserved; a ratio of about one (1/-1) is best for sequences that are 75% conserved (https://www.ncbi.nlm.nih.gov/books/NBK279678) Gap open penalty = -5 Gap extension penalty = -2

Should our similarity metric depend on whether we are aligning
DNA or protein sequences?

What about aligning amino acid sequences? Human AAGACCTACTTCCCGCACTTCGACCTGAGCCACGGCTCT |||||||||||||| |||||
|| | |||||||||||| Mouse AAGACCTACTTCCCTCACTTTGATGTAAGCCACGGCTCT Human alpha globin fragment

What about aligning amino acid sequences? K T Y F
P H F D L S H G S Human AAGACCTACTTCCCGCACTTCGACCTGAGCCACGGCTCT |||||||||||||| ||||| || | |||||||||||| Mouse AAGACCTACTTCCCTCACTTTGATGTAAGCCACGGCTCT K T Y F P H F D V S H G S Human alpha globin fragment Synonymous mutation Non-synonymous mutation

Are all mismatches as bad as each other?

BLOSUM62 matrix

What should alignment scores represent?

What should an alignment scoring matrix represent? • Scores reflect
probability that two residues would appear at equivalent/aligned positions in homologous sequences. • Likelihood ratio between two hypotheses: • Hypothesis 1: residues are aligned due to chance • unrelated sequences • Hypothesis 2: residues are aligned due to common ancestry • evolutionarily related sequences

What should an alignment score represent? • Probability that two
similar sequences are homologous • i.e. related via evolution • Likelihood ratio between two hypotheses: • Hypothesis 1: alignment due to chance (unrelated sequences) • Hypothesis 2: alignment due to common ancestry (related sequences) • Calculate probability of observing an alignment according to each hypothesis • P(Xi ,Yj | U): prob of aligning Xi & Yj by model U (unrelated) • P(Xi ,Yj | R): prob of aligning Xi & Yj by model R (related) • Alignment score: likelihood ratio between the two • Relative likelihood that alignment not due to chance = P(Xi ,Yj | R) / P(Xi ,Yj | U) • Score = log(P)

Model for unrelated sequences • Assume that each position in
the alignment is randomly generated according to some amino acid frequencies. • Let qa be the probability of seeing amino acid a. • Probability of seeing an n-character alignment of X & Y: P(X ,Y |U) = q Xi i=1 n ∏ q Yi i=1 n ∏

Model for related sequences • Assume that each pair of
amino acids evolved from a common ancestor. • Let pa,b be the probability that evolution gave rise to amino acid a in one sequence and amino acid b in the other sequence. • Define from known high-quality alignments of related protein sequences. • Probability of seeing an n-character alignment of X & Y: P(X,Y | R) = p Xi ,Yi i=1 n ∏

Probabilistic model of alignments • How to decide which model
explains the alignment (U or R)? • Consider the relative likelihood of two possibilities (log-odds): P(X,Y | R) P(X,Y |U) = p Xi ,Yi i=1 n ∏ q Xi i=1 n ∏ q Yi i=1 n ∏ log P(X,Y | R) P(X,Y |U) = log p Xi ,Yi q Xi q Yi i n ∑

Substitution matrices • General idea: calculate substitution frequencies from high-confidence
alignments of many homologous proteins. • PAM matrices (“point accepted mutations”): • Began with closely related protein sequences (>85% identical). • Built phylogenetic trees to model evolutionary distances. • BLOSUM (“blocks substitution matrix”): • Based on ungapped alignments from BLOCKS database of related proteins. • Calculated substitution frequencies between related proteins with more than X% identity (e.g. X = 80% for BLOSUM80). BLOSUM80 PAM1 BLOSUM62 PAM120 BLOSUM45 PAM250 Less divergent More divergent

How can we quickly align a sequence against many other
sequences?

BLAST (Basic Local Alignment Search Tool)

BLAST algorithm overview • Idea: a high-scoring alignment is very
likely to contain a short stretch of very high scoring matches 1. Split query into overlapping words of length k. k = 20 for DNA, k = 6 for amino acids 2. Find neighborhood words for each k-mer. 3. Look up sequences that contain neighborhood words from table (seeds S). 4. Extend seeds S until score drops off. 5. Report significance and alignment of each match.

BLAST algorithm overview Query = KTYFPHFDLSHGS 1. Split query into
words (k=3): HFD 20 HYD 17 HWD 15 HMD 14 HID 14 HLD 14 HVD 13 2. Find neighborhood words (using BLOSUM) k-mer database 3. Search database for matches to neighborhood words Query = KTYFPHFDLSHGS |||| || | Sbjct = PYYFPHIDLDFAS 4. Extend seed into alignment

BLOSUM62 matrix

BLAST example blast.ncbi.nlm.nih.gov

BLAST example

Summary • Dynamic programming approaches enable several variations on the
alignment problem, each applicable to different biological questions/assumptions. • Substitution matrices provide sequence similarity scores that account for some of the ways that sequences evolve.

lecture-11-sequence-alignment-2

lecture-11-sequence-alignment-2

More Decks by shaunmahony

Featured

Transcript