Upgrade to Pro — share decks privately, control downloads, hide ads and more …

lecture-11-sequence-alignment-2

Avatar for shaunmahony shaunmahony
February 16, 2022
550

 lecture-11-sequence-alignment-2

Lecture 11: Sequence Alignment II

Avatar for shaunmahony

shaunmahony

February 16, 2022
Tweet

Transcript

  1. Today’s learning objectives • How do we align two biological

    sequences? • Global alignment • Local alignment • Similarity scores • Sequence similarity vs. homology
  2. Example global alignment A C G T A C T

    0 -2 -4 -6 -8 -2 -4 -6 Scoring scheme Match = +3 Mismatch = -1 Gap = -2 +3 +6 +7 Problem: Align: ACGT vs ACT +1 -1 -3 +1 -1 +4 +5 +4 +2
  3. Dynamic Programming See: B&FG – Chapter 3 Fi-1, j-1 Fi-1,

    j Fi, j-1 Fi, j -d -d s(X i , Y j ) Let: Fi, j = optimal score of aligning X1 …Xi to Y1 …Yj Three possibilities: • Xi aligns to Yj : Fi, j = Fi-1, j-1 + s(Xi , Yj ) • Xi aligns to gap : Fi, j = Fi-1, j – d • Yj aligns to gap : Fi, j = Fi, j-1 – d
  4. Needleman-Wunsch algorithm F0, 0 = 0 F0, 1…j = -

    j * d F1…i, 0 = - i * d for each i = 1…M for each j = 1…N Fi-1, j-1 + s(Xi , Yj ) [match] Fi, j = max Fi-1, j – d [gap in X] Fi, j-1 – d [gap in Y] DIAG, if [match] Ptri, j = LEFT, if [gap in X] UP, if [gap in Y] Initialization Iteration Termination: FM, N is the score of the optimal alignment. Alignment path can be traced back from PtrM, N Problem: Find the best way to align X and Y from 0,0 to M,N
  5. Algorithmic complexity • Given two sequences of length L •

    Brute force alignment: • Possible pairwise alignments: • Needleman-Wunsch alignment: • 3 summations and a max operation per matrix entry • L x L matrix entries to compute • à O(L2) 22L 2πL
  6. T G A A C T C C T A

    C T G T A A G T T G T T C T T A C T G T C T A A G X: -TGAACTCCTACTGT--AAG Y: TTGTTCT--TACTGTCTAAG Global Alignment
  7. A C C G A T G T A C

    T G T A G G T G A G T C T A C T G T T T A A T C X: ACCGATGTACTGTAGGT Y: GAGTCTACTGTTTAATC Local Alignment
  8. Local alignment (Smith-Waterman) Problem: Find optimal alignments between subsequences of

    X and Y. Given X1 …XM and Y1 …YN , find i, j, k, l such that the score of alignment between Xi …Xj and Yk …Yl is maximal. Idea: If the alignment score becomes negative, it is better to start a new alignment. i.e. set the score to 0
  9. Smith-Waterman algorithm F0, 0 = 0 F0, 1…j = 0

    F1…i, 0 = 0 for each i = 1…M for each j = 1…N Fi-1, j-1 + s(Xi , Yj ) [match] Fi, j = max Fi-1, j – d [gap in X] Fi, j-1 – d [gap in Y] 0 DIAG, if [match] Ptri, j = LEFT, if [gap in X] UP, if [gap in Y] Initialization Iteration Termination: Best local alignment score is the Fi, j with maximum value. Best local alignment path can be traced back from Ptri, j corresponding to maximum Fi, j
  10. Example local alignment T A C G A C T

    0 0 0 0 0 0 0 0 Scoring scheme Match = +3 Mismatch = -3 Gap = -4 0 0 +3 Problem: Align: TACGT vs ACT +3 0 0 0 0 +2 +6 +2 T 0 0 0 +5 +3
  11. Gap opening vs gap extension Seq1 AAGACCTACGGGGGGGGTTCCCGCACTTCGACCTGAGCC Seq2 AAGACCTACGGGGTTCCCGCACTTCGACCTGAGCC Original

    Seq1 AAGACCTACGGGGGGGGTTCCCGCACTTCGACCTGAGCC Seq2 AAGACCTACG-G-G-G-TTCCCGCACTTCGACCTGAGCC Alignment 1 Seq1 AAGACCTACGGGGGGGGTTCCCGCACTTCGACCTGAGCC Seq2 AAGACCTACGGGG----TTCCCGCACTTCGACCTGAGCC Alignment 2 How many evolutionary events does each alignment suggest? Should gap opening ”cost” the same as gap extension?
  12. Affine gap penalties Gap extensions should be penalized less than

    gap openings: Penalty for gap of length g: γ(g) = – d – (g – 1)e e = extension penalty M (+1,+1 ) Ix (+1,+0 ) Iy (+0,+1 ) -e -e -d -d s(xi , yj ) s(xi , yj ) s(xi , yj )
  13. Affine gap penalties Gap extensions should be penalized less than

    gap openings: Penalty for gap of length g: γ(g) = – d – (g – 1)e To implement, we need two more matrices to keep track of insertions: Fi, j = max{ Mi, j , Ixi, j , Iyi, j } Mi-1, j-1 + s(Xi , Yj ) Mi, j = max Ixi-1, j-1 + s(Xi , Yj ) Iyi-1, j-1 + s(Xi , Yj ) Mi-1, j – d Ixi, j = max Ixi-1, j – e Mi, j-1 – d Iyi, j = max Iyi, j-1 – e e = extension penalty
  14. Let’s think of substitution penalties as weight matrices s(Xi ,

    Yj ) A C G T A +1 -2 -2 -2 C -2 +1 -2 -2 G -2 -2 +1 -2 T -2 -2 -2 +1 Default nucleotide substitution matrix used by BLAST A ratio of 0.33 (1/-3) is appropriate for sequences that are about 99% conserved; a ratio of 0.5 (1/-2) is best for sequences that are 95% conserved; a ratio of about one (1/-1) is best for sequences that are 75% conserved (https://www.ncbi.nlm.nih.gov/books/NBK279678) Gap open penalty = -5 Gap extension penalty = -2
  15. What about aligning amino acid sequences? Human AAGACCTACTTCCCGCACTTCGACCTGAGCCACGGCTCT |||||||||||||| |||||

    || | |||||||||||| Mouse AAGACCTACTTCCCTCACTTTGATGTAAGCCACGGCTCT Human alpha globin fragment
  16. What about aligning amino acid sequences? K T Y F

    P H F D L S H G S Human AAGACCTACTTCCCGCACTTCGACCTGAGCCACGGCTCT |||||||||||||| ||||| || | |||||||||||| Mouse AAGACCTACTTCCCTCACTTTGATGTAAGCCACGGCTCT K T Y F P H F D V S H G S Human alpha globin fragment Synonymous mutation Non-synonymous mutation
  17. What should an alignment scoring matrix represent? • Scores reflect

    probability that two residues would appear at equivalent/aligned positions in homologous sequences. • Likelihood ratio between two hypotheses: • Hypothesis 1: residues are aligned due to chance • unrelated sequences • Hypothesis 2: residues are aligned due to common ancestry • evolutionarily related sequences
  18. What should an alignment score represent? • Probability that two

    similar sequences are homologous • i.e. related via evolution • Likelihood ratio between two hypotheses: • Hypothesis 1: alignment due to chance (unrelated sequences) • Hypothesis 2: alignment due to common ancestry (related sequences) • Calculate probability of observing an alignment according to each hypothesis • P(Xi ,Yj | U): prob of aligning Xi & Yj by model U (unrelated) • P(Xi ,Yj | R): prob of aligning Xi & Yj by model R (related) • Alignment score: likelihood ratio between the two • Relative likelihood that alignment not due to chance = P(Xi ,Yj | R) / P(Xi ,Yj | U) • Score = log(P)
  19. Model for unrelated sequences • Assume that each position in

    the alignment is randomly generated according to some amino acid frequencies. • Let qa be the probability of seeing amino acid a. • Probability of seeing an n-character alignment of X & Y: P(X ,Y |U) = q Xi i=1 n ∏ q Yi i=1 n ∏
  20. Model for related sequences • Assume that each pair of

    amino acids evolved from a common ancestor. • Let pa,b be the probability that evolution gave rise to amino acid a in one sequence and amino acid b in the other sequence. • Define from known high-quality alignments of related protein sequences. • Probability of seeing an n-character alignment of X & Y: P(X,Y | R) = p Xi ,Yi i=1 n ∏
  21. Probabilistic model of alignments • How to decide which model

    explains the alignment (U or R)? • Consider the relative likelihood of two possibilities (log-odds): P(X,Y | R) P(X,Y |U) = p Xi ,Yi i=1 n ∏ q Xi i=1 n ∏ q Yi i=1 n ∏ log P(X,Y | R) P(X,Y |U) = log p Xi ,Yi q Xi q Yi i n ∑
  22. Substitution matrices • General idea: calculate substitution frequencies from high-confidence

    alignments of many homologous proteins. • PAM matrices (“point accepted mutations”): • Began with closely related protein sequences (>85% identical). • Built phylogenetic trees to model evolutionary distances. • BLOSUM (“blocks substitution matrix”): • Based on ungapped alignments from BLOCKS database of related proteins. • Calculated substitution frequencies between related proteins with more than X% identity (e.g. X = 80% for BLOSUM80). BLOSUM80 PAM1 BLOSUM62 PAM120 BLOSUM45 PAM250 Less divergent More divergent
  23. BLAST algorithm overview • Idea: a high-scoring alignment is very

    likely to contain a short stretch of very high scoring matches 1. Split query into overlapping words of length k. k = 20 for DNA, k = 6 for amino acids 2. Find neighborhood words for each k-mer. 3. Look up sequences that contain neighborhood words from table (seeds S). 4. Extend seeds S until score drops off. 5. Report significance and alignment of each match.
  24. BLAST algorithm overview Query = KTYFPHFDLSHGS 1. Split query into

    words (k=3): HFD 20 HYD 17 HWD 15 HMD 14 HID 14 HLD 14 HVD 13 2. Find neighborhood words (using BLOSUM) k-mer database 3. Search database for matches to neighborhood words Query = KTYFPHFDLSHGS |||| || | Sbjct = PYYFPHIDLDFAS 4. Extend seed into alignment
  25. Summary • Dynamic programming approaches enable several variations on the

    alignment problem, each applicable to different biological questions/assumptions. • Substitution matrices provide sequence similarity scores that account for some of the ways that sequences evolve.