Lecture 10: Sequence alignment 1

L10: SEQUENCE ALIGNMENT I Foundations in Data-Driven Life Sciences BMMB/MCIBS
554

Today’s learning objectives 1. Understand the goal of sequence alignments.
2. Understand how dynamic programming can be applied to sequence alignment problems. • Relevant Reading: • Bioinformatics & Functional Genomics (Pevsner): Chapter 3

Algorithm design techniques • Exhaustive search • “Brute-force” • Examine
every possible alternative to find a solution. • Greedy algorithms • Choose the ‘most attractive’ alternative at each iteration. • Divide-and-Conquer algorithms • Break problem into non-overlapping subproblems. • Stitch solutions of subproblems together to solve larger problem. • Dynamic programming • Break problem into overlapping subproblems. • Remember solutions of subproblems, and use them to construct solutions to larger problems. • Machine-learning / Statistical learning theory • Learn the solution from observed data. • Typically models problems probabilistically.

Dynamic programming • Break a problem into overlapping subproblems. •
Remember solutions of subproblems, and use them to construct solutions to larger problems.

Sequence alignment • Why would we want to align two
sequences? • Identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships. • Sequence alignment: • Modeling the evolutionary events that occurred in a pair of homologous sequences since their last common ancestor. • Events: Substitutions, Insertions, Deletions

Sequence alignment: problem definition X: -TGAACTCCTACTGT--AAG Y: TTGTTCT--TACTGTCTAAG X= TGAACTCCTACTGTAAG
Y= TTGTTCTTACTGTCTAAG Align Given two sequence strings (X & Y), an alignment assigns gaps to the sequences such that each letter in X lines up with a letter or a gap in Y, and vice versa. The best alignment assigns gaps such that the similarity between the resulting strings is maximized.

Scoring alignments AGGCTAGT-T AGCGAAGTAT AGGCTA-GT-T AG-CGAAGTAT vs. 6 match 3
mismatch, 1 gap 6 match 1 mismatch, 3 gap Scoring Function: s (Xi , Yj ) : Xi == Yj (match): +m Xi ≠ Yj (mismatch): -k Gap penalty: -d Alignment Score: F = number of matches multiplied by +m + number of mismatches multiplied by -k + number of gaps multiplied by -d Example: m = +3 k = -1 d = -2

T G A A C T C C T A
C T G T A A G T T G T T C T T A C T G T C T A A G X: TGAACTCCTACTGTAAG Y: TTGTTCTTACTGTCTAAG

T G A A C T C C T A
C T G T A A G T T G T T C T T A C T G T C T A A G X: -TGAACTCCTACTGT--AAG Y: TTGTTCT--TACTGTCTAAG

1. identity (stay along a diagonal) 2. mismatch (stay along
a diagonal) 3. gap in one sequence (move vertically!) 4. gap in the other sequence (move horizontally!) B&FG 3e Fig. 3-20 Page 97 Four possible outcomes in aligning two sequences 1 2

B&FG 3e Fig. 3-20 Page 97 Four possible outcomes in
aligning two sequences match (diagonal) mismatch (diagonal) gap in seq1 (vertical) gap in seq2 (horizontal)

Alignment is additive If: X1 …. Xi aligns to Y1
…. Yj and Xi+1 …. XM aligns to Yj+1 …. YN Then: F(X1…M , Y1…N ) = F(X1…i , Y1…j ) + F(Xi+1…M , Yj+1…N ) So, the original problem, Align X1 …XM to Y1 …YN Can be decomposed into smaller subproblems: Align X1 …Xi to Y1 …Yj And we can apply Dynamic Programming to solve.

• Needleman-Wunsch is guaranteed to find optimal alignments, even though
the algorithm does not search all possible alignments. • It is an example of a dynamic programming algorithm: an optimal path (alignment) is identified by incrementally extending optimal subpaths. Thus, a series of decisions is made at each step of the alignment to find the pair of residues with the best score. These solutions are then used by future steps of the algorithm. Needleman-Wunsch: dynamic programming

B&FG 3e Fig. 3-21 Page 98 Global pairwise alignment using
Needleman-Wunsch Three possibilities: • Xi aligns to Yj : Fi, j = Fi-1, j-1 + s(Xi , Yj ) • Xi aligns to gap : Fi, j = Fi-1, j – d • Yj aligns to gap : Fi, j = Fi, j-1 – d

B&FG 3e Fig. 3-21 Page 98 Here the best score
involves +1 (proceed from upper left to gray, lower right square). If we instead select an alignment involving a gap the score would be worse (-4). Global pairwise alignment using Needleman-Wunsch

B&FG 3e Fig. 3-21 Page 98 Proceed to calculate the
optimal score for the next position. Global pairwise alignment using Needleman-Wunsch

Needleman-Wunsch algorithm F0, 0 = 0 F0, 1…j = -
j * d F1…i, 0 = - i * d for each i = 1…M for each j = 1…N Fi-1, j-1 + s(Xi , Yj ) [match] Fi, j = max Fi-1, j – d [gap in X] Fi, j-1 – d [gap in Y] DIAG, if [match] Ptri, j = LEFT, if [gap in X] UP, if [gap in Y] Initialization Iteration Termination: FM, N is the score of the optimal alignment. Alignment path can be traced back from PtrM, N

T G A A C T C C T A
C T G T A A G T T G T T C T T A C T G T C T A A G X: -TGAACTCCTACTGT--AAG Y: TTGTTCT--TACTGTCTAAG Global Alignment

A C G T A C T Scoring scheme Match
= +3 Mismatch = -1 Gap = -2 Problem: Align: ACGT vs ACT

Example global alignment A C G T A C T
0 -2 -4 -6 -8 -2 -4 -6 Scoring scheme Match = +3 Mismatch = -1 Gap = -2 +3 +6 +7 Problem: Align: ACGT vs ACT +1 -1 -3 +1 -1 +4 +5 +4 +2

Algorithmic complexity • Given two sequences of length L •
Brute force alignment: • Possible pairwise alignments: • Needleman-Wunsch alignment: • 3 summations and a max operation per matrix entry • L x L matrix entries to compute • à O(L2) 22L 2πL

T G A A C T C C T A
C T G T A A G T T G T T C T T A C T G T C T A A G X: -TGAACTCCTACTGT--AAG Y: TTGTTCT--TACTGTCTAAG Global Alignment

A C C G A T G T A C
T G T A G G T G A G T C T A C T G T T T A A T C X: ACCGATGTACTGTAGGT Y: GAGTCTACTGTTTAATC Local Alignment

Local alignment (Smith-Waterman) Problem: Find optimal alignments between subsequences of
X and Y. Given X1 …XM and Y1 …YN , find i, j, k, l such that the score of alignment between Xi …Xj and Yk …Yl is maximal. Idea: If the alignment score becomes negative, it is better to start a new alignment. i.e. set the score to 0

Smith-Waterman algorithm F0, 0 = 0 F0, 1…j = 0
F1…i, 0 = 0 for each i = 1…M for each j = 1…N Fi-1, j-1 + s(Xi , Yj ) [match] Fi, j = max Fi-1, j – d [gap in X] Fi, j-1 – d [gap in Y] 0 DIAG, if [match] Ptri, j = LEFT, if [gap in X] UP, if [gap in Y] Initialization Iteration Termination: Best local alignment score is the Fi, j with maximum value. Best local alignment path can be traced back from Ptri, j corresponding to maximum Fi, j

Example local alignment T A C G A C T
0 0 0 0 0 0 0 0 Scoring scheme Match = +3 Mismatch = -3 Gap = -4 0 0 +3 Problem: Align: TACGT vs ACT +3 0 0 0 0 +2 +6 +2 T 0 0 0 +5 +3

Question • In what scenario would local alignment be appropriate?

Summary • Sequence alignment: • Modeling the evolutionary events that
occurred in a pair of homologous sequences since their last common ancestor. • Placing gaps in sequences such that similarity is maximized. • Dynamic programming: strategy to solve a complex problem by breaking it into simpler sub-problems. • The Needleman-Wunsch algorithm uses a dynamic programming strategy to compute the optimal global alignment of two sequences. Next up… • Lecture 11: Sequence alignment continued

Lecture 10: Sequence alignment 1

Lecture 10: Sequence alignment 1

shaunmahony

More Decks by shaunmahony

Featured

Transcript

L10: SEQUENCE ALIGNMENT I Foundations in Data-Driven Life Sciences BMMB/MCIBS

Today’s learning objectives 1. Understand the goal of sequence alignments.

Algorithm design techniques • Exhaustive search • “Brute-force” • Examine

Dynamic programming • Break a problem into overlapping subproblems. •

Sequence alignment • Why would we want to align two

Sequence alignment: problem definition X: -TGAACTCCTACTGT--AAG Y: TTGTTCT--TACTGTCTAAG X= TGAACTCCTACTGTAAG

Scoring alignments AGGCTAGT-T AGCGAAGTAT AGGCTA-GT-T AG-CGAAGTAT vs. 6 match 3

T G A A C T C C T A

T G A A C T C C T A

T G A A C T C C T A

1. identity (stay along a diagonal) 2. mismatch (stay along

B&FG 3e Fig. 3-20 Page 97 Four possible outcomes in

Alignment is additive If: X1 …. Xi aligns to Y1

• Needleman-Wunsch is guaranteed to find optimal alignments, even though

B&FG 3e Fig. 3-21 Page 98 Global pairwise alignment using

B&FG 3e Fig. 3-21 Page 98 Here the best score

B&FG 3e Fig. 3-21 Page 98 Proceed to calculate the

Needleman-Wunsch algorithm F0, 0 = 0 F0, 1…j = -

T G A A C T C C T A

A C G T A C T Scoring scheme Match

Example global alignment A C G T A C T

Algorithmic complexity • Given two sequences of length L •

T G A A C T C C T A

A C C G A T G T A C

Local alignment (Smith-Waterman) Problem: Find optimal alignments between subsequences of

Smith-Waterman algorithm F0, 0 = 0 F0, 1…j = 0

Example local alignment T A C G A C T

Question • In what scenario would local alignment be appropriate?

Summary • Sequence alignment: • Modeling the evolutionary events that