Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Topic 3 (BMMB554)

Topic 3 (BMMB554)

Sequence alignment: Introductory ideas. Slides for BMMB554 on-line course held at Penn State in Fall of 2015

Anton Nekrutenko

September 10, 2015
Tweet

More Decks by Anton Nekrutenko

Other Decks in Education

Transcript

  1. Why sequence similarity? ‣ Mapping problems ‣ Database search problems

    ‣ Evolutionary problems ‣ Functional similarity
  2. you can turn s into t by: ‣ substituting a

    character in s from t ‣ delete character in s ‣ insert a character from t to s
  3. DAG: directed acyclic graph 1 3 2 4 5 (i

    1, j) (i, j 1) (i 1, j 1) (i, j) (si, tj )
  4. C A G A G G A G G C

    0 -1 -2 -3 -4 -5 -1 -1 -2 -3 -4 -3 -2 -2 0 -1 -2 -3 -3 -1 -1 1 0 -1 -4 -2 0 0 0 -1 -5 -3 -1 1 1 0 C A G A G G A G G C 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 1 0 2 1 0 0 0 2 1 1 0 0 1 1 3 2 1 l global alignment is CAGAG- GAG-GC with score 0, while an optimal local alignment is G G ote that the optimal global alignment fails to align the two identical subsequences. r the local alignment algorithm is that the scoring scheme must be such that random pected score of less than zero. ￿us, it is very sensitive to a realistic choice of scorin -1 -2 -3 -4 -5 -1 -2 -3 -4 -3 -2 0 -1 -2 -3 -1 -1 1 0 -1 -2 0 0 0 -1 -3 -1 1 1 0 G A G G C 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 1 0 2 1 0 ignment is CAGAG- GAG-GC with score 0, while an optimal lo e optimal global alignment fails to align the two identi alignment algorithm is that the scoring scheme must b G A G G C 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 1 0 2 1 0 0 2 1 1 0 1 1 3 2 1 while an optimal local alignment is GAG GAG with lign the two identical subsequences. ring scheme must be such that random matches
  5. Arbitrary gap penalties single position). However, this is unrealistic, particularly

    for one kind of event. In o do occur as single character events in reality. Indels of multiple positions can be of a single position. ￿us, we would like to handle gaps more realistically. trary Gap Penalties ne an arbitrary function γ(g) that speci￿es the cost of a gap or indel of length e for ￿nding the optimal alignment becomes: S(s0..i , t0..j ) = max ￿ ￿ ￿ ￿ ￿ ￿ ￿ ￿ ￿ σ(si , tj ) + S(s0..i−1 , t0..j−1 ) −γ(k) + S(s0..i−k , t0..j )for k = 0 . . . i − 1 −γ(k) + S(s0..i , t0..j−k )for k = 0 . . . j − 1 we have added iteration over ∼ n possible gap lengths to the computation of ea hm is now O(n3). ne gap penalties n speci￿c forms of γ(g) the computation can be bounded. ￿e common case score, in which the score of a gap depends on only two values, an initiation c dless of length, and an extension cost e for each additional base in the gap. In o − (g − 1)e. ￿is can be computed by the following recurrences:
  6. Affine gap penalties ￿.￿ A￿ne gap penalties For certain speci￿c

    forms of γ( a￿ne gap score, in which the s gaps regardless of length, and a γ(g) = −d − (g − 1)e. ￿is can b S(so...i , I(so...i , Specific gap penalty function: e gap penalties n speci￿c forms of γ(g) the computation can be bounded. ￿e common case score, in which the score of a gap depends on only two values, an initiation dless of length, and an extension cost e for each additional base in the gap. In − (g − 1)e. ￿is can be computed by the following recurrences: S(so...i , t0...j ) = max￿ σ(si , tj ) + S(s0...i−1 , t0...j−1 ) σ(si , tj ) + I(s0...i−1 , t0...j−1 ) I(so...i , t0...j ) = max ￿ ￿ ￿ ￿ ￿ ￿ ￿ ￿ ￿ ￿ ￿ ￿ ￿ −d + S(s0...i , t0...j−1 ) −e + I(s0...i , t0...j−1 ) −d + S(s0...i−1 , t0...j ) −e + I(s0...i−1 , t0...j )
  7. • Alignment as a recursive problem with “optimal substructure” •

    Solved efficiently with “dynamic programming” • Global (Needleman-Wunsch 1970) • Local (Smith-Waterman 1981) • Under specific models and scoring schemes • Similarity matrix + linear or affine gap penalty (Gotoh 1982) Summary
  8. Banded Smith-Waterman • If we expect few gaps, or gaps

    are very costly, then we would expect the optimal alignment lie mostly on a diagonal • In banded Smith-Waterman, we ignore cells more than a certain distance from the diagonal (set them to zero). Thus, the complexity is reduced from O(n m) to ~O(n k) where k is the width of the band • (Similar to the microsatellite finding idea)
  9. Banded Smith-Waterman • Main problem, this is local alignment, the

    optimal alignment may not be on the main diagonal • How can we decide what bands to search? • Inspiration: dotplot, use short matches to guide search
  10. Banded Smith-Waterman • Main problem, this is local alignment, the

    optimal alignment may not be on the main diagonal • How can we decide what bands to search? • Inspiration: dotplot, use short matches to guide search
  11. Approximating Smith-Waterman • Assumption: the optimal local alignment will include

    short regions of very high similarity or identity • Heuristic: rather that considering all possible alignments, consider only those that pass through or near short regions of high similarity
  12. Filtration algorithm concept • Find a set of candidate partial

    solutions (seeds) using a fast method • Filter candidates using one or a series of more expensive criteria • Report results
  13. Seeds FASTA BLAST For DNA: All exact matches of length

    4 to 6 For DNA: All exact matches of length 11
  14. Filtration FASTA BLAST 1) Rank diagonals by number of exact

    matches (roughly) and keep only the top 10 1) Keep only seeds that are paired with another seed on the same diagonal within a limited distance
  15. Filtration FASTA BLAST 2) Rescore top runs using similarity matrix,

    keep only those over threshold. Attempt to join nearby diagonals 2) Extend (ungapped) until score drops by a certain amount over best seen, keep best if score over threshold
  16. Filtration FASTA BLAST 3) Perform a banded Smith- Waterman alignment

    along the diagonals that score over the required threshold 3) From the center of every HSP, perform a Smith- Waterman style alignment, stopping when the score drops over a certain amount
  17. Performing initial seed match • Hash table: • Data structure

    that provides a mapping from keys to values (think dictionary) • Inserting a new key/value pair and looking up the value for a key are both O(1)
  18. Performing initial seed match target = “ACAGGAGGGAGA” Table: ACA →

    0 CAG → 1 AGG → 2, 5 GGA → 3, 7 GAG → 4, 8 GGG → 6 AGA → 9 n - 3 3-tuples, n - 3 inserts, Build time O(n)
  19. Performing initial seed match query = “GGAGCCC” Table: ACA →

    0 CAG → 1 AGG → 2, 5 GGA → 3, 7 GAG → 4, 8 GGG → 6 AGA → 9 Lookups: GGA → 3, 7 GAG → 4, 8 AGC → GCC → CCC → m - 3 3-tuples, m - 3 lookups, Lookup time O(m)
  20. Performing initial seed match m - 3 3-tuples, m -

    3 lookups, Lookup time O(m) n - 3 3-tuples, n - 3 inserts, Build time O(n) To find all matches, build and then lookup: Total time O(n+m)
  21. BLAST Algorithm 1.Find all matches of at least length k

    • For DNA, these are exact matches • For Proteins, mismatches are allowed up front as long as the score is over some threshold 2.Extend exact matches without allowing gaps to form HSPs (can’t be extended more without decreasing score)
  22. BLAST Algorithm 1.Find all matches of at least length k

    • For DNA, these are exact matches • For Proteins, mismatches are allowed up front as long as the score is over some threshold 2.Extend exact matches without allowing gaps to form HSPs (heuristic: extension stops when difference between best score and score after extension is greater than X) 3.Report highest scoring results
  23. Two-hit gapped BLAST algorithm 1.Find initial matches as before 2.Require

    two high scoring words on the same diagonal to trigger extension 3.If ungapped extension scores over a certain threshold, do a gapped extension • Like banded alignment, but region is determined by score drop heuristic rather than distance from diagonal
  24. Blastz ideas • 12of19 spaced seed (from PatternHunter) but allowing

    two transitions • Soft and dynamic masking of known repeats, and regions that generate highly repetitive alignments • Chaining: reduce overlapping alignments to a maximal scoring non-overlapping set • Interpolation: realign between best alignments with more sensitive parameters