$30 off During Our Annual Pro Sale. View Details »

BMMB554 2016 Lecture 4

BMMB554 2016 Lecture 4

Anton Nekrutenko

February 03, 2016
Tweet

More Decks by Anton Nekrutenko

Other Decks in Education

Transcript

  1. Lecture 4
    Alignment basics

    View Slide

  2. Why sequence similarity?
    ‣ Mapping problems
    ‣ Database search problems
    ‣ Evolutionary problems
    ‣ Functional similarity

    View Slide

  3. s/t dot plot

    View Slide

  4. s/t dot plot

    View Slide

  5. s/t dot plot (word = 4)

    View Slide

  6. s/t dot plot (word = 4)

    View Slide

  7. Distance (D) and Similarity (S)

    View Slide

  8. you can turn s into t by:
    ‣ substituting a character in s from t
    ‣ delete character in s
    ‣ insert a character from t to s

    View Slide

  9. general case

    View Slide

  10. AGAG(s)versus AGG(t)

    View Slide

  11. Distance (D) and Similarity (S)

    View Slide

  12. Adding cost

    View Slide

  13. DAG: directed acyclic graph
    1 3
    2 4
    5
    (i 1, j)
    (i, j 1)
    (i 1, j 1)
    (i, j)
    (si, tj
    )
    (i 1, j)
    (i, j 1)
    (i 1, j 1)
    (i, j)
    (si, tj
    )

    View Slide

  14. Dynamic programming:
    AGAG versus AGG
    filling matrix

    View Slide

  15. Dynamic programming:
    AGAG versus AGG
    filling matrix

    View Slide

  16. Dynamic programming:
    AGAG versus AGG
    filling matrix

    View Slide

  17. Dynamic programming:
    AGAG versus AGG
    backtracking

    View Slide

  18. Effect of gap penalties

    View Slide

  19. Global versus Local

    View Slide

  20. C A G A G
    G
    A
    G
    G
    C
    0
    -1
    -2
    -3
    -4
    -5
    -1
    -1
    -2
    -3
    -4
    -3
    -2
    -2
    0
    -1
    -2
    -3
    -3
    -1
    -1
    1
    0
    -1
    -4
    -2
    0
    0
    0
    -1
    -5
    -3
    -1
    1
    1
    0
    C A G A G
    G
    A
    G
    G
    C
    0
    0
    0
    0
    0
    0
    0
    0
    0
    0
    0
    1
    0
    0
    1
    0
    0
    0
    0
    1
    0
    2
    1
    0
    0
    0
    2
    1
    1
    0
    0
    1
    1
    3
    2
    1
    l global alignment is
    CAGAG-
    GAG-GC
    with score 0, while an optimal local alignment is
    G
    G
    ote that the optimal global alignment fails to align the two identical subsequences.
    r the local alignment algorithm is that the scoring scheme must be such that random
    pected score of less than zero. us, it is very sensitive to a realistic choice of scorin
    -1
    -2
    -3
    -4
    -5
    -1
    -2
    -3
    -4
    -3
    -2
    0
    -1
    -2
    -3
    -1
    -1
    1
    0
    -1
    -2
    0
    0
    0
    -1
    -3
    -1
    1
    1
    0
    G
    A
    G
    G
    C
    0
    0
    0
    0
    0
    0
    0
    0
    0
    1
    0
    1
    0
    0
    0
    1
    0
    2
    1
    0
    ignment is
    CAGAG-
    GAG-GC
    with score 0, while an optimal lo
    e optimal global alignment fails to align the two identi
    alignment algorithm is that the scoring scheme must b
    G
    A
    G
    G
    C
    0
    0
    0
    0
    0
    0
    0
    0
    0
    1
    0
    1
    0
    0
    0
    1
    0
    2
    1
    0
    0
    2
    1
    1
    0
    1
    1
    3
    2
    1
    while an optimal local alignment is
    GAG
    GAG
    with
    lign the two identical subsequences.
    ring scheme must be such that random matches

    View Slide

  21. More realistic scoring schemes

    View Slide

  22. Arbitrary gap penalties
    single position). However, this is unrealistic, particularly for one kind of event. In
    o do occur as single character events in reality. Indels of multiple positions can be
    of a single position. us, we would like to handle gaps more realistically.
    trary Gap Penalties
    ne an arbitrary function γ(g) that species the cost of a gap or indel of length
    e for nding the optimal alignment becomes:
    S(s0..i
    , t0..j
    ) = max









    σ(si
    , tj
    ) + S(s0..i−1
    , t0..j−1
    )
    −γ(k) + S(s0..i−k
    , t0..j
    )for k = 0 . . . i − 1
    −γ(k) + S(s0..i
    , t0..j−k
    )for k = 0 . . . j − 1
    we have added iteration over ∼ n possible gap lengths to the computation of ea
    hm is now O(n3).
    ne gap penalties
    n specic forms of γ(g) the computation can be bounded. e common case
    score, in which the score of a gap depends on only two values, an initiation c
    dless of length, and an extension cost e for each additional base in the gap. In o
    − (g − 1)e. is can be computed by the following recurrences:

    View Slide

  23. Affine gap penalties
    . Ane gap penalties
    For certain specic forms of γ(
    ane gap score, in which the s
    gaps regardless of length, and a
    γ(g) = −d − (g − 1)e. is can b
    S(so...i
    ,
    I(so...i
    ,
    Specific gap penalty function:
    e gap penalties
    n specic forms of γ(g) the computation can be bounded. e common case
    score, in which the score of a gap depends on only two values, an initiation
    dless of length, and an extension cost e for each additional base in the gap. In
    − (g − 1)e. is can be computed by the following recurrences:
    S(so...i
    , t0...j
    ) = max
    σ(si
    , tj
    ) + S(s0...i−1
    , t0...j−1
    )
    σ(si
    , tj
    ) + I(s0...i−1
    , t0...j−1
    )
    I(so...i
    , t0...j
    ) = max













    −d + S(s0...i
    , t0...j−1
    )
    −e + I(s0...i
    , t0...j−1
    )
    −d + S(s0...i−1
    , t0...j
    )
    −e + I(s0...i−1
    , t0...j
    )

    View Slide

  24. • Alignment as a recursive problem with “optimal
    substructure”
    • Solved efficiently with “dynamic programming”
    • Global (Needleman-Wunsch 1970)
    • Local (Smith-Waterman 1981)
    • Under specific models and scoring schemes
    • Similarity matrix + linear or affine gap
    penalty (Gotoh 1982)
    Summary

    View Slide

  25. Banded Smith-Waterman
    • If we expect few gaps, or gaps are very costly,
    then we would expect the optimal alignment lie
    mostly on a diagonal
    • In banded Smith-Waterman, we ignore cells more
    than a certain distance from the diagonal (set
    them to zero). Thus, the complexity is reduced
    from O(n m) to ~O(n k) where k is the width of the
    band
    • (Similar to the microsatellite finding idea)

    View Slide

  26. Banded Smith-Waterman
    • Main problem, this is local alignment, the optimal
    alignment may not be on the main diagonal
    • How can we decide what bands to search?
    • Inspiration: dotplot, use short matches to guide
    search

    View Slide

  27. (human/marmoset genomic DNA)

    View Slide

  28. Banded Smith-Waterman
    • Main problem, this is local alignment, the optimal
    alignment may not be on the main diagonal
    • How can we decide what bands to search?
    • Inspiration: dotplot, use short matches to guide
    search

    View Slide

  29. Approximate local alignment with
    word-match heuristics

    View Slide

  30. Approximating Smith-Waterman
    • Assumption: the optimal local alignment will
    include short regions of very high similarity or
    identity
    • Heuristic: rather that considering all possible
    alignments, consider only those that pass
    through or near short regions of high similarity

    View Slide

  31. Filtration algorithm concept
    • Find a set of candidate partial solutions (seeds)
    using a fast method
    • Filter candidates using one or a series of more
    expensive criteria
    • Report results

    View Slide

  32. Seeds
    FASTA BLAST
    For DNA: All exact
    matches of length
    4 to 6
    For DNA: All exact
    matches of length
    11

    View Slide

  33. Filtration
    FASTA BLAST
    1) Rank diagonals
    by number of
    exact matches
    (roughly) and
    keep only the top
    10
    1) Keep only
    seeds that are
    paired with
    another seed on
    the same diagonal
    within a limited
    distance

    View Slide

  34. Filtration
    FASTA BLAST
    2) Rescore top
    runs using
    similarity matrix,
    keep only those
    over threshold.
    Attempt to join
    nearby diagonals
    2) Extend
    (ungapped) until
    score drops by a
    certain amount
    over best seen,
    keep best if score
    over threshold

    View Slide

  35. ungapped extension

    View Slide

  36. Filtration
    FASTA BLAST
    3) Perform a
    banded Smith-
    Waterman
    alignment along
    the diagonals that
    score over the
    required
    threshold
    3) From the center
    of every HSP,
    perform a Smith-
    Waterman style
    alignment,
    stopping when
    the score drops
    over a certain
    amount

    View Slide

  37. banded extension score bounded extension

    View Slide

  38. Performing initial seed match
    • Hash table:
    • Data structure that provides a mapping from
    keys to values (think dictionary)
    • Inserting a new key/value pair and looking up
    the value for a key are both O(1)

    View Slide

  39. Performing initial seed match
    target = “ACAGGAGGGAGA”
    Table:
    ACA → 0
    CAG → 1
    AGG → 2, 5
    GGA → 3, 7
    GAG → 4, 8
    GGG → 6
    AGA → 9
    n - 3 3-tuples,
    n - 3 inserts,
    Build time O(n)

    View Slide

  40. Performing initial seed match
    query = “GGAGCCC”
    Table:
    ACA → 0
    CAG → 1
    AGG → 2, 5
    GGA → 3, 7
    GAG → 4, 8
    GGG → 6
    AGA → 9
    Lookups:
    GGA → 3, 7
    GAG → 4, 8
    AGC →
    GCC →
    CCC →
    m - 3 3-tuples,
    m - 3 lookups,
    Lookup time O(m)

    View Slide

  41. Performing initial seed match
    m - 3 3-tuples,
    m - 3 lookups,
    Lookup time O(m)
    n - 3 3-tuples,
    n - 3 inserts,
    Build time O(n)
    To find all matches,
    build and then lookup:
    Total time O(n+m)

    View Slide

  42. Similarities between BLAST and FASTA
    • Fairly similar core idea, Filtration:
    • Identify a set of initial candidate matches
    (seeds)
    • Filter seeds using heuristics
    • Extend seeds that pass filters using bounded
    dynamic programming

    View Slide

  43. Extension
    • The vast majority of running time is in extension
    • The key to good performance is identifying just
    the right number of seeds to extend
    • Seed criteria must not be so stringent as to
    miss many optimal alignments
    • But must be stringent enough to achieve good
    performance

    View Slide

  44. Seeding

    View Slide

  45. Some types of seeds
    • Exact matches of length k (exact seeds)
    • Pairs that produce a score over T when matched
    (score seeds)
    • Specific patterns of matches spaced with
    mismatches (spaced seeds)
    • Fixed or spaced seeds allowing transitions at one
    or more position (transition seeds)
    • Arbitrary combinations

    View Slide

  46. Evaluating seeds
    • Seed sensitivity: probability that a biologically
    meaningful alignments that contain that seed
    • Seed specificity: 1 - probability of finding a seed
    match in unrelated sequences

    View Slide

  47. Balancing accuracy and performance
    • Fixed seeds: hash table
    • Scored seeds: neighborhood strategy + hash
    table
    • Spaced seeds: shifting + hashtable (non match
    positions can be completely ignored)
    • Transition seeds: transition insensitive encoding +
    hash table

    View Slide

  48. Seeds matter
    • Different seeds may be more or less effective for
    particular classes of elements
    • For non-coding genome alignment, transition
    seeds are much better than fixed spaced seeds
    • Allowing transitions at any position is
    advantageous
    • Optimal seed design is hard
    (Sun and Buhler, 2006)

    View Slide

  49. Scoring parameters

    View Slide

  50. Inferring a scoring matrix from data
    • For proteins, estimate from hand tuned structural
    alignments, using a time dependent model for
    extrapolation
    • No great analog for genomic alignments

    View Slide

  51. Inferring a scoring matrix from data
    • Start with two high quality sequences with
    known homology
    • Align with ungapped blast, +1/-1 scoring
    • Discard high scoring HSPs (>70% identity)
    • Infer new scoring scheme as log-odds ratio:
    (Chiaromonte et al. 2002)
    score of the alignment column x-over-y is the log of an “odds ratio”
    s(x, y) = log p(x, y)
    q1
    (x)q2
    (y)
    e p(x, y) is the frequency of x-over-y in the training set, expressed as a f
    of the observed aligned pairs, and q1
    (x) and q2
    (y) denote the backgro
    encies of nucleotides x and y as the upper and lower components (re
    Pacific Symposium on Biocomputing 7:115-126 (2002)

    View Slide

  52. in the human genome. In all three
    HOXD matrix h
    47.5% G+C
    T A C G T A
    –117 91 –114 –31 –123 100
    –20 –114 100 –125 –31 –123
    –96 –31 –125 100 –114 –28
    67 –123 –31 –114 91 –109
    s can then be used in the tradition
    (Chiaromonte et al. 2002)

    View Slide

  53. Long genomic alignments
    blastz and lastz

    View Slide

  54. Blastz ideas
    • 12of19 spaced seed (from PatternHunter) but
    allowing two transitions
    • Soft and dynamic masking of known repeats, and
    regions that generate highly repetitive
    alignments
    • Chaining: reduce overlapping alignments to a
    maximal scoring non-overlapping set
    • Interpolation: realign between best alignments
    with more sensitive parameters

    View Slide

  55. View Slide

  56. View Slide

  57. View Slide

  58. View Slide

  59. View Slide

  60. View Slide

  61. View Slide

  62. View Slide

  63. View Slide

  64. View Slide

  65. View Slide

  66. View Slide

  67. View Slide

  68. View Slide

  69. View Slide