Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Wabi Four Russians Algorithm

ghuryejay
August 23, 2017

Wabi Four Russians Algorithm

The slide deck for the talk given at WABI 2017 at Boston.

ghuryejay

August 23, 2017
Tweet

More Decks by ghuryejay

Other Decks in Research

Transcript

  1. Better Greedy Sequence Clustering with Fast Banded Alignment Brian Brubach,

    Jay Ghurye, Mihai Pop and Aravind Srinivasan Department of Computer Science, University of Maryland – College Park WABI-2017 1
  2. WABI-2017 2 Hierarchical Clustering O(n2) comparisons, with each comparison being

    non-trivial For a dataset with 3 million sequences, it would require 9 x 1012 comparisons (9 trillion)! Why clustering is difficult?
  3. WABI-2017 3 Why clustering is difficult? Curse of dimensionality! How

    many DNA sequences within 5 mismatches in first 500 bp and one mismatch in last position? 3 " 3# " 500 5 ≈ 95 " 10*+ Filtering methods based on kmers won’t help!
  4. Any sequence within a specified edit distance or similarity from

    the cluster center is recruited by the cluster center. 1. Select a cluster center. 2. Recruit sequences to center. 3. Repeat until no more sequences remain. Greedy clustering WABI-2017 4
  5. Length Abundance 1. Select a cluster center. … Run fast

    de-replication CD-HIT (Li et al. 2006) UCLUST (Edgar 2010) DNACLUST (Ghodsi et al. 2011) UCLUST (Edgar 2010) Longest remaining sequence EDIT (This work 2017) WABI-2017 5 Possible to preserve triangle inequality in semi- global / local alignment More likely to choose “True” cluster centers
  6. 2. Recruit sequences to center. ATGTGA x|||x| -TGTCA Substitutions, insertions,

    deletions Edit (Levenshtein) distance Similarity (DNAClust) WABI-2017 6
  7. Problem How to calculate edit distance between: AGGTATCGC and ATGGC?

    Solution Dynamic programming Takes O(m2) time where m is the length of the sequence. The Strong Exponential Time Hypothesis (SETH) implies it cannot be done faster (Backrus and Indyk, 2015) WABI-2017 7
  8. Dynamic programming WABI-2017 8 , -, / = min 4

    , - − 1, / + 1 , -, / − 1 + 1 , - − 1, / − 1 + (8 - == 9 / ? 0 ∶ 1) Theorem: Under standard formulation, any two adjacent cells in edit distance matrix differ by atmost 1. Hence the possible values the difference between adjacent cells can take are 0, 1 and -1. (Gusfield, 1997)
  9. Standard Four Russians Speedup A T T G A T

    T A G C T = log(m) Running time = O(m2/ log(m)) Block function: F(s1,s2, ) = 1 1 0 1 1 0 0 0 1 1 1 -1 -1 1 1,-1 WABI-2017 9 Sequence 1 Sequence 2
  10. ` 0 1 -1 -1 δ 1 1 -1 0

    Banded Four Russians Speedup Block function: F(s1,s2, ) = WABI-2017 11 S1 S2 Overlap Overlap
  11. Constructing trie-like data structure WABI-2017 12 S1 : A C

    T G G A C A G T T S2 : A C T G G A C A A A C S3 : A C T G G T C A A A C Block size = 5, overlap = 2 Substrings
  12. Constructing trie-like data structure WABI-2017 13 S1 : A C

    T G G A C A G T T 1 S2 : A C T G G A C A A A C S3 : A C T G G T C A A A C Block size = 5, overlap = 2 Substrings 1: A C T G G
  13. Constructing trie-like data structure WABI-2017 14 S1 : A C

    T G G A C A G T T 1 2 S2 : A C T G G A C A A A C S3 : A C T G G T C A A A C Block size = 5, overlap = 2 Substrings 1: A C T G G 2: G G A C A
  14. Constructing trie-like data structure WABI-2017 15 S1 : A C

    T G G A C A G T T 1 2 3 S2 : A C T G G A C A A A C S3 : A C T G G T C A A A C Block size = 5, overlap = 2 Substrings 1: A C T G G 2: G G A C A 3: C A G T T
  15. Constructing trie-like data structure WABI-2017 16 S1 : A C

    T G G A C A G T T 1 S1 : 1 2 3 2 3 S2 : A C T G G A C A A A C S3 : A C T G G T C A A A C Block size = 5, overlap = 2 Substrings 1: A C T G G 2: G G A C A 3: C A G T T
  16. Constructing trie-like data structure WABI-2017 17 S1 : A C

    T G G A C A G T T 1 S1 : 1 2 3 2 3 S2 : A C T G G A C A A A C S3 : A C T G G T C A A A C Block size = 5, overlap = 2 Substrings 1: A C T G G 2: G G A C A 3: C A G T T Root 1 2 3 S1
  17. Constructing trie-like data structure WABI-2017 18 S1 : A C

    T G G A C A G T T 1 S1 : 1 2 3 2 3 S2 : A C T G G A C A A A C 1 2 4 S3 : A C T G G T C A A A C Block size = 5, overlap = 2 Substrings 1: A C T G G 2: G G A C A 3: C A G T T 4: C A A A C Root 1 2 3 S1
  18. Constructing trie-like data structure WABI-2017 19 S1 : A C

    T G G A C A G T T 1 S1 : 1 2 3 2 3 S2 : A C T G G A C A A A C 1 2 4 S3 : A C T G G T C A A A C Block size = 5, overlap = 2 S2 : 1 2 4 Substrings 1: A C T G G 2: G G A C A 3: C A G T T 4: C A A A C Root 1 2 3 S1
  19. Constructing trie-like data structure WABI-2017 20 S1 : A C

    T G G A C A G T T 1 S1 : 1 2 3 2 3 S2 : A C T G G A C A A A C 1 2 4 S3 : A C T G G T C A A A C Block size = 5, overlap = 2 S2 : 1 2 4 Substrings 1: A C T G G 2: G G A C A 3: C A G T T 4: C A A A C Root 1 2 3 4 S1 S2
  20. Constructing trie-like data structure WABI-2017 21 S1 : A C

    T G G A C A G T T 1 S1 : 1 2 3 2 3 S2 : A C T G G A C A A A C 1 2 4 S3 : A C T G G T C A A A C 1 5 4 Block size = 5, overlap = 2 S2 : 1 2 4 Substrings 1: A C T G G 2: G G A C A 3: C A G T T 4: C A A A C 5: G G T C A Root 1 2 3 4 S1 S2
  21. Constructing trie-like data structure WABI-2017 22 S1 : A C

    T G G A C A G T T 1 S1 : 1 2 3 2 3 S2 : A C T G G A C A A A C 1 2 4 S3 : A C T G G T C A A A C 1 5 4 Block size = 5, overlap = 2 S2 : 1 2 4 S3 : 1 5 4 Substrings 1: A C T G G 2: G G A C A 3: C A G T T 4: C A A A C 5: G G T C A Root 1 2 3 4 S1 S2
  22. Constructing trie-like data structure WABI-2017 23 S1 : A C

    T G G A C A G T T 1 S1 : 1 2 3 2 3 S2 : A C T G G A C A A A C 1 2 4 S3 : A C T G G T C A A A C 1 5 4 Block size = 5, overlap = 2 S2 : 1 2 4 S3 : 1 5 4 Substrings 1: A C T G G 2: G G A C A 3: C A G T T 4: C A A A C 5: G G T C A Root 1 2 3 4 5 4 S1 S2 S3
  23. Example comparison WABI-2017 24 Block size = 5, overlap =

    2 Substrings 1: A C T G G 2: G G A C A 3: C A G T T 4: C A A A C 5: G G T C A Root 1 2 3 4 5 4 S1 S2 S3 Compare S1 = 123 to S3 = 154
  24. Example comparison WABI-2017 25 Block size = 5, overlap =

    2 Substrings 1: A C T G G 2: G G A C A 3: C A G T T 4: C A A A C 5: G G T C A Root 1 2 3 4 5 4 S1 S2 S3 1 1 Compare S1 = 123 to S3 = 154 1
  25. Example comparison WABI-2017 26 Block size = 5, overlap =

    2 Substrings 1: A C T G G 2: G G A C A 3: C A G T T 4: C A A A C 5: G G T C A Root 1 2 3 4 5 4 S1 S2 S3 1 2 1 5 Compare S1 = 123 to S3 = 154 1 2
  26. Example comparison WABI-2017 27 Block size = 5, overlap =

    2 Substrings 1: A C T G G 2: G G A C A 3: C A G T T 4: C A A A C 5: G G T C A Root 1 2 3 4 5 4 S1 S2 S3 1 2 3 1 5 4 Compare S1 = 123 to S3 = 154 1 2 3
  27. Speedup intuition WABI-2017 28 Block size = 5, overlap =

    2 Substrings 1: A C T G G 2: G G A C A 3: C A G T T 4: C A A A C 5: G G T C A Root 1 2 3 4 5 4 S1 S2 S3 Compare S1 = 123 to S3 = 154 Then compare S1 to S2 = 124 1 2 3
  28. Speedup intuition WABI-2017 29 Block size = 5, overlap =

    2 Substrings 1: A C T G G 2: G G A C A 3: C A G T T 4: C A A A C 5: G G T C A Root 1 2 3 4 5 4 S1 S2 S3 Compare S1 = 123 to S3 = 154 Then compare S1 to S2 = 124 1 2 3 No need to recompute prefix
  29. Speedup intuition WABI-2017 30 Block size = 5, overlap =

    2 Substrings 1: A C T G G 2: G G A C A 3: C A G T T 4: C A A A C 5: G G T C A Root 1 2 3 4 5 4 S1 S2 S3 Compare S1 = 123 to S3 = 154 Then compare S1 to S2 = 124 1 2 3 2 No need to recompute prefix Block Comparison
  30. Speedup intuition WABI-2017 31 Block size = 5, overlap =

    2 Substrings 1: A C T G G 2: G G A C A 3: C A G T T 4: C A A A C 5: G G T C A Root 1 2 3 4 5 4 S1 S2 S3 Compare S1 = 123 to S3 = 154 Then compare S1 to S2 = 124 1 2 3 2 3 No need to recompute prefix Block Comparison
  31. Speedup intuition WABI-2017 32 Block size = 5, overlap =

    2 Substrings 1: A C T G G 2: G G A C A 3: C A G T T 4: C A A A C 5: G G T C A Root 1 2 3 4 5 4 S1 S2 S3 Compare S1 = 123 to S3 = 154 Then compare S1 to S2 = 124 1 2 3 2 3 Stored in lookup table No need to recompute prefix Block Comparison
  32. Dataset • 16S rRNA gene sequences from a mouse concussion

    data • Highly similar to each other • 57 million sequences • De-replicated before running EDIT WABI-2017 33
  33. Theoretical Running Time Analysis b = Number of distinct substrings

    at each level in the tree d = maximum allowed edit distance m = length of the sequence n = number of sequences Banded alignment: O(n2md) Theorem: If = ≤ ? @A B , all pairwise edit distance can be computed in O(n2m) time. WABI-2017 37
  34. Conclusion • Exploit sequence similarity not limited just to the

    prefix • Combined 4-Russian’s method with banded alignment • Find compact and accurate clusters compared to UCLUST at high similarity • Preprocessing step for complicated clustering methods WABI-2017 38
  35. Acknowledgements WABI-2017 39 Brian Brubach Mihai Pop Aravind Srinivasan NSF

    Awards CNS 1010789 and CCF 1422569 (AS,BB) NIH, grant R01-AI-100947 (MP,BB) Bill and Melinda Gates Foundation (MP,JG) Adobe, Inc. (AS)