Wabi Four Russians Algorithm

Better Greedy Sequence Clustering with Fast Banded Alignment Brian Brubach,
Jay Ghurye, Mihai Pop and Aravind Srinivasan Department of Computer Science, University of Maryland – College Park WABI-2017 1

WABI-2017 2 Hierarchical Clustering O(n2) comparisons, with each comparison being
non-trivial For a dataset with 3 million sequences, it would require 9 x 1012 comparisons (9 trillion)! Why clustering is difficult?

WABI-2017 3 Why clustering is difficult? Curse of dimensionality! How
many DNA sequences within 5 mismatches in first 500 bp and one mismatch in last position? 3 " 3# " 500 5 ≈ 95 " 10*+ Filtering methods based on kmers won’t help!

Any sequence within a specified edit distance or similarity from
the cluster center is recruited by the cluster center. 1. Select a cluster center. 2. Recruit sequences to center. 3. Repeat until no more sequences remain. Greedy clustering WABI-2017 4

Length Abundance 1. Select a cluster center. … Run fast
de-replication CD-HIT (Li et al. 2006) UCLUST (Edgar 2010) DNACLUST (Ghodsi et al. 2011) UCLUST (Edgar 2010) Longest remaining sequence EDIT (This work 2017) WABI-2017 5 Possible to preserve triangle inequality in semi- global / local alignment More likely to choose “True” cluster centers

2. Recruit sequences to center. ATGTGA x|||x| -TGTCA Substitutions, insertions,
deletions Edit (Levenshtein) distance Similarity (DNAClust) WABI-2017 6

Problem How to calculate edit distance between: AGGTATCGC and ATGGC?
Solution Dynamic programming Takes O(m2) time where m is the length of the sequence. The Strong Exponential Time Hypothesis (SETH) implies it cannot be done faster (Backrus and Indyk, 2015) WABI-2017 7

Dynamic programming WABI-2017 8 , -, / = min 4
, - − 1, / + 1 , -, / − 1 + 1 , - − 1, / − 1 + (8 - == 9 / ? 0 ∶ 1) Theorem: Under standard formulation, any two adjacent cells in edit distance matrix differ by atmost 1. Hence the possible values the difference between adjacent cells can take are 0, 1 and -1. (Gusfield, 1997)

Standard Four Russians Speedup A T T G A T
T A G C T = log(m) Running time = O(m2/ log(m)) Block function: F(s1,s2, ) = 1 1 0 1 1 0 0 0 1 1 1 -1 -1 1 1,-1 WABI-2017 9 Sequence 1 Sequence 2

Banded Alignment WABI-2017 10 Only want alignments which are at
most edit distance d apart ` 2d

` 0 1 -1 -1 δ 1 1 -1 0
Banded Four Russians Speedup Block function: F(s1,s2, ) = WABI-2017 11 S1 S2 Overlap Overlap

Constructing trie-like data structure WABI-2017 12 S1 : A C
T G G A C A G T T S2 : A C T G G A C A A A C S3 : A C T G G T C A A A C Block size = 5, overlap = 2 Substrings

T G G A C A G T T 1 S2 : A C T G G A C A A A C S3 : A C T G G T C A A A C Block size = 5, overlap = 2 Substrings 1: A C T G G

T G G A C A G T T 1 2 S2 : A C T G G A C A A A C S3 : A C T G G T C A A A C Block size = 5, overlap = 2 Substrings 1: A C T G G 2: G G A C A

T G G A C A G T T 1 2 3 S2 : A C T G G A C A A A C S3 : A C T G G T C A A A C Block size = 5, overlap = 2 Substrings 1: A C T G G 2: G G A C A 3: C A G T T

T G G A C A G T T 1 S1 : 1 2 3 2 3 S2 : A C T G G A C A A A C S3 : A C T G G T C A A A C Block size = 5, overlap = 2 Substrings 1: A C T G G 2: G G A C A 3: C A G T T

T G G A C A G T T 1 S1 : 1 2 3 2 3 S2 : A C T G G A C A A A C S3 : A C T G G T C A A A C Block size = 5, overlap = 2 Substrings 1: A C T G G 2: G G A C A 3: C A G T T Root 1 2 3 S1

T G G A C A G T T 1 S1 : 1 2 3 2 3 S2 : A C T G G A C A A A C 1 2 4 S3 : A C T G G T C A A A C Block size = 5, overlap = 2 Substrings 1: A C T G G 2: G G A C A 3: C A G T T 4: C A A A C Root 1 2 3 S1

T G G A C A G T T 1 S1 : 1 2 3 2 3 S2 : A C T G G A C A A A C 1 2 4 S3 : A C T G G T C A A A C Block size = 5, overlap = 2 S2 : 1 2 4 Substrings 1: A C T G G 2: G G A C A 3: C A G T T 4: C A A A C Root 1 2 3 S1

T G G A C A G T T 1 S1 : 1 2 3 2 3 S2 : A C T G G A C A A A C 1 2 4 S3 : A C T G G T C A A A C Block size = 5, overlap = 2 S2 : 1 2 4 Substrings 1: A C T G G 2: G G A C A 3: C A G T T 4: C A A A C Root 1 2 3 4 S1 S2

T G G A C A G T T 1 S1 : 1 2 3 2 3 S2 : A C T G G A C A A A C 1 2 4 S3 : A C T G G T C A A A C 1 5 4 Block size = 5, overlap = 2 S2 : 1 2 4 Substrings 1: A C T G G 2: G G A C A 3: C A G T T 4: C A A A C 5: G G T C A Root 1 2 3 4 S1 S2

T G G A C A G T T 1 S1 : 1 2 3 2 3 S2 : A C T G G A C A A A C 1 2 4 S3 : A C T G G T C A A A C 1 5 4 Block size = 5, overlap = 2 S2 : 1 2 4 S3 : 1 5 4 Substrings 1: A C T G G 2: G G A C A 3: C A G T T 4: C A A A C 5: G G T C A Root 1 2 3 4 S1 S2

T G G A C A G T T 1 S1 : 1 2 3 2 3 S2 : A C T G G A C A A A C 1 2 4 S3 : A C T G G T C A A A C 1 5 4 Block size = 5, overlap = 2 S2 : 1 2 4 S3 : 1 5 4 Substrings 1: A C T G G 2: G G A C A 3: C A G T T 4: C A A A C 5: G G T C A Root 1 2 3 4 5 4 S1 S2 S3

Example comparison WABI-2017 24 Block size = 5, overlap =
2 Substrings 1: A C T G G 2: G G A C A 3: C A G T T 4: C A A A C 5: G G T C A Root 1 2 3 4 5 4 S1 S2 S3 Compare S1 = 123 to S3 = 154

2 Substrings 1: A C T G G 2: G G A C A 3: C A G T T 4: C A A A C 5: G G T C A Root 1 2 3 4 5 4 S1 S2 S3 1 1 Compare S1 = 123 to S3 = 154 1

2 Substrings 1: A C T G G 2: G G A C A 3: C A G T T 4: C A A A C 5: G G T C A Root 1 2 3 4 5 4 S1 S2 S3 1 2 1 5 Compare S1 = 123 to S3 = 154 1 2

2 Substrings 1: A C T G G 2: G G A C A 3: C A G T T 4: C A A A C 5: G G T C A Root 1 2 3 4 5 4 S1 S2 S3 1 2 3 1 5 4 Compare S1 = 123 to S3 = 154 1 2 3

Speedup intuition WABI-2017 28 Block size = 5, overlap =
2 Substrings 1: A C T G G 2: G G A C A 3: C A G T T 4: C A A A C 5: G G T C A Root 1 2 3 4 5 4 S1 S2 S3 Compare S1 = 123 to S3 = 154 Then compare S1 to S2 = 124 1 2 3

2 Substrings 1: A C T G G 2: G G A C A 3: C A G T T 4: C A A A C 5: G G T C A Root 1 2 3 4 5 4 S1 S2 S3 Compare S1 = 123 to S3 = 154 Then compare S1 to S2 = 124 1 2 3 No need to recompute prefix

2 Substrings 1: A C T G G 2: G G A C A 3: C A G T T 4: C A A A C 5: G G T C A Root 1 2 3 4 5 4 S1 S2 S3 Compare S1 = 123 to S3 = 154 Then compare S1 to S2 = 124 1 2 3 2 No need to recompute prefix Block Comparison

2 Substrings 1: A C T G G 2: G G A C A 3: C A G T T 4: C A A A C 5: G G T C A Root 1 2 3 4 5 4 S1 S2 S3 Compare S1 = 123 to S3 = 154 Then compare S1 to S2 = 124 1 2 3 2 3 No need to recompute prefix Block Comparison

2 Substrings 1: A C T G G 2: G G A C A 3: C A G T T 4: C A A A C 5: G G T C A Root 1 2 3 4 5 4 S1 S2 S3 Compare S1 = 123 to S3 = 154 Then compare S1 to S2 = 124 1 2 3 2 3 Stored in lookup table No need to recompute prefix Block Comparison

Dataset • 16S rRNA gene sequences from a mouse concussion
data • Highly similar to each other • 57 million sequences • De-replicated before running EDIT WABI-2017 33

Running Time WABI-2017 34 1.07 million sequences 99% similarity threshold

Cluster Size 97% Similarity 99% Similarity WABI-2017 35

Edit distance diameter 97% Similarity 99% Similarity WABI-2017 36

Theoretical Running Time Analysis b = Number of distinct substrings
at each level in the tree d = maximum allowed edit distance m = length of the sequence n = number of sequences Banded alignment: O(n2md) Theorem: If = ≤ ? @A B , all pairwise edit distance can be computed in O(n2m) time. WABI-2017 37

Conclusion • Exploit sequence similarity not limited just to the
prefix • Combined 4-Russian’s method with banded alignment • Find compact and accurate clusters compared to UCLUST at high similarity • Preprocessing step for complicated clustering methods WABI-2017 38

Acknowledgements WABI-2017 39 Brian Brubach Mihai Pop Aravind Srinivasan NSF
Awards CNS 1010789 and CCF 1422569 (AS,BB) NIH, grant R01-AI-100947 (MP,BB) Bill and Melinda Gates Foundation (MP,JG) Adobe, Inc. (AS)

WABI-2017 40 Thank You Questions? Contact: [email protected] and [email protected]

Wabi Four Russians Algorithm

Wabi Four Russians Algorithm

More Decks by ghuryejay

Other Decks in Research

Featured

Transcript