Speeding Up Minwise Hashing for Weighted Sets

2 • • • • • • • • •
• • •

4 • Useful if objects can be represented as sets
of features • and Jaccard similarity is an appropriate similarity measure coronavirus hate the “I hate the coronavirus!” I “I hate lockdowns!” 25 21 18 41 98 12 15 41 25 32 18 11 98 56 33 72 Set representation lockdowns hate I Object Signature Similarity estimation Minwise hashing Minwise hashing used for deduplication of similar web pages

5 I 25 63 98 hate 67 41 18 the
79 34 35 coronavirus 36 21 52 25 21 18 input set signature minimum hash value defines signature component independent hash functions

8 hate the I coronavirus

14 Step 1 Step 2 Step 1 Step 2

15 claims that Ioffe’s algorithm is wrong!

31 “Bagminhash - Minwise hashing algorithm for weighted sets” (Ertl,
KDD 2018)

32 “DartMinHash: Fast Sketching for Weighted Sets” (Christiani, 2020)

33 “DartMinHash: Fast Sketching for Weighted Sets” (Christiani, 2020)

34 https://github.com/oertl/treeminhash

36 http://www.nrbook.com/devroye/Devroye_files/chapter_five.pdf

40 DartMinHash performs best if weights are normalized Performance of
DartMinHash depends on total weight https://github.com/oertl/treeminhash

43 “Maximally consistent sampling and the Jaccard index of probability
distributions” (Moulton & Jiang, ICDMW 2018)

44 “ProbMinHash–A Class of Locality-Sensitive Hash Algorithms for the (Probability)
Jaccard Similarity” (Ertl, TKDE 2020)

48 ProbMinHash4 ProbMinHash3 ProbMinHash2 ProbMinHash1 with replacement w/o replacement Label
sampling uncorrelated correlated Point sampling

50 Correlated point generation of ProbMinHash3/4 may reduce estimation error
for small sets!

Speeding Up Minwise Hashing for Weighted Sets

Speeding Up Minwise Hashing for Weighted Sets

Other Decks in Research

Featured

Transcript