Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Speeding Up Minwise Hashing for Weighted Sets

Otmar Ertl
November 18, 2020

Speeding Up Minwise Hashing for Weighted Sets

Minwise hashing (MinHash) has become a standard tool for calculating signatures (fingerprints) of sets that is used in many applications for similarity estimation and nearest neighbor search. Generalizations have been proposed that are able to calculate signatures for weighted sets and allow estimating either the weighted Jaccard similarity or the probability Jaccard similarity. While there are already very fast algorithms for calculating signatures of unweighted sets, until recently there were no such algorithms for weighted sets. In this talk, the basic ideas of the latest weighted minwise hashing algorithms BagMinHash, DartMinHash, TreeMinHash, and ProbMinHash are presented. All of them have been developed only in the last two years and can reduce the computation costs by many orders of magnitude.

Otmar Ertl

November 18, 2020
Tweet

Other Decks in Research

Transcript

  1. 3

  2. 4 • Useful if objects can be represented as sets

    of features • and Jaccard similarity is an appropriate similarity measure coronavirus hate the “I hate the coronavirus!” I “I hate lockdowns!” 25 21 18 41 98 12 15 41 25 32 18 11 98 56 33 72 Set representation lockdowns hate I Object Signature Similarity estimation Minwise hashing Minwise hashing used for deduplication of similar web pages
  3. 5 I 25 63 98 hate 67 41 18 the

    79 34 35 coronavirus 36 21 52 25 21 18 input set signature minimum hash value defines signature component independent hash functions
  4. 6

  5. 7

  6. 9

  7. 10

  8. 11

  9. 12

  10. 13

  11. 16

  12. 17

  13. 18

  14. 19

  15. 20

  16. 21

  17. 22

  18. 23

  19. 24

  20. 25

  21. 26

  22. 27

  23. 28

  24. 29

  25. 30

  26. 40 DartMinHash performs best if weights are normalized Performance of

    DartMinHash depends on total weight https://github.com/oertl/treeminhash
  27. 42

  28. 43 “Maximally consistent sampling and the Jaccard index of probability

    distributions” (Moulton & Jiang, ICDMW 2018)
  29. 45

  30. 46

  31. 47

  32. 49

  33. 51

  34. 52