Slide 9
Slide 9 text
Probabilistic interpretation
Probability that random mapping by a hash-function hi
Pr(hi
(xi
)=hi
(yi
)) =JS
(x,y)+(1-Js
)/2k
that a random permutation of the subsets produces the same values, k
is the number of bits mapped by the hash-function hi
, that is a random
permutation of the bit vectors' coordinates (the same permutation on all
vectors),
MinHashing: replace long text document by much shorter unified-length
MinHash signatures.
A MinHash function is defined as the index of the first bit, in the permuted
order, to have a value 1.
h(x) = (ax+b) mod p
Then, the MinHash signature of the set S is:
• [h (S), h π 2
(S), h π3
(S),…h πk
(S)],
• π1
, π2
,…, πk
are random permutations of the bit vector's coordinates and h1
; h2
, h3
,…,hk
their matching MinHash functions.
1. Minhashing: convert large sets to short signatures (lists of integers), while
preserving similarity.
2. Locality-sensitive hashing: focus on pairs of signatures likely to be
similar.
9