Locality Sensitive Hashing at Lyst

Speeding up search with locality sensitive hashing. by Maciej Kula

Hi, I’m Maciej Kula. @maciej_kula

We collect the world of  fashion into a customisable  shopping
experience.

Given a point, ﬁnd other points close to it. Nearest
neighbour search… 4

At Lyst we use it for… 1.) Image Search 2.)
Recommendations 6

Convert image to points in space (vectors) & use nearest
neighbour search to get similar images. 1. Image Search (-0.3, 2.1, 0.5)

Super useful for deduplication & search.

Convert products and users to points in space & use
nearest neighbour search to get related products for the user. 2. Recommendations user = (-0.3, 2.1, 0.5) product = (5.2, 0.3, -0.5)

Great, but…

11 80 million We have images

12 9 million We have products

Exhaustive nearest neighbour search is too slow.

Locality sensitive hashing to the rescue! Use a hash table.
Pick a hash function that puts similar points in the same bucket. Only search within the bucket.

We use Random Projection Forests

Partition by splitting on random vectors

Points to note Keep splitting until the nodes are small
enough. Median splits give nicely balanced trees. Build a forest of trees.

Why do we need a forest? Some partitions split the
true neighbourhood of a point. Because partitions are random, other trees will not repeat the error. Build more trees to trade off query speed for precision.

LSH in Python annoy, Python wrapper for C++ code. LSHForest,
part of scikit-learn FLANN, an auto-tuning ANN index

But… LSHForest is slow. FLANN is a pain to deploy.
annoy is great, but can’t add points to an existing index.

So we wrote our own.

github.com/lyst/rpforest pip install rpforest

rpforest Quite fast. Allows adding new items to the index.
Does not require us to store points in memory.

We use it in conjunction with PostgreSQL Send the query
point to the ANN index. Get ANN row ids back Plug them into postgres for ﬁltering Final scoring done in postgres using C extensions.

Side note: postgres is awesome. Arrays & custom functions in
C

Gives us a fast and reliable ANN service 100x speed-up
with 0.6 10-NN precision Allows us to serve real-time results All on top of a real database.

thank you @maciej_kula

Locality Sensitive Hashing at Lyst

Locality Sensitive Hashing at Lyst

Maciej Kula

More Decks by Maciej Kula

Other Decks in Programming

Featured

Transcript

Speeding up search with locality sensitive hashing. by Maciej Kula

Hi, I’m Maciej Kula. @maciej_kula

We collect the world of  fashion into a customisable  shopping

Given a point, ﬁnd other points close to it. Nearest

At Lyst we use it for… 1.) Image Search 2.)

Convert image to points in space (vectors) & use nearest

Super useful for deduplication & search.

Convert products and users to points in space & use

Great, but…

11 80 million We have images

12 9 million We have products

Exhaustive nearest neighbour search is too slow.

Locality sensitive hashing to the rescue! Use a hash table.

We use Random Projection Forests

Partition by splitting on random vectors

Partition by splitting on random vectors

Partition by splitting on random vectors

Partition by splitting on random vectors

Partition by splitting on random vectors

Points to note Keep splitting until the nodes are small

Why do we need a forest? Some partitions split the

LSH in Python annoy, Python wrapper for C++ code. LSHForest,

But… LSHForest is slow. FLANN is a pain to deploy.

So we wrote our own.

github.com/lyst/rpforest pip install rpforest

rpforest Quite fast. Allows adding new items to the index.

We use it in conjunction with PostgreSQL Send the query

Side note: postgres is awesome. Arrays & custom functions in

Gives us a fast and reliable ANN service 100x speed-up

thank you @maciej_kula