PyNNDescent: Fast Approximate Nearest Neighbors with Numba

Fast Approximate Nearest Neighbour Search with Numba

What are Nearest Neighbours?

Given a set of points with A distance measure between
them…

… and a new “query point” …

Find the closest points to the query point

Why Nearest Neighbors?

Nearest Neighbour computations are at the heart of many machine
learning algorithms

KNN-Classi fi ers KNN-Regressors

Clustering https://commons.wikimedia.org/wiki/File:DBSCAN-Illustration.svg by Chire https://www. fl ickr.com/photos/trevorpatt/41875889652/in/photostream/ by Trevor Patt
HDBSCAN DBSCAN Single Linkage Clustering Spectral Clustering

Dimension Reduction http://lvdmaaten.github.io/tsne/ http://www-clmc.usc.edu/publications/T/tenenbaum-Science2000.pdf t-SNE Isomap Spectral Embedding UMAP

Recommender Systems Query Expansion

Why Approximate Nearest Neighbours?

Finding exact nearest neighbours is hard

Approximate nearest neighbour search trades accuracy for performance

How Do You Find Nearest Neighbors?

Using Trees

Hierarchically divide up the space into a tree

Bound the search using the tree structure (And the triangle
inequality)

KD-Tree

Ball Tree

Random Projection Tree

Using Graphs

How do you search for nearest neighbours of a query
using a graph? Malkov and Yashunin, 2018 Dong, Moses and Li, 2011 Iwasaki and Miyazaki, 2018

Start with a nearest neighbour graph of the training data
Assume we now want to fi nd neighbours of a query point

Choose a starting node in the graph (potentially randomly) as
a candidate node

Look at all nodes connected by an edge to the
best untried candidate node in the graph Add all these nodes to our potential candidate pool

Sort the candidate pool by closeness to the query point
Truncate the pool to the k best candidates

Return to the Expansion step unless we have already tried
all the candidates in the pool

Stop when there are no untried candidates in the pool

Looks inef fi cient Scales up well

Graph adapts to intrinsic dimension of the data

But how do we build the graph?!

The algorithm works (badly) even on a bad graph

Run one iteration of search for every node Update the
graph with new better neighbours Search is better on the improved graph

Perfect accuracy of neighbours is not assured We can get
an approximate knn-graph quickly

How Do You Make it Fast?

Algorithm tricks

Query node Expansion node Current neighbour

Neighbour A Neighbour B Common node

Hubs have a lot of neighbours!

Sample neighbours when constructing the graph Prune away edges before
performing searches

Necessary to fi nd green’s nearest neighbour Necessary to fi
nd blue’s nearest neighbour Not required since we can traverse through blue

For search remove the longest edges of any triangles in
the graph

Initialize with Random Projection Trees

Implementation tricks

Pro fi le and inspect llvm code for innermost functions
Type declarations and code choices can help the compiler a lot!

@numba.jit def euclidean(x, y): return np.sqrt(np.sum((x - y)**2)) Query benchmark
took 12s

@numba.jit(fastmath=True) def euclidean(x, y): result = 0.0 for i in
range(x.shape[0]): result += (x[i] - y[i])**2 return np.sqrt(result) Query benchmark took 8.5s

@numba.njit( numba.types.float32( numba.types.Array( numba.types.float32, 1, "C", readonly=True ), numba.types.Array( numba.types.float32,
1, "C", readonly=True ), ), fastmath=True, locals={ "result": numba.types.float32, "diff": numba.types.float32, "i": numba.types.uint16, }, ) def squared_euclidean(x, y): result = 0.0 dim = x.shape[0] for i in range(dim): diff = x[i] - y[i] result += diff * diff return result Query benchmark took 7.6s

Custom data structure implementations to help numba for often called
code

@numba.njit( "i4(f4[ :: 1],i4[ :: 1],f4,i4)", ) def simple_heap_push(priorities, indices,
p, n): ...

Numba has signi fi cant function call overhead with large
parameters Use closures over static data instead

@numba.njit() def frequently_called_function(param, large_readonly_data): ... val = access(large_readonly_data, param) ...
def create_frequently_called_function(large_readonly_data): @numba.njit() def closure(param): ... val = access(large_readonly_data, param) ... return closure

How Does it Compare?

Performance

We can test query performance using ann-benchmarks https://github.com/erikbern/ann-benchmarks

Consider the whole accuracy / performance trade-off space

Caveats: •Newer algorithms and implementations •Hardware can makes a big
difference •No GPU support for pynndescent

Features

Out of the box support for a wide variety of
distance measures: Euclidean Cosine Hamming Manhattan Minkowski Chebyshev Jaccard Haversine Dice Wasserstein Hellinger Spearman Correlation Mahalanobis Canberra Bray-Curtis Angular TSSS +20 more measures https://towardsdatascience.com/9-distance-measures-in-data-science-918109d069fa By Maarten Grootendorst

Custom metrics in Python (using numba)

Support for sparse data

Drop-in replacement for sklearn KNeighborsTransformer

Summary

pip install pynndescent conda install pynndescent https://github.com/lmcinnes/pynndescent [email protected] @leland_mcinnes

Questions? [email protected] @leland_mcinnes

PyNNDescent: Fast Approximate Nearest Neighbors...

PyNNDescent: Fast Approximate Nearest Neighbors with Numba

More Decks by Leland McInnes

Other Decks in Programming

Featured

Transcript