Upgrade to Pro — share decks privately, control downloads, hide ads and more …

PyNNDescent: Fast Approximate Nearest Neighbors with Numba

PyNNDescent: Fast Approximate Nearest Neighbors with Numba

A PDF version of slides for my SciPy 2021 talk on PyNNDescent.

Leland McInnes

July 16, 2021
Tweet

More Decks by Leland McInnes

Other Decks in Programming

Transcript

  1. How do you search for nearest neighbours of a query

    using a graph? Malkov and Yashunin, 2018 Dong, Moses and Li, 2011 Iwasaki and Miyazaki, 2018
  2. Start with a nearest neighbour graph of the training data

    Assume we now want to fi nd neighbours of a query point
  3. Look at all nodes connected by an edge to the

    best untried candidate node in the graph Add all these nodes to our potential candidate pool
  4. Sort the candidate pool by closeness to the query point

    Truncate the pool to the k best candidates
  5. Run one iteration of search for every node Update the

    graph with new better neighbours Search is better on the improved graph
  6. Necessary to fi nd green’s nearest neighbour Necessary to fi

    nd blue’s nearest neighbour Not required since we can traverse through blue
  7. Pro fi le and inspect llvm code for innermost functions

    Type declarations and code choices can help the compiler a lot!
  8. @numba.jit(fastmath=True) def euclidean(x, y): result = 0.0 for i in

    range(x.shape[0]): result += (x[i] - y[i])**2 return np.sqrt(result) Query benchmark took 8.5s
  9. @numba.njit( numba.types.float32( numba.types.Array( numba.types.float32, 1, "C", readonly=True ), numba.types.Array( numba.types.float32,

    1, "C", readonly=True ), ), fastmath=True, locals={ "result": numba.types.float32, "diff": numba.types.float32, "i": numba.types.uint16, }, ) def squared_euclidean(x, y): result = 0.0 dim = x.shape[0] for i in range(dim): diff = x[i] - y[i] result += diff * diff return result Query benchmark took 7.6s
  10. Numba has signi fi cant function call overhead with large

    parameters Use closures over static data instead
  11. @numba.njit() def frequently_called_function(param, large_readonly_data): ... val = access(large_readonly_data, param) ...

    def create_frequently_called_function(large_readonly_data): @numba.njit() def closure(param): ... val = access(large_readonly_data, param) ... return closure
  12. vs

  13. Out of the box support for a wide variety of

    distance measures: Euclidean Cosine Hamming Manhattan Minkowski Chebyshev Jaccard Haversine Dice Wasserstein Hellinger Spearman Correlation Mahalanobis Canberra Bray-Curtis Angular TSSS +20 more measures https://towardsdatascience.com/9-distance-measures-in-data-science-918109d069fa By Maarten Grootendorst