Upgrade to Pro — share decks privately, control downloads, hide ads and more …

PyNNDescent: Fast Approximate Nearest Neighbors with Numba

PyNNDescent: Fast Approximate Nearest Neighbors with Numba

A PDF version of slides for my SciPy 2021 talk on PyNNDescent.


Leland McInnes

July 16, 2021


  1. Fast Approximate Nearest Neighbour Search with Numba

  2. What are Nearest Neighbours?

  3. Given a set of points with A distance measure between

  4. … and a new “query point” …

  5. Find the closest points to the query point

  6. Why Nearest Neighbors?

  7. Nearest Neighbour computations are at the heart of many machine

    learning algorithms
  8. KNN-Classi fi ers KNN-Regressors

  9. Clustering https://commons.wikimedia.org/wiki/File:DBSCAN-Illustration.svg by Chire https://www. fl ickr.com/photos/trevorpatt/41875889652/in/photostream/ by Trevor Patt

    HDBSCAN DBSCAN Single Linkage Clustering Spectral Clustering
  10. Dimension Reduction http://lvdmaaten.github.io/tsne/ http://www-clmc.usc.edu/publications/T/tenenbaum-Science2000.pdf t-SNE Isomap Spectral Embedding UMAP

  11. Recommender Systems Query Expansion

  12. Why Approximate Nearest Neighbours?

  13. Finding exact nearest neighbours is hard

  14. Approximate nearest neighbour search trades accuracy for performance

  15. How Do You Find Nearest Neighbors?

  16. Using Trees

  17. Hierarchically divide up the space into a tree

  18. Bound the search using the tree structure (And the triangle

  19. KD-Tree

  20. Ball Tree

  21. Random Projection Tree

  22. Using Graphs

  23. How do you search for nearest neighbours of a query

    using a graph? Malkov and Yashunin, 2018 Dong, Moses and Li, 2011 Iwasaki and Miyazaki, 2018
  24. Start with a nearest neighbour graph of the training data

    Assume we now want to fi nd neighbours of a query point
  25. Choose a starting node in the graph (potentially randomly) as

    a candidate node
  26. None
  27. Look at all nodes connected by an edge to the

    best untried candidate node in the graph Add all these nodes to our potential candidate pool
  28. None
  29. Sort the candidate pool by closeness to the query point

    Truncate the pool to the k best candidates
  30. None
  31. Return to the Expansion step unless we have already tried

    all the candidates in the pool
  32. Stop when there are no untried candidates in the pool

  33. None
  34. None
  35. None
  36. None
  37. Looks inef fi cient Scales up well

  38. None
  39. Graph adapts to intrinsic dimension of the data

  40. But how do we build the graph?!

  41. The algorithm works (badly) even on a bad graph

  42. Run one iteration of search for every node Update the

    graph with new better neighbours Search is better on the improved graph
  43. None
  44. None
  45. None
  46. None
  47. None
  48. Perfect accuracy of neighbours is not assured We can get

    an approximate knn-graph quickly
  49. How Do You Make it Fast?

  50. Algorithm tricks

  51. Query node Expansion node Current neighbour

  52. Neighbour A Neighbour B Common node

  53. Hubs have a lot of neighbours!

  54. None
  55. None
  56. Sample neighbours when constructing the graph Prune away edges before

    performing searches
  57. Necessary to fi nd green’s nearest neighbour Necessary to fi

    nd blue’s nearest neighbour Not required since we can traverse through blue
  58. For search remove the longest edges of any triangles in

    the graph
  59. Initialize with Random Projection Trees

  60. Implementation tricks

  61. None
  62. Pro fi le and inspect llvm code for innermost functions

    Type declarations and code choices can help the compiler a lot!
  63. @numba.jit def euclidean(x, y): return np.sqrt(np.sum((x - y)**2)) Query benchmark

    took 12s
  64. @numba.jit(fastmath=True) def euclidean(x, y): result = 0.0 for i in

    range(x.shape[0]): result += (x[i] - y[i])**2 return np.sqrt(result) Query benchmark took 8.5s
  65. @numba.njit( numba.types.float32( numba.types.Array( numba.types.float32, 1, "C", readonly=True ), numba.types.Array( numba.types.float32,

    1, "C", readonly=True ), ), fastmath=True, locals={ "result": numba.types.float32, "diff": numba.types.float32, "i": numba.types.uint16, }, ) def squared_euclidean(x, y): result = 0.0 dim = x.shape[0] for i in range(dim): diff = x[i] - y[i] result += diff * diff return result Query benchmark took 7.6s
  66. Custom data structure implementations to help numba for often called

  67. @numba.njit( "i4(f4[ :: 1],i4[ :: 1],f4,i4)", ) def simple_heap_push(priorities, indices,

    p, n): ...
  68. Numba has signi fi cant function call overhead with large

    parameters Use closures over static data instead
  69. @numba.njit() def frequently_called_function(param, large_readonly_data): ... val = access(large_readonly_data, param) ...

    def create_frequently_called_function(large_readonly_data): @numba.njit() def closure(param): ... val = access(large_readonly_data, param) ... return closure
  70. How Does it Compare?

  71. Performance

  72. We can test query performance using ann-benchmarks https://github.com/erikbern/ann-benchmarks

  73. Consider the whole accuracy / performance trade-off space

  74. vs

  75. None
  76. None
  77. None
  78. None
  79. Caveats: •Newer algorithms and implementations •Hardware can makes a big

    difference •No GPU support for pynndescent
  80. Features

  81. Out of the box support for a wide variety of

    distance measures: Euclidean Cosine Hamming Manhattan Minkowski Chebyshev Jaccard Haversine Dice Wasserstein Hellinger Spearman Correlation Mahalanobis Canberra Bray-Curtis Angular TSSS +20 more measures https://towardsdatascience.com/9-distance-measures-in-data-science-918109d069fa By Maarten Grootendorst
  82. Custom metrics in Python (using numba)

  83. Support for sparse data

  84. Drop-in replacement for sklearn KNeighborsTransformer

  85. Summary

  86. pip install pynndescent conda install pynndescent https://github.com/lmcinnes/pynndescent leland.mcinnes@gmail.com @leland_mcinnes

  87. Questions? leland.mcinnes@gmail.com @leland_mcinnes