When is "nearest neighbor" meaningful?

Database Theory --- ICDT'99, Springer (1999) https://doi.org/10.1007/3-540-49257-7_15 I reproduced a
study from the above paper, which is a very nice read, that studies the notion of “nearest neighbor” for a set of points in high dimensions. This paper comes from the database literature, where nearest neighbor queries are often used. However, I’m interested in this work as it relates to choosing points in a high-dimensional input space to study the behavior or characterize properties of a function of several variables---e.g., integration, approximation / interpolation, and optimization. This study was inspired by Brian Reich, who stated the following aphorism to me a few years ago: “There’s no such thing as neighbors in high dimensions.” PAUL CONSTANTINE University of Colorado [email protected]

Figs 1 and 2 are from the paper. Fig 1
shows how we typically imagine a nearest neighbor query for a set of two-dimensional points. The authors call this a meaningful nearest neighbor query. In contrast, Fig 2 shows a set of points arranged in a circle, and the query point is just off the center of the circle. Technically, the query point has one nearest neighbor. However, the nearest neighbor is not much closer than any other point on the circle. The following study shows how Fig 2 is more representative of the situation in high dimensions than Fig 1. In particular, the nearest neighbor is not much closer than the farthest point in the set.

This study is a slight modification from Beyer et al.
I studied the following set of points used in numerical integration and design-of-experiments for computer experiments---all on the hypercube [0,1]^m: •  uniform random points •  Latin hypercube points •  Sobol sequences •  Halton sequences Let be a set of N points in m dimensions. Let be a query point drawn uniformly from the hypercube. Compute the ratio Repeat this computation for 100 query points. For the random point sets (uniform and Latin hypercube), repeat this computation for 100 realizations of point sets. Compute the sample average and standard deviation of the resulting ratios. Repeat these calculations for (i) number N of samples in the point set 10, 100, 1000, and 10000 and (ii) dimension from 1 to 30. The following plots show the results. X = { x1, . . . , xN } q r = r(q, X) = max i k q xi k2 min i k q xi k2

Each slide shows results for a different type of point
set. Each line represents a different number of samples; see the legend. The plots show the statistics (mean and standard deviation) of the ratio as a function of dimension. REMARKS For each type of point set, the expected ratio of the max-to-min distances rapidly approaches 1 as the dimension increases. In other words, the nearest point is not much closer than the farthest point. This has direct implications for Gaussian process emulators built with radially symmetric kernels---particularly for adaptive heuristics that use some flavor of the “explore vs. exploit” trade-off. Such a trade- off is meaningless if all points in high dimensions are explorations. Surprisingly, the standard devation of the ratio also converges rapidly to 0.1. I do not have a good explanation for this. I was expecting the standard deviation to increase with dimension, not decrease. The point set types that are not uniform random points do not fare significantly better than the random points. Thus, I question the notion and utility of space- filling point sets.

Uniform random points in [0,1]^m 0 5 10 15 20
25 30 Dimension 100 101 102 103 104 105 106 E[max dist / min dist] rand 1e1 1e2 1e3 1e4 0 5 10 15 20 25 30 Dimension 10-1 100 101 102 103 104 105 106 107 Std[max dist / min dist] rand 1e1 1e2 1e3 1e4

Latin hypercube points in [0,1]^m 0 5 10 15 20
25 30 Dimension 100 101 102 103 104 105 106 E[max dist / min dist] lhs 1e1 1e2 1e3 1e4 0 5 10 15 20 25 30 Dimension 10-1 100 101 102 103 104 105 106 107 Std[max dist / min dist] lhs 1e1 1e2 1e3 1e4

Sobol sequence in [0,1]^m 0 5 10 15 20 25
30 Dimension 100 101 102 103 104 105 106 E[max dist / min dist] sobol 1e1 1e2 1e3 1e4 0 5 10 15 20 25 30 Dimension 10-1 100 101 102 103 104 105 106 107 Std[max dist / min dist] sobol 1e1 1e2 1e3 1e4

Halton sequence in [0,1]^m 0 5 10 15 20 25
30 Dimension 100 101 102 103 104 105 106 E[max dist / min dist] halton 1e1 1e2 1e3 1e4 0 5 10 15 20 25 30 Dimension 10-1 100 101 102 103 104 105 106 107 Std[max dist / min dist] halton 1e1 1e2 1e3 1e4

When is "nearest neighbor" meaningful?

When is "nearest neighbor" meaningful?

Paul Constantine

More Decks by Paul Constantine

Other Decks in Science

Featured

Transcript

Database Theory --- ICDT'99, Springer (1999) https://doi.org/10.1007/3-540-49257-7_15 I reproduced a

Figs 1 and 2 are from the paper. Fig 1

This study is a slight modification from Beyer et al.

Each slide shows results for a different type of point

Uniform random points in [0,1]^m 0 5 10 15 20

Latin hypercube points in [0,1]^m 0 5 10 15 20

Sobol sequence in [0,1]^m 0 5 10 15 20 25

Halton sequence in [0,1]^m 0 5 10 15 20 25