Susovan Pal - Speaker Deck

Slide 1

Slide 1 text

Convergence of Mean Shift Algorithm with radially symmetric kernels with suﬃciently large bandwidth Susovan PAL December 10, 2021 Susovan PAL Convergence of Mean Shift Algorithm with radially symmetric ker

Slide 2

Slide 2 text

Introduction to Mean Shift(MS) algorithm Input: discrete data generated by a random variable with unknown probability density (PDF) Goal: to locate the maxima or local maximas of the probability density from the discrete data Susovan PAL Convergence of Mean Shift Algorithm with radially symmetric ker

Slide 3

Slide 3 text

Mean Shift: uses Mean Shift is used for clustering Image segmentation requires clustering, so Mean Shift is useful here, below (a) is the original image, (b), (c) are segmented Multivalued regression, e.g. for covariates/features following a Gaussian mixture model Susovan PAL Convergence of Mean Shift Algorithm with radially symmetric ker

Slide 4

Slide 4 text

High level description of the algorithm Start with a data point y1 and ﬁx a region of interest centered at y1 Calculate the (weighted) sample mean in that region, weighted by the kernel depending on the distance. Far away points should get lower weights, near ones higher weights. The sample mean should be close to where points are relatively dense, i.e. the PDF is high Iteratively shift the center of the region of interest to this sample mea, hence ’mean shift’. The idea is that these sample means should approach the modes of the PDF. Susovan PAL Convergence of Mean Shift Algorithm with radially symmetric ker

Slide 5

Slide 5 text

Mean shift: visualization of the previous algorithm Susovan PAL Convergence of Mean Shift Algorithm with radially symmetric ker

Slide 6

Slide 6 text

Mean Shift: setup {x1, x2, ...xn} be a set of n independent data points in Rd drawn from some unknown distribution. A radially symmetric kernel is a function K : Rd → [0, ∞) so that: K ∈ L1(Rd ), Rd K = 1. K(x) = k(||x||2) for some suitable non-increasing k : [0, ∞) → [0, ∞). Note that for popular Gaussian kernels, k(r) := C.exp(−r/2), C is a constant to make the total integral 1. Susovan PAL Convergence of Mean Shift Algorithm with radially symmetric ker

Slide 7

Slide 7 text

Mean Shift: description of the algorithm I Since the goal is to find the modes of the KDE of the unknown PDF, we define first the KDE as: fh,k(x) ≡ f (x) := cN 1 n n i=1 k || x − xi h ||2 (1) Above, h is referred as ’bandwidth’, which serves the purpose of radius, but isn’t the radius itself. Small bandwidth means the weights decrease fast, so essentially points in a small distance from the center of region of interests are considered to calculate sample mean. Large bandwiths means farther points are also considered to calculate sample mean. Susovan PAL Convergence of Mean Shift Algorithm with radially symmetric ker

Slide 8

Slide 8 text

Mean Shift: description of the algorithm II Assuming that k is diﬀerentiable with derivative k , taking the gradient of (1) w.r.t. x yields: (g = −k ) ∇fh,k(x) = 2cN nh2 n i=1 g(|| x − xi h ||2)    n i=1 xi g(||x−xi h ||2) n i=1 g(||x−xi h ||2) − x    (2) The second term is called the mean shift (MS) vector, mh,g (x), and hence (2) can be written in the form: ∇fh,k(x) = ∇fh,g (x) 2cN h2c N mh,g (x) (3) Susovan PAL Convergence of Mean Shift Algorithm with radially symmetric ker

Slide 9

Slide 9 text

Mean Shift: description of the algorithm III The modes of the estimated density function fh,k are located at the zeros of the gradient function ∇fh,k = 0, i.e., . Equating (2) to zero reveals that the modes of the estimated pdf are ﬁxed points of the following function: mh,g (x) + x = n i=1 xi g(||x−xi h ||2) n i=1 g(||x−xi h ||2) (4) The MS algorithm initializes the mode estimate sequence to be one of the observed data. The mode estimate yj in the j- th iteration is updated as: yj+1 = mh,g (yj ) + yj = n i=1 xi g(||yj −xi h ||2) n i=1 g(||yj −xi h ||2) (5) Susovan PAL Convergence of Mean Shift Algorithm with radially symmetric ker

Slide 10

Slide 10 text

Convergence of Mean Shift : brief history The MS algorithm iterates until the norm of the difference between two consecutive mode estimates becomes less than some predefined threshold, i.e. mh,g is small. Proof of convergence of {yj } is unknown for general kernels Proof in one dimension for general kernels is known (Ghassabeh) Proof in arbitrary dimension for Gaussian kernel is known only for sufficiently large bandwidth (Ghassabeh) Susovan PAL Convergence of Mean Shift Algorithm with radially symmetric ker

Slide 11

Slide 11 text

Convergence of Mean Shift : strategy of proof 1 Some facts already known about the proof of convergence: Theorem 1 There exists h0 > 0 so that, for each h ≥ h0, the Hesssian of fh,k, denoted by Hess(fh,k) is nonsingular at all the stationary points of fh,k, i.e. where ∇fh,k vanishes. In particular, h0 := max1≤i≤n xi . Theorem 2 If the Hessian matrix of any C2 function at its stationary points is nonsingular, the stationary points are isolated. (Proof: inverse function theorem on ∇fh,k ) Theorem 3 Let xi ∈ Rd , i = 1, 2...n Assume that the stationary points of the estimated pdf are isolated. Then the mode estimate sequence {yj } in (5) converges. Susovan PAL Convergence of Mean Shift Algorithm with radially symmetric ker

Slide 12

Slide 12 text

Convergence of Mean Shift : strategy of proof 2 Ghassabeh proved Theorem 1 for Gaussian kernels: ∃h0 > 0, ∀h ≥ h0, Hess(fh,k) is nonsingular for Gaussian kernels with k(r) = Ce−r/2. This made the subsequent Theorem 2, Theorem 3 go through. This means we just need to be able to generalize Theorem 1 in order to ensure {yj } converges. This is precisely what we do. Susovan PAL Convergence of Mean Shift Algorithm with radially symmetric ker

Slide 13

Slide 13 text

Convergence of Mean Shift : new theorem Below is a slightly weaker version of generalization of Theorem 1: Theorem 4 (P.) Assume that K(.) = k( . 2) is strictly positive definite with the first two derivatives of k finite at 0. Then ∃h0 > 0 so that ∀h ≥ h0, Hess(fh,k) is nonsingular, therefore by Theorem 2 and Theorem 3 the mode estimate sequence {yi = yi (h)} corresponding to Kh,k, converges ∀h ≥ h0. Remark 1 Note that this theorem is slighly weaker (as of now) in the sense that it doesn’t explicitly find h0, unlike Ghassebeh’s result where h0 := max1≤i≤n xi . Susovan PAL Convergence of Mean Shift Algorithm with radially symmetric ker

Slide 14

Slide 14 text

Proof of convergence of mean shift for general kernels Our proof will require an alternate characterization of positive definite kernel matrices, so we state some definitions below: Definition 5 (Completely monotone functions) A function k : [0, ∞) → R is called completely monotone if: k ∈ C0([0, ∞)) ∩ C∞((0, ∞)) (2)(−1)l k(l) ≥ 0 Examples: k(r) = rs, s ≤ 0, e−sr , s ≥ 0, ln(1 + 1 r ), e1/r . If f , g are completely monotone, c, d > 0, then cf + dg, fg are also completely monotone! Susovan PAL Convergence of Mean Shift Algorithm with radially symmetric ker

Slide 15

Slide 15 text

Alternative characterizations of completely monotone function I Next, we state two alternate characterizations of completely monotone functions, on in each slide. Theorem 6 (Hausdorﬀ-Bernstein-Widder theorem: Laplace transform characterization of completely monotone functions) A function k : [0, ∞) → R completely monotone if and only it is the Laplace transform of a ﬁnite non-negative Borel measure µ on [0, ∞), i.e. k is of the form: k(r) = Lµ(r) = ∞ 0 e−rtdµ(t) Susovan PAL Convergence of Mean Shift Algorithm with radially symmetric ker

Slide 16

Slide 16 text

Alternative characterizations of completely monotone function II First recall that a function K is called non-negative definite if the corresponding kernel matrix K := [K(xi − xj )]1≤i,j≤n positive definite. This amounts to the fact that K := [k(||xi − xj ||2)1≤i,j≤n] is non-negative definite. Theorem 7 (Non-negative definite kernel matrices are of constructed from completely monotone functions) A function k is completely monotone on [0, ∞) if and only if K(x) := k(||x||2) is non-negative definite and radial on Rd ∀d ∈ N That it the matrix K with Kij := K(xi − xj ) = k( xi − xj 2) is non-negative definite. Susovan PAL Convergence of Mean Shift Algorithm with radially symmetric ker

Slide 17

Slide 17 text

Connection to our problem Thanks to the previous two theorems, one can write any non-negative definite radially symmetric kernel matrix K = (Kij ) as Kij = ∞ 0 e−t xi −xj 2 dµ(t) for some finite Borel measure µ. This should be useful! Something more is true: Theorem 8 A non-constant function k is completely monotone on [0, ∞) if and only if Φ(x) := k(||x||2) is strictly positive definite and radial on Rd ∀d ∈ N. Susovan PAL Convergence of Mean Shift Algorithm with radially symmetric ker

Slide 18

Slide 18 text

Connection to our problem And ﬁnally, something more: Theorem 9 A non-constant function k : [0, ∞) → R completely monotone if and only it is the Laplace transform of a ﬁnite non-negative Borel measure µ on [0, ∞) not of the form cδ0, c > 0, i.e. k is of the form: k(r) = Lµ(r) = ∞ 0 e−rtdµ(t) (6) where µ is not of the form cδ0, c > 0 Note that for Gaussian kernels, k(r) = e−r , µ := δ1. Susovan PAL Convergence of Mean Shift Algorithm with radially symmetric ker

Slide 19

Slide 19 text

Details of the proof of Theorem 5, slide 12/13 - Part I Start by computing the gradient and Hessian of the KDE; we will abbreviate the notation fh,k by f . Recall: ˆ f (x) = cN 1 n n i=1 k || x − xi h ||2 Taking gradient w.r.t. x, ∇ˆ f (x) = cN n i=1 xi − x h2 ∞ 0 2t exp −t|| x − xi h ||2 dµ(t) Susovan PAL Convergence of Mean Shift Algorithm with radially symmetric ker

Slide 20

Slide 20 text

Details of the proof of Theorem 5, slide 12/13 - Part II Hess ˆ f (x) = cN n i=1 ∞ 0 2t h2 −ID×D + 2t h2 (x − xi )(x − xi )T exp −t||x−xi h ||2 dµ( Deﬁne the following two functions, respectively: C : Rd → R by: C(x) := ∞ 0 2t n i=1 exp −t|| x − xi h ||2 dµ(t) A : Rd → Rd×d by: A(x) := n i=1 ∞ 0 4t2 (x − xi )(x − xi )T exp −t|| x − xi h ||2 dµ(t) Susovan PAL Convergence of Mean Shift Algorithm with radially symmetric ker

Slide 21

Slide 21 text

Details of the proof of Theorem 5, slide 12/13 - Part III For a motivation, note the right/later part of the Hessian H(x) is nothing but: cNh−2A(x)h−2, and the left/earlier part is nothing but: −cNh−2C(x), i.e.: H(x) = −cN C(x) h2 ID×D + cN A(x) h4 Next we argue by contradiction. Susovan PAL Convergence of Mean Shift Algorithm with radially symmetric ker

Slide 22

Slide 22 text

Details of the proof of Theorem 5, slide 12/13 - Part IV If possible, assume that, H is singular (not full rank) So H has a zero eigenvalue v ∈ RD×1 so that H(x)v = 0 ∈ RD×1. Use the above expression for H above: H(x) = −cN C(x) h2 ID×D + cN A(x) h4 This yields: −cN C(x) h2 ID×D + cN A(x) h4 v = 0 =⇒ A(x)v = h2C(x)v =⇒ n i=1 (x − xi )(x − xi )T ∞ 0 4t2exp −t||x−xi h ||2 dµ(t) v = h2 n i=1 ∞ 0 2t exp −t||x−xi h ||2 dµ(t) Susovan PAL Convergence of Mean Shift Algorithm with radially symmetric ker

Slide 23

Slide 23 text

Details of the proof of Theorem 5, slide 12/13 - Part IV Let us abbreviate notation and write the above as: n i=1 (x − xi )(x − xT i )Ji (h)v = −2h2 n i=1 Ii (h)v Where: Ii (h) := ∞ 0 2texp −t|| x − xi h ||2 dµ(t) Ji (h) := ∞ 0 4t2exp −t|| x − xi h ||2 dµ(t) Susovan PAL Convergence of Mean Shift Algorithm with radially symmetric ker

Slide 24

Slide 24 text

Details of the proof of Theorem 5, slide 12/13 - Part V Note that as h → ∞, each Ii (h) → ∞ 0 tdµ(t) = −k (0) and Ji (h) → 4k (0), so the above equation becomes: 4 n i=1 (x − xi )(x − xT i )k (0) = −2h2 n i=1 k (0) Gives a contradiction as h → ∞, because the first two derivatives of K at 0 being finite implies that k (0), k (0) are finite. So there exists a sufficiently large h0 > 0 so that ∀h ≥ h0, the above equation won’t hold, meaning that Hessian H(x) has no zero eigenvalue. Susovan PAL Convergence of Mean Shift Algorithm with radially symmetric ker

Slide 25

Slide 25 text

Practical usefulness of the result Choose a large bandwidth h is like considering too many points to calculate the sample mean, so this may help us not get a mode estimate sequence converging to a local point of density So clusters may not be found correctly On the other hand choosing too small h won’t also help us locate the local sample mean = local centroid of density correctly Hence the result may not be useful Susovan PAL Convergence of Mean Shift Algorithm with radially symmetric ker

Slide 26

Slide 26 text

Possible future work Finding a necessary and suﬃcient criterion for convergence It has been shown that yj+1 − yj → 0, but this doesn’t mean it will be a Cauchy sequence, e.g. uj := sin( √ j) is a bounded, non-Cauchy, non-convergent sequence where uj+1 − uj → 0. We can still leverage the property. Convergence of mean shift on manifolds should be a mathematically interesting theory to look into, as KDE on manifolds with nice properties is an active area of research. Susovan PAL Convergence of Mean Shift Algorithm with radially symmetric ker

Slide 27

Slide 27 text

Thanks THANK YOU!!! Susovan PAL Convergence of Mean Shift Algorithm with radially symmetric ker