$30 off During Our Annual Pro Sale. View Details »

Susovan Pal

S³ Seminar
December 10, 2021
26

Susovan Pal

(EPITA)

https://s3-seminar.github.io/seminars/susovan-pal/

Title — A sufficient condition for convergence of Mean Shift algorithm in any dimension, with radially symmetric, strictly positive definite kernels

Abstract — The mean shift (MS) is a non-parametric, density based, iterative algorithm that has been used to find the modes of an estimated probability density function (pdf). Although the MS algorithm has been widely used in many applications, such as clustering, image segmentation, and object tracking, a rigorous proof for its convergence in a fully general case is still missing. Two significant steps toward this direction were taken in a paper by Ghassabeh, that proved the convergence for Gaussian kernels in any dimensions, and also by the same author, who proved the convergence for one dimension for kernels with differentiable, strictly decreasing, convex profiles. As of now, we are not aware of any proof of convergence of the MS algorithm for fully general kernels. This paper/talk aims to give a sufficient condition for the convergence result for any dimensions, and for any strictly positive definite, smooth kernels. Some open questions for further research will also be addressed, with no rigorous mathematical detail known to the author.

S³ Seminar

December 10, 2021
Tweet

Transcript

  1. Convergence of Mean Shift Algorithm with radially symmetric kernels with

    sufficiently large bandwidth Susovan PAL December 10, 2021 Susovan PAL Convergence of Mean Shift Algorithm with radially symmetric ker
  2. Introduction to Mean Shift(MS) algorithm Input: discrete data generated by

    a random variable with unknown probability density (PDF) Goal: to locate the maxima or local maximas of the probability density from the discrete data Susovan PAL Convergence of Mean Shift Algorithm with radially symmetric ker
  3. Mean Shift: uses Mean Shift is used for clustering Image

    segmentation requires clustering, so Mean Shift is useful here, below (a) is the original image, (b), (c) are segmented Multivalued regression, e.g. for covariates/features following a Gaussian mixture model Susovan PAL Convergence of Mean Shift Algorithm with radially symmetric ker
  4. High level description of the algorithm Start with a data

    point y1 and fix a region of interest centered at y1 Calculate the (weighted) sample mean in that region, weighted by the kernel depending on the distance. Far away points should get lower weights, near ones higher weights. The sample mean should be close to where points are relatively dense, i.e. the PDF is high Iteratively shift the center of the region of interest to this sample mea, hence ’mean shift’. The idea is that these sample means should approach the modes of the PDF. Susovan PAL Convergence of Mean Shift Algorithm with radially symmetric ker
  5. Mean shift: visualization of the previous algorithm Susovan PAL Convergence

    of Mean Shift Algorithm with radially symmetric ker
  6. Mean Shift: setup {x1, x2, ...xn} be a set of

    n independent data points in Rd drawn from some unknown distribution. A radially symmetric kernel is a function K : Rd → [0, ∞) so that: K ∈ L1(Rd ), Rd K = 1. K(x) = k(||x||2) for some suitable non-increasing k : [0, ∞) → [0, ∞). Note that for popular Gaussian kernels, k(r) := C.exp(−r/2), C is a constant to make the total integral 1. Susovan PAL Convergence of Mean Shift Algorithm with radially symmetric ker
  7. Mean Shift: description of the algorithm I Since the goal

    is to find the modes of the KDE of the unknown PDF, we define first the KDE as: fh,k(x) ≡ f (x) := cN 1 n n i=1 k || x − xi h ||2 (1) Above, h is referred as ’bandwidth’, which serves the purpose of radius, but isn’t the radius itself. Small bandwidth means the weights decrease fast, so essentially points in a small distance from the center of region of interests are considered to calculate sample mean. Large bandwiths means farther points are also considered to calculate sample mean. Susovan PAL Convergence of Mean Shift Algorithm with radially symmetric ker
  8. Mean Shift: description of the algorithm II Assuming that k

    is differentiable with derivative k , taking the gradient of (1) w.r.t. x yields: (g = −k ) ∇fh,k(x) = 2cN nh2 n i=1 g(|| x − xi h ||2)    n i=1 xi g(||x−xi h ||2) n i=1 g(||x−xi h ||2) − x    (2) The second term is called the mean shift (MS) vector, mh,g (x), and hence (2) can be written in the form: ∇fh,k(x) = ∇fh,g (x) 2cN h2c N mh,g (x) (3) Susovan PAL Convergence of Mean Shift Algorithm with radially symmetric ker
  9. Mean Shift: description of the algorithm III The modes of

    the estimated density function fh,k are located at the zeros of the gradient function ∇fh,k = 0, i.e., . Equating (2) to zero reveals that the modes of the estimated pdf are fixed points of the following function: mh,g (x) + x = n i=1 xi g(||x−xi h ||2) n i=1 g(||x−xi h ||2) (4) The MS algorithm initializes the mode estimate sequence to be one of the observed data. The mode estimate yj in the j- th iteration is updated as: yj+1 = mh,g (yj ) + yj = n i=1 xi g(||yj −xi h ||2) n i=1 g(||yj −xi h ||2) (5) Susovan PAL Convergence of Mean Shift Algorithm with radially symmetric ker
  10. Convergence of Mean Shift : brief history The MS algorithm

    iterates until the norm of the difference between two consecutive mode estimates becomes less than some predefined threshold, i.e. mh,g is small. Proof of convergence of {yj } is unknown for general kernels Proof in one dimension for general kernels is known (Ghassabeh) Proof in arbitrary dimension for Gaussian kernel is known only for sufficiently large bandwidth (Ghassabeh) Susovan PAL Convergence of Mean Shift Algorithm with radially symmetric ker
  11. Convergence of Mean Shift : strategy of proof 1 Some

    facts already known about the proof of convergence: Theorem 1 There exists h0 > 0 so that, for each h ≥ h0, the Hesssian of fh,k, denoted by Hess(fh,k) is nonsingular at all the stationary points of fh,k, i.e. where ∇fh,k vanishes. In particular, h0 := max1≤i≤n xi . Theorem 2 If the Hessian matrix of any C2 function at its stationary points is nonsingular, the stationary points are isolated. (Proof: inverse function theorem on ∇fh,k ) Theorem 3 Let xi ∈ Rd , i = 1, 2...n Assume that the stationary points of the estimated pdf are isolated. Then the mode estimate sequence {yj } in (5) converges. Susovan PAL Convergence of Mean Shift Algorithm with radially symmetric ker
  12. Convergence of Mean Shift : strategy of proof 2 Ghassabeh

    proved Theorem 1 for Gaussian kernels: ∃h0 > 0, ∀h ≥ h0, Hess(fh,k) is nonsingular for Gaussian kernels with k(r) = Ce−r/2. This made the subsequent Theorem 2, Theorem 3 go through. This means we just need to be able to generalize Theorem 1 in order to ensure {yj } converges. This is precisely what we do. Susovan PAL Convergence of Mean Shift Algorithm with radially symmetric ker
  13. Convergence of Mean Shift : new theorem Below is a

    slightly weaker version of generalization of Theorem 1: Theorem 4 (P.) Assume that K(.) = k( . 2) is strictly positive definite with the first two derivatives of k finite at 0. Then ∃h0 > 0 so that ∀h ≥ h0, Hess(fh,k) is nonsingular, therefore by Theorem 2 and Theorem 3 the mode estimate sequence {yi = yi (h)} corresponding to Kh,k, converges ∀h ≥ h0. Remark 1 Note that this theorem is slighly weaker (as of now) in the sense that it doesn’t explicitly find h0, unlike Ghassebeh’s result where h0 := max1≤i≤n xi . Susovan PAL Convergence of Mean Shift Algorithm with radially symmetric ker
  14. Proof of convergence of mean shift for general kernels Our

    proof will require an alternate characterization of positive definite kernel matrices, so we state some definitions below: Definition 5 (Completely monotone functions) A function k : [0, ∞) → R is called completely monotone if: k ∈ C0([0, ∞)) ∩ C∞((0, ∞)) (2)(−1)l k(l) ≥ 0 Examples: k(r) = rs, s ≤ 0, e−sr , s ≥ 0, ln(1 + 1 r ), e1/r . If f , g are completely monotone, c, d > 0, then cf + dg, fg are also completely monotone! Susovan PAL Convergence of Mean Shift Algorithm with radially symmetric ker
  15. Alternative characterizations of completely monotone function I Next, we state

    two alternate characterizations of completely monotone functions, on in each slide. Theorem 6 (Hausdorff-Bernstein-Widder theorem: Laplace transform characterization of completely monotone functions) A function k : [0, ∞) → R completely monotone if and only it is the Laplace transform of a finite non-negative Borel measure µ on [0, ∞), i.e. k is of the form: k(r) = Lµ(r) = ∞ 0 e−rtdµ(t) Susovan PAL Convergence of Mean Shift Algorithm with radially symmetric ker
  16. Alternative characterizations of completely monotone function II First recall that

    a function K is called non-negative definite if the corresponding kernel matrix K := [K(xi − xj )]1≤i,j≤n positive definite. This amounts to the fact that K := [k(||xi − xj ||2)1≤i,j≤n] is non-negative definite. Theorem 7 (Non-negative definite kernel matrices are of constructed from completely monotone functions) A function k is completely monotone on [0, ∞) if and only if K(x) := k(||x||2) is non-negative definite and radial on Rd ∀d ∈ N That it the matrix K with Kij := K(xi − xj ) = k( xi − xj 2) is non-negative definite. Susovan PAL Convergence of Mean Shift Algorithm with radially symmetric ker
  17. Connection to our problem Thanks to the previous two theorems,

    one can write any non-negative definite radially symmetric kernel matrix K = (Kij ) as Kij = ∞ 0 e−t xi −xj 2 dµ(t) for some finite Borel measure µ. This should be useful! Something more is true: Theorem 8 A non-constant function k is completely monotone on [0, ∞) if and only if Φ(x) := k(||x||2) is strictly positive definite and radial on Rd ∀d ∈ N. Susovan PAL Convergence of Mean Shift Algorithm with radially symmetric ker
  18. Connection to our problem And finally, something more: Theorem 9

    A non-constant function k : [0, ∞) → R completely monotone if and only it is the Laplace transform of a finite non-negative Borel measure µ on [0, ∞) not of the form cδ0, c > 0, i.e. k is of the form: k(r) = Lµ(r) = ∞ 0 e−rtdµ(t) (6) where µ is not of the form cδ0, c > 0 Note that for Gaussian kernels, k(r) = e−r , µ := δ1. Susovan PAL Convergence of Mean Shift Algorithm with radially symmetric ker
  19. Details of the proof of Theorem 5, slide 12/13 -

    Part I Start by computing the gradient and Hessian of the KDE; we will abbreviate the notation fh,k by f . Recall: ˆ f (x) = cN 1 n n i=1 k || x − xi h ||2 Taking gradient w.r.t. x, ∇ˆ f (x) = cN n i=1 xi − x h2 ∞ 0 2t exp −t|| x − xi h ||2 dµ(t) Susovan PAL Convergence of Mean Shift Algorithm with radially symmetric ker
  20. Details of the proof of Theorem 5, slide 12/13 -

    Part II Hess ˆ f (x) = cN n i=1 ∞ 0 2t h2 −ID×D + 2t h2 (x − xi )(x − xi )T exp −t||x−xi h ||2 dµ( Define the following two functions, respectively: C : Rd → R by: C(x) := ∞ 0 2t n i=1 exp −t|| x − xi h ||2 dµ(t) A : Rd → Rd×d by: A(x) := n i=1 ∞ 0 4t2 (x − xi )(x − xi )T exp −t|| x − xi h ||2 dµ(t) Susovan PAL Convergence of Mean Shift Algorithm with radially symmetric ker
  21. Details of the proof of Theorem 5, slide 12/13 -

    Part III For a motivation, note the right/later part of the Hessian H(x) is nothing but: cNh−2A(x)h−2, and the left/earlier part is nothing but: −cNh−2C(x), i.e.: H(x) = −cN C(x) h2 ID×D + cN A(x) h4 Next we argue by contradiction. Susovan PAL Convergence of Mean Shift Algorithm with radially symmetric ker
  22. Details of the proof of Theorem 5, slide 12/13 -

    Part IV If possible, assume that, H is singular (not full rank) So H has a zero eigenvalue v ∈ RD×1 so that H(x)v = 0 ∈ RD×1. Use the above expression for H above: H(x) = −cN C(x) h2 ID×D + cN A(x) h4 This yields: −cN C(x) h2 ID×D + cN A(x) h4 v = 0 =⇒ A(x)v = h2C(x)v =⇒ n i=1 (x − xi )(x − xi )T ∞ 0 4t2exp −t||x−xi h ||2 dµ(t) v = h2 n i=1 ∞ 0 2t exp −t||x−xi h ||2 dµ(t) Susovan PAL Convergence of Mean Shift Algorithm with radially symmetric ker
  23. Details of the proof of Theorem 5, slide 12/13 -

    Part IV Let us abbreviate notation and write the above as: n i=1 (x − xi )(x − xT i )Ji (h)v = −2h2 n i=1 Ii (h)v Where: Ii (h) := ∞ 0 2texp −t|| x − xi h ||2 dµ(t) Ji (h) := ∞ 0 4t2exp −t|| x − xi h ||2 dµ(t) Susovan PAL Convergence of Mean Shift Algorithm with radially symmetric ker
  24. Details of the proof of Theorem 5, slide 12/13 -

    Part V Note that as h → ∞, each Ii (h) → ∞ 0 tdµ(t) = −k (0) and Ji (h) → 4k (0), so the above equation becomes: 4 n i=1 (x − xi )(x − xT i )k (0) = −2h2 n i=1 k (0) Gives a contradiction as h → ∞, because the first two derivatives of K at 0 being finite implies that k (0), k (0) are finite. So there exists a sufficiently large h0 > 0 so that ∀h ≥ h0, the above equation won’t hold, meaning that Hessian H(x) has no zero eigenvalue. Susovan PAL Convergence of Mean Shift Algorithm with radially symmetric ker
  25. Practical usefulness of the result Choose a large bandwidth h

    is like considering too many points to calculate the sample mean, so this may help us not get a mode estimate sequence converging to a local point of density So clusters may not be found correctly On the other hand choosing too small h won’t also help us locate the local sample mean = local centroid of density correctly Hence the result may not be useful Susovan PAL Convergence of Mean Shift Algorithm with radially symmetric ker
  26. Possible future work Finding a necessary and sufficient criterion for

    convergence It has been shown that yj+1 − yj → 0, but this doesn’t mean it will be a Cauchy sequence, e.g. uj := sin( √ j) is a bounded, non-Cauchy, non-convergent sequence where uj+1 − uj → 0. We can still leverage the property. Convergence of mean shift on manifolds should be a mathematically interesting theory to look into, as KDE on manifolds with nice properties is an active area of research. Susovan PAL Convergence of Mean Shift Algorithm with radially symmetric ker
  27. Thanks THANK YOU!!! Susovan PAL Convergence of Mean Shift Algorithm

    with radially symmetric ker