Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Susovan Pal

S³ Seminar
December 10, 2021
49

Susovan Pal

(EPITA)

https://s3-seminar.github.io/seminars/susovan-pal/

Title — A sufficient condition for convergence of Mean Shift algorithm in any dimension, with radially symmetric, strictly positive definite kernels

Abstract — The mean shift (MS) is a non-parametric, density based, iterative algorithm that has been used to find the modes of an estimated probability density function (pdf). Although the MS algorithm has been widely used in many applications, such as clustering, image segmentation, and object tracking, a rigorous proof for its convergence in a fully general case is still missing. Two significant steps toward this direction were taken in a paper by Ghassabeh, that proved the convergence for Gaussian kernels in any dimensions, and also by the same author, who proved the convergence for one dimension for kernels with differentiable, strictly decreasing, convex profiles. As of now, we are not aware of any proof of convergence of the MS algorithm for fully general kernels. This paper/talk aims to give a sufficient condition for the convergence result for any dimensions, and for any strictly positive definite, smooth kernels. Some open questions for further research will also be addressed, with no rigorous mathematical detail known to the author.

S³ Seminar

December 10, 2021
Tweet

Transcript

  1. Convergence of Mean Shift Algorithm with
    radially symmetric kernels with sufficiently large
    bandwidth
    Susovan PAL
    December 10, 2021
    Susovan PAL Convergence of Mean Shift Algorithm with radially symmetric ker

    View Slide

  2. Introduction to Mean Shift(MS) algorithm
    Input: discrete data generated by a random variable with
    unknown probability density (PDF)
    Goal: to locate the maxima or local maximas of the
    probability density from the discrete data
    Susovan PAL Convergence of Mean Shift Algorithm with radially symmetric ker

    View Slide

  3. Mean Shift: uses
    Mean Shift is used for clustering
    Image segmentation requires clustering, so Mean Shift is
    useful here, below (a) is the original image, (b), (c) are
    segmented
    Multivalued regression, e.g. for covariates/features following a
    Gaussian mixture model
    Susovan PAL Convergence of Mean Shift Algorithm with radially symmetric ker

    View Slide

  4. High level description of the algorithm
    Start with a data point y1 and fix a region of interest centered
    at y1
    Calculate the (weighted) sample mean in that region,
    weighted by the kernel depending on the distance. Far away
    points should get lower weights, near ones higher weights.
    The sample mean should be close to where points are
    relatively dense, i.e. the PDF is high
    Iteratively shift the center of the region of interest to this
    sample mea, hence ’mean shift’.
    The idea is that these sample means should approach the
    modes of the PDF.
    Susovan PAL Convergence of Mean Shift Algorithm with radially symmetric ker

    View Slide

  5. Mean shift: visualization of the previous algorithm
    Susovan PAL Convergence of Mean Shift Algorithm with radially symmetric ker

    View Slide

  6. Mean Shift: setup
    {x1, x2, ...xn} be a set of n independent data points in Rd
    drawn from some unknown distribution.
    A radially symmetric kernel is a function K : Rd → [0, ∞) so
    that:
    K ∈ L1(Rd ),
    Rd
    K = 1.
    K(x) = k(||x||2) for some suitable non-increasing
    k : [0, ∞) → [0, ∞). Note that for popular Gaussian kernels,
    k(r) := C.exp(−r/2), C is a constant to make the total
    integral 1.
    Susovan PAL Convergence of Mean Shift Algorithm with radially symmetric ker

    View Slide

  7. Mean Shift: description of the algorithm I
    Since the goal is to find the modes of the KDE of the unknown
    PDF, we define first the KDE as:
    fh,k(x) ≡ f (x) := cN
    1
    n
    n
    i=1
    k ||
    x − xi
    h
    ||2 (1)
    Above, h is referred as ’bandwidth’, which serves the purpose
    of radius, but isn’t the radius itself.
    Small bandwidth means the weights decrease fast, so
    essentially points in a small distance from the center of region
    of interests are considered to calculate sample mean.
    Large bandwiths means farther points are also considered to
    calculate sample mean.
    Susovan PAL Convergence of Mean Shift Algorithm with radially symmetric ker

    View Slide

  8. Mean Shift: description of the algorithm II
    Assuming that k is differentiable with derivative k , taking the
    gradient of (1) w.r.t. x yields: (g = −k )
    ∇fh,k(x) =
    2cN
    nh2
    n
    i=1
    g(||
    x − xi
    h
    ||2)



    n
    i=1
    xi g(||x−xi
    h
    ||2)
    n
    i=1
    g(||x−xi
    h
    ||2)
    − x



    (2)
    The second term is called the mean shift (MS) vector, mh,g (x),
    and hence (2) can be written in the form:
    ∇fh,k(x) = ∇fh,g (x)
    2cN
    h2c
    N
    mh,g (x) (3)
    Susovan PAL Convergence of Mean Shift Algorithm with radially symmetric ker

    View Slide

  9. Mean Shift: description of the algorithm III
    The modes of the estimated density function fh,k are located at the
    zeros of the gradient function ∇fh,k = 0, i.e., . Equating (2) to
    zero reveals that the modes of the estimated pdf are fixed points of
    the following function:
    mh,g (x) + x =
    n
    i=1
    xi g(||x−xi
    h
    ||2)
    n
    i=1
    g(||x−xi
    h
    ||2)
    (4)
    The MS algorithm initializes the mode estimate sequence to be
    one of the observed data. The mode estimate yj in the j- th
    iteration is updated as:
    yj+1 = mh,g (yj ) + yj =
    n
    i=1
    xi g(||yj −xi
    h
    ||2)
    n
    i=1
    g(||yj −xi
    h
    ||2)
    (5)
    Susovan PAL Convergence of Mean Shift Algorithm with radially symmetric ker

    View Slide

  10. Convergence of Mean Shift : brief history
    The MS algorithm iterates until the norm of the difference
    between two consecutive mode estimates becomes less than
    some predefined threshold, i.e. mh,g is small.
    Proof of convergence of {yj } is unknown for general kernels
    Proof in one dimension for general kernels is known
    (Ghassabeh)
    Proof in arbitrary dimension for Gaussian kernel is known only
    for sufficiently large bandwidth (Ghassabeh)
    Susovan PAL Convergence of Mean Shift Algorithm with radially symmetric ker

    View Slide

  11. Convergence of Mean Shift : strategy of proof 1
    Some facts already known about the proof of convergence:
    Theorem 1
    There exists h0 > 0 so that, for each h ≥ h0, the Hesssian of fh,k,
    denoted by Hess(fh,k) is nonsingular at all the stationary points of
    fh,k, i.e. where ∇fh,k vanishes. In particular, h0 := max1≤i≤n xi .
    Theorem 2
    If the Hessian matrix of any C2 function at its stationary points is
    nonsingular, the stationary points are isolated. (Proof: inverse
    function theorem on ∇fh,k )
    Theorem 3
    Let xi ∈ Rd , i = 1, 2...n Assume that the stationary points of the
    estimated pdf are isolated. Then the mode estimate sequence {yj }
    in (5) converges.
    Susovan PAL Convergence of Mean Shift Algorithm with radially symmetric ker

    View Slide

  12. Convergence of Mean Shift : strategy of proof 2
    Ghassabeh proved Theorem 1 for Gaussian kernels:
    ∃h0 > 0, ∀h ≥ h0, Hess(fh,k) is nonsingular for Gaussian
    kernels with k(r) = Ce−r/2. This made the subsequent
    Theorem 2, Theorem 3 go through.
    This means we just need to be able to generalize Theorem 1
    in order to ensure {yj } converges. This is precisely what we
    do.
    Susovan PAL Convergence of Mean Shift Algorithm with radially symmetric ker

    View Slide

  13. Convergence of Mean Shift : new theorem
    Below is a slightly weaker version of generalization of Theorem 1:
    Theorem 4
    (P.) Assume that K(.) = k( . 2) is strictly positive definite with
    the first two derivatives of k finite at 0. Then ∃h0 > 0 so that
    ∀h ≥ h0, Hess(fh,k) is nonsingular, therefore by Theorem 2 and
    Theorem 3 the mode estimate sequence {yi = yi (h)}
    corresponding to Kh,k, converges ∀h ≥ h0.
    Remark 1
    Note that this theorem is slighly weaker (as of now) in the sense
    that it doesn’t explicitly find h0, unlike Ghassebeh’s result where
    h0 := max1≤i≤n xi .
    Susovan PAL Convergence of Mean Shift Algorithm with radially symmetric ker

    View Slide

  14. Proof of convergence of mean shift for general kernels
    Our proof will require an alternate characterization of positive
    definite kernel matrices, so we state some definitions below:
    Definition 5
    (Completely monotone functions) A function k : [0, ∞) → R is
    called completely monotone if:
    k ∈ C0([0, ∞)) ∩ C∞((0, ∞))
    (2)(−1)l k(l) ≥ 0
    Examples: k(r) = rs, s ≤ 0, e−sr , s ≥ 0, ln(1 + 1
    r
    ), e1/r . If f , g are
    completely monotone, c, d > 0, then cf + dg, fg are also
    completely monotone!
    Susovan PAL Convergence of Mean Shift Algorithm with radially symmetric ker

    View Slide

  15. Alternative characterizations of completely monotone
    function I
    Next, we state two alternate characterizations of completely
    monotone functions, on in each slide.
    Theorem 6
    (Hausdorff-Bernstein-Widder theorem: Laplace transform
    characterization of completely monotone functions)
    A function k : [0, ∞) → R completely monotone if and only it is
    the Laplace transform of a finite non-negative Borel measure µ on
    [0, ∞), i.e. k is of the form:
    k(r) = Lµ(r) =

    0
    e−rtdµ(t)
    Susovan PAL Convergence of Mean Shift Algorithm with radially symmetric ker

    View Slide

  16. Alternative characterizations of completely monotone
    function II
    First recall that a function K is called non-negative definite if
    the corresponding kernel matrix K := [K(xi − xj )]1≤i,j≤n positive
    definite. This amounts to the fact that K := [k(||xi − xj ||2)1≤i,j≤n]
    is non-negative definite.
    Theorem 7
    (Non-negative definite kernel matrices are of constructed from
    completely monotone functions)
    A function k is completely monotone on [0, ∞) if and only if
    K(x) := k(||x||2) is non-negative definite and radial on Rd ∀d ∈ N
    That it the matrix K with Kij := K(xi − xj ) = k( xi − xj
    2) is
    non-negative definite.
    Susovan PAL Convergence of Mean Shift Algorithm with radially symmetric ker

    View Slide

  17. Connection to our problem
    Thanks to the previous two theorems, one can write any
    non-negative definite radially symmetric kernel matrix K = (Kij ) as
    Kij = ∞
    0
    e−t xi −xj
    2
    dµ(t) for some finite Borel measure µ. This
    should be useful! Something more is true:
    Theorem 8
    A non-constant function k is completely monotone on [0, ∞) if
    and only if Φ(x) := k(||x||2) is strictly positive definite and radial
    on Rd ∀d ∈ N.
    Susovan PAL Convergence of Mean Shift Algorithm with radially symmetric ker

    View Slide

  18. Connection to our problem
    And finally, something more:
    Theorem 9
    A non-constant function k : [0, ∞) → R completely monotone if
    and only it is the Laplace transform of a finite non-negative Borel
    measure µ on [0, ∞) not of the form cδ0, c > 0, i.e. k is of the
    form:
    k(r) = Lµ(r) =

    0
    e−rtdµ(t) (6)
    where µ is not of the form cδ0, c > 0
    Note that for Gaussian kernels, k(r) = e−r , µ := δ1.
    Susovan PAL Convergence of Mean Shift Algorithm with radially symmetric ker

    View Slide

  19. Details of the proof of Theorem 5, slide 12/13 - Part I
    Start by computing the gradient and Hessian of the KDE; we will
    abbreviate the notation fh,k by f . Recall:
    ˆ
    f (x) = cN
    1
    n
    n
    i=1
    k ||
    x − xi
    h
    ||2
    Taking gradient w.r.t. x,
    ∇ˆ
    f (x) = cN
    n
    i=1
    xi − x
    h2

    0
    2t exp −t||
    x − xi
    h
    ||2 dµ(t)
    Susovan PAL Convergence of Mean Shift Algorithm with radially symmetric ker

    View Slide

  20. Details of the proof of Theorem 5, slide 12/13 - Part II
    Hess ˆ
    f (x) =
    cN
    n
    i=1

    0
    2t
    h2
    −ID×D + 2t
    h2
    (x − xi )(x − xi )T exp −t||x−xi
    h
    ||2 dµ(
    Define the following two functions, respectively:
    C : Rd → R by:
    C(x) :=

    0
    2t
    n
    i=1
    exp −t||
    x − xi
    h
    ||2 dµ(t)
    A : Rd → Rd×d by:
    A(x) :=
    n
    i=1

    0
    4t2 (x − xi )(x − xi )T exp −t||
    x − xi
    h
    ||2 dµ(t)
    Susovan PAL Convergence of Mean Shift Algorithm with radially symmetric ker

    View Slide

  21. Details of the proof of Theorem 5, slide 12/13 - Part III
    For a motivation, note the right/later part of the Hessian
    H(x) is nothing but: cNh−2A(x)h−2, and the left/earlier part
    is nothing but: −cNh−2C(x), i.e.:
    H(x) = −cN
    C(x)
    h2
    ID×D + cN
    A(x)
    h4
    Next we argue by contradiction.
    Susovan PAL Convergence of Mean Shift Algorithm with radially symmetric ker

    View Slide

  22. Details of the proof of Theorem 5, slide 12/13 - Part IV
    If possible, assume that, H is singular (not full rank)
    So H has a zero eigenvalue v ∈ RD×1 so that
    H(x)v = 0 ∈ RD×1.
    Use the above expression for H above:
    H(x) = −cN
    C(x)
    h2
    ID×D + cN
    A(x)
    h4
    This yields:
    −cN
    C(x)
    h2
    ID×D + cN
    A(x)
    h4
    v = 0
    =⇒ A(x)v = h2C(x)v
    =⇒
    n
    i=1
    (x − xi )(x − xi )T ∞
    0
    4t2exp −t||x−xi
    h
    ||2 dµ(t) v
    = h2 n
    i=1

    0
    2t exp −t||x−xi
    h
    ||2 dµ(t)
    Susovan PAL Convergence of Mean Shift Algorithm with radially symmetric ker

    View Slide

  23. Details of the proof of Theorem 5, slide 12/13 - Part IV
    Let us abbreviate notation and write the above as:
    n
    i=1
    (x − xi )(x − xT
    i
    )Ji (h)v = −2h2
    n
    i=1
    Ii (h)v
    Where:
    Ii
    (h) :=

    0
    2texp −t||
    x − xi
    h
    ||2 dµ(t)
    Ji
    (h) :=

    0
    4t2exp −t||
    x − xi
    h
    ||2 dµ(t)
    Susovan PAL Convergence of Mean Shift Algorithm with radially symmetric ker

    View Slide

  24. Details of the proof of Theorem 5, slide 12/13 - Part V
    Note that as h → ∞, each Ii (h) → ∞
    0
    tdµ(t) = −k (0) and
    Ji (h) → 4k (0), so the above equation becomes:
    4
    n
    i=1
    (x − xi )(x − xT
    i
    )k (0) = −2h2
    n
    i=1
    k (0)
    Gives a contradiction as h → ∞, because the first two
    derivatives of K at 0 being finite implies that k (0), k (0) are
    finite.
    So there exists a sufficiently large h0 > 0 so that ∀h ≥ h0, the
    above equation won’t hold, meaning that Hessian H(x) has
    no zero eigenvalue.
    Susovan PAL Convergence of Mean Shift Algorithm with radially symmetric ker

    View Slide

  25. Practical usefulness of the result
    Choose a large bandwidth h is like considering too many
    points to calculate the sample mean, so this may help us not
    get a mode estimate sequence converging to a local point of
    density
    So clusters may not be found correctly
    On the other hand choosing too small h won’t also help us
    locate the local sample mean = local centroid of density
    correctly
    Hence the result may not be useful
    Susovan PAL Convergence of Mean Shift Algorithm with radially symmetric ker

    View Slide

  26. Possible future work
    Finding a necessary and sufficient criterion for convergence
    It has been shown that yj+1 − yj → 0, but this doesn’t
    mean it will be a Cauchy sequence, e.g. uj := sin(

    j) is a
    bounded, non-Cauchy, non-convergent sequence where
    uj+1 − uj → 0. We can still leverage the property.
    Convergence of mean shift on manifolds should be a
    mathematically interesting theory to look into, as KDE on
    manifolds with nice properties is an active area of research.
    Susovan PAL Convergence of Mean Shift Algorithm with radially symmetric ker

    View Slide

  27. Thanks
    THANK YOU!!!
    Susovan PAL Convergence of Mean Shift Algorithm with radially symmetric ker

    View Slide