Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Mouna Berrada's presentation of Hartigan and Wong’s 1979 K-Means Clustering Algorithm paper in JRSS C.

Xi'an
January 21, 2014

Mouna Berrada's presentation of Hartigan and Wong’s 1979 K-Means Clustering Algorithm paper in JRSS C.

Xi'an

January 21, 2014
Tweet

More Decks by Xi'an

Other Decks in Education

Transcript

  1. 1/42
    Introduction
    The K-means algorithm
    Discussion about the algorithm
    Conclusion
    Algorithm AS 136 :
    A K-Means Clustering Algorithm
    J.A Hartigan and M.A Wong
    Reading Seminar in Statistical Classic
    Presented by Mouna Berrada
    Under the direction of Christian P.Robert
    January 20,2014
    J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm

    View full-size slide

  2. 2/42
    Introduction
    The K-means algorithm
    Discussion about the algorithm
    Conclusion
    Presentation of the article
    Definition of clustering
    Presentation of the article
    Algorithm AS 136 : A K-Means Clustering Algorithm
    Developed by J.A Hartigan and M.A Wong
    Published by Journal of the Royal Statistical Society in 1979
    Applied Statistics (Series C), Vol 28, No.1, p. 100-108
    J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm

    View full-size slide

  3. 3/42
    Introduction
    The K-means algorithm
    Discussion about the algorithm
    Conclusion
    Presentation of the article
    Definition of clustering
    The clustering
    Clustering involves the task of dividing data points into
    homogeneous classes or clusters.
    So that items in the same cluster are as similar as possible and
    items in different clusters are as dissimilar as possible.
    Given a collection of objects, put objects into groups based on
    similarity.
    J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm

    View full-size slide

  4. 4/42
    Introduction
    The K-means algorithm
    Discussion about the algorithm
    Conclusion
    Presentation of the article
    Definition of clustering
    Possible applications of clustering
    Marketing: Help marketers discover distinct groups in their
    customer bases, and then use this knowledge to develop
    targeted marketing programs
    Insurance: Identifying groups of motor insurance policy holders
    with a high average claim cost
    Seismology: Observed earth quake epicenters should be
    clustered along continent faults
    → Also called unsupervised learning, or classification by
    statisticians and sorting by psychologists and segmentation by
    people in marketing.
    J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm

    View full-size slide

  5. 5/42
    Introduction
    The K-means algorithm
    Discussion about the algorithm
    Conclusion
    Presentation of the article
    Definition of clustering
    Table of contents
    1 Introduction
    Presentation of the article
    Definition of clustering
    2 The K-means algorithm
    The K-means function on R
    Description of the algorithm
    Interactive demonstration
    3 Discussion about the algorithm
    Comparison with related algorithm
    Strength of the present algorithm
    Weaknesses and solutions
    4 Conclusion
    J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm

    View full-size slide

  6. 6/42
    Introduction
    The K-means algorithm
    Discussion about the algorithm
    Conclusion
    The K-means function on R
    Description of the algorithm
    Interactive demonstration
    Presentation of the K-means function
    kmeans() is a function that belongs to the core of R.
    It has three different algorithms to be selected; the default
    algorithm included inside the k-means was developed and
    published in Applied Statistics by Hartigan and Wong, 1979.
    It was implemented in FORTRAN.
    J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm

    View full-size slide

  7. 7/42
    Introduction
    The K-means algorithm
    Discussion about the algorithm
    Conclusion
    The K-means function on R
    Description of the algorithm
    Interactive demonstration
    The aim of the K-means function
    The aim of the K-means algorithm is to divide M points
    in N dimensions into K clusters so that the within-cluster
    sum of squares is minimized.
    J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm

    View full-size slide

  8. 8/42
    Introduction
    The K-means algorithm
    Discussion about the algorithm
    Conclusion
    The K-means function on R
    Description of the algorithm
    Interactive demonstration
    Case study on K-means algorithm using R
    J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm

    View full-size slide

  9. 9/42
    Introduction
    The K-means algorithm
    Discussion about the algorithm
    Conclusion
    The K-means function on R
    Description of the algorithm
    Interactive demonstration
    Case study on K-means algorithm using R
    We specify that we want to divide Iris data set into 3 clusters :
    J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm

    View full-size slide

  10. 10/42
    Introduction
    The K-means algorithm
    Discussion about the algorithm
    Conclusion
    The K-means function on R
    Description of the algorithm
    Interactive demonstration
    Case study on K-means algorithm using R
    We plot the clusters and their centers.
    J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm

    View full-size slide

  11. 11/42
    Introduction
    The K-means algorithm
    Discussion about the algorithm
    Conclusion
    The K-means function on R
    Description of the algorithm
    Interactive demonstration
    Case study on K-means algorithm using R
    A cross-tabulation of type (species) and cluster membership is given by :
    We can quantify the agreement between type and cluster, using an
    Adjusted Rank Index provided by the flexclust package :
    Figure : The ARI provides a measure of the agreement between two partitions. It ranges
    from -1 (no agreement) to 1 (perfect agreement). Agreement between the class label species
    and the cluster solution is 0.73.
    J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm

    View full-size slide

  12. 12/42
    Introduction
    The K-means algorithm
    Discussion about the algorithm
    Conclusion
    The K-means function on R
    Description of the algorithm
    Interactive demonstration
    Main steps of the K-means algorithm
    Figure : Schema describing the iterations
    J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm

    View full-size slide

  13. 13/42
    Introduction
    The K-means algorithm
    Discussion about the algorithm
    Conclusion
    The K-means function on R
    Description of the algorithm
    Interactive demonstration
    Initialization
    Figure : Schema describing the iterations
    J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm

    View full-size slide

  14. 14/42
    Introduction
    The K-means algorithm
    Discussion about the algorithm
    Conclusion
    The K-means function on R
    Description of the algorithm
    Interactive demonstration
    Initialization
    For each point I (I=1,2..M) find its closest cluster center
    IC1(I) and second closest cluster center lC2(I)
    Assign point I to cluster lC1(I)
    Update the cluster centers to be the averages of points
    contained within them
    All clusters belong to the live set
    J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm

    View full-size slide

  15. 15/42
    Introduction
    The K-means algorithm
    Discussion about the algorithm
    Conclusion
    The K-means function on R
    Description of the algorithm
    Interactive demonstration
    The optimal-transfer stage (OPTRA)
    Figure : Schema describing the iterations
    J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm

    View full-size slide

  16. 16/42
    Introduction
    The K-means algorithm
    Discussion about the algorithm
    Conclusion
    The K-means function on R
    Description of the algorithm
    Interactive demonstration
    The optimal-transfer stage (OPTRA)
    → Let point I be in the active cluster L1
    Step 1
    Compute the minimum of the quantity over all clusters L (= L1)
    R=NC(L)∗D(I,L)2
    NC(L)+1
    NC(L) is the number of points in cluster L
    D(I,L) is the Euclidean distance between point I and cluster L
    J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm

    View full-size slide

  17. 17/42
    Introduction
    The K-means algorithm
    Discussion about the algorithm
    Conclusion
    The K-means function on R
    Description of the algorithm
    Interactive demonstration
    The optimal-transfer stage (OPTRA)
    → Let L2 be the cluster with the smallest R.
    Step 2
    If:
    R2=NC(L)∗D(I,L)2
    NC(L)+1
    ≥ R1=NC(L1)∗D(I,L1)2
    NC(L1)−1
    No reallocation is necessary : L2 is the new IC2(I)
    Step 3
    Otherwise:
    R2=NC(L)∗D(I,L)2
    NC(L)+1
    < R1=NC(L1)∗D(I,L1)2
    NC(L1)−1
    Point I is allocated to cluster L2 : L2 = IC1(I), L1 = IC2(I).
    J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm

    View full-size slide

  18. 18/42
    Introduction
    The K-means algorithm
    Discussion about the algorithm
    Conclusion
    The K-means function on R
    Description of the algorithm
    Interactive demonstration
    Empty Live Set
    Figure : Schema describing the iterations
    J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm

    View full-size slide

  19. 19/42
    Introduction
    The K-means algorithm
    Discussion about the algorithm
    Conclusion
    The K-means function on R
    Description of the algorithm
    Interactive demonstration
    The quick-transfer stage (QTRAN)
    Figure : Schema describing the iterations
    J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm

    View full-size slide

  20. 20/42
    Introduction
    The K-means algorithm
    Discussion about the algorithm
    Conclusion
    The K-means function on R
    Description of the algorithm
    Interactive demonstration
    The quick-transfer stage (QTRAN)
    → Let L1=IC1(I) and L2=IC2(I), for all point I (I=1,2,..M) :
    Step 1
    If: R1=NC(L1)∗D(I,L1)2
    NC(L1)−1
    < R2=NC(L)∗D(I,L)2
    NC(L)+1
    No relocation took place
    The point I remains in cluster L1
    Step 2
    Otherwise: R1=NC(L1)∗D(I,L1)2
    NC(L1)−1
    ≥ R2=NC(L)∗D(I,L)2
    NC(L)+1
    Switch IC1(I) and IC2(I)
    Update the centres of clusters L1 and L2
    The two clusters are noted for their involvement in a transfer at this step
    J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm

    View full-size slide

  21. 21/42
    Introduction
    The K-means algorithm
    Discussion about the algorithm
    Conclusion
    The K-means function on R
    Description of the algorithm
    Interactive demonstration
    Main steps of the K-means algorithm
    Figure : Schema describing the iterations
    → The iterations continue until convergence, meaning until no movement
    of a point from one cluster to another will reduce the WSS by cluster.
    J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm

    View full-size slide

  22. 22/42
    Introduction
    The K-means algorithm
    Discussion about the algorithm
    Conclusion
    The K-means function on R
    Description of the algorithm
    Interactive demonstration
    Interactive demonstration
    J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm

    View full-size slide

  23. 23/42
    Introduction
    The K-means algorithm
    Discussion about the algorithm
    Conclusion
    The K-means function on R
    Description of the algorithm
    Interactive demonstration
    Initialization
    J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm

    View full-size slide

  24. 24/42
    Introduction
    The K-means algorithm
    Discussion about the algorithm
    Conclusion
    The K-means function on R
    Description of the algorithm
    Interactive demonstration
    Iteration 1
    J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm

    View full-size slide

  25. 25/42
    Introduction
    The K-means algorithm
    Discussion about the algorithm
    Conclusion
    The K-means function on R
    Description of the algorithm
    Interactive demonstration
    Iteration 2
    J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm

    View full-size slide

  26. 26/42
    Introduction
    The K-means algorithm
    Discussion about the algorithm
    Conclusion
    The K-means function on R
    Description of the algorithm
    Interactive demonstration
    Iteration 3
    J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm

    View full-size slide

  27. 27/42
    Introduction
    The K-means algorithm
    Discussion about the algorithm
    Conclusion
    The K-means function on R
    Description of the algorithm
    Interactive demonstration
    Iteration 4
    J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm

    View full-size slide

  28. 28/42
    Introduction
    The K-means algorithm
    Discussion about the algorithm
    Conclusion
    The K-means function on R
    Description of the algorithm
    Interactive demonstration
    Iteration 5
    J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm

    View full-size slide

  29. 29/42
    Introduction
    The K-means algorithm
    Discussion about the algorithm
    Conclusion
    The K-means function on R
    Description of the algorithm
    Interactive demonstration
    Iteration 6
    J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm

    View full-size slide

  30. 30/42
    Introduction
    The K-means algorithm
    Discussion about the algorithm
    Conclusion
    The K-means function on R
    Description of the algorithm
    Interactive demonstration
    Last iteration
    J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm

    View full-size slide

  31. 31/42
    Introduction
    The K-means algorithm
    Discussion about the algorithm
    Conclusion
    Comparison with related algorithm
    Strength of the present algorithm
    Weaknesses and solutions
    AS 113: A transfer Algorithm For Non-Hierarchial Classification
    by Banfield and Bassil in 1977
    The user is allowed to choose the criterion function to be
    minimized.
    This algorithm uses swaps as well as transfers so try to
    overcome the problem of local optima.
    There is a significant amount of work off-loaded onto the user,
    and it will be more expensive than the algorithm AS 136 for
    large M.
    J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm

    View full-size slide

  32. 32/42
    Introduction
    The K-means algorithm
    Discussion about the algorithm
    Conclusion
    Comparison with related algorithm
    Strength of the present algorithm
    Weaknesses and solutions
    AS 58: Euclidean Cluster Analysis
    by Sparks in 1973
    As AS 136, it aim at finding a K-partition of the sample, with
    WSS which cannot be reduced by moving points from one
    cluster to the other.
    However, at the stage where each point is examined in turn to
    see if it should be reassigned to a different cluster, only the
    closest centre is used to check for possible reallocation of the
    given point.
    Algorithm AS 58 does not provide a locally optimal solution.
    J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm

    View full-size slide

  33. 33/42
    Introduction
    The K-means algorithm
    Discussion about the algorithm
    Conclusion
    Comparison with related algorithm
    Strength of the present algorithm
    Weaknesses and solutions
    Relatively efficient
    Figure : The algorithms AS 58 and AS 136 are tested and compared
    on various generate data sets
    ⇒ AS 136 produces fairly higher accuracy and requires less
    computation time than AS 58, especially when K is large.
    J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm

    View full-size slide

  34. 34/42
    Introduction
    The K-means algorithm
    Discussion about the algorithm
    Conclusion
    Comparison with related algorithm
    Strength of the present algorithm
    Weaknesses and solutions
    Relatively efficient
    Time = CMNKI
    C: Speed of the computer
    (for example = 2.1x10−5 seconds for an IBM 370/158)
    M: Number of points
    N: Number of dimensions
    K: Number of clusters
    I: Number of iterations (usually less than 10)
    ⇒ The present algorithm has a high speed of convergence,
    even if it finds a local optimum instead of a global one.
    J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm

    View full-size slide

  35. 35/42
    Introduction
    The K-means algorithm
    Discussion about the algorithm
    Conclusion
    Comparison with related algorithm
    Strength of the present algorithm
    Weaknesses and solutions
    Other properties
    The K-means clustering can handle large datasets
    It is based on a very simple idea, and don’t requires lot of
    computation time
    The cost function is strictly decreasing
    The resulting partition has no empty clusters
    The resulting partition has distinct means
    J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm

    View full-size slide

  36. 36/42
    Introduction
    The K-means algorithm
    Discussion about the algorithm
    Conclusion
    Comparison with related algorithm
    Strength of the present algorithm
    Weaknesses and solutions
    Sensitive to the initial selection of centroids
    Recommended solution
    Run the algorithm several times, each with a different set of initial
    cluster centers.
    The kmeans() function has an nstart option that attempts multiple
    initial configurations and reports on the best one.
    Solution proposed in the article
    The points are first ordered by their distances to the overall mean
    of the sample. For cluster L (L=1,2,..K), the [1 + (L-1)*(M/K)]th
    point is chosen to be its initial cluster centre.
    J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm

    View full-size slide

  37. 37/42
    Introduction
    The K-means algorithm
    Discussion about the algorithm
    Conclusion
    Comparison with related algorithm
    Strength of the present algorithm
    Weaknesses and solutions
    Sensitive to noisy data and outliers
    Book’s solution: The K-means Algorithm including weights
    This algorithm is based on the mean, a descriptive statistic not
    robust to outliers. It gives the same level of relevance to all the
    features in a dataset.
    The general idea is to include feature weighting in the K-mean
    clustering criterion, in order to give a lower weight to the variables
    affected by high noise.
    J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm

    View full-size slide

  38. 38/42
    Introduction
    The K-means algorithm
    Discussion about the algorithm
    Conclusion
    Comparison with related algorithm
    Strength of the present algorithm
    Weaknesses and solutions
    Sensitive to noisy data and outliers
    Changes to include weights
    New parameters:
    Input, double WH(M), the weight of each point
    Output, double WHC(K), the weight of each cluster
    Center of cluster L:
    C(L) =
    i∈I
    WH(I)
    WHC(L)
    XI
    Compare the quantity when a point change of cluster:
    R1=WHC(L1)∗D(I,L1)2
    WHC(L1)−WH(I)
    R2=WHC(L2)∗D(I,L2)2
    WHC(L2)+WH(I)
    J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm

    View full-size slide

  39. 39/42
    Introduction
    The K-means algorithm
    Discussion about the algorithm
    Conclusion
    Comparison with related algorithm
    Strength of the present algorithm
    Weaknesses and solutions
    The number of clusters K need to be specify in advance
    Recommended solution
    Perform multiple trials to find the best amount of clusters.
    Other solution : Rule of thumb
    The rule of thumb sets the number of cluster to :
    K ≈
    √ N
    2
    N : the number of data points
    J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm

    View full-size slide

  40. 40/42
    Introduction
    The K-means algorithm
    Discussion about the algorithm
    Conclusion
    Conclusion
    The K-means is the most used clustering algorithm, due to its
    inherent simplicity, speed, and empirical success.
    However, in its basic form, it has limitations such as the
    sensitivity to the initial partition, sensitivity to noise, and the
    requierement of predefined number of clusters.
    Thus, this algorithm needs to be improved in order to remain
    as popular.
    J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm

    View full-size slide

  41. 41/42
    Introduction
    The K-means algorithm
    Discussion about the algorithm
    Conclusion
    References
    J.A Hartigan and M.A Wong (1979)
    Algorithm AS 136: A K-Means Clustering Algorithm
    Journal of the Royal Statistical Society (Applied Statistics).
    J.A Hartigan (1975) - Clustering Algorithm
    Wiley Series in Probability and Mathematical Statistics.
    Laurence Morissette and Sylvain Chartier
    "The k-means clustering technique : General considerations and
    implementation in Mathematica"
    R-statistics blog
    K-MEANS CLUSTERING (FROM "R IN ACTION")
    Tutorial with introduction of Clustering Algorithms, and interactive
    demonstration
    http://home.deib.polimi.it/matteucc/Clustering/tutorial-html/AppletKM.html
    J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm

    View full-size slide

  42. 42/42
    Introduction
    The K-means algorithm
    Discussion about the algorithm
    Conclusion
    Thank you for your time and attention!!
    Any questions ?
    J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm

    View full-size slide