Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Mouna Berrada's presentation of Hartigan and Wong’s 1979 K-Means Clustering Algorithm paper in JRSS C.

Xi'an
January 21, 2014

Mouna Berrada's presentation of Hartigan and Wong’s 1979 K-Means Clustering Algorithm paper in JRSS C.

Xi'an

January 21, 2014
Tweet

More Decks by Xi'an

Other Decks in Education

Transcript

  1. 1/42 Introduction The K-means algorithm Discussion about the algorithm Conclusion

    Algorithm AS 136 : A K-Means Clustering Algorithm J.A Hartigan and M.A Wong Reading Seminar in Statistical Classic Presented by Mouna Berrada Under the direction of Christian P.Robert January 20,2014 J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm
  2. 2/42 Introduction The K-means algorithm Discussion about the algorithm Conclusion

    Presentation of the article Definition of clustering Presentation of the article Algorithm AS 136 : A K-Means Clustering Algorithm Developed by J.A Hartigan and M.A Wong Published by Journal of the Royal Statistical Society in 1979 Applied Statistics (Series C), Vol 28, No.1, p. 100-108 J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm
  3. 3/42 Introduction The K-means algorithm Discussion about the algorithm Conclusion

    Presentation of the article Definition of clustering The clustering Clustering involves the task of dividing data points into homogeneous classes or clusters. So that items in the same cluster are as similar as possible and items in different clusters are as dissimilar as possible. Given a collection of objects, put objects into groups based on similarity. J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm
  4. 4/42 Introduction The K-means algorithm Discussion about the algorithm Conclusion

    Presentation of the article Definition of clustering Possible applications of clustering Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs Insurance: Identifying groups of motor insurance policy holders with a high average claim cost Seismology: Observed earth quake epicenters should be clustered along continent faults → Also called unsupervised learning, or classification by statisticians and sorting by psychologists and segmentation by people in marketing. J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm
  5. 5/42 Introduction The K-means algorithm Discussion about the algorithm Conclusion

    Presentation of the article Definition of clustering Table of contents 1 Introduction Presentation of the article Definition of clustering 2 The K-means algorithm The K-means function on R Description of the algorithm Interactive demonstration 3 Discussion about the algorithm Comparison with related algorithm Strength of the present algorithm Weaknesses and solutions 4 Conclusion J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm
  6. 6/42 Introduction The K-means algorithm Discussion about the algorithm Conclusion

    The K-means function on R Description of the algorithm Interactive demonstration Presentation of the K-means function kmeans() is a function that belongs to the core of R. It has three different algorithms to be selected; the default algorithm included inside the k-means was developed and published in Applied Statistics by Hartigan and Wong, 1979. It was implemented in FORTRAN. J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm
  7. 7/42 Introduction The K-means algorithm Discussion about the algorithm Conclusion

    The K-means function on R Description of the algorithm Interactive demonstration The aim of the K-means function The aim of the K-means algorithm is to divide M points in N dimensions into K clusters so that the within-cluster sum of squares is minimized. J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm
  8. 8/42 Introduction The K-means algorithm Discussion about the algorithm Conclusion

    The K-means function on R Description of the algorithm Interactive demonstration Case study on K-means algorithm using R J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm
  9. 9/42 Introduction The K-means algorithm Discussion about the algorithm Conclusion

    The K-means function on R Description of the algorithm Interactive demonstration Case study on K-means algorithm using R We specify that we want to divide Iris data set into 3 clusters : J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm
  10. 10/42 Introduction The K-means algorithm Discussion about the algorithm Conclusion

    The K-means function on R Description of the algorithm Interactive demonstration Case study on K-means algorithm using R We plot the clusters and their centers. J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm
  11. 11/42 Introduction The K-means algorithm Discussion about the algorithm Conclusion

    The K-means function on R Description of the algorithm Interactive demonstration Case study on K-means algorithm using R A cross-tabulation of type (species) and cluster membership is given by : We can quantify the agreement between type and cluster, using an Adjusted Rank Index provided by the flexclust package : Figure : The ARI provides a measure of the agreement between two partitions. It ranges from -1 (no agreement) to 1 (perfect agreement). Agreement between the class label species and the cluster solution is 0.73. J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm
  12. 12/42 Introduction The K-means algorithm Discussion about the algorithm Conclusion

    The K-means function on R Description of the algorithm Interactive demonstration Main steps of the K-means algorithm Figure : Schema describing the iterations J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm
  13. 13/42 Introduction The K-means algorithm Discussion about the algorithm Conclusion

    The K-means function on R Description of the algorithm Interactive demonstration Initialization Figure : Schema describing the iterations J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm
  14. 14/42 Introduction The K-means algorithm Discussion about the algorithm Conclusion

    The K-means function on R Description of the algorithm Interactive demonstration Initialization For each point I (I=1,2..M) find its closest cluster center IC1(I) and second closest cluster center lC2(I) Assign point I to cluster lC1(I) Update the cluster centers to be the averages of points contained within them All clusters belong to the live set J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm
  15. 15/42 Introduction The K-means algorithm Discussion about the algorithm Conclusion

    The K-means function on R Description of the algorithm Interactive demonstration The optimal-transfer stage (OPTRA) Figure : Schema describing the iterations J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm
  16. 16/42 Introduction The K-means algorithm Discussion about the algorithm Conclusion

    The K-means function on R Description of the algorithm Interactive demonstration The optimal-transfer stage (OPTRA) → Let point I be in the active cluster L1 Step 1 Compute the minimum of the quantity over all clusters L (= L1) R=NC(L)∗D(I,L)2 NC(L)+1 NC(L) is the number of points in cluster L D(I,L) is the Euclidean distance between point I and cluster L J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm
  17. 17/42 Introduction The K-means algorithm Discussion about the algorithm Conclusion

    The K-means function on R Description of the algorithm Interactive demonstration The optimal-transfer stage (OPTRA) → Let L2 be the cluster with the smallest R. Step 2 If: R2=NC(L)∗D(I,L)2 NC(L)+1 ≥ R1=NC(L1)∗D(I,L1)2 NC(L1)−1 No reallocation is necessary : L2 is the new IC2(I) Step 3 Otherwise: R2=NC(L)∗D(I,L)2 NC(L)+1 < R1=NC(L1)∗D(I,L1)2 NC(L1)−1 Point I is allocated to cluster L2 : L2 = IC1(I), L1 = IC2(I). J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm
  18. 18/42 Introduction The K-means algorithm Discussion about the algorithm Conclusion

    The K-means function on R Description of the algorithm Interactive demonstration Empty Live Set Figure : Schema describing the iterations J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm
  19. 19/42 Introduction The K-means algorithm Discussion about the algorithm Conclusion

    The K-means function on R Description of the algorithm Interactive demonstration The quick-transfer stage (QTRAN) Figure : Schema describing the iterations J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm
  20. 20/42 Introduction The K-means algorithm Discussion about the algorithm Conclusion

    The K-means function on R Description of the algorithm Interactive demonstration The quick-transfer stage (QTRAN) → Let L1=IC1(I) and L2=IC2(I), for all point I (I=1,2,..M) : Step 1 If: R1=NC(L1)∗D(I,L1)2 NC(L1)−1 < R2=NC(L)∗D(I,L)2 NC(L)+1 No relocation took place The point I remains in cluster L1 Step 2 Otherwise: R1=NC(L1)∗D(I,L1)2 NC(L1)−1 ≥ R2=NC(L)∗D(I,L)2 NC(L)+1 Switch IC1(I) and IC2(I) Update the centres of clusters L1 and L2 The two clusters are noted for their involvement in a transfer at this step J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm
  21. 21/42 Introduction The K-means algorithm Discussion about the algorithm Conclusion

    The K-means function on R Description of the algorithm Interactive demonstration Main steps of the K-means algorithm Figure : Schema describing the iterations → The iterations continue until convergence, meaning until no movement of a point from one cluster to another will reduce the WSS by cluster. J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm
  22. 22/42 Introduction The K-means algorithm Discussion about the algorithm Conclusion

    The K-means function on R Description of the algorithm Interactive demonstration Interactive demonstration J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm
  23. 23/42 Introduction The K-means algorithm Discussion about the algorithm Conclusion

    The K-means function on R Description of the algorithm Interactive demonstration Initialization J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm
  24. 24/42 Introduction The K-means algorithm Discussion about the algorithm Conclusion

    The K-means function on R Description of the algorithm Interactive demonstration Iteration 1 J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm
  25. 25/42 Introduction The K-means algorithm Discussion about the algorithm Conclusion

    The K-means function on R Description of the algorithm Interactive demonstration Iteration 2 J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm
  26. 26/42 Introduction The K-means algorithm Discussion about the algorithm Conclusion

    The K-means function on R Description of the algorithm Interactive demonstration Iteration 3 J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm
  27. 27/42 Introduction The K-means algorithm Discussion about the algorithm Conclusion

    The K-means function on R Description of the algorithm Interactive demonstration Iteration 4 J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm
  28. 28/42 Introduction The K-means algorithm Discussion about the algorithm Conclusion

    The K-means function on R Description of the algorithm Interactive demonstration Iteration 5 J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm
  29. 29/42 Introduction The K-means algorithm Discussion about the algorithm Conclusion

    The K-means function on R Description of the algorithm Interactive demonstration Iteration 6 J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm
  30. 30/42 Introduction The K-means algorithm Discussion about the algorithm Conclusion

    The K-means function on R Description of the algorithm Interactive demonstration Last iteration J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm
  31. 31/42 Introduction The K-means algorithm Discussion about the algorithm Conclusion

    Comparison with related algorithm Strength of the present algorithm Weaknesses and solutions AS 113: A transfer Algorithm For Non-Hierarchial Classification by Banfield and Bassil in 1977 The user is allowed to choose the criterion function to be minimized. This algorithm uses swaps as well as transfers so try to overcome the problem of local optima. There is a significant amount of work off-loaded onto the user, and it will be more expensive than the algorithm AS 136 for large M. J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm
  32. 32/42 Introduction The K-means algorithm Discussion about the algorithm Conclusion

    Comparison with related algorithm Strength of the present algorithm Weaknesses and solutions AS 58: Euclidean Cluster Analysis by Sparks in 1973 As AS 136, it aim at finding a K-partition of the sample, with WSS which cannot be reduced by moving points from one cluster to the other. However, at the stage where each point is examined in turn to see if it should be reassigned to a different cluster, only the closest centre is used to check for possible reallocation of the given point. Algorithm AS 58 does not provide a locally optimal solution. J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm
  33. 33/42 Introduction The K-means algorithm Discussion about the algorithm Conclusion

    Comparison with related algorithm Strength of the present algorithm Weaknesses and solutions Relatively efficient Figure : The algorithms AS 58 and AS 136 are tested and compared on various generate data sets ⇒ AS 136 produces fairly higher accuracy and requires less computation time than AS 58, especially when K is large. J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm
  34. 34/42 Introduction The K-means algorithm Discussion about the algorithm Conclusion

    Comparison with related algorithm Strength of the present algorithm Weaknesses and solutions Relatively efficient Time = CMNKI C: Speed of the computer (for example = 2.1x10−5 seconds for an IBM 370/158) M: Number of points N: Number of dimensions K: Number of clusters I: Number of iterations (usually less than 10) ⇒ The present algorithm has a high speed of convergence, even if it finds a local optimum instead of a global one. J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm
  35. 35/42 Introduction The K-means algorithm Discussion about the algorithm Conclusion

    Comparison with related algorithm Strength of the present algorithm Weaknesses and solutions Other properties The K-means clustering can handle large datasets It is based on a very simple idea, and don’t requires lot of computation time The cost function is strictly decreasing The resulting partition has no empty clusters The resulting partition has distinct means J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm
  36. 36/42 Introduction The K-means algorithm Discussion about the algorithm Conclusion

    Comparison with related algorithm Strength of the present algorithm Weaknesses and solutions Sensitive to the initial selection of centroids Recommended solution Run the algorithm several times, each with a different set of initial cluster centers. The kmeans() function has an nstart option that attempts multiple initial configurations and reports on the best one. Solution proposed in the article The points are first ordered by their distances to the overall mean of the sample. For cluster L (L=1,2,..K), the [1 + (L-1)*(M/K)]th point is chosen to be its initial cluster centre. J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm
  37. 37/42 Introduction The K-means algorithm Discussion about the algorithm Conclusion

    Comparison with related algorithm Strength of the present algorithm Weaknesses and solutions Sensitive to noisy data and outliers Book’s solution: The K-means Algorithm including weights This algorithm is based on the mean, a descriptive statistic not robust to outliers. It gives the same level of relevance to all the features in a dataset. The general idea is to include feature weighting in the K-mean clustering criterion, in order to give a lower weight to the variables affected by high noise. J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm
  38. 38/42 Introduction The K-means algorithm Discussion about the algorithm Conclusion

    Comparison with related algorithm Strength of the present algorithm Weaknesses and solutions Sensitive to noisy data and outliers Changes to include weights New parameters: Input, double WH(M), the weight of each point Output, double WHC(K), the weight of each cluster Center of cluster L: C(L) = i∈I WH(I) WHC(L) XI Compare the quantity when a point change of cluster: R1=WHC(L1)∗D(I,L1)2 WHC(L1)−WH(I) R2=WHC(L2)∗D(I,L2)2 WHC(L2)+WH(I) J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm
  39. 39/42 Introduction The K-means algorithm Discussion about the algorithm Conclusion

    Comparison with related algorithm Strength of the present algorithm Weaknesses and solutions The number of clusters K need to be specify in advance Recommended solution Perform multiple trials to find the best amount of clusters. Other solution : Rule of thumb The rule of thumb sets the number of cluster to : K ≈ √ N 2 N : the number of data points J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm
  40. 40/42 Introduction The K-means algorithm Discussion about the algorithm Conclusion

    Conclusion The K-means is the most used clustering algorithm, due to its inherent simplicity, speed, and empirical success. However, in its basic form, it has limitations such as the sensitivity to the initial partition, sensitivity to noise, and the requierement of predefined number of clusters. Thus, this algorithm needs to be improved in order to remain as popular. J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm
  41. 41/42 Introduction The K-means algorithm Discussion about the algorithm Conclusion

    References J.A Hartigan and M.A Wong (1979) Algorithm AS 136: A K-Means Clustering Algorithm Journal of the Royal Statistical Society (Applied Statistics). J.A Hartigan (1975) - Clustering Algorithm Wiley Series in Probability and Mathematical Statistics. Laurence Morissette and Sylvain Chartier "The k-means clustering technique : General considerations and implementation in Mathematica" R-statistics blog K-MEANS CLUSTERING (FROM "R IN ACTION") Tutorial with introduction of Clustering Algorithms, and interactive demonstration http://home.deib.polimi.it/matteucc/Clustering/tutorial-html/AppletKM.html J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm
  42. 42/42 Introduction The K-means algorithm Discussion about the algorithm Conclusion

    Thank you for your time and attention!! Any questions ? J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm