Algorithm AS 136 : A K-Means Clustering Algorithm J.A Hartigan and M.A Wong Reading Seminar in Statistical Classic Presented by Mouna Berrada Under the direction of Christian P.Robert January 20,2014 J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm
Presentation of the article Deﬁnition of clustering Presentation of the article Algorithm AS 136 : A K-Means Clustering Algorithm Developed by J.A Hartigan and M.A Wong Published by Journal of the Royal Statistical Society in 1979 Applied Statistics (Series C), Vol 28, No.1, p. 100-108 J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm
Presentation of the article Deﬁnition of clustering The clustering Clustering involves the task of dividing data points into homogeneous classes or clusters. So that items in the same cluster are as similar as possible and items in diﬀerent clusters are as dissimilar as possible. Given a collection of objects, put objects into groups based on similarity. J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm
Presentation of the article Deﬁnition of clustering Possible applications of clustering Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs Insurance: Identifying groups of motor insurance policy holders with a high average claim cost Seismology: Observed earth quake epicenters should be clustered along continent faults → Also called unsupervised learning, or classiﬁcation by statisticians and sorting by psychologists and segmentation by people in marketing. J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm
Presentation of the article Deﬁnition of clustering Table of contents 1 Introduction Presentation of the article Deﬁnition of clustering 2 The K-means algorithm The K-means function on R Description of the algorithm Interactive demonstration 3 Discussion about the algorithm Comparison with related algorithm Strength of the present algorithm Weaknesses and solutions 4 Conclusion J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm
The K-means function on R Description of the algorithm Interactive demonstration Presentation of the K-means function kmeans() is a function that belongs to the core of R. It has three diﬀerent algorithms to be selected; the default algorithm included inside the k-means was developed and published in Applied Statistics by Hartigan and Wong, 1979. It was implemented in FORTRAN. J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm
The K-means function on R Description of the algorithm Interactive demonstration The aim of the K-means function The aim of the K-means algorithm is to divide M points in N dimensions into K clusters so that the within-cluster sum of squares is minimized. J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm
The K-means function on R Description of the algorithm Interactive demonstration Case study on K-means algorithm using R J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm
The K-means function on R Description of the algorithm Interactive demonstration Case study on K-means algorithm using R We specify that we want to divide Iris data set into 3 clusters : J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm
The K-means function on R Description of the algorithm Interactive demonstration Case study on K-means algorithm using R We plot the clusters and their centers. J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm
The K-means function on R Description of the algorithm Interactive demonstration Case study on K-means algorithm using R A cross-tabulation of type (species) and cluster membership is given by : We can quantify the agreement between type and cluster, using an Adjusted Rank Index provided by the ﬂexclust package : Figure : The ARI provides a measure of the agreement between two partitions. It ranges from -1 (no agreement) to 1 (perfect agreement). Agreement between the class label species and the cluster solution is 0.73. J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm
The K-means function on R Description of the algorithm Interactive demonstration Main steps of the K-means algorithm Figure : Schema describing the iterations J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm
The K-means function on R Description of the algorithm Interactive demonstration Initialization Figure : Schema describing the iterations J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm
The K-means function on R Description of the algorithm Interactive demonstration Initialization For each point I (I=1,2..M) ﬁnd its closest cluster center IC1(I) and second closest cluster center lC2(I) Assign point I to cluster lC1(I) Update the cluster centers to be the averages of points contained within them All clusters belong to the live set J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm
The K-means function on R Description of the algorithm Interactive demonstration The optimal-transfer stage (OPTRA) Figure : Schema describing the iterations J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm
The K-means function on R Description of the algorithm Interactive demonstration The optimal-transfer stage (OPTRA) → Let point I be in the active cluster L1 Step 1 Compute the minimum of the quantity over all clusters L (= L1) R=NC(L)∗D(I,L)2 NC(L)+1 NC(L) is the number of points in cluster L D(I,L) is the Euclidean distance between point I and cluster L J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm
The K-means function on R Description of the algorithm Interactive demonstration The optimal-transfer stage (OPTRA) → Let L2 be the cluster with the smallest R. Step 2 If: R2=NC(L)∗D(I,L)2 NC(L)+1 ≥ R1=NC(L1)∗D(I,L1)2 NC(L1)−1 No reallocation is necessary : L2 is the new IC2(I) Step 3 Otherwise: R2=NC(L)∗D(I,L)2 NC(L)+1 < R1=NC(L1)∗D(I,L1)2 NC(L1)−1 Point I is allocated to cluster L2 : L2 = IC1(I), L1 = IC2(I). J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm
The K-means function on R Description of the algorithm Interactive demonstration Empty Live Set Figure : Schema describing the iterations J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm
The K-means function on R Description of the algorithm Interactive demonstration The quick-transfer stage (QTRAN) Figure : Schema describing the iterations J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm
The K-means function on R Description of the algorithm Interactive demonstration The quick-transfer stage (QTRAN) → Let L1=IC1(I) and L2=IC2(I), for all point I (I=1,2,..M) : Step 1 If: R1=NC(L1)∗D(I,L1)2 NC(L1)−1 < R2=NC(L)∗D(I,L)2 NC(L)+1 No relocation took place The point I remains in cluster L1 Step 2 Otherwise: R1=NC(L1)∗D(I,L1)2 NC(L1)−1 ≥ R2=NC(L)∗D(I,L)2 NC(L)+1 Switch IC1(I) and IC2(I) Update the centres of clusters L1 and L2 The two clusters are noted for their involvement in a transfer at this step J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm
The K-means function on R Description of the algorithm Interactive demonstration Main steps of the K-means algorithm Figure : Schema describing the iterations → The iterations continue until convergence, meaning until no movement of a point from one cluster to another will reduce the WSS by cluster. J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm
The K-means function on R Description of the algorithm Interactive demonstration Interactive demonstration J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm
The K-means function on R Description of the algorithm Interactive demonstration Initialization J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm
The K-means function on R Description of the algorithm Interactive demonstration Iteration 1 J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm
The K-means function on R Description of the algorithm Interactive demonstration Iteration 2 J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm
The K-means function on R Description of the algorithm Interactive demonstration Iteration 3 J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm
The K-means function on R Description of the algorithm Interactive demonstration Iteration 4 J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm
The K-means function on R Description of the algorithm Interactive demonstration Iteration 5 J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm
The K-means function on R Description of the algorithm Interactive demonstration Iteration 6 J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm
The K-means function on R Description of the algorithm Interactive demonstration Last iteration J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm
Comparison with related algorithm Strength of the present algorithm Weaknesses and solutions AS 113: A transfer Algorithm For Non-Hierarchial Classiﬁcation by Banﬁeld and Bassil in 1977 The user is allowed to choose the criterion function to be minimized. This algorithm uses swaps as well as transfers so try to overcome the problem of local optima. There is a signiﬁcant amount of work oﬀ-loaded onto the user, and it will be more expensive than the algorithm AS 136 for large M. J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm
Comparison with related algorithm Strength of the present algorithm Weaknesses and solutions AS 58: Euclidean Cluster Analysis by Sparks in 1973 As AS 136, it aim at ﬁnding a K-partition of the sample, with WSS which cannot be reduced by moving points from one cluster to the other. However, at the stage where each point is examined in turn to see if it should be reassigned to a diﬀerent cluster, only the closest centre is used to check for possible reallocation of the given point. Algorithm AS 58 does not provide a locally optimal solution. J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm
Comparison with related algorithm Strength of the present algorithm Weaknesses and solutions Relatively eﬃcient Figure : The algorithms AS 58 and AS 136 are tested and compared on various generate data sets ⇒ AS 136 produces fairly higher accuracy and requires less computation time than AS 58, especially when K is large. J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm
Comparison with related algorithm Strength of the present algorithm Weaknesses and solutions Relatively eﬃcient Time = CMNKI C: Speed of the computer (for example = 2.1x10−5 seconds for an IBM 370/158) M: Number of points N: Number of dimensions K: Number of clusters I: Number of iterations (usually less than 10) ⇒ The present algorithm has a high speed of convergence, even if it ﬁnds a local optimum instead of a global one. J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm
Comparison with related algorithm Strength of the present algorithm Weaknesses and solutions Other properties The K-means clustering can handle large datasets It is based on a very simple idea, and don’t requires lot of computation time The cost function is strictly decreasing The resulting partition has no empty clusters The resulting partition has distinct means J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm
Comparison with related algorithm Strength of the present algorithm Weaknesses and solutions Sensitive to the initial selection of centroids Recommended solution Run the algorithm several times, each with a diﬀerent set of initial cluster centers. The kmeans() function has an nstart option that attempts multiple initial conﬁgurations and reports on the best one. Solution proposed in the article The points are ﬁrst ordered by their distances to the overall mean of the sample. For cluster L (L=1,2,..K), the [1 + (L-1)*(M/K)]th point is chosen to be its initial cluster centre. J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm
Comparison with related algorithm Strength of the present algorithm Weaknesses and solutions Sensitive to noisy data and outliers Book’s solution: The K-means Algorithm including weights This algorithm is based on the mean, a descriptive statistic not robust to outliers. It gives the same level of relevance to all the features in a dataset. The general idea is to include feature weighting in the K-mean clustering criterion, in order to give a lower weight to the variables aﬀected by high noise. J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm
Comparison with related algorithm Strength of the present algorithm Weaknesses and solutions Sensitive to noisy data and outliers Changes to include weights New parameters: Input, double WH(M), the weight of each point Output, double WHC(K), the weight of each cluster Center of cluster L: C(L) = i∈I WH(I) WHC(L) XI Compare the quantity when a point change of cluster: R1=WHC(L1)∗D(I,L1)2 WHC(L1)−WH(I) R2=WHC(L2)∗D(I,L2)2 WHC(L2)+WH(I) J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm
Comparison with related algorithm Strength of the present algorithm Weaknesses and solutions The number of clusters K need to be specify in advance Recommended solution Perform multiple trials to ﬁnd the best amount of clusters. Other solution : Rule of thumb The rule of thumb sets the number of cluster to : K ≈ √ N 2 N : the number of data points J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm
Conclusion The K-means is the most used clustering algorithm, due to its inherent simplicity, speed, and empirical success. However, in its basic form, it has limitations such as the sensitivity to the initial partition, sensitivity to noise, and the requierement of predeﬁned number of clusters. Thus, this algorithm needs to be improved in order to remain as popular. J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm
References J.A Hartigan and M.A Wong (1979) Algorithm AS 136: A K-Means Clustering Algorithm Journal of the Royal Statistical Society (Applied Statistics). J.A Hartigan (1975) - Clustering Algorithm Wiley Series in Probability and Mathematical Statistics. Laurence Morissette and Sylvain Chartier "The k-means clustering technique : General considerations and implementation in Mathematica" R-statistics blog K-MEANS CLUSTERING (FROM "R IN ACTION") Tutorial with introduction of Clustering Algorithms, and interactive demonstration http://home.deib.polimi.it/matteucc/Clustering/tutorial-html/AppletKM.html J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm