Xi'an
January 21, 2014
5k

# Mouna Berrada's presentation of Hartigan and Wong’s 1979 K-Means Clustering Algorithm paper in JRSS C.

January 21, 2014

## Transcript

1. 1/42
Introduction
The K-means algorithm
Conclusion
Algorithm AS 136 :
A K-Means Clustering Algorithm
J.A Hartigan and M.A Wong
Under the direction of Christian P.Robert
January 20,2014
J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm

2. 2/42
Introduction
The K-means algorithm
Conclusion
Presentation of the article
Deﬁnition of clustering
Presentation of the article
Algorithm AS 136 : A K-Means Clustering Algorithm
Developed by J.A Hartigan and M.A Wong
Applied Statistics (Series C), Vol 28, No.1, p. 100-108
J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm

3. 3/42
Introduction
The K-means algorithm
Conclusion
Presentation of the article
Deﬁnition of clustering
The clustering
Clustering involves the task of dividing data points into
homogeneous classes or clusters.
So that items in the same cluster are as similar as possible and
items in diﬀerent clusters are as dissimilar as possible.
Given a collection of objects, put objects into groups based on
similarity.
J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm

4. 4/42
Introduction
The K-means algorithm
Conclusion
Presentation of the article
Deﬁnition of clustering
Possible applications of clustering
Marketing: Help marketers discover distinct groups in their
customer bases, and then use this knowledge to develop
targeted marketing programs
Insurance: Identifying groups of motor insurance policy holders
with a high average claim cost
Seismology: Observed earth quake epicenters should be
clustered along continent faults
→ Also called unsupervised learning, or classiﬁcation by
statisticians and sorting by psychologists and segmentation by
people in marketing.
J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm

5. 5/42
Introduction
The K-means algorithm
Conclusion
Presentation of the article
Deﬁnition of clustering
1 Introduction
Presentation of the article
Deﬁnition of clustering
2 The K-means algorithm
The K-means function on R
Description of the algorithm
Interactive demonstration
Comparison with related algorithm
Strength of the present algorithm
Weaknesses and solutions
4 Conclusion
J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm

6. 6/42
Introduction
The K-means algorithm
Conclusion
The K-means function on R
Description of the algorithm
Interactive demonstration
Presentation of the K-means function
kmeans() is a function that belongs to the core of R.
It has three diﬀerent algorithms to be selected; the default
algorithm included inside the k-means was developed and
published in Applied Statistics by Hartigan and Wong, 1979.
It was implemented in FORTRAN.
J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm

7. 7/42
Introduction
The K-means algorithm
Conclusion
The K-means function on R
Description of the algorithm
Interactive demonstration
The aim of the K-means function
The aim of the K-means algorithm is to divide M points
in N dimensions into K clusters so that the within-cluster
sum of squares is minimized.
J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm

8. 8/42
Introduction
The K-means algorithm
Conclusion
The K-means function on R
Description of the algorithm
Interactive demonstration
Case study on K-means algorithm using R
J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm

9. 9/42
Introduction
The K-means algorithm
Conclusion
The K-means function on R
Description of the algorithm
Interactive demonstration
Case study on K-means algorithm using R
We specify that we want to divide Iris data set into 3 clusters :
J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm

10. 10/42
Introduction
The K-means algorithm
Conclusion
The K-means function on R
Description of the algorithm
Interactive demonstration
Case study on K-means algorithm using R
We plot the clusters and their centers.
J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm

11. 11/42
Introduction
The K-means algorithm
Conclusion
The K-means function on R
Description of the algorithm
Interactive demonstration
Case study on K-means algorithm using R
A cross-tabulation of type (species) and cluster membership is given by :
We can quantify the agreement between type and cluster, using an
Adjusted Rank Index provided by the ﬂexclust package :
Figure : The ARI provides a measure of the agreement between two partitions. It ranges
from -1 (no agreement) to 1 (perfect agreement). Agreement between the class label species
and the cluster solution is 0.73.
J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm

12. 12/42
Introduction
The K-means algorithm
Conclusion
The K-means function on R
Description of the algorithm
Interactive demonstration
Main steps of the K-means algorithm
Figure : Schema describing the iterations
J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm

13. 13/42
Introduction
The K-means algorithm
Conclusion
The K-means function on R
Description of the algorithm
Interactive demonstration
Initialization
Figure : Schema describing the iterations
J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm

14. 14/42
Introduction
The K-means algorithm
Conclusion
The K-means function on R
Description of the algorithm
Interactive demonstration
Initialization
For each point I (I=1,2..M) ﬁnd its closest cluster center
IC1(I) and second closest cluster center lC2(I)
Assign point I to cluster lC1(I)
Update the cluster centers to be the averages of points
contained within them
All clusters belong to the live set
J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm

15. 15/42
Introduction
The K-means algorithm
Conclusion
The K-means function on R
Description of the algorithm
Interactive demonstration
The optimal-transfer stage (OPTRA)
Figure : Schema describing the iterations
J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm

16. 16/42
Introduction
The K-means algorithm
Conclusion
The K-means function on R
Description of the algorithm
Interactive demonstration
The optimal-transfer stage (OPTRA)
→ Let point I be in the active cluster L1
Step 1
Compute the minimum of the quantity over all clusters L (= L1)
R=NC(L)∗D(I,L)2
NC(L)+1
NC(L) is the number of points in cluster L
D(I,L) is the Euclidean distance between point I and cluster L
J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm

17. 17/42
Introduction
The K-means algorithm
Conclusion
The K-means function on R
Description of the algorithm
Interactive demonstration
The optimal-transfer stage (OPTRA)
→ Let L2 be the cluster with the smallest R.
Step 2
If:
R2=NC(L)∗D(I,L)2
NC(L)+1
≥ R1=NC(L1)∗D(I,L1)2
NC(L1)−1
No reallocation is necessary : L2 is the new IC2(I)
Step 3
Otherwise:
R2=NC(L)∗D(I,L)2
NC(L)+1
< R1=NC(L1)∗D(I,L1)2
NC(L1)−1
Point I is allocated to cluster L2 : L2 = IC1(I), L1 = IC2(I).
J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm

18. 18/42
Introduction
The K-means algorithm
Conclusion
The K-means function on R
Description of the algorithm
Interactive demonstration
Empty Live Set
Figure : Schema describing the iterations
J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm

19. 19/42
Introduction
The K-means algorithm
Conclusion
The K-means function on R
Description of the algorithm
Interactive demonstration
The quick-transfer stage (QTRAN)
Figure : Schema describing the iterations
J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm

20. 20/42
Introduction
The K-means algorithm
Conclusion
The K-means function on R
Description of the algorithm
Interactive demonstration
The quick-transfer stage (QTRAN)
→ Let L1=IC1(I) and L2=IC2(I), for all point I (I=1,2,..M) :
Step 1
If: R1=NC(L1)∗D(I,L1)2
NC(L1)−1
< R2=NC(L)∗D(I,L)2
NC(L)+1
No relocation took place
The point I remains in cluster L1
Step 2
Otherwise: R1=NC(L1)∗D(I,L1)2
NC(L1)−1
≥ R2=NC(L)∗D(I,L)2
NC(L)+1
Switch IC1(I) and IC2(I)
Update the centres of clusters L1 and L2
The two clusters are noted for their involvement in a transfer at this step
J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm

21. 21/42
Introduction
The K-means algorithm
Conclusion
The K-means function on R
Description of the algorithm
Interactive demonstration
Main steps of the K-means algorithm
Figure : Schema describing the iterations
→ The iterations continue until convergence, meaning until no movement
of a point from one cluster to another will reduce the WSS by cluster.
J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm

22. 22/42
Introduction
The K-means algorithm
Conclusion
The K-means function on R
Description of the algorithm
Interactive demonstration
Interactive demonstration
J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm

23. 23/42
Introduction
The K-means algorithm
Conclusion
The K-means function on R
Description of the algorithm
Interactive demonstration
Initialization
J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm

24. 24/42
Introduction
The K-means algorithm
Conclusion
The K-means function on R
Description of the algorithm
Interactive demonstration
Iteration 1
J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm

25. 25/42
Introduction
The K-means algorithm
Conclusion
The K-means function on R
Description of the algorithm
Interactive demonstration
Iteration 2
J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm

26. 26/42
Introduction
The K-means algorithm
Conclusion
The K-means function on R
Description of the algorithm
Interactive demonstration
Iteration 3
J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm

27. 27/42
Introduction
The K-means algorithm
Conclusion
The K-means function on R
Description of the algorithm
Interactive demonstration
Iteration 4
J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm

28. 28/42
Introduction
The K-means algorithm
Conclusion
The K-means function on R
Description of the algorithm
Interactive demonstration
Iteration 5
J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm

29. 29/42
Introduction
The K-means algorithm
Conclusion
The K-means function on R
Description of the algorithm
Interactive demonstration
Iteration 6
J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm

30. 30/42
Introduction
The K-means algorithm
Conclusion
The K-means function on R
Description of the algorithm
Interactive demonstration
Last iteration
J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm

31. 31/42
Introduction
The K-means algorithm
Conclusion
Comparison with related algorithm
Strength of the present algorithm
Weaknesses and solutions
AS 113: A transfer Algorithm For Non-Hierarchial Classiﬁcation
by Banﬁeld and Bassil in 1977
The user is allowed to choose the criterion function to be
minimized.
This algorithm uses swaps as well as transfers so try to
overcome the problem of local optima.
There is a signiﬁcant amount of work oﬀ-loaded onto the user,
and it will be more expensive than the algorithm AS 136 for
large M.
J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm

32. 32/42
Introduction
The K-means algorithm
Conclusion
Comparison with related algorithm
Strength of the present algorithm
Weaknesses and solutions
AS 58: Euclidean Cluster Analysis
by Sparks in 1973
As AS 136, it aim at ﬁnding a K-partition of the sample, with
WSS which cannot be reduced by moving points from one
cluster to the other.
However, at the stage where each point is examined in turn to
see if it should be reassigned to a diﬀerent cluster, only the
closest centre is used to check for possible reallocation of the
given point.
Algorithm AS 58 does not provide a locally optimal solution.
J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm

33. 33/42
Introduction
The K-means algorithm
Conclusion
Comparison with related algorithm
Strength of the present algorithm
Weaknesses and solutions
Relatively eﬃcient
Figure : The algorithms AS 58 and AS 136 are tested and compared
on various generate data sets
⇒ AS 136 produces fairly higher accuracy and requires less
computation time than AS 58, especially when K is large.
J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm

34. 34/42
Introduction
The K-means algorithm
Conclusion
Comparison with related algorithm
Strength of the present algorithm
Weaknesses and solutions
Relatively eﬃcient
Time = CMNKI
C: Speed of the computer
(for example = 2.1x10−5 seconds for an IBM 370/158)
M: Number of points
N: Number of dimensions
K: Number of clusters
I: Number of iterations (usually less than 10)
⇒ The present algorithm has a high speed of convergence,
even if it ﬁnds a local optimum instead of a global one.
J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm

35. 35/42
Introduction
The K-means algorithm
Conclusion
Comparison with related algorithm
Strength of the present algorithm
Weaknesses and solutions
Other properties
The K-means clustering can handle large datasets
It is based on a very simple idea, and don’t requires lot of
computation time
The cost function is strictly decreasing
The resulting partition has no empty clusters
The resulting partition has distinct means
J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm

36. 36/42
Introduction
The K-means algorithm
Conclusion
Comparison with related algorithm
Strength of the present algorithm
Weaknesses and solutions
Sensitive to the initial selection of centroids
Recommended solution
Run the algorithm several times, each with a diﬀerent set of initial
cluster centers.
The kmeans() function has an nstart option that attempts multiple
initial conﬁgurations and reports on the best one.
Solution proposed in the article
The points are ﬁrst ordered by their distances to the overall mean
of the sample. For cluster L (L=1,2,..K), the [1 + (L-1)*(M/K)]th
point is chosen to be its initial cluster centre.
J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm

37. 37/42
Introduction
The K-means algorithm
Conclusion
Comparison with related algorithm
Strength of the present algorithm
Weaknesses and solutions
Sensitive to noisy data and outliers
Book’s solution: The K-means Algorithm including weights
This algorithm is based on the mean, a descriptive statistic not
robust to outliers. It gives the same level of relevance to all the
features in a dataset.
The general idea is to include feature weighting in the K-mean
clustering criterion, in order to give a lower weight to the variables
aﬀected by high noise.
J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm

38. 38/42
Introduction
The K-means algorithm
Conclusion
Comparison with related algorithm
Strength of the present algorithm
Weaknesses and solutions
Sensitive to noisy data and outliers
Changes to include weights
New parameters:
Input, double WH(M), the weight of each point
Output, double WHC(K), the weight of each cluster
Center of cluster L:
C(L) =
i∈I
WH(I)
WHC(L)
XI
Compare the quantity when a point change of cluster:
R1=WHC(L1)∗D(I,L1)2
WHC(L1)−WH(I)
R2=WHC(L2)∗D(I,L2)2
WHC(L2)+WH(I)
J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm

39. 39/42
Introduction
The K-means algorithm
Conclusion
Comparison with related algorithm
Strength of the present algorithm
Weaknesses and solutions
The number of clusters K need to be specify in advance
Recommended solution
Perform multiple trials to ﬁnd the best amount of clusters.
Other solution : Rule of thumb
The rule of thumb sets the number of cluster to :
K ≈
√ N
2
N : the number of data points
J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm

40. 40/42
Introduction
The K-means algorithm
Conclusion
Conclusion
The K-means is the most used clustering algorithm, due to its
inherent simplicity, speed, and empirical success.
However, in its basic form, it has limitations such as the
sensitivity to the initial partition, sensitivity to noise, and the
requierement of predeﬁned number of clusters.
Thus, this algorithm needs to be improved in order to remain
as popular.
J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm

41. 41/42
Introduction
The K-means algorithm
Conclusion
References
J.A Hartigan and M.A Wong (1979)
Algorithm AS 136: A K-Means Clustering Algorithm
Journal of the Royal Statistical Society (Applied Statistics).
J.A Hartigan (1975) - Clustering Algorithm
Wiley Series in Probability and Mathematical Statistics.
Laurence Morissette and Sylvain Chartier
"The k-means clustering technique : General considerations and
implementation in Mathematica"
R-statistics blog
K-MEANS CLUSTERING (FROM "R IN ACTION")
Tutorial with introduction of Clustering Algorithms, and interactive
demonstration
http://home.deib.polimi.it/matteucc/Clustering/tutorial-html/AppletKM.html
J.A Hartigan and M.A Wong Algorithm AS 136 : A K-Means Clustering Algorithm

42. 42/42
Introduction
The K-means algorithm