Slide 46
Slide 46 text
K-Means Clustering
SOME METHODS FOR
CLASSIFICATION AND ANALYSIS
OF MULTIVARIATE OBSERVATIONS
J. MACQUEEN
UNIVERSITY OF CALIFORNIA, Los ANGELES
1. Introduction
The main purpose of this paper is to describe a process for partitioning an
N-dimensional population into k sets on the basis of a sample. The process,
which is called 'k-means,' appears to give partitions which are reasonably
efficient in the sense of within-class variance. That is, if p is the probability mass
function for the population, S = {S1, S2, -
* *, Sk} is a partition of EN, and ui,
i = 1, 2, * - , k, is the conditional mean of p over the set Si, then W2(S) =
ff=ISi
f z - u42 dp(z) tends to be low for the partitions S generated by the
method. We say 'tends to be low,' primarily because of intuitive considerations,
corroborated to some extent by mathematical analysis and practical computa-
tional experience. Also, the k-means procedure is easily programmed and is
computationally economical, so that it is feasible to process very large samples
on a digital computer. Possible applications include methods for similarity
grouping, nonlinear prediction, approximating multivariate distributions, and
nonparametric tests for independence among several variables.
In addition to suggesting practical classification methods, the study of k-means
has proved to be theoretically interesting. The k-means concept represents a
generalization of the ordinary sample mean, and one is naturally led to study the
pertinent asymptotic behavior, the object being to establish some sort of law of
large numbers for the k-means. This problem is sufficiently interesting, in fact,
for us to devote a good portion of this paper to it. The k-means are defined in
section 2.1, and the main results which have been obtained on the asymptotic
behavior are given there. The rest of section 2 is devoted to the proofs of these
results. Section 3 describes several specific possible applications, and reports
some preliminary results from computer experiments conducted to explore the
possibilities inherent in the k-means idea. The extension to general metric spaces
is indicated briefly in section 4.
The original point of departure for the work described here was a series of
problems in optimal classification (MacQueen [9]) which represented special
This work was supported by the Western Management Science Institute under a grant from
the Ford Foundation, and by the Office of Naval Research under Contract No. 233(75), Task
No. 047-041.
281
Bulletin de l’acad´
emie
polonaise des sciences
Cl. III — Vol. IV, No. 12, 1956
MATH´
EMATIQUE
Sur la division des corps mat´
eriels en parties 1
par
H. STEINHAUS
Pr´
esent´
e le 19 Octobre 1956
Un corps Q est, par d´
efinition, une r´
epartition de mati`
ere dans l’espace,
donn´
ee par une fonction f(P) ; on appelle cette fonction la densit´
e du corps
en question ; elle est d´
efinie pour tous les points P de l’espace ; elle est non-
n´
egative et mesurable. On suppose que l’ensemble caract´
eristique du corps
E =E
P
{f(P) > 0} est born´
e et de mesure positive ; on suppose aussi que
l’int´
egrale de f(P) sur E est finie : c’est la masse du corps Q. On consid`
ere
comme identiques deux corps dont les densit´
es sont ´
egales `
a un ensemble de
mesure nulle pr`
es.
En d´
ecomposant l’ensemble caract´
eristique d’un corps Q en n sous-ensembles
Ei
(i = 1, 2, . . . , n) de mesures positives, on obtient une division du corps en
question en n corps partiels ; leurs ensembles caract´
eristiques respectifs sont
les Ei
et leurs densit´
es sont d´
efinies par les valeurs que prend la densit´
e du
corps Q dans ces ensembles partiels. En d´
esignant les corps partiels par Qi
, on
´
ecrira Q = Q1
+ Q2
+ . . . + Qn
. Quand on donne d’abord n corps Qi
, dont les
ensembles caract´
eristiques sont disjoints deux `
a deux `
a la mesure nulle pr`
es, il
existe ´
evidemment un corps Q ayant ces Qi
comme autant de parties ; on ´
ecrira
Q1
+ Q2
+ . . . + Qn
= Q. Ces remarques su sent pour expliquer la division et
la composition des corps.
Le probl`
eme de cette Note est la division d’un corps en n parties Ki
(i = 1, 2, . . . , n) et le choix de n points Ai
de mani`
ere `
a rendre aussi petite que
possible la somme
(1) S(K, A) =
n
X
i=1
I(Ki, Ai
) (K ⌘ {Ki
}, A ⌘ {Ai
}),
o`
u I(Q, P) d´
esigne, en g´
en´
eral, le moment d’inertie d’un corps quelconque Q
par rapport `
a un point quelconque P. Pour traiter ce probl`
eme ´
el´
ementaire nous
aurons recours aux lemmes suivants :
1. Cet article de Hugo Steinhaus est le premier formulant de mani`
ere explicite, en dimen-
sion finie, le probl`
eme de partitionnement par les k-moyennes (k-means), dites aussi “nu´
ees
dynamiques”. Son algorithme classique est le mˆ
eme que celui de la quantification optimale de
Lloyd-Max. ´
Etant di cilement accessible sous format num´
erique, le voici transduit par Maciej
Denkowski, transmis par J´
erˆ
ome Bolte, transcrit par Laurent Duval, en juillet/aoˆ
ut 2015. Un
e↵ort a ´
et´
e fourni pour conserver une proximit´
e avec la pagination originale.
801
1956-1967
Unsupervised
Clustering