K-Means Clustering

SOME METHODS FOR

CLASSIFICATION AND ANALYSIS

OF MULTIVARIATE OBSERVATIONS

J. MACQUEEN

UNIVERSITY OF CALIFORNIA, Los ANGELES

1. Introduction

The main purpose of this paper is to describe a process for partitioning an

N-dimensional population into k sets on the basis of a sample. The process,

which is called 'k-means,' appears to give partitions which are reasonably

efficient in the sense of within-class variance. That is, if p is the probability mass

function for the population, S = {S1, S2, -

* *, Sk} is a partition of EN, and ui,

i = 1, 2, * - , k, is the conditional mean of p over the set Si, then W2(S) =

ff=ISi

f z - u42 dp(z) tends to be low for the partitions S generated by the

method. We say 'tends to be low,' primarily because of intuitive considerations,

corroborated to some extent by mathematical analysis and practical computa-

tional experience. Also, the k-means procedure is easily programmed and is

computationally economical, so that it is feasible to process very large samples

on a digital computer. Possible applications include methods for similarity

grouping, nonlinear prediction, approximating multivariate distributions, and

nonparametric tests for independence among several variables.

In addition to suggesting practical classification methods, the study of k-means

has proved to be theoretically interesting. The k-means concept represents a

generalization of the ordinary sample mean, and one is naturally led to study the

pertinent asymptotic behavior, the object being to establish some sort of law of

large numbers for the k-means. This problem is sufficiently interesting, in fact,

for us to devote a good portion of this paper to it. The k-means are defined in

section 2.1, and the main results which have been obtained on the asymptotic

behavior are given there. The rest of section 2 is devoted to the proofs of these

results. Section 3 describes several specific possible applications, and reports

some preliminary results from computer experiments conducted to explore the

possibilities inherent in the k-means idea. The extension to general metric spaces

is indicated briefly in section 4.

The original point of departure for the work described here was a series of

problems in optimal classification (MacQueen [9]) which represented special

This work was supported by the Western Management Science Institute under a grant from

the Ford Foundation, and by the Office of Naval Research under Contract No. 233(75), Task

No. 047-041.

281

Bulletin de l’acad´

emie

polonaise des sciences

Cl. III — Vol. IV, No. 12, 1956

MATH´

EMATIQUE

Sur la division des corps mat´

eriels en parties 1

par

H. STEINHAUS

Pr´

esent´

e le 19 Octobre 1956

Un corps Q est, par d´

eﬁnition, une r´

epartition de mati`

ere dans l’espace,

donn´

ee par une fonction f(P) ; on appelle cette fonction la densit´

e du corps

en question ; elle est d´

eﬁnie pour tous les points P de l’espace ; elle est non-

n´

egative et mesurable. On suppose que l’ensemble caract´

eristique du corps

E =E

P

{f(P) > 0} est born´

e et de mesure positive ; on suppose aussi que

l’int´

egrale de f(P) sur E est ﬁnie : c’est la masse du corps Q. On consid`

ere

comme identiques deux corps dont les densit´

es sont ´

egales `

a un ensemble de

mesure nulle pr`

es.

En d´

ecomposant l’ensemble caract´

eristique d’un corps Q en n sous-ensembles

Ei

(i = 1, 2, . . . , n) de mesures positives, on obtient une division du corps en

question en n corps partiels ; leurs ensembles caract´

eristiques respectifs sont

les Ei

et leurs densit´

es sont d´

eﬁnies par les valeurs que prend la densit´

e du

corps Q dans ces ensembles partiels. En d´

esignant les corps partiels par Qi

, on

´

ecrira Q = Q1

+ Q2

+ . . . + Qn

. Quand on donne d’abord n corps Qi

, dont les

ensembles caract´

eristiques sont disjoints deux `

a deux `

a la mesure nulle pr`

es, il

existe ´

evidemment un corps Q ayant ces Qi

comme autant de parties ; on ´

ecrira

Q1

+ Q2

+ . . . + Qn

= Q. Ces remarques su sent pour expliquer la division et

la composition des corps.

Le probl`

eme de cette Note est la division d’un corps en n parties Ki

(i = 1, 2, . . . , n) et le choix de n points Ai

de mani`

ere `

a rendre aussi petite que

possible la somme

(1) S(K, A) =

n

X

i=1

I(Ki, Ai

) (K ⌘ {Ki

}, A ⌘ {Ai

}),

o`

u I(Q, P) d´

esigne, en g´

en´

eral, le moment d’inertie d’un corps quelconque Q

par rapport `

a un point quelconque P. Pour traiter ce probl`

eme ´

el´

ementaire nous

aurons recours aux lemmes suivants :

1. Cet article de Hugo Steinhaus est le premier formulant de mani`

ere explicite, en dimen-

sion ﬁnie, le probl`

eme de partitionnement par les k-moyennes (k-means), dites aussi “nu´

ees

dynamiques”. Son algorithme classique est le mˆ

eme que celui de la quantiﬁcation optimale de

Lloyd-Max. ´

Etant di cilement accessible sous format num´

erique, le voici transduit par Maciej

Denkowski, transmis par J´

erˆ

ome Bolte, transcrit par Laurent Duval, en juillet/aoˆ

ut 2015. Un

e↵ort a ´

et´

e fourni pour conserver une proximit´

e avec la pagination originale.

801

1956-1967

Unsupervised

Clustering