KMEAN

Performs a K‑means (centroid) cluster analysis.

Required Arguments

XNOBS by NCOL matrix containing the observations to be clustered. (Input)
The only columns of X used are those indicated by IND and possibly IFRQ and/or IWT.

CMK by NVAR matrix containing, on input, the cluster seeds, i.e., estimates for the cluster centers, and the cluster means on output. (Input/Output)
The cluster seeds must be unique.

SWTK by NVAR matrix containing the sum of the weights used to compute each cluster mean. (Output)
Missing observations are excluded from SWT.

IC — Vector of length NOBS containing the cluster membership for each observation. (Output)

NC — Vector of length K containing the number of observations in each cluster. (Output)

WSS — Vector of length K containing the within sum of squares for each cluster. (Output)

Optional Arguments

NOBS — Number of observations. (Input)
Default: NOBS = size (X,1).

NCOL — Number of columns in X. (Input)
Default: NCOL = size (X,2).

NVAR — Number of variables to be used in computing the metric. (Input)
Default: NVAR = size (CM,2).

LDX — Leading dimension of X exactly as specified in the dimension statement in the calling program. (Input)
Default: LDX = size (X,1).

IFRQ — Frequency option. (Input)
IFRQ = 0 means all frequencies are 1. For positive IFRQ, column number IFRQ of X contains the nonnegative frequencies.
Default: IFRQ = 0.

IWT — Weighting option. (Input)
IWT = 0 means all weights are 1. For positive IWT, column number IWT contains the nonnegative weights.
Default: IWT = 0.

IND — Vector of length NVAR containing the columns of X to be used in computing the metric. (Input)
In the usual case in which X is the data matrix, no observation has multiple frequency, and unequal weighting is not desired, IND = (1, 2, 3, , NVAR).
By default, IND(I) = (I)

K — Number of clusters. (Input)
Default: K = size (CM,1).

MAXIT — Maximum number of iterations. (Input)
MAXIT = 30 is usually sufficient.
Default: MAXIT = 30.

LDCM — Leading dimension of CM exactly as specified in the dimension statement in the calling program. (Input)
Default: LDCM = size (CM,1).

LDSWT — Leading dimension of SWT exactly as specified in the dimension statement in the calling program. (Input)
Default: LDSWT = size (SWT,1).

FORTRAN 90 Interface

Generic: CALL KMEAN (X, CM, SWT, IC, NC, WSS [])

Specific: The specific interface names are S_KMEAN and D_KMEAN.

FORTRAN 77 Interface

Single: CALL KMEAN (NOBS, NCOL, NVAR, X, LDX, IFRQ, IWT, IND, K, MAXIT, CM, LDCM, SWT, LDSWT, IC, NC, WSS)

Double: The double precision name is DKMEAN.

Description

Routine KMEAN is an implementation of Algorithm AS 136 by Hartigan and Wong (1979). It computes K‑means (centroid) Euclidean metric clusters for an input matrix starting with initial estimates of the K cluster means. Routine KMEAN allows for missing values (coded as NaN, “not a number”) and for weights and frequencies.

Let p = NVAR denote the number of variables to be used in computing the Euclidean distance between observations. The idea in K‑means cluster analysis is to find a clustering (or grouping) of the observations so as to minimize the total within‑cluster sums of squares. In this case, the total sums of squares within each cluster is computed as the sum of the centered sum of squares over all nonmissing values of each variable. That is,

 

where νim denotes the row index of the m‑th observation in the i‑th cluster in the matrix X; ni is the number of rows of X assigned to group i; f denotes the frequency of the observation; w denotes its weight; δ is zero if the j‑th variable on observation νim is missing, otherwise δ is one; and

 

is the average of the nonmissing observations for variable j in group i. This method sequentially processes each observation and reassigns it to another cluster if doing so results in a decrease in the total within‑cluster sums of squares. The user in referred to Hartigan and Wong (1979) or Hartigan (1975) for the details.

Comments

1. Workspace may be explicitly provided, if desired, by use of K2EAN/DK2EAN. The reference is:

CALL K2EAN (NOBS, NCOL, NVAR, X, LDX, IFRQ, IWT, IND, K, MAXIT, CM, LDCM, SWT, LDSWT, IC, NC, WSS, IC2, NCP, D, ITRAN, LIVE)

The additional arguments are as follows:

IC2 — Work vector of length NOBS.

NCP — Work vector of length K.

D — Work vector of length NOBS.

ITRAN — Work vector of length K.

LIVE — Work vector of length K.

2. Informational Error

 

Type

Code

Description

3

1

Convergence did not occur within MAXIT iterations.

Example

This example performs K‑means cluster analysis on Fisher’s iris data, which is first obtained via routine GDATA (see Chapter 19, “Utilities”). The initial cluster seed for each iris type is an observation known to be in the iris type.

 

USE IMSL_LIBRARIES

 

IMPLICIT NONE

INTEGER IPRINT, K, LDCM, LDSWT, LDX, NCOL, NOBS, NV, NVAR

PARAMETER (IPRINT=0, K=3, NCOL=5, NOBS=150, NV=5, NVAR=4, &

LDCM=K, LDSWT=K, LDX=NOBS)

!

INTEGER IC(NOBS), IND(NVAR), NC(K), NXCOL, NXROW

REAL CM(K,NVAR), SWT(K,NVAR), WSS(K), X(NOBS,NV)

!

DATA IND/2, 3, 4, 5/

!

CALL GDATA (3, X, NXROW, NXCOL)

! Copy the cluster seeds into CM

CALL SCOPY (NVAR, X(1:,2), LDX, CM(1:,1), LDCM)

CALL SCOPY (NVAR, X(51:,2), LDX, CM(2:,1), LDCM)

CALL SCOPY (NVAR, X(101:,2), LDX, CM(3:,1), LDCM)

!

CALL KMEAN (X, CM, SWT, IC, NC, WSS, IND=IND)

!

CALL WRRRN ('CM', CM)

CALL WRRRN ('SWT', SWT)

CALL WRIRN ('IC', IC, 1, NOBS, 1)

CALL WRIRN ('NC', NC, 1, K, 1)

CALL WRRRN ('WSS', WSS, 1, K, 1)

END

Output

 

CM

1 2 3 4

1 5.006 3.428 1.462 0.246

2 5.902 2.748 4.394 1.434

3 6.850 3.074 5.742 2.071

 

SWT

1 2 3 4

1 50.00 50.00 50.00 50.00

2 62.00 62.00 62.00 62.00

3 38.00 38.00 38.00 38.00

 

IC

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

 

21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

 

41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

1 1 1 1 1 1 1 1 1 1 2 2 3 2 2 2 2 2 2 2

 

61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80

2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 2

 

81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99

2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

 

100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115

2 3 2 3 3 3 3 2 3 3 3 3 3 3 2 2

 

116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131

3 3 3 3 2 3 2 3 2 3 3 2 2 3 3 3

132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147

3 3 2 3 3 3 3 2 3 3 3 2 3 3 3 2

 

148 149 150

3 3 2

 

NC

1 2 3

50 62 38

 

WSS

1 2 3

15.15 39.82 23.88