KMEAN
Performs a K‑means (centroid) cluster analysis.
Required Arguments
X — NOBS by NCOL matrix containing the observations to be clustered. (Input)
The only columns of X used are those indicated by IND and possibly IFRQ and/or IWT.
CM — K by NVAR matrix containing, on input, the cluster seeds, i.e., estimates for the cluster centers, and the cluster means on output. (Input/Output)
The cluster seeds must be unique.
SWT — K by NVAR matrix containing the sum of the weights used to compute each cluster mean. (Output)
Missing observations are excluded from SWT.
IC — Vector of length NOBS containing the cluster membership for each observation. (Output)
NC — Vector of length K containing the number of observations in each cluster. (Output)
WSS — Vector of length K containing the within sum of squares for each cluster. (Output)
Optional Arguments
NOBS — Number of observations. (Input)
Default: NOBS = size (X,1).
NCOL — Number of columns in X. (Input)
Default: NCOL = size (X,2).
NVAR — Number of variables to be used in computing the metric. (Input)
Default: NVAR = size (CM,2).
LDX — Leading dimension of X exactly as specified in the dimension statement in the calling program. (Input)
Default: LDX = size (X,1).
IFRQ — Frequency option. (Input)
IFRQ = 0 means all frequencies are 1. For positive IFRQ, column number IFRQ of X contains the nonnegative frequencies.
Default: IFRQ = 0.
IWT — Weighting option. (Input)
IWT = 0 means all weights are 1. For positive IWT, column number IWT contains the nonnegative weights.
Default: IWT = 0.
IND — Vector of length NVAR containing the columns of X to be used in computing the metric. (Input)
In the usual case in which X is the data matrix, no observation has multiple frequency, and unequal weighting is not desired, IND = (1, 2, 3, …, NVAR).
By default, IND(I) = (I)
K — Number of clusters. (Input)
Default: K = size (CM,1).
MAXIT — Maximum number of iterations. (Input)
MAXIT = 30 is usually sufficient.
Default: MAXIT = 30.
LDCM — Leading dimension of CM exactly as specified in the dimension statement in the calling program. (Input)
Default: LDCM = size (CM,1).
LDSWT — Leading dimension of SWT exactly as specified in the dimension statement in the calling program. (Input)
Default: LDSWT = size (SWT,1).
FORTRAN 90 Interface
Generic: CALL KMEAN (X, CM, SWT, IC, NC, WSS [, …])
Specific: The specific interface names are S_KMEAN and D_KMEAN.
FORTRAN 77 Interface
Single: CALL KMEAN (NOBS, NCOL, NVAR, X, LDX, IFRQ, IWT, IND, K, MAXIT, CM, LDCM, SWT, LDSWT, IC, NC, WSS)
Double: The double precision name is DKMEAN.
Description
Routine KMEAN is an implementation of Algorithm AS 136 by Hartigan and Wong (1979). It computes K‑means (centroid) Euclidean metric clusters for an input matrix starting with initial estimates of the K cluster means. Routine KMEAN allows for missing values (coded as NaN, “not a number”) and for weights and frequencies.
Let p = NVAR denote the number of variables to be used in computing the Euclidean distance between observations. The idea in K‑means cluster analysis is to find a clustering (or grouping) of the observations so as to minimize the total within‑cluster sums of squares. In this case, the total sums of squares within each cluster is computed as the sum of the centered sum of squares over all nonmissing values of each variable. That is,
where νim denotes the row index of the m‑th observation in the i‑th cluster in the matrix X; ni is the number of rows of X assigned to group i; f denotes the frequency of the observation; w denotes its weight; δ is zero if the j‑th variable on observation νim is missing, otherwise δ is one; and
is the average of the nonmissing observations for variable j in group i. This method sequentially processes each observation and reassigns it to another cluster if doing so results in a decrease in the total within‑cluster sums of squares. The user in referred to Hartigan and Wong (1979) or Hartigan (1975) for the details.
Comments
1. Workspace may be explicitly provided, if desired, by use of K2EAN/DK2EAN. The reference is:
CALL K2EAN (NOBS, NCOL, NVAR, X, LDX, IFRQ, IWT, IND, K, MAXIT, CM, LDCM, SWT, LDSWT, IC, NC, WSS, IC2, NCP, D, ITRAN, LIVE)
The additional arguments are as follows:
IC2 — Work vector of length NOBS.
NCP — Work vector of length K.
D — Work vector of length NOBS.
ITRAN — Work vector of length K.
LIVE — Work vector of length K.
2. Informational Error
Type |
Code |
Description |
3 |
1 |
Convergence did not occur within MAXIT iterations. |
Example
This example performs K‑means cluster analysis on Fisher’s iris data, which is first obtained via routine GDATA (see Chapter 19, “Utilities”). The initial cluster seed for each iris type is an observation known to be in the iris type.
USE IMSL_LIBRARIES
IMPLICIT NONE
INTEGER IPRINT, K, LDCM, LDSWT, LDX, NCOL, NOBS, NV, NVAR
PARAMETER (IPRINT=0, K=3, NCOL=5, NOBS=150, NV=5, NVAR=4, &
LDCM=K, LDSWT=K, LDX=NOBS)
!
INTEGER IC(NOBS), IND(NVAR), NC(K), NXCOL, NXROW
REAL CM(K,NVAR), SWT(K,NVAR), WSS(K), X(NOBS,NV)
!
DATA IND/2, 3, 4, 5/
!
CALL GDATA (3, X, NXROW, NXCOL)
! Copy the cluster seeds into CM
CALL SCOPY (NVAR, X(1:,2), LDX, CM(1:,1), LDCM)
CALL SCOPY (NVAR, X(51:,2), LDX, CM(2:,1), LDCM)
CALL SCOPY (NVAR, X(101:,2), LDX, CM(3:,1), LDCM)
!
CALL KMEAN (X, CM, SWT, IC, NC, WSS, IND=IND)
!
CALL WRRRN ('CM', CM)
CALL WRRRN ('SWT', SWT)
CALL WRIRN ('IC', IC, 1, NOBS, 1)
CALL WRIRN ('NC', NC, 1, K, 1)
CALL WRRRN ('WSS', WSS, 1, K, 1)
END
Output
CM
1 2 3 4
1 5.006 3.428 1.462 0.246
2 5.902 2.748 4.394 1.434
3 6.850 3.074 5.742 2.071
SWT
1 2 3 4
1 50.00 50.00 50.00 50.00
2 62.00 62.00 62.00 62.00
3 38.00 38.00 38.00 38.00
IC
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
1 1 1 1 1 1 1 1 1 1 2 2 3 2 2 2 2 2 2 2
61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 2
81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115
2 3 2 3 3 3 3 2 3 3 3 3 3 3 2 2
116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131
3 3 3 3 2 3 2 3 2 3 3 2 2 3 3 3
132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147
3 3 2 3 3 3 3 2 3 3 3 2 3 3 3 2
148 149 150
3 3 2
NC
1 2 3
50 62 38
WSS
1 2 3
15.15 39.82 23.88