clusterKMeans¶
Performs a K-means (centroid) cluster analysis.
Synopsis¶
clusterKMeans (nVariables, x, clusterSeeds)
Required Arguments¶
- int
nVariables
(Input) - Number of variables to be used in computing the metric.
- float
x[[]]
(Input) - Array of length
nObservations
×nVariables
containing the observations to be clustered. - float
clusterSeeds[[]]
(Input) - Array of length
nClusters
×nVariables
containing the cluster seeds, i.e., estimates for the cluster centers.
Return Value¶
The cluster membership for each observation is returned.
Optional Arguments¶
weights
, float[]
(Input)Array of length
nObservations
containing the weight of each observation of matrixx
.Default:
weights
= 1.frequencies
, float[]
(Input)Array of length
nObservations
containing the frequency of each observation of matrixx
.Default:
frequencies
= 1.maxIterations
, int (Input)Maximum number of iterations.
Default:
maxIterations
= 30.clusterHistory
(Output)clusterHistory
is an array of sizenIter
bynObservations
containing the cluster membership of each observation per iteration. Note thatnIter
is the number of completed iterations in the algorithm.clusterMeans
(Output)- An array of length
nClusters
×nVariables
containing the cluster means. clusterSsq
(Output)- Array of length
nClusters
containing the within sum-of-squares for each cluster. clusterCounts
(Output)- An array of length
nClusters
containing the number of observations in each cluster. clusterVariableColumns
, int[]
(Input)Vector of length
nVariables
containing the columns ofx
to be used in computing the metric. Columns are numbered 0, 1, 2, …,nVariables
Default:
clusterVariableColumns
[ ] = 0, 1, 2, …,nVariables
.
Description¶
Function clusterKMeans
is an implementation of Algorithm AS 136 by
Hartigan and Wong (1979). It computes K-means (centroid) Euclidean metric
clusters for an input matrix starting with initial estimates of the
K-cluster means. The function allows for missing values coded as NaN (Not
a Number) and for weights and frequencies.
Let p = nVariables
be the number of variables to be used in computing
the Euclidean distance between observations. The idea in K-means cluster
analysis is to find a clustering (or grouping) of the observations so as to
minimize the total within-cluster sums-of-squares. In this case, the total
sums-of-squares within each cluster is computed as the sum of the centered
sum-of-squares over all nonmissing values of each variable. That is,
where \(\nu_{im}\) denotes the row index of the m-th observation in the i-th cluster in the matrix X; \(n_i\) is the number of rows of X assigned to group i; f denotes the frequency of the observation; w denotes its weight; δ is 0 if the j-th variable on observation \(\nu_{im}\) is missing, otherwise δ is 1; and
is the average of the nonmissing observations for variable j in group i. This method sequentially processes each observation and reassigns it to another cluster if doing so results in a decrease of the total within-cluster sums-of-squares. See Hartigan and Wong (1979) or Hartigan (1975) for details.
Example¶
This example performs K-means cluster analysis on Fisher’s Iris data, which is obtained by function dataSets (see Chapter 15, Utilities). The initial cluster seed for each iris type is an observation known to be in the iris type.
from numpy import *
from pyimsl.stat.clusterKMeans import clusterKMeans
from pyimsl.stat.dataSets import dataSets
from pyimsl.stat.writeMatrix import writeMatrix
n_observations = 150
n_variables = 4
n_clusters = 3
cluster_seeds = empty(shape=(n_clusters, n_variables))
cluster_means = []
cluster_ssq = []
cluster_counts = []
# Retrieve the data set
x = dataSets(3)
# Assign initial cluster seeds
for i in range(0, n_variables):
cluster_seeds[0][i] = x[0][i + 1]
cluster_seeds[1][i] = x[50][i + 1]
cluster_seeds[2][i] = x[100][i + 1]
# Perform the analysis, using the last four columns of x.
cluster_group = clusterKMeans(n_variables, x[0:n_observations, 1:5],
cluster_seeds,
clusterCounts=cluster_counts,
clusterMeans=cluster_means,
clusterSsq=cluster_ssq)
writeMatrix('Cluster Membership', cluster_group, writeFormat="%5i")
writeMatrix('Cluster Means', cluster_means)
writeMatrix('Cluster Sum of Squares', cluster_ssq)
writeMatrix('# Observations in Each Cluster', cluster_counts)
Output¶
Cluster Membership
1 2 3 4 5 6 7 8 9 10 11
1 1 1 1 1 1 1 1 1 1 1
12 13 14 15 16 17 18 19 20 21 22
1 1 1 1 1 1 1 1 1 1 1
23 24 25 26 27 28 29 30 31 32 33
1 1 1 1 1 1 1 1 1 1 1
34 35 36 37 38 39 40 41 42 43 44
1 1 1 1 1 1 1 1 1 1 1
45 46 47 48 49 50 51 52 53 54 55
1 1 1 1 1 1 2 2 3 2 2
56 57 58 59 60 61 62 63 64 65 66
2 2 2 2 2 2 2 2 2 2 2
67 68 69 70 71 72 73 74 75 76 77
2 2 2 2 2 2 2 2 2 2 2
78 79 80 81 82 83 84 85 86 87 88
3 2 2 2 2 2 2 2 2 2 2
89 90 91 92 93 94 95 96 97 98 99
2 2 2 2 2 2 2 2 2 2 2
100 101 102 103 104 105 106 107 108 109 110
2 3 2 3 3 3 3 2 3 3 3
111 112 113 114 115 116 117 118 119 120 121
3 3 3 2 2 3 3 3 3 2 3
122 123 124 125 126 127 128 129 130 131 132
2 3 2 3 3 2 2 3 3 3 3
133 134 135 136 137 138 139 140 141 142 143
3 2 3 3 3 3 2 3 3 3 2
144 145 146 147 148 149 150
3 3 3 2 3 3 2
Cluster Means
1 2 3 4
1 5.006 3.428 1.462 0.246
2 5.902 2.748 4.394 1.434
3 6.850 3.074 5.742 2.071
Cluster Sum of Squares
1 2 3
15.15 39.82 23.88
# Observations in Each Cluster
1 2 3
50 62 38
Warning Errors¶
IMSLS_NO_CONVERGENCE |
Convergence did not occur. |