clusterKMeans¶

Performs a K-means (centroid) cluster analysis.

Synopsis¶

clusterKMeans (nVariables, x, clusterSeeds)

Required Arguments¶

int nVariables (Input): Number of variables to be used in computing the metric.
float x[[]] (Input): Array of length nObservations × nVariables containing the observations to be clustered.
float clusterSeeds[[]] (Input): Array of length nClusters × nVariables containing the cluster seeds, i.e., estimates for the cluster centers.

Return Value¶

The cluster membership for each observation is returned.

Optional Arguments¶

weights, float[] (Input)

Array of length nObservations containing the weight of each observation of matrix x.

Default: weights = 1.

frequencies, float[] (Input)

Array of length nObservations containing the frequency of each observation of matrix x.

Default: frequencies = 1.

maxIterations, int (Input)

Maximum number of iterations.

Default: maxIterations = 30.

clusterHistory (Output)

clusterHistory is an array of size nIter by nObservations containing the cluster membership of each observation per iteration. Note that nIter is the number of completed iterations in the algorithm.

clusterMeans (Output)

An array of length nClusters × nVariables containing the cluster means.

clusterSsq (Output)

Array of length nClusters containing the within sum-of-squares for each cluster.

clusterCounts (Output)

An array of length nClusters containing the number of observations in each cluster.

clusterVariableColumns, int[] (Input)

Vector of length nVariables containing the columns of x to be used in computing the metric. Columns are numbered 0, 1, 2, …, nVariables

Default: clusterVariableColumns [ ] = 0, 1, 2, …, nVariables.

Description¶

Function clusterKMeans is an implementation of Algorithm AS 136 by Hartigan and Wong (1979). It computes K-means (centroid) Euclidean metric clusters for an input matrix starting with initial estimates of the K-cluster means. The function allows for missing values coded as NaN (Not a Number) and for weights and frequencies.

Let p = nVariables be the number of variables to be used in computing the Euclidean distance between observations. The idea in K-means cluster analysis is to find a clustering (or grouping) of the observations so as to minimize the total within-cluster sums-of-squares. In this case, the total sums-of-squares within each cluster is computed as the sum of the centered sum-of-squares over all nonmissing values of each variable. That is,

\[\phi = \sum_{i=1}^{K} \sum_{j=1}^{p} \sum_{m=1}^{n_i} f_{v_{im}} w_{v_{im}} \delta_{v_{im},j} \left(x_{v_{im},j} - \overline{x}_{ij}\right)^2\]

where \(\nu_{im}\) denotes the row index of the m-th observation in the i-th cluster in the matrix X; \(n_i\) is the number of rows of X assigned to group i; f denotes the frequency of the observation; w denotes its weight; δ is 0 if the j-th variable on observation \(\nu_{im}\) is missing, otherwise δ is 1; and

\[\overline{x}_{ij}\]

is the average of the nonmissing observations for variable j in group i. This method sequentially processes each observation and reassigns it to another cluster if doing so results in a decrease of the total within-cluster sums-of-squares. See Hartigan and Wong (1979) or Hartigan (1975) for details.

Example¶

This example performs K-means cluster analysis on Fisher’s Iris data, which is obtained by function dataSets (see Chapter 15, Utilities). The initial cluster seed for each iris type is an observation known to be in the iris type.

from numpy import *
from pyimsl.stat.clusterKMeans import clusterKMeans
from pyimsl.stat.dataSets import dataSets
from pyimsl.stat.writeMatrix import writeMatrix

n_observations = 150
n_variables = 4
n_clusters = 3
cluster_seeds = empty(shape=(n_clusters, n_variables))
cluster_means = []
cluster_ssq = []
cluster_counts = []

# Retrieve the data set
x = dataSets(3)

# Assign initial cluster seeds
for i in range(0, n_variables):
    cluster_seeds[0][i] = x[0][i + 1]
    cluster_seeds[1][i] = x[50][i + 1]
    cluster_seeds[2][i] = x[100][i + 1]

# Perform the analysis, using the last four columns of x.
cluster_group = clusterKMeans(n_variables, x[0:n_observations, 1:5],
                              cluster_seeds,
                              clusterCounts=cluster_counts,
                              clusterMeans=cluster_means,
                              clusterSsq=cluster_ssq)

writeMatrix('Cluster Membership', cluster_group, writeFormat="%5i")
writeMatrix('Cluster Means', cluster_means)
writeMatrix('Cluster Sum of Squares', cluster_ssq)
writeMatrix('# Observations in Each Cluster', cluster_counts)

Output¶

 
                            Cluster Membership
    2      3      4      5      6      7      8      9     10     11
    1      1      1      1      1      1      1      1      1      1
 
   13     14     15     16     17     18     19     20     21     22
    1      1      1      1      1      1      1      1      1      1
 
   24     25     26     27     28     29     30     31     32     33
    1      1      1      1      1      1      1      1      1      1
 
   35     36     37     38     39     40     41     42     43     44
    1      1      1      1      1      1      1      1      1      1
 
   46     47     48     49     50     51     52     53     54     55
    1      1      1      1      1      2      2      3      2      2
 
   57     58     59     60     61     62     63     64     65     66
    2      2      2      2      2      2      2      2      2      2
 
   68     69     70     71     72     73     74     75     76     77
    2      2      2      2      2      2      2      2      2      2
 
   79     80     81     82     83     84     85     86     87     88
    2      2      2      2      2      2      2      2      2      2
 
   90     91     92     93     94     95     96     97     98     99
    2      2      2      2      2      2      2      2      2      2
 
  101    102    103    104    105    106    107    108    109    110
    3      2      3      3      3      3      2      3      3      3
 
  112    113    114    115    116    117    118    119    120    121
    3      3      2      2      3      3      3      3      2      3
 
  123    124    125    126    127    128    129    130    131    132
    3      2      3      3      2      2      3      3      3      3
 
  134    135    136    137    138    139    140    141    142    143
    2      3      3      3      3      2      3      3      3      2
 
  145    146    147    148    149    150
    3      3      2      3      3      2
 
                    Cluster Means
             1            2            3            4
      5.006        3.428        1.462        0.246
      5.902        2.748        4.394        1.434
      6.850        3.074        5.742        2.071
 
       Cluster Sum of Squares
          1            2            3
15        39.82        23.88
 
   # Observations in Each Cluster
          1            2            3
         50           62           38

Warning Errors¶

IMSLS_NO_CONVERGENCE Convergence did not occur.