imsl.cluster.cluster_k_means

cluster_k_means(obs, cluster_seeds, weights=None, frequencies=None, max_iter=30, cluster_vars=None)

Perform a K-means (centroid) cluster analysis.

Parameters:
  • obs ((M,N) array_like) – Array of size M \(\times\) N containing the observations to be clustered.
  • cluster_seeds ((n_clusters, L) array_like) – Array containing the cluster seeds, i.e., estimates for the cluster centers. L denotes the number of columns of array obs used in the analysis; see argument cluster_vars.
  • weights ((M,) array_like, optional) –

    Array of length M containing the weight of each observation of array obs.

    Default: weights = [1, 1, …, 1].

  • frequencies ((M,) array_like, optional) –

    Array of length M containing the frequency of each observation of array obs.

    Default: frequencies = [1, 1, …, 1].

  • max_iter (int, optional) –

    The maximum number of iterations.

    Default: max_iter = 30.

  • cluster_vars ((L,) array_like, optional) –

    Array of length L containing the columns of obs to be used in computing the metric. The columns in array obs are numbered 0, 1, 2, …, N-1.

    Default: cluster_vars = [0, 1, 2, …, N - 1].

Returns:

  • A named tuple with the following fields
  • membership ((M,) ndarray) – Array containing the cluster membership for each observation.
  • history ((n_iter, M) ndarray) – Array of size n_iter \(\times\) M containing the cluster membership of each observation in array obs per iteration. Note that n_iter is the number of completed iterations in the algorithm.
  • means ((n_clusters, L) ndarray) – Array containing the cluster means.
  • ssq ((n_clusters,) ndarray) – Array containing the within sum-of-squares for each cluster.
  • counts ((n_clusters,) ndarray) – Array containing the number of observations in each cluster.

Notes

Function cluster_k_means is an implementation of Algorithm AS 136 by Hartigan and Wong ([1]). It computes K-means (centroid) Euclidean metric clusters for an input matrix starting with initial estimates of the K-cluster means. The function allows for missing values coded as NaN (Not a Number) and for weights and frequencies.

Let p be the number of variables to be used in computing the Euclidean distance between observations. The idea in K-means cluster analysis is to find a clustering (or grouping) of the observations so as to minimize the total within-cluster sums-of-squares. In this case, the total sums-of-squares within each cluster is computed as the sum of the centered sum-of-squares over all nonmissing values of each variable.

That is,

\[\phi = \sum_{i=1}^K \sum_{j=1}^p \sum_{m=1}^{n_i} f_{\nu_{im}} w_{\nu_{im}} \delta_{\nu_{im},j}(x_{\nu_{im},j}-\bar{x}_{ij})^2\]

where \(\nu_{im}\) denotes the row index of the m-th observation in the i-th cluster in the matrix obs; \(n_i\) is the number of rows of obs assigned to group i; f denotes the frequency of the observation; w denotes its weight; \(\delta\) is 0 if the j-th variable on observation \(\nu_{im}\) is missing, otherwise \(\delta\) is 1; and

\[\bar{x}_{ij}\]

is the average of the nonmissing observations for variable j in group i. This method sequentially processes each observation and reassigns it to another cluster if doing so results in a decrease of the total within-cluster sums-of-squares. See [1] or [2] for details.

References

[1](1, 2) Hartigan, J.A. and M.A. Wong (1979), Algorithm AS 136: A K-means clustering algorithm, Applied Statistics, 28, 100-108.
[2]Hartigan, John A. (1975), Clustering Algorithms, John Wiley & Sons, New York.

Examples

This example performs K-means cluster analysis on Fisher’s Iris data. The initial cluster seed for each iris type is an observation known to be in the iris type.

>>> import numpy as np
>>> import imsl.cluster as cluster
>>> fisher_iris_data = np.array(
... [[1.0, 5.1, 3.5, 1.4, .2], [1.0, 4.9, 3.0, 1.4, .2],
... [1.0, 4.7, 3.2, 1.3, .2], [1.0, 4.6, 3.1, 1.5, .2],
... [1.0, 5.0, 3.6, 1.4, .2], [1.0, 5.4, 3.9, 1.7, .4],
... [1.0, 4.6, 3.4, 1.4, .3], [1.0, 5.0, 3.4, 1.5, .2],
... [1.0, 4.4, 2.9, 1.4, .2], [1.0, 4.9, 3.1, 1.5, .1],
... [1.0, 5.4, 3.7, 1.5, .2], [1.0, 4.8, 3.4, 1.6, .2],
... [1.0, 4.8, 3.0, 1.4, .1], [1.0, 4.3, 3.0, 1.1, .1],
... [1.0, 5.8, 4.0, 1.2, .2], [1.0, 5.7, 4.4, 1.5, .4],
... [1.0, 5.4, 3.9, 1.3, .4], [1.0, 5.1, 3.5, 1.4, .3],
... [1.0, 5.7, 3.8, 1.7, .3], [1.0, 5.1, 3.8, 1.5, .3],
... [1.0, 5.4, 3.4, 1.7, .2], [1.0, 5.1, 3.7, 1.5, .4],
... [1.0, 4.6, 3.6, 1.0, .2], [1.0, 5.1, 3.3, 1.7, .5],
... [1.0, 4.8, 3.4, 1.9, .2], [1.0, 5.0, 3.0, 1.6, .2],
... [1.0, 5.0, 3.4, 1.6, .4], [1.0, 5.2, 3.5, 1.5, .2],
... [1.0, 5.2, 3.4, 1.4, .2], [1.0, 4.7, 3.2, 1.6, .2],
... [1.0, 4.8, 3.1, 1.6, .2], [1.0, 5.4, 3.4, 1.5, .4],
... [1.0, 5.2, 4.1, 1.5, .1], [1.0, 5.5, 4.2, 1.4, .2],
... [1.0, 4.9, 3.1, 1.5, .2], [1.0, 5.0, 3.2, 1.2, .2],
... [1.0, 5.5, 3.5, 1.3, .2], [1.0, 4.9, 3.6, 1.4, .1],
... [1.0, 4.4, 3.0, 1.3, .2], [1.0, 5.1, 3.4, 1.5, .2],
... [1.0, 5.0, 3.5, 1.3, .3], [1.0, 4.5, 2.3, 1.3, .3],
... [1.0, 4.4, 3.2, 1.3, .2], [1.0, 5.0, 3.5, 1.6, .6],
... [1.0, 5.1, 3.8, 1.9, .4], [1.0, 4.8, 3.0, 1.4, .3],
... [1.0, 5.1, 3.8, 1.6, .2], [1.0, 4.6, 3.2, 1.4, .2],
... [1.0, 5.3, 3.7, 1.5, .2], [1.0, 5.0, 3.3, 1.4, .2],
... [2.0, 7.0, 3.2, 4.7, 1.4], [2.0, 6.4, 3.2, 4.5, 1.5],
... [2.0, 6.9, 3.1, 4.9, 1.5], [2.0, 5.5, 2.3, 4.0, 1.3],
... [2.0, 6.5, 2.8, 4.6, 1.5], [2.0, 5.7, 2.8, 4.5, 1.3],
... [2.0, 6.3, 3.3, 4.7, 1.6], [2.0, 4.9, 2.4, 3.3, 1.0],
... [2.0, 6.6, 2.9, 4.6, 1.3], [2.0, 5.2, 2.7, 3.9, 1.4],
... [2.0, 5.0, 2.0, 3.5, 1.0], [2.0, 5.9, 3.0, 4.2, 1.5],
... [2.0, 6.0, 2.2, 4.0, 1.0], [2.0, 6.1, 2.9, 4.7, 1.4],
... [2.0, 5.6, 2.9, 3.6, 1.3], [2.0, 6.7, 3.1, 4.4, 1.4],
... [2.0, 5.6, 3.0, 4.5, 1.5], [2.0, 5.8, 2.7, 4.1, 1.0],
... [2.0, 6.2, 2.2, 4.5, 1.5], [2.0, 5.6, 2.5, 3.9, 1.1],
... [2.0, 5.9, 3.2, 4.8, 1.8], [2.0, 6.1, 2.8, 4.0, 1.3],
... [2.0, 6.3, 2.5, 4.9, 1.5], [2.0, 6.1, 2.8, 4.7, 1.2],
... [2.0, 6.4, 2.9, 4.3, 1.3], [2.0, 6.6, 3.0, 4.4, 1.4],
... [2.0, 6.8, 2.8, 4.8, 1.4], [2.0, 6.7, 3.0, 5.0, 1.7],
... [2.0, 6.0, 2.9, 4.5, 1.5], [2.0, 5.7, 2.6, 3.5, 1.0],
... [2.0, 5.5, 2.4, 3.8, 1.1], [2.0, 5.5, 2.4, 3.7, 1.0],
... [2.0, 5.8, 2.7, 3.9, 1.2], [2.0, 6.0, 2.7, 5.1, 1.6],
... [2.0, 5.4, 3.0, 4.5, 1.5], [2.0, 6.0, 3.4, 4.5, 1.6],
... [2.0, 6.7, 3.1, 4.7, 1.5], [2.0, 6.3, 2.3, 4.4, 1.3],
... [2.0, 5.6, 3.0, 4.1, 1.3], [2.0, 5.5, 2.5, 4.0, 1.3],
... [2.0, 5.5, 2.6, 4.4, 1.2], [2.0, 6.1, 3.0, 4.6, 1.4],
... [2.0, 5.8, 2.6, 4.0, 1.2], [2.0, 5.0, 2.3, 3.3, 1.0],
... [2.0, 5.6, 2.7, 4.2, 1.3], [2.0, 5.7, 3.0, 4.2, 1.2],
... [2.0, 5.7, 2.9, 4.2, 1.3], [2.0, 6.2, 2.9, 4.3, 1.3],
... [2.0, 5.1, 2.5, 3.0, 1.1], [2.0, 5.7, 2.8, 4.1, 1.3],
... [3.0, 6.3, 3.3, 6.0, 2.5], [3.0, 5.8, 2.7, 5.1, 1.9],
... [3.0, 7.1, 3.0, 5.9, 2.1], [3.0, 6.3, 2.9, 5.6, 1.8],
... [3.0, 6.5, 3.0, 5.8, 2.2], [3.0, 7.6, 3.0, 6.6, 2.1],
... [3.0, 4.9, 2.5, 4.5, 1.7], [3.0, 7.3, 2.9, 6.3, 1.8],
... [3.0, 6.7, 2.5, 5.8, 1.8], [3.0, 7.2, 3.6, 6.1, 2.5],
... [3.0, 6.5, 3.2, 5.1, 2.0], [3.0, 6.4, 2.7, 5.3, 1.9],
... [3.0, 6.8, 3.0, 5.5, 2.1], [3.0, 5.7, 2.5, 5.0, 2.0],
... [3.0, 5.8, 2.8, 5.1, 2.4], [3.0, 6.4, 3.2, 5.3, 2.3],
... [3.0, 6.5, 3.0, 5.5, 1.8], [3.0, 7.7, 3.8, 6.7, 2.2],
... [3.0, 7.7, 2.6, 6.9, 2.3], [3.0, 6.0, 2.2, 5.0, 1.5],
... [3.0, 6.9, 3.2, 5.7, 2.3], [3.0, 5.6, 2.8, 4.9, 2.0],
... [3.0, 7.7, 2.8, 6.7, 2.0], [3.0, 6.3, 2.7, 4.9, 1.8],
... [3.0, 6.7, 3.3, 5.7, 2.1], [3.0, 7.2, 3.2, 6.0, 1.8],
... [3.0, 6.2, 2.8, 4.8, 1.8], [3.0, 6.1, 3.0, 4.9, 1.8],
... [3.0, 6.4, 2.8, 5.6, 2.1], [3.0, 7.2, 3.0, 5.8, 1.6],
... [3.0, 7.4, 2.8, 6.1, 1.9], [3.0, 7.9, 3.8, 6.4, 2.0],
... [3.0, 6.4, 2.8, 5.6, 2.2], [3.0, 6.3, 2.8, 5.1, 1.5],
... [3.0, 6.1, 2.6, 5.6, 1.4], [3.0, 7.7, 3.0, 6.1, 2.3],
... [3.0, 6.3, 3.4, 5.6, 2.4], [3.0, 6.4, 3.1, 5.5, 1.8],
... [3.0, 6.0, 3.0, 4.8, 1.8], [3.0, 6.9, 3.1, 5.4, 2.1],
... [3.0, 6.7, 3.1, 5.6, 2.4], [3.0, 6.9, 3.1, 5.1, 2.3],
... [3.0, 5.8, 2.7, 5.1, 1.9], [3.0, 6.8, 3.2, 5.9, 2.3],
... [3.0, 6.7, 3.3, 5.7, 2.5], [3.0, 6.7, 3.0, 5.2, 2.3],
... [3.0, 6.3, 2.5, 5.0, 1.9], [3.0, 6.5, 3.0, 5.2, 2.0],
... [3.0, 6.2, 3.4, 5.4, 2.3], [3.0, 5.9, 3.0, 5.1, 1.8]])
>>> cluster_seeds = np.empty((3,4))
>>> cluster_variables = np.array([1, 2, 3, 4])
>>> # Assign initial cluster seeds
>>> for i in range(4):
...    cluster_seeds[0][i] = fisher_iris_data[0][i+1]
...    cluster_seeds[1][i] = fisher_iris_data[50][i+1]
...    cluster_seeds[2][i] = fisher_iris_data[100][i+1]
>>> # Perform the analysis
>>> clusters = cluster.cluster_k_means(fisher_iris_data, cluster_seeds,
...                                    cluster_vars = cluster_variables)
>>> # Print results
>>> np.set_printoptions(precision=3)
>>> print("Cluster Membership:\n\n" +
...       str(clusters.membership)) 
Cluster Membership:

[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 3 3 3 3 2 3 3 3 3
 3 3 2 2 3 3 3 3 2 3 2 3 2 3 3 2 2 3 3 3 3 3 2 3 3 3 3 2 3 3 3 2 3 3 3 2 3
 3 2]
>>> print("\nCluster Means:\n\n" +
...       str(clusters.means)) 

Cluster Means:

[[5.006  3.428  1.462  0.246]
 [5.902  2.748  4.394  1.434]
 [6.85   3.074  5.742  2.071]]
>>> print("\nCluster Sum of squares:\n\n" +
...       str(clusters.ssq)) 

Cluster Sum of squares:

[15.151  39.821  23.879]
>>> print("\n# Observations in Each Cluster:\n\n" +
...       str(clusters.counts)) 

# Observations in Each Cluster:

[50 62 38]