imsl.cluster.cluster_k_means¶
-
cluster_k_means
(obs, cluster_seeds, weights=None, frequencies=None, max_iter=30, cluster_vars=None)¶ Perform a K-means (centroid) cluster analysis.
Parameters: - obs ((M,N) array_like) – Array of size M \(\times\) N containing the observations to be clustered.
- cluster_seeds ((n_clusters, L) array_like) – Array containing the cluster seeds, i.e., estimates for the cluster centers. L denotes the number of columns of array obs used in the analysis; see argument cluster_vars.
- weights ((M,) array_like, optional) –
Array of length M containing the weight of each observation of array obs.
Default: weights = [1, 1, …, 1].
- frequencies ((M,) array_like, optional) –
Array of length M containing the frequency of each observation of array obs.
Default: frequencies = [1, 1, …, 1].
- max_iter (int, optional) –
The maximum number of iterations.
Default: max_iter = 30.
- cluster_vars ((L,) array_like, optional) –
Array of length L containing the columns of obs to be used in computing the metric. The columns in array obs are numbered 0, 1, 2, …, N-1.
Default: cluster_vars = [0, 1, 2, …, N - 1].
Returns: - A named tuple with the following fields
- membership ((M,) ndarray) – Array containing the cluster membership for each observation.
- history ((n_iter, M) ndarray) – Array of size n_iter \(\times\) M containing the cluster membership of each observation in array obs per iteration. Note that n_iter is the number of completed iterations in the algorithm.
- means ((n_clusters, L) ndarray) – Array containing the cluster means.
- ssq ((n_clusters,) ndarray) – Array containing the within sum-of-squares for each cluster.
- counts ((n_clusters,) ndarray) – Array containing the number of observations in each cluster.
Notes
Function cluster_k_means is an implementation of Algorithm AS 136 by Hartigan and Wong ([1]). It computes K-means (centroid) Euclidean metric clusters for an input matrix starting with initial estimates of the K-cluster means. The function allows for missing values coded as NaN (Not a Number) and for weights and frequencies.
Let p be the number of variables to be used in computing the Euclidean distance between observations. The idea in K-means cluster analysis is to find a clustering (or grouping) of the observations so as to minimize the total within-cluster sums-of-squares. In this case, the total sums-of-squares within each cluster is computed as the sum of the centered sum-of-squares over all nonmissing values of each variable.
That is,
\[\phi = \sum_{i=1}^K \sum_{j=1}^p \sum_{m=1}^{n_i} f_{\nu_{im}} w_{\nu_{im}} \delta_{\nu_{im},j}(x_{\nu_{im},j}-\bar{x}_{ij})^2\]where \(\nu_{im}\) denotes the row index of the m-th observation in the i-th cluster in the matrix obs; \(n_i\) is the number of rows of obs assigned to group i; f denotes the frequency of the observation; w denotes its weight; \(\delta\) is 0 if the j-th variable on observation \(\nu_{im}\) is missing, otherwise \(\delta\) is 1; and
\[\bar{x}_{ij}\]is the average of the nonmissing observations for variable j in group i. This method sequentially processes each observation and reassigns it to another cluster if doing so results in a decrease of the total within-cluster sums-of-squares. See [1] or [2] for details.
References
[1] (1, 2) Hartigan, J.A. and M.A. Wong (1979), Algorithm AS 136: A K-means clustering algorithm, Applied Statistics, 28, 100-108. [2] Hartigan, John A. (1975), Clustering Algorithms, John Wiley & Sons, New York. Examples
This example performs K-means cluster analysis on Fisher’s Iris data. The initial cluster seed for each iris type is an observation known to be in the iris type.
>>> import numpy as np >>> import imsl.cluster as cluster >>> fisher_iris_data = np.array( ... [[1.0, 5.1, 3.5, 1.4, .2], [1.0, 4.9, 3.0, 1.4, .2], ... [1.0, 4.7, 3.2, 1.3, .2], [1.0, 4.6, 3.1, 1.5, .2], ... [1.0, 5.0, 3.6, 1.4, .2], [1.0, 5.4, 3.9, 1.7, .4], ... [1.0, 4.6, 3.4, 1.4, .3], [1.0, 5.0, 3.4, 1.5, .2], ... [1.0, 4.4, 2.9, 1.4, .2], [1.0, 4.9, 3.1, 1.5, .1], ... [1.0, 5.4, 3.7, 1.5, .2], [1.0, 4.8, 3.4, 1.6, .2], ... [1.0, 4.8, 3.0, 1.4, .1], [1.0, 4.3, 3.0, 1.1, .1], ... [1.0, 5.8, 4.0, 1.2, .2], [1.0, 5.7, 4.4, 1.5, .4], ... [1.0, 5.4, 3.9, 1.3, .4], [1.0, 5.1, 3.5, 1.4, .3], ... [1.0, 5.7, 3.8, 1.7, .3], [1.0, 5.1, 3.8, 1.5, .3], ... [1.0, 5.4, 3.4, 1.7, .2], [1.0, 5.1, 3.7, 1.5, .4], ... [1.0, 4.6, 3.6, 1.0, .2], [1.0, 5.1, 3.3, 1.7, .5], ... [1.0, 4.8, 3.4, 1.9, .2], [1.0, 5.0, 3.0, 1.6, .2], ... [1.0, 5.0, 3.4, 1.6, .4], [1.0, 5.2, 3.5, 1.5, .2], ... [1.0, 5.2, 3.4, 1.4, .2], [1.0, 4.7, 3.2, 1.6, .2], ... [1.0, 4.8, 3.1, 1.6, .2], [1.0, 5.4, 3.4, 1.5, .4], ... [1.0, 5.2, 4.1, 1.5, .1], [1.0, 5.5, 4.2, 1.4, .2], ... [1.0, 4.9, 3.1, 1.5, .2], [1.0, 5.0, 3.2, 1.2, .2], ... [1.0, 5.5, 3.5, 1.3, .2], [1.0, 4.9, 3.6, 1.4, .1], ... [1.0, 4.4, 3.0, 1.3, .2], [1.0, 5.1, 3.4, 1.5, .2], ... [1.0, 5.0, 3.5, 1.3, .3], [1.0, 4.5, 2.3, 1.3, .3], ... [1.0, 4.4, 3.2, 1.3, .2], [1.0, 5.0, 3.5, 1.6, .6], ... [1.0, 5.1, 3.8, 1.9, .4], [1.0, 4.8, 3.0, 1.4, .3], ... [1.0, 5.1, 3.8, 1.6, .2], [1.0, 4.6, 3.2, 1.4, .2], ... [1.0, 5.3, 3.7, 1.5, .2], [1.0, 5.0, 3.3, 1.4, .2], ... [2.0, 7.0, 3.2, 4.7, 1.4], [2.0, 6.4, 3.2, 4.5, 1.5], ... [2.0, 6.9, 3.1, 4.9, 1.5], [2.0, 5.5, 2.3, 4.0, 1.3], ... [2.0, 6.5, 2.8, 4.6, 1.5], [2.0, 5.7, 2.8, 4.5, 1.3], ... [2.0, 6.3, 3.3, 4.7, 1.6], [2.0, 4.9, 2.4, 3.3, 1.0], ... [2.0, 6.6, 2.9, 4.6, 1.3], [2.0, 5.2, 2.7, 3.9, 1.4], ... [2.0, 5.0, 2.0, 3.5, 1.0], [2.0, 5.9, 3.0, 4.2, 1.5], ... [2.0, 6.0, 2.2, 4.0, 1.0], [2.0, 6.1, 2.9, 4.7, 1.4], ... [2.0, 5.6, 2.9, 3.6, 1.3], [2.0, 6.7, 3.1, 4.4, 1.4], ... [2.0, 5.6, 3.0, 4.5, 1.5], [2.0, 5.8, 2.7, 4.1, 1.0], ... [2.0, 6.2, 2.2, 4.5, 1.5], [2.0, 5.6, 2.5, 3.9, 1.1], ... [2.0, 5.9, 3.2, 4.8, 1.8], [2.0, 6.1, 2.8, 4.0, 1.3], ... [2.0, 6.3, 2.5, 4.9, 1.5], [2.0, 6.1, 2.8, 4.7, 1.2], ... [2.0, 6.4, 2.9, 4.3, 1.3], [2.0, 6.6, 3.0, 4.4, 1.4], ... [2.0, 6.8, 2.8, 4.8, 1.4], [2.0, 6.7, 3.0, 5.0, 1.7], ... [2.0, 6.0, 2.9, 4.5, 1.5], [2.0, 5.7, 2.6, 3.5, 1.0], ... [2.0, 5.5, 2.4, 3.8, 1.1], [2.0, 5.5, 2.4, 3.7, 1.0], ... [2.0, 5.8, 2.7, 3.9, 1.2], [2.0, 6.0, 2.7, 5.1, 1.6], ... [2.0, 5.4, 3.0, 4.5, 1.5], [2.0, 6.0, 3.4, 4.5, 1.6], ... [2.0, 6.7, 3.1, 4.7, 1.5], [2.0, 6.3, 2.3, 4.4, 1.3], ... [2.0, 5.6, 3.0, 4.1, 1.3], [2.0, 5.5, 2.5, 4.0, 1.3], ... [2.0, 5.5, 2.6, 4.4, 1.2], [2.0, 6.1, 3.0, 4.6, 1.4], ... [2.0, 5.8, 2.6, 4.0, 1.2], [2.0, 5.0, 2.3, 3.3, 1.0], ... [2.0, 5.6, 2.7, 4.2, 1.3], [2.0, 5.7, 3.0, 4.2, 1.2], ... [2.0, 5.7, 2.9, 4.2, 1.3], [2.0, 6.2, 2.9, 4.3, 1.3], ... [2.0, 5.1, 2.5, 3.0, 1.1], [2.0, 5.7, 2.8, 4.1, 1.3], ... [3.0, 6.3, 3.3, 6.0, 2.5], [3.0, 5.8, 2.7, 5.1, 1.9], ... [3.0, 7.1, 3.0, 5.9, 2.1], [3.0, 6.3, 2.9, 5.6, 1.8], ... [3.0, 6.5, 3.0, 5.8, 2.2], [3.0, 7.6, 3.0, 6.6, 2.1], ... [3.0, 4.9, 2.5, 4.5, 1.7], [3.0, 7.3, 2.9, 6.3, 1.8], ... [3.0, 6.7, 2.5, 5.8, 1.8], [3.0, 7.2, 3.6, 6.1, 2.5], ... [3.0, 6.5, 3.2, 5.1, 2.0], [3.0, 6.4, 2.7, 5.3, 1.9], ... [3.0, 6.8, 3.0, 5.5, 2.1], [3.0, 5.7, 2.5, 5.0, 2.0], ... [3.0, 5.8, 2.8, 5.1, 2.4], [3.0, 6.4, 3.2, 5.3, 2.3], ... [3.0, 6.5, 3.0, 5.5, 1.8], [3.0, 7.7, 3.8, 6.7, 2.2], ... [3.0, 7.7, 2.6, 6.9, 2.3], [3.0, 6.0, 2.2, 5.0, 1.5], ... [3.0, 6.9, 3.2, 5.7, 2.3], [3.0, 5.6, 2.8, 4.9, 2.0], ... [3.0, 7.7, 2.8, 6.7, 2.0], [3.0, 6.3, 2.7, 4.9, 1.8], ... [3.0, 6.7, 3.3, 5.7, 2.1], [3.0, 7.2, 3.2, 6.0, 1.8], ... [3.0, 6.2, 2.8, 4.8, 1.8], [3.0, 6.1, 3.0, 4.9, 1.8], ... [3.0, 6.4, 2.8, 5.6, 2.1], [3.0, 7.2, 3.0, 5.8, 1.6], ... [3.0, 7.4, 2.8, 6.1, 1.9], [3.0, 7.9, 3.8, 6.4, 2.0], ... [3.0, 6.4, 2.8, 5.6, 2.2], [3.0, 6.3, 2.8, 5.1, 1.5], ... [3.0, 6.1, 2.6, 5.6, 1.4], [3.0, 7.7, 3.0, 6.1, 2.3], ... [3.0, 6.3, 3.4, 5.6, 2.4], [3.0, 6.4, 3.1, 5.5, 1.8], ... [3.0, 6.0, 3.0, 4.8, 1.8], [3.0, 6.9, 3.1, 5.4, 2.1], ... [3.0, 6.7, 3.1, 5.6, 2.4], [3.0, 6.9, 3.1, 5.1, 2.3], ... [3.0, 5.8, 2.7, 5.1, 1.9], [3.0, 6.8, 3.2, 5.9, 2.3], ... [3.0, 6.7, 3.3, 5.7, 2.5], [3.0, 6.7, 3.0, 5.2, 2.3], ... [3.0, 6.3, 2.5, 5.0, 1.9], [3.0, 6.5, 3.0, 5.2, 2.0], ... [3.0, 6.2, 3.4, 5.4, 2.3], [3.0, 5.9, 3.0, 5.1, 1.8]]) >>> cluster_seeds = np.empty((3,4)) >>> cluster_variables = np.array([1, 2, 3, 4]) >>> # Assign initial cluster seeds >>> for i in range(4): ... cluster_seeds[0][i] = fisher_iris_data[0][i+1] ... cluster_seeds[1][i] = fisher_iris_data[50][i+1] ... cluster_seeds[2][i] = fisher_iris_data[100][i+1] >>> # Perform the analysis >>> clusters = cluster.cluster_k_means(fisher_iris_data, cluster_seeds, ... cluster_vars = cluster_variables) >>> # Print results >>> np.set_printoptions(precision=3) >>> print("Cluster Membership:\n\n" + ... str(clusters.membership)) Cluster Membership: [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 3 3 3 3 2 3 3 3 3 3 3 2 2 3 3 3 3 2 3 2 3 2 3 3 2 2 3 3 3 3 3 2 3 3 3 3 2 3 3 3 2 3 3 3 2 3 3 2] >>> print("\nCluster Means:\n\n" + ... str(clusters.means)) Cluster Means: [[5.006 3.428 1.462 0.246] [5.902 2.748 4.394 1.434] [6.85 3.074 5.742 2.071]] >>> print("\nCluster Sum of squares:\n\n" + ... str(clusters.ssq)) Cluster Sum of squares: [15.151 39.821 23.879] >>> print("\n# Observations in Each Cluster:\n\n" + ... str(clusters.counts)) # Observations in Each Cluster: [50 62 38]