CNL Stat : Multivariate Analysis : cluster_k_means
cluster_k_means
Performs a K-means (centroid) cluster analysis.
Synopsis
#include <imsls.h>
int *imsls_f_cluster_k_means (int n_observations, int n_variables, float x[], int n_clusters, float cluster_seeds, ..., 0)
The type double function is imsls_d_cluster_k_means.
Required Arguments
int n_observations (Input)
Number of observations.
int n_variables (Input)
Number of variables to be used in computing the metric.
float x[] (Input)
Array of length n_observations × n_variables containing the observations to be clustered.
int n_clusters (Input)
Number of clusters.
float cluster_seeds[] (Input)
Array of length n_clusters × n_variables containing the cluster seeds, i.e., estimates for the cluster centers.
Return Value
The cluster membership for each observation is returned.
Synopsis with Optional Arguments
#include <imsls.h>
int *imsls_f_cluster_k_means (int n_observations, int n_variables, float x[], int n_clusters, float cluster_seeds,
IMSLS_WEIGHTS, float weights[],
IMSLS_FREQUENCIES, float frequencies[],
IMSLS_MAX_ITERATIONS, int  max_iterations,
IMSLS_CLUSTER_HISTORY, int *n_itr, int **cluster_history,
IMSLS_CLUSTER_HISTORY_USER, int *n_itr, int cluster_history[],
IMSLS_CLUSTER_MEANS, float **cluster_means,
IMSLS_CLUSTER_MEANS_USER, float cluster_means[],
IMSLS_CLUSTER_SSQ, float **cluster_ssq,
IMSLS_CLUSTER_SSQ_USER, float cluster_ssq[],
IMSLS_X_COL_DIM, int x_col_dim,
IMSLS_CLUSTER_MEANS_COL_DIM, int cluster_means_col_dim,
IMSLS_CLUSTER_SEEDS_COL_DIM, int cluster_seeds_col_dim,
IMSLS_CLUSTER_COUNTS, int **cluster_counts,
IMSLS_CLUSTER_COUNTS_USER, int cluster_counts[],
IMSLS_CLUSTER_VARIABLE_COLUMNS, int cluster_variables[],
IMSLS_RETURN_USER, int cluster_group[],
0)
Optional Arguments
IMSLS_WEIGHTS, float weights[] (Input)
Array of length n_observations containing the weight of each observation of matrix x.
Default: weights = 1.
IMSLS_FREQUENCIES, float frequencies[] (Input)
Array of length n_observations containing the frequency of each observation of matrix x.
Default: frequencies = 1.
IMSLS_MAX_ITERATIONS, int max_iterations (Input)
Maximum number of iterations.
Default: max_iterations = 30.
IMSLS_CLUSTER_HISTORY, int *n_itr, int **cluster_history (Output)
cluster_history is a pointer to an array of size n_iter by n_observations containing the cluster membership of each observation per iteration. Note that n_iter is the number of completed iterations in the algorithm.
IMSLS_CLUSTER_HISTORY_USER, int *n_itr, int cluster_history[] (Output)
Storage for array cluster_history is provided by the user. cluster_history is an array of size max_iterations by n_observations containing the cluster membership of each observation per iteration. Note that only the first n_itr rows of cluster_history is set upon return.
IMSLS_CLUSTER_MEANS, float **cluster_means (Output)
The address of a pointer to an internally allocated array of length n_clusters × n_variables containing the cluster means.
IMSLS_CLUSTER_MEANS_USER, float cluster_means[] (Output)
Storage for array cluster_means is provided by the user. See IMSLS_CLUSTER_MEANS.
IMSLS_CLUSTER_SSQ, float **cluster_ssq (Output)
The address of a pointer to internally allocated array of length n_clusters containing the within sum-of-squares for each cluster.
IMSLS_CLUSTER_SSQ_USER, float cluster_ssq[] (Output)
Storage for array cluster_ssq is provided by the user. See IMSLS_CLUSTER_SSQ.
IMSLS_X_COL_DIM, int x_col_dim (Input)
Column dimension of x.
Default: x_col_dim = n_variables.
IMSLS_CLUSTER_MEANS_COL_DIM, int cluster_means_col_dim (Input)
Column dimension for the vector cluster_means.
Default: cluster_means_col_dim = n_variables.
IMSLS_CLUSTER_SEEDS_COL_DIM, int cluster_seeds_col_dim (Input)
Column dimension for the vector cluster_seeds.
Default: cluster_seeds_col_dim = n_variables.
IMSLS_CLUSTER_COUNTS, int **cluster_counts (Output)
The address of a pointer to an internally allocated array of length n_clusters containing the number of observations in each cluster.
IMSLS_CLUSTER_COUNTS_USER, int cluster_counts[] (Output)
Storage for array cluster_counts is provided by the user. See IMSLS_CLUSTER_COUNTS.
IMSLS_CLUSTER_VARIABLE_COLUMNS, int cluster_variables[] (Input)
Vector of length n_variables containing the columns of x to be used in computing the metric. Columns are numbered 0, 1, 2, n_variables
Default: cluster_variables [ ] = 0, 1, 2, n_variables.
IMSLS_RETURN_USER, int cluster_group[] (Output)
User-allocated array of length n_observations containing the cluster membership for each observation.
Description
Function imsls_f_cluster_k_means is an implementation of Algorithm AS 136 by Hartigan and Wong (1979). It computes K-means (centroid) Euclidean metric clusters for an input matrix starting with initial estimates of the K-cluster means. The function allows for missing values coded as NaN (Not a Number) and for weights and frequencies.
Let p = n_variables be the number of variables to be used in computing the Euclidean distance between observations. The idea in K-means cluster analysis is to find a clustering (or grouping) of the observations so as to minimize the total within-cluster sums-of-squares. In this case, the total sums-of-squares within each cluster is computed as the sum of the centered sum-of-squares over all nonmissing values of each variable. That is,
where νim denotes the row index of the m-th observation in the i-th cluster in the matrix X; ni is the number of rows of X assigned to group i; f denotes the frequency of the observation; w denotes its weight; δ is 0 if the j-th variable on observation νim is missing, otherwise δ is 1; and
is the average of the nonmissing observations for variable j in group i. This method sequentially processes each observation and reassigns it to another cluster if doing so results in a decrease of the total within-cluster sums-of-squares. See Hartigan and Wong (1979) or Hartigan (1975) for details.
Example
This example performs K-means cluster analysis on Fisher’s Iris data, which is obtained by function imsls_f_data_sets (see Chapter 15, Utilities). The initial cluster seed for each iris type is an observation known to be in the iris type.
 
#include <imsls.h>
 
int main()
{
#define N_OBSERVATIONS 150
#define N_VARIABLES 4
#define N_CLUSTERS 3
 
float x[N_OBSERVATIONS][5];
float cluster_seeds[N_CLUSTERS][N_VARIABLES];
float cluster_means[N_CLUSTERS][N_VARIABLES];
float cluster_ssq[N_CLUSTERS];
int cluster_variables[N_VARIABLES] = {1, 2, 3, 4};
int cluster_counts[N_CLUSTERS];
int cluster_group[N_OBSERVATIONS];
int i;
 
/* Retrieve the data set */
imsls_f_data_sets(3,
IMSLS_RETURN_USER, x,
0);
 
/* Assign initial cluster seeds */
for (i=0; i<N_VARIABLES; i++) {
cluster_seeds[0][i] = x[0][i+1];
cluster_seeds[1][i] = x[50][i+1];
cluster_seeds[2][i] = x[100][i+1];
}
 
/* Perform the analysis */
imsls_f_cluster_k_means(N_OBSERVATIONS, N_VARIABLES, (float*)x,
N_CLUSTERS, (float*)cluster_seeds,
IMSLS_X_COL_DIM, 5,
IMSLS_CLUSTER_VARIABLE_COLUMNS, cluster_variables,
IMSLS_CLUSTER_COUNTS_USER, cluster_counts,
IMSLS_CLUSTER_MEANS_USER, cluster_means,
IMSLS_CLUSTER_SSQ_USER, cluster_ssq,
IMSLS_RETURN_USER, cluster_group,
0);
 
/* Print results */
imsls_i_write_matrix("Cluster Membership", 1, N_OBSERVATIONS,
cluster_group,
0);
 
imsls_f_write_matrix("Cluster Means", N_CLUSTERS, N_VARIABLES,
(float*)cluster_means,
0);
 
imsls_f_write_matrix("Cluster Sum of Squares", 1, N_CLUSTERS,
cluster_ssq,
0);
 
imsls_i_write_matrix("# Observations in Each Cluster", 1,
N_CLUSTERS, cluster_counts,
0);
}
Output
 
Cluster Membership
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 
41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
1 1 1 1 1 1 1 1 1 1 2 2 3 2 2 2 2 2 2 2
 
61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 2
 
81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 
100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115
2 3 2 3 3 3 3 2 3 3 3 3 3 3 2 2
 
116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131
3 3 3 3 2 3 2 3 2 3 3 2 2 3 3 3
 
132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147
3 3 2 3 3 3 3 2 3 3 3 2 3 3 3 2
 
148 149 150
3 3 2
 
Cluster Means
1 2 3 4
1 5.006 3.428 1.462 0.246
2 5.902 2.748 4.394 1.434
3 6.850 3.074 5.742 2.071
 
Cluster Sum of Squares
1 2 3
15.15 39.82 23.88
 
# Observations in Each Cluster
1 2 3
50 62 38
Warning Errors
IMSLS_NO_CONVERGENCE
Convergence did not occur.