Class ClusterKMeans
- All Implemented Interfaces:
Serializable,Cloneable
ClusterKMeans is an implementation of Algorithm AS 136 by
Hartigan and Wong (1979). It computes K-means (centroid) Euclidean
metric clusters for an input matrix starting with initial estimates of the
K cluster means. It allows for missing values (coded as NaN, not a
number) and for weights and frequencies.
Let p denote the number of variables to be used in computing the Euclidean distance between observations. The idea in K-means cluster analysis is to find a clustering (or grouping) of the observations so as to minimize the total within-cluster sums of squares. In this case, the total sums of squares within each cluster is computed as the sum of the centered sum of squares over all nonmissing values of each variable. That is,
$$ \phi = \sum_{i=1}^K \sum_{j=1}^p \sum_{m=1}^{n_i} f_{\nu_{im}} w_{\nu_{im}} \delta_{\nu_{im},j} \left( x_{\nu_{im},j} - \bar x_{ij} \right)^2 $$where \(\nu_{im}\) denotes the row index of the m-th observation in the i-th cluster in the matrix X; \(n_i\) is the number of rows of X assigned to group i; f denotes the frequency of the observation; w denotes its weight; d is zero if the j-th variable on observation \(\nu_{im}\) is missing, otherwise \(\delta\) is one; and \(\bar x_{ij}\) is the average of the nonmissing observations for variable j in group i. This method sequentially processes each observation and reassigns it to another cluster if doing so results in a decrease in the total within-cluster sums of squares. See Hartigan and Wong (1979) or Hartigan (1975) for details.
An algorithm for choosing efficient initial seeds was proposed by
Arthur and Vassilvitskii (2007), and is known as K-means++. This
algorithm has advantages over choosing arbitrary or random initial seeds from
the population of observations. Access the implementation of this algorithm
in this class using the constructor with the signature
ClusterKMeans(double[][], int) or by calling the
setClusterCenters(Random) method.
-
Nested Class Summary
Nested ClassesModifier and TypeClassDescriptionstatic classThere is a cluster with no pointsstatic classConvergence did not occur within the maximum number of iterations.static classDeprecated.static classDeprecated.No longer used, replaced with anIllegalArgumentException. -
Constructor Summary
ConstructorsConstructorDescriptionClusterKMeans(double[][] x, double[][] cs) Constructor forClusterKMeans.ClusterKMeans(double[][] x, int k) Constructor forClusterKMeansusing the K-means++ algorithm to select the initial seeds.ClusterKMeans(double[][] x, int k, Random r) Constructor forClusterKMeansusing the K-means++ algorithm to set the initial seeds. -
Method Summary
Modifier and TypeMethodDescriptionfinal double[][]compute()Computes the cluster means.int[]Returns the number of observations in each cluster.int[]Returns the cluster membership for each observation.double[]Returns the within sum of squares for each cluster.double[][]Returns the initial cluster centers.voidsetFrequencies(double[] frequencies) Sets the frequency for each observation.voidsetMaxIterations(int iterations) Sets the maximum number of iterations.voidsetWeights(double[] weights) Sets the weight for each observation.
-
Constructor Details
-
ClusterKMeans
public ClusterKMeans(double[][] x, double[][] cs) Constructor forClusterKMeans.- Parameters:
x- adoublematrix containing the observations to be clusteredcs- adoublematrix containing the cluster seeds, i.e. estimates for the cluster centers
-
ClusterKMeans
public ClusterKMeans(double[][] x, int k) Constructor forClusterKMeansusing the K-means++ algorithm to select the initial seeds.- Parameters:
x- adoublematrix containing the observations to be clusteredk- anintindicating the number of clusters
-
ClusterKMeans
Constructor forClusterKMeansusing the K-means++ algorithm to set the initial seeds.- Parameters:
x- adoublematrix containing the observations to be clusteredk- anintindicating the number of clustersr- a Random object, the random number generator to be used in the KMeans++ algorithmNote that
rcan be initialized with a seed for repeatable results.
-
-
Method Details
-
getInitialCenters
public double[][] getInitialCenters()Returns the initial cluster centers.The cluster centers are those set by the K-Means++ center selection algorithm or those supplied to the constructor.
- Returns:
- a
doublematrix containing the cluster centers
-
compute
public final double[][] compute() throws ClusterKMeans.NoConvergenceException, ClusterKMeans.ClusterNoPointsExceptionComputes the cluster means.- Returns:
- a
doublematrix containing computed result - Throws:
ClusterKMeans.NonnegativeFreqException- is thrown if a frequency is negativeClusterKMeans.NonnegativeWeightException- is thrown if a weight is negativeClusterKMeans.NoConvergenceException- is thrown if convergence did not occur within the maximum number of iterationsClusterKMeans.ClusterNoPointsException- is thrown if the cluster centers yield a cluster with no points
-
setWeights
public void setWeights(double[] weights) Sets the weight for each observation.- Parameters:
weights- adoublearray of sizex.lengthcontaining the weight for each observationDefault:
weights[]= 1
-
setFrequencies
public void setFrequencies(double[] frequencies) Sets the frequency for each observation.- Parameters:
frequencies- adoublearray of sizex.lengthcontaining the frequency for each observationDefault:
frequencies[]= 1
-
setMaxIterations
public void setMaxIterations(int iterations) Sets the maximum number of iterations.- Parameters:
iterations- anintscalar specifying the maximum number of iterationsDefault:
interations= 30
-
getClusterMembership
public int[] getClusterMembership()Returns the cluster membership for each observation.Note that the
compute()method must be invoked first before invoking this method. Otherwise, the method throws aIllegalStateExceptionexception.- Returns:
- an
intarray containing the cluster membership for each observationCluster membership 1 indicates the observation belongs to cluster 1, cluster membership 2 indicates the observation belongs to cluster 2, etc.
-
getClusterSSQ
public double[] getClusterSSQ()Returns the within sum of squares for each cluster.Note that the
computemethod must be invoked first before invoking this method. Otherwise, the method throws aIllegalStateExceptionexception.- Returns:
- a
doublearray containing the within sum of squares for each cluster
-
getClusterCounts
public int[] getClusterCounts()Returns the number of observations in each cluster.Note that the
computemethod must be invoked first before invoking this method. Otherwise, the method throws aIllegalStateExceptionexception.- Returns:
- an
intarray containing the number of observations in each cluster
-
IllegalArgumentException.