Package com.imsl.stat

Class ClusterKMeans

java.lang.Object
com.imsl.stat.ClusterKMeans
All Implemented Interfaces:
Serializable, Cloneable

public class ClusterKMeans extends Object implements Serializable, Cloneable
Perform a K-means (centroid) cluster analysis.

ClusterKMeans is an implementation of Algorithm AS 136 by Hartigan and Wong (1979). It computes K-means (centroid) Euclidean metric clusters for an input matrix starting with initial estimates of the K cluster means. It allows for missing values (coded as NaN, not a number) and for weights and frequencies.

Let p denote the number of variables to be used in computing the Euclidean distance between observations. The idea in K-means cluster analysis is to find a clustering (or grouping) of the observations so as to minimize the total within-cluster sums of squares. In this case, the total sums of squares within each cluster is computed as the sum of the centered sum of squares over all nonmissing values of each variable. That is,

$$ \phi = \sum_{i=1}^K \sum_{j=1}^p \sum_{m=1}^{n_i} f_{\nu_{im}} w_{\nu_{im}} \delta_{\nu_{im},j} \left( x_{\nu_{im},j} - \bar x_{ij} \right)^2 $$

where \(\nu_{im}\) denotes the row index of the m-th observation in the i-th cluster in the matrix X; \(n_i\) is the number of rows of X assigned to group i; f denotes the frequency of the observation; w denotes its weight; d is zero if the j-th variable on observation \(\nu_{im}\) is missing, otherwise \(\delta\) is one; and \(\bar x_{ij}\) is the average of the nonmissing observations for variable j in group i. This method sequentially processes each observation and reassigns it to another cluster if doing so results in a decrease in the total within-cluster sums of squares. See Hartigan and Wong (1979) or Hartigan (1975) for details.

An algorithm for choosing efficient initial seeds was proposed by Arthur and Vassilvitskii (2007), and is known as K-means++. This algorithm has advantages over choosing arbitrary or random initial seeds from the population of observations. Access the implementation of this algorithm in this class using the constructor with the signature ClusterKMeans(double[][], int) or by calling the setClusterCenters(Random) method.

See Also:
  • Constructor Details

    • ClusterKMeans

      public ClusterKMeans(double[][] x, double[][] cs)
      Constructor for ClusterKMeans.
      Parameters:
      x - a double matrix containing the observations to be clustered
      cs - a double matrix containing the cluster seeds, i.e. estimates for the cluster centers
    • ClusterKMeans

      public ClusterKMeans(double[][] x, int k)
      Constructor for ClusterKMeans using the K-means++ algorithm to select the initial seeds.
      Parameters:
      x - a double matrix containing the observations to be clustered
      k - an int indicating the number of clusters
    • ClusterKMeans

      public ClusterKMeans(double[][] x, int k, Random r)
      Constructor for ClusterKMeans using the K-means++ algorithm to set the initial seeds.
      Parameters:
      x - a double matrix containing the observations to be clustered
      k - an int indicating the number of clusters
      r - a Random object, the random number generator to be used in the KMeans++ algorithm

      Note that r can be initialized with a seed for repeatable results.

  • Method Details

    • getInitialCenters

      public double[][] getInitialCenters()
      Returns the initial cluster centers.

      The cluster centers are those set by the K-Means++ center selection algorithm or those supplied to the constructor.

      Returns:
      a double matrix containing the cluster centers
    • compute

      Computes the cluster means.
      Returns:
      a double matrix containing computed result
      Throws:
      ClusterKMeans.NonnegativeFreqException - is thrown if a frequency is negative
      ClusterKMeans.NonnegativeWeightException - is thrown if a weight is negative
      ClusterKMeans.NoConvergenceException - is thrown if convergence did not occur within the maximum number of iterations
      ClusterKMeans.ClusterNoPointsException - is thrown if the cluster centers yield a cluster with no points
    • setWeights

      public void setWeights(double[] weights)
      Sets the weight for each observation.
      Parameters:
      weights - a double array of size x.length containing the weight for each observation

      Default: weights[] = 1

    • setFrequencies

      public void setFrequencies(double[] frequencies)
      Sets the frequency for each observation.
      Parameters:
      frequencies - a double array of size x.length containing the frequency for each observation

      Default: frequencies[] = 1

    • setMaxIterations

      public void setMaxIterations(int iterations)
      Sets the maximum number of iterations.
      Parameters:
      iterations - an int scalar specifying the maximum number of iterations

      Default: interations = 30

    • getClusterMembership

      public int[] getClusterMembership()
      Returns the cluster membership for each observation.

      Note that the compute() method must be invoked first before invoking this method. Otherwise, the method throws a IllegalStateException exception.

      Returns:
      an int array containing the cluster membership for each observation

      Cluster membership 1 indicates the observation belongs to cluster 1, cluster membership 2 indicates the observation belongs to cluster 2, etc.

    • getClusterSSQ

      public double[] getClusterSSQ()
      Returns the within sum of squares for each cluster.

      Note that the compute method must be invoked first before invoking this method. Otherwise, the method throws a IllegalStateException exception.

      Returns:
      a double array containing the within sum of squares for each cluster
    • getClusterCounts

      public int[] getClusterCounts()
      Returns the number of observations in each cluster.

      Note that the compute method must be invoked first before invoking this method. Otherwise, the method throws a IllegalStateException exception.

      Returns:
      an int array containing the number of observations in each cluster