com.imsl.stat.ClusterKMeans

All Implemented Interfaces:: Serializable, Cloneable

public class ClusterKMeans extends Object implements Serializable, Cloneable

Perform a K-means (centroid) cluster analysis.

ClusterKMeans is an implementation of Algorithm AS 136 by Hartigan and Wong (1979). It computes K-means (centroid) Euclidean metric clusters for an input matrix starting with initial estimates of the K cluster means. It allows for missing values (coded as NaN, not a number) and for weights and frequencies.

Let p denote the number of variables to be used in computing the Euclidean distance between observations. The idea in K-means cluster analysis is to find a clustering (or grouping) of the observations so as to minimize the total within-cluster sums of squares. In this case, the total sums of squares within each cluster is computed as the sum of the centered sum of squares over all nonmissing values of each variable. That is,

$$ \phi = \sum_{i=1}^K \sum_{j=1}^p \sum_{m=1}^{n_i} f_{\nu_{im}} w_{\nu_{im}} \delta_{\nu_{im},j} \left( x_{\nu_{im},j} - \bar x_{ij} \right)^2 $$

where $\nu_{im}$ denotes the row index of the m-th observation in the i-th cluster in the matrix X; $n_i$ is the number of rows of X assigned to group i; f denotes the frequency of the observation; w denotes its weight; d is zero if the j-th variable on observation $\nu_{im}$ is missing, otherwise $\delta$ is one; and $\bar x_{ij}$ is the average of the nonmissing observations for variable j in group i. This method sequentially processes each observation and reassigns it to another cluster if doing so results in a decrease in the total within-cluster sums of squares. See Hartigan and Wong (1979) or Hartigan (1975) for details.

An algorithm for choosing efficient initial seeds was proposed by Arthur and Vassilvitskii (2007), and is known as K-means++. This algorithm has advantages over choosing arbitrary or random initial seeds from the population of observations. Access the implementation of this algorithm in this class using the constructor with the signature ClusterKMeans(double[][], int) or by calling the setClusterCenters(Random) method.

See Also:

Nested Class Summary

Nested Classes

Modifier and Type

Class

Description

static class

ClusterKMeans.ClusterNoPointsException

There is a cluster with no points

static class

ClusterKMeans.NoConvergenceException

Convergence did not occur within the maximum number of iterations.

static class

ClusterKMeans.NonnegativeFreqException

Deprecated.
No longer used, replaced with an IllegalArgumentException.

static class

ClusterKMeans.NonnegativeWeightException

Deprecated.
No longer used, replaced with an IllegalArgumentException.
Constructor Summary

Constructors

Constructor

Description

ClusterKMeans(double[][] x, double[][] cs)

Constructor for ClusterKMeans.

ClusterKMeans(double[][] x, int k)

Constructor for ClusterKMeans using the K-means++ algorithm to select the initial seeds.

ClusterKMeans(double[][] x, int k, Random r)

Constructor for ClusterKMeans using the K-means++ algorithm to set the initial seeds.
Method Summary

Modifier and Type

Method

Description

final double[][]

compute()

Computes the cluster means.

int[]

getClusterCounts()

Returns the number of observations in each cluster.

int[]

getClusterMembership()

Returns the cluster membership for each observation.

double[]

getClusterSSQ()

Returns the within sum of squares for each cluster.

double[][]

getInitialCenters()

Returns the initial cluster centers.

void

setFrequencies(double[] frequencies)

Sets the frequency for each observation.

void

setMaxIterations(int iterations)

Sets the maximum number of iterations.

void

setWeights(double[] weights)

Sets the weight for each observation.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Constructor Details
- ClusterKMeans
  
  public ClusterKMeans(double[][] x, double[][] cs)
  
  Constructor for ClusterKMeans.
  
  Parameters:
  
  x - a double matrix containing the observations to be clustered
  
  cs - a double matrix containing the cluster seeds, i.e. estimates for the cluster centers
- ClusterKMeans
  
  public ClusterKMeans(double[][] x, int k)
  
  Constructor for ClusterKMeans using the K-means++ algorithm to select the initial seeds.
  
  Parameters:
  
  x - a double matrix containing the observations to be clustered
  
  k - an int indicating the number of clusters
- ClusterKMeans
  
  public ClusterKMeans(double[][] x, int k, Random r)
  
  Constructor for ClusterKMeans using the K-means++ algorithm to set the initial seeds.
  
  Parameters:
  
  x - a double matrix containing the observations to be clustered
  
  k - an int indicating the number of clusters
  
  r - a Random object, the random number generator to be used in the KMeans++ algorithm
  Note that r can be initialized with a seed for repeatable results.
Method Details
- getInitialCenters
  
  public double[][] getInitialCenters()
  
  Returns the initial cluster centers.
  The cluster centers are those set by the K-Means++ center selection algorithm or those supplied to the constructor.
  
  Returns:
  
  a double matrix containing the cluster centers
- compute
  
  public final double[][] compute() throws ClusterKMeans.NoConvergenceException, ClusterKMeans.ClusterNoPointsException
  
  Computes the cluster means.
  
  Returns:
  
  a double matrix containing computed result
  
  Throws:
  
  ClusterKMeans.NonnegativeFreqException - is thrown if a frequency is negative
  
  ClusterKMeans.NonnegativeWeightException - is thrown if a weight is negative
  
  ClusterKMeans.NoConvergenceException - is thrown if convergence did not occur within the maximum number of iterations
  
  ClusterKMeans.ClusterNoPointsException - is thrown if the cluster centers yield a cluster with no points
- setWeights
  
  public void setWeights(double[] weights)
  
  Sets the weight for each observation.
  
  Parameters:
  
  weights - a double array of size x.length containing the weight for each observation
  Default: weights[] = 1
- setFrequencies
  
  public void setFrequencies(double[] frequencies)
  
  Sets the frequency for each observation.
  
  Parameters:
  
  frequencies - a double array of size x.length containing the frequency for each observation
  Default: frequencies[] = 1
- setMaxIterations
  
  public void setMaxIterations(int iterations)
  
  Sets the maximum number of iterations.
  
  Parameters:
  
  iterations - an int scalar specifying the maximum number of iterations
  Default: interations = 30
- getClusterMembership
  
  public int[] getClusterMembership()
  
  Returns the cluster membership for each observation.
  Note that the compute() method must be invoked first before invoking this method. Otherwise, the method throws a IllegalStateException exception.
  
  Returns:
  
  an int array containing the cluster membership for each observation
  Cluster membership 1 indicates the observation belongs to cluster 1, cluster membership 2 indicates the observation belongs to cluster 2, etc.
- getClusterSSQ
  
  public double[] getClusterSSQ()
  
  Returns the within sum of squares for each cluster.
  Note that the compute method must be invoked first before invoking this method. Otherwise, the method throws a IllegalStateException exception.
  
  Returns:
  
  a double array containing the within sum of squares for each cluster
- getClusterCounts
  
  public int[] getClusterCounts()
  
  Returns the number of observations in each cluster.
  Note that the compute method must be invoked first before invoking this method. Otherwise, the method throws a IllegalStateException exception.
  
  Returns:
  
  an int array containing the number of observations in each cluster

Class ClusterKMeans

Nested Class Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Constructor Details

ClusterKMeans

ClusterKMeans

ClusterKMeans

Method Details

getInitialCenters

compute

setWeights

setFrequencies

setMaxIterations

getClusterMembership

getClusterSSQ

getClusterCounts