public class ClusterKMeans extends Object implements Serializable, Cloneable
ClusterKMeans
is an implementation of Algorithm AS 136 by
Hartigan and Wong (1979). It computes K-means (centroid) Euclidean
metric clusters for an input matrix starting with initial estimates of the
K cluster means. It allows for missing values (coded as NaN, not a
number) and for weights and frequencies.
Let p denote the number of variables to be used in computing the Euclidean distance between observations. The idea in K-means cluster analysis is to find a clustering (or grouping) of the observations so as to minimize the total within-cluster sums of squares. In this case, the total sums of squares within each cluster is computed as the sum of the centered sum of squares over all nonmissing values of each variable. That is,
$$ \phi = \sum_{i=1}^K \sum_{j=1}^p \sum_{m=1}^{n_i} f_{\nu_{im}} w_{\nu_{im}} \delta_{\nu_{im},j} \left( x_{\nu_{im},j} - \bar x_{ij} \right)^2 $$where \(\nu_{im}\) denotes the row index of the m-th observation in the i-th cluster in the matrix X; \(n_i\) is the number of rows of X assigned to group i; f denotes the frequency of the observation; w denotes its weight; d is zero if the j-th variable on observation \(\nu_{im}\) is missing, otherwise \(\delta\) is one; and \(\bar x_{ij}\) is the average of the nonmissing observations for variable j in group i. This method sequentially processes each observation and reassigns it to another cluster if doing so results in a decrease in the total within-cluster sums of squares. See Hartigan and Wong (1979) or Hartigan (1975) for details.
An algorithm for choosing efficient initial seeds was proposed by
Arthur and Vassilvitskii (2007), and is known as K-means++. This
algorithm has advantages over choosing arbitrary or random initial seeds from
the population of observations. Access the implementation of this algorithm
in this class using the constructor with the signature
ClusterKMeans(double[][], int)
or by calling the
setClusterCenters(Random)
method.
Modifier and Type | Class and Description |
---|---|
static class |
ClusterKMeans.ClusterNoPointsException
There is a cluster with no points
|
static class |
ClusterKMeans.NoConvergenceException
Convergence did not occur within the maximum number of iterations.
|
static class |
ClusterKMeans.NonnegativeFreqException
Deprecated.
No longer used, replaced with an
IllegalArgumentException . |
static class |
ClusterKMeans.NonnegativeWeightException
Deprecated.
No longer used, replaced with an
IllegalArgumentException . |
Constructor and Description |
---|
ClusterKMeans(double[][] x,
double[][] cs)
Constructor for
ClusterKMeans . |
ClusterKMeans(double[][] x,
int k)
Constructor for
ClusterKMeans using the K-means++ algorithm
to select the initial seeds. |
ClusterKMeans(double[][] x,
int k,
Random r)
Constructor for
ClusterKMeans using the K-means++ algorithm
to set the initial seeds. |
Modifier and Type | Method and Description |
---|---|
double[][] |
compute()
Computes the cluster means.
|
int[] |
getClusterCounts()
Returns the number of observations in each cluster.
|
int[] |
getClusterMembership()
Returns the cluster membership for each observation.
|
double[] |
getClusterSSQ()
Returns the within sum of squares for each cluster.
|
double[][] |
getInitialCenters()
Returns the initial cluster centers.
|
void |
setFrequencies(double[] frequencies)
Sets the frequency for each observation.
|
void |
setMaxIterations(int iterations)
Sets the maximum number of iterations.
|
void |
setWeights(double[] weights)
Sets the weight for each observation.
|
public ClusterKMeans(double[][] x, double[][] cs)
ClusterKMeans
.x
- a double
matrix containing the observations to be
clusteredcs
- a double
matrix containing the cluster seeds, i.e.
estimates for the cluster centerspublic ClusterKMeans(double[][] x, int k)
ClusterKMeans
using the K-means++ algorithm
to select the initial seeds.x
- a double
matrix containing the observations to be
clusteredk
- an int
indicating the number of clusterspublic ClusterKMeans(double[][] x, int k, Random r)
ClusterKMeans
using the K-means++ algorithm
to set the initial seeds.x
- a double
matrix containing the observations to be
clusteredk
- an int
indicating the number of clustersr
- a Random object, the random number generator to be used in the
KMeans++ algorithm
Note that r
can be initialized with a seed for repeatable
results.
public double[][] getInitialCenters()
The cluster centers are those set by the K-Means++ center selection algorithm or those supplied to the constructor.
double
matrix containing the cluster centerspublic final double[][] compute() throws ClusterKMeans.NoConvergenceException, ClusterKMeans.ClusterNoPointsException
double
matrix containing computed resultClusterKMeans.NonnegativeFreqException
- is thrown if a frequency is negativeClusterKMeans.NonnegativeWeightException
- is thrown if a weight is negativeClusterKMeans.NoConvergenceException
- is thrown if convergence did not occur
within the maximum number of iterationsClusterKMeans.ClusterNoPointsException
- is thrown if the cluster centers
yield a cluster with no pointspublic void setWeights(double[] weights)
weights
- a double
array of size x.length
containing the weight for each observation
Default: weights[]
= 1
public void setFrequencies(double[] frequencies)
frequencies
- a double
array of size
x.length
containing the frequency for each observation
Default: frequencies[]
= 1
public void setMaxIterations(int iterations)
iterations
- an int
scalar specifying the maximum
number of iterations
Default: interations
= 30
public int[] getClusterMembership()
Note that the compute()
method must be invoked first before
invoking this method. Otherwise, the method throws a
IllegalStateException
exception.
int
array containing the cluster membership for
each observation
Cluster membership 1 indicates the observation belongs to cluster 1, cluster membership 2 indicates the observation belongs to cluster 2, etc.
public double[] getClusterSSQ()
Note that the compute
method must be invoked first before
invoking this method. Otherwise, the method throws a
IllegalStateException
exception.
double
array containing the within sum of squares
for each clusterpublic int[] getClusterCounts()
Note that the
compute
method must be invoked first before invoking this
method. Otherwise, the method throws a IllegalStateException
exception.
int
array containing the number of observations
in each clusterCopyright © 2020 Rogue Wave Software. All rights reserved.