Cluster Analysis¶
Function imsl.cluster.cluster_k_means()
performs a K-means cluster
analysis. Basic K-means clustering attempts to find a clustering that
minimizes the within-cluster sums-of-squares. In this method of clustering the
data, matrix X is grouped so that each observation (row in X) is assigned
to one of a fixed number, K, of clusters. The sum of the squared difference
of each observation about its assigned cluster’s mean is used as the criterion
for assignment. In the basic algorithm, observations are transferred from one
cluster or another when doing so decreases the within-cluster sums-of-squared
differences. When no transfer occurs in a pass through the entire data set, the
algorithm stops. Function imsl.cluster.cluster_k_means()
is one
implementation of the basic algorithm.
The usual course of events in K-means cluster analysis is to use
imsl.cluster.cluster_k_means()
to obtain the optimal clustering. The
clustering is then evaluated by other statistical functions in IMSL Library for Python. Often,
K-means clustering with more than one value of K is performed, and the
value of K that best fits the data is used.
Clustering can be performed either on observations or variables. The discussion
of the function imsl.cluster.cluster_k_means()
assumes the clustering
is to be performed on the observations, which correspond to the rows of the
input data matrix. If variables, rather than observations, are to be clustered,
the data matrix should first be transposed. In the documentation for
imsl.cluster.cluster_k_means()
, the words “observation” and “variable”
are interchangeable.