Cluster Analysis

Function imsl.cluster.cluster_k_means() performs a K-means cluster analysis. Basic K-means clustering attempts to find a clustering that minimizes the within-cluster sums-of-squares. In this method of clustering the data, matrix X is grouped so that each observation (row in X) is assigned to one of a fixed number, K, of clusters. The sum of the squared difference of each observation about its assigned cluster’s mean is used as the criterion for assignment. In the basic algorithm, observations are transferred from one cluster or another when doing so decreases the within-cluster sums-of-squared differences. When no transfer occurs in a pass through the entire data set, the algorithm stops. Function imsl.cluster.cluster_k_means() is one implementation of the basic algorithm.

The usual course of events in K-means cluster analysis is to use imsl.cluster.cluster_k_means() to obtain the optimal clustering. The clustering is then evaluated by other statistical functions in IMSL Library for Python. Often, K-means clustering with more than one value of K is performed, and the value of K that best fits the data is used.

Clustering can be performed either on observations or variables. The discussion of the function imsl.cluster.cluster_k_means() assumes the clustering is to be performed on the observations, which correspond to the rows of the input data matrix. If variables, rather than observations, are to be clustered, the data matrix should first be transposed. In the documentation for imsl.cluster.cluster_k_means(), the words “observation” and “variable” are interchangeable.