Usage Notes

FNLStat : Cluster Analysis : Usage Notes

Usage Notes

The routines described in this chapter perform various forms of hierarchical or K‑means cluster analysis. By appropriate manipulation of the input data, either variables or cases may be clustered. Additionally, for hierarchical clustering, similarity or dissimilarity (distance) matrices created by routines not included in this chapter can be clustered. Hartigan (1975) and Anderberg (1973) are general references that may be used in this chapter.

The first step in agglomerative hierarchical cluster analysis is to compute the distance between each observation (or variable). Initially, each observation (variable) is treated as a cluster. The two clusters that are closest to one another in distance are merged, and the distance of the new cluster from all other clusters is computed. This process continues until only one cluster remains. No attempt at finding an optimal clustering (in the sense of minimizing some criterion) is made.

The usual steps in a hierarchical cluster analysis might proceed as follows:

1. Routine CDIST is used to compute a distance (or possibly a similarity) matrix from the input data matrix. A scaled matrix of Euclidean distances is a common choice for a distance matrix, while a correlation matrix is a common choice for a similarity matrix. If a correlation matrix is to be used, many of the routines described in Chapter 3, “Correlation”, may also be used to compute the correlation measures for the matrix. In particular, routine CORVC (see Chapter 3, “Correlation”) from this chapter may be used.

2. Once the distance matrix has been computed, routine CLINK is used to perform the agglomerative hierarchical cluster analysis using either single, complete, average, or Ward’s linkage.

3. The results obtained from CLINK are examined, and if desired, the number of clusters is selected. Routine TREEP in Chapter 16, “Line Printer Graphics” may be used to print the cluster tree. This tree may aid in selecting the number of clusters, assuming that such a number is desired. Based upon the number of clusters selected, routine CNUMB is used to obtain the cluster number of each of the clustered observations (or variables).

4. Routines described in Chapter 1, “Basic Statistics” and other chapters in the IMSL STAT/LIBRARY are used to obtain descriptive and other statistics to evaluate the clustering.

Because routine CDIST produces similarity and distance matrices for either rows or columns, it is easy to cluster either observations or variables. Optionally, the user may wish to cluster a correlation matrix obtained from one of the routines in the correlation chapter or to input a matrix of similarities (or dissimilarities) obtained via experimentation. The objects within such matrices may be clustered directly in routine CLINK.

Basic K‑means clustering attempts to find a clustering that minimizes the within‑cluster sums of squares. In this method of clustering the data, matrix X is grouped so that each observation (row in X) is assigned to one of a fixed number, K, of clusters. The sum of the squared difference of each observation about its assigned clusters mean is used as the criterion for assignment. In the basic algorithm, observations are transferred from one cluster to another when doing so decreases the within‑cluster sums of squared differences. When, in a pass through the entire data set, no transfer occurs, the algorithm stops. Routine KMEAN is one implementation of the basic algorithm.

The usual course of events in K‑means cluster analysis might be to use routine KMEAN to obtain the optimal clustering. The clustering is then evaluated via routines described in Chapter 1, “Basic Statistics” and/or other chapters in the IMSL STAT/LIBRARY. Often, K‑means clustering with more than one value for K is performed, and the value of K that best fits the data is used.

Clustering can be performed either on observations or on variables. The discussion of the routine KMEAN assumes the clustering is to be performed on the observations, which correspond to the rows of the input data matrix. If variables, rather than observations, are to be clustered using KMEAN, the data matrix should first be transposed (possibly using routine TRNRR (IMSL MATH/LIBRARY)). In the documentation for KMEAN, the words “observation” and “variable” would then be exchanged.