Uses Fisher’s linear discriminant analysis method to reduce the number of variables.
Required Arguments
XMEAN — NGROUP by NVAR matrix containing the means of the variables in each group. (Input)
SUMWT — Vector of length NGROUP containing the sum of the weights of the observations in each group. (Input)
COV — NVAR by NVAR matrix containing the pooled within‑groups variance‑covariance matrix Sp. (Input)
NNV — Number of eigenvectors extracted from
the standardized between‑groups variance‑covariance matrix. (Output) Sp is the pooled within‑groups variance‑covariance matrix, and Sb is the between‑groups variance‑covariance matrix. NNV is usually the minimum of NVAR and NGROUP‑1, but it may be smaller if any row of XMEAN or COV is a linear combination of the other rows.
EVAL — Vector of length NNV containing the eigenvalues extracted from the standardized between‑means variancecovariance matrix, in descending order. (Output) NNV is less than or equal to the minimum of NVAR and (NGROUP‑1).
COEF — NVAR by NNV matrix of eigenvectors from the standardized between‑means variance‑covariance matrix. (Output) The eigenvector coefficients have been standardized such that the canonical scores can be obtained directly by multiplication of the original data by COEF.
CMEAN — NGROUP by NNV matrix of group means of the canonical variables. (Output)
Optional Arguments
NGROUP — Number of groups. (Input) Default: NGROUP = size (XMEAN,1).
NVAR — Number of variables. (Input) Default: NVAR = size (XMEAN,2).
LDXMEA — Leading dimension of XMEAN exactly as specified in the dimension statement in the calling program. (Input) Default: LDXMEA = size (XMEAN,1).
LDCOV — Leading dimension of COV exactly as specified in the dimension statement in the calling program. (Input) Default: LDCOV = size (COV,1).
LDCOEF — Leading dimension of COEF exactly as specified in the dimension statement in the calling program. (Input) Default: LDCOEF = size (COEF,1).
LDCMEA — Leading dimension of CMEAN exactly as specified in the dimension statement in the calling program. (Input) Default: LDCMEA = size (CMEAN,1).
Routine DMSCR is a natural generalization of R.A. Fisher’s linear discrimination procedure for two groups. This method of discrimination obtains those linear combinations of the observed random variables that maximize the between‑groups variation relative to the within groups variation. Denote the first of these linear combinations by
where β1 is a column vector of coefficients of length NVAR and x is an observation to be classified. On the basis of one linear combination, the discriminant rule assigns the observation, z, to a group (characterized by the group mean) by minimizing the Euclidean distance between z and the group mean.
To obtain β1 (see, e.g., Tatsuoka 1971, page 158), let Sp denote the pooled within‑groups covariance matrix (Sp is defined and can be computed via routine DSCRM) and let Sb denote the between‑groups covariance matrix defined by
where g is the number of groups,
is the mean vector for the i-th group of observations, denotes the vector of means over all observations, wi is the sum of the weights times the frequencies as input in SUMWT and as used in the computation of
and N is the total number of observations used in computing COV. Then, β1, such that
can be computed as the maximum of
This yields β1 as the eigenvector associated with the largest eigenvalue from
Generally,
has rank m, where m = min(g‑ 1, p) and p = NVAR.
has m such eigenvectors, and the matrix COEF is obtained as (β1, β2, …., βm), where each βi is an eigenvector.
The matrix CMEAN is taken as the within‑group means vector of the linear combinations zi defined by the β’s. For each observation x, scores
can be computed, because of the restriction on βi, the sample variance of the zi is 1.0. The observation is classified into the group (as specified by the group mean of the zi’s) to which, on the basis of the zi, the Euclidean distance is the least.
Note that the linear combinations zi have meaning even when discrimination is not desired. The linear combination of the observed variables that most separates the g groups is z1; z2, giving the second highest such separation orthogonal to the first, and so on. Thus, a plot of the mean vectors of the first two variables gives a good two‑dimensional summarization of the relationships between the groups.
Comments
1. Workspace may be explicitly provided, if desired, by use of D2SCR/DD2SCR. The reference is:
2. IMSL routine DSCRM may be used to calculate the input arrays for this routine from the original data.
Example
The following example illustrates a typical sequence. Fisher’s iris data is used. (See routine GDATA, Chapter 19, “Utilities”). Routine DSCRM is first used to perform a discriminant analysis based on all the variables. COV, XMEAN, and NI are obtained from DSCRM. Function DMSCR, which uses these arrays, is then called.