Performs a linear or a quadratic discriminant function analysis among several known groups.
Required Arguments
NROW — The absolute value of NROW is the number of rows of X that contain an observation. (Input) If NROW is negative, the observations are deleted from the discriminant statistics. If NROW is positive, they are added.
NVAR — Number of variables to be used in the discrimination. (Input)
X — ∣NROW∣ by NVAR + m matrix containing the data to be used on this call. (Input, if ∣NROW∣ > 0; X is not referenced otherwise) m is 1, 2, or 3 depending upon whether any columns in X contain frequencies or weights. One column in X must contain the group number for each observation. Group numbers must be 1.0, 2. 0, …, NGROUP. If present, IFRQ gives the column containing the frequencies, while IWT gives the column in X containing the weights.
NGROUP — Number of groups in the data. (Input)
COV — NVAR by NVAR by g matrix of covariances. (Output, for IDO = 0 or 1; input/output, for IDO = 2, 3, or 5; input, for IDO = 4; not referenced if IDO = 6) g = NGROUP + 1 when IMTH = 1, 2, 4, or 5, and g = 1 otherwise. When IMTH = 3 or 6, the within‑group covariance matrices are not computed. Regardless of the value of IMTH, the pooled covariance matrix is always computed and saved as the g‑th covariance matrix in COV.
COEF — NGROUP by NVAR + 1 matrix containing the linear discriminant function coefficients. (Output, if IDO = 0 or 3; input, if IDO = 4; not referenced if IDO = 1, 2, 5, or 6) The first column of COEF contains the constant term, and the remaining columns contain the variable coefficients. Row i of COEF corresponds to group i, for i = 1,…,NGROUP. Matrix COEF is always computed as the linear discriminant function coefficients even when quadratic discrimination is specified. Specifically, given the linear discriminant function , the intercept is assigned to COEF(i,1) and the coefficient vector is assigned to COEF(i,2:NVAR+1)for i = 1,…,NGROUP. Matrix COEF is updated if IDO is equal to 0 or 3.
ICLASS — Vector of length ∣NROW∣ containing the group to which the observation was classified. (Output, if IDO = 0 or 4; not referenced otherwise) If an observation has an invalid group number, frequency, or weight when the leaving‑out‑one method has been specified, then the observation is not classified and the corresponding elements of ICLASS and PROB are set to zero.
PROB — ∣NROW∣ by NGROUP matrix containing the posterior probabilities for each observation. (Output, if IDO = 0 or 4; not referenced otherwise)
CLASS — NGROUP by NGROUP matrix containing the classification table. (Output, if IDO = 0 or 1, input/output, if IDO = 4; not referenced otherwise) Each observation that is classified and has a group number equal to 1.0, 2.0, …, NGROUP is entered into the table. The rows of the table correspond to the known group membership. The columns refer to the group to which the observation was classified. Classification results accumulate with each call to DSCRM with IDO = 4. For example, if 2 calls with IDO = 4 are made, then the elements in CLASS sum to the total number of valid observations in the 2 calls.
D2 — NGROUP by NGROUP matrix containing the Mahalanobis distances
between the group means. (Output, when IDO= 0 or 3; not referenced otherwise) For linear discrimination, the Mahalanobis distance is computed using the pooled covariance matrix. Otherwise, the Mahalanobis distance
between group means i and j is computed using the within covariance matrix for group i in place of the pooled covariance matrix.
STAT — Vector of length 4 + 2 * (NGROUP + 1) containing statistics of interest. (Input/ Output, if IDO = 3 or 5; output, if IDO = 0 or 1; not referenced otherwise) The first element of STAT is the sum of the degrees of freedom for the within‑covariance matrices. The second, third and fourth elements of STAT correspond to the chi‑squared statistic, its degrees of freedom, and the probability of a greater chi‑squared, respectively, of a test of the homogeneity of the within‑covariance matrices (not computed if IMTH = 3 or 6). The 5‑th through 5 + NGROUP elements of STAT contain the log of the determinants of each group’s covariance matrix (not computed if IMTH = 3 or 6) and of the pooled covariance matrix (element 5 + NGROUP). Finally, the last NGROUP + 1 elements of STAT contain the sum of the weights within each group and, in the last position, the sum of the weights in all groups.
This is the only invocation of DSCRM; all the data are input at once.
1
This is the first invocation of DSCRM with this data, additional calls will be made. Initialization and updating for the NROW observations are performed.
2
This is an intermediate invocation of DSCRM; updating for the NROW observations is performed.
3
All statistics are updated for the NROW observations. The discriminant functions and other statistics are computed.
4
The discriminant functions are used to classify each of the NROW observations in X.
5
The covariance matrices are computed, and workspace is released. No further calls to DSCRM with IDO greater than 1 should be made without first calling DSCRM with IDO = 1.
6
Workspace is released. No further calls to DSCRM with IDO greater than 1 should be made without first calling DSCRM with IDO = 1. This option is not required if a call has been made with IDO = 5 or if workspace is explicitly provided by use of D2CRM.
See Comments 5 and 6 for further information.
NCOL — Number of columns in matrix X. (Input) Default: NCOL = size (X,2).
LDX — Leading dimension of X exactly as specified in the dimension statement in the calling program. (Input) Default: LDX = size (X,1).
IND — Vector of length NVAR containing the column numbers in X to be used in the discrimination. (Input) By default, IND(I)=I.
IFRQ — Frequency option. (Input) IFRQ = 0 means that all frequencies are 1.0. Positive IFRQ indicates that column number IFRQ of X contains the frequencies. All frequencies should be integer values. If this is not the case, the NINT (nearest integer) function is used to obtain integer frequencies. Default: IFRQ = 0.
IWT — Weighting option. (Input) IWT = 0 means that all weights are 1.0. Positive IWT means that column IWT of X contains the weights. Negative weights are not allowed. Default: IWT = 0.
IGRP — Column number in X containing the group numbers. (Input) The group numbers must be 1.0, 2,0, …, NGROUP for an observation to be used in the discriminant functions. An observation will be classified regardless of its group number when the reclassification method is specified. Default: IGRP = NVAR + 1.
IMTH — Option parameter giving the method of discrimination. (Input) IMTH determines whether linear or quadratic discrimination is used whether the group covariance matrices are computed (the pooled covariance matrix is always computed) and whether the leaving‑out‑one or the reclassification method is used to classify each observation. Default: IMTH = 1.
IMTH
Discrim.
Coveriance
Classification
1
Linear
All
Reclassification
2
Quadratic
All
Reclassification
3
Linear
Pooled only
Reclassification
4
Linear
All
Leaving‑out‑one
5
Quadratic
All
Leaving‑out‑one
6
Linear
Pooled only
Leaving‑out‑one
In the leaving‑out‑one method of classification, the posterior probabilities are adjusted so as to eliminate the effect of the observation from the sample statistics prior to its classification. In the reclassification method, the effect of the observation is not eliminated from the classification function. Calls to DSCRM with IMTH = 1, 2, 4, or 5 can be intermixed, as can calls to DSCRM with IMTH = 3 or 6. Calls to DSCRM with IMTH = 1, 2, 4, or 5 cannot be intermixed with calls to DSCRM with IMTH = 3 or 6 without first calling DSCRM with IDO = 1 (or 0).
IPRINT — Printing option. (Input) Default: IPRINT = 0. For the given combination of IDO and IPRINT, the following arrays are printed.
Note that the only change from IPRINT = 1 to IPRINT = 2 is the printing when IDO = 4. Also, note that PRIOR is printed even though it may be input only.
PRIOR — Vector of length NGROUP containing the prior probabilities for each group. (Input, if PRIOR(1) is not ‑1.0 and IDO is 0, 3 or 4; input/output, if PRIOR(1) is ‑1.0 and IDO is 0 or 3; not referenced if IDO is 1, 2, 5 or 6) If PRIOR(1) is not ‑1.0, then the elements of PRIOR should sum to 1.0. Proportional priors can be selected by setting PRIOR(1) = ‑1.0. In this case, the prior probabilities will be proportional to the sample size in each group, and the elements of PRIOR will contain the proportional prior probabilities after the first call with IDO = 0 or 3. Use of this optional argument is mandatory if IDO = 4. Default: PRIOR(1) = ‑1.0.
NI — Vector of length NGROUP. (Input, for IDO = 4 or 5; input/output, for IDO = 1, 2 or 3; output, for IDO = 0; not referenced if IDO = 6) The i‑th element of NI contains the number of observations in group i. Use of this optional argument is mandatory if IDO = 1, 2, 3, 4, 5.
XMEAN — NGROUP by NVAR matrix. (Input, for IDO = 3, 4; input/output, for IDO = 2, 5; output, for IDO = 0 or 1; not referenced if IDO = 6) The i‑th row of XMEAN contains the group i variable means.
Note that for IDO = 1, 2, 3 ,4 the input and/or output values are just the weighted sums of the observations in each group, i.e. the group i variable means is calculated as
where Mi denotes the number of elements in group i and xik is the k-th observation in group i with associated weight wik and frequency fik.
For IDO = 0 or 5, the means are output and group-wise calculated as weighted arithmetic means,
Use of this optional argument is mandatory if IDO = 1,2,3,4,5.
LDXMEA — Leading dimension of XMEAN exactly as specified in the dimension statement in the calling program. (Input) Default: LDXMEA = size(XMEAN,1).
LDCOV — Leading and second dimensions of COV exactly as specified in the dimension statement of the calling program. (Input) The first two dimensions of COV must be equal. Default: LDCOV = size (COV,1).
LDCOEF — Leading dimension of COEF exactly as specified in the dimension statement in the calling program. (Input) Default: LDCOEF = size (COEF,1).
LDPROB — Leading dimension of PROB exactly as specified in the dimension statement in the calling program. (Input) Default: LDPROB = size (PROB,1).
LDCLAS — Leading dimension of CLASS exactly as specified in the dimension statement in the calling program. (Input) Default: LDCLAS = size (CLASS,1).
LDD2 — Leading dimension of D2 exactly as specified in the dimension statement of the calling program. (Input) Default: LDD2 = size (D2,1).
NRMISS — Number of rows of data encountered in calls to DSCRM containing missing values (NaN) for the classification, group, weight, and/or frequency variables. (Output, if IDO = 0 or 1; input/output, if IDO = 2 or 3, not referenced otherwise) If a row of data contains a missing value (NaN) for any of these variables, that row is excluded from the computations.
Routine DSCRM performs discriminant function analysis using either linear or quadratic discrimination. The output from DSCRM includes a measure of distance between the groups, a table summarizing the classification results, a matrix containing the posterior probabilities of group membership for each observation, and the within‑sample means and covariance matrices. The linear discriminant function coefficients are also computed.
All observations can be input during one call to DSCRM, a method of operation that has the advantage of simplicity. Alternatively, one or more rows of observations can be input during separate calls. This method does not require that all observations be memory resident, a significant advantage with large data sets. Note, however, that DSCRM requires two passes of the data. During the first pass the discriminant functions are computed while in the second pass, the observations are classified. Thus, with the second method of operation, the data will usually need to be input into DSCRM twice.
Because both methods result in the same operations being performed, the algorithm for DSCRM is discussed as if only a few observations are input during each call. The operations performed during each call to DSCRM depend upon the IDO parameter. IDO = 0 should be used if all observations are to be input at one time.
The IDO = 1 step is the initialization step. The variables XMEAN, CLASS, and COV are initialized to zero, and other program parameters are set. After this call, all subroutine arguments except IDO, NROW, X, LDX and IMTH should not be changed by the user except via another call to DSCRM with IDO = 0 or IDO = 1. IMTH can be changed from one call to the next within the two sets {1, 2, 4, 5} or {3, 6} but not between these sets when IDO > 1. That is, do not call DSCRM with IMTH = 1 in one call and IMTH = 3 in another call without first calling DSCRM with IDO = 1.
After initialization has been performed in the IDO = 1 step, the within‑group means are updated for all valid observations in X. Observations with invalid group numbers are ignored, as are observations with missing values. The LU factorization of the covariance matrices are updated by adding (or deleting) observations via Givens rotations.
The IDO = 2 step is used solely for adding or deleting observations from the model as in the above paragraph.
The IDO = 3 step begins by adding all observations in X to the means and the factorizations of the covariance matrices. It continues by computing some statistics of interest: the linear discriminant functions, the prior probabilities (if PRIOR(1) = ‑1.0), the log of the determinant of each of the covariance matrices, a test statistic for testing that all of the within‑group covariance matrices are equal, and a matrix of Mahalanobis distances between the groups. The matrix of Mahalanobis distances is computed via the pooled covariance matrix when linear discrimination is specified, the row covariance matrix is used when the discrimination is quadratic.
Covariance matrices are defined as follows. Let Ni denote the sum of the frequencies of the observations in group i, and let Mi denote the number of observations in group i. Then, if Si denotes the within‑group i covariance matrix,
where wj is the weight of the j-th observation in group i, fj is its frequency, xj is the j-th observation column vector (in group i), and denotes the mean vector of the observations in group i. The mean vectors are computed as
where
Given the means and the covariance matrices, the linear discriminant function for group i is computed as:
where ln(pi) is the natural log of the prior probability for the i‑th group, x is the observation to be classified, and Sp denotes the pooled covariance matrix.
Let S denote either the pooled covariance matrix or one of the within‑group covariance matrices Si. (S will be the pooled covariance matrix in linear discrimination, and Si otherwise.) The Mahalanobis distance between group i and group j is computed as:
Finally, the asymptotic chi‑squared test for the equality of covariance matrices is computed as follows (Morrison 1976, page 252):
where ni is the number of degrees of freedom in the i‑th sample covariance matrix, k is the number of groups, and
where p is the number of variables.
When IDO = 4, the estimated posterior probability of each observation x belonging to group i is computed using the prior probabilities and the sample mean vectors and estimated covariance matrices under a multivariate normal assumption. Under quadratic discrimination, the within‑group covariance matrices are used to compute the estimated posterior probabilities. The estimated posterior probability of an observation x belonging to group i is
where
For the leaving‑out‑one method of classification (IMTH = 4, 5, and 6), the sample mean vector and sample covariance matrices in the formula for
are adjusted so as to remove the observation x from their computation. For linear discrimination (IMTH = 1, 2, 4, and 6), the linear discriminant function coefficients are actually used to compute the same posterior probabilities.
Using the posterior probabilities, each observations in X is classified into a group; the result is tabulated in the matrix CLASS and saved in the vector ICLASS. CLASS is not altered at this stage if X(i, IGRP) contains a group number that is out of range. If the reclassification method is specified, then all observations with no missing values in the NVAR classification variables are classified. When the leaving‑out‑one method is used, observations with invalid group numbers, weights, frequencies or classification variables are not classified. Regardless of the frequency, a 1 is added (or subtracted) from CLASS for each row of X that is classified and contains a valid group number.
When IMTH > 3, adjustment is made to the posterior probabilities to remove the effect of the observation in the classification rule. In this adjustment, each observation is presumed to have a weight of X(i, IWT), if IWT > 0 and a frequency of 1.0. See Lachenbruch (1975, page 36) for the required adjustment.
Note that the X data in an IDO = 3 call do not have to be identical with the X data used in a subsequent IDO = 4 call if the reclassification method (IMTH=1,2,3) is used: The observations in the IDO = 3 call serve as a training sample to determine the discriminant functions that can be used in the following IDO = 4 call to classify new X data. Since the reclassification method classifies each observation regardless of its group number, it can also be used for the classification of new observations whose actual group assignment is not known (i.e. group column (IGRP) of X is not used when IDO = 4). See Example 3.
Finally, when IDO = 5, the covariance matrices are computed from their LU factorizations, and the mean values (XMEAN) are scaled.
Comments
1. Workspace may be explicitly provided, if desired, by use of D2CRM/DD2CRM. The reference is:
D — Work vector of length equal to (NGROUP + 1) *NVAR if IMTH is not 3 or 6, and of length NVAR otherwise.
OB — Work vector of length equal to NVAR.
OB1 — Work vector of length equal to NVAR.
2. Informational errors
Type
Code
Description
3
1
A row of the data matrix X has an invalid group number.
4
2
The variance‑covariance matrix for a group is singular.
4
3
The pooled variance‑covariance matrix is singular.
3
4
The variance‑covariance matrix for a group is singular. STAT(2) cannot be computed. STAT(2) and STAT(4) are set to the missing value code (NaN).
3
5
An element of PRIOR is less than or equal to 10−20.
3
6
The leaving‑out‑one method is specified, but this observation does not have a valid weight, or it does not have a valid frequency. This observation is ignored.
3
7
The leaving‑out‑one method is specified, but this observation does not have a valid group number. This observation is ignored.
3. Common choices for the Bayesian prior probabilities are given by:
In all cases, the prior probabilities should sum to 1.0.
4. Two passes of the data are made. In the first pass, the statistics required to compute the discriminant functions are obtained (IDO = 1, 2, and 3). In the second pass, the discriminant functions are used to classify the observations. When IDO = 0, all of the data are memory resident, and both passes are made in one call to DSCRM. When IDO > 0 and workspace is not explicitly provided by use of D2CRM, a third call to DSCRM involving no data is required with IDO = 5 or 6.
5. Here are a few rules and guidelines for the correct value of IDO in a series of calls.
(1)
Calls with IDO = 0 or 1 may be made at any time. These calls destroy all statistics from previous calls.
(2)
IDO may not be 2, 3, 4, 5, or 6 (a) immediately after a call where IDO was 0, (b) before a call with IDO = 1 has been made, or (c) immediately after a call with IDO = 5 or 6 has been made.
(3)
IDO may not be 4 or 5 before a call with IDO = 3 has been made.
(4)
Each series of calls to DSCRM which begins with IDO = 1 should end with IDO = 5 or 6 to ensure the proper release of workspace.This is a valid sequence of IDOs: 0, 1, 2, 3, 4, 5, 1, 3, 4, 3, 5, 1, 6, 1, 2, 6, 0, 0, 1, 3, 5.
6. Unlike many routines using the parameter IDO, because of the workspace allocation and saved variables, neither DSCRM or D2CRM can be called with IDO greater than 1 in consecutive invocations with more than one dataset.
Examples
Example 1
The following example uses linear discrimination with equal prior probabilities on Fisher’s (1936) iris data. This example illustrates the execution of DSCRM when one call is made.
Continuing with Fisher’s iris data, the following example computes the quadratic discriminant functions using values of IDO > 0. In the first loop, all observations are added to the functions, two observations at a time. In the second loop, each of three observations is classified, one by one, using the leaving‑out‑one method. Output for statistics that are identical to those reported in the first example are not printed here.
Fisher's iris data consists of 50 samples from each of three species of the Iris flower. The first 50 data sets belong to species 1, the second 50 data sets to species 2 and the last 50 data sets to species 3. In this example, 30 data sets from each of the 3 species are used to train the quadratic discrimination functions. In a second step, the computed discriminant functions are used to classify the remaining data sets. The classification results show that two data sets are misclassified by the discriminant functions: One data set belonging to species 2 is erroneously assigned to species 3 and one data set belonging to species 3 is erroneously assigned to species 2.