X — NROW by NVAR + 1 matrix containing the data to be used on this call. (Input/Output) One column in X must contain the group number for each observation. On output, X is sorted into a k‑d tree. The first NRULE + NCLASS rows of X must not contain missing values in the columns specified by IND and IGRP.
K — Number of nearest neighbors to be used in the discriminant rule. (Input)
NGROUP — Number of groups in the data. (Input)
NRULE — Number of observations in X to be used in the discriminant rule. (Input) The first ∣NRULE∣ observations in X are used as the set defining the rule. If NRULE is positive, then the NRULE observations defining the rule are classified. If NRULE is negative, the NRULE observations defining the rule are not classified.
NCLASS — Number of observations in X to classify. (Input) NCLASS is the number of observations in a second sample that may be used to test the rule formed from the first NRULE observations. If present, this sample is in rows NRULE + 1 through NRULE + NCLASS of X.
THRESH — Threshold for the posterior probabilities. (Input) If the maximum posterior probability is less than THRESH, the observation is classified into group NGROUP + 1 (the group “other”).
PART — Vector of length NRULE containing the values to be used in the partition of X for the k-d tree. (Output)
IDISCR — Vector of length NRULE containing the element number in IND that points to the column of X to be used as the discriminator in the k-d tree. (Output) IDISCR(i) = 0 if the observation is a terminal node. IND(IDISCR(i)) is the column number in X to be used as the discriminator.
NI — Vector of length NGROUP containing the number of observations in each group. (Output)
ICLASS — Vector of length m containing the group to which the observation was classified. (Output) If NRULE > 0, m = NRULE + NCLASS; otherwise, m = NCLASS. The i‑th element in ICLASS corresponds to to i‑th row in the sorted matrix X.
PROB — m by NGROUP matrix containing the posterior probabilities for each observation. (Output) The i‑th row in PROB corresponds to the i‑th row in the in the sorted matrix X.
CLASS — NGROUP by NGROUP + 1 matrix containing the classification table. (Output) Each observation that is classified and has a group number equal to 1.0, 2.0, …, NGROUP is entered into the table. The rows of the table correspond to the known group membership. The columns refer to the group to which the observation was classified. Column NGROUP + 1 refers to the column “other” (see THRESH).
Optional Arguments
NROW — Number of rows of X that contain an observation. (Input) Default: NROW = size (X,1).
NVAR — Number of variables to be used in the discrimination. (Input) Default: NVAR = size (X,2) – 1.
NCOL — Number of columns in matrix X. (Input) Default: NCOL = size (X,2).
LDX — Leading dimension of X exactly as specified in the dimension statement in the calling program. (Input) Default: LDX = size (X,1).
IND — Vector of length NVAR containing the column numbers in X to be used in the discrimination. (Input) By default, IND(I)=1.
IGRP — Column number in X containing the group numbers. (Input) The group numbers must be 1.0, 2.0, …, NGROUP for an observation to be used in the discriminant functions. (Note, however, that the nearest integer (NINT) function is used to obtain the group numbers.) Default: IGRP = NVAR + 1.
METRIC — Metric to be used in computing the k nearest neighbors. (Input) Default: METRIC = 0.
METRIC
Metric used
0
Euclidean distance
1
L1 norm
2
L∞ norm
PRIOR — Vector of length NGROUP containing the prior probabilities for each group. (Input, if PRIOR(1) is not ‑1.0; input/output, if PRIOR(1) is ‑1.0) If PRIOR(1) is not ‑1.0, then the elements of PRIOR should sum to 1.0. Proportional priors can be selected by setting PRIOR(1) = ‑1.0. In this case, the prior probabilities will be proportional to the sample size in each group based upon the first NRULE observations, and the elements of PRIOR will contain the proportional prior probabilities on return from NNBRD. Default: PRIOR(1) = ‑1.0.
LDPROB — Leading dimension of PROB exactly as specified in the dimension statement in the calling program. (Input) Default: LDPROB = size (PROB,1).
LDCLAS — Leading dimension of CLASS exactly as specified in the dimension statement in the calling program. (Input) Default: LDCLAS = size (CLASS,1).
Routine NNBRD performs k‑th nearest neighbor discriminant function analysis. The k-d tree algorithm of Friedman, Bentley, and Finkel (1977) is used to find the nearest neighbors. Consult this reference for a discussion of k-d trees and how one goes about finding nearest neighbors in them.
In NNBRD, the k nearest neighbors of any observation used in forming the rule (i.e., one of the first NRULE observations in X), do not include the observation. Let ki(i = 1, …, NGROUP) denote the number of nearest neighbors found from each of the groups for a given observation (Σiki = k); let pi = PRIOR(i)( Σipi = 1); and let
denote the estimated posterior probability of membership in group i. Compute
where g = NGROUP. (If nj = 0 for some j, the associated term in the denominator is excluded and
is set to 0.0.)
Let m denote the index of the maximum
and ɸ = THRESH. Then if
the observation is classified into group m. If
or if the maximum is not unique, then the observation is not classified into any group and ICLASS is set to zero.
Three metrics are available in NNBRD for finding the nearest neighbors. These are Euclidean (L2) distance, L1 norm, and L∞ norm. In order to use Mahalanobis distance, xTΣ−1x, a transformation y = Σ−1∕2x is first needed so that Var(y) = I. These transformations can be accomplished by use of the mathematical routines. The L2 norm would then be used with y as input to obtain the Mahalanobis metric.
Comments
Workspace may be explicitly provided, if desired, by use of N2BRD/DN2BRD. The reference is:
BNDL — Work vector of length NVAR * (log2(NROW) + 3).
BNDH — Work vector of length NVAR * (log2(NROW) + 3).
XKEY — Work vector of length NVAR.
IPQR — Work vector of length K + 1.
PQD — Work vector of length K + 1.
Example
Fisher’s iris data are used to illustrate routine NNBRD. The data consist of three types of iris. NNBRD is called with k = 5 and Euclidean distance as the metric. The results show a clear separation of the groups.