CDIST

Computes a matrix of dissimilarities (or similarities) between the columns (or rows) of a matrix.

Required Arguments

XNROW by NCOL matrix containing the data. (Input)

DISTm by m matrix containing the computed dissimilarities or similarities, where
m = NROW if IROW = 1 and m = NCOL otherwise. (Output)

Optional Arguments

NROW — Number of rows in the matrix. (Input)
Default: NROW = size (X,1).

NCOL — Number of columns in the matrix. (Input)
Default: NCOL = size (X,2).

LDX — Leading dimension of X exactly as specified in the dimension statement in the calling program. (Input)
Default: LDX = size (X,1).

NDSTM — Number of rows (columns, if IROW = 1) to be used in computing the distance measure between the columns (rows). (Input)
Default: NDSTM = size (IND,1) if IND is present. Otherwise, a default value of 2 is used.

IND — Vector of length NDSTM containing the indices of the rows (columns, if IROW = 1) to be used in computing the distance measure. (Input)
If IND(1) = 0; the first NDSTM rows (columns) are used.
By default, the first NDSTM rows (columns) are used.

IMETH — Method to be used in computing the dissimilarities or similarities. (Input)
Default: IMETH = 0.

 

IMETH

Method

0

Euclidean distance (L2 norm)

1

Sum of the absolute differences (L1 norm)

2

Maximum difference (L norm)

3

Mahalanobis distance

4

Absolute value of the cosine of the angle between the vectors

5

Angle in radians (0, π) between the lines through the origin defined by the vectors

6

Correlation coefficient

7

Absolute value of the correlation coefficient

8

Number of exact matches

The algorithm section of the manual document has a more detailed description of each measure.

IROW — Row or columns option. (Input)
If IROW = 1, distances are computed between the NROW rows of X. Otherwise, distances between the NCOL columns of X are computed.
Default: IROW = 1.

ISCALE — Scaling option. (Input)
ISCALE is not used for methods 3 through 8.
Default: ISCALE = 0.

 

ISCALE

Scaling Performed

0

No scaling is performed.

1

Scale each column (row, if IROW = 1) by the standard deviation of the column (row).

2

Scale each column (row, if IROW = 1) by the range of the column (row).

LDDIST — Leading dimension of DIST exactly as specified in the dimension statement in the calling program. (Input)
Default: LDDIST = size (DIST,1).

FORTRAN 90 Interface

Generic: CALL CDIST (X, DIST [])

Specific: The specific interface names are S_CDIST and D_CDIST.

FORTRAN 77 Interface

Single: CALL CDIST (NROW, NCOL, X, LDX, NDSTM, IND, IMETH, IROW, ISCALE, DIST, LDDIST)

Double: The double precision name is DCDIST.

Description

Routine CDIST computes an upper triangular matrix (excluding the diagonal) of dissimilarities (or similarities) between the columns or rows of a matrix. Nine different distance measures can be computed. For the first three measures, three different scaling options can be employed. Output from CDIST is generally used as input to clustering or multidimensional scaling routines.

The following discussion assumes that the distance measure is being computed between the columns of the matrix, i.e., that IROW is not 1. If distances between the rows of the matrix are desired, set IROW to 1.

For IMETH = 0 to 2, each row of X is first scaled according to the value of ISCALE. The scaling parameters are obtained from the values in the row scaled as either the standard deviation of the row or the row range; the standard deviation is computed from the unbiased estimate of the variance. If ISCALE is 0, no scaling is performed, and the parameters in the following discussion are all 1.0. Once the scaling value (if any) has been computed, the distance between column i and column j is computed via the difference vector zk = (xk  yk)/sk, i = 1, , NDSTM, where xk denotes the k‑th element in the i‑th column, and yk denotes the corresponding element in the j‑th column. For given zi, the metrics 0 to 2 are defined as:

 

IMETH

Metric

0

Euclidian distance,

1

,

2

,

Distance measures corresponding to IMETH = 3 to 8 do not allow for scaling. These measures are defined via the column vectors X = (xi), Y = (yi), and Z = (xi  yi) as follows:

 

IMETH

Metric

3

Mahalanobis distance, where is the usual unbiased sample estimate of the covariance matrix of the rows.

4

the dot product of X and Y divided by the length of X times the length of Y .

5

θ, where θ is defined in 4.

6

ρ = the usual (centered) estimate of the correlation between X and Y.

7

The absolute value of ρ (where ρ is defined in 6).

8

The number of times xi = yi, where xi and yi are elements of X and Y.

For the Mahalanobis distance, any variable used in computing the distance measure that is (numerically) linearly dependent upon the previous variables in the IND vector is omitted from the distance measure.

Comments

1. Workspace may be explicitly provided, if desired, by use of C2IST/DC2IST. The reference is:

CALL C2IST (NROW, NCOL, X, LDX, NDSTM, IND, IMETH, IROW, ISCALE, DIST, LDDIST, X1, X2, SCALE, WK, IND1)

The additional arguments are as follows:

X1 — Work vector of length NDSTM. Not used if IMETH = 8.

X2 — Work vector of length NDSTM. Not used if IMETH = 8.

SCALE — Work vector of length NDSTM if IMETH is less than 4; of length NCOL or NROW when IROW is 0 or 1, respectively, and IMETH is 4 or 5; and of length
* NCOL or 2 * NROW when IROW is 0 or 1 and IMETH is 6 or 7. SCALE is not used when IMETH is 8.

WK — Work vector of length NDSTM * NDSTM when IMETH is 3, or of length NDSTM when IMETH = 6 or 7. Not used otherwise.

IND1 — Integer work vector of length NDSTM.

2. Informational error

 

Type

Code

Description

3

3

A variable is numerically linearly dependent on the previous variables when IMETH is 3. The variable detected as being linearly dependent is omitted from the distance measure.

Example

The following example illustrates the use of CDIST for computing the Euclidean distance between the rows of a matrix.

 

USE WRRRN_INT

USE CDIST_INT

 

IMPLICIT NONE

INTEGER IROW, LDDIST, LDX, NCOL, NDSTM, NROW, IMETH

PARAMETER (IMETH=0, IROW=1, NCOL=2, NROW=4, LDDIST=NROW, LDX=NROW)

!

REAL DIST(LDDIST,NROW), X(NROW,NCOL), IND

!

DATA IND/0/

DATA X/1, 1, 1, 1, 1, 0, -1, 2/

DATA DIST/16*0.0/

! Print input matrix

CALL WRRRN ('X', X)

!

CALL CDIST (X, DIST)

! Print distance matrix

CALL WRRRN ('DIST', DIST)

!

END

Output

 

X

1 2

1 1.000 1.000

2 1.000 0.000

3 1.000 -1.000

4 1.000 2.000

 

DIST

1 2 3 4

1 0.000 1.000 2.000 1.000

2 0.000 0.000 1.000 2.000

3 0.000 0.000 0.000 3.000

4 0.000 0.000 0.000 0.000