dissimilarities¶
Computes a matrix of dissimilarities (or similarities) between the columns (or rows) of a matrix.
Synopsis¶
dissimilarities (x)
Required Arguments¶
- float
x[[]]
(Input) - Array of size
nrow
byncol
containing the matrix.
Return Value¶
An array of size m by m containing the computed dissimilarities or
similarities, where m = nrow
if optional argument rows
is used,
and m = ncol
otherwise.
Optional Arguments¶
rows
(Input)
or
columns
, (Input)Exactly one of these options can be present to indicate whether distances are computed between rows or columns of
x
.Default: Distances are computed between rows.
index
, int[]
(Input)Argument
index
is an array of lengthndstm
containing the indices of the rows (columns ifrows
is used) to be used in computing the distance measure.Default: All rows(columns) are used.
method
, int (Input)Method to be used in computing the dissimilarities or similarities.
Default:
method
= 0.method
Method 0 Euclidean distance (\(L_2\) norm) 1 Sum of the absolute differences (\(L_1\) norm) 2 Maximum difference (\(L_\infty\) norm) 3 Mahalanobis distance 4 Absolute value of the cosine of the angle between the vectors 5 Angle in radians (0, π) between the lines through the origin defined by the vectors 6 Correlation coefficient 7 Absolute value of the correlation coefficient 8 Number of exact matches See the Description section for a more detailed description of each measure.
scale
, int (Input)Scaling option.
scale
is not used for methods 3 through 8.scale Scaling Performed 0 No scaling is performed. 1 Scale each column (row, if rows
is used) by the standard deviation of the column (row).2 Scale each column (row, if rows
is used) by the range of the column (row).Default:
scale
= 0.
Description¶
Function dissimilarities
computes an upper triangular matrix (excluding
the diagonal) of dissimilarities (or similarities) between the columns or
rows of a matrix. Nine different distance measures can be computed. For the
first three measures, three different scaling options can be employed.
Output from dissimilarities
is generally used as input to clustering or
multidimensional scaling functions.
The following discussion assumes that the distance measure is being computed
between the columns of the matrix, i.e., that columns
is used. If
distances between the rows of the matrix are desired, use optional argument
rows
.
For method
= 0 to 2, each row of x
is first scaled according to the
value of scale
. The scaling parameters are obtained from the values in
the row scaled as either the standard deviation of the row or the row range;
the standard deviation is computed from the unbiased estimate of the
variance. If scale
is 0, no scaling is performed, and the parameters in
the following discussion are all 1.0. Once the scaling value (if any) has
been computed, the distance between column i and column j is computed
via the difference vector \(z_k=(x_k-y_k)/s_k\), i = 1, …,
ndstm
, where \(x_k\) denotes the k-th element in the i-th column,
and \(y_k\) denotes the corresponding element in the j-th column. For
given \(z_i\), the metrics 0 to 2 are defined as:
method | Metric | |
---|---|---|
0 | \(\sqrt{\left(\textstyle\sum_{i=1}^{\mathit{ndstm}}z_i^2\right)}\) | Euclidean |
1 | \(\textstyle\sum_{i=1}^{\mathit{ndstm}} |z_i|\) | \(L_1\) norm |
2 | \(\max_i |z_i|\) | \(L_\infty\) norm |
Distance measures corresponding to method
= 3 to 8 do not allow for
scaling. These measures are defined via the column vectors \(X=(x_i)\),
\(Y=(y_i)\), and \(Z=(x_i-y_i)\) as follows:
method | Metric |
---|---|
3 | \(Z'\mathit{\hat{\Sigma}}^{-1}Z\) = Mahalanobis distance, where \(\hat{\mathit{\Sigma}}\) is the usual unbiased sample estimate of the covariance matrix of the rows. |
4 | \(\cos(\theta)=X^TY/\left(\sqrt{X^TX}\sqrt{Y^TY}\right)\) = the dot product of X and Y divided by the length of X times the length of Y . |
5 | θ, where θ is defined in 4. |
6 | ρ = the usual (centered) estimate of the correlation between X and Y. |
7 | The absolute value of ρ (where ρ is defined in 6). |
8 | The number of times \(x_i=y_i\), where \(x_i\) and \(y_i\) are elements of X and Y. |
For the Mahalanobis distance, any variable used in computing the distance
measure that is (numerically) linearly dependent upon the previous variables
in the ind
vector is omitted from the distance measure.
Example¶
The following example illustrates the use of dissimilarities
for
computing the Euclidean distance between the rows of a matrix.
from numpy import *
from pyimsl.stat.dissimilarities import dissimilarities
from pyimsl.stat.writeMatrix import writeMatrix
x = [[1., 1.],
[1., 0.],
[1., -1.],
[1., 2.]]
dist = dissimilarities(x)
writeMatrix('dist', dist)
Output¶
dist
1 2 3 4
1 0 1 2 1
2 0 0 1 2
3 0 0 0 3
4 0 0 0 0