robustCovariances¶
Computes a robust estimate of a covariance matrix and mean vector.
Synopsis¶
robustCovariances (x, nGroups)
Required Argument¶
- float
x
(Input) - A
nRows
bynVariables
+ 1 matrix containing the data. The firstnVariables
columns correspond to the variables, and the last column (columnnVariables
) must contain the group numbers. - int
nGroups
(Input) - Number of groups in the data.
Return Value¶
Matrix of size nVariables
by nVariables
containing the matrix of
covariances.
Optional Arguments¶
xIndices
, intigrp
, intind[]
, intifrq
, intiwt
(Input)Each of the four arguments contains indices indicating column numbers of
x
in which particular types of data are stored.Parameter
igrp
contains the index for the column ofx
in which the group numbers are stored.Parameter
ind
contains the indices of the variables to be used in the analysis.Parameters
ifrq
andiwt
contain the column numbers ofx
in which the frequencies and weights, respectively, are stored. Setifrq
= −1 if there will be no column for frequencies. Setiwt
= −1 if there will be no column for weights. Weights are rounded to the nearest integer. Negative weights are not allowed.Defaults:
igrp
=nVariables
,ind
[ ] = 0, 1, …,nVariables
− 1,ifrq
= −1, andiwt
= −1
initialEstMean
(Input)
or
initialEstMedian
(Input)
or
initialEstInput
, floatinputMean
, floatinputCov
(Input)If
initialEstMean
is specified, initial estimates are obtained as the usual estimate of a mean vector and of a covariance matrix.If
initialEstMedian
is specified, initial estimates are based upon the median and interquartile range are used.If
initialEstInput
is specified, the initial estimates are specified in arraysinputMean
andinputCov
. ArgumentinputMean
is an array of sizenGroups
bynVariables
, andinputCov
is an array of sizenVariables
bynVariables
.Default:
initialEstMean
estimationMethod
, int (Input)- Option parameter giving the algorithm to be used in computing the estimates.
estimationMethod | Method Used |
---|---|
0 | Huber’s conjugate-gradient algorithm is used. |
1 | Stahel’s algorithm is used. |
percentage
, float (Input)Percentage of gross errors expected in the data. Argument
percentage
must be in the range 0.0 to 100.0 and contains the percentage of outliers expected in the data. If the percentage of gross errors expected in the data is not known, a reasonable strategy is to choose a value ofpercentage
that is such that larger values do not result in significant changes in the estimates.Default:
percentage
= 5.0maxIterations
, int (Input)Maximum number of iterations.
Default:
maxIterations
= 30tolerance
, float (Input)Convergence criterion. When the maximum absolute change in a location or covariance estimate is less than
tolerance
, convergence is assumed.Default:
tolerance
= \(10^{-4}\)minimaxWeights
, floata
, floatb
, floatc
(Output)- Arguments
a
,b
, andc
contain the values for the parameters of the weighting function. See the Description section. groupCounts
(Output)- An integer array of length
nGroups
containing the number of observations in each group. sumWeights
(Output)- An array of length
nGroups
containing the sum of the weights times the frequencies in the groups. means
(Output)- An array of size
nGroups
bynVariables
. The i-th row ofmeans
contains the group i variable means. u
(Output)- An array of size
nVariables
bynVariables
containing the lower matrix U, the lower triangular for the robust sample cross-products matrix. U is computed from the robust sample covariance matrix, S (see the Description section), as \(S=U^TU\). beta
(Output)- Argument
beta
contains the constant used to ensure that the estimated covariance matrix has unbiased expectation (for a given mean vector) for a multivariate normal density. nRowsMissing
(Output)- Number of rows of data encountered in calls to
robustCovariances
containing missing values (NaN) for any of the variables used.
Description¶
Function robustCovariances
computes robust M-estimates of the mean and
covariance matrix from a matrix of observations. A pooled estimate of the
covariance matrix is computed when multiple groups are present in the input
data. M-estimate weights are obtained using the “minimax” weights of Huber
(1981, pp. 231-235), with percentage
expected gross errors. Huber’s
(1981) weighting equations are given by:
User specified observation weights and frequencies may be given for each row
in x
. Listwise deletion of missing values is assumed so that all
observations used are “complete”.
Let \(f(x;\mu_i,\Sigma)\) denote the density of an observation p-vector
x
in population (group) i with mean vector \(\mu_i\), for \(i=
1,\ldots,\tau\). Let the covariance matrix Σ be such that \(\Sigma=R^T
R\). If
then
It is assumed that \(g(y)\) is a spherically symmetric density in p-dimensions.
In robustCovariances
, Σ and \(\mu_i\) are estimated as the solutions
of the estimation equations
and
where i indexes the τ groups, \(n_i\), is the number of observations
in group i, \(f_{ij}\) is the frequency for the j‑th observation
in group i, \(w_{ij}\) is the observation weight specified in column
iwt
of x
, \(I_p\) is a p × p identity matrix,
\(w(r)\) and \(u(r)\) are the weighting functions, and where β is a constant computed by the program to make the expected weighted Mahalanobis distance (\(y^Ty\)) equal the expected Mahalanobis distance from a multivariate normal distribution (see Marazzi 1985). The constant β is described more fully below.
Function robustCovariances
uses one of two algorithms for solving the
estimation equations. The first algorithm is discussed in detail in Huber
(1981) and is a variant of the conjugate gradient method. The second
algorithm is due to Stahel (1981) and is discussed in detail by Marazzi
(1985). In both algorithms, correction vectors \(T_{ki}\) for the group
i means and correction matrix \(W_k=I_p+U_k\) for the Cholesky
factorization of Σ are found such that the updated mean vectors are given by
and the updated matrix R is given as
where k is the iteration number and
When all elements of \(U_k\) and \(T_{ki}\) are less than ɛ =
tolerance
, convergence is assumed.
Three methods for obtaining estimates are allowed. In the first method, the sample weighted estimate of Σ is computed. In the second method, estimates based upon the median and the interquartile range are used. Finally, in the last method, the user inputs initial estimates.
Function robustCovariances
computes estimates based on the “minimax”
weights discussed above. The constant β is chosen such that
\(E(u(r)r_2)=\rho\beta\) where the expectation is with respect to a
standard p‑variate multivariate normal distribution. This yields
estimates with the correct expectation for the multivariate normal
distribution (for given mean vector). The expectation is computed via
integration of estimated spline function. 200 knots are used on an equally
spaced grid from 0.0 to the 99.999 percentile of
distribution. An error estimate is computed based upon 100 of these knots. If the estimated relative error is greater than 0.0001, a warning message is issued. If β is not computed accurately (i.e., if the warning message is issued), the computed estimates are still optimal, but the scale of the estimated covariance matrix may need to be multiplied by a constant in order for
to have the correct multivariate normal covariance expectation.
Examples¶
Example 1¶
The following example computes a robust variance-covariance matrix. The last column of the data set is the group indicator.
from numpy import *
from pyimsl.stat.robustCovariances import robustCovariances
from pyimsl.stat.writeMatrix import writeMatrix
n_groups = 2
x = [[2.2, 5.6, 1],
[3.4, 2.3, 1],
[1.2, 7.8, 1],
[3.2, 2.1, 2],
[4.1, 1.6, 2],
[3.7, 2.2, 2]]
cov = robustCovariances(x, n_groups)
writeMatrix("Robust Covariance Matrix", cov,
colNumberZero=True, rowNumberZero=True)
Output¶
Robust Covariance Matrix
0 1
0 0.522 -1.160
1 -1.160 2.862
Example 2¶
The following example computes estimates of the pooled covariance matrix for
the Fisher’s iris data. For comparison, the estimates are first computed via
function pooledCovariances. Function
robustCovariances
with percentage
= 2.0 is then used to compute the
robust estimates. As can be seen from the output, the resulting estimates
are quite similar.
Next, three observations are made into outliers, and again, estimates are
computed using functions pooledCovariances
and robustCovariances
.
When outliers are present, the estimates of pooledCovariances
are
adversely affected, while the estimates produced by robustCovariances
are close the estimates produced when no outliers are present.
from numpy import *
from pyimsl.stat.dataSets import dataSets
from pyimsl.stat.pooledCovariances import pooledCovariances
from pyimsl.stat.robustCovariances import robustCovariances
from pyimsl.stat.writeMatrix import writeMatrix
nobs = 150
nvar = 4
n_groups = 3
percentage = 2.0
igrp = 0
ifrq = -1
iwt = -1
ind = (1, 2, 3, 4)
x = dataSets(3)
xIndices = {}
xIndices['igrp'] = igrp
xIndices['ind'] = ind
xIndices['ifrq'] = ifrq
xIndices['iwt'] = iwt
cov = pooledCovariances(nobs, nvar, x, n_groups,
xIndices=xIndices)
writeMatrix("Pooled Covariance with No Outliers", cov,
colNumberZero=True, rowNumberZero=True, printUpper=True)
rbcov = robustCovariances(x, n_groups,
percentage=percentage,
xIndices=xIndices)
writeMatrix("Robust Covariance with No Outliers", rbcov,
colNumberZero=True, rowNumberZero=True, printUpper=True)
# Add Outliers
x[0, 1] = 100.0
x[3, 4] = 100.0
x[99, 2] = -100.0
xIndices = {}
xIndices['igrp'] = igrp
xIndices['ind'] = ind
xIndices['ifrq'] = ifrq
xIndices['iwt'] = iwt
cov = pooledCovariances(nobs, nvar, x, n_groups,
xIndices=xIndices)
writeMatrix("Pooled Covariance with Outliers", cov,
colNumberZero=True, rowNumberZero=True, printUpper=True)
rbcov = robustCovariances(x, n_groups,
percentage=percentage,
xIndices=xIndices)
writeMatrix("Robust Covariance with Outliers", rbcov,
colNumberZero=True, rowNumberZero=True, printUpper=True)
Output¶
Pooled Covariance with No Outliers
0 1 2 3
0 0.2650 0.0927 0.1675 0.0384
1 0.1154 0.0552 0.0327
2 0.1852 0.0427
3 0.0419
Robust Covariance with No Outliers
0 1 2 3
0 0.2474 0.0872 0.1535 0.0360
1 0.1073 0.0538 0.0322
2 0.1705 0.0412
3 0.0401
Pooled Covariance with Outliers
0 1 2 3
0 60.43 0.30 0.13 -1.56
1 70.53 0.17 -0.17
2 0.19 0.07
3 66.38
Robust Covariance with Outliers
0 1 2 3
0 0.2555 0.0876 0.1553 0.0359
1 0.1127 0.0545 0.0322
2 0.1723 0.0412
3 0.0424
Warning Errors¶
IMSLS_NO_CONVERGE_MAX_ITER |
Failure to converge within
“maxIterations ” = #
iterations for at least one of the
“nroot” = # roots. |
Fatal Errors¶
IMSLS_BAD_GROUP_2 |
The group number for observation # is equal to #. It must be greater than or equal to one and less than or equal to #, the number of groups. |