robustCovariances¶
Computes a robust estimate of a covariance matrix and mean vector.
Synopsis¶
robustCovariances (x, nGroups)
Required Argument¶
- float
x(Input) - A
nRowsbynVariables+ 1 matrix containing the data. The firstnVariablescolumns correspond to the variables, and the last column (columnnVariables) must contain the group numbers. - int
nGroups(Input) - Number of groups in the data.
Return Value¶
Matrix of size nVariables by nVariables containing the matrix of
covariances.
Optional Arguments¶
xIndices, intigrp, intind[], intifrq, intiwt(Input)Each of the four arguments contains indices indicating column numbers of
xin which particular types of data are stored.Parameter
igrpcontains the index for the column ofxin which the group numbers are stored.Parameter
indcontains the indices of the variables to be used in the analysis.Parameters
ifrqandiwtcontain the column numbers ofxin which the frequencies and weights, respectively, are stored. Setifrq= −1 if there will be no column for frequencies. Setiwt= −1 if there will be no column for weights. Weights are rounded to the nearest integer. Negative weights are not allowed.Defaults:
igrp=nVariables,ind[ ] = 0, 1, …,nVariables− 1,ifrq= −1, andiwt= −1
initialEstMean (Input)
or
initialEstMedian (Input)
or
initialEstInput, floatinputMean, floatinputCov(Input)If
initialEstMeanis specified, initial estimates are obtained as the usual estimate of a mean vector and of a covariance matrix.If
initialEstMedianis specified, initial estimates are based upon the median and interquartile range are used.If
initialEstInputis specified, the initial estimates are specified in arraysinputMeanandinputCov. ArgumentinputMeanis an array of sizenGroupsbynVariables, andinputCovis an array of sizenVariablesbynVariables.Default:
initialEstMeanestimationMethod, int (Input)- Option parameter giving the algorithm to be used in computing the estimates.
| estimationMethod | Method Used |
|---|---|
| 0 | Huber’s conjugate-gradient algorithm is used. |
| 1 | Stahel’s algorithm is used. |
percentage, float (Input)Percentage of gross errors expected in the data. Argument
percentagemust be in the range 0.0 to 100.0 and contains the percentage of outliers expected in the data. If the percentage of gross errors expected in the data is not known, a reasonable strategy is to choose a value ofpercentagethat is such that larger values do not result in significant changes in the estimates.Default:
percentage= 5.0maxIterations, int (Input)Maximum number of iterations.
Default:
maxIterations= 30tolerance, float (Input)Convergence criterion. When the maximum absolute change in a location or covariance estimate is less than
tolerance, convergence is assumed.Default:
tolerance= \(10^{-4}\)minimaxWeights, floata, floatb, floatc(Output)- Arguments
a,b, andccontain the values for the parameters of the weighting function. See the Description section. groupCounts(Output)- An integer array of length
nGroupscontaining the number of observations in each group. sumWeights(Output)- An array of length
nGroupscontaining the sum of the weights times the frequencies in the groups. means(Output)- An array of size
nGroupsbynVariables. The i-th row ofmeanscontains the group i variable means. u(Output)- An array of size
nVariablesbynVariablescontaining the lower matrix U, the lower triangular for the robust sample cross-products matrix. U is computed from the robust sample covariance matrix, S (see the Description section), as \(S=U^TU\). beta(Output)- Argument
betacontains the constant used to ensure that the estimated covariance matrix has unbiased expectation (for a given mean vector) for a multivariate normal density. nRowsMissing(Output)- Number of rows of data encountered in calls to
robustCovariancescontaining missing values (NaN) for any of the variables used.
Description¶
Function robustCovariances computes robust M-estimates of the mean and
covariance matrix from a matrix of observations. A pooled estimate of the
covariance matrix is computed when multiple groups are present in the input
data. M-estimate weights are obtained using the “minimax” weights of Huber
(1981, pp. 231-235), with percentage expected gross errors. Huber’s
(1981) weighting equations are given by:
User specified observation weights and frequencies may be given for each row
in x. Listwise deletion of missing values is assumed so that all
observations used are “complete”.
Let \(f(x;\mu_i,\Sigma)\) denote the density of an observation p-vector
x in population (group) i with mean vector \(\mu_i\), for \(i=
1,\ldots,\tau\). Let the covariance matrix Σ be such that \(\Sigma=R^T
R\). If
then
It is assumed that \(g(y)\) is a spherically symmetric density in p-dimensions.
In robustCovariances, Σ and \(\mu_i\) are estimated as the solutions
of the estimation equations
and
where i indexes the τ groups, \(n_i\), is the number of observations
in group i, \(f_{ij}\) is the frequency for the j‑th observation
in group i, \(w_{ij}\) is the observation weight specified in column
iwt of x, \(I_p\) is a p × p identity matrix,
\(w(r)\) and \(u(r)\) are the weighting functions, and where β is a constant computed by the program to make the expected weighted Mahalanobis distance (\(y^Ty\)) equal the expected Mahalanobis distance from a multivariate normal distribution (see Marazzi 1985). The constant β is described more fully below.
Function robustCovariances uses one of two algorithms for solving the
estimation equations. The first algorithm is discussed in detail in Huber
(1981) and is a variant of the conjugate gradient method. The second
algorithm is due to Stahel (1981) and is discussed in detail by Marazzi
(1985). In both algorithms, correction vectors \(T_{ki}\) for the group
i means and correction matrix \(W_k=I_p+U_k\) for the Cholesky
factorization of Σ are found such that the updated mean vectors are given by
and the updated matrix R is given as
where k is the iteration number and
When all elements of \(U_k\) and \(T_{ki}\) are less than ɛ =
tolerance, convergence is assumed.
Three methods for obtaining estimates are allowed. In the first method, the sample weighted estimate of Σ is computed. In the second method, estimates based upon the median and the interquartile range are used. Finally, in the last method, the user inputs initial estimates.
Function robustCovariances computes estimates based on the “minimax”
weights discussed above. The constant β is chosen such that
\(E(u(r)r_2)=\rho\beta\) where the expectation is with respect to a
standard p‑variate multivariate normal distribution. This yields
estimates with the correct expectation for the multivariate normal
distribution (for given mean vector). The expectation is computed via
integration of estimated spline function. 200 knots are used on an equally
spaced grid from 0.0 to the 99.999 percentile of
distribution. An error estimate is computed based upon 100 of these knots. If the estimated relative error is greater than 0.0001, a warning message is issued. If β is not computed accurately (i.e., if the warning message is issued), the computed estimates are still optimal, but the scale of the estimated covariance matrix may need to be multiplied by a constant in order for
to have the correct multivariate normal covariance expectation.
Examples¶
Example 1¶
The following example computes a robust variance-covariance matrix. The last column of the data set is the group indicator.
from numpy import *
from pyimsl.stat.robustCovariances import robustCovariances
from pyimsl.stat.writeMatrix import writeMatrix
n_groups = 2
x = [[2.2, 5.6, 1],
[3.4, 2.3, 1],
[1.2, 7.8, 1],
[3.2, 2.1, 2],
[4.1, 1.6, 2],
[3.7, 2.2, 2]]
cov = robustCovariances(x, n_groups)
writeMatrix("Robust Covariance Matrix", cov,
colNumberZero=True, rowNumberZero=True)
Output¶
Robust Covariance Matrix
0 1
0 0.522 -1.160
1 -1.160 2.862
Example 2¶
The following example computes estimates of the pooled covariance matrix for
the Fisher’s iris data. For comparison, the estimates are first computed via
function pooledCovariances. Function
robustCovariances with percentage = 2.0 is then used to compute the
robust estimates. As can be seen from the output, the resulting estimates
are quite similar.
Next, three observations are made into outliers, and again, estimates are
computed using functions pooledCovariances and robustCovariances.
When outliers are present, the estimates of pooledCovariances are
adversely affected, while the estimates produced by robustCovariances
are close the estimates produced when no outliers are present.
from numpy import *
from pyimsl.stat.dataSets import dataSets
from pyimsl.stat.pooledCovariances import pooledCovariances
from pyimsl.stat.robustCovariances import robustCovariances
from pyimsl.stat.writeMatrix import writeMatrix
nobs = 150
nvar = 4
n_groups = 3
percentage = 2.0
igrp = 0
ifrq = -1
iwt = -1
ind = (1, 2, 3, 4)
x = dataSets(3)
xIndices = {}
xIndices['igrp'] = igrp
xIndices['ind'] = ind
xIndices['ifrq'] = ifrq
xIndices['iwt'] = iwt
cov = pooledCovariances(nobs, nvar, x, n_groups,
xIndices=xIndices)
writeMatrix("Pooled Covariance with No Outliers", cov,
colNumberZero=True, rowNumberZero=True, printUpper=True)
rbcov = robustCovariances(x, n_groups,
percentage=percentage,
xIndices=xIndices)
writeMatrix("Robust Covariance with No Outliers", rbcov,
colNumberZero=True, rowNumberZero=True, printUpper=True)
# Add Outliers
x[0, 1] = 100.0
x[3, 4] = 100.0
x[99, 2] = -100.0
xIndices = {}
xIndices['igrp'] = igrp
xIndices['ind'] = ind
xIndices['ifrq'] = ifrq
xIndices['iwt'] = iwt
cov = pooledCovariances(nobs, nvar, x, n_groups,
xIndices=xIndices)
writeMatrix("Pooled Covariance with Outliers", cov,
colNumberZero=True, rowNumberZero=True, printUpper=True)
rbcov = robustCovariances(x, n_groups,
percentage=percentage,
xIndices=xIndices)
writeMatrix("Robust Covariance with Outliers", rbcov,
colNumberZero=True, rowNumberZero=True, printUpper=True)
Output¶
Pooled Covariance with No Outliers
0 1 2 3
0 0.2650 0.0927 0.1675 0.0384
1 0.1154 0.0552 0.0327
2 0.1852 0.0427
3 0.0419
Robust Covariance with No Outliers
0 1 2 3
0 0.2474 0.0872 0.1535 0.0360
1 0.1073 0.0538 0.0322
2 0.1705 0.0412
3 0.0401
Pooled Covariance with Outliers
0 1 2 3
0 60.43 0.30 0.13 -1.56
1 70.53 0.17 -0.17
2 0.19 0.07
3 66.38
Robust Covariance with Outliers
0 1 2 3
0 0.2555 0.0876 0.1553 0.0359
1 0.1127 0.0545 0.0322
2 0.1723 0.0412
3 0.0424
Warning Errors¶
IMSLS_NO_CONVERGE_MAX_ITER |
Failure to converge within
“maxIterations” = #
iterations for at least one of the
“nroot” = # roots. |
Fatal Errors¶
IMSLS_BAD_GROUP_2 |
The group number for observation # is equal to #. It must be greater than or equal to one and less than or equal to #, the number of groups. |