robustCovariances

Computes a robust estimate of a covariance matrix and mean vector.

Synopsis

robustCovariances (x, nGroups)

Required Argument

float x (Input)
A nRows by nVariables + 1 matrix containing the data. The first nVariables columns correspond to the variables, and the last column (column nVariables) must contain the group numbers.
int nGroups (Input)
Number of groups in the data.

Return Value

Matrix of size nVariables by nVariables containing the matrix of covariances.

Optional Arguments

xIndices, int igrp, int ind[], int ifrq, int iwt (Input)

Each of the four arguments contains indices indicating column numbers of x in which particular types of data are stored.

Parameter igrp contains the index for the column of x in which the group numbers are stored.

Parameter ind contains the indices of the variables to be used in the analysis.

Parameters ifrq and iwt contain the column numbers of x in which the frequencies and weights, respectively, are stored. Set ifrq = −1 if there will be no column for frequencies. Set iwt = −1 if there will be no column for weights. Weights are rounded to the nearest integer. Negative weights are not allowed.

Defaults: igrp = nVariables, ind [ ] = 0, 1, …, nVariables − 1, ifrq = −1, and iwt = −1

initialEstMean (Input)

or

initialEstMedian (Input)

or

initialEstInput, float inputMean, float inputCov (Input)

If initialEstMean is specified, initial estimates are obtained as the usual estimate of a mean vector and of a covariance matrix.

If initialEstMedian is specified, initial estimates are based upon the median and interquartile range are used.

If initialEstInput is specified, the initial estimates are specified in arrays inputMean and inputCov. Argument inputMean is an array of size nGroups by nVariables, and inputCov is an array of size nVariables by nVariables.

Default: initialEstMean

estimationMethod, int (Input)
Option parameter giving the algorithm to be used in computing the estimates.
estimationMethod Method Used
0 Huber’s conjugate-gradient algorithm is used.
1 Stahel’s algorithm is used.
percentage, float (Input)

Percentage of gross errors expected in the data. Argument percentage must be in the range 0.0 to 100.0 and contains the percentage of outliers expected in the data. If the percentage of gross errors expected in the data is not known, a reasonable strategy is to choose a value of percentage that is such that larger values do not result in significant changes in the estimates.

Default: percentage = 5.0

maxIterations, int (Input)

Maximum number of iterations.

Default: maxIterations = 30

tolerance, float (Input)

Convergence criterion. When the maximum absolute change in a location or covariance estimate is less than tolerance, convergence is assumed.

Default: tolerance = \(10^{-4}\)

minimaxWeights, float a, float b, float c (Output)
Arguments a, b, and c contain the values for the parameters of the weighting function. See the Description section.
groupCounts (Output)
An integer array of length nGroups containing the number of observations in each group.
sumWeights (Output)
An array of length nGroups containing the sum of the weights times the frequencies in the groups.
means (Output)
An array of size nGroups by nVariables. The i-th row of means contains the group i variable means.
u (Output)
An array of size nVariables by nVariables containing the lower matrix U, the lower triangular for the robust sample cross-products matrix. U is computed from the robust sample covariance matrix, S (see the Description section), as \(S=U^TU\).
beta (Output)
Argument beta contains the constant used to ensure that the estimated covariance matrix has unbiased expectation (for a given mean vector) for a multivariate normal density.
nRowsMissing (Output)
Number of rows of data encountered in calls to robustCovariances containing missing values (NaN) for any of the variables used.

Description

Function robustCovariances computes robust M-estimates of the mean and covariance matrix from a matrix of observations. A pooled estimate of the covariance matrix is computed when multiple groups are present in the input data. M-estimate weights are obtained using the “minimax” weights of Huber (1981, pp. 231-235), with percentage expected gross errors. Huber’s (1981) weighting equations are given by:

\[\begin{split}\begin{array}{l} u(r) = \begin{cases} \tfrac{a^2}{r^2} & r < a \\ 1 & a \leq r \leq b \\ \tfrac{b^2}{r^2} & r > b\\ \end{cases} \\ \\ w(r) = \min\left(1, \tfrac{c}{r}\right) \end{array}\end{split}\]

User specified observation weights and frequencies may be given for each row in x. Listwise deletion of missing values is assumed so that all observations used are “complete”.

Let \(f(x;\mu_i,\Sigma)\) denote the density of an observation p-vector x in population (group) i with mean vector \(\mu_i\), for \(i= 1,\ldots,\tau\). Let the covariance matrix Σ be such that \(\Sigma=R^T R\). If

\[y = R^{−T} (x − μ_i)\]

then

\[g(y) = |\Sigma|^{1/2} f\left(R^Ty + \mu_i; \mu_i, \Sigma\right)\]

It is assumed that \(g(y)\) is a spherically symmetric density in p-dimensions.

In robustCovariances, Σ and \(\mu_i\) are estimated as the solutions

\[\left(\hat{\Sigma}, \hat{\mu}_i\right)\]

of the estimation equations

\[\tfrac{1}{n} \sum_{j=1}^{n_i} f_{ig} w_{ij} w\left(r_{ij}\right) y_{ij} = 0\]

and

\[\tfrac{1}{n} \sum_{i=1}^{\tau} \sum_{j=1}^{n_i} f_{ij} w_{ij} \left[u\left(r_{ij}\right) y_{ij} y_{ij}^{T} - \beta I_p\right] = 0\]

where i indexes the τ groups, \(n_i\), is the number of observations in group i, \(f_{ij}\) is the frequency for the j‑th observation in group i, \(w_{ij}\) is the observation weight specified in column iwt of x, \(I_p\) is a p × p identity matrix,

\[r_{ij} = \sqrt{y_{ij}^{T} y_{ij}}\]

\(w(r)\) and \(u(r)\) are the weighting functions, and where β is a constant computed by the program to make the expected weighted Mahalanobis distance (\(y^Ty\)) equal the expected Mahalanobis distance from a multivariate normal distribution (see Marazzi 1985). The constant β is described more fully below.

Function robustCovariances uses one of two algorithms for solving the estimation equations. The first algorithm is discussed in detail in Huber (1981) and is a variant of the conjugate gradient method. The second algorithm is due to Stahel (1981) and is discussed in detail by Marazzi (1985). In both algorithms, correction vectors \(T_{ki}\) for the group i means and correction matrix \(W_k=I_p+U_k\) for the Cholesky factorization of Σ are found such that the updated mean vectors are given by

\[\hat{\mu}_{i,k+1} = \hat{\mu}_{i,k} + T_{ki}\]

and the updated matrix R is given as

\[\hat{R}_{k+1} = W_k \hat{R}_k\]

where k is the iteration number and

\[\hat{\Sigma}_k = R_k^T R_k\]

When all elements of \(U_k\) and \(T_{ki}\) are less than ɛ = tolerance, convergence is assumed.

Three methods for obtaining estimates are allowed. In the first method, the sample weighted estimate of Σ is computed. In the second method, estimates based upon the median and the interquartile range are used. Finally, in the last method, the user inputs initial estimates.

Function robustCovariances computes estimates based on the “minimax” weights discussed above. The constant β is chosen such that \(E(u(r)r_2)=\rho\beta\) where the expectation is with respect to a standard p‑variate multivariate normal distribution. This yields estimates with the correct expectation for the multivariate normal distribution (for given mean vector). The expectation is computed via integration of estimated spline function. 200 knots are used on an equally spaced grid from 0.0 to the 99.999 percentile of

\[\chi_p^2\]

distribution. An error estimate is computed based upon 100 of these knots. If the estimated relative error is greater than 0.0001, a warning message is issued. If β is not computed accurately (i.e., if the warning message is issued), the computed estimates are still optimal, but the scale of the estimated covariance matrix may need to be multiplied by a constant in order for

\[\hat{\Sigma}\]

to have the correct multivariate normal covariance expectation.

Examples

Example 1

The following example computes a robust variance-covariance matrix. The last column of the data set is the group indicator.

from numpy import *
from pyimsl.stat.robustCovariances import robustCovariances
from pyimsl.stat.writeMatrix import writeMatrix

n_groups = 2
x = [[2.2, 5.6, 1],
     [3.4, 2.3, 1],
     [1.2, 7.8, 1],
     [3.2, 2.1, 2],
     [4.1, 1.6, 2],
     [3.7, 2.2, 2]]

cov = robustCovariances(x, n_groups)

writeMatrix("Robust Covariance Matrix", cov,
            colNumberZero=True, rowNumberZero=True)

Output

 
 Robust Covariance Matrix
             0            1
0        0.522       -1.160
1       -1.160        2.862

Example 2

The following example computes estimates of the pooled covariance matrix for the Fisher’s iris data. For comparison, the estimates are first computed via function pooledCovariances. Function robustCovariances with percentage = 2.0 is then used to compute the robust estimates. As can be seen from the output, the resulting estimates are quite similar.

Next, three observations are made into outliers, and again, estimates are computed using functions pooledCovariances and robustCovariances. When outliers are present, the estimates of pooledCovariances are adversely affected, while the estimates produced by robustCovariances are close the estimates produced when no outliers are present.

from numpy import *
from pyimsl.stat.dataSets import dataSets
from pyimsl.stat.pooledCovariances import pooledCovariances
from pyimsl.stat.robustCovariances import robustCovariances
from pyimsl.stat.writeMatrix import writeMatrix

nobs = 150
nvar = 4
n_groups = 3
percentage = 2.0
igrp = 0
ifrq = -1
iwt = -1
ind = (1, 2, 3, 4)
x = dataSets(3)
xIndices = {}
xIndices['igrp'] = igrp
xIndices['ind'] = ind
xIndices['ifrq'] = ifrq
xIndices['iwt'] = iwt

cov = pooledCovariances(nobs, nvar, x, n_groups,
                        xIndices=xIndices)

writeMatrix("Pooled Covariance with No Outliers", cov,
            colNumberZero=True, rowNumberZero=True, printUpper=True)

rbcov = robustCovariances(x, n_groups,
                          percentage=percentage,
                          xIndices=xIndices)

writeMatrix("Robust Covariance with No Outliers", rbcov,
            colNumberZero=True, rowNumberZero=True, printUpper=True)

# Add Outliers
x[0, 1] = 100.0
x[3, 4] = 100.0
x[99, 2] = -100.0
xIndices = {}
xIndices['igrp'] = igrp
xIndices['ind'] = ind
xIndices['ifrq'] = ifrq
xIndices['iwt'] = iwt

cov = pooledCovariances(nobs, nvar, x, n_groups,
                        xIndices=xIndices)

writeMatrix("Pooled Covariance with Outliers", cov,
            colNumberZero=True, rowNumberZero=True, printUpper=True)

rbcov = robustCovariances(x, n_groups,
                          percentage=percentage,
                          xIndices=xIndices)

writeMatrix("Robust Covariance with Outliers", rbcov,
            colNumberZero=True, rowNumberZero=True, printUpper=True)

Output

 
         Pooled Covariance with No Outliers
             0            1            2            3
0       0.2650       0.0927       0.1675       0.0384
1                    0.1154       0.0552       0.0327
2                                 0.1852       0.0427
3                                              0.0419
 
         Robust Covariance with No Outliers
             0            1            2            3
0       0.2474       0.0872       0.1535       0.0360
1                    0.1073       0.0538       0.0322
2                                 0.1705       0.0412
3                                              0.0401
 
           Pooled Covariance with Outliers
             0            1            2            3
0        60.43         0.30         0.13        -1.56
1                     70.53         0.17        -0.17
2                                   0.19         0.07
3                                               66.38
 
           Robust Covariance with Outliers
             0            1            2            3
0       0.2555       0.0876       0.1553       0.0359
1                    0.1127       0.0545       0.0322
2                                 0.1723       0.0412
3                                              0.0424

Warning Errors

IMSLS_NO_CONVERGE_MAX_ITER Failure to converge within “maxIterations” = # iterations for at least one of the “nroot” = # roots.

Fatal Errors

IMSLS_BAD_GROUP_2 The group number for observation # is equal to #. It must be greater than or equal to one and less than or equal to #, the number of groups.