robust_covariances

Chapter 3: Correlation and Covariance

robust_covariances

Computes a robust estimate of a covariance matrix and mean vector.

Synopsis

#include <imsls.h>

float *imsls_f_robust_covariances (int n_rows, int n_variables, float *x, int n_groups, ..., 0)

The type double function is imsls_d_robust_covariances.

Required Argument

int n_rows (Input)
Number of rows observations) in the input matrix x.

int n_variables (Input)
Number of variables to be used in computing the covariance matrix.

float *x (Input)
A n_rows by n_variables + 1 matrix containing the data. The first n_variables columns correspond to the variables, and the last column (column n_variables) must contain the group numbers.

int n_groups (Input)
Number of groups in the data.

Return Value

Matrix of size n_variables by n_variables containing the matrix of covariances.

Synopsis with Optional Arguments

#include <imsls.h>

float *imsls_f_robust_covariances (int n_rows, int n_variables, float x[], int n_groups,
IMSLS_X_COL_DIM, int x_col_dim,
IMSLS_X_INDICES, int igrp, int ind[], int ifrq, int iwt,
IMSLS_INITIAL_EST_MEAN,
IMSLS_INITIAL_EST_MEDIAN,
IMSLS_INITIAL_EST_INPUT, float input_means[], float input_cov[],
IMSLS_ESTIMATION_METHOD, int method,
IMSLS_PERCENTAGE, float percentage,
IMSLS_MAX_ITERATIONS, int maxit,
IMSLS_TOLERANCE, float tolerance,
IMSLS_MINIMAX_WEIGHTS, float *a, float *b, float *c,
IMSLS_GROUP_COUNTS, int **gcounts,
IMSLS_GROUP_COUNTS_USER, int gcounts[],
IMSLS_SUM_WEIGHTS, float **sum_weights,
IMSLS_SUM_WEIGHTS_USER, float sum_weights[],
IMSLS_MEANS, float **means,
IMSLS_MEANS_USER, float means[],
IMSLS_U, float **u,
IMSLS_U_USER, float u[],
IMSLS_BETA, float *beta,
IMSLS_N_ROWS_MISSING, int *nrmiss,
IMSLS_RETURN_USER, float c[],
0)

Optional Arguments

IMSLS_X_COL_DIM, int x_col_dim (Input)
Row/Column dimension of x.
Default: x_col_dim = n_variables + 1

IMSLS_X_INDICES, int igrp, int ind[], int ifrq, int iwt (Input)
Each of the four arguments contains indices indicating column numbers of x in which particular types of data are stored. Columns are numbered 0 … x_col_dim − 1.

Parameter igrp contains the index for the column of x in which the group numbers are stored.

Parameter ind contains the indices of the variables to be used in the analysis.

Parameters ifrq and iwt contain the column numbers of x in which the frequencies and weights, respectively, are stored. Set ifrq = −1 if there will be no column for frequencies. Set iwt = −1 if there will be no column for weights. Weights are rounded to the nearest integer. Negative weights are not allowed.

Defaults: igrp = n_variables, ind [ ] = 0, 1, …, n_variables − 1, ifrq = −1, and iwt = −1

IMSLS_INITIAL_EST_MEAN, or

IMSLS_INITIAL_EST_MEDIAN, or

IMSLS_INITIAL_EST_INPUT, float *input_mean, float *input_cov (Input)
If IMSLS_INITIAL_EST_MEAN is specified, initial estimates are obtained as the usual estimate of a mean vector and of a covariance matrix.

If IMSLS_INITIAL_EST_MEDIAN is specified, initial estimates are based upon the median and interquartile range are used.

If IMSLS_INITIAL_EST_INPUT is specified, the initial estimates are specified in arrays input_mean and input_cov. Argument input_mean is an array of size n_groups by n_variables, and input_cov is an array of size n_variables by n_variables.

Default: IMSLS_INITIAL_EST_MEAN

IMSLS_ESTIMATION_METHOD, int method (Input)
Option parameter giving the algorithm to be used in computing the estimates.

method	Method Used
0	Huber's conjugate-gradient algorithm is used.
1	Stahel's algorithm is used.

IMSLS_PERCENTAGE, float percentage (Input)
Percentage of gross errors expected in the data. Argument percentage must be in the range 0.0 to 100.0 and contains the percentage of outliers expected in the data. If the percentage of gross errors expected in the data is not known, a reasonable strategy is to choose a value of percentage that is such that larger values do not result in significant changes in the estimates.
Default: percentage = 5.0

IMSLS_MAX_ITERATIONS, int maxit (Input)
Maximum number of iterations.
Default: maxit = 30

IMSLS_TOLERANCE, float tolerance (Input)
Convergence criterion. When the maximum absolute change in a location or covariance estimate is less than tolerance, convergence is assumed.
Default: tolerance = 10−4

IMSLS_MINIMAX_WEIGHTS, float *a, float *b, float *c (Output)
Arguments a, b, and c contain the values for the parameters of the weighting function. See the “Description” section.

IMSLS_GROUP_COUNTS, int **gcounts (Output)
Address of a pointer to an integer array of length n_groups containing the number of observations in each group.

IMSLS_GROUP_COUNTS_USER, int gcounts[] (Output)
Storage for integer array gcounts is provided by the user. See IMSLS_GROUP_COUNTS.

IMSLS_SUM_WEIGHTS, float **sum_weights (Output)
Address of a pointer to an array of length n_groups containing the sum of the weights times the frequencies in the groups.

IMSLS_SUM_WEIGHTS_USER, float sum_weights[](Output)
Storage for array sum_weights is provided by the user. See IMSLS_SUM_WEIGHTS.

IMSLS_MEANS, float **means (Output)
Address of a pointer to an array of size n_groups by n_variables. The i-th row of means contains the group i variable means.

IMSLS_MEANS_USER, float means[] (Output)
Storage for array means is provided by the user. See IMSLS_MEANS.

IMSLS_U, float **u (Output)
Address of a pointer to an array of size n_variables by n_variables containing the lower matrix U, the lower triangular for the robust sample cross-products matrix. U is computed from the robust sample covariance matrix, S (See the “Description” section), as S = UTU.

IMSLS_U_USER, float u[] (Output)
Storage for array u is provided by the user. See IMSLS_U.

IMSLS_BETA, float *beta (Output)
Argument beta contains the constant used to ensure that the estimated covariance matrix has unbiased expectation (for a given mean vector) for a multivariate normal density.

IMSLS_N_ROWS_MISSING, int *nrmiss (Output)
Number of rows of data encountered in calls to robust_covariances containing missing values (NaN) for any of the variables used.

IMSLS_RETURN_USER, float c[] (Output)
If specified, c returns the covariance matrix. Storage for array c is provided by the user.

Description

Function imsls_f_robust_covariances computes robust M-estimates of the mean and covariance matrix from a matrix of observations. A pooled estimate of the covariance matrix is computed when multiple groups are present in the input data. M-estimate weights are obtained using the “minimax” weights of Huber (1981, pp. 231-235), with percentage expected gross errors. Huber's (1981) weighting equations are given by:

User specified observation weights and frequencies may be given for each row in x. Listwise deletion of missing values is assumed so that all observations used are “complete”.

Let f (x;μi, Σ) denote the density of an observation p-vector x in population (group) i with mean vector μi, for i = 1, …, τ. Let the covariance matrix Σ be such that Σ = RTR. If

y = R−T (x − μi)

then

It is assumed that g(y) is a spherically symmetric density in p-dimensions.

In imsls_f_robust_covariances, Σ and μi are estimated as the solutions

of the estimation equations

and

where i indexes the τ groups, ni, is the number of observations in group i, fij is the frequency for the j-th observation in group i, wij is the observation weight specified in column iwt of x, Ip is a p × p identity matrix,

w(r) and u(r) are the weighting functions, and where β is a constant computed by the program to make the expected weighted Mahalanobis distance (yTy) equal the expected Mahalanobis distance from a multivariate normal distribution (see Marazzi 1985). The constant β is described more fully below.

Function imsls_f_robust_covariances uses one of two algorithms for solving the estimation equations. The first algorithm is discussed in detail in Huber (1981) and is a variant of the conjugate gradient method. The second algorithm is due to Stahel (1981) and is discussed in detail by Marazzi (1985). In both algorithms, correction vectors Tki for the group i means and correction matrix Wk = Ip + Uk for the Cholesky factorization of Σ are found such that the updated mean vectors are given by

and the updated matrix R is given as

where k is the iteration number and

When all elements of Uk and Tki are less than ɛ = tolerance, convergence is assumed.

Three methods for obtaining estimates are allowed. In the first method, the sample weighted estimate of Σ is computed. In the second method, estimates based upon the median and the interquartile range are used. Finally, in the last method, the user inputs initial estimates.

Function imsls_f_robust_covariances computes estimates based on the “minimax” weights discussed above. The constant β is chosen such that E
(u(r)r2) = ρβ where the expectation is with respect to a standard p-variate multivariate normal distribution. This yields estimates with the correct expectation for the multivariate normal distribution (for given mean vector). The expectation is computed via integration of estimated spline function. 200 knots are used on an equally apaced grid from 0.0 to the 99.999 percentile of

distribution. An error estimate is computed based upon 100 of these knots. If the estimated relative error is greater than 0.0001, a warning message is issued. If β is not computed accurately (i.e., if the warning message is issued), the computed esimates are still optimal, but the scale of the estimated covariance matrix may need to be multiplied by a constant in order for

to have the correct multivariate normal covariance expectation.

Examples

Example 1

The following example computes a robust variance-covariance matrix. The last column of the data set is the group indicator.

#include <imsls.h>

#include <stdlib.h>

int main()

{

int nobs = 6;

int nvar = 2;

int n_groups = 2;

float *cov;

float x[18] = {

2.2, 5.6, 1,

3.4, 2.3, 1,

1.2, 7.8, 1,

3.2, 2.1, 2,

4.1, 1.6, 2,

3.7, 2.2, 2};

cov = imsls_f_robust_covariances(nobs, nvar, x, n_groups, 0);

imsls_f_write_matrix("Robust Covariance Matrix", nvar, nvar, cov,

IMSLS_COL_NUMBER_ZERO,

IMSLS_ROW_NUMBER_ZERO, 0);

imsls_free(cov);

}

Output

Robust Covariance Matrix

0 1

0 0.522 -1.160

1 -1.160 2.862

Example 2

The following example computes estimates of the pooled covariance matrix for the Fisher's iris data. For comparison, the estimates are first computed via function imsls_f_pooled_covariances. Function imsls_f_robust_covariances with percentage = 2.0 is then used to compute the robust estimates. As can be seen from the output, the resulting estimates are quite similar.

Next, three observations are made into outliers, and again, estimates are computed using functions imsls_f_pooled_covariances and imsls_f_robust_covariances. When outliers are present, the estimates of imsls_f_pooled_covariances are adversely affected, while the estimates produced by imsls_f_robust_covariances are close the estimates produced when no outliers are present.

include <imsls.h>

#include <stdlib.h>

int main()

{

int nobs = 150;

int nvar = 4;

int n_groups = 3;

float percentage = 2.0;

int igrp = 0;

int ifrq = -1;

int iwt = -1;

int ind[4] = {1, 2, 3, 4};

float *x, cov[16], rbcov[16];

x = imsls_f_data_sets(3, 0);

imsls_f_pooled_covariances(nobs, nvar, x, n_groups,

IMSLS_RETURN_USER, cov,

IMSLS_X_INDICES, igrp, ind, ifrq, iwt, 0);

imsls_f_write_matrix("Pooled Covariance with No Outliers", nvar, nvar,
cov,

IMSLS_COL_NUMBER_ZERO,

IMSLS_ROW_NUMBER_ZERO,

IMSLS_PRINT_UPPER, 0);

imsls_f_robust_covariances(nobs, nvar, x, n_groups,

IMSLS_RETURN_USER, rbcov,

IMSLS_PERCENTAGE, percentage,

IMSLS_X_INDICES, igrp, ind, ifrq, iwt, 0);

imsls_f_write_matrix("Robust Covariance with No Outliers", nvar, nvar,
rbcov,

IMSLS_COL_NUMBER_ZERO,

IMSLS_ROW_NUMBER_ZERO,

IMSLS_PRINT_UPPER, 0);

/* Add Outliers */

x[1] = 100.0;

x[19] = 100.0;

x[497] = -100.0;

imsls_f_pooled_covariances(nobs, nvar, x, n_groups,

IMSLS_RETURN_USER, cov,

IMSLS_X_INDICES, igrp, ind, ifrq, iwt, 0);

imsls_f_write_matrix("Pooled Covariance with Outliers", nvar, nvar,
cov,

IMSLS_COL_NUMBER_ZERO,

IMSLS_ROW_NUMBER_ZERO,

IMSLS_PRINT_UPPER, 0);

imsls_f_robust_covariances(nobs, nvar, x, n_groups,

IMSLS_RETURN_USER, rbcov,

IMSLS_PERCENTAGE, percentage,

IMSLS_X_INDICES, igrp, ind, ifrq, iwt, 0);

imsls_f_write_matrix("Robust Covariance with Outliers", nvar, nvar,
rbcov,

IMSLS_COL_NUMBER_ZERO,

IMSLS_ROW_NUMBER_ZERO,

IMSLS_PRINT_UPPER, 0);

imsls_free(x);

}

Output

Pooled Covariance with No Outliers

0 1 2 3

0 0.2650 0.0927 0.1675 0.0384

1 0.1154 0.0552 0.0327

2 0.1852 0.0427

3 0.0419

Robust Covariance with No Outliers

0 1 2 3

0 0.2474 0.0872 0.1535 0.0360

1 0.1073 0.0538 0.0322

2 0.1705 0.0412

3 0.0401

Pooled Covariance with Outliers

0 1 2 3

0 60.43 0.30 0.13 -1.56

1 70.53 0.17 -0.17

2 0.19 0.07

3 66.38

Robust Covariance with Outliers

0 1 2 3

0 0.2555 0.0876 0.1553 0.0359

1 0.1127 0.0545 0.0322

2 0.1723 0.0412

3 0.0424

Warning Errors

IMSLS_NO_CONVERGE_MAX_ITER Failure to converge within “maxit” = # iterations for at least one of the “nroot” = # roots.

Fatal Errors

IMSLS_BAD_GROUP_2 The group number for observation # is equal to #. It must be greater than or equal to one and less than or equal to #, the number of groups.

Visual Numerics, Inc.
Visual Numerics - Developers of IMSL and PV-WAVE
http://www.vni.com/
PHONE: 713.784.3131
FAX:713.781.9260