Computes a robust estimate of a covariance matrix and mean vector.
#include <imsls.h>
float *imsls_f_robust_covariances (int n_rows, int n_variables, float *x, int n_groups, ..., 0)
The type double function is imsls_d_robust_covariances.
int n_rows
(Input)
Number of rows observations) in the input matrix x.
int
n_variables (Input)
Number of variables to be used in
computing the covariance matrix.
float *x
(Input)
A n_rows by n_variables + 1
matrix containing the data. The first n_variables columns
correspond to the variables, and the last column (column n_variables) must
contain the group numbers.
int n_groups
(Input)
Number of groups in the data.
Matrix of size n_variables by n_variables containing the matrix of covariances.
#include <imsls.h>
float
*imsls_f_robust_covariances (int n_rows, int n_variables,
float
x[], int n_groups,
IMSLS_X_COL_DIM, int x_col_dim,
IMSLS_X_INDICES, int igrp, int ind[], int ifrq, int iwt,
IMSLS_INITIAL_EST_MEAN,
IMSLS_INITIAL_EST_MEDIAN,
IMSLS_INITIAL_EST_INPUT,
float
input_means[],
float input_cov[],
IMSLS_ESTIMATION_METHOD, int method,
IMSLS_PERCENTAGE,
float
percentage,
IMSLS_MAX_ITERATIONS, int maxit,
IMSLS_TOLERANCE,
float
tolerance,
IMSLS_MINIMAX_WEIGHTS,
float *a,
float *b, float
*c,
IMSLS_GROUP_COUNTS, int **gcounts,
IMSLS_GROUP_COUNTS_USER, int gcounts[],
IMSLS_SUM_WEIGHTS,
float
**sum_weights,
IMSLS_SUM_WEIGHTS_USER,
float
sum_weights[],
IMSLS_MEANS,
float
**means,
IMSLS_MEANS_USER, float means[],
IMSLS_U,
float **u,
IMSLS_U_USER, float u[],
IMSLS_BETA,
float
*beta,
IMSLS_N_ROWS_MISSING, int *nrmiss,
IMSLS_RETURN_USER,
float c[],
0)
IMSLS_X_COL_DIM, int x_col_dim
(Input)
Row/Column dimension of x.
Default: x_col_dim = n_variables + 1
IMSLS_X_INDICES, int igrp, int ind[], int ifrq, int iwt
(Input)
Each of the four arguments contains indices indicating column numbers
of x in which
particular types of data are stored. Columns are numbered 0 … x_col_dim − 1.
Parameter igrp contains the index for the column of x in which the group numbers are stored.
Parameter ind contains the indices of the variables to be used in the analysis.
Parameters ifrq and iwt contain the column numbers of x in which the frequencies and weights, respectively, are stored. Set ifrq = −1 if there will be no column for frequencies. Set iwt = −1 if there will be no column for weights. Weights are rounded to the nearest integer. Negative weights are not allowed.
Defaults: igrp = n_variables, ind [ ] = 0, 1, …, n_variables − 1, ifrq = −1, and iwt = −1
IMSLS_INITIAL_EST_MEAN, or
IMSLS_INITIAL_EST_MEDIAN, or
IMSLS_INITIAL_EST_INPUT, float *input_mean, float
*input_cov (Input)
If IMSLS_INITIAL_EST_MEAN
is specified, initial estimates are obtained as the usual estimate of a mean
vector and of a covariance matrix.
If IMSLS_INITIAL_EST_MEDIAN is specified, initial estimates are based upon the median and interquartile range are used.
If IMSLS_INITIAL_EST_INPUT is specified, the initial estimates are specified in arrays input_mean and input_cov. Argument input_mean is an array of size n_groups by n_variables, and input_cov is an array of size n_variables by n_variables.
Default: IMSLS_INITIAL_EST_MEAN
IMSLS_ESTIMATION_METHOD, int method
(Input)
Option parameter giving the algorithm to be used in computing the
estimates.
method |
Method Used |
0 |
Huber's conjugate-gradient algorithm is used. |
1 |
Stahel's algorithm is used. |
IMSLS_PERCENTAGE, float
percentage (Input)
Percentage of gross errors expected in
the data. Argument percentage must be in
the range 0.0 to 100.0 and contains the percentage of outliers expected in the
data. If the percentage of gross errors expected in the data is not known, a
reasonable strategy is to choose a value of percentage that is
such that larger values do not result in significant changes in the
estimates.
Default: percentage = 5.0
IMSLS_MAX_ITERATIONS, int maxit
(Input)
Maximum number of iterations.
Default: maxit = 30
IMSLS_TOLERANCE, float tolerance
(Input)
Convergence criterion. When the maximum absolute change in a location
or covariance estimate is less than tolerance, convergence
is assumed.
Default: tolerance = 10−4
IMSLS_MINIMAX_WEIGHTS, float *a, float *b, float *c
(Output)
Arguments a, b, and c contain the values
for the parameters of the weighting function. See the “Description”
section.
IMSLS_GROUP_COUNTS, int **gcounts
(Output)
Address of a pointer to an integer array of length n_groups containing
the number of observations in each group.
IMSLS_GROUP_COUNTS_USER, int gcounts[]
(Output)
Storage for integer array gcounts is provided by
the user. See IMSLS_GROUP_COUNTS.
IMSLS_SUM_WEIGHTS, float
**sum_weights (Output)
Address of a pointer to an array of
length n_groups
containing the sum of the weights times the frequencies in the
groups.
IMSLS_SUM_WEIGHTS_USER, float
sum_weights[](Output)
Storage for array sum_weights is
provided by the user. See IMSLS_SUM_WEIGHTS.
IMSLS_MEANS, float **means
(Output)
Address of a pointer to an array of size n_groups by n_variables. The
i-th row of means contains the
group i variable means.
IMSLS_MEANS_USER, float means[]
(Output)
Storage for array
means is provided by the user. See IMSLS_MEANS.
IMSLS_U, float **u
(Output)
Address of a
pointer to an array of size n_variables by n_variables containing
the lower matrix U, the lower triangular for the robust sample
cross-products matrix. U is computed from the robust sample covariance
matrix, S (See the “Description” section),
as S = UTU.
IMSLS_U_USER, float u[]
(Output)
Storage for array u is provided by the
user. See IMSLS_U.
IMSLS_BETA, float *beta
(Output)
Argument beta contains the
constant used to ensure that the estimated covariance matrix has unbiased
expectation (for a given mean vector) for a multivariate normal
density.
IMSLS_N_ROWS_MISSING, int *nrmiss
(Output)
Number of rows of data encountered in calls to robust_covariances
containing missing values (NaN) for any of the variables used.
IMSLS_RETURN_USER, float c[]
(Output)
If specified, c returns the
covariance matrix. Storage for array c is provided by the
user.
Function imsls_f_robust_covariances computes robust M-estimates of the mean and covariance matrix from a matrix of observations. A pooled estimate of the covariance matrix is computed when multiple groups are present in the input data. M-estimate weights are obtained using the “minimax” weights of Huber (1981, pp. 231-235), with percentage expected gross errors. Huber's (1981) weighting equations are given by:
User specified observation weights and frequencies may be given for each row in x. Listwise deletion of missing values is assumed so that all observations used are “complete”.
Let f (x;μi, Σ) denote the density of an observation p-vector x in population (group) i with mean vector μi, for i = 1, …, τ. Let the covariance matrix Σ be such that Σ = RTR. If
y = R−T (x − μi)
then
It is assumed that g(y) is a spherically symmetric density in p-dimensions.
In imsls_f_robust_covariances, Σ and μi are estimated as the solutions
of the estimation equations
and
where i indexes the τ groups, ni, is the number of observations in group i, fij is the frequency for the j-th observation in group i, wij is the observation weight specified in column iwt of x, Ip is a p × p identity matrix,
w(r) and u(r) are the weighting functions, and where β is a constant computed by the program to make the expected weighted Mahalanobis distance (yTy) equal the expected Mahalanobis distance from a multivariate normal distribution (see Marazzi 1985). The constant β is described more fully below.
Function imsls_f_robust_covariances uses one of two algorithms for solving the estimation equations. The first algorithm is discussed in detail in Huber (1981) and is a variant of the conjugate gradient method. The second algorithm is due to Stahel (1981) and is discussed in detail by Marazzi (1985). In both algorithms, correction vectors Tki for the group i means and correction matrix Wk = Ip + Uk for the Cholesky factorization of Σ are found such that the updated mean vectors are given by
and the updated matrix R is given as
where k is the iteration number and
When all elements of Uk and Tki are less than ɛ = tolerance, convergence is assumed.
Three methods for obtaining estimates are allowed. In the first method, the sample weighted estimate of Σ is computed. In the second method, estimates based upon the median and the interquartile range are used. Finally, in the last method, the user inputs initial estimates.
Function imsls_f_robust_covariances
computes estimates based on the “minimax” weights discussed above. The constant
β is chosen such that E
(u(r)r2) = ρβ where the expectation is
with respect to a standard p-variate multivariate normal distribution.
This yields estimates with the correct expectation for the multivariate normal
distribution (for given mean vector). The expectation is computed via
integration of estimated spline function. 200 knots are used on an equally
apaced grid from 0.0 to the 99.999 percentile of
distribution. An error estimate is computed based upon 100 of these knots. If the estimated relative error is greater than 0.0001, a warning message is issued. If β is not computed accurately (i.e., if the warning message is issued), the computed esimates are still optimal, but the scale of the estimated covariance matrix may need to be multiplied by a constant in order for
to have the correct multivariate normal covariance expectation.
The following example computes a robust variance-covariance matrix. The last column of the data set is the group indicator.
#include <imsls.h>
#include <stdlib.h>
int main()
{
int nobs = 6;
int nvar = 2;
int n_groups = 2;
float *cov;
float x[18] = {
2.2, 5.6, 1,
3.4, 2.3, 1,
1.2, 7.8, 1,
3.2, 2.1, 2,
4.1, 1.6, 2,
3.7, 2.2, 2};
cov = imsls_f_robust_covariances(nobs, nvar, x, n_groups, 0);
imsls_f_write_matrix("Robust Covariance Matrix", nvar, nvar, cov,
IMSLS_COL_NUMBER_ZERO,
IMSLS_ROW_NUMBER_ZERO, 0);
imsls_free(cov);
}
Robust Covariance Matrix
0 1
0 0.522 -1.160
1 -1.160 2.862
The following example computes estimates of the pooled covariance matrix for the Fisher's iris data. For comparison, the estimates are first computed via function imsls_f_pooled_covariances. Function imsls_f_robust_covariances with percentage = 2.0 is then used to compute the robust estimates. As can be seen from the output, the resulting estimates are quite similar.
Next, three observations are made into outliers, and again, estimates are computed using functions imsls_f_pooled_covariances and imsls_f_robust_covariances. When outliers are present, the estimates of imsls_f_pooled_covariances are adversely affected, while the estimates produced by imsls_f_robust_covariances are close the estimates produced when no outliers are present.
include <imsls.h>
#include <stdlib.h>
int main()
{
int nobs = 150;
int nvar = 4;
int n_groups = 3;
float percentage = 2.0;
int igrp = 0;
int ifrq = -1;
int iwt = -1;
int ind[4] = {1, 2, 3, 4};
float *x, cov[16], rbcov[16];
x = imsls_f_data_sets(3, 0);
imsls_f_pooled_covariances(nobs, nvar, x, n_groups,
IMSLS_RETURN_USER, cov,
IMSLS_X_INDICES, igrp, ind, ifrq, iwt, 0);
imsls_f_write_matrix("Pooled Covariance
with No Outliers", nvar, nvar,
cov,
IMSLS_COL_NUMBER_ZERO,
IMSLS_ROW_NUMBER_ZERO,
IMSLS_PRINT_UPPER, 0);
imsls_f_robust_covariances(nobs, nvar, x, n_groups,
IMSLS_RETURN_USER, rbcov,
IMSLS_PERCENTAGE, percentage,
IMSLS_X_INDICES, igrp, ind, ifrq, iwt, 0);
imsls_f_write_matrix("Robust Covariance
with No Outliers", nvar, nvar,
rbcov,
IMSLS_COL_NUMBER_ZERO,
IMSLS_ROW_NUMBER_ZERO,
IMSLS_PRINT_UPPER, 0);
/* Add Outliers */
x[1] = 100.0;
x[19] = 100.0;
x[497] = -100.0;
imsls_f_pooled_covariances(nobs, nvar, x, n_groups,
IMSLS_RETURN_USER, cov,
IMSLS_X_INDICES, igrp, ind, ifrq, iwt, 0);
imsls_f_write_matrix("Pooled Covariance
with Outliers", nvar, nvar,
cov,
IMSLS_COL_NUMBER_ZERO,
IMSLS_ROW_NUMBER_ZERO,
IMSLS_PRINT_UPPER, 0);
imsls_f_robust_covariances(nobs, nvar, x, n_groups,
IMSLS_RETURN_USER, rbcov,
IMSLS_PERCENTAGE, percentage,
IMSLS_X_INDICES, igrp, ind, ifrq, iwt, 0);
imsls_f_write_matrix("Robust Covariance
with Outliers", nvar, nvar,
rbcov,
IMSLS_COL_NUMBER_ZERO,
IMSLS_ROW_NUMBER_ZERO,
IMSLS_PRINT_UPPER, 0);
imsls_free(x);
}
Pooled Covariance with No Outliers
0 1 2 3
0 0.2650 0.0927 0.1675 0.0384
1 0.1154 0.0552 0.0327
2 0.1852 0.0427
3 0.0419
Robust Covariance with No Outliers
0 1 2 3
0 0.2474 0.0872 0.1535 0.0360
1 0.1073 0.0538 0.0322
2 0.1705 0.0412
3 0.0401
Pooled Covariance with Outliers
0 1 2 3
0 60.43 0.30 0.13 -1.56
1 70.53 0.17 -0.17
2 0.19 0.07
3 66.38
Robust Covariance with Outliers
0 1 2 3
0 0.2555 0.0876 0.1553 0.0359
1 0.1127 0.0545 0.0322
2 0.1723 0.0412
3 0.0424
IMSLS_NO_CONVERGE_MAX_ITER Failure to converge within “maxit” = # iterations for at least one of the “nroot” = # roots.
IMSLS_BAD_GROUP_2 The group number for observation # is equal to #. It must be greater than or equal to one and less than or equal to #, the number of groups.
Visual Numerics, Inc. PHONE: 713.784.3131 FAX:713.781.9260 |