Locate and optionally replace dependent variable missing values with nearest neighbor estimates.
#include <imsls.h>
int imsls_f_impute_missing (int n_observations, int n_variables, int n_independent, int indind[], float x[], ..., 0)
The type double function is imsls_d_impute_missing.
int
n_observations (Input)
Number of observations.
int
n_variables (Input)
Number of variables.
int
n_independent (Input)
Number of independent
variables.
int indind[]
(Input)
Array of size n_independent
designating the indices of the columns of x containing the
independent variables.
float x[]
(Input)
Array of size n_observations × n_variables containing
the observations. Missing values of the dependent variables may be imputed
as functions of the independent variables, but if any of the independent
variables have missing values, then imputation will not be performed and a
warning will be issued. If one of the optional arguments, IMSLS_REPLACEMENT_VALUE,
IMSLS_IMPUTE_METHOD,
or IMSLS_PURGE
is supplied, x_imputed
(see optional argument IMSLS_X_IMPUTED)
contains the imputed data on output.
The number of missing values (n_miss) in the data array x.
#include <imsls.h>
int
imsls_f_impute_missing (int
n_observations,
int n_variables,
int n_independent,
int indind[],
float x[],
IMSLS_MISSING_VALUE, float
mval,
IMSLS_METRIC_DIAG,
float
g[],
IMSLS_REPLACEMENT_VALUE,
float
replacement_value,
or
IMSLS_IMPUTE_METHOD,
int
method,
int
k,
or
IMSLS_PURGE, int
*n_missing_rows,
int
**missing_row_indices,
IMSLS_MISSING_INDEX, int
**indices,
IMSLS_X_IMPUTED,
float
** x_imputed,
IMSLS_X_IMPUTED_USER,
float
x_imputed[],
0)
IMSLS_MISSING_VALUE, float mval, (Input)
Scalar value
(other than NaN) representing a missing value. NaN always represents a missing
value, so if mval is not NaN it
will be treated as a second type of missing value.
IMSLS_METRIC_DIAG, float g[]
(Input)
Array of length n_independent defining
a diagonal metric for independent variable space. This scales the independent
variables in the distance calculations used to determine nearest neighbors. The
default measure of distance is Euclidean (g[i] = 1
for all i).
IMSLS_REPLACEMENT_VALUE, float replacement_value
(Input)
Replace missing values in x with replacement_value.
Output data array is returned in x_imputed. Requires
optional argument IMSLS_X_IMPUTED or
IMSLS_X_IMPUTED_USER.
or
IMSLS_IMPUTE_METHOD, int method, int k
(Input)
The method to be used for imputing missing values
using k nearest
neighbors. Replace missing value of dependent variable y at point
x in the space of independent variables with the mode, mean, median,
geometric mean, or linear regression (method) of
y on those k nearest neighbors of
x which have no missing values. To use all of the data and eliminate the
need to compute neighborhoods, set k ≥ n_observations. If
there are no independent variables, set k ≥ n_observations.
Imputed data is returned in x_imputed. Requires
optional argument IMSLS_X_IMPUTED or
IMSLS_X_IMPUTED_USER.
Valid values for method are:
method |
Description |
IMSLS_MODE_METH |
Mode |
IMSLS_MEAN_METH |
Mean |
IMSLS_MEDIAN_METH |
Median |
IMSLS_GEOMEAN_METH |
Geometric mean |
IMSLS_LINEAR_METH |
Linear regression |
or
IMSLS_PURGE, int *n_missing_rows,
int **missing_row_indices
(Output)
All rows with missing values are removed from x and the resulting
data array is returned in x_imputed. n_missing_rows is the
number of rows that were removed. missing_row_indices
are the indices of the rows that were removed. Requires optional argument IMSLS_X_IMPUTED or IMSLS_X_IMPUTED_USER.
IMSLS_MISSING_INDEX, int **indices
(Output)
Address of a pointer to the internally allocated array of
size n_miss,
containing the indices of x where missing values
occur. n_miss is
the function return value. If the data has no missing values, the pointer
is returned as NULL.
IMSLS_X_IMPUTED, float ** x_imputed
(Output)
Array containing imputed data. This argument is required
when IMSLS_REPLACEMENT_VALUE,
IMSLS_IMPUTE_METHOD,
or IMSLS_PURGE
is supplied. For options IMSLS_REPLACEMENT_VALUE
and IMSLS_IMPUTE_METHOD,
x_imputed
contains all data from x with missing values replaced in the dependent variable
columns. For option IMSLS_PURGE, x_imputed is an array
of size n_observations - n_missing_rows × n_variables,
containing the data from x with the rows of
missing data removed.
IMSLS_X_IMPUTED_USER, float x_imputed[]
(Output)
Storage for array x_imputed is provided
by the user. See IMSLS_X_IMPUTED. The
size of this array must be the same as x, n_observations × n_variables. For the
IMSLS_PURGE option, use only the first n_observations - n_missing_rows × n_variables values on output.
Function imsls_f_impute_missing locates missing values, and optionally, replaces them with estimated values. This replacement process, called imputation, applies only to dependent variables. If x denotes an arbitrary point in independent variable space and y denotes a dependent variable with a missing value at x = xi, then y at xi is estimated as y(xi) = f(xi) where f(x) is some function of x in some neighborhood of xi. imsls_f_impute_missing provides five options (see IMSLS_IMPUTE_METHOD) for the form of f(x), and each option allows neighborhood size to be specified in terms of some given number of nearest neighbors. The neighbors exclude observations with missing values and are determined by distance, the norm relative to metric G,
By default, G = I, but the IMSLS_METRIC_DIAG option can be used to specify any other diagonal metric G. A sixth option, IMSLS_REPLACEMENT_VALUE, allows the user to specify one value to be used as a replacement for all missing values.
Instead of being used for imputation, imsls_f_impute_missing can be used to simply remove all observations which contain missing values. This is accomplished with the option IMSLS_PURGE. With this option, all rows with missing values are removed from the input data matrix. Unlike imputation, this option is not limited to dependent variables and can be used to handle missing values in the independent variables.
Usually either imputation or deletion will be performed, but imsls_f_impute_missing can be used for the more basic task of returning the indices of missing values. The indices could then be used to implement other imputation methods.
Following the standard practice, missing data values are always represented by NaN. Option IMSLS_MISSING_VALUE allows the user to also specify a second value to represent missing values.
Count the missing values in a data set, where the only valid missing value is NaN.
#include <imsls.h>
#include <stdio.h>
#define N_OBSERVATIONS 20
#define N_VARIABLES 4
int main()
{
float x[N_OBSERVATIONS][N_VARIABLES];
int count, i, j;
/* create the test data */
for(i=0;i<N_OBSERVATIONS;i++) {
for(j=0;j<N_VARIABLES;j++) {
x[i][j]= (float)((i*N_VARIABLES)+j);
}
}
/* replace some of the data values */
x[3][1] = imsls_f_machine(6); /* NaN */
x[5][2] = imsls_f_machine(6); /* NaN */
x[7][2] = imsls_f_machine(7); /* positive infinity */
x[9][3] = imsls_f_machine(8); /* negative infinity */
/* declare no independent variables */
/* note +/-inf are not considered 'missing' */
count = imsls_f_impute_missing (N_OBSERVATIONS, N_VARIABLES, 0,
NULL, (float*)x,
0);
printf("number of missing values = %d\n", count);
}
number of missing values = 2
Set the value 20 to represent a missing value and find the indices in x which contain the missing value.
#include <imsls.h>
#include <stdio.h>
#define N_OBSERVATIONS 20
#define N_VARIABLES 4
int main()
{
float x[N_OBSERVATIONS][N_VARIABLES];
float mval;
int n_independent, count, i, j;
int indind[2];
int *indices;
/* declare 2 independent variables */
n_independent = 2;
indind[0] = 2; /* declare that column 2 is independent */
indind[1] = 3; /* declare that column 3 is independent */
/* missing value is represented by 20 */
/* and will be located at x[5][0] */
mval = 20.0;
/* create the test data */
for(i=0;i<N_OBSERVATIONS;i++) {
for(j=0;j<N_VARIABLES;j++) {
x[i][j] = (float)((i*N_VARIABLES)+j);
}
}
count = imsls_f_impute_missing (N_OBSERVATIONS, N_VARIABLES,
n_independent, indind, (float*)x,
IMSLS_MISSING_VALUE, mval,
IMSLS_MISSING_INDEX, &indices,
0);
printf("number of missing values = %d\n", count);
for (i=0; i<count;i++) {
printf("indices[%d] = %d\n", i, indices[i]);
}
}
number of missing values = 1
indices[0] = 20
In this example both NaN and infinity represent missing values in the original data. In the first call to imsls_f_impute_missing, missing values are replaced by negative infinity. In the second call to imsls_f_impute_missing, negative infinity is set to represent missing values and the rows containing the missing values are purged for the final output.
#include <imsls.h>
#include <stdio.h>
#define N_ROWS 6
#define N_COLS 4
int main()
{
float *x_imputed, *x_purged;
float mval, replacement_value;
float data[N_ROWS][N_COLS];
int n_independent, count, npurge, i, j;
int indind[1];
int *bad_obs;
char *fmt="%6.2f";
/* create the test data */
for(i=0;i<N_ROWS;i++) {
for(j=0;j<N_COLS;j++) {
data[i][j] = (float)((i*N_COLS)+j);
}
}
/* insert bad values into data */
data[1][1] = imsls_f_machine(6); /* NaN */
data[2][2] = imsls_f_machine(7); /* positive infinity */
data[3][3] = imsls_f_machine(8); /* negative infinity */
imsls_f_write_matrix ("Original data with missing values",
N_ROWS, N_COLS, (float*)data,
IMSLS_WRITE_FORMAT, fmt,
0);
/* set the missing value to be +inf */
mval = imsls_f_machine(7);
/* replace missing values with neg inf */
replacement_value = imsls_f_machine(8);
/* declare one independent variable */
n_independent = 1;
indind[0] = 0;
/* replace Nan and +inf values with -inf */
count = imsls_f_impute_missing (N_ROWS, N_COLS, n_independent,
indind, (float*)data,
IMSLS_MISSING_VALUE, mval,
IMSLS_REPLACEMENT_VALUE, replacement_value,
IMSLS_X_IMPUTED, &x_imputed,
0);
imsls_f_write_matrix ("Data with values replaced",
N_ROWS, N_COLS, x_imputed,
IMSLS_WRITE_FORMAT, fmt,
0);
/* now purge all rows containing -inf */
mval = imsls_f_machine(8);
count = imsls_f_impute_missing (N_ROWS, N_COLS, n_independent,
indind, x_imputed,
IMSLS_MISSING_VALUE, mval,
IMSLS_PURGE, &npurge, &bad_obs,
IMSLS_X_IMPUTED, &x_purged,
0);
printf("\n number missing = %d, number of rows purged = %d\n",
count, npurge);
printf("\n Purged row numbers:\n");
for(i=0;i<npurge;i++) {
printf(" %d ",bad_obs[i]);
}
printf ("\n");
imsls_f_write_matrix ("New data with bad rows purged",
N_ROWS-npurge, N_COLS, x_purged,
IMSLS_WRITE_FORMAT, fmt,
0);
}
Original data with missing values
1 2 3 4
1 0.00 1.00 2.00 3.00
2 4.00 ...... 6.00 7.00
3 8.00 9.00 ++++++ 11.00
4 12.00 13.00 14.00 ------
5 16.00 17.00 18.00 19.00
6 20.00 21.00 22.00 23.00
Data with values replaced
1 2 3 4
1 0.00 1.00 2.00 3.00
2 4.00 ------ 6.00 7.00
3 8.00 9.00 ------ 11.00
4 12.00 13.00 14.00 ------
5 16.00 17.00 18.00 19.00
6 20.00 21.00 22.00 23.00
number missing = 3, number of rows purged = 3
Purged row numbers:
1 2 3
New data with bad rows purged
1 2 3 4
1 0.00 1.00 2.00 3.00
2 16.00 17.00 18.00 19.00
3 20.00 21.00 22.00 23.00
Replace missing values computed using the mean of the 3 nearest neighbors.
#include <imsls.h>
#include <stdio.h>
#define N_ROWS 10
#define N_COLS 4
int main()
{
float *x_imputed;
float data[N_ROWS][N_COLS];
int n_independent, count, i, j;
int indind[2];
char *fmt="%6.2f";
/* create the test data */
for(i=0;i<N_ROWS;i++) {
for(j=0;j<N_COLS;j++) {
data[i][j] = (float)((i*N_COLS)+j);
}
}
data[1][3] = imsls_f_machine(6); /* insert NaN at row 1 col 3 */
data[4][2] = imsls_f_machine(6); /* insert NaN at row 4 col 2 */
imsls_f_write_matrix ("Original data with missing values",
N_ROWS, N_COLS, (float*) data,
IMSLS_WRITE_FORMAT, fmt,
0);
/* declare two independent variables */
n_independent = 2;
indind[0] = 0;
indind[1] = 1;
/* replace missing values using mean method */
count = imsls_f_impute_missing (N_ROWS, N_COLS, n_independent, indind,
(float*)data, IMSLS_IMPUTE_METHOD,
IMSLS_MEAN_METH, 3,
IMSLS_X_IMPUTED, &x_imputed,
0);
imsls_f_write_matrix ("Imputed data (using mean method)",
N_ROWS, N_COLS,
x_imputed, IMSLS_WRITE_FORMAT, fmt,
0);
}
Original data with missing values
1 2 3 4
1 0.00 1.00 2.00 3.00
2 4.00 5.00 6.00 ......
3 8.00 9.00 10.00 11.00
4 12.00 13.00 14.00 15.00
5 16.00 17.00 ...... 19.00
6 20.00 21.00 22.00 23.00
7 24.00 25.00 26.00 27.00
8 28.00 29.00 30.00 31.00
9 32.00 33.00 34.00 35.00
10 36.00 37.00 38.00 39.00
Imputed data (using mean method)
1 2 3 4
1 0.00 1.00 2.00 3.00
2 4.00 5.00 6.00 9.67
3 8.00 9.00 10.00 11.00
4 12.00 13.00 14.00 15.00
5 16.00 17.00 20.67 19.00
6 20.00 21.00 22.00 23.00
7 24.00 25.00 26.00 27.00
8 28.00 29.00 30.00 31.00
9 32.00 33.00 34.00 35.00
10 36.00 37.00 38.00 39.00
IMSLS_NO_GOOD_ROW |
Each row contains missing values. No imputation is performed. |
IMSLS_INDEP_HAS_MISSING |
At least one of the independent variables contains a missing value. No imputation is performed.
|