CNL Stat : Utilities : impute_missing
impute_missing
Locate and optionally replace dependent variable missing values with nearest neighbor estimates.
Synopsis
#include <imsls.h>
int imsls_f_impute_missing (int n_observations, int n_variables, int n_independent, int indind[], float x[], 0)
The type double function is imsls_d_impute_missing.
Required Arguments
int n_observations (Input)
Number of observations.
int n_variables (Input)
Number of variables.
int n_independent (Input)
Number of independent variables.
int indind[] (Input)
Array of size n_independent designating the indices of the columns of x containing the independent variables.
float x[] (Input)
Array of size n_observations × n_variables containing the observations. Missing values of the dependent variables may be imputed as functions of the independent variables, but if any of the independent variables have missing values, then imputation will not be performed and a warning will be issued. If one of the optional arguments, IMSLS_REPLACEMENT_VALUE, IMSLS_IMPUTE_METHOD, or IMSLS_PURGE is supplied, x_imputed (see optional argument IMSLS_X_IMPUTED) contains the imputed data on output.
Return Value
The number of missing values (n_miss) in the data array x.
Synopsis with Optional Arguments
#include <imsls.h>
int imsls_f_impute_missing (int n_observations, int n_variables, int n_independent, int indind[], float x[],
IMSLS_MISSING_VALUE, float mval,
IMSLS_METRIC_DIAG, float g[],
IMSLS_REPLACEMENT_VALUE, float replacement_value, or
IMSLS_IMPUTE_METHOD, int method, int k, or
IMSLS_PURGE, int *n_missing_rows, int **missing_row_indices,
IMSLS_MISSING_INDEX, int **indices,
IMSLS_X_IMPUTED, float **x_imputed,
IMSLS_X_IMPUTED_USER, float x_imputed[],
0)
Optional Arguments
IMSLS_MISSING_VALUE, float mval, (Input)
Scalar value (other than NaN) representing a missing value. NaN always represents a missing value, so if mval is not NaN it will be treated as a second type of missing value.
IMSLS_METRIC_DIAG, float g[] (Input)
Array of length n_independent defining a diagonal metric for independent variable space. This scales the independent variables in the distance calculations used to determine nearest neighbors. The default measure of distance is Euclidean (g[i] = 1 for all i).
IMSLS_REPLACEMENT_VALUE, float replacement_value (Input)
Replace missing values in x with replacement_value. Output data array is returned in x_imputed. Requires optional argument IMSLS_X_IMPUTED or IMSLS_X_IMPUTED_USER.
or
IMSLS_IMPUTE_METHOD, int method, int k (Input)
The method to be used for imputing missing values using k nearest neighbors. Replace missing value of dependent variable y at point x in the space of independent variables with the mode, mean, median, geometric mean, or linear regression (method) of y on those k nearest neighbors of x which have no missing values. To use all of the data and eliminate the need to compute neighborhoods, set k  n_observations. If there are no independent variables, set k  n_observations. Imputed data is returned in x_imputed. Requires optional argument IMSLS_X_IMPUTED or IMSLS_X_IMPUTED_USER.
Valid values for method are:
method
Description
IMSLS_MODE_METH
Mode
IMSLS_MEAN_METH
Mean
IMSLS_MEDIAN_METH
Median
IMSLS_GEOMEAN_METH
Geometric mean
IMSLS_LINEAR_METH
Linear regression
or
IMSLS_PURGE, int *n_missing_rows, int **missing_row_indices (Output)
All rows with missing values are removed from x and the resulting data array is returned in x_imputed. n_missing_rows is the number of rows that were removed. missing_row_indices are the indices of the rows that were removed. Requires optional argument IMSLS_X_IMPUTED or IMSLS_X_IMPUTED_USER.
IMSLS_MISSING_INDEX, int **indices (Output)
Address of a pointer to the internally allocated array of size n_miss, containing the indices of x where missing values occur. n_miss is the function return value. If the data has no missing values, the pointer is returned as NULL.
IMSLS_X_IMPUTED, float ** x_imputed (Output)
Array containing imputed data. This argument is required when IMSLS_REPLACEMENT_VALUE, IMSLS_IMPUTE_METHOD, or IMSLS_PURGE is supplied. For options IMSLS_REPLACEMENT_VALUE and IMSLS_IMPUTE_METHOD, x_imputed contains all data from x with missing values replaced in the dependent variable columns. For option IMSLS_PURGE, x_imputed is an array of size n_observations  n_missing_rows × n_variables, containing the data from x with the rows of missing data removed.
IMSLS_X_IMPUTED_USER, float x_imputed[] (Output)
Storage for array x_imputed is provided by the user. See IMSLS_X_IMPUTED. The size of this array must be the same as x,  n_observations × n_variables. For the IMSLS_PURGE option, use only the first n_observations  n_missing_rows × n_variables values on output.
Description
Function imsls_f_impute_missing locates missing values, and optionally, replaces them with estimated values. This replacement process, called imputation, applies only to dependent variables. If x denotes an arbitrary point in independent variable space and y denotes a dependent variable with a missing value at x = xi, then y at xi is estimated as y(xi) = f(xi) where f(x) is some function of x in some neighborhood of xi. imsls_f_impute_missing provides five options (see IMSLS_IMPUTE_METHOD) for the form of f(x), and each option allows neighborhood size to be specified in terms of some given number of nearest neighbors. The neighbors exclude observations with missing values and are determined by distance, the norm relative to metric G,
By default, G = I, but the IMSLS_METRIC_DIAG option can be used to specify any other diagonal metric G. A sixth option, IMSLS_REPLACEMENT_VALUE, allows the user to specify one value to be used as a replacement for all missing values.
Instead of being used for imputation, imsls_f_impute_missing can be used to simply remove all observations which contain missing values. This is accomplished with the option IMSLS_PURGE. With this option, all rows with missing values are removed from the input data matrix. Unlike imputation, this option is not limited to dependent variables and can be used to handle missing values in the independent variables.
Usually either imputation or deletion will be performed, but imsls_f_impute_missing can be used for the more basic task of returning the indices of missing values. The indices could then be used to implement other imputation methods.
Following the standard practice, missing data values are always represented by NaN. Option IMSLS_MISSING_VALUE allows the user to also specify a second value to represent missing values.
Examples
Example 1
Count the missing values in a data set, where the only valid missing value is NaN.
 
#include <imsls.h>
#include <stdio.h>
#define N_OBSERVATIONS 20
#define N_VARIABLES 4
 
int main()
{
float x[N_OBSERVATIONS][N_VARIABLES];
int count, i, j;
 
/* create the test data */
 
for(i=0;i<N_OBSERVATIONS;i++) {
for(j=0;j<N_VARIABLES;j++) {
x[i][j]= (float)((i*N_VARIABLES)+j);
}
}
/* replace some of the data values */
x[3][1] = imsls_f_machine(6); /* NaN */
x[5][2] = imsls_f_machine(6); /* NaN */
x[7][2] = imsls_f_machine(7); /* positive infinity */
x[9][3] = imsls_f_machine(8); /* negative infinity */
/* declare no independent variables */
 
/* note +/-inf are not considered 'missing' */
 
count = imsls_f_impute_missing (N_OBSERVATIONS, N_VARIABLES, 0,
NULL, (float*)x,
0);
printf("number of missing values = %d\n", count);
}
Output
 
number of missing values = 2
 
Example 2
Set the value 20 to represent a missing value and find the indices in x which contain the missing value.
 
#include <imsls.h>
#include <stdio.h>
#define N_OBSERVATIONS 20
#define N_VARIABLES 4
 
 
int main()
{
float x[N_OBSERVATIONS][N_VARIABLES];
float mval;
int n_independent, count, i, j;
int indind[2];
int *indices;
 
/* declare 2 independent variables */
n_independent = 2;
indind[0] = 2; /* declare that column 2 is independent */
indind[1] = 3; /* declare that column 3 is independent */
 
/* missing value is represented by 20 */
/* and will be located at x[5][0] */
mval = 20.0;
 
/* create the test data */
 
for(i=0;i<N_OBSERVATIONS;i++) {
for(j=0;j<N_VARIABLES;j++) {
x[i][j] = (float)((i*N_VARIABLES)+j);
}
}
 
count = imsls_f_impute_missing (N_OBSERVATIONS, N_VARIABLES,
n_independent, indind, (float*)x,
IMSLS_MISSING_VALUE, mval,
IMSLS_MISSING_INDEX, &indices,
0);
 
printf("number of missing values = %d\n", count);
for (i=0; i<count;i++) {
printf("indices[%d] = %d\n", i, indices[i]);
}
}
 
Output
 
number of missing values = 1
indices[0] = 20
Example 3
In this example both NaN and infinity represent missing values in the original data. In the first call to imsls_f_impute_missing, missing values are replaced by negative infinity. In the second call to imsls_f_impute_missing, negative infinity is set to represent missing values and the rows containing the missing values are purged for the final output.
 
#include <imsls.h>
#include <stdio.h>
#define N_ROWS 6
#define N_COLS 4
 
int main()
{
 
float *x_imputed, *x_purged;
float mval, replacement_value;
float data[N_ROWS][N_COLS];
int n_independent, count, npurge, i, j;
int indind[1];
int *bad_obs;
char *fmt="%6.2f";
 
/* create the test data */
 
for(i=0;i<N_ROWS;i++) {
for(j=0;j<N_COLS;j++) {
data[i][j] = (float)((i*N_COLS)+j);
}
}
/* insert bad values into data */
data[1][1] = imsls_f_machine(6); /* NaN */
data[2][2] = imsls_f_machine(7); /* positive infinity */
data[3][3] = imsls_f_machine(8); /* negative infinity */
imsls_f_write_matrix ("Original data with missing values",
N_ROWS, N_COLS, (float*)data,
IMSLS_WRITE_FORMAT, fmt,
0);
/* set the missing value to be +inf */
mval = imsls_f_machine(7);
/* replace missing values with neg inf */
replacement_value = imsls_f_machine(8);
/* declare one independent variable */
n_independent = 1;
indind[0] = 0;
/* replace Nan and +inf values with -inf */
count = imsls_f_impute_missing (N_ROWS, N_COLS, n_independent,
indind, (float*)data,
IMSLS_MISSING_VALUE, mval,
IMSLS_REPLACEMENT_VALUE, replacement_value,
IMSLS_X_IMPUTED, &x_imputed,
0);
 
imsls_f_write_matrix ("Data with values replaced",
N_ROWS, N_COLS, x_imputed,
IMSLS_WRITE_FORMAT, fmt,
0);
 
/* now purge all rows containing -inf */
mval = imsls_f_machine(8);
count = imsls_f_impute_missing (N_ROWS, N_COLS, n_independent,
indind, x_imputed,
IMSLS_MISSING_VALUE, mval,
IMSLS_PURGE, &npurge, &bad_obs,
IMSLS_X_IMPUTED, &x_purged,
0);
 
printf("\n number missing = %d, number of rows purged = %d\n",
count, npurge);
 
printf("\n Purged row numbers:\n");
for(i=0;i<npurge;i++) {
printf(" %d ",bad_obs[i]);
}
printf ("\n");
 
imsls_f_write_matrix ("New data with bad rows purged",
N_ROWS-npurge, N_COLS, x_purged,
IMSLS_WRITE_FORMAT, fmt,
0);
}
 
Output
 
Original data with missing values
1 2 3 4
1 0.00 1.00 2.00 3.00
2 4.00 ...... 6.00 7.00
3 8.00 9.00 ++++++ 11.00
4 12.00 13.00 14.00 ------
5 16.00 17.00 18.00 19.00
6 20.00 21.00 22.00 23.00
 
Data with values replaced
1 2 3 4
1 0.00 1.00 2.00 3.00
2 4.00 ------ 6.00 7.00
3 8.00 9.00 ------ 11.00
4 12.00 13.00 14.00 ------
5 16.00 17.00 18.00 19.00
6 20.00 21.00 22.00 23.00
 
number missing = 3, number of rows purged = 3
 
Purged row numbers:
1 2 3
 
New data with bad rows purged
1 2 3 4
1 0.00 1.00 2.00 3.00
2 16.00 17.00 18.00 19.00
3 20.00 21.00 22.00 23.00
 
Example 4
Replace missing values computed using the mean of the 3 nearest neighbors.
 
#include <imsls.h>
 
#define N_ROWS 10
#define N_COLS 4
 
int main()
{
float *x_imputed;
float data[N_ROWS][N_COLS];
int n_independent, count, i, j;
int indind[2];
char *fmt="%6.2f";
 
/* create the test data */
for(i=0;i<N_ROWS;i++) {
for(j=0;j<N_COLS;j++) {
data[i][j] = (float)((i*N_COLS)+j);
}
}
 
data[1][3] = imsls_f_machine(6); /* insert NaN at row 1 col 3 */
data[4][2] = imsls_f_machine(6); /* insert NaN at row 4 col 2 */
 
imsls_f_write_matrix ("Original data with missing values", N_ROWS,
N_COLS, (float*) data,
IMSLS_WRITE_FORMAT, fmt,
0);
 
/* declare two independent variables */
n_independent = 2;
indind[0] = 0;
indind[1] = 1;
 
/* replace missing values using mean method */
count = imsls_f_impute_missing (N_ROWS, N_COLS, n_independent,
indind, (float*)data,
IMSLS_IMPUTE_METHOD,
IMSLS_MEAN_METH, 3,
IMSLS_X_IMPUTED, &x_imputed,
0);
 
imsls_f_write_matrix ("Imputed data (using mean method)", N_ROWS,
N_COLS, x_imputed,
IMSLS_WRITE_FORMAT, fmt,
0);
}
 
Output
Original data with missing values
1 2 3 4
1 0.00 1.00 2.00 3.00
2 4.00 5.00 6.00 ......
3 8.00 9.00 10.00 11.00
4 12.00 13.00 14.00 15.00
5 16.00 17.00 ...... 19.00
6 20.00 21.00 22.00 23.00
7 24.00 25.00 26.00 27.00
8 28.00 29.00 30.00 31.00
9 32.00 33.00 34.00 35.00
10 36.00 37.00 38.00 39.00
Imputed data (using mean method)
1 2 3 4
1 0.00 1.00 2.00 3.00
2 4.00 5.00 6.00 9.67
3 8.00 9.00 10.00 11.00
4 12.00 13.00 14.00 15.00
5 16.00 17.00 20.67 19.00
6 20.00 21.00 22.00 23.00
7 24.00 25.00 26.00 27.00
8 28.00 29.00 30.00 31.00
9 32.00 33.00 34.00 35.00
10 36.00 37.00 38.00 39.00
Warning Errors
IMSLS_NO_GOOD_ROW
Each row contains missing values. No imputation is performed.
IMSLS_INDEP_HAS_MISSING
At least one of the independent variables contains a missing value. No imputation is performed.