imputeMissing

Locate and optionally replace dependent variable missing values with nearest neighbor estimates.

Synopsis

imputeMissing (indind, x)

Required Arguments

int indind[] (Input)
Array of size nIndependent designating the indices of the columns of x containing the independent variables.
float x[[]] (Input)
Array of size nObservations × nVariables containing the observations. Missing values of the dependent variables may be imputed as functions of the independent variables, but if any of the independent variables have missing values, then imputation will not be performed and a warning will be issued. If one of the optional arguments, replacementValue, imputeMethod, or purge is supplied, xImputed (see optional argument xImputed) contains the imputed data on output.

Return Value

The number of missing values (nMiss) in the data array x.

Optional Arguments

missingValue, float (Input)
Scalar value (other than NaN) representing a missing value. NaN always represents a missing value, so if missingValue is not NaN it will be treated as a second type of missing value.
metricDiag, float (Input)
Array of length nIndependent defining a diagonal metric for independent variable space. This scales the independent variables in the distance calculations used to determine nearest neighbors. The default measure of distance is Euclidean (g[i] = 1 for all i).
replacementValue, float (Input)
Replace missing values in x with replacementValue. Output data array is returned in xImputed. Requires optional argument xImputed.

or

imputeMethod, int method, int k (Input)

The method to be used for imputing missing values using k nearest neighbors. Replace missing value of dependent variable y at point x in the space of independent variables with the mode, mean, median, geometric mean, or linear regression (method) of y on those k nearest neighbors of x which have no missing values. To use all of the data and eliminate the need to compute neighborhoods, set knObservations. If there are no independent variables, set knObservations. Imputed data is returned in xImputed. Requires optional argument xImputed.

Valid values for method are:

method Description
MODE_METH Mode
MEAN_METH Mean
MEDIAN_METH Median
GEOMEAN_METH Geometric mean
LINEAR_METH Linear regression

or

purge (Output)
All rows with missing values are removed from x and the resulting data array is returned in xImputed. nMissingRows is the number of rows that were removed. missingRowIndices are the indices of the rows that were removed. Requires optional argument xImputed.

missingIndex (Output)The array of size nMiss, containing the indices of x where missing values occur. nMiss is the function return value. If the data has no missing values, the pointer is returned as None.

xImputed (Output)
Array containing imputed data. This argument is required when replacementValue, imputeMethod, or purge is supplied. For options replacementValue and imputeMethod, xImputed contains all data from x with missing values replaced in the dependent variable columns. For option purge, xImputed is an array of size nObservationsnMissingRows × nVariables, containing the data from x with the rows of missing data removed.

Description

Function imputeMissing locates missing values, and optionally, replaces them with estimated values. This replacement process, called imputation, applies only to dependent variables. If x denotes an arbitrary point in independent variable space and y denotes a dependent variable with a missing value at \(x=x_i\), then y at \(x_i\) is estimated as \(y(x_i)=f(x_i)\) where \(f(x)\) is some function of x in some neighborhood of \(x_i\). imputeMissing provides five options (see imputeMethod) for the form of \(f(x)\), and each option allows neighborhood size to be specified in terms of some given number of nearest neighbors. The neighbors exclude observations with missing values and are determined by distance, the norm relative to metric G,

\[\|x\| = \sqrt{x^T Gx}.\]

By default, \(G=I\), but the metricDiag option can be used to specify any other diagonal metric G. A sixth option, replacementValue, allows the user to specify one value to be used as a replacement for all missing values.

Instead of being used for imputation, imputeMissing can be used to simply remove all observations which contain missing values. This is accomplished with the option purge. With this option, all rows with missing values are removed from the input data matrix. Unlike imputation, this option is not limited to dependent variables and can be used to handle missing values in the independent variables.

Usually either imputation or deletion will be performed, but imputeMissing can be used for the more basic task of returning the indices of missing values. The indices could then be used to implement other imputation methods.

Following the standard practice, missing data values are always represented by NaN. Option missingValue allows the user to also specify a second value to represent missing values.

Examples

Example 1

Count the missing values in a data set, where the only valid missing value is NaN.

from __future__ import print_function
from numpy import *
from pyimsl.stat.imputeMissing import imputeMissing
from pyimsl.stat.machine import machine

nObservations = 20
nVariables = 4

# Create the test data
x = zeros((nObservations, nVariables), dtype=double)
for i in range(nObservations):
    for j in range(nVariables):
        x[i][j] = i * nVariables + j

# Replace some of the data values,
# note +/-inf are not considered 'missing
x[3][1] = machine(6)  # NaN
x[5][2] = machine(6)  # NaN
x[7][2] = machine(7)  # positive infinity
x[9][3] = machine(8)  # negative infinity

# declare no independent variables
indind = []

count = imputeMissing(indind, x)

print("number of missing values = %d" % count)

Output

number of missing values = 2

Example 2

Set the value 20 to represent a missing value and find the indices in x which contain the missing value.

from __future__ import print_function
from numpy import *
from pyimsl.stat.imputeMissing import imputeMissing
from pyimsl.stat.machine import machine

nObservations = 20
nVariables = 4

# Declare 2 independent variables
nIndependent = 2
# Declare that columns 2 and 3 are independent
indind = [2, 3]

# Missing value is represented by 20
# and will be located at x[5][0]
mval = 20.0

# Create the test data
x = zeros((nObservations, nVariables), dtype=double)
for i in range(nObservations):
    for j in range(nVariables):
        x[i][j] = i * nVariables + j

indices = []
count = imputeMissing(indind, x,
                      missingValue=mval,
                      missingIndex=indices)

print("number of missing values = %d" % count)
print("indices[0] = %d" % indices[0])

Output

number of missing values = 1
indices[0] = 20

Example 3

In this example both NaN and infinity represent missing values in the original data. In the first call to imputeMissing, missing values are replaced by negative infinity. In the second call to imputeMissing, negative infinity is set to represent missing values and the rows containing the missing values are purged for the final output.

from __future__ import print_function
from numpy import *
from pyimsl.stat.imputeMissing import imputeMissing
from pyimsl.stat.machine import machine
from pyimsl.stat.writeMatrix import writeMatrix

nRows = 6
nCols = 4
fmt = "%6.2f"

# Create the test data
data = zeros((nRows, nCols), dtype=double)
for i in range(nRows):
    for j in range(nCols):
        data[i][j] = i * nCols + j

# Insert bad values into data
data[1][1] = machine(6)  # NaN
data[2][2] = machine(7)  # positive infinity
data[3][3] = machine(8)  # negative infinity
writeMatrix("Original data with missing values", data,
            writeFormat=fmt)

# Set the missing value to be +inf
mval = machine(7)
# Replace missing values with neg inf
replacementValue = machine(8)
# Declare one independent variable
indind = [0]
# replace Nan and +inf values with -inf
xImputed = []
count = imputeMissing(indind, data,
                      missingValue=mval,
                      replacementValue=replacementValue,
                      xImputed=xImputed)


writeMatrix("Data with values replaced", xImputed,
            writeFormat=fmt)
# Now purge all rows containing -inf
mval = machine(8)
purge = {}
xPurged = []
count = imputeMissing(indind, xImputed,
                      missingValue=mval,
                      purge=purge,
                      xImputed=xPurged)

print("\n number missing = %d, number of rows purged = %d" %
      (count, purge['nMissingRows']))

print("\n Purged row numbers:")
badObs = purge['missingRowIndices']
for i in range(purge['nMissingRows']):
    print(" %d" % badObs[i])

writeMatrix("New data with bad rows purged", xPurged,
            writeFormat=fmt)

Output

 number missing = 3, number of rows purged = 3

 Purged row numbers:
 1
 2
 3
 
Original data with missing values
        1       2       3       4
1    0.00    1.00    2.00    3.00
2    4.00  ......    6.00    7.00
3    8.00    9.00  ++++++   11.00
4   12.00   13.00   14.00  ------
5   16.00   17.00   18.00   19.00
6   20.00   21.00   22.00   23.00
 
    Data with values replaced
        1       2       3       4
1    0.00    1.00    2.00    3.00
2    4.00  ------    6.00    7.00
3    8.00    9.00  ------   11.00
4   12.00   13.00   14.00  ------
5   16.00   17.00   18.00   19.00
6   20.00   21.00   22.00   23.00
 
  New data with bad rows purged
        1       2       3       4
1    0.00    1.00    2.00    3.00
2   16.00   17.00   18.00   19.00
3   20.00   21.00   22.00   23.00

Example 4

Replace missing values computed using the mean of the 3 nearest neighbors.

from numpy import *
from pyimsl.stat.imputeMissing import imputeMissing, MEAN_METH
from pyimsl.stat.machine import machine
from pyimsl.stat.writeMatrix import writeMatrix

nRows = 10
nCols = 4
fmt = "%6.2f"

# Create the test data
data = zeros((nRows, nCols), dtype=double)
for i in range(nRows):
    for j in range(nCols):
        data[i][j] = i * nCols + j

# Insert bad values into data
data[1][3] = machine(6)  # NaN
data[4][2] = machine(6)  # NaN
writeMatrix("Original data with missing values", data,
            writeFormat=fmt)


# Declare two independent variables
indind = [0, 1]

# Replace missing values using mean method
xImputed = []
count = imputeMissing(indind, data,
                      imputeMethod={'method': MEAN_METH, "k": 3},
                      xImputed=xImputed)

writeMatrix("Imputed data (using mean method)", xImputed,
            writeFormat=fmt)

Output

 
 Original data with missing values
         1       2       3       4
 1    0.00    1.00    2.00    3.00
 2    4.00    5.00    6.00  ......
 3    8.00    9.00   10.00   11.00
 4   12.00   13.00   14.00   15.00
 5   16.00   17.00  ......   19.00
 6   20.00   21.00   22.00   23.00
 7   24.00   25.00   26.00   27.00
 8   28.00   29.00   30.00   31.00
 9   32.00   33.00   34.00   35.00
10   36.00   37.00   38.00   39.00
 
 Imputed data (using mean method)
         1       2       3       4
 1    0.00    1.00    2.00    3.00
 2    4.00    5.00    6.00    9.67
 3    8.00    9.00   10.00   11.00
 4   12.00   13.00   14.00   15.00
 5   16.00   17.00   20.67   19.00
 6   20.00   21.00   22.00   23.00
 7   24.00   25.00   26.00   27.00
 8   28.00   29.00   30.00   31.00
 9   32.00   33.00   34.00   35.00
10   36.00   37.00   38.00   39.00

Warning Errors

IMSLS_NO_GOOD_ROW Each row contains missing values. No imputation is performed.
IMSLS_INDEP_HAS_MISSING At least one of the independent variables contains a missing value. No imputation is performed.