imputeMissing¶
Locate and optionally replace dependent variable missing values with nearest neighbor estimates.
Synopsis¶
imputeMissing (indind, x)
Required Arguments¶
- int
indind[](Input) - Array of size
nIndependentdesignating the indices of the columns ofxcontaining the independent variables. - float
x[[]](Input) - Array of size
nObservations×nVariablescontaining the observations. Missing values of the dependent variables may be imputed as functions of the independent variables, but if any of the independent variables have missing values, then imputation will not be performed and a warning will be issued. If one of the optional arguments,replacementValue,imputeMethod, orpurgeis supplied,xImputed(see optional argumentxImputed) contains the imputed data on output.
Return Value¶
The number of missing values (nMiss) in the data array x.
Optional Arguments¶
missingValue, float (Input)- Scalar value (other than NaN) representing a missing value. NaN always
represents a missing value, so if
missingValueis not NaN it will be treated as a second type of missing value. metricDiag, float (Input)- Array of length
nIndependentdefining a diagonal metric for independent variable space. This scales the independent variables in the distance calculations used to determine nearest neighbors. The default measure of distance is Euclidean (g[i] = 1 for all i). replacementValue, float (Input)- Replace missing values in
xwithreplacementValue. Output data array is returned inxImputed. Requires optional argumentxImputed.
or
imputeMethod, intmethod, intk(Input)The method to be used for imputing missing values using
knearest neighbors. Replace missing value of dependent variable y at point x in the space of independent variables with the mode, mean, median, geometric mean, or linear regression (method)of y on thoseknearest neighbors of x which have no missing values. To use all of the data and eliminate the need to compute neighborhoods, setk≥nObservations. If there are no independent variables, setk≥nObservations. Imputed data is returned inxImputed. Requires optional argumentxImputed.Valid values for
methodare:
| method | Description |
|---|---|
MODE_METH |
Mode |
MEAN_METH |
Mean |
MEDIAN_METH |
Median |
GEOMEAN_METH |
Geometric mean |
LINEAR_METH |
Linear regression |
or
purge(Output)- All rows with missing values are removed from
xand the resulting data array is returned inxImputed.nMissingRowsis the number of rows that were removed.missingRowIndicesare the indices of the rows that were removed. Requires optional argumentxImputed.
missingIndex (Output)The array of size nMiss, containing the indices of x where missing values occur. nMiss is the function return value. If the data has no missing values, the pointer is returned as None.
xImputed(Output)- Array containing imputed data. This argument is required when
replacementValue,imputeMethod, orpurgeis supplied. For optionsreplacementValueandimputeMethod,xImputedcontains all data from x with missing values replaced in the dependent variable columns. For optionpurge,xImputedis an array of sizenObservations‑nMissingRows×nVariables, containing the data fromxwith the rows of missing data removed.
Description¶
Function imputeMissing locates missing values, and optionally, replaces
them with estimated values. This replacement process, called imputation,
applies only to dependent variables. If x denotes an arbitrary point in
independent variable space and y denotes a dependent variable with a
missing value at \(x=x_i\), then y at \(x_i\) is estimated as
\(y(x_i)=f(x_i)\) where \(f(x)\) is some function of x in some
neighborhood of \(x_i\). imputeMissing provides five options (see
imputeMethod) for the form of \(f(x)\),
and each option allows neighborhood size to be specified in terms of some
given number of nearest neighbors. The neighbors exclude observations with
missing values and are determined by distance, the norm relative to metric G,
By default, \(G=I\), but the metricDiag option can be used to specify
any other diagonal metric G. A sixth option, replacementValue, allows
the user to specify one value to be used as a replacement for all missing
values.
Instead of being used for imputation, imputeMissing can be used to
simply remove all observations which contain missing values. This is
accomplished with the option purge. With this option, all rows with
missing values are removed from the input data matrix. Unlike imputation,
this option is not limited to dependent variables and can be used to handle
missing values in the independent variables.
Usually either imputation or deletion will be performed, but
imputeMissing can be used for the more basic task of returning the
indices of missing values. The indices could then be used to implement other
imputation methods.
Following the standard practice, missing data values are always represented
by NaN. Option missingValue allows the user to also specify a second
value to represent missing values.
Examples¶
Example 1¶
Count the missing values in a data set, where the only valid missing value is NaN.
from __future__ import print_function
from numpy import *
from pyimsl.stat.imputeMissing import imputeMissing
from pyimsl.stat.machine import machine
nObservations = 20
nVariables = 4
# Create the test data
x = zeros((nObservations, nVariables), dtype=double)
for i in range(nObservations):
for j in range(nVariables):
x[i][j] = i * nVariables + j
# Replace some of the data values,
# note +/-inf are not considered 'missing
x[3][1] = machine(6) # NaN
x[5][2] = machine(6) # NaN
x[7][2] = machine(7) # positive infinity
x[9][3] = machine(8) # negative infinity
# declare no independent variables
indind = []
count = imputeMissing(indind, x)
print("number of missing values = %d" % count)
Output¶
number of missing values = 2
Example 2¶
Set the value 20 to represent a missing value and find the indices in x
which contain the missing value.
from __future__ import print_function
from numpy import *
from pyimsl.stat.imputeMissing import imputeMissing
from pyimsl.stat.machine import machine
nObservations = 20
nVariables = 4
# Declare 2 independent variables
nIndependent = 2
# Declare that columns 2 and 3 are independent
indind = [2, 3]
# Missing value is represented by 20
# and will be located at x[5][0]
mval = 20.0
# Create the test data
x = zeros((nObservations, nVariables), dtype=double)
for i in range(nObservations):
for j in range(nVariables):
x[i][j] = i * nVariables + j
indices = []
count = imputeMissing(indind, x,
missingValue=mval,
missingIndex=indices)
print("number of missing values = %d" % count)
print("indices[0] = %d" % indices[0])
Output¶
number of missing values = 1
indices[0] = 20
Example 3¶
In this example both NaN and infinity represent missing values in the
original data. In the first call to imputeMissing, missing values are
replaced by negative infinity. In the second call to imputeMissing,
negative infinity is set to represent missing values and the rows containing
the missing values are purged for the final output.
from __future__ import print_function
from numpy import *
from pyimsl.stat.imputeMissing import imputeMissing
from pyimsl.stat.machine import machine
from pyimsl.stat.writeMatrix import writeMatrix
nRows = 6
nCols = 4
fmt = "%6.2f"
# Create the test data
data = zeros((nRows, nCols), dtype=double)
for i in range(nRows):
for j in range(nCols):
data[i][j] = i * nCols + j
# Insert bad values into data
data[1][1] = machine(6) # NaN
data[2][2] = machine(7) # positive infinity
data[3][3] = machine(8) # negative infinity
writeMatrix("Original data with missing values", data,
writeFormat=fmt)
# Set the missing value to be +inf
mval = machine(7)
# Replace missing values with neg inf
replacementValue = machine(8)
# Declare one independent variable
indind = [0]
# replace Nan and +inf values with -inf
xImputed = []
count = imputeMissing(indind, data,
missingValue=mval,
replacementValue=replacementValue,
xImputed=xImputed)
writeMatrix("Data with values replaced", xImputed,
writeFormat=fmt)
# Now purge all rows containing -inf
mval = machine(8)
purge = {}
xPurged = []
count = imputeMissing(indind, xImputed,
missingValue=mval,
purge=purge,
xImputed=xPurged)
print("\n number missing = %d, number of rows purged = %d" %
(count, purge['nMissingRows']))
print("\n Purged row numbers:")
badObs = purge['missingRowIndices']
for i in range(purge['nMissingRows']):
print(" %d" % badObs[i])
writeMatrix("New data with bad rows purged", xPurged,
writeFormat=fmt)
Output¶
number missing = 3, number of rows purged = 3
Purged row numbers:
1
2
3
Original data with missing values
1 2 3 4
1 0.00 1.00 2.00 3.00
2 4.00 ...... 6.00 7.00
3 8.00 9.00 ++++++ 11.00
4 12.00 13.00 14.00 ------
5 16.00 17.00 18.00 19.00
6 20.00 21.00 22.00 23.00
Data with values replaced
1 2 3 4
1 0.00 1.00 2.00 3.00
2 4.00 ------ 6.00 7.00
3 8.00 9.00 ------ 11.00
4 12.00 13.00 14.00 ------
5 16.00 17.00 18.00 19.00
6 20.00 21.00 22.00 23.00
New data with bad rows purged
1 2 3 4
1 0.00 1.00 2.00 3.00
2 16.00 17.00 18.00 19.00
3 20.00 21.00 22.00 23.00
Example 4¶
Replace missing values computed using the mean of the 3 nearest neighbors.
from numpy import *
from pyimsl.stat.imputeMissing import imputeMissing, MEAN_METH
from pyimsl.stat.machine import machine
from pyimsl.stat.writeMatrix import writeMatrix
nRows = 10
nCols = 4
fmt = "%6.2f"
# Create the test data
data = zeros((nRows, nCols), dtype=double)
for i in range(nRows):
for j in range(nCols):
data[i][j] = i * nCols + j
# Insert bad values into data
data[1][3] = machine(6) # NaN
data[4][2] = machine(6) # NaN
writeMatrix("Original data with missing values", data,
writeFormat=fmt)
# Declare two independent variables
indind = [0, 1]
# Replace missing values using mean method
xImputed = []
count = imputeMissing(indind, data,
imputeMethod={'method': MEAN_METH, "k": 3},
xImputed=xImputed)
writeMatrix("Imputed data (using mean method)", xImputed,
writeFormat=fmt)
Output¶
Original data with missing values
1 2 3 4
1 0.00 1.00 2.00 3.00
2 4.00 5.00 6.00 ......
3 8.00 9.00 10.00 11.00
4 12.00 13.00 14.00 15.00
5 16.00 17.00 ...... 19.00
6 20.00 21.00 22.00 23.00
7 24.00 25.00 26.00 27.00
8 28.00 29.00 30.00 31.00
9 32.00 33.00 34.00 35.00
10 36.00 37.00 38.00 39.00
Imputed data (using mean method)
1 2 3 4
1 0.00 1.00 2.00 3.00
2 4.00 5.00 6.00 9.67
3 8.00 9.00 10.00 11.00
4 12.00 13.00 14.00 15.00
5 16.00 17.00 20.67 19.00
6 20.00 21.00 22.00 23.00
7 24.00 25.00 26.00 27.00
8 28.00 29.00 30.00 31.00
9 32.00 33.00 34.00 35.00
10 36.00 37.00 38.00 39.00
Warning Errors¶
IMSLS_NO_GOOD_ROW |
Each row contains missing values. No imputation is performed. |
IMSLS_INDEP_HAS_MISSING |
At least one of the independent variables contains a missing value. No imputation is performed. |