imputeMissing¶
Locate and optionally replace dependent variable missing values with nearest neighbor estimates.
Synopsis¶
imputeMissing (indind, x)
Required Arguments¶
- int
indind[]
(Input) - Array of size
nIndependent
designating the indices of the columns ofx
containing the independent variables. - float
x[[]]
(Input) - Array of size
nObservations
×nVariables
containing the observations. Missing values of the dependent variables may be imputed as functions of the independent variables, but if any of the independent variables have missing values, then imputation will not be performed and a warning will be issued. If one of the optional arguments,replacementValue
,imputeMethod
, orpurge
is supplied,xImputed
(see optional argumentxImputed
) contains the imputed data on output.
Return Value¶
The number of missing values (nMiss
) in the data array x
.
Optional Arguments¶
missingValue
, float (Input)- Scalar value (other than NaN) representing a missing value. NaN always
represents a missing value, so if
missingValue
is not NaN it will be treated as a second type of missing value. metricDiag
, float (Input)- Array of length
nIndependent
defining a diagonal metric for independent variable space. This scales the independent variables in the distance calculations used to determine nearest neighbors. The default measure of distance is Euclidean (g
[i] = 1 for all i). replacementValue
, float (Input)- Replace missing values in
x
withreplacementValue
. Output data array is returned inxImputed
. Requires optional argumentxImputed
.
or
imputeMethod
, intmethod
, intk
(Input)The method to be used for imputing missing values using
k
nearest neighbors. Replace missing value of dependent variable y at point x in the space of independent variables with the mode, mean, median, geometric mean, or linear regression (method)
of y on thosek
nearest neighbors of x which have no missing values. To use all of the data and eliminate the need to compute neighborhoods, setk
≥nObservations
. If there are no independent variables, setk
≥nObservations
. Imputed data is returned inxImputed
. Requires optional argumentxImputed
.Valid values for
method
are:
method | Description |
---|---|
MODE_METH |
Mode |
MEAN_METH |
Mean |
MEDIAN_METH |
Median |
GEOMEAN_METH |
Geometric mean |
LINEAR_METH |
Linear regression |
or
purge
(Output)- All rows with missing values are removed from
x
and the resulting data array is returned inxImputed
.nMissingRows
is the number of rows that were removed.missingRowIndices
are the indices of the rows that were removed. Requires optional argumentxImputed
.
missingIndex
(Output)The array of size nMiss
, containing the indices of x
where missing values occur. nMiss
is the function return value. If the data has no missing values, the pointer is returned as None
.
xImputed
(Output)- Array containing imputed data. This argument is required when
replacementValue
,imputeMethod
, orpurge
is supplied. For optionsreplacementValue
andimputeMethod
,xImputed
contains all data from x with missing values replaced in the dependent variable columns. For optionpurge
,xImputed
is an array of sizenObservations
‑nMissingRows
×nVariables
, containing the data fromx
with the rows of missing data removed.
Description¶
Function imputeMissing
locates missing values, and optionally, replaces
them with estimated values. This replacement process, called imputation,
applies only to dependent variables. If x denotes an arbitrary point in
independent variable space and y denotes a dependent variable with a
missing value at \(x=x_i\), then y at \(x_i\) is estimated as
\(y(x_i)=f(x_i)\) where \(f(x)\) is some function of x in some
neighborhood of \(x_i\). imputeMissing
provides five options (see
imputeMethod) for the form of \(f(x)\),
and each option allows neighborhood size to be specified in terms of some
given number of nearest neighbors. The neighbors exclude observations with
missing values and are determined by distance, the norm relative to metric G,
By default, \(G=I\), but the metricDiag
option can be used to specify
any other diagonal metric G. A sixth option, replacementValue
, allows
the user to specify one value to be used as a replacement for all missing
values.
Instead of being used for imputation, imputeMissing
can be used to
simply remove all observations which contain missing values. This is
accomplished with the option purge
. With this option, all rows with
missing values are removed from the input data matrix. Unlike imputation,
this option is not limited to dependent variables and can be used to handle
missing values in the independent variables.
Usually either imputation or deletion will be performed, but
imputeMissing
can be used for the more basic task of returning the
indices of missing values. The indices could then be used to implement other
imputation methods.
Following the standard practice, missing data values are always represented
by NaN. Option missingValue
allows the user to also specify a second
value to represent missing values.
Examples¶
Example 1¶
Count the missing values in a data set, where the only valid missing value is NaN.
from __future__ import print_function
from numpy import *
from pyimsl.stat.imputeMissing import imputeMissing
from pyimsl.stat.machine import machine
nObservations = 20
nVariables = 4
# Create the test data
x = zeros((nObservations, nVariables), dtype=double)
for i in range(nObservations):
for j in range(nVariables):
x[i][j] = i * nVariables + j
# Replace some of the data values,
# note +/-inf are not considered 'missing
x[3][1] = machine(6) # NaN
x[5][2] = machine(6) # NaN
x[7][2] = machine(7) # positive infinity
x[9][3] = machine(8) # negative infinity
# declare no independent variables
indind = []
count = imputeMissing(indind, x)
print("number of missing values = %d" % count)
Output¶
number of missing values = 2
Example 2¶
Set the value 20 to represent a missing value and find the indices in x
which contain the missing value.
from __future__ import print_function
from numpy import *
from pyimsl.stat.imputeMissing import imputeMissing
from pyimsl.stat.machine import machine
nObservations = 20
nVariables = 4
# Declare 2 independent variables
nIndependent = 2
# Declare that columns 2 and 3 are independent
indind = [2, 3]
# Missing value is represented by 20
# and will be located at x[5][0]
mval = 20.0
# Create the test data
x = zeros((nObservations, nVariables), dtype=double)
for i in range(nObservations):
for j in range(nVariables):
x[i][j] = i * nVariables + j
indices = []
count = imputeMissing(indind, x,
missingValue=mval,
missingIndex=indices)
print("number of missing values = %d" % count)
print("indices[0] = %d" % indices[0])
Output¶
number of missing values = 1
indices[0] = 20
Example 3¶
In this example both NaN and infinity represent missing values in the
original data. In the first call to imputeMissing
, missing values are
replaced by negative infinity. In the second call to imputeMissing
,
negative infinity is set to represent missing values and the rows containing
the missing values are purged for the final output.
from __future__ import print_function
from numpy import *
from pyimsl.stat.imputeMissing import imputeMissing
from pyimsl.stat.machine import machine
from pyimsl.stat.writeMatrix import writeMatrix
nRows = 6
nCols = 4
fmt = "%6.2f"
# Create the test data
data = zeros((nRows, nCols), dtype=double)
for i in range(nRows):
for j in range(nCols):
data[i][j] = i * nCols + j
# Insert bad values into data
data[1][1] = machine(6) # NaN
data[2][2] = machine(7) # positive infinity
data[3][3] = machine(8) # negative infinity
writeMatrix("Original data with missing values", data,
writeFormat=fmt)
# Set the missing value to be +inf
mval = machine(7)
# Replace missing values with neg inf
replacementValue = machine(8)
# Declare one independent variable
indind = [0]
# replace Nan and +inf values with -inf
xImputed = []
count = imputeMissing(indind, data,
missingValue=mval,
replacementValue=replacementValue,
xImputed=xImputed)
writeMatrix("Data with values replaced", xImputed,
writeFormat=fmt)
# Now purge all rows containing -inf
mval = machine(8)
purge = {}
xPurged = []
count = imputeMissing(indind, xImputed,
missingValue=mval,
purge=purge,
xImputed=xPurged)
print("\n number missing = %d, number of rows purged = %d" %
(count, purge['nMissingRows']))
print("\n Purged row numbers:")
badObs = purge['missingRowIndices']
for i in range(purge['nMissingRows']):
print(" %d" % badObs[i])
writeMatrix("New data with bad rows purged", xPurged,
writeFormat=fmt)
Output¶
number missing = 3, number of rows purged = 3
Purged row numbers:
1
2
3
Original data with missing values
1 2 3 4
1 0.00 1.00 2.00 3.00
2 4.00 ...... 6.00 7.00
3 8.00 9.00 ++++++ 11.00
4 12.00 13.00 14.00 ------
5 16.00 17.00 18.00 19.00
6 20.00 21.00 22.00 23.00
Data with values replaced
1 2 3 4
1 0.00 1.00 2.00 3.00
2 4.00 ------ 6.00 7.00
3 8.00 9.00 ------ 11.00
4 12.00 13.00 14.00 ------
5 16.00 17.00 18.00 19.00
6 20.00 21.00 22.00 23.00
New data with bad rows purged
1 2 3 4
1 0.00 1.00 2.00 3.00
2 16.00 17.00 18.00 19.00
3 20.00 21.00 22.00 23.00
Example 4¶
Replace missing values computed using the mean of the 3 nearest neighbors.
from numpy import *
from pyimsl.stat.imputeMissing import imputeMissing, MEAN_METH
from pyimsl.stat.machine import machine
from pyimsl.stat.writeMatrix import writeMatrix
nRows = 10
nCols = 4
fmt = "%6.2f"
# Create the test data
data = zeros((nRows, nCols), dtype=double)
for i in range(nRows):
for j in range(nCols):
data[i][j] = i * nCols + j
# Insert bad values into data
data[1][3] = machine(6) # NaN
data[4][2] = machine(6) # NaN
writeMatrix("Original data with missing values", data,
writeFormat=fmt)
# Declare two independent variables
indind = [0, 1]
# Replace missing values using mean method
xImputed = []
count = imputeMissing(indind, data,
imputeMethod={'method': MEAN_METH, "k": 3},
xImputed=xImputed)
writeMatrix("Imputed data (using mean method)", xImputed,
writeFormat=fmt)
Output¶
Original data with missing values
1 2 3 4
1 0.00 1.00 2.00 3.00
2 4.00 5.00 6.00 ......
3 8.00 9.00 10.00 11.00
4 12.00 13.00 14.00 15.00
5 16.00 17.00 ...... 19.00
6 20.00 21.00 22.00 23.00
7 24.00 25.00 26.00 27.00
8 28.00 29.00 30.00 31.00
9 32.00 33.00 34.00 35.00
10 36.00 37.00 38.00 39.00
Imputed data (using mean method)
1 2 3 4
1 0.00 1.00 2.00 3.00
2 4.00 5.00 6.00 9.67
3 8.00 9.00 10.00 11.00
4 12.00 13.00 14.00 15.00
5 16.00 17.00 20.67 19.00
6 20.00 21.00 22.00 23.00
7 24.00 25.00 26.00 27.00
8 28.00 29.00 30.00 31.00
9 32.00 33.00 34.00 35.00
10 36.00 37.00 38.00 39.00
Warning Errors¶
IMSLS_NO_GOOD_ROW |
Each row contains missing values. No imputation is performed. |
IMSLS_INDEP_HAS_MISSING |
At least one of the independent variables contains a missing value. No imputation is performed. |