estimateMissing¶
Estimates missing values in a time series.
Synopsis¶
estimateMissing(tpoints, z)
Required Arguments¶
- int
tpoints[]
(Input) - Vector of length
nObs
containing the time points \(t_1,\ldots,t_{\mathit{n\_obs}}\) at which the time series values were observed. The time points must be in strictly increasing order. Time points for missing values must lie in the open interval \(\left( t_1,t_{\mathit{n\_obs}}\right)\). - float
z[]
(Input) Vector of length
nObs
containing the time series values. The values must be ordered in accordance with the values in vectortpoints
. It is assumed that the time series after estimation of missing values contains values at equidistant time points where the distance between two consecutive time points is one. If the non-missing time series values are observed at time points \(t_1,\ldots t_{nObs}\), then missing values between \(t_i\) and \(t_{i+1}\), i = 1, …,nObs
- 1 , exist if \(t_{i+1}\) - \(t_i\) > 1. The size of the gap between \(t_i\) and \(t_{i+1}\) is then \(t_{i+1}-t_i-1\). The total length of the time series with non-missing and estimated missing values is\(t_{nObs}-t_i+1\), or
tpoints[nObs
‑1]
‑tpoints[0]
+1.
Return Value¶
An array of length (tpoints[nObs-
1]-tpoints[0]+
1) containing
the time series together with estimates for the missing values. If an error
occurred, None
is returned.
Optional Arguments¶
method
, int (Input)The method used for estimating the missing values:
Index Description 0 Use median. 1 Use cubic spline interpolation. 2 Use one-step-ahead forecasts from an AR(1) model. 3 Use one-step-ahead forecasts from an AR(p) model. If
method = 2
is chosen, then all values of gaps beginning at time points \(t_1+1\) or \(t_1+2\) are estimated bymethod
0. Ifmethod
=
3 is chosen and the first gap starts at \(t_1+1\), then the values of this gap are also estimated bymethod
0. If the length of the series before a gap, denotedlen
, is greater than 1 and less than 2⋅maxLag
, thenmaxLag
is reduced tolen/2
for the computation of the missing values within this gap.Default:
method
= 3maxLag
, int (Input)Maximum lag number when
method
=
3 was chosen.Default:
maxLag
=
10ntimes
(Output)- Number of elements in the time series with estimated missing values. Note
that
ntimes = tpoints[nObs
-1]-tpoints[0]
+1. meanEstimate
, float (Input)- Estimate of the mean of the time series.
relativeError
, float (Input)Stopping criterion for use in the nonlinear equation solver used by method 2.
Default:
relativeError
= 100 ×machine(4)
for single precision.maxIterations
, int (Input)Maximum number of iterations allowed in the nonlinear equations solver used by method 2.
Default:
maxIterations
=
200.timesArray
(Output)- An array of length
ntimes
=
tpoints[nObs
-1]-tpoints[
0]
+1 containing the time points of the time series with estimates for the missing values. missingIndex
(Output)- An array of length
(ntimes-nObs)
containing the indices for the missing values in array times. Ifntimes-nObs
= 0, then no missing value could be found andNone
is returned.
Description¶
Traditional time series analysis as described by Box, Jenkins and Reinsel
(1994) requires the observations made at equidistant time points
\(t_1,t_1+1,t_1+2,\ldots,t_n\). When observations are missing, the
problem occurs to determine suitable estimates. Function estimateMissing
offers 4 estimation methods:
Method 0 estimates the missing observations in a gap by the median of the last four time series values before and the first four values after the gap. If not enough values are available before or after the gap then the number is reduced accordingly. This method is very fast and simple, but its use is limited to stationary ergodic series without outliers and level shifts.
Method 1 uses a cubic spline interpolation method to estimate missing values. Here the interpolation is again done over the last four time series values before and the first four values after the gap. The missing values are estimated by the resulting interpolant. This method gives smooth transitions across missing values.
Method 2 assumes that the time series before the gap can be well described by an AR(1) process. If the last observation prior to the gap is made at time point \(t_m\) then it uses the time series values at \(t_1,t_1+1,\ldots,t_m\) to compute the one-step-ahead forecast at origin \(t_m\). This value is taken as an estimate for the missing value at time point \(t_m+1\). If the value at \(t_m+1\) is also missing then the values at time points \(t_1,t_1+1,\ldots,t_{m+1}\) are used to recompute the AR(1) model, estimate the value at \(t_m+2\) and so on. The coefficient \(\phi_1\) in the AR(1) model is computed internally by the method of least squares from function arma.
Finally, method 3 uses an AR(p) model to estimate missing values by a
one-step-ahead forecast . First, function autoUniAr,
applied to the time series prior to the missing values, is used to determine
the optimum p
from the set {0, 1, …, maxLag
} of possible values
and to compute the parameters \(\phi_1,\ldots,\phi_p\) of the resulting
AR(p) model. The parameters are estimated by the least squares method based
on Householder transformations as described in Kitagawa and Akaike (1978).
Denoting the mean of the series \(y_{t_1},y_{t_1+1},\ldots,y_{t_m}\) by μ
the one-step-ahead forecast at origin \(t_m\), \(\hat{y}_{t_m}(1)\),
can be computed by the formula
This value is used as an estimate for the missing value. The procedure starting with autoUniAr is then repeated for every further missing value in the gap. All four estimation methods treat gaps of missing values in increasing time order.
Example¶
Consider the AR(1) process
We assume that \(\{a_t\}\) is a Gaussian white noise process, \(a_t \sim N\left(0,\sigma^2\right)\). Then, \(E\left[Y_t\right]=0\) and \(\mathit{VAR} \left[ Y_t \right]=\sigma^2/\left( 1-\phi_1^2 \right)\) (see Anderson, p. 174).
The time series in the code below was artificially generated from an AR(1) process characterized by \(\phi_1=-0.7\) and \(\sigma^2=1-\phi_1^2=0.51\). This process is stationary with VAR[\(Y_t\)] = 1. As initial value, \(Y_0 :=a_0\) was taken. The sequence \(\{a_t\}\) was generated by a random number generator.
From the original series, we remove the observations at time points
\(t=130,t=140,t=141,t=160,t=175,t=176\). Then, estimateMissing
is
used to compute estimates for the missing values by all 4 estimation methods
available. The estimated values are compared with the actual values.
from __future__ import print_function
from numpy import *
from pyimsl.stat.estimateMissing import estimateMissing
maxlag = 20
times_1 = empty(200)
times_2 = empty(200)
x_1 = empty(200)
x_2 = empty(200)
times = []
missing_index = []
ntimes = []
y = array([
1.30540, -1.37166, 1.47905, -0.91059, 1.36191, -2.16966, 3.11254,
-1.99536, 2.29740, -1.82474, -0.25445, 0.33519, -0.25480, -0.50574,
-0.21429, -0.45932, -0.63813, 0.25646, -0.46243, -0.44104, 0.42733,
0.61102, -0.82417, 1.48537, -1.57733, -0.09846, 0.46311, 0.49156,
-1.66090, 2.02808, -1.45768, 1.36115, -0.65973, 1.13332, -0.86285,
1.23848, -0.57301, -0.28210, 0.20195, 0.06981, 0.28454, 0.19745,
-0.16490, -1.05019, 0.78652, -0.40447, 0.71514, -0.90003, 1.83604,
-2.51205, 1.00526, -1.01683, 1.70691, -1.86564, 1.84912, -1.33120,
2.35105, -0.45579, -0.57773, -0.55226, 0.88371, 0.23138, 0.59984,
0.31971, 0.59849, 0.41873, -0.46955, 0.53003, -1.17203, 1.52937,
-0.48017, -0.93830, 1.00651, -1.41493, -0.42188, -0.67010, 0.58079,
-0.96193, 0.22763, -0.92214, 1.35697, -1.47008, 2.47841, -1.50522,
0.41650, -0.21669, -0.90297, 0.00274, -1.04863, 0.66192, -0.39143,
0.40779, -0.68174, -0.04700, -0.84469, 0.30735, -0.68412, 0.25888,
-1.08642, 0.52928, 0.72168, -0.18199, -0.09499, 0.67610, 0.14636,
0.46846, -0.13989, 0.50856, -0.22268, 0.92756, 0.73069, 0.78998,
-1.01650, 1.25637, -2.36179, 1.99616, -1.54326, 1.38220, 0.19674,
-0.85241, 0.40463, 0.39523, -0.60721, 0.25041, -1.24967, 0.26727,
1.40042, -0.66963, 1.26049, -0.92074, 0.05909, -0.61926, 1.41550,
0.25537, -0.13240, -0.07543, 0.10413, 1.42445, -1.37379, 0.44382,
-1.57210, 2.04702, -2.22450, 1.27698, 0.01073, -0.88459, 0.88194,
-0.25019, 0.70224, -0.41855, 0.93850, 0.36007, -0.46043, 0.18645,
0.06337, 0.29414, -0.20054, 0.83078, -1.62530, 2.64925, -1.25355,
1.59094, -1.00684, 1.03196, -1.58045, 2.04295, -2.38264, 1.65095,
-0.33273, -1.29092, 0.14020, -0.11434, 0.04392, 0.05293, -0.42277,
0.59143, -0.03347, -0.58457, 0.87030, 0.19985, -0.73500, 0.73640,
0.29531, 0.22325, -0.60035, 1.42253, -1.11278, 1.30468, -0.41923,
-0.38019, 0.50937, 0.23051, 0.46496, 0.02459, -0.68478, 0.25821,
1.17655, -2.26629, 1.41173, -0.68331])
tpoints = array([
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24,
25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45,
46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66,
67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87,
88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106,
107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122,
123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138,
139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154,
155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170,
171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186,
187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200])
n_miss = 0
times_1[0] = times_2[0] = tpoints[0]
x_1[0] = x_2[0] = y[0]
k = 0
for i in range(1, 200):
times_1[i] = tpoints[i]
x_1[i] = y[i]
# Generate series with missing values
if i != 129 and i != 139 and i != 140 and i != 159 and i != 174 and i != 175:
k += 1
times_2[k] = times_1[i]
x_2[k] = x_1[i]
n_obs = k + 1
print("n_obs: ", n_obs)
for j in range(0, 4):
if j <= 2:
result = estimateMissing(times_2[0:n_obs], x_2[0:n_obs],
method=j,
ntimes=ntimes,
timesArray=times,
missingIndex=missing_index)
else:
result = estimateMissing(times_2[0:n_obs], x_2[0:n_obs],
method=j,
ntimes=ntimes,
maxLag=20,
timesArray=times,
missingIndex=missing_index)
if j == 0:
print("\nMethod: Median")
if j == 1:
print("\nMethod: Cubic Spline Interpolation")
if j == 2:
print("\nMethod: AR(1) Forecast")
if j == 3:
print("\nMethod: AR(p) Forecast")
print("ntimes = %d" % ntimes[0])
print("time actual predicted difference")
n_miss = ntimes[0] - n_obs
for i in range(0, n_miss):
miss_ind = missing_index[i]
print("%d, %10.5f, %10.5f, %11.6f" % (times[miss_ind],
x_1[miss_ind], result[miss_ind],
abs(x_1[miss_ind] - result[miss_ind])))
Output¶
n_obs: 194
Method: Median
ntimes = 200
time actual predicted difference
130, -0.92074, 0.26132, 1.182060
140, 0.44382, 0.05743, 0.386390
141, -1.57210, 0.05743, 1.629530
160, 2.64925, 0.04680, 2.602450
175, -0.42277, 0.04842, 0.471195
176, 0.59143, 0.04842, 0.543005
Method: Cubic Spline Interpolation
ntimes = 200
time actual predicted difference
130, -0.92074, 1.54109, 2.461829
140, 0.44382, -0.40730, 0.851119
141, -1.57210, 2.49709, 4.069194
160, 2.64925, -2.94712, 5.596371
175, -0.42277, 0.25066, 0.673430
176, 0.59143, 0.38032, 0.211107
Method: AR(1) Forecast
ntimes = 200
time actual predicted difference
130, -0.92074, -0.92987, 0.009135
140, 0.44382, 1.02824, 0.584424
141, -1.57210, -0.74520, 0.826896
160, 2.64925, 1.22876, 1.420490
175, -0.42277, 0.01049, 0.433256
176, 0.59143, 0.03679, 0.554637
Method: AR(p) Forecast
ntimes = 200
time actual predicted difference
130, -0.92074, -0.86385, 0.056895
140, 0.44382, 0.98098, 0.537164
141, -1.57210, -0.64489, 0.927205
160, 2.64925, 1.18966, 1.459592
175, -0.42277, -0.00105, 0.421722
176, 0.59143, 0.03773, 0.553705