estimateMissing

Estimates missing values in a time series.

Synopsis

estimateMissing(tpoints, z)

Required Arguments

int tpoints[] (Input)
Vector of length nObs containing the time points \(t_1,\ldots,t_{\mathit{n\_obs}}\) at which the time series values were observed. The time points must be in strictly increasing order. Time points for missing values must lie in the open interval \(\left( t_1,t_{\mathit{n\_obs}}\right)\).
float z[] (Input)

Vector of length nObs containing the time series values. The values must be ordered in accordance with the values in vector tpoints. It is assumed that the time series after estimation of missing values contains values at equidistant time points where the distance between two consecutive time points is one. If the non-missing time series values are observed at time points \(t_1,\ldots t_{nObs}\), then missing values between \(t_i\) and \(t_{i+1}\), i = 1, …, nObs - 1 , exist if \(t_{i+1}\) - \(t_i\) > 1. The size of the gap between \(t_i\) and \(t_{i+1}\) is then \(t_{i+1}-t_i-1\). The total length of the time series with non-missing and estimated missing values is

\(t_{nObs}-t_i+1\), or tpoints[nObs1]tpoints[0]+1.

Return Value

An array of length (tpoints[nObs-1]-tpoints[0]+1) containing the time series together with estimates for the missing values. If an error occurred, None is returned.

Optional Arguments

method, int (Input)

The method used for estimating the missing values:

Index Description
0 Use median.
1 Use cubic spline interpolation.
2 Use one-step-ahead forecasts from an AR(1) model.
3 Use one-step-ahead forecasts from an AR(p) model.

If method = 2 is chosen, then all values of gaps beginning at time points \(t_1+1\) or \(t_1+2\) are estimated by method 0. If method = 3 is chosen and the first gap starts at \(t_1+1\), then the values of this gap are also estimated by method 0. If the length of the series before a gap, denoted len, is greater than 1 and less than 2⋅ maxLag, then maxLag is reduced to len/2 for the computation of the missing values within this gap.

Default: method = 3

maxLag, int (Input)

Maximum lag number when method = 3 was chosen.

Default: maxLag = 10

ntimes (Output)
Number of elements in the time series with estimated missing values. Note that ntimes = tpoints[nObs-1]-tpoints[0]+1.
meanEstimate, float (Input)
Estimate of the mean of the time series.
relativeError, float (Input)

Stopping criterion for use in the nonlinear equation solver used by method 2.

Default: relativeError = 100 × machine(4) for single precision.

maxIterations, int (Input)

Maximum number of iterations allowed in the nonlinear equations solver used by method 2.

Default: maxIterations = 200.

timesArray (Output)
An array of length ntimes = tpoints[nObs-1]-tpoints[0]+1 containing the time points of the time series with estimates for the missing values.
missingIndex (Output)
An array of length (ntimes-nObs) containing the indices for the missing values in array times. If ntimes-nObs = 0, then no missing value could be found and None is returned.

Description

Traditional time series analysis as described by Box, Jenkins and Reinsel (1994) requires the observations made at equidistant time points \(t_1,t_1+1,t_1+2,\ldots,t_n\). When observations are missing, the problem occurs to determine suitable estimates. Function estimateMissing offers 4 estimation methods:

Method 0 estimates the missing observations in a gap by the median of the last four time series values before and the first four values after the gap. If not enough values are available before or after the gap then the number is reduced accordingly. This method is very fast and simple, but its use is limited to stationary ergodic series without outliers and level shifts.

Method 1 uses a cubic spline interpolation method to estimate missing values. Here the interpolation is again done over the last four time series values before and the first four values after the gap. The missing values are estimated by the resulting interpolant. This method gives smooth transitions across missing values.

Method 2 assumes that the time series before the gap can be well described by an AR(1) process. If the last observation prior to the gap is made at time point \(t_m\) then it uses the time series values at \(t_1,t_1+1,\ldots,t_m\) to compute the one-step-ahead forecast at origin \(t_m\). This value is taken as an estimate for the missing value at time point \(t_m+1\). If the value at \(t_m+1\) is also missing then the values at time points \(t_1,t_1+1,\ldots,t_{m+1}\) are used to recompute the AR(1) model, estimate the value at \(t_m+2\) and so on. The coefficient \(\phi_1\) in the AR(1) model is computed internally by the method of least squares from function arma.

Finally, method 3 uses an AR(p) model to estimate missing values by a one-step-ahead forecast . First, function autoUniAr, applied to the time series prior to the missing values, is used to determine the optimum p from the set {0, 1, …, maxLag} of possible values and to compute the parameters \(\phi_1,\ldots,\phi_p\) of the resulting AR(p) model. The parameters are estimated by the least squares method based on Householder transformations as described in Kitagawa and Akaike (1978). Denoting the mean of the series \(y_{t_1},y_{t_1+1},\ldots,y_{t_m}\) by μ the one-step-ahead forecast at origin \(t_m\), \(\hat{y}_{t_m}(1)\), can be computed by the formula

\[\hat{y}_{t_m}(1) = \mu \left(1 - \sum\nolimits_{j=1}^{p} \phi_j\right) + \sum\nolimits_{j=1}^{p} \phi_j y_{t_m + 1 - j}.\]

This value is used as an estimate for the missing value. The procedure starting with autoUniAr is then repeated for every further missing value in the gap. All four estimation methods treat gaps of missing values in increasing time order.

Example

Consider the AR(1) process

\[Y_t = \phi_1 Y_{t-1} + a_t, \phantom{...} t=1,2,3,\ldots\]

We assume that \(\{a_t\}\) is a Gaussian white noise process, \(a_t \sim N\left(0,\sigma^2\right)\). Then, \(E\left[Y_t\right]=0\) and \(\mathit{VAR} \left[ Y_t \right]=\sigma^2/\left( 1-\phi_1^2 \right)\) (see Anderson, p. 174).

The time series in the code below was artificially generated from an AR(1) process characterized by \(\phi_1=-0.7\) and \(\sigma^2=1-\phi_1^2=0.51\). This process is stationary with VAR[\(Y_t\)] = 1. As initial value, \(Y_0 :=a_0\) was taken. The sequence \(\{a_t\}\) was generated by a random number generator.

From the original series, we remove the observations at time points \(t=130,t=140,t=141,t=160,t=175,t=176\). Then, estimateMissing is used to compute estimates for the missing values by all 4 estimation methods available. The estimated values are compared with the actual values.

from __future__ import print_function
from numpy import *
from pyimsl.stat.estimateMissing import estimateMissing

maxlag = 20
times_1 = empty(200)
times_2 = empty(200)
x_1 = empty(200)
x_2 = empty(200)
times = []
missing_index = []
ntimes = []

y = array([
    1.30540, -1.37166, 1.47905, -0.91059, 1.36191, -2.16966, 3.11254,
    -1.99536, 2.29740, -1.82474, -0.25445, 0.33519, -0.25480, -0.50574,
    -0.21429, -0.45932, -0.63813, 0.25646, -0.46243, -0.44104, 0.42733,
    0.61102, -0.82417, 1.48537, -1.57733, -0.09846, 0.46311, 0.49156,
    -1.66090, 2.02808, -1.45768, 1.36115, -0.65973, 1.13332, -0.86285,
    1.23848, -0.57301, -0.28210, 0.20195, 0.06981, 0.28454, 0.19745,
    -0.16490, -1.05019, 0.78652, -0.40447, 0.71514, -0.90003, 1.83604,
    -2.51205, 1.00526, -1.01683, 1.70691, -1.86564, 1.84912, -1.33120,
    2.35105, -0.45579, -0.57773, -0.55226, 0.88371, 0.23138, 0.59984,
    0.31971, 0.59849, 0.41873, -0.46955, 0.53003, -1.17203, 1.52937,
    -0.48017, -0.93830, 1.00651, -1.41493, -0.42188, -0.67010, 0.58079,
    -0.96193, 0.22763, -0.92214, 1.35697, -1.47008, 2.47841, -1.50522,
    0.41650, -0.21669, -0.90297, 0.00274, -1.04863, 0.66192, -0.39143,
    0.40779, -0.68174, -0.04700, -0.84469, 0.30735, -0.68412, 0.25888,
    -1.08642, 0.52928, 0.72168, -0.18199, -0.09499, 0.67610, 0.14636,
    0.46846, -0.13989, 0.50856, -0.22268, 0.92756, 0.73069, 0.78998,
    -1.01650, 1.25637, -2.36179, 1.99616, -1.54326, 1.38220, 0.19674,
    -0.85241, 0.40463, 0.39523, -0.60721, 0.25041, -1.24967, 0.26727,
    1.40042, -0.66963, 1.26049, -0.92074, 0.05909, -0.61926, 1.41550,
    0.25537, -0.13240, -0.07543, 0.10413, 1.42445, -1.37379, 0.44382,
    -1.57210, 2.04702, -2.22450, 1.27698, 0.01073, -0.88459, 0.88194,
    -0.25019, 0.70224, -0.41855, 0.93850, 0.36007, -0.46043, 0.18645,
    0.06337, 0.29414, -0.20054, 0.83078, -1.62530, 2.64925, -1.25355,
    1.59094, -1.00684, 1.03196, -1.58045, 2.04295, -2.38264, 1.65095,
    -0.33273, -1.29092, 0.14020, -0.11434, 0.04392, 0.05293, -0.42277,
    0.59143, -0.03347, -0.58457, 0.87030, 0.19985, -0.73500, 0.73640,
    0.29531, 0.22325, -0.60035, 1.42253, -1.11278, 1.30468, -0.41923,
    -0.38019, 0.50937, 0.23051, 0.46496, 0.02459, -0.68478, 0.25821,
    1.17655, -2.26629, 1.41173, -0.68331])

tpoints = array([
    1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24,
    25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45,
    46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66,
    67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87,
    88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106,
    107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122,
    123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138,
    139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154,
    155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170,
    171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186,
    187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200])

n_miss = 0
times_1[0] = times_2[0] = tpoints[0]
x_1[0] = x_2[0] = y[0]
k = 0
for i in range(1, 200):
    times_1[i] = tpoints[i]
    x_1[i] = y[i]

    # Generate series with missing values
    if i != 129 and i != 139 and i != 140 and i != 159 and i != 174 and i != 175:
        k += 1
        times_2[k] = times_1[i]
        x_2[k] = x_1[i]

n_obs = k + 1
print("n_obs: ", n_obs)

for j in range(0, 4):
    if j <= 2:
        result = estimateMissing(times_2[0:n_obs], x_2[0:n_obs],
                                 method=j,
                                 ntimes=ntimes,
                                 timesArray=times,
                                 missingIndex=missing_index)
    else:
        result = estimateMissing(times_2[0:n_obs], x_2[0:n_obs],
                                 method=j,
                                 ntimes=ntimes,
                                 maxLag=20,
                                 timesArray=times,
                                 missingIndex=missing_index)
    if j == 0:
        print("\nMethod: Median")
    if j == 1:
        print("\nMethod: Cubic Spline Interpolation")
    if j == 2:
        print("\nMethod: AR(1) Forecast")
    if j == 3:
        print("\nMethod: AR(p) Forecast")
    print("ntimes = %d" % ntimes[0])
    print("time    actual      predicted   difference")
    n_miss = ntimes[0] - n_obs
    for i in range(0, n_miss):
        miss_ind = missing_index[i]
        print("%d, %10.5f, %10.5f, %11.6f" % (times[miss_ind],
                                              x_1[miss_ind], result[miss_ind],
                                              abs(x_1[miss_ind] - result[miss_ind])))

Output

n_obs:  194

Method: Median
ntimes = 200
time    actual      predicted   difference
130,   -0.92074,    0.26132,    1.182060
140,    0.44382,    0.05743,    0.386390
141,   -1.57210,    0.05743,    1.629530
160,    2.64925,    0.04680,    2.602450
175,   -0.42277,    0.04842,    0.471195
176,    0.59143,    0.04842,    0.543005

Method: Cubic Spline Interpolation
ntimes = 200
time    actual      predicted   difference
130,   -0.92074,    1.54109,    2.461829
140,    0.44382,   -0.40730,    0.851119
141,   -1.57210,    2.49709,    4.069194
160,    2.64925,   -2.94712,    5.596371
175,   -0.42277,    0.25066,    0.673430
176,    0.59143,    0.38032,    0.211107

Method: AR(1) Forecast
ntimes = 200
time    actual      predicted   difference
130,   -0.92074,   -0.92987,    0.009135
140,    0.44382,    1.02824,    0.584424
141,   -1.57210,   -0.74520,    0.826896
160,    2.64925,    1.22876,    1.420490
175,   -0.42277,    0.01049,    0.433256
176,    0.59143,    0.03679,    0.554637

Method: AR(p) Forecast
ntimes = 200
time    actual      predicted   difference
130,   -0.92074,   -0.86385,    0.056895
140,    0.44382,    0.98098,    0.537164
141,   -1.57210,   -0.64489,    0.927205
160,    2.64925,    1.18966,    1.459592
175,   -0.42277,   -0.00105,    0.421722
176,    0.59143,    0.03773,    0.553705