yates¶
Estimates missing observations in designed experiments using Yate’s method.
Synopsis¶
yates(x)
Required Arguments¶
- float
x[[]]
(Input/Output) - A
n
by (nIndependent+
1) 2-dimensional array containing the experimental observations and missing values. The firstnIndependent
columns contain values for the independent variables and the last column contains the corresponding observations for the dependent variable or response. The columns assigned to the independent variables should not contain any missing values. Missing values are included in this array by placing a NaN (not a number) in the last column ofx
. The NaN value can be set using the functionmachine(6)
. Upon successful completion, missing values are replaced with estimates calculated using Yates’ method.
Return Value¶
The number of missing values replaced with estimates using the Yates procedure. A negative return value indicates that the function was unable to successfully estimate all missing values. Typically this occurs when all of the observations for a particular treatment combination are missing. In this case, Yate’s missing value method does not produce a unique set of missing value estimates.
Optional Arguments¶
design
, int (Input)An integer indicating whether a custom or standard design is being used. The association of values for this variable and standard designs is described in the following table:
design Description 0 CRD – Completely Randomized Design. The input matrix, x, is assumed to have only two columns. The first is used to contain integers identifying the treatments. The second column should contain corresponding observations for the dependent variable. In this case, nIndependent=
1. Default value whennIndependent=
1.1 RCBD – Randomized Complete Block Design. The input matrix is assumed to have only three columns. The first is used to contain the treatment identifiers and the second the block identifiers. The last column contains the corresponding observations for the dependent variable. In this case, nIndependent=
2. This is the default value whennIndependent=
2.2 Another design. In this case, the function getSs
is a required input. The design matrix is passed to that function. Initial values for missing observations are set to the grand mean of the data, unless initial values are specified usinginitialEstimates
.Default:
design=
0 ordesign=
1, depending upon whethernIndependent
= 1 or 2 respectively. IfnIndependent
> 2, thendesign
must be set to 2, andgetSs
must be provided as input toyates
.initialEstimates
,nMissing
,initialEstimates
(Input)Initial estimates for the missing values. Argument
nMissing
is the number of missing values. ArgumentinitialEstimates
is an array of lengthnMissing
containing the initial estimates.Default: For
design=
0 anddesign=
1, the initial estimates are calculated using the Yates formula for those designs. Fordesign=
2, the mean of the non‑missing observations is used as the initial estimate for all missing values.maxItn
(Input)Maximum number of iterations in the optimization function for finding the missing value estimates that minimize the error sum of squares in the analysis of variance.
Default:
maxItn
= 500.getSs
, (n
,nIndependent
,nLevels
,dataMatrix
) (Input)A user-supplied function that returns the error sum of squares calculated using the n by (
nIndependent
+1) matrixdataMatrix
.yates
calculates the error sum of squares assuming thatdataMatrix
contains no missing observations. In general,dataMatrix
should be equal to the input matrixx
with missing values replaced by estimates.getSs
is required input whendesign=
2. The arraynLevels
should be of lengthnIndependent
and contain the number of levels associated with each of the firstnIndependent
columns in thedataMatrix
andx
arrays.Arguments
- int
n
(Input) - Number of observations.
- int
nIndependent
(Input) - Number of independent variables.
- int
nLevels[]
(Input) - An array
nLevels
and should be of lengthnIndependent
and contain the number of levels associated with each of the firstnIndependent
columns in thedataMatrix
andx
arrays. - float
dataMatrix[]
(Input) dataMatrix
should be equal to the input matrixx
with missing values replaced by estimates.dataMatrix
should not contain any missing observations.Return Value
Returns the error sum of squares.
gradTol
, float (Input)Scaled gradient tolerance used to determine whether the difference between the error sum of squares is small enough to stop the search for missing value estimates.
Default:
gradTol
= \(\varepsilon^{2/3}\), where \(\varepsilon\) is the machine precision.stepTol
, float (Input)Scaled step tolerance used to determine whether the difference between missing value estimates is small enough to stop the search for missing value estimates.
Default:
stepTol
= \(\varepsilon^{2/3}\), where \(\varepsilon\) is the machine precision.missingIndex
(Output)- An array of length
nMissing
containing the indices for the missing values inx
. The number of missing values,nMissing
, is the return value ofyates
. errorSs
(Output)- The value of the error sum of squares calculated using the missing value
estimates. If
design=
2 then this is equal to the value returned fromgetSs
using the Yates missing value estimates.
Description¶
Several functions for analysis of variance require balanced experimental data, i.e. data containing no missing values within a block and an equal number of replicates for each treatment. If the number of missing observations in an experiment is smaller than the Yates method as described in Yates (1933) and Steel and Torrie (1960), can be used to estimate the missing values. Once the missing values are replaced with these estimates, the data can be passed to an analysis of variance that requires balanced data.
The basic principle behind the Yates method for estimating missing observations is to replace the missing values with values that minimize the error sum of squares in the analysis of variance. Since the error sum of squares depends upon the underlying model for the analysis of variance, the Yates formulas for estimating missing values vary from anova to anova.
Consider, for example, the model underlying experiments conducted using a completely randomized design. If \(y_{ij}\) is the i‑th observation for the i‑th treatment then the error sum of squares for a CRD is calculated using the following formula:
If an observation \(y_{ij}\) is missing then SSE is minimized by replacing that missing observation with the estimate
For a randomized complete block design (RCBD), the calculation for estimating a single missing observation can be derived from the RCBD error sum of squares:
If only a single observation, \(y_{ij}\), is missing from the j‑th block and i‑th treatment, the estimate for this missing observation can be derived by solving the equation:
The solution is referred to as the Yates formula for a RCBD:
where r = nBlocks
, t = nTreatments
, \(y_i\) = total of all
non‑missing observations from the i‑th treatment, \(y_{.j}\) =total
of all non-missing observations from the j‑th block, and y = total of
all non‑missing observations.
If more than one observation is missing, yates
minimization procedure is
used to estimate missing values. For a CRD, all missing observations are set
equal to their corresponding treatment means calculated using the non‑missing
observations. That is, \(\hat{x}_{ij}=\overline{y}_{i.}\).
For RCBD designs with more than one missing value, Yate’s formula for estimating a single missing observation is used to obtain initial estimates for all missing values. These are passed to a function minimization routine to obtain the values that minimize SSE.
For other designs, specify design=
2 and getSs
. The function
getSs
is used to obtain the Yates missing value estimates by selecting
the estimates that minimize sum of squares returned by getSs
. When
called, getSs
calculates the error sum of squares at each iteration
assuming that the data matrix it receives is balanced and contains no
missing values.
Example¶
Missing values can occur in any experiment. Estimating missing values via the Yates method is usually done by minimizing the error sum of squares for that experiment. If only a single observation is missing and there is an analytical formula for calculating the error sum of squares then a formula for estimating the missing value is fairly easily derived. Consider for example a split-plot experiment with a single missing value.
Suppose, for example, that \(x_{ijk}\), the observation for the i‑th whole-plot, j‑th split plot and k‑th block is missing. Then the estimate for a single missing observation in the i‑th whole plot is equal to:
where r = number of blocks, s = number of split‑plots, W = total of all non‑missing values in same block as the missing observation, \(x_{ij.}\) = total of the non-missing observations across blocks of observations from i‑th whole-plot factor level and the j‑th split-plot level, and \(x_{i..}\) = the total of all observations, across split-plots and blocks of the non-missing observations for the i‑th whole plot.
If more than a single observation is missing, then an iterative solution is required to obtain missing value estimates that minimize the error sum of squares.
Function yates
simplifies this procedure. Consider, for example, a
split-plot experiment conducted at a single location using fixed-effects
whole and split plots. If there are no missing values, then the error sum of
squares can be calculated from a 3-way analysis of variance using
whole-plot, split-plot and blocks as the 3 factors. For balanced data
without missing values, the errors sum of squares would be equal to the sum
of the 3-way interaction between these factors and the split-plot by block
interaction.
Calculating the error sum of squares using this 3-way analysis of variance is achieved using the anovaFactorial.
The function is passed to the yates
as an argument, together with a
matrix containing the data for the split-plot experiment. For this example,
the following data matrix obtained from an agricultural experiment will be
used. In this experiment, 4 whole plots were randomly assigned to two 2
blocks. Whole-plots were subdivided into 2 split-plots. The whole-plot
factor consisted of 4 different seed lots, and the split-plot factor
consisted of 2 seed protectants. The data matrix of this example is an n =
24 by 4 matrix with two missing observations.
The program below uses these data with yates
to replace the two missing
values with Yates estimates.
from __future__ import print_function
import sys
from numpy import *
from pyimsl.stat.anovaFactorial import anovaFactorial
from pyimsl.stat.machine import machine
from pyimsl.stat.sortData import sortData
from pyimsl.stat.yates import yates
from pyimsl.stat.writeMatrix import writeMatrix
def get_ss(n, n_independent, n_levels, x):
n_lev_array = empty(n_independent, dtype="int")
for i in range(0, n_independent):
n_lev_array[i] = n_levels[i]
responses = empty((24), dtype='double')
# Copy responses from the last column of x into
# a 1-D array as expected by anovaFactorial.
for i in range(0, n):
responses[i] = x[i * (n_independent + 1) + n_independent]
# Compute the error sum of squares
test_effects = []
anova_table = []
pValue = anovaFactorial(n_lev_array, responses,
testEffects=test_effects,
anovaTable=anova_table,
poolInteractions=True)
errorSS = anova_table[4] + test_effects[5][1]
return errorSS
col_labels = [" ", "Whole", "Split", "Block", " "]
n = 24
n_independent = 3
whole = [1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2,
2, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4]
split = [1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2,
3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3]
block = [1, 1, 1, 2, 2, 2, 1, 1, 1, 2, 2,
2, 1, 1, 1, 2, 2, 2, 1, 1, 1, 2, 2, 2]
y = [0.0, 53.8, 49.5, 41.6, 0.0, 53.8,
53.3, 57.6, 59.8, 69.6, 69.6, 65.8,
62.3, 63.4, 64.5, 58.5, 50.4, 46.1,
75.4, 70.3, 68.8, 65.6, 67.3, 65.3]
x = empty((n, n_independent + 1), dtype='double')
# Set the first and fifth observations to missing values
y[0] = machine(6)
y[4] = machine(6)
# Fill the array X with classification vars and observations
for i in range(0, n):
x[i, 0] = whole[i]
x[i, 1] = split[i]
x[i, 2] = block[i]
x[i, 3] = y[i]
# Sort the data since anovaFactorial expects sorted data
sortData(x, 3)
error_ss = []
missing_idx = []
n_missing = yates([x], design=2, getSs=get_ss,
errorSs=error_ss, missingIndex=missing_idx)
print("Returned error sum of squares: ", error_ss[0])
print("\nMissing values replaced: ", n_missing)
print("Whole Split Block Estimate")
for i in range(0, n_missing):
sys.stdout.write("%3d %3d %3d %7.3f\n" %
(int(x[missing_idx[i]][0]),
int(x[missing_idx[i]][1]),
int(x[missing_idx[i]][2]),
x[missing_idx[i]][n_independent]))
writeMatrix("Sorted x, with estimates", x,
writeFormat="%-4.0f%-4.0f%-4.0f%5.2f",
colLabels=col_labels, noRowLabels=True)
Output¶
Returned error sum of squares: 95.62000000000052
Missing values replaced: 2
Whole Split Block Estimate
1 1 1 37.300
1 2 2 58.100
Sorted x, with estimates
Whole Split Block
1 1 1 37.30
1 1 2 41.60
1 2 1 53.80
1 2 2 58.10
1 3 1 49.50
1 3 2 53.80
2 1 1 53.30
2 1 2 69.60
2 2 1 57.60
2 2 2 69.60
2 3 1 59.80
2 3 2 65.80
3 1 1 62.30
3 1 2 58.50
3 2 1 63.40
3 2 2 50.40
3 3 1 64.50
3 3 2 46.10
4 1 1 75.40
4 1 2 65.60
4 2 1 70.30
4 2 2 67.30
4 3 1 68.80
4 3 2 65.30