regressorsForGlm¶
Generates regressors for a general linear model.
Synopsis¶
regressorsForGlm (x, nClass, nContinuous)
Required Arguments¶
- float
x[[]]
(Input) - An
nObservations
× (nClass
+nContinuous
) array containing the data. The columns must be ordered such that the firstnClass
columns contain the class variables and the nextnContinuous
columns contain the continuous variables. (Exception: see optional argumentxClassColumns
.) - int
nClass
(Input) - Number of classification variables.
- int
nContinuous
(Input) - Number of continuous variables.
Return Value¶
An integer (nRegressors
) indicating the number of regressors generated.
Optional Arguments¶
xClassColumns
, int[]
(Input)Index array of length
nClass
containing the column numbers ofx
that are the classification variables. The remaining variables are assumed to be continuous.Default:
xClassColumns
= 0, 1, …,nClass
− 1modelOrder
, int (Input)Order of the model. Model order can be specified as 1 or 2. Use optional argument
indicesEffects
to specify more complicated models.Default:
modelOrder
= 1or
indicesEffects
, intnVarEffects[]
, intindicesEffects[]
(Input)- Variable
nEffects
is the number of effects (sources of variation) in the model. VariablenVarEffects
is an array of lengthnEffects
containing the number of variables associated with each effect in the model. ArgumentindicesEffects
is an index array of lengthnVarEffects[0]
+nVarEffects[1]
+…+nVarEffects
[nEffects
− 1]. The firstnVarEffects
[0] elements give the column numbers ofx
for each variable in the first effect. The nextnVarEffects
[1] elements give the column numbers for each variable in the second effect. The lastnVarEffects
[nEffects
− 1] elements give the column numbers for each variable in the last effect. dummy
, int (Input)Dummy variable option. Indicator variables are defined for each class variable as described in the Description section.
Dummy variables are then generated from the n indicator variables in one of the following three ways:
dummy |
Method |
---|---|
all |
The n indicator variables are the dummy variables (default). |
leaveOutLast |
The dummies are the first n − 1 indicator variables. |
sumToZero |
The n − 1 dummies are defined in terms of the indicator variables so that for balanced data, the usual summation restrictions are imposed on the regression coefficients. |
regressors
(Output)- The array of size
nObservations
×nRegressors
containing the regressor variables generated fromx
.
Description¶
Function regressorsForGlm
generates regressors for a general linear
model from a data matrix. The data matrix can contain classification
variables as well as continuous variables. Regressors for effects composed
solely of continuous variables are generated as powers and crossproducts.
Consider a data matrix containing continuous variables as Columns 3 and 4.
The effect indices (3, 3) generate a regressor whose i‑th value is the
square of the i‑th value in Column 3. The effect indices (3, 4)
generates a regressor whose i‑th value is the product of the i‑th
value in Column 3 with the i‑th value in Column 4.
Regressors for an effect (source of variation) composed of a single
classification variable are generated using indicator variables. Let the
classification variable A take on values \(a_1,a_2,\ldots,a_n\). From
this classification variable, regressorsForGlm
creates
n indicator variables. For \(k=1,2,\ldots,n\), we have
For each classification variable, another set of variables is created from the indicator variables. These new variables are called dummy variables. Dummy variables are generated from the indicator variables in one of three manners:
- The dummies are the n indicator variables.
- The dummies are the first n – 1 indicator variables.
- The n – 1 dummies are defined in terms of the indicator variables so that for balanced data, the usual summation restrictions are imposed on the regression coefficients.
In particular, for dummy
= all
, the dummy variables are \(A_k=
I_k\)(\(k=1,2,\ldots,n\)). For dummy
= leaveOutLast
, the dummy
variables are \(A_k=I_k\)(\(k=1,2,\ldots,n-1\)). For dummy
=
sumToZero
, the dummy variables are \(A_k=I_k-I_n\)(\(k=1,2,\ldots,n-1\)). The regressors generated for an effect composed
of a single-classification variable are the associated dummy variables.
Let \(m_j\) be the number of dummies generated for the j-th classification variable. Suppose there are two classification variables A and B with dummies
and
The regressors generated for an effect composed of two classification variables A and B are
More generally, the regressors generated for an effect composed of several
classification variables and several continuous variables are given by the
Kronecker products of variables, where the order of the variables is
specified in indicesEffects
. Consider a data matrix containing
classification variables in Columns 0 and 1 and continuous variables in
Columns 2 and 3. Label these four columns A, B, \(X_1\), and
\(X_2\). The regressors generated by the effect indices (0, 1, 2, 2, 3)
are \(A\bigotimes B\bigotimes X_1 X_1 X_2\).
Remarks¶
Let the data matrix \(x=(A,B,X_1)\), where A and B are classification
variables and \(X_1\) is a continuous variable. The model containing the
effects A, B, AB, \(X_1\), \(AX_1\), \(BX_1\), and
\(ABX_1\) is specified as follows (use optional keyword
indicesEffects
):
nClass
= 2
nContinuous
= 1
nEffects
= 7
nVarEffects
= (1, 1, 2, 1, 2, 2, 3)
indicesEffects
= (0, 1, 0, 1, 2, 0, 2, 1, 2, 0, 1, 2)
For this model, suppose that variable A has two levels, \(A_1\) and
\(A_2\), and that variable B has three levels, \(B_1\),
\(B_2\), and \(B_3\). For each dummy
option, the regressors in
their order of appearance in regressors
are given below.
dummy |
regressors |
all |
\(A_1\), \(A_2\), \(B_1\), \(B_2\), \(B_3\), \(A_1B_1\), \(A_1B_2\), \(A_1B_3\), \(A_2B_1\), \(A_2B_2\), \(A_2B_3\), \(X_1\), \(A_1X_1\), \(A_2X_1\), \(B_1X_1\), \(B_2X_1\), \(B_3X_1\), \(A_1B_1X_1\), \(A_1B_2X_1\), \(A_1B_3X_1\), \(A_2B_1X_1\), \(A_2B_2X_1\), \(A_2B_3X_1\) |
leaveOutLast |
\(A_1\), \(B_1\), \(B_2\), \(A_1B_1\), \(A_1B_2\), \(X_1\), \(A_1X_1\), \(B_1X_1\), \(B_2X_1\), \(A_1B_1X_1\), \(A_1B_2X_1\) |
sumToZero |
\(A_1-A_2\), \(B_1-B_3\), \(B_2-B_3\), \((A_1-A_2) (B_1-B_2)\), \((A_1-A_2) (B_2-B_3)\), \(X_1\), \((A_1-A_2) X_1\), \((B_1-B_3) X_1\), \((B_2-B_3) X_1\), \((A_1-A_2) (B_1-B_2) X_1\), \((A_1-A_2) (B_2-B_3) X_1\) |
Within a group of regressors corresponding to an interaction effect, the indicator variables composing the regressors vary most rapidly for the last classification variable, next most rapidly for the next to last classification variable, etc.
By default, regressorsForGlm
internally generates values for
nEffects
, nVarEffects
, and indicesEffects
, which correspond to a
first order model with NEF = nContinuous
+ nClass
. The variables
then are used to create the regressor variables. The effects are ordered
such that the first effect corresponds to the first column of x
, the
second effect corresponds to the second column of x
, etc. A second order
model corresponding to the columns (variables) of x
is generated if
modelOrder
with modelOrder
= 2 is specified.
There are
effects, where NVAR = nContinuous
+ nClass
. The first NVAR effects
correspond to the columns of x
, such that the first effect corresponds
to the first column of x
, the second effect corresponds to the second
column of x
, …, the NVAR-th effect corresponds to the NVAR-th column
of x
(i.e., x
[NVAR − 1]). The next nContinuous
effects
correspond to squares of the continuous variables. The last
effects correspond to the two-variable interactions.
Let the data matrix
x
= \((A,B,X_1)\), where A and B are classification variables and \(X_1\) is a continuous variable. The effects generated and order of appearance is\[A,B,X_1,X_1^2,AB,AX_1,BX_1\]Let the data matrix
x
= \((A,X_1,X_2)\), where A is a classification variable and \(X_1\) and \(X_2\) are continuous variables. The effects generated and order of appearance is\[A,X_1,X_2,X_1^2,X_2^2,AX_1,AX_2,X_1 X_2\]Let the data matrix
x
= \((X_1,A,X_2)\) (seexClassColumns
), where A is a classification variable and \(X_1\) and \(X_2\) are continuous variables. The effects generated and order of appearance is\[X_1,A,X_2,X_1^2,X_2^2,X_1 A, X_1 X_2, AX_2\]
Higher-order and more complicated models can be specified using
indicesEffects
.
Examples¶
Example 1¶
In the following example, there are two classification variables, A and
B, with two and three values, respectively. Regressors for a one-way model
(the default model order) are generated using the all
dummy method (the
default dummy method). The five regressors generated are \(A_1\),
\(A_2\), \(B_1\), \(B_2\), and \(B_3\).
from __future__ import print_function
from numpy import *
from pyimsl.stat.regressorsForGlm import regressorsForGlm
n_observations = 6
n_class = 2
n_cont = 0
x = [[10.0, 5.0], [20.0, 15.0], [20.0, 10.0],
[10.0, 10.0], [10.0, 15.0], [20.0, 5.0]]
n_regressors = regressorsForGlm(x, n_class, n_cont)
print("Number of regressors = ", n_regressors)
Output¶
Number of regressors = 5
Example 2¶
In this example, a two-way analysis of covariance model containing all the
interaction terms is fit. First, regressorsForGlm
is called to produce a
matrix of regressors, regressors
, from the data x
. Then,
regressors
is used as the input matrix into regression
to produce
the final fit. The regressors, generated using dummy
= leaveOutLast
,
are the model whose mean function is
where
from __future__ import print_function
from numpy import *
from pyimsl.stat.regression import regression
from pyimsl.stat.regressorsForGlm import regressorsForGlm, LEAVE_OUT_LAST
from pyimsl.stat.writeMatrix import writeMatrix
n_class = 2
n_cont = 1
regressors = []
x = array([
[1.0, 1.0, 1.11],
[1.0, 1.0, 2.22],
[1.0, 1.0, 3.33],
[1.0, 2.0, 1.11],
[1.0, 2.0, 2.22],
[1.0, 2.0, 3.33],
[1.0, 3.0, 1.11],
[1.0, 3.0, 2.22],
[1.0, 3.0, 3.33],
[2.0, 1.0, 1.11],
[2.0, 1.0, 2.22],
[2.0, 1.0, 3.33],
[2.0, 2.0, 1.11],
[2.0, 2.0, 2.22],
[2.0, 2.0, 3.33],
[2.0, 3.0, 1.11],
[2.0, 3.0, 2.22],
[2.0, 3.0, 3.33]])
y = ([
1.0, 2.0, 2.0, 4.0, 4.0, 6.0,
3.0, 3.5, 4.0, 4.5, 5.0, 5.5,
2.0, 3.0, 4.0, 5.0, 6.0, 7.0])
class_col = (0, 1)
n_effects = 7
n_var_effects = (1, 1, 2, 1, 2, 2, 3)
indices_effects = (0, 1, 0, 1, 2, 0, 2, 1, 2, 0, 1, 2)
indicesEffects = {}
indicesEffects["nVarEffects"] = n_var_effects
indicesEffects["indicesEffects"] = indices_effects
reg_labels = [" ", "Alpha1", "Beta1", "Beta2",
"Gamma11", "Gamma12",
"Delta", "Zeta1", "Eta1", "Eta2",
"Theta11", "Theta12"]
labels = ["degrees of freedom for the model",
"degrees of freedom for error",
"total (corrected) degrees of freedom",
"sum of squares for the model",
"sum of squares for error",
"total (corrected) sum of squares",
"model mean square", "error mean square",
"F-statistic", "p-value",
"R-squared (in percent)", "adjusted R-squared (in percent)",
"est. standard deviation of the model error",
"overall mean of y",
"coefficient of variation (in percent)"]
n_regressors = regressorsForGlm(x, n_class, n_cont,
xClassColumns=class_col,
dummy=LEAVE_OUT_LAST,
indicesEffects=indicesEffects,
regressors=regressors)
print("Number of regressors", n_regressors)
writeMatrix("regressors", regressors,
colLabels=reg_labels, writeFormat="%10.2f")
anova_table = []
coef = regression(regressors, y, anovaTable=anova_table)
writeMatrix("* * * Analysis of Variance * * *\n", anova_table,
rowLabels=labels, writeFormat="%11.4f", column=True)
Output¶
Number of regressors 11
regressors
Alpha1 Beta1 Beta2 Gamma11 Gamma12 Delta
1 1.00 1.00 0.00 1.00 0.00 1.11
2 1.00 1.00 0.00 1.00 0.00 2.22
3 1.00 1.00 0.00 1.00 0.00 3.33
4 1.00 0.00 1.00 0.00 1.00 1.11
5 1.00 0.00 1.00 0.00 1.00 2.22
6 1.00 0.00 1.00 0.00 1.00 3.33
7 1.00 0.00 0.00 0.00 0.00 1.11
8 1.00 0.00 0.00 0.00 0.00 2.22
9 1.00 0.00 0.00 0.00 0.00 3.33
10 0.00 1.00 0.00 0.00 0.00 1.11
11 0.00 1.00 0.00 0.00 0.00 2.22
12 0.00 1.00 0.00 0.00 0.00 3.33
13 0.00 0.00 1.00 0.00 0.00 1.11
14 0.00 0.00 1.00 0.00 0.00 2.22
15 0.00 0.00 1.00 0.00 0.00 3.33
16 0.00 0.00 0.00 0.00 0.00 1.11
17 0.00 0.00 0.00 0.00 0.00 2.22
18 0.00 0.00 0.00 0.00 0.00 3.33
Zeta1 Eta1 Eta2 Theta11 Theta12
1 1.11 1.11 0.00 1.11 0.00
2 2.22 2.22 0.00 2.22 0.00
3 3.33 3.33 0.00 3.33 0.00
4 1.11 0.00 1.11 0.00 1.11
5 2.22 0.00 2.22 0.00 2.22
6 3.33 0.00 3.33 0.00 3.33
7 1.11 0.00 0.00 0.00 0.00
8 2.22 0.00 0.00 0.00 0.00
9 3.33 0.00 0.00 0.00 0.00
10 0.00 1.11 0.00 0.00 0.00
11 0.00 2.22 0.00 0.00 0.00
12 0.00 3.33 0.00 0.00 0.00
13 0.00 0.00 1.11 0.00 0.00
14 0.00 0.00 2.22 0.00 0.00
15 0.00 0.00 3.33 0.00 0.00
16 0.00 0.00 0.00 0.00 0.00
17 0.00 0.00 0.00 0.00 0.00
18 0.00 0.00 0.00 0.00 0.00
* * * Analysis of Variance * * *
degrees of freedom for the model 11.0000
degrees of freedom for error 6.0000
total (corrected) degrees of freedom 17.0000
sum of squares for the model 43.9028
sum of squares for error 0.8333
total (corrected) sum of squares 44.7361
model mean square 3.9912
error mean square 0.1389
F-statistic 28.7364
p-value 0.0003
R-squared (in percent) 98.1372
adjusted R-squared (in percent) 94.7221
est. standard deviation of the model error 0.3727
overall mean of y 3.9722
coefficient of variation (in percent) 9.3821