regressorsForGlm¶

Generates regressors for a general linear model.

Synopsis¶

regressorsForGlm (x, nClass, nContinuous)

Required Arguments¶

float x[[]] (Input): An nObservations × (nClass + nContinuous) array containing the data. The columns must be ordered such that the first nClass columns contain the class variables and the next nContinuous columns contain the continuous variables. (Exception: see optional argument xClassColumns.)
int nClass (Input): Number of classification variables.
int nContinuous (Input): Number of continuous variables.

Return Value¶

An integer (nRegressors) indicating the number of regressors generated.

Optional Arguments¶

xClassColumns, int[] (Input)

Index array of length nClass containing the column numbers of x that are the classification variables. The remaining variables are assumed to be continuous.

Default: xClassColumns = 0, 1, …, nClass − 1

modelOrder, int (Input)

Order of the model. Model order can be specified as 1 or 2. Use optional argument indicesEffects to specify more complicated models.

Default: modelOrder = 1

or

indicesEffects, int nVarEffects[], int indicesEffects[] (Input)

Variable nEffects is the number of effects (sources of variation) in the model. Variable nVarEffects is an array of length nEffects containing the number of variables associated with each effect in the model. Argument indicesEffects is an index array of length nVarEffects[0] + nVarEffects[1]+…+nVarEffects[nEffects − 1]. The first nVarEffects[0] elements give the column numbers of x for each variable in the first effect. The next nVarEffects[1] elements give the column numbers for each variable in the second effect. The last nVarEffects [nEffects − 1] elements give the column numbers for each variable in the last effect.

dummy, int (Input)

Dummy variable option. Indicator variables are defined for each class variable as described in the Description section.

Dummy variables are then generated from the n indicator variables in one of the following three ways:

`dummy`	Method
`all`	The n indicator variables are the dummy variables (default).
`leaveOutLast`	The dummies are the first n − 1 indicator variables.
`sumToZero`	The n − 1 dummies are defined in terms of the indicator variables so that for balanced data, the usual summation restrictions are imposed on the regression coefficients.

regressors (Output): The array of size nObservations × nRegressors containing the regressor variables generated from x.

Description¶

Function regressorsForGlm generates regressors for a general linear model from a data matrix. The data matrix can contain classification variables as well as continuous variables. Regressors for effects composed solely of continuous variables are generated as powers and crossproducts. Consider a data matrix containing continuous variables as Columns 3 and 4. The effect indices (3, 3) generate a regressor whose i‑th value is the square of the i‑th value in Column 3. The effect indices (3, 4) generates a regressor whose i‑th value is the product of the i‑th value in Column 3 with the i‑th value in Column 4.

Regressors for an effect (source of variation) composed of a single classification variable are generated using indicator variables. Let the classification variable A take on values $a_1,a_2,\ldots,a_n$ . From this classification variable, regressorsForGlm creates n indicator variables. For $k=1,2,\ldots,n$ , we have

$\begin{split}I_k = \begin{cases} 1 & A = a_k \\ 0 & \text{otherwise} \\ \end{cases}\end{split}$

For each classification variable, another set of variables is created from the indicator variables. These new variables are called dummy variables. Dummy variables are generated from the indicator variables in one of three manners:

The dummies are the n indicator variables.
The dummies are the first n – 1 indicator variables.
The n – 1 dummies are defined in terms of the indicator variables so that for balanced data, the usual summation restrictions are imposed on the regression coefficients.

In particular, for dummy = all, the dummy variables are $A_k= I_k$ ( $k=1,2,\ldots,n$ ). For dummy = leaveOutLast, the dummy variables are $A_k=I_k$ ( $k=1,2,\ldots,n-1$ ). For dummy = sumToZero, the dummy variables are $A_k=I_k-I_n$ ( $k=1,2,\ldots,n-1$ ). The regressors generated for an effect composed of a single-classification variable are the associated dummy variables.

Let $m_j$ be the number of dummies generated for the j-th classification variable. Suppose there are two classification variables A and B with dummies

$A_1,A_2, \ldots A_{m_1}$

and

$B_1,B_2, \ldots B_{m_2}$

The regressors generated for an effect composed of two classification variables A and B are

$\begin{split}\begin{aligned} A \otimes B &= \left(A_1,A_2, \ldots A_{m_1}\right) \otimes \left(B_1,B_2, \ldots B_{m_2}\right) \\ &= \left( A_1B_1,A_1B_2, \ldots A_1B_{m_2}, A_2B_1, A_2B_2, \ldots \right. \\ &\phantom{= (}\left. A_2B_{m_2}, \ldots A_{m_1}B_1, A_{m_1}B_2, \ldots A_{m_1}B_{m_2}\right) \end{aligned}\end{split}$

More generally, the regressors generated for an effect composed of several classification variables and several continuous variables are given by the Kronecker products of variables, where the order of the variables is specified in indicesEffects. Consider a data matrix containing classification variables in Columns 0 and 1 and continuous variables in Columns 2 and 3. Label these four columns A, B, $X_1$ , and $X_2$ . The regressors generated by the effect indices (0, 1, 2, 2, 3) are $A\bigotimes B\bigotimes X_1 X_1 X_2$ .

Remarks¶

Let the data matrix $x=(A,B,X_1)$ , where A and B are classification variables and $X_1$ is a continuous variable. The model containing the effects A, B, AB, $X_1$ , $AX_1$ , $BX_1$ , and $ABX_1$ is specified as follows (use optional keyword indicesEffects):

nClass = 2

nContinuous = 1

nEffects = 7

nVarEffects = (1, 1, 2, 1, 2, 2, 3)

indicesEffects = (0, 1, 0, 1, 2, 0, 2, 1, 2, 0, 1, 2)

For this model, suppose that variable A has two levels, $A_1$ and $A_2$ , and that variable B has three levels, $B_1$ , $B_2$ , and $B_3$ . For each dummy option, the regressors in their order of appearance in regressors are given below.

`dummy`	`regressors`
`all`	$A_1$ , $A_2$ , $B_1$ , $B_2$ , $B_3$ , $A_1B_1$ , $A_1B_2$ , $A_1B_3$ , $A_2B_1$ , $A_2B_2$ , $A_2B_3$ , $X_1$ , $A_1X_1$ , $A_2X_1$ , $B_1X_1$ , $B_2X_1$ , $B_3X_1$ , $A_1B_1X_1$ , $A_1B_2X_1$ , $A_1B_3X_1$ , $A_2B_1X_1$ , $A_2B_2X_1$ , $A_2B_3X_1$
`leaveOutLast`	$A_1$ , $B_1$ , $B_2$ , $A_1B_1$ , $A_1B_2$ , $X_1$ , $A_1X_1$ , $B_1X_1$ , $B_2X_1$ , $A_1B_1X_1$ , $A_1B_2X_1$
`sumToZero`	$A_1-A_2$ , $B_1-B_3$ , $B_2-B_3$ , $(A_1-A_2) (B_1-B_2)$ , $(A_1-A_2) (B_2-B_3)$ , $X_1$ , $(A_1-A_2) X_1$ , $(B_1-B_3) X_1$ , $(B_2-B_3) X_1$ , $(A_1-A_2) (B_1-B_2) X_1$ , $(A_1-A_2) (B_2-B_3) X_1$

Within a group of regressors corresponding to an interaction effect, the indicator variables composing the regressors vary most rapidly for the last classification variable, next most rapidly for the next to last classification variable, etc.

By default, regressorsForGlm internally generates values for nEffects, nVarEffects, and indicesEffects, which correspond to a first order model with NEF = nContinuous + nClass. The variables then are used to create the regressor variables. The effects are ordered such that the first effect corresponds to the first column of x, the second effect corresponds to the second column of x, etc. A second order model corresponding to the columns (variables) of x is generated if modelOrder with modelOrder = 2 is specified.

There are

$\mathrm{NEF} = \mathrm{nClass} + 2 * \mathrm{nContinuous} + \binom{\mathrm{NVAR}}{2}$

effects, where NVAR = nContinuous + nClass. The first NVAR effects correspond to the columns of x, such that the first effect corresponds to the first column of x, the second effect corresponds to the second column of x, …, the NVAR-th effect corresponds to the NVAR-th column of x (i.e., x[NVAR − 1]). The next nContinuous effects correspond to squares of the continuous variables. The last

$\binom{\mathrm{NVAR}}{2}$

effects correspond to the two-variable interactions.

Let the data matrix x = $(A,B,X_1)$ , where A and B are classification variables and $X_1$ is a continuous variable. The effects generated and order of appearance is

$A,B,X_1,X_1^2,AB,AX_1,BX_1$
Let the data matrix x = $(A,X_1,X_2)$ , where A is a classification variable and $X_1$ and $X_2$ are continuous variables. The effects generated and order of appearance is

$A,X_1,X_2,X_1^2,X_2^2,AX_1,AX_2,X_1 X_2$
Let the data matrix x = $(X_1,A,X_2)$ (see xClassColumns), where A is a classification variable and $X_1$ and $X_2$ are continuous variables. The effects generated and order of appearance is

$X_1,A,X_2,X_1^2,X_2^2,X_1 A, X_1 X_2, AX_2$

Higher-order and more complicated models can be specified using indicesEffects.

Examples¶

Example 1¶

In the following example, there are two classification variables, A and B, with two and three values, respectively. Regressors for a one-way model (the default model order) are generated using the all dummy method (the default dummy method). The five regressors generated are $A_1$ , $A_2$ , $B_1$ , $B_2$ , and $B_3$ .

from __future__ import print_function
from numpy import *
from pyimsl.stat.regressorsForGlm import regressorsForGlm

n_observations = 6
n_class = 2
n_cont = 0
x = [[10.0, 5.0], [20.0, 15.0], [20.0, 10.0],
     [10.0, 10.0], [10.0, 15.0], [20.0, 5.0]]

n_regressors = regressorsForGlm(x, n_class, n_cont)

print("Number of regressors = ", n_regressors)

Output¶

Number of regressors =  5

Example 2¶

In this example, a two-way analysis of covariance model containing all the interaction terms is fit. First, regressorsForGlm is called to produce a matrix of regressors, regressors, from the data x. Then, regressors is used as the input matrix into regression to produce the final fit. The regressors, generated using dummy = leaveOutLast, are the model whose mean function is

$μ + α_i + β_j + Υ_{ij} + δx_{ij} + ζ_ix_{ij} + ηjx_{ij} + θ_{ij}x_{ij} i = 1, 2; j = 1, 2, 3$

where

$α_2 = β_3 = Υ_{21} = Υ_{22} = Υ_{23} = ζ_2 = η_3 = θ_{21} = θ_{22} = θ_{23} = 0.$

from __future__ import print_function
from numpy import *
from pyimsl.stat.regression import regression
from pyimsl.stat.regressorsForGlm import regressorsForGlm, LEAVE_OUT_LAST
from pyimsl.stat.writeMatrix import writeMatrix

n_class = 2
n_cont = 1
regressors = []
x = array([
    [1.0, 1.0, 1.11],
    [1.0, 1.0, 2.22],
    [1.0, 1.0, 3.33],
    [1.0, 2.0, 1.11],
    [1.0, 2.0, 2.22],
    [1.0, 2.0, 3.33],
    [1.0, 3.0, 1.11],
    [1.0, 3.0, 2.22],
    [1.0, 3.0, 3.33],
    [2.0, 1.0, 1.11],
    [2.0, 1.0, 2.22],
    [2.0, 1.0, 3.33],
    [2.0, 2.0, 1.11],
    [2.0, 2.0, 2.22],
    [2.0, 2.0, 3.33],
    [2.0, 3.0, 1.11],
    [2.0, 3.0, 2.22],
    [2.0, 3.0, 3.33]])
y = ([
    1.0, 2.0, 2.0, 4.0, 4.0, 6.0,
    3.0, 3.5, 4.0, 4.5, 5.0, 5.5,
    2.0, 3.0, 4.0, 5.0, 6.0, 7.0])
class_col = (0, 1)
n_effects = 7
n_var_effects = (1, 1, 2, 1, 2, 2, 3)
indices_effects = (0, 1, 0, 1, 2, 0, 2, 1, 2, 0, 1, 2)
indicesEffects = {}
indicesEffects["nVarEffects"] = n_var_effects
indicesEffects["indicesEffects"] = indices_effects
reg_labels = [" ", "Alpha1", "Beta1", "Beta2",
              "Gamma11", "Gamma12",
              "Delta", "Zeta1", "Eta1", "Eta2",
              "Theta11", "Theta12"]
labels = ["degrees of freedom for the model",
          "degrees of freedom for error",
          "total (corrected) degrees of freedom",
          "sum of squares for the model",
          "sum of squares for error",
          "total (corrected) sum of squares",
          "model mean square", "error mean square",
          "F-statistic", "p-value",
          "R-squared (in percent)", "adjusted R-squared (in percent)",
          "est. standard deviation of the model error",
          "overall mean of y",
          "coefficient of variation (in percent)"]

n_regressors = regressorsForGlm(x, n_class, n_cont,
                                xClassColumns=class_col,
                                dummy=LEAVE_OUT_LAST,
                                indicesEffects=indicesEffects,
                                regressors=regressors)

print("Number of regressors", n_regressors)
writeMatrix("regressors", regressors,
            colLabels=reg_labels, writeFormat="%10.2f")

anova_table = []
coef = regression(regressors, y, anovaTable=anova_table)

writeMatrix("* * * Analysis of Variance * * *\n", anova_table,
            rowLabels=labels, writeFormat="%11.4f", column=True)

Output¶

Number of regressors 11
 
                                regressors
        Alpha1       Beta1       Beta2     Gamma11     Gamma12       Delta
 1        1.00        1.00        0.00        1.00        0.00        1.11
 2        1.00        1.00        0.00        1.00        0.00        2.22
 3        1.00        1.00        0.00        1.00        0.00        3.33
 4        1.00        0.00        1.00        0.00        1.00        1.11
 5        1.00        0.00        1.00        0.00        1.00        2.22
 6        1.00        0.00        1.00        0.00        1.00        3.33
 7        1.00        0.00        0.00        0.00        0.00        1.11
 8        1.00        0.00        0.00        0.00        0.00        2.22
 9        1.00        0.00        0.00        0.00        0.00        3.33
10        0.00        1.00        0.00        0.00        0.00        1.11
11        0.00        1.00        0.00        0.00        0.00        2.22
12        0.00        1.00        0.00        0.00        0.00        3.33
13        0.00        0.00        1.00        0.00        0.00        1.11
14        0.00        0.00        1.00        0.00        0.00        2.22
15        0.00        0.00        1.00        0.00        0.00        3.33
16        0.00        0.00        0.00        0.00        0.00        1.11
17        0.00        0.00        0.00        0.00        0.00        2.22
18        0.00        0.00        0.00        0.00        0.00        3.33
 
         Zeta1        Eta1        Eta2     Theta11     Theta12
 1        1.11        1.11        0.00        1.11        0.00
 2        2.22        2.22        0.00        2.22        0.00
 3        3.33        3.33        0.00        3.33        0.00
 4        1.11        0.00        1.11        0.00        1.11
 5        2.22        0.00        2.22        0.00        2.22
 6        3.33        0.00        3.33        0.00        3.33
 7        1.11        0.00        0.00        0.00        0.00
 8        2.22        0.00        0.00        0.00        0.00
 9        3.33        0.00        0.00        0.00        0.00
10        0.00        1.11        0.00        0.00        0.00
11        0.00        2.22        0.00        0.00        0.00
12        0.00        3.33        0.00        0.00        0.00
13        0.00        0.00        1.11        0.00        0.00
14        0.00        0.00        2.22        0.00        0.00
15        0.00        0.00        3.33        0.00        0.00
16        0.00        0.00        0.00        0.00        0.00
17        0.00        0.00        0.00        0.00        0.00
18        0.00        0.00        0.00        0.00        0.00
 
           * * * Analysis of Variance * * *

degrees of freedom for the model                11.0000
degrees of freedom for error                     6.0000
total (corrected) degrees of freedom            17.0000
sum of squares for the model                    43.9028
sum of squares for error                         0.8333
total (corrected) sum of squares                44.7361
model mean square                                3.9912
error mean square                                0.1389
F-statistic                                     28.7364
p-value                                          0.0003
R-squared (in percent)                          98.1372
adjusted R-squared (in percent)                 94.7221
est. standard deviation of the model error       0.3727
overall mean of y                                3.9722
coefficient of variation (in percent)            9.3821