regressionStepwise

Builds multiple linear regression models using forward selection, backward selection, or stepwise selection.

Synopsis

regressionStepwise (x, y)

Required Arguments

float x[[]] (Input)
Array of size nRows × nCandidate containing the data for the candidate variables.
float y[] (Input)
Array of length nRows containing the responses for the dependent variable.

Optional Arguments

weights, float[] (Input)

Array of length nRows containing the weight for each row of x.

Default: weights[] = 1

frequencies, float[] (Input)

Array of length nRows containing the frequency for each row of x.

Default: frequencies[] = 1

firstStep, or

intermediateStep, or

lastStep, or

allSteps
One or none of these options can be specified. If none of these is specified, the action defaults to allSteps.
Argument Action
firstStep This is the first invocation; additional calls will be made. Initialization and stepping is performed.
intermediateStep

This is an intermediate invocation.

Stepping is performed.

lastStep This is the final invocation. Stepping and wrap-up computations are performed.
allSteps This is the only invocation. Initialization, stepping, and wrap-up computations are performed.
nSteps, int (Input)
For nonnegative nSteps, nSteps steps are taken. If nSteps = −1, stepping continues until completion.

forward, or

backward, or

stepwise
One or none of these options can be specified. If none is specified, the action defaults to backward.
Keyword Action
forward An attempt is made to add a variable to the model. A variable is added if its p-value is less than pValueIn. During initialization, only the forced variables enter the model.
backward An attempt is made to remove a variable from the model. A variable is removed if its p-value exceeds pValueOut. During initialization, all candidate independent variables enter the model.
stepwise A backward step is attempted. If a variable is not removed, a forward step is attempted. This is a stepwise step. Only the forced variables enter the model during initialization.
pValueIn, float (Input)

Largest p-value for variables entering the model. Variables with p-values less than pValueIn may enter the model.

Default: pValueIn = 0.05

pValueOut, float (Input)

Smallest p-value for removing variables. Variables with pValues greater than pValueOut may leave the model. Argument pValueOut must be greater than or equal to pValueIn. A common choice for pValueOut is 2*pValueIn.

Default: pValueOut = 0.10

tolerance, float (Input)

Tolerance used in determining linear dependence.

Default: tolerance = 100*eps, where eps = machine(4) for single precision

anovaTable, float (Output)

The array containing the analysis of variance table. The analysis of variance statistics are as follows:

Element Analysis of Variance Statistic
0 degrees of freedom for regression
1 degrees of freedom for error
2 total degrees of freedom
3 sum of squares for regression
4 sum of squares for error
5 total sum of squares
6 regression mean square
7 error mean square
8 F-statistic
9 p-value
10 \(R^2\) (in percent)
11 adjusted \(R^2\) (in percent)
12 estimate of the standard deviation

Note that the p‑value is returned as 0.0 when the value is so small that all significant digits have been lost.

coefTTests (Output)
An array containing statistics relating to the regression coefficient for the final model in this invocation. The rows correspond to the nCandidate independent variables. The rows are in the same order as the variables in x (or, if inputCov is specified, the rows are in the same order as the variables in cov). Each row corresponding to a variable not in the model contains statistics for a model which includes the variables of the final model and the variable corresponding to the row in question.
Column Description
0 coefficient estimate
1 estimated standard error of the coefficient estimate
2 t-statistic for the test that the coefficient is 0
3 p-value for the two-sided t test
coefVif (Output)

An array containing variance inflation factors for the final model in this invocation. The elements correspond to the nCandidate dependent variables. The elements are in the same order as the variables in x (or, if inputCov is specified, the elements are in the same order as the variables in cov). Each element corresponding to a variable not in the model contains statistics for a model which includes the variables of the final model and the variables corresponding to the element in question.

The square of the multiple correlation coefficient for the i‑th regressor after all others can be obtained from coefVif[i] by the following formula:

\[1.0 - \frac{1.0}{\mathtt{coef\_vif}[i]}\]
level, int[] (Input)

Array of length nCandidate + 1 containing levels of priority for variables entering and leaving the regression. Each variable is assigned a positive value which indicates its level of entry into the model. A variable can enter the model only after all variables with smaller nonzero levels of entry have entered. Similarly, a variable can only leave the model after all variables with higher levels of entry have left. Variables with the same level of entry compete for entry (deletion) at each step. Argument level[I] = 0 means the I‑th variable is never to enter the model. Argument level[I] = −1 means the I‑th variable is the dependent variable. Argument level[nCandidate] must correspond to the dependent variable, except when inputCov is specified.

Default: 1, 1, …, 1, −1 where −1 corresponds to level[nCandidate]

force, int (Input)
Variable with levels 1, 2, …, force are forced into the model as independent variables. See level.
iend (Output)
Variable which indicates whether additional steps are possible.
iend Meaning
0 Additional steps may be possible.
1 No additional steps are possible.
sweptUser (Output)
A user-allocated array of length nCandidate + 1 with information to indicate the independent variables in the model. Argument sweptUser[nCandidate] usually corresponds to the dependent variable. See level.
sweptUser[i] Status of i‑th Variable
−1 Variable i is not in model.
1 Variable i is in model.
historyUser (Output)
User-allocated array of length nCandidate + 1 containing the recent history of the independent variables. Element historyUser[nCandidate] usually corresponds to the dependent variable. See level.
historyUser[i] Status of i-th Variable
0.0 Variable has never been added to model.
0.5 Variable was added into the model during initialization.
k > 0.0 Variable was added to the model during the k-th step.
k < 0.0 Variable was deleted from model during the k-th step.
covSweptUser (Output)
User-allocated array of length (nCandidate + 1) × (nCandidate + 1) that results after cov has been swept on the columns corresponding to the variables in the model. The estimated variance-covariance matrix of the estimated regression coefficients in the final model can be obtained by extracting the rows and columns of covSweptUser corresponding to the independent variables in the final model and multiplying the elements of this matrix by anovaTable[7].
inputCov, int nObservations float cov (Input)

An (nCandidate + 1) by (nCandidate + 1) array containing a variance-covariance or sum of squares and crossproducts matrix, in which the last column must correspond to the dependent variable. Argument nObservations is an integer specifying the number of observations associated with cov. Argument cov can be computed using covariances. Arguments x, y, weights, and frequencies are not accessed when this option is specified.

By default, regressionStepwise computes cov from the input data matrices x and y.

Description

Function regressionStepwise builds a multiple linear regression model using forward selection, backward selection, or forward stepwise (with a backward glance) selection. Function regressionStepwise is designed so the user can monitor, and perhaps change, the variables added (deleted) to (from) the model after each step. In this case, multiple calls to regressionStepwise (using optional arguments firstStep, intermediateStep, …, lastStep) are made. Alternatively, regressionStepwise can be invoked once (default, or specify optional argument allSteps) in order to perform the stepping until a final model is selected.

Levels of priority can be assigned to the candidate independent variables (use optional argument level). All variables with a priority level of 1 must enter the model before variables with a priority level of 2. Similarly, variables with a level of 2 must enter before variables with a level of 3, etc. Variables also can be forced into the model (see optional argument force). Note that specifying optional argument force without also specifying optional argument level will result in all variables being forced into the model.

Typically, the intercept is forced into all models and is not a candidate variable. In this case, a sum-of-squares and crossproducts matrix for the independent and dependent variables corrected for the mean is required. Other possibilities are as follows:

  1. The intercept is not in the model. A raw (uncorrected) sum-of-squares and crossproducts matrix for the independent and dependent variables is required as input in cov (see optional argument inputCov). Argument nObservations must be set to one greater than the number of observations.
  2. An intercept is a candidate variable. A raw (uncorrected) sum-of-squares and crossproducts matrix for the constant regressor (=1), independent and dependent variables are required for cov. In this case, cov contains one additional row and column corresponding to the constant regressor. This row/column contains the sum-of-squares and crossproducts of the constant regressor with the independent and dependent variables. The remaining elements in cov are the same as in the previous case. Argument nObservations must be set to one greater than the number of observations.

The stepwise regression algorithm is due to Efroymson (1960). Function regressionStepwise uses sweeps of the covariance matrix (input in cov, if optional argument inputCov is specified, or generated internally by default) to move variables in and out of the model (Hemmerle 1967, Chapter 3). The SWEEP operator discussed in Goodnight (1979) is used. A description of the stepwise algorithm is also given by Kennedy and Gentle (1980, pp. 335−340). The advantage of stepwise model building over all possible regression (see function regressionSelection) is that it is less demanding computationally when the number of candidate independent variables is very large. However, there is no guarantee that the model selected will be the best model (highest \(R^2\)) for any subset size of independent variables.

Example

This example uses a data set from Draper and Smith (1981, pp. 629−630). Backwards stepping is performed by default.

from __future__ import print_function
from numpy import *
from pyimsl.stat.regressionStepwise import regressionStepwise
from pyimsl.stat.writeMatrix import writeMatrix


labels = ["degrees of freedom for regression",
          "degrees of freedom for error",
          "total degrees of freedom",
          "sum of squares for regression",
          "sum of squares for error",
          "total sum of squares",
          "regression mean square",
          "error mean square",
          "F-statistic",
          "p-value",
          "R-squared (in percent)",
          "adjusted R-squared (in percent)",
          "est. standard deviation of within error"]
c_labels = ["variable",
            "estimate",
            "s.e.",
            "t",
            "prob > t"]
x = array([
    [7.0, 26.0, 6.0, 60.0],
    [1.0, 29.0, 15.0, 52.0],
    [11.0, 56.0, 8.0, 20.0],
    [11.0, 31.0, 8.0, 47.0],
    [7.0, 52.0, 6.0, 33.0],
    [11.0, 55.0, 9.0, 22.0],
    [3.0, 71.0, 17.0, 6.0],
    [1.0, 31.0, 22.0, 44.0],
    [2.0, 54.0, 18.0, 22.0],
    [21.0, 47.0, 4.0, 26.0],
    [1.0, 40.0, 23.0, 34.0],
    [11.0, 66.0, 9.0, 12.0],
    [10.0, 68.0, 8.0, 12.0]])
y = array([78.5, 74.3, 104.3, 87.6, 95.9, 109.2,
           102.7, 72.5, 93.1, 115.9, 83.8, 113.3, 109.4])
aov = []
tt = []

regressionStepwise(x, y,
                   anovaTable=aov,
                   coefTTests=tt)

writeMatrix("* * * Analysis of Variance * * *\n", aov, column=True,
            rowLabels=labels, writeFormat="%9.2f")

writeMatrix("* * * Inference on Coefficients * * *\n", tt,
            colLabels=c_labels, writeFormat="%9.2f")

Output

 
         * * * Analysis of Variance * * *

degrees of freedom for regression             2.00
degrees of freedom for error                 10.00
total degrees of freedom                     12.00
sum of squares for regression              2657.86
sum of squares for error                     57.90
total sum of squares                       2715.76
regression mean square                     1328.93
error mean square                             5.79
F-statistic                                 229.50
p-value                                       0.00
R-squared (in percent)                       97.87
adjusted R-squared (in percent)              97.44
est. standard deviation of within error       2.41
 
       * * * Inference on Coefficients * * *

variable   estimate       s.e.          t   prob > t
       1       1.47       0.12      12.10       0.00
       2       0.66       0.05      14.44       0.00
       3       0.25       0.18       1.35       0.21
       4      -0.24       0.17      -1.37       0.21

Warning Errors

IMSLS_LINEAR_DEPENDENCE_1 Based on “tolerance” = #, there are linear dependencies among the variables to be forced.

Fatal Errors

IMSLS_NO_VARIABLES_ENTERED No variables entered the model. All elements of “anovaTable” are set to NaN.