regressionStepwise¶
Builds multiple linear regression models using forward selection, backward selection, or stepwise selection.
Synopsis¶
regressionStepwise (x, y)
Required Arguments¶
- float
x[[]]
(Input) - Array of size
nRows
×nCandidate
containing the data for the candidate variables. - float
y[]
(Input) - Array of length
nRows
containing the responses for the dependent variable.
Optional Arguments¶
weights
, float[]
(Input)Array of length
nRows
containing the weight for each row ofx
.Default:
weights[]
= 1frequencies
, float[]
(Input)Array of length
nRows
containing the frequency for each row ofx
.Default:
frequencies[]
= 1
firstStep
, or
intermediateStep
, or
lastStep
, or
allSteps
- One or none of these options can be specified. If none of these is
specified, the action defaults to
allSteps
.
Argument | Action |
firstStep |
This is the first invocation; additional calls will be made. Initialization and stepping is performed. |
intermediateStep |
This is an intermediate invocation. Stepping is performed. |
lastStep |
This is the final invocation. Stepping and wrap-up computations are performed. |
allSteps |
This is the only invocation. Initialization, stepping, and wrap-up computations are performed. |
nSteps
, int (Input)- For nonnegative
nSteps
,nSteps
steps are taken. IfnSteps
= −1, stepping continues until completion.
forward
, or
backward
, or
stepwise
- One or none of these options can be specified. If none is specified, the
action defaults to
backward
.
Keyword | Action |
forward |
An attempt is made to add a variable to the model. A
variable is added if its p-value is less than
pValueIn . During initialization, only the forced
variables enter the model. |
backward |
An attempt is made to remove a variable from the
model. A variable is removed if its p-value exceeds
pValueOut . During initialization, all candidate
independent variables enter the model. |
stepwise |
A backward step is attempted. If a variable is not removed, a forward step is attempted. This is a stepwise step. Only the forced variables enter the model during initialization. |
pValueIn
, float (Input)Largest p-value for variables entering the model. Variables with p-values less than
pValueIn
may enter the model.Default:
pValueIn
= 0.05pValueOut
, float (Input)Smallest p-value for removing variables. Variables with
pValues
greater thanpValueOut
may leave the model. ArgumentpValueOut
must be greater than or equal topValueIn
. A common choice forpValueOut
is 2*pValueIn
.Default:
pValueOut
= 0.10tolerance
, float (Input)Tolerance used in determining linear dependence.
Default:
tolerance
= 100*eps, where eps =machine
(4) for single precisionanovaTable
, float (Output)The array containing the analysis of variance table. The analysis of variance statistics are as follows:
Element Analysis of Variance Statistic 0 degrees of freedom for regression 1 degrees of freedom for error 2 total degrees of freedom 3 sum of squares for regression 4 sum of squares for error 5 total sum of squares 6 regression mean square 7 error mean square 8 F-statistic 9 p-value 10 \(R^2\) (in percent) 11 adjusted \(R^2\) (in percent) 12 estimate of the standard deviation Note that the p‑value is returned as 0.0 when the value is so small that all significant digits have been lost.
coefTTests
(Output)- An array containing statistics relating to the regression coefficient for
the final model in this invocation. The rows correspond to the
nCandidate
independent variables. The rows are in the same order as the variables inx
(or, ifinputCov
is specified, the rows are in the same order as the variables incov
). Each row corresponding to a variable not in the model contains statistics for a model which includes the variables of the final model and the variable corresponding to the row in question.
Column | Description |
---|---|
0 | coefficient estimate |
1 | estimated standard error of the coefficient estimate |
2 | t-statistic for the test that the coefficient is 0 |
3 | p-value for the two-sided t test |
coefVif
(Output)An array containing variance inflation factors for the final model in this invocation. The elements correspond to the
nCandidate
dependent variables. The elements are in the same order as the variables inx
(or, ifinputCov
is specified, the elements are in the same order as the variables incov
). Each element corresponding to a variable not in the model contains statistics for a model which includes the variables of the final model and the variables corresponding to the element in question.The square of the multiple correlation coefficient for the i‑th regressor after all others can be obtained from
coefVif
[i] by the following formula:
level
, int[]
(Input)Array of length
nCandidate
+ 1 containing levels of priority for variables entering and leaving the regression. Each variable is assigned a positive value which indicates its level of entry into the model. A variable can enter the model only after all variables with smaller nonzero levels of entry have entered. Similarly, a variable can only leave the model after all variables with higher levels of entry have left. Variables with the same level of entry compete for entry (deletion) at each step. Argumentlevel
[I
] = 0 means theI
‑th variable is never to enter the model. Argumentlevel
[I
] = −1 means theI
‑th variable is the dependent variable. Argumentlevel
[nCandidate
] must correspond to the dependent variable, except wheninputCov
is specified.Default: 1, 1, …, 1, −1 where −1 corresponds to
level
[nCandidate
]force
, int (Input)- Variable with levels 1, 2, …,
force
are forced into the model as independent variables. Seelevel
. iend
(Output)- Variable which indicates whether additional steps are possible.
iend |
Meaning |
---|---|
0 | Additional steps may be possible. |
1 | No additional steps are possible. |
sweptUser
(Output)- A user-allocated array of length
nCandidate
+ 1 with information to indicate the independent variables in the model. ArgumentsweptUser
[nCandidate
] usually corresponds to the dependent variable. Seelevel
.
sweptUser[ i] |
Status of i‑th Variable |
---|---|
−1 | Variable i is not in model. |
1 | Variable i is in model. |
historyUser
(Output)- User-allocated array of length
nCandidate
+ 1 containing the recent history of the independent variables. ElementhistoryUser
[nCandidate
] usually corresponds to the dependent variable. Seelevel
.
historyUser[ i] |
Status of i-th Variable |
---|---|
0.0 | Variable has never been added to model. |
0.5 | Variable was added into the model during initialization. |
k > 0.0 | Variable was added to the model during the k-th step. |
k < 0.0 | Variable was deleted from model during the k-th step. |
covSweptUser
(Output)- User-allocated array of length (
nCandidate
+ 1) × (nCandidate
+ 1) that results aftercov
has been swept on the columns corresponding to the variables in the model. The estimated variance-covariance matrix of the estimated regression coefficients in the final model can be obtained by extracting the rows and columns ofcovSweptUser
corresponding to the independent variables in the final model and multiplying the elements of this matrix byanovaTable[7]
. inputCov
, intnObservations
floatcov
(Input)An (
nCandidate
+ 1) by (nCandidate
+ 1) array containing a variance-covariance or sum of squares and crossproducts matrix, in which the last column must correspond to the dependent variable. ArgumentnObservations
is an integer specifying the number of observations associated withcov
. Argumentcov
can be computed usingcovariances
. Argumentsx
,y
,weights
, andfrequencies
are not accessed when this option is specified.By default,
regressionStepwise
computescov
from the input data matricesx
andy
.
Description¶
Function regressionStepwise
builds a multiple linear regression model
using forward selection, backward selection, or forward stepwise (with a
backward glance) selection. Function regressionStepwise
is designed so
the user can monitor, and perhaps change, the variables added (deleted) to
(from) the model after each step. In this case, multiple calls to
regressionStepwise
(using optional arguments firstStep
,
intermediateStep
, …, lastStep
) are made. Alternatively,
regressionStepwise
can be invoked once (default, or specify optional
argument allSteps
) in order to perform the stepping until a final model
is selected.
Levels of priority can be assigned to the candidate independent variables
(use optional argument level
). All variables with a priority level of 1
must enter the model before variables with a priority level of 2. Similarly,
variables with a level of 2 must enter before variables with a level of 3,
etc. Variables also can be forced into the model (see optional argument
force
). Note that specifying optional argument force
without also
specifying optional argument level
will result in all variables being
forced into the model.
Typically, the intercept is forced into all models and is not a candidate variable. In this case, a sum-of-squares and crossproducts matrix for the independent and dependent variables corrected for the mean is required. Other possibilities are as follows:
- The intercept is not in the model. A raw (uncorrected) sum-of-squares and
crossproducts matrix for the independent and dependent variables is
required as input in
cov
(see optional argumentinputCov
). ArgumentnObservations
must be set to one greater than the number of observations. - An intercept is a candidate variable. A raw (uncorrected) sum-of-squares
and crossproducts matrix for the constant regressor (=1), independent and
dependent variables are required for
cov
. In this case,cov
contains one additional row and column corresponding to the constant regressor. This row/column contains the sum-of-squares and crossproducts of the constant regressor with the independent and dependent variables. The remaining elements incov
are the same as in the previous case. ArgumentnObservations
must be set to one greater than the number of observations.
The stepwise regression algorithm is due to Efroymson (1960). Function
regressionStepwise
uses sweeps of the covariance matrix (input in
cov
, if optional argument inputCov
is specified, or generated
internally by default) to move variables in and out of the model (Hemmerle
1967, Chapter 3). The SWEEP operator discussed in Goodnight (1979) is
used. A description of the stepwise algorithm is also given by Kennedy and
Gentle (1980, pp. 335−340). The advantage of stepwise model building over
all possible regression (see function
regressionSelection) is that it is less demanding
computationally when the number of candidate independent variables is very
large. However, there is no guarantee that the model selected will be the
best model (highest \(R^2\)) for any subset size of independent
variables.
Example¶
This example uses a data set from Draper and Smith (1981, pp. 629−630). Backwards stepping is performed by default.
from __future__ import print_function
from numpy import *
from pyimsl.stat.regressionStepwise import regressionStepwise
from pyimsl.stat.writeMatrix import writeMatrix
labels = ["degrees of freedom for regression",
"degrees of freedom for error",
"total degrees of freedom",
"sum of squares for regression",
"sum of squares for error",
"total sum of squares",
"regression mean square",
"error mean square",
"F-statistic",
"p-value",
"R-squared (in percent)",
"adjusted R-squared (in percent)",
"est. standard deviation of within error"]
c_labels = ["variable",
"estimate",
"s.e.",
"t",
"prob > t"]
x = array([
[7.0, 26.0, 6.0, 60.0],
[1.0, 29.0, 15.0, 52.0],
[11.0, 56.0, 8.0, 20.0],
[11.0, 31.0, 8.0, 47.0],
[7.0, 52.0, 6.0, 33.0],
[11.0, 55.0, 9.0, 22.0],
[3.0, 71.0, 17.0, 6.0],
[1.0, 31.0, 22.0, 44.0],
[2.0, 54.0, 18.0, 22.0],
[21.0, 47.0, 4.0, 26.0],
[1.0, 40.0, 23.0, 34.0],
[11.0, 66.0, 9.0, 12.0],
[10.0, 68.0, 8.0, 12.0]])
y = array([78.5, 74.3, 104.3, 87.6, 95.9, 109.2,
102.7, 72.5, 93.1, 115.9, 83.8, 113.3, 109.4])
aov = []
tt = []
regressionStepwise(x, y,
anovaTable=aov,
coefTTests=tt)
writeMatrix("* * * Analysis of Variance * * *\n", aov, column=True,
rowLabels=labels, writeFormat="%9.2f")
writeMatrix("* * * Inference on Coefficients * * *\n", tt,
colLabels=c_labels, writeFormat="%9.2f")
Output¶
* * * Analysis of Variance * * *
degrees of freedom for regression 2.00
degrees of freedom for error 10.00
total degrees of freedom 12.00
sum of squares for regression 2657.86
sum of squares for error 57.90
total sum of squares 2715.76
regression mean square 1328.93
error mean square 5.79
F-statistic 229.50
p-value 0.00
R-squared (in percent) 97.87
adjusted R-squared (in percent) 97.44
est. standard deviation of within error 2.41
* * * Inference on Coefficients * * *
variable estimate s.e. t prob > t
1 1.47 0.12 12.10 0.00
2 0.66 0.05 14.44 0.00
3 0.25 0.18 1.35 0.21
4 -0.24 0.17 -1.37 0.21
Warning Errors¶
IMSLS_LINEAR_DEPENDENCE_1 |
Based on “tolerance” = #, there are linear dependencies among the variables to be forced. |
Fatal Errors¶
IMSLS_NO_VARIABLES_ENTERED |
No variables entered the model. All elements of “anovaTable” are set to NaN. |