regressionSelection¶
Selects the best multiple linear regression models.
Synopsis¶
regressionSelection (x, y)
Required Arguments¶
- float
x[[]]
(Input) - Array of size
nRows
×nCandidate
containing the data for the candidate variables. - float
y[]
(Input) - Array of length
nRows
containing the responses for the dependent variable.
Optional Arguments¶
t_print
- Printing is performed. This is the default.
or
noPrint
- Printing is not performed.
weights
, float[]
(Input)Array of length
nRows
containing the weight for each row ofx
.Default:
weights[]
= 1frequencies
, float[]
(Input)Array of length
nRows
containing the frequency for each row ofx
.Default:
frequencies[]
= 1rSquared
, int (Input)- The \(R^2\) criterion is used, where subset sizes 1, 2, …,
rSquared
are examined. This option is the default withrSquared
=nCandidate
.
or
adjRSquared
- The adjusted \(R^2\) criterion is used, where subset sizes 1, 2, …,
nCandidate
are examined.
or
mallowsCp
- Mallows \(C_p\) criterion is used, where subset sizes 1, 2, …,
nCandidate
are examined. maxNBest
, int (Input)Number of best regressions to be found. If the \(R^2\) criterions are selected, the
maxNBest
best regressions for each subset size examined are found. If the adjusted \(R^2\) or Mallows \(C_p\) criterion is selected, themaxNBest
overall regressions are found.Default:
maxNBest
= 1maxNGoodSaved
, int (Input)Maximum number of good regressions of each subset size to be saved in finding the best regressions. Argument
maxNGoodSaved
must be greater than or equal tomaxNBest
. Normally,maxNGoodSaved
should be less than or equal to 10. It doesn’t ever need to be larger than the maximum number of subsets for any subset size. Computing time required is inversely related tomaxNGoodSaved
.Default:
maxNGoodSaved
= 10criterions
,indexCriterions
,criterions
(Output)- Argument
indexCriterions
is an array of length nsize + 1(where nsize is equal torSquared
if optional argumentrSquared
is specified; otherwise, nsize is equal tonCandidate
) containing the locations incriterions
of the first element for each subset size. ForI
= 0, 1, …, nsize −1, element numbersindexCriterions
[I
],indexCriterions
[I
] + 1, …,indexCriterions
[I
+1
] − 1 ofcriterions
correspond to the (I
+ 1)-st subset size. Argumentcriterions
is an array of length max (indexCriterions
[nsize] − 1 ,nCandidate
) containing in its firstindexCriterions
[nsize] − 1 elements the criterion values for each subset considered, in increasing subset size order. independentVariables
,indexVariables
,independentVariables
(Output)- Argument
indexVariables
is an array of length nsize + 1 (where nsize is equal torSquared
if optional argumentrSquared
is specified; otherwise, nsize is equal tonCandidate
) containing the locations inindependentVariables
of the first element for each subset size. ForI
= 0, 1, …, nsize − 1, element numbersindexVariables
[I
],indexVariables
[I
] + 1, …,indexVariables
[I
+ 1] − 1 ofindependentVariables
correspond to the (I+1
)-st subset size. ArgumentindependentVariables
is an array of lengthindexVariables
[nsize] − 1 containing the variable numbers for each subset considered and in the same order as incriterions
. coefStatistics
,indexCoefficients
,coefficients
(Output)- Argument
indexCoefficients
is an array of length ntbest + 1 containing the locations incoefficients
or the first row for each of the best regressions. Here, ntbest is the total number of best regression found and is equal torSquared
×maxNBest
ifrSquared
is specified, equal tomaxNBest
if eithermallowsCp
oradjRSquared
is specified, and equal tomaxNBest
×nCandidate
, otherwise. ForI
= 0, 1, …, ntbest − 1, rowsindexCoefficients
[I
],indexCoefficients
[I
] + 1, …,indexCoefficients
[I
+ 1] – 1 ofcoefficients
correspond to the (I
+ 1)-st regression. Argumentcoefficients
is an array of size (indexCoefficients
[ntbest] − 1) × 5 containing statistics relating to the regression coefficients of the best models. Each row corresponds to a coefficient for a particular regression. The regressions are in order of increasing subset size. Within each subset size, the regressions are ordered so that the better regressions appear first. The statistic in the columns are as follows (inferences are conditional on the selected model):
Column | |
---|---|
0 | variable number |
1 | coefficient estimate. |
2 | estimated standard error of the estimate |
3 | t-statistic for the test that the coefficient is 0 |
4 | p-value for the two-sided t test |
inputCov
, intnObservations
, floatcov[]
(Input)- Argument
nObservations
is the number of observations associated with arraycov
. Argumentcov
is an (nCandidate
+ 1) by (nCandidate
+ 1) array containing a variance-covariance or sum of squares and crossproducts matrix, in which the last column must correspond to the dependent variable. Arraycov
can be computed usingcovariances
. Argumentsx
andy
, and optional argumentsfrequencies
andweights
are not accessed when this option is specified. Normally,regressionSelection
computescov
from the input data matricesx
andy
. However, there may be cases when the user will wish to calculate the covariance matrix and manipulate it before callingregressionSelection
. See the description section below for a discussion of such cases.
Description¶
Function regressionSelection
finds the best subset regressions for a
regression problem with nCandidate
independent variables. Typically, the
intercept is forced into all models and is not a candidate variable. In this
case, a sum of squares and crossproducts matrix for the independent and
dependent variables corrected for the mean is computed internally. There may
be cases when it is convenient for the user to calculate the matrix; see the
description of optional argument inputCov
.
“Best” is defined, on option, by one of the following three criteria:
- \(R^2\) (in percent)
- \(R_a^2\) (adjusted \(R^2\) in percent)
Note that maximizing the criterion is equivalent to minimizing the residual mean square:
- Mallows’ \(C_p\) statistic
Here, n is equal to the sum of the frequencies (or nRows
if
frequencies
is not specified) and SST is the total sum of squares.
\(SSE_p\) is the error sum of squares in a model containing p
regression parameters including \(\beta_0\) (or p − 1 of the
nCandidate
candidate variables). Variable
is the error mean square from the model with all nCandidate
variables in
the model. Hocking (1972) and Draper and Smith (1981, pp. 296−302) discuss
these criteria.
Function regressionSelection
is based on the algorithm of Furnival and
Wilson (1974). This algorithm finds maxNGoodSaved
candidate regressions
for each possible subset size. These regressions are used to identify a set
of best regressions. In large problems, many regressions are not computed.
They may be rejected without computation based on results for other subsets;
this yields an efficient technique for considering all possible regressions.
There are cases when the user may want to input the variance-covariance
matrix rather than allow the function regressionSelection
to calculate
it. This can be accomplished using optional argument inputCov
. Three
situations in which the user may want to do this are as follows:
- The intercept is not in the model. A raw (uncorrected) sum of squares and
crossproducts matrix for the independent and dependent variables is
required. Argument
nObservations
must be set to 1 greater than the number of observations. Form \(A^T A\), where \(A=\left[A, Y\right]\), to compute the raw sum of squares and crossproducts matrix. - An intercept is a candidate variable. A raw (uncorrected) sum of squares
and crossproducts matrix for the constant regressor (= 1.0), independent,
and dependent variables is required for
cov
. In this case,cov
contains one additional row and column corresponding to the constant regressor. This row/column contains the sum of squares and crossproducts of the constant regressor with the independent and dependent variables. The remaining elements incov
are the same as in the previous case. ArgumentnObservations
must be set to 1 greater than the number of observations. - There are m variables to be forced into the models. A sum of squares
and crossproducts matrix adjusted for the m variables is required
(calculated by regressing the candidate variables on the variables to be
forced into the model). Argument
nObservations
must be set to m less than the number of observations.
Programming Notes¶
Function regressionSelection
can save considerable CPU time over
explicitly computing all possible regressions. However, the function has
some limitations that can cause unexpected results for users who are unaware
of the limitations of the software.
- For
nCandidate
+ 1 > \(-log_2(\varepsilon)\), where ɛ is machine(4); see Chapter 15,:doc:/stat/utilities/index), some results can be incorrect. This limitation arises because the possible models indicated (the model numbers \(1,2,\ldots,2^{nCandidate}\)) are stored as floating-point values; for sufficiently largenCandidate
, the model numbers cannot be stored exactly. On many computers, this meansregressionSelection
(fornCandidate
> 24) andregressionSelection
(fornCandidate
> 49) can produce incorrect results. - Function
regressionSelection
eliminates some subsets of candidate variables by obtaining lower bounds on the error sum of squares from fitting larger models. First, the full model containing allnCandidate
is fit sequentially using a forward stepwise procedure in which one variable enters the model at a time, and criterion values and model numbers for all the candidate variables that can enter at each step are stored. If linearly dependent variables are removed from the full model, errorIMSLS_VARIABLES_DELETED
is issued. If this error is issued, some submodels that contain variables removed from the full model because of linear dependency can be overlooked if they have not already been identified during the initial forward stepwise procedure. If errorIMSLS_VARIABLES_DELETED
is issued and you want the variables that were removed from the full model to be considered in smaller models, you can rerun the program with a set of linearly independent variables.
Examples¶
Example 1¶
This example uses a data set from Draper and Smith (1981, pp. 629−630).
Function regressionSelection
is invoked to find the best regression for
each subset size using the \(R^2\) criterion. By default, the function
prints the results.
from __future__ import print_function
from numpy import *
from pyimsl.stat.regressionSelection import regressionSelection
x = array([
[7.0, 26.0, 6.0, 60.0],
[1.0, 29.0, 15.0, 52.0],
[11.0, 56.0, 8.0, 20.0],
[11.0, 31.0, 8.0, 47.0],
[7.0, 52.0, 6.0, 33.0],
[11.0, 55.0, 9.0, 22.0],
[3.0, 71.0, 17.0, 6.0],
[1.0, 31.0, 22.0, 44.0],
[2.0, 54.0, 18.0, 22.0],
[21.0, 47.0, 4.0, 26.0],
[1.0, 40.0, 23.0, 34.0],
[11.0, 66.0, 9.0, 12.0],
[10.0, 68.0, 8.0, 12.0]])
y = array([78.5, 74.3, 104.3, 87.6, 95.9, 109.2,
102.7, 72.5, 93.1, 115.9, 83.8, 113.3, 109.4])
regressionSelection(x, y)
Output¶
Regressions with 1 variable(s) (R-squared)
Criterion Variables
67.5 4
66.6 2
53.4 1
28.6 3
Regressions with 2 variable(s) (R-squared)
Criterion Variables
97.9 1 2
97.2 1 4
93.5 3 4
68 2 4
54.8 1 3
Regressions with 3 variable(s) (R-squared)
Criterion Variables
98.2 1 2 4
98.2 1 2 3
98.1 1 3 4
97.3 2 3 4
Regressions with 4 variable(s) (R-squared)
Criterion Variables
98.2 1 2 3 4
Best Regression with 1 variable(s) (R-squared)
Variable Coefficient Standard Error t-statistic p-value
4 -0.7382 0.1546 -4.775 0.0006
Best Regression with 2 variable(s) (R-squared)
Variable Coefficient Standard Error t-statistic p-value
1 1.468 0.1213 12.10 0.0000
2 0.662 0.0459 14.44 0.0000
Best Regression with 3 variable(s) (R-squared)
Variable Coefficient Standard Error t-statistic p-value
1 1.452 0.1170 12.41 0.0000
2 0.416 0.1856 2.24 0.0517
4 -0.237 0.1733 -1.37 0.2054
Best Regression with 4 variable(s) (R-squared)
Variable Coefficient Standard Error t-statistic p-value
1 1.551 0.7448 2.083 0.0708
2 0.510 0.7238 0.705 0.5009
3 0.102 0.7547 0.135 0.8959
4 -0.144 0.7091 -0.203 0.8441
Example 2¶
This example uses the same data set as the first example, but Mallow’s
\(C_p\) statistic is used as the criterion rather than \(R^2\). Note
that when Mallow’s \(C_p\) statistic (or adjusted \(R^2\)) is
specified, the variable maxNBest
indicates the total number of “best”
regressions (rather than indicating the number of best regressions per
subset size, as in the case of the \(R^2\) criterion). In this example,
the three best regressions are found to be (1, 2), (1, 2, 4), and (1, 2, 3).
from __future__ import print_function
from numpy import *
from pyimsl.stat.regressionSelection import regressionSelection
x = array([
[7.0, 26.0, 6.0, 60.0],
[1.0, 29.0, 15.0, 52.0],
[11.0, 56.0, 8.0, 20.0],
[11.0, 31.0, 8.0, 47.0],
[7.0, 52.0, 6.0, 33.0],
[11.0, 55.0, 9.0, 22.0],
[3.0, 71.0, 17.0, 6.0],
[1.0, 31.0, 22.0, 44.0],
[2.0, 54.0, 18.0, 22.0],
[21.0, 47.0, 4.0, 26.0],
[1.0, 40.0, 23.0, 34.0],
[11.0, 66.0, 9.0, 12.0],
[10.0, 68.0, 8.0, 12.0]])
y = array([78.5, 74.3, 104.3, 87.6, 95.9, 109.2,
102.7, 72.5, 93.1, 115.9, 83.8, 113.3, 109.4])
regressionSelection(x, y, mallowsCp=True, maxNBest=3)
Output¶
Regressions with 1 variable(s) (Mallows CP)
Criterion Variables
139 4
142 2
203 1
315 3
Regressions with 2 variable(s) (Mallows CP)
Criterion Variables
2.68 1 2
5.5 1 4
22.4 3 4
138 2 4
198 1 3
Regressions with 3 variable(s) (Mallows CP)
Criterion Variables
3.02 1 2 4
3.04 1 2 3
3.5 1 3 4
7.34 2 3 4
Regressions with 4 variable(s) (Mallows CP)
Criterion Variables
5 1 2 3 4
Best Regression with 2 variable(s) (Mallows CP)
Variable Coefficient Standard Error t-statistic p-value
1 1.468 0.1213 12.10 0.0000
2 0.662 0.0459 14.44 0.0000
Best Regression with 3 variable(s) (Mallows CP)
Variable Coefficient Standard Error t-statistic p-value
1 1.452 0.1170 12.41 0.0000
2 0.416 0.1856 2.24 0.0517
4 -0.237 0.1733 -1.37 0.2054
2nd Best Regression with 3 variable(s) (Mallows CP)
Variable Coefficient Standard Error t-statistic p-value
1 1.696 0.2046 8.29 0.0000
2 0.657 0.0442 14.85 0.0000
3 0.250 0.1847 1.35 0.2089
Warning Errors¶
IMSLS_VARIABLES_DELETED |
At least one variable is deleted from the full model because the variance-covariance matrix “cov” is singular. |
Fatal Errors¶
IMSLS_NO_VARIABLES |
No variables can enter any model. |