regressionSelection¶
Selects the best multiple linear regression models.
Synopsis¶
regressionSelection (x, y)
Required Arguments¶
- float
x[[]](Input) - Array of size
nRows×nCandidatecontaining the data for the candidate variables. - float
y[](Input) - Array of length
nRowscontaining the responses for the dependent variable.
Optional Arguments¶
t_print- Printing is performed. This is the default.
or
noPrint- Printing is not performed.
weights, float[](Input)Array of length
nRowscontaining the weight for each row ofx.Default:
weights[]= 1frequencies, float[](Input)Array of length
nRowscontaining the frequency for each row ofx.Default:
frequencies[]= 1rSquared, int (Input)- The \(R^2\) criterion is used, where subset sizes 1, 2, …,
rSquaredare examined. This option is the default withrSquared=nCandidate.
or
adjRSquared- The adjusted \(R^2\) criterion is used, where subset sizes 1, 2, …,
nCandidateare examined.
or
mallowsCp- Mallows \(C_p\) criterion is used, where subset sizes 1, 2, …,
nCandidateare examined. maxNBest, int (Input)Number of best regressions to be found. If the \(R^2\) criterions are selected, the
maxNBestbest regressions for each subset size examined are found. If the adjusted \(R^2\) or Mallows \(C_p\) criterion is selected, themaxNBestoverall regressions are found.Default:
maxNBest= 1maxNGoodSaved, int (Input)Maximum number of good regressions of each subset size to be saved in finding the best regressions. Argument
maxNGoodSavedmust be greater than or equal tomaxNBest. Normally,maxNGoodSavedshould be less than or equal to 10. It doesn’t ever need to be larger than the maximum number of subsets for any subset size. Computing time required is inversely related tomaxNGoodSaved.Default:
maxNGoodSaved= 10criterions,indexCriterions,criterions(Output)- Argument
indexCriterionsis an array of length nsize + 1(where nsize is equal torSquaredif optional argumentrSquaredis specified; otherwise, nsize is equal tonCandidate) containing the locations incriterionsof the first element for each subset size. ForI= 0, 1, …, nsize −1, element numbersindexCriterions[I],indexCriterions[I] + 1, …,indexCriterions[I+1] − 1 ofcriterionscorrespond to the (I+ 1)-st subset size. Argumentcriterionsis an array of length max (indexCriterions[nsize] − 1 ,nCandidate) containing in its firstindexCriterions[nsize] − 1 elements the criterion values for each subset considered, in increasing subset size order. independentVariables,indexVariables,independentVariables(Output)- Argument
indexVariablesis an array of length nsize + 1 (where nsize is equal torSquaredif optional argumentrSquaredis specified; otherwise, nsize is equal tonCandidate) containing the locations inindependentVariablesof the first element for each subset size. ForI= 0, 1, …, nsize − 1, element numbersindexVariables[I],indexVariables[I] + 1, …,indexVariables[I+ 1] − 1 ofindependentVariablescorrespond to the (I+1)-st subset size. ArgumentindependentVariablesis an array of lengthindexVariables[nsize] − 1 containing the variable numbers for each subset considered and in the same order as incriterions. coefStatistics,indexCoefficients,coefficients(Output)- Argument
indexCoefficientsis an array of length ntbest + 1 containing the locations incoefficientsor the first row for each of the best regressions. Here, ntbest is the total number of best regression found and is equal torSquared×maxNBestifrSquaredis specified, equal tomaxNBestif eithermallowsCporadjRSquaredis specified, and equal tomaxNBest×nCandidate, otherwise. ForI= 0, 1, …, ntbest − 1, rowsindexCoefficients[I],indexCoefficients[I] + 1, …,indexCoefficients[I+ 1] – 1 ofcoefficientscorrespond to the (I+ 1)-st regression. Argumentcoefficientsis an array of size (indexCoefficients[ntbest] − 1) × 5 containing statistics relating to the regression coefficients of the best models. Each row corresponds to a coefficient for a particular regression. The regressions are in order of increasing subset size. Within each subset size, the regressions are ordered so that the better regressions appear first. The statistic in the columns are as follows (inferences are conditional on the selected model):
| Column | |
|---|---|
| 0 | variable number |
| 1 | coefficient estimate. |
| 2 | estimated standard error of the estimate |
| 3 | t-statistic for the test that the coefficient is 0 |
| 4 | p-value for the two-sided t test |
inputCov, intnObservations, floatcov[](Input)- Argument
nObservationsis the number of observations associated with arraycov. Argumentcovis an (nCandidate+ 1) by (nCandidate+ 1) array containing a variance-covariance or sum of squares and crossproducts matrix, in which the last column must correspond to the dependent variable. Arraycovcan be computed usingcovariances. Argumentsxandy, and optional argumentsfrequenciesandweightsare not accessed when this option is specified. Normally,regressionSelectioncomputescovfrom the input data matricesxandy. However, there may be cases when the user will wish to calculate the covariance matrix and manipulate it before callingregressionSelection. See the description section below for a discussion of such cases.
Description¶
Function regressionSelection finds the best subset regressions for a
regression problem with nCandidate independent variables. Typically, the
intercept is forced into all models and is not a candidate variable. In this
case, a sum of squares and crossproducts matrix for the independent and
dependent variables corrected for the mean is computed internally. There may
be cases when it is convenient for the user to calculate the matrix; see the
description of optional argument inputCov.
“Best” is defined, on option, by one of the following three criteria:
- \(R^2\) (in percent)
- \(R_a^2\) (adjusted \(R^2\) in percent)
Note that maximizing the criterion is equivalent to minimizing the residual mean square:
- Mallows’ \(C_p\) statistic
Here, n is equal to the sum of the frequencies (or nRows if
frequencies is not specified) and SST is the total sum of squares.
\(SSE_p\) is the error sum of squares in a model containing p
regression parameters including \(\beta_0\) (or p − 1 of the
nCandidate candidate variables). Variable
is the error mean square from the model with all nCandidate variables in
the model. Hocking (1972) and Draper and Smith (1981, pp. 296−302) discuss
these criteria.
Function regressionSelection is based on the algorithm of Furnival and
Wilson (1974). This algorithm finds maxNGoodSaved candidate regressions
for each possible subset size. These regressions are used to identify a set
of best regressions. In large problems, many regressions are not computed.
They may be rejected without computation based on results for other subsets;
this yields an efficient technique for considering all possible regressions.
There are cases when the user may want to input the variance-covariance
matrix rather than allow the function regressionSelection to calculate
it. This can be accomplished using optional argument inputCov. Three
situations in which the user may want to do this are as follows:
- The intercept is not in the model. A raw (uncorrected) sum of squares and
crossproducts matrix for the independent and dependent variables is
required. Argument
nObservationsmust be set to 1 greater than the number of observations. Form \(A^T A\), where \(A=\left[A, Y\right]\), to compute the raw sum of squares and crossproducts matrix. - An intercept is a candidate variable. A raw (uncorrected) sum of squares
and crossproducts matrix for the constant regressor (= 1.0), independent,
and dependent variables is required for
cov. In this case,covcontains one additional row and column corresponding to the constant regressor. This row/column contains the sum of squares and crossproducts of the constant regressor with the independent and dependent variables. The remaining elements incovare the same as in the previous case. ArgumentnObservationsmust be set to 1 greater than the number of observations. - There are m variables to be forced into the models. A sum of squares
and crossproducts matrix adjusted for the m variables is required
(calculated by regressing the candidate variables on the variables to be
forced into the model). Argument
nObservationsmust be set to m less than the number of observations.
Programming Notes¶
Function regressionSelection can save considerable CPU time over
explicitly computing all possible regressions. However, the function has
some limitations that can cause unexpected results for users who are unaware
of the limitations of the software.
- For
nCandidate+ 1 > \(-log_2(\varepsilon)\), where ɛ is machine(4); see Chapter 15,:doc:/stat/utilities/index), some results can be incorrect. This limitation arises because the possible models indicated (the model numbers \(1,2,\ldots,2^{nCandidate}\)) are stored as floating-point values; for sufficiently largenCandidate, the model numbers cannot be stored exactly. On many computers, this meansregressionSelection(fornCandidate> 24) andregressionSelection(fornCandidate> 49) can produce incorrect results. - Function
regressionSelectioneliminates some subsets of candidate variables by obtaining lower bounds on the error sum of squares from fitting larger models. First, the full model containing allnCandidateis fit sequentially using a forward stepwise procedure in which one variable enters the model at a time, and criterion values and model numbers for all the candidate variables that can enter at each step are stored. If linearly dependent variables are removed from the full model, errorIMSLS_VARIABLES_DELETEDis issued. If this error is issued, some submodels that contain variables removed from the full model because of linear dependency can be overlooked if they have not already been identified during the initial forward stepwise procedure. If errorIMSLS_VARIABLES_DELETEDis issued and you want the variables that were removed from the full model to be considered in smaller models, you can rerun the program with a set of linearly independent variables.
Examples¶
Example 1¶
This example uses a data set from Draper and Smith (1981, pp. 629−630).
Function regressionSelection is invoked to find the best regression for
each subset size using the \(R^2\) criterion. By default, the function
prints the results.
from __future__ import print_function
from numpy import *
from pyimsl.stat.regressionSelection import regressionSelection
x = array([
[7.0, 26.0, 6.0, 60.0],
[1.0, 29.0, 15.0, 52.0],
[11.0, 56.0, 8.0, 20.0],
[11.0, 31.0, 8.0, 47.0],
[7.0, 52.0, 6.0, 33.0],
[11.0, 55.0, 9.0, 22.0],
[3.0, 71.0, 17.0, 6.0],
[1.0, 31.0, 22.0, 44.0],
[2.0, 54.0, 18.0, 22.0],
[21.0, 47.0, 4.0, 26.0],
[1.0, 40.0, 23.0, 34.0],
[11.0, 66.0, 9.0, 12.0],
[10.0, 68.0, 8.0, 12.0]])
y = array([78.5, 74.3, 104.3, 87.6, 95.9, 109.2,
102.7, 72.5, 93.1, 115.9, 83.8, 113.3, 109.4])
regressionSelection(x, y)
Output¶
Regressions with 1 variable(s) (R-squared)
Criterion Variables
67.5 4
66.6 2
53.4 1
28.6 3
Regressions with 2 variable(s) (R-squared)
Criterion Variables
97.9 1 2
97.2 1 4
93.5 3 4
68 2 4
54.8 1 3
Regressions with 3 variable(s) (R-squared)
Criterion Variables
98.2 1 2 4
98.2 1 2 3
98.1 1 3 4
97.3 2 3 4
Regressions with 4 variable(s) (R-squared)
Criterion Variables
98.2 1 2 3 4
Best Regression with 1 variable(s) (R-squared)
Variable Coefficient Standard Error t-statistic p-value
4 -0.7382 0.1546 -4.775 0.0006
Best Regression with 2 variable(s) (R-squared)
Variable Coefficient Standard Error t-statistic p-value
1 1.468 0.1213 12.10 0.0000
2 0.662 0.0459 14.44 0.0000
Best Regression with 3 variable(s) (R-squared)
Variable Coefficient Standard Error t-statistic p-value
1 1.452 0.1170 12.41 0.0000
2 0.416 0.1856 2.24 0.0517
4 -0.237 0.1733 -1.37 0.2054
Best Regression with 4 variable(s) (R-squared)
Variable Coefficient Standard Error t-statistic p-value
1 1.551 0.7448 2.083 0.0708
2 0.510 0.7238 0.705 0.5009
3 0.102 0.7547 0.135 0.8959
4 -0.144 0.7091 -0.203 0.8441
Example 2¶
This example uses the same data set as the first example, but Mallow’s
\(C_p\) statistic is used as the criterion rather than \(R^2\). Note
that when Mallow’s \(C_p\) statistic (or adjusted \(R^2\)) is
specified, the variable maxNBest indicates the total number of “best”
regressions (rather than indicating the number of best regressions per
subset size, as in the case of the \(R^2\) criterion). In this example,
the three best regressions are found to be (1, 2), (1, 2, 4), and (1, 2, 3).
from __future__ import print_function
from numpy import *
from pyimsl.stat.regressionSelection import regressionSelection
x = array([
[7.0, 26.0, 6.0, 60.0],
[1.0, 29.0, 15.0, 52.0],
[11.0, 56.0, 8.0, 20.0],
[11.0, 31.0, 8.0, 47.0],
[7.0, 52.0, 6.0, 33.0],
[11.0, 55.0, 9.0, 22.0],
[3.0, 71.0, 17.0, 6.0],
[1.0, 31.0, 22.0, 44.0],
[2.0, 54.0, 18.0, 22.0],
[21.0, 47.0, 4.0, 26.0],
[1.0, 40.0, 23.0, 34.0],
[11.0, 66.0, 9.0, 12.0],
[10.0, 68.0, 8.0, 12.0]])
y = array([78.5, 74.3, 104.3, 87.6, 95.9, 109.2,
102.7, 72.5, 93.1, 115.9, 83.8, 113.3, 109.4])
regressionSelection(x, y, mallowsCp=True, maxNBest=3)
Output¶
Regressions with 1 variable(s) (Mallows CP)
Criterion Variables
139 4
142 2
203 1
315 3
Regressions with 2 variable(s) (Mallows CP)
Criterion Variables
2.68 1 2
5.5 1 4
22.4 3 4
138 2 4
198 1 3
Regressions with 3 variable(s) (Mallows CP)
Criterion Variables
3.02 1 2 4
3.04 1 2 3
3.5 1 3 4
7.34 2 3 4
Regressions with 4 variable(s) (Mallows CP)
Criterion Variables
5 1 2 3 4
Best Regression with 2 variable(s) (Mallows CP)
Variable Coefficient Standard Error t-statistic p-value
1 1.468 0.1213 12.10 0.0000
2 0.662 0.0459 14.44 0.0000
Best Regression with 3 variable(s) (Mallows CP)
Variable Coefficient Standard Error t-statistic p-value
1 1.452 0.1170 12.41 0.0000
2 0.416 0.1856 2.24 0.0517
4 -0.237 0.1733 -1.37 0.2054
2nd Best Regression with 3 variable(s) (Mallows CP)
Variable Coefficient Standard Error t-statistic p-value
1 1.696 0.2046 8.29 0.0000
2 0.657 0.0442 14.85 0.0000
3 0.250 0.1847 1.35 0.2089
Warning Errors¶
IMSLS_VARIABLES_DELETED |
At least one variable is deleted from the full model because the variance-covariance matrix “cov” is singular. |
Fatal Errors¶
IMSLS_NO_VARIABLES |
No variables can enter any model. |