regressionSelection

Selects the best multiple linear regression models.

Synopsis

regressionSelection (x, y)

Required Arguments

float x[[]] (Input)
Array of size nRows × nCandidate containing the data for the candidate variables.
float y[] (Input)
Array of length nRows containing the responses for the dependent variable.

Optional Arguments

t_print
Printing is performed. This is the default.

or

noPrint
Printing is not performed.
weights, float[] (Input)

Array of length nRows containing the weight for each row of x.

Default: weights[] = 1

frequencies, float[] (Input)

Array of length nRows containing the frequency for each row of x.

Default: frequencies[] = 1

rSquared, int (Input)
The \(R^2\) criterion is used, where subset sizes 1, 2, …, rSquared are examined. This option is the default with rSquared = nCandidate.

or

adjRSquared
The adjusted \(R^2\) criterion is used, where subset sizes 1, 2, …, nCandidate are examined.

or

mallowsCp
Mallows \(C_p\) criterion is used, where subset sizes 1, 2, …, nCandidate are examined.
maxNBest, int (Input)

Number of best regressions to be found. If the \(R^2\) criterions are selected, the maxNBest best regressions for each subset size examined are found. If the adjusted \(R^2\) or Mallows \(C_p\) criterion is selected, the maxNBest overall regressions are found.

Default: maxNBest = 1

maxNGoodSaved, int (Input)

Maximum number of good regressions of each subset size to be saved in finding the best regressions. Argument maxNGoodSaved must be greater than or equal to maxNBest. Normally, maxNGoodSaved should be less than or equal to 10. It doesn’t ever need to be larger than the maximum number of subsets for any subset size. Computing time required is inversely related to maxNGoodSaved.

Default: maxNGoodSaved = 10

criterions, indexCriterions, criterions (Output)
Argument indexCriterions is an array of length nsize + 1(where nsize is equal to rSquared if optional argument rSquared is specified; otherwise, nsize is equal to nCandidate) containing the locations in criterions of the first element for each subset size. For I = 0, 1, …, nsize −1, element numbers indexCriterions[I], indexCriterions[I] + 1, …, indexCriterions[I + 1] − 1 of criterions correspond to the (I + 1)-st subset size. Argument criterions is an array of length max (indexCriterions [nsize] − 1 , nCandidate) containing in its first indexCriterions [nsize] − 1 elements the criterion values for each subset considered, in increasing subset size order.
independentVariables, indexVariables, independentVariables (Output)
Argument indexVariables is an array of length nsize + 1 (where nsize is equal to rSquared if optional argument rSquared is specified; otherwise, nsize is equal to nCandidate) containing the locations in independentVariables of the first element for each subset size. For I = 0, 1, …, nsize − 1, element numbers indexVariables[I], indexVariables[I] + 1, …, indexVariables[I + 1] − 1 of independentVariables correspond to the (I+1)-st subset size. Argument independentVariables is an array of length indexVariables [nsize] − 1 containing the variable numbers for each subset considered and in the same order as in criterions.
coefStatistics, indexCoefficients, coefficients (Output)
Argument indexCoefficients is an array of length ntbest + 1 containing the locations in coefficients or the first row for each of the best regressions. Here, ntbest is the total number of best regression found and is equal to rSquared × maxNBest if rSquared is specified, equal to maxNBest if either mallowsCp or adjRSquared is specified, and equal to maxNBest × nCandidate, otherwise. For I = 0, 1, …, ntbest − 1, rows indexCoefficients[I], indexCoefficients[I] + 1, …, indexCoefficients[I + 1] – 1 of coefficients correspond to the (I + 1)-st regression. Argument coefficients is an array of size (indexCoefficients [ntbest] − 1) × 5 containing statistics relating to the regression coefficients of the best models. Each row corresponds to a coefficient for a particular regression. The regressions are in order of increasing subset size. Within each subset size, the regressions are ordered so that the better regressions appear first. The statistic in the columns are as follows (inferences are conditional on the selected model):
Column  
0 variable number
1 coefficient estimate.
2 estimated standard error of the estimate
3 t-statistic for the test that the coefficient is 0
4 p-value for the two-sided t test
inputCov, int nObservations, float cov[] (Input)
Argument nObservations is the number of observations associated with array cov. Argument cov is an (nCandidate + 1) by (nCandidate + 1) array containing a variance-covariance or sum of squares and crossproducts matrix, in which the last column must correspond to the dependent variable. Array cov can be computed using covariances. Arguments x and y, and optional arguments frequencies and weights are not accessed when this option is specified. Normally, regressionSelection computes cov from the input data matrices x and y. However, there may be cases when the user will wish to calculate the covariance matrix and manipulate it before calling regressionSelection. See the description section below for a discussion of such cases.

Description

Function regressionSelection finds the best subset regressions for a regression problem with nCandidate independent variables. Typically, the intercept is forced into all models and is not a candidate variable. In this case, a sum of squares and crossproducts matrix for the independent and dependent variables corrected for the mean is computed internally. There may be cases when it is convenient for the user to calculate the matrix; see the description of optional argument inputCov.

“Best” is defined, on option, by one of the following three criteria:

  • \(R^2\) (in percent)
\[R^2 = 100 \left(1 - \frac{\mathrm{SSE}_p}{\mathrm{SST}}\right)\]
  • \(R_a^2\) (adjusted \(R^2\) in percent)
\[R_a^2 = 100 \left[1 - \left(\frac{n-1}{n-p}\right) \frac{\mathrm{SSE}_p}{\mathrm{SST}}\right]\]

Note that maximizing the criterion is equivalent to minimizing the residual mean square:

\[\frac{\mathrm{SSE}_p}{(n-p)}\]
  • Mallows’ \(C_p\) statistic
\[C_{\mathrm{p}} = \frac{\mathrm{SSE}_{\mathrm{p}}}{s_{\mathrm{nCandidate}}^2} + 2p - n\]

Here, n is equal to the sum of the frequencies (or nRows if frequencies is not specified) and SST is the total sum of squares.

\(SSE_p\) is the error sum of squares in a model containing p regression parameters including \(\beta_0\) (or p − 1 of the nCandidate candidate variables). Variable

\[s_{\mathrm{nCandidate}}^2\]

is the error mean square from the model with all nCandidate variables in the model. Hocking (1972) and Draper and Smith (1981, pp. 296−302) discuss these criteria.

Function regressionSelection is based on the algorithm of Furnival and Wilson (1974). This algorithm finds maxNGoodSaved candidate regressions for each possible subset size. These regressions are used to identify a set of best regressions. In large problems, many regressions are not computed. They may be rejected without computation based on results for other subsets; this yields an efficient technique for considering all possible regressions.

There are cases when the user may want to input the variance-covariance matrix rather than allow the function regressionSelection to calculate it. This can be accomplished using optional argument inputCov. Three situations in which the user may want to do this are as follows:

  1. The intercept is not in the model. A raw (uncorrected) sum of squares and crossproducts matrix for the independent and dependent variables is required. Argument nObservations must be set to 1 greater than the number of observations. Form \(A^T A\), where \(A=\left[A, Y\right]\), to compute the raw sum of squares and crossproducts matrix.
  2. An intercept is a candidate variable. A raw (uncorrected) sum of squares and crossproducts matrix for the constant regressor (= 1.0), independent, and dependent variables is required for cov. In this case, cov contains one additional row and column corresponding to the constant regressor. This row/column contains the sum of squares and crossproducts of the constant regressor with the independent and dependent variables. The remaining elements in cov are the same as in the previous case. Argument nObservations must be set to 1 greater than the number of observations.
  3. There are m variables to be forced into the models. A sum of squares and crossproducts matrix adjusted for the m variables is required (calculated by regressing the candidate variables on the variables to be forced into the model). Argument nObservations must be set to m less than the number of observations.

Programming Notes

Function regressionSelection can save considerable CPU time over explicitly computing all possible regressions. However, the function has some limitations that can cause unexpected results for users who are unaware of the limitations of the software.

  1. For nCandidate + 1 > \(-log_2(\varepsilon)\), where ɛ is machine(4); see Chapter 15,:doc:/stat/utilities/index), some results can be incorrect. This limitation arises because the possible models indicated (the model numbers \(1,2,\ldots,2^{nCandidate}\)) are stored as floating-point values; for sufficiently large nCandidate, the model numbers cannot be stored exactly. On many computers, this means regressionSelection (for nCandidate > 24) and regressionSelection (for nCandidate > 49) can produce incorrect results.
  2. Function regressionSelection eliminates some subsets of candidate variables by obtaining lower bounds on the error sum of squares from fitting larger models. First, the full model containing all nCandidate is fit sequentially using a forward stepwise procedure in which one variable enters the model at a time, and criterion values and model numbers for all the candidate variables that can enter at each step are stored. If linearly dependent variables are removed from the full model, error IMSLS_VARIABLES_DELETED is issued. If this error is issued, some submodels that contain variables removed from the full model because of linear dependency can be overlooked if they have not already been identified during the initial forward stepwise procedure. If error IMSLS_VARIABLES_DELETED is issued and you want the variables that were removed from the full model to be considered in smaller models, you can rerun the program with a set of linearly independent variables.

Examples

Example 1

This example uses a data set from Draper and Smith (1981, pp. 629−630). Function regressionSelection is invoked to find the best regression for each subset size using the \(R^2\) criterion. By default, the function prints the results.

from __future__ import print_function
from numpy import *
from pyimsl.stat.regressionSelection import regressionSelection


x = array([
    [7.0, 26.0, 6.0, 60.0],
    [1.0, 29.0, 15.0, 52.0],
    [11.0, 56.0, 8.0, 20.0],
    [11.0, 31.0, 8.0, 47.0],
    [7.0, 52.0, 6.0, 33.0],
    [11.0, 55.0, 9.0, 22.0],
    [3.0, 71.0, 17.0, 6.0],
    [1.0, 31.0, 22.0, 44.0],
    [2.0, 54.0, 18.0, 22.0],
    [21.0, 47.0, 4.0, 26.0],
    [1.0, 40.0, 23.0, 34.0],
    [11.0, 66.0, 9.0, 12.0],
    [10.0, 68.0, 8.0, 12.0]])
y = array([78.5, 74.3, 104.3, 87.6, 95.9, 109.2,
           102.7, 72.5, 93.1, 115.9, 83.8, 113.3, 109.4])
regressionSelection(x, y)

Output

 Regressions with   1 variable(s) (R-squared)

        Criterion         Variables
             67.5          4
             66.6          2
             53.4          1
             28.6          3


 Regressions with   2 variable(s) (R-squared)

        Criterion         Variables
             97.9          1  2
             97.2          1  4
             93.5          3  4
               68          2  4
             54.8          1  3


 Regressions with   3 variable(s) (R-squared)

        Criterion         Variables
             98.2          1  2  4
             98.2          1  2  3
             98.1          1  3  4
             97.3          2  3  4


 Regressions with   4 variable(s) (R-squared)

        Criterion         Variables
             98.2          1  2  3  4

 
      Best Regression with   1 variable(s) (R-squared)
Variable  Coefficient  Standard Error  t-statistic  p-value
       4      -0.7382          0.1546       -4.775   0.0006


 
      Best Regression with   2 variable(s) (R-squared)
Variable  Coefficient  Standard Error  t-statistic  p-value
       1        1.468          0.1213        12.10   0.0000
       2        0.662          0.0459        14.44   0.0000


 
      Best Regression with   3 variable(s) (R-squared)
Variable  Coefficient  Standard Error  t-statistic  p-value
       1        1.452          0.1170        12.41   0.0000
       2        0.416          0.1856         2.24   0.0517
       4       -0.237          0.1733        -1.37   0.2054


 
      Best Regression with   4 variable(s) (R-squared)
Variable  Coefficient  Standard Error  t-statistic  p-value
       1        1.551          0.7448        2.083   0.0708
       2        0.510          0.7238        0.705   0.5009
       3        0.102          0.7547        0.135   0.8959
       4       -0.144          0.7091       -0.203   0.8441

Example 2

This example uses the same data set as the first example, but Mallow’s \(C_p\) statistic is used as the criterion rather than \(R^2\). Note that when Mallow’s \(C_p\) statistic (or adjusted \(R^2\)) is specified, the variable maxNBest indicates the total number of “best” regressions (rather than indicating the number of best regressions per subset size, as in the case of the \(R^2\) criterion). In this example, the three best regressions are found to be (1, 2), (1, 2, 4), and (1, 2, 3).

from __future__ import print_function
from numpy import *
from pyimsl.stat.regressionSelection import regressionSelection


x = array([
    [7.0, 26.0, 6.0, 60.0],
    [1.0, 29.0, 15.0, 52.0],
    [11.0, 56.0, 8.0, 20.0],
    [11.0, 31.0, 8.0, 47.0],
    [7.0, 52.0, 6.0, 33.0],
    [11.0, 55.0, 9.0, 22.0],
    [3.0, 71.0, 17.0, 6.0],
    [1.0, 31.0, 22.0, 44.0],
    [2.0, 54.0, 18.0, 22.0],
    [21.0, 47.0, 4.0, 26.0],
    [1.0, 40.0, 23.0, 34.0],
    [11.0, 66.0, 9.0, 12.0],
    [10.0, 68.0, 8.0, 12.0]])
y = array([78.5, 74.3, 104.3, 87.6, 95.9, 109.2,
           102.7, 72.5, 93.1, 115.9, 83.8, 113.3, 109.4])
regressionSelection(x, y, mallowsCp=True, maxNBest=3)

Output

 Regressions with   1 variable(s) (Mallows  CP)

        Criterion         Variables
              139          4
              142          2
              203          1
              315          3


 Regressions with   2 variable(s) (Mallows  CP)

        Criterion         Variables
             2.68          1  2
              5.5          1  4
             22.4          3  4
              138          2  4
              198          1  3


 Regressions with   3 variable(s) (Mallows  CP)

        Criterion         Variables
             3.02          1  2  4
             3.04          1  2  3
              3.5          1  3  4
             7.34          2  3  4


 Regressions with   4 variable(s) (Mallows  CP)

        Criterion         Variables
                5          1  2  3  4

 
     Best Regression with   2 variable(s) (Mallows CP)
Variable  Coefficient  Standard Error  t-statistic  p-value
       1        1.468          0.1213        12.10   0.0000
       2        0.662          0.0459        14.44   0.0000


 
     Best Regression with   3 variable(s) (Mallows CP)
Variable  Coefficient  Standard Error  t-statistic  p-value
       1        1.452          0.1170        12.41   0.0000
       2        0.416          0.1856         2.24   0.0517
       4       -0.237          0.1733        -1.37   0.2054


 
    2nd Best Regression with   3 variable(s) (Mallows CP)
Variable  Coefficient  Standard Error  t-statistic  p-value
       1        1.696          0.2046         8.29   0.0000
       2        0.657          0.0442        14.85   0.0000
       3        0.250          0.1847         1.35   0.2089

Warning Errors

IMSLS_VARIABLES_DELETED At least one variable is deleted from the full model because the variance-covariance matrix “cov” is singular.

Fatal Errors

IMSLS_NO_VARIABLES No variables can enter any model.