RBEST
Selects the best multiple linear regression models.
Required Arguments
COV — NVAR by NVAR matrix containing the variance-covariance matrix or sum of squares and crossproducts matrix. (Input)
Only the upper triangle of COV is referenced. The last column of COV must correspond to the dependent variable.
NOBS — Number of observations. NOBS must be greater than or equal to the number of variables plus 1 (NVAR + 1), when using Adjusted R2 or Mallows Cp criteria
(ICRIT > 1). (Input)
ICOEFX — Index vector of length NTBEST + 1 containing the locations in COEF of the first row for each of the best regressions. (Output)
Here, NTBEST is the total number of best regressions found and is given as follows:
ICRIT |
NTBEST |
<0 |
‑NBEST * ICRIT |
1 |
NBEST * (NVAR ‑ 1) |
2 |
NBEST |
3 |
NBEST |
For I = 1, 2, …, NTBEST, rows ICOEFX(I), ICOEFX(I) + 1, …, ICOEFX(I + 1) ‑ 1 of COEF correspond to the I-th regression.
COEF — ICOEFX(NTBEST + 1) ‑ 1 by 5 matrix containing statistics relating to the regression coefficients of the best models. (Output)
An upper bound on the number of rows in COEF is given as follows:
ICRIT |
Upper Bound on the Number of Rows in COEF |
<0 |
‑NBEST * ICRIT * (1 ‑ ICRIT)/2 |
1 |
NBEST * (NVAR ‑ 1) * NVAR/2 |
2 |
NBEST * (NVAR ‑ 1) |
3 |
NBEST * (NVAR ‑ 1) |
Each row corresponds to a coefficient for a particular regression. The regressions are in order of increasing subset size. Within each subset size, the regressions are ordered so that the better regressions appear first. The statistics in the columns are as follows:
Col. |
Description |
1 |
Variable number |
2 |
Coefficient estimate |
3 |
Estimated standard error of the estimate |
4 |
t-statistic for the test that the coefficient is zero |
5 |
p‑value for the two-sided t test |
(Inferences are conditional on the selected models.)
Optional Arguments
NVAR — Number of variables. (Input)
Default: NVAR = size (COV,2).
LDCOV — Leading dimension of COV exactly as specified in the dimension statement in the calling program. (Input)
Default: LDCOV = size (COV,1).
ICRIT — Criterion option. (Input)
Default: ICRIT = 1.
ICRIT |
Criterion |
NSIZE |
< 0 |
R2 |
‑ICRIT |
1 |
R2 |
NVAR ‑ 1 |
2 |
Adjusted R2 |
NVAR ‑ 1 |
3 |
Mallows Cp |
NVAR ‑ 1 |
Subset sizes 1, 2, …, NSIZE are examined.
NBEST — Number of best regressions to be found. (Input)
If the R2 criterion is selected, the NBEST best regressions for each subset size examined are found. If the adjusted R2 or Mallows Cp criterion is selected, the NBEST best overall regressions are found.
Default: NBEST = 1.
NGOOD — Maximum number of good regressions of each subset size to be saved in finding the best regressions. (Input)
NGOOD must be greater than or equal to NBEST. Normally, NGOOD should be less than or equal to 10. It need not ever be larger than the maximum number of subsets for any subset size. Computing time required is inversely related to NGOOD.
Default: NGOOD = 10.
IPRINT — Printing option. (Input)
Default: IPRINT = 0.
IPRINT |
Action |
0 |
No printing is performed. |
1 |
Printing is performed. |
ICRITX — Index vector of length NSIZE + 1 containing the locations in CRIT of the first element for each subset size. (Output)
(See argument ICRIT for a definition of NSIZE. ) For I = 1, 2, …, NSIZE, element numbers ICRITX(I), ICRITX(I) + 1, …, ICRITX(I + 1) ‑ 1 of CRIT correspond to the I-th subset size.
CRIT — Vector of length max(ICRITX(NSIZE + 1) ‑ 1, NVAR ‑ 1) containing in its first ICRITX(NSIZE + 1) ‑ 1 elements the criterion values for each subset considered, in increasing subset size order. (Output)
An upper bound on the length of CRIT is max(NGOOD * NSIZE, NVAR ‑ 1). Within each subset size, results are returned in monotone order according to the criterion value with the results for the better regressions given first.
IVARX — Index vector of length NSIZE + 1 containing the locations in INDVAR of the first element for each subset size. (Output)
For I = 1,2, …, NSIZE, element numbers IVARX(I), IVARX(I) + 1, …, IVARX (I + 1) ‑ 1 of INDVAR correspond to the I-th subset size.
INDVAR — Index vector of length IVARX(NSIZE + 1) ‑ 1 containing the variable numbers for each subset considered and in the same order as in CRIT. (Output)
An upper bound on the length of INDVAR is NGOOD * NSIZE * (NSIZE + 1)/2.
LDCOEF — Leading dimension of COEF exactly as specified in the dimension statement in the calling program. (Input)
Default: LDCOEF = size (COEF,1).
FORTRAN 90 Interface
Generic: CALL RBEST (COV, NOBS, ICOEFX, COEF [, …])
Specific: The specific interface names are S_RBEST and D_RBEST.
FORTRAN 77 Interface
Single: CALL RBEST (NVAR, COV, LDCOV, NOBS, ICRIT, NBEST, NGOOD, IPRINT, ICRITX, CRIT, IVARX, INDVAR, ICOEFX, COEF, LDCOEF)
Double: The double precision name is DRBEST.
Description
Routine RBEST finds the best subset regressions for a regression problem with NVAR ‑ 1 candidate independent variables. Typically, the intercept is forced into all models and is not a candidate variable. In this case, a sum of squares and crossproducts matrix for the independent and dependent variables corrected for the mean is input for COV. Routine CORVC in Chapter 3, “Correlation”, can be used to compute the corrected sum of squares and crossproducts. IMSL routine RORDM in Chapter 19, “Utilities,” can be used to reorder this matrix, if required. Other possibilities are
1. The intercept is not in the model. A raw (uncorrected) sum of squares and crossproducts matrix for the independent and dependent variables is required for COV. NOBS must be set to one greater than the number of observations. Routine MXTXF (IMSL MATH/LIBRARY) can be used to compute the raw sum of squares and crossproducts matrix.
2. An intercept is to be a candidate variable. A raw (uncorrected) sum of squares and crossproducts matrix for the constant regressor ( = 1), independent, and dependent variables is required for COV. In this case, COV contains one additional row and column corresponding to the constant regressor. This row/column contains the sum of squares and crossproducts of the constant regressor with the independent and dependent variables. The remaining elements in COV are the same as in the previous case. NOBS must be set to one greater than the number of observations.
3. There are m variables to be forced into the models. A sum of squares and crossproducts matrix adjusted for the m variables is required. NOBS must be set to m less than the number of observations. Routine RCOV can be used to compute the adjusted sum of squares and crossproducts matrix. This is accomplished by a regression of the candidate variables on the variables to be forced into the models. The error sum of squares and crossproducts matrix, SCPE from RCOV, is the input to COV in routine RBEST.
“Best” is defined, on option, by one of three criteria:
1. R2 (in percent)
2. (adjusted R2 in percent)
Note that maximizing this criterion is equivalent to minimizing the residual mean square, SSEp/(n ‑ p).
3. Mallows’ Cp statistic
Here, n is NOBS, and SST is the total sum of squares. SSEp is the error sum of squares in a model containing p regression parameters including β0 (or p ‑ 1 of the NVAR ‑ 1 candidate variables).
is the error mean square from the model with all NVAR ‑ 1 candidate variables in the model. Hocking (1972) and Draper and Smith (1981, pages 296‑302) discuss these criteria.
Routine RBEST is based on the algorithm of Furnival and Wilson (1974), this algorithm finds NGOOD candidate regressions for each possible subset size. These regressions are used to identify a set of best regressions. In large problems, many regressions are not computed. They may be rejected without computation based on results for other subsets, this yields an efficient technique for considering all possible regressions.
Comments
1. Workspace may be explicitly provided, if desired, by use of R2EST/DR2EST. The reference is:
CALL R2EST (NVAR, COV, LDCOV, NOBS, ICRIT, NBEST, NGOOD, IPRINT, ICRITX, CRIT, IVARX, INDVAR, ICOEFX, COEF, LDCOEF, WK, IWK)
The additional arguments are as follows:
WK — Work vector of length NVAR * (2 * NGOOD + 6) + (2 * NVAR3 + 4 * NVAR)/3. The first NVAR ‑ 1 locations indicate which variables are in the full model. If IWK(I) = 0, then variable I is in the full model, otherwise, the variable has been dropped.
IWK — Integer work vector of length 3 * NVAR2 + 6 * NVAR.
2. Informational errors
Type |
Code |
Description |
3 |
1 |
At least one variable is deleted from the full model because COV is singular. |
4 |
3 |
No variables can enter any model. |
Programming Notes
Routine RBEST can save considerable CPU time over explicitly computing all possible regressions. However, the routine has some limitations that can cause unexpected results for users that are unaware of the limitations of the software.
1. For NVAR > ‑ log2(ɛ) where ɛ is AMACH(4), (See the section “Machine-Dependent Constants” in the Reference Material), some results can be incorrect. This limitation arises because the possible models indicated by the model numbers 1, 2, …, 2NVAR−1, are stored as floating-point values, for sufficiently large NVAR, the model numbers cannot be stored exactly. On many computers, this means S_RBEST (for NVAR > 25) and D_RBEST (for NVAR > 50) can produce incorrect results.
2. Routine RBEST eliminates some subsets of candidate variables by obtaining lower bounds on the error sum of squares from fitting larger models. First, the full model containing all NVAR ‑ 1 is fit sequentially using a forward stepwise procedure in which one variable enters the model at a time, and criterion values and model numbers for all the candidate variables that can enter at each step are stored. If linearly dependent variables are removed from the full model, error code 1 is issued. If this error is issued, some submodels that contain variables removed from the full model because of linear dependency can be overlooked, if they have not already been identified during the initial forward stepwise procedure. If error code 1 is issued and you want the variables that were removed from the full model to be considered in smaller models, you may want to rerun the program with a set of linearly independent variables.
Example
This example uses a data set from Draper and Smith (1981, pages 629‑630). This data set is input to the matrix X by routine GDATA (see Chapter 19, “Utilities”). The first four columns contain the independent variables, and the last column contains the dependent variable. Routine CORVC in Chapter 3, “Correlation,” is invoked to compute the corrected sum of squares and crossproducts matrix. Routine RBEST is then invoked to find the best regression for each of the four subset sizes using the R2 criterion.
USE RBEST_INT
USE GDATA_INT
USE CORVC_INT
IMPLICIT NONE
INTEGER LDCOEF, LDCOV, LDX, NBEST, NGOOD, NSIZE, NTBEST, NVAR
PARAMETER (LDX=13, NBEST=1, NGOOD=10, NVAR=5, &
LDCOEF=NBEST*(NVAR-1)*NVAR/2, LDCOV=NVAR, &
NSIZE=NVAR-1, NTBEST=NBEST*(NVAR-1))
!
INTEGER ICOEFX(NTBEST+1), ICOPT, ICRIT, ICRITX(NSIZE+1), &
INCD(1,1), INDVAR(NGOOD*NSIZE*(NSIZE+1)/2), &
IPRINT, IVARX(NSIZE+1), NMISS, NOBS, NROW, NVAR1
REAL COEF(LDCOEF,5), COV(LDCOV,NVAR), CRIT(NGOOD*NSIZE), &
SUMWT, X(LDX,NVAR), XMEAN(NVAR)
!
CALL GDATA (5, X, NROW, NVAR1)
!
ICOPT = 1
CALL CORVC (NVAR, X, COV, ICOPT=ICOPT, NOBS=NOBS)
!
IPRINT = 1
CALL RBEST (COV, NOBS, ICOEFX, COEF, IPRINT=IPRINT)
!
END
Output
Regressions with 1 variable(s) (R-squared)
Criterion Variables
67.5 4
66.6 2
53.4 1
28.6 3
Regressions with 2 variable(s) (R-squared)
Criterion Variables
97.9 1 2
97.2 1 4
93.5 3 4
68.0 2 4
54.8 1 3
Regressions with 3 variable(s) (R-squared)
Criterion Variables
98.2 1 2 4
98.2 1 2 3
98.1 1 3 4
97.3 2 3 4
Regressions with 4 variable(s) (R-squared)
Criterion Variables
98.2 1 2 3 4
Best Regression with 1 variable(s) (R-squared)
Variable Coefficient Standard Error t-statistic p-value
4 -0.7382 0.1546 -4.775 0.0006
Best Regression with 2 variable(s) (R-squared)
Variable Coefficient Standard Error t-statistic p-value
1 1.468 0.1213 12.10 0.0000
2 0.662 0.0459 14.44 0.0000
Best Regression with 3 variable(s) (R-squared)
Variable Coefficient Standard Error t-statistic p-value
1 1.452 0.1170 12.41 0.0000
2 0.416 0.1856 2.24 0.0517
4 -0.237 0.1733 -1.36 0.2054
Best Regression with 4 variable(s) (R-squared)
Variable Coefficient Standard Error t-statistic p-value
1 1.551 0.7448 2.083 0.0708
2 0.510 0.7238 0.705 0.5009
3 0.102 0.7547 0.135 0.8959
4 -0.144 0.7091 -0.203 0.8441