RSTEP

Builds multiple linear regression models using forward selection, backward selection, or stepwise selection.

Required Arguments

COVNVAR by NVAR matrix containing the variance-covariance matrix or sum of squares and crossproducts matrix. (Input)
Only the upper triangle of COV is referenced.

NOBS — Number of observations. (Input)

AOV — Vector of length 13 containing statistics relating to the analysis of variance for the final model in this invocation. (Output)

 

I

AOV(I)

1

Degrees of freedom for regression

2

Degrees of freedom for error

3

Total degrees of freedom

4

Sum of squares for regression

5

Sum of squares for error

6

Total sum of squares

7

Regression mean square

8

Error mean square

9

F-statistic

10

p‑value

11

R2 (in percent)

12

Adjusted R2 (in percent)

13

Estimated standard deviation of the model error

14

Mean of the response (dependent) variable

15

Coefficient of variation (in percent)

COEFNVAR  1 by 5 matrix containing statistics relating to the regression coefficients for the final model in this invocation. (Output)
The rows correspond to the NVAR  1 variables with LEVEL(I) nonnegative, i.e., all variables but the dependent variable. The rows are in the same order as the variables in COV except that the dependent variable is excluded. Each row corresponding to a variable not in the model is for the model supposing the additional variable was in the model.

 

Col.

Description

1

Coefficient estimate

2

Estimated standard error of the coefficient estimate

3

t-statistic for the test that the coefficient is zero

4

p‑value for the two-sided t test

5

Variance inflation factor. The square of the multiple correlation coefficient for the I-th regressor after all others can be obtained from COEF(I, 5) by the formula 1.0  1.0/COEF(I, 5).

COVSNVAR by NVAR matrix that results after COV has been swept on the columns corresponding to the variables in the model. (Output, if INVOKE = 0 or 1; Input/Output, if INVOKE = 2 or 3)
The estimated variance-covariance matrix of the estimated regression coefficients in the final model can be obtained by extracting the rows and columns of COVS corresponding to the independent variables in the final model and multiplying the elements of this matrix by AOV(8). If COV is not needed, COV and COVS can occupy the same storage locations.

Optional Arguments

INVOKE — Invocation option. (Input)
Default: INVOKE = 0.

 

INVOKE

Action

0

This is the only invocation of RSTEP for this variance-covariance matrix. Initialization, stepping, and wrap-up computations are performed.

1

This is the first invocation of RSTEP, and additional calls to RSTEP will be made. Initialization and stepping is performed.

2

This is an intermediate invocation of RSTEP and stepping is performed.

3

This is the final invocation of RSTEP and stepping is performed.

NVAR — Number of variables. (Input)
Default: NVAR = size (COV,2).

LDCOV — Leading dimension of COV exactly as specified in the dimension statement in the calling program. (Input)
Default: LDCOV = size (COV,1).

LEVEL — Vector of length NVAR containing levels of priority for variables entering and leaving the regression. (Input)
LEVEL(I) = 1 means the I-th variable is the dependent variable. LEVEL(I) = 0 means the I-th variable is never to enter into the model. Other variables must be assigned a positive value to indicate their level of entry into the model. A variable can enter the model only after all variables with smaller nonzero levels of entry have entered. Similarly, a variable can only leave the model after all variables with higher levels of entry have left. Variables with the same level of entry compete for entry (deletion) at each step.

NFORCE — Variables with levels 1, 2, NFORCE are forced into the model as the independent variables. (Input)
Default: NFORCE = 0.

NSTEP — Step length option. (Input)
For nonnegative NSTEP, NSTEP steps are taken. NSTEP =  1 means stepping continues until completion.
Default: NSTEP = 1.

ISTEP — Stepping option. (Input)
Default: ISTEP = -1.

 

ISTEP

Action

1

An attempt is made to remove a variable from the model (backward step). A variable is removed if its p‑value exceeds POUT. During initialization, all candidate independent variables enter the model.

1

An attempt is made to add a variable to the model (forward step). A variable is added if its p‑value is less than PIN. During initialization, only the forced variables enter the model.

0

A backward step is attempted. If a variable is not removed, a forward step is attempted. This is a stepwise step. Only the forced variables enter the model during initialization.

PIN — Largest p‑value for entering variables. (Input)
Variables with p‑values less than PIN may enter the model. A common choice is PIN = 0.05.
Default: PIN = .05.

POUT — Smallest p‑value for removing variables. (Input)
Variables with p‑values greater than POUT may leave the model. POUT must be greater or equal to PIN. A common choice is POUT = 0.10 (or 2 * PIN).
Default: POUT = .10.

TOL — Tolerance used in determining linear dependence. (Input)
TOL = 100 * AMACH (4) is a common choice. See documentation for AMACH in the Reference Material.
Default: TOL = 1.e-5 for single precision and 2.d – 14 for double precision.

IPRINT — Printing option. (Input)
Default: IPRINT = 0.

 

IPRINT

Action

0

No printing is performed.

1

Printing is performed on the final invocation.

2

Printing is performed after each step and on the final invocation.

SCALE — Vector of length NVAR containing the initial diagonal entries in COV. (Output, if INVOKE = 0 or 1; Input, if INVOKE = 2 or 3)

HIST — Vector of length NVAR containing the recent history of variables. (Output, if INVOKE = 0 or 1; Input/Output, otherwise)

 

HIST(I)

Meaning

k > 0

I-th variable was added to the model during the k-th step.

k < 0

I-th variable was deleted from the model during the k-th step.

0

I-th variable has never been in the model.

0.5

I-th variable was added into the model during initialization.

IEND — Completion indicator. (Output)

 

IEND

Meaning

0

Additional steps may be possible.

1

No additional steps are possible.

LDCOEF — Leading dimension of COEF exactly as specified in the dimension statement in the calling program. (Input)
Default: LDCOEF = size (COEF,1).

LDCOVS — Leading dimension of COVS exactly as specified in the dimension statement in the calling program. (Input)
Default: LDCOVS = size (COVS,1).

FORTRAN 90 Interface

Generic: CALL RSTEP (COV, NOBS, AOV, COEF, COVS [])

Specific: The specific interface names are S_RSTEP and D_RSTEP.

FORTRAN 77 Interface

Single: CALL RSTEP (INVOKE, NVAR, COV, LDCOV, LEVEL, NFORCE, NSTEP, ISTEP, NOBS, PIN, POUT, TOL, IPRINT, SCALE, HIST, IEND, AOV, COEF, LDCOEF, COVS, LDCOVS)

Double: The double precision name is DRSTEP.

Description

Routine RSTEP builds a multiple linear regression model using forward selection, backward selection, or forward stepwise (with a backward glance) selection. The routine RSTEP is designed so that the user can monitor, and perhaps change, the variables added (deleted) to (from) the model after each step. In this case, multiple calls to RSTEP (with INVOKE = 1, 2, 2, , 3) are made. Alternatively, RSTEP can be invoked once (with INVOKE = 0) in order to perform the stepping until a final model is selected.

Levels of priority can be assigned to the candidate independent variables. All variables with a priority level of 1 must enter the model before any variable with a priority level of 2. Similarly, variables with a level of 2 must enter before variables with a level of 3, etc.

Variables can also be forced into the model. If equal levels of priority are to be assumed, the levels of priority can all be set to 1.

Typically, the intercept is forced into all models and is not a candidate variable. In this case, a sum of squares and crossproducts matrix for the independent and dependent variables corrected for the mean is input for COV. Routine CORVC in Chapter 3, “Correlation” can be used to compute the corrected sum of squares and crossproducts. Routine RORDM in Chapter 19, “Utilities” can be used to reorder this matrix, if required. Other possibilities are

1. The intercept is not in the model. A raw (uncorrected) sum of squares and crossproducts matrix for the independent and dependent variables is required for COV. NOBS must be set to one greater than the number of observations. IMSL routine MXTXF (IMSL MATH/LIBRARY) can be used to compute the raw sum of squares and crossproducts matrix.

2. An intercept is to be a candidate variable. A raw (uncorrected) sum of squares and crossproducts matrix for the constant regressor (= 1), independent and dependent variables is required for COV. In this case, COV contains one additional row and column corresponding to the constant regressor. This row/column contains the sum of squares and crossproducts of the constant regressor with the independent and dependent variables. The remaining elements in COV are the same as in the previous case. NOBS must be set to one greater than the number of observations.

The stepwise regression algorithm is due to Efroymson (1960). Routine RSTEP uses sweeps of COV to move variables in and out of the model (Hemmerle 1967, Chapter 3). The SWEEP operator discussed by Goodnight (1979) is used. A description of the stepwise algorithm is given also by Kennedy and Gentle (1980, pages 335340). The advantage of stepwise model building over all possible regressions (see routine RBEST) is that it is less demanding computationally when the number of candidate independent variables is very large. However, there is no guarantee that the model selected will be the best model (highest R2) for any subset size of independent variables.

Comments

1. Workspace may be explicitly provided, if desired, by use of R2TEP/DR2TEP. The reference is:

CALL R2TEP (INVOKE, NVAR, COV, LDCOV, LEVEL, NFORCE, NSTEP, ISTEP, NOBS, PIN, POUT, TOL, IPRINT, SCALE,HIST, IEND, AOV, COEF, LDCOEF, COVS, LDCOVS, SWEPT, IWK)

The additional arguments are as follows:

SWEPT — Work vector of length NVAR with information to indicate the independent variables in the model. (Output)
SWEPT(I) = 1.0 indicates that independent variable I is in the model. Otherwise, SWEPT(I) = 1.0. Routine RSUBM can be called with the arguments COVS and SWEPT to obtain the part of COVS pertaining to the current model.

IWK — Integer work vector of length 2 * NVAR.

2. Informational errors

 

Type

Code

Description

3

1

Based on TOL, there are linear dependencies among the variables to be forced.

4

2

No variables entered the model. Elements of AOV are set to NaN.

Examples

Example 1

Both examples use a data set from Draper and Smith (1981, pages 629630). A corrected sum of squares and crossproducts matrix for this data is given in the DATA statement and can be computed using routine CORVC in Chapter 3, “Correlation”. The first four columns are for the independent variables and the last column is for the dependent variable. Here, RSTEP is invoked using the backward stepping option.

 

USE RSTEP_INT

 

IMPLICIT NONE

INTEGER LDCOEF, LDCOV, LDCOVS, NVAR

PARAMETER (NVAR=5, LDCOEF=NVAR, LDCOV=NVAR, LDCOVS=NVAR)

!

INTEGER IEND, IPRINT, LEVEL(NVAR), NOBS

REAL AOV(13), COEF(LDCOEF,5), COV(LDCOV,NVAR), &

COVS(LDCOVS,NVAR), HIST(NVAR), SCALE(NVAR)

!

DATA COV/415.231, 251.077, -372.615, -290.000, 775.962, 251.077, &

2905.69, -166.538, -3041.00, 2292.95, -372.615, -166.538, &

492.308, 38.0000, -618.231, -290.000, -3041.00, 38.0000, &

3362.00, -2481.70, 775.962, 2292.95, -618.231, -2481.70, &

2715.76/

DATA LEVEL/4*1, -1/

!

NOBS = 13

IPRINT = 2

CALL RSTEP (COV, NOBS, AOV, COEF, COVS, IPRINT=IPRINT)

!

END

Output

 

BACKWARD ELIMINATION

STEP 0: 4 variable(s) entered.

 

Dependent R-squared Adjusted Est. Std. Dev.

Variable (percent) R-squared of Model Error

5 98.238 97.356 2.446

 

* * * Analysis of Variance * * *

Sum of Mean Prob. of

Source DF Squares Square Overall F Larger F

Regression 4 2667.9 667.0 111.480 0.0000

Error 8 47.9 6.0

Total 12 2715.8

 

* * * Inference on Coefficients * * *

(Conditional on the Selected Model)

Coef. Standard Prob. of Variance

Variable Estimate Error t-statistic Larger t Inflation

1 1.551 0.7448 2.082 0.0709 38.5

2 0.510 0.7238 0.704 0.5012 254.4

3 0.102 0.7547 0.135 0.8963 46.9

4 -0.144 0.7091 -0.204 0.8437 282.5

 

STEP 1 : Variable 3 removed.

Dependent R-squared Adjusted Est. Std. Dev.

Variable (percent) R-squared of Model Error

5 98.234 97.645 2.309

 

* * * Analysis of Variance * * *

Sum of Mean Prob. of

Source DF Squares Square Overall F Larger F

Regression 3 2667.8 889.3 166.835 0.0000

Error 9 48.0 5.3

Total 12 2715.8

 

* * * Inference on Coefficients * * *

(Conditional on the Selected Model)

Coef. Standard Prob. of Variance

Variable Estimate Error t-statistic Larger t Inflation

1 1.452 0.1170 12.410 0.0000 1.07

2 0.416 0.1856 2.242 0.0517 18.78

4 -0.237 0.1733 -1.365 0.2054 18.94

 

* * * Statistics for Variables Not in the Model * * *

Coef. Standard t-statistic Prob. of Variance

Variable Estimate Error to enter Larger t Inflation

3 0.102 0.7547 0.135 0.8963 46.87

 

STEP 2 : Variable 4 removed.

 

Dependent R-squared Adjusted Est. Std. Dev.

Variable (percent) R-squared of Model Error

5 97.868 97.441 2.406

 

* * * Analysis of Variance * * *

Sum of Mean Prob. of

Source DF Squares Square Overall F Larger F

Regression 2 2657.9 1328.9 229.502 0.0000

Error 10 57.9 5.8

Total 12 2715.8

 

* * * Inference on Coefficients * * *

(Conditional on the Selected Model)

Coef. Standard Prob. of Variance

Variable Estimate Error t-statistic Larger t Inflation

1 1.468 0.1213 12.105 0.0000 1.06

2 0.662 0.0459 14.442 0.0000 1.06

 

* * * Statistics for Variables Not in the Model * * *

Coef. Standard t-statistic Prob. of Variance

Variable Estimate Error to enter Larger t Inflation

3 0.250 0.1847 1.354 0.2089 3.14

4 -0.237 0.1733 -1.365 0.2054 18.94

 

* * * Backward Elimination Summary * * *

Variable Step Removed

3 1

4 2

Example 2

This example uses the data set in Example 1. Here, RSTEP is invoked using the forward stepwise option.

 

USE RSTEP_INT

 

IMPLICIT NONE

INTEGER LDCOEF, LDCOV, LDCOVS, NVAR

PARAMETER (NVAR=5, LDCOEF=NVAR, LDCOV=NVAR, LDCOVS=NVAR)

!

INTEGER IEND, IPRINT, ISTEP, LEVEL(NVAR), NOBS

REAL AOV(13), COEF(LDCOEF,5), COV(LDCOV,NVAR), &

COVS(LDCOVS,NVAR), HIST(NVAR), SCALE(NVAR)

!

DATA COV/415.231, 251.077, -372.615, -290.000, 775.962, 251.077, &

2905.69, -166.538, -3041.00, 2292.95, -372.615, -166.538, &

492.308, 38.0000, -618.231, -290.000, -3041.00, 38.0000, &

3362.00, -2481.70, 775.962, 2292.95, -618.231, -2481.70, &

2715.76/

DATA LEVEL/4*1, -1/

!

ISTEP = 1

NOBS = 13

IPRINT = 2

CALL RSTEP (COV, NOBS, AOV, COEF, COVS, ISTEP=ISTEP, IPRINT=IPRINT)

!

END

Output

 

FORWARD SELECTION

STEP 0: No variables entered.

 

* * * Statistics for Variables Not in the Model * * *

Coef. Standard t-statistic Prob. of Variance

Variable Estimate Error to enter Larger t Inflation

1 1.869 0.5264 3.550 0.0046 1

2 0.789 0.1684 4.686 0.0007 1

3 -1.256 0.5984 -2.098 0.0598 1

4 -0.738 0.1546 -4.775 0.0006 1

 

STEP 1 : Variable 4 entered.

 

Dependent R-squared Adjusted Est. Std. Dev.

Variable (percent) R-squared of Model Error

5 67.454 64.496 8.964

 

* * * Analysis of Variance * * *

Sum of Mean Prob. of

Source DF Squares Square Overall F Larger F

Regression 1 1831.9 1831.9 22.799 0.0006

Error 11 883.9 80.4

Total 12 2715.8

 

* * * Inference on Coefficients * * *

(Conditional on the Selected Model)

Coef. Standard Prob. of Variance

Variable Estimate Error t-statistic Larger t Inflation

4 -0.738 0.1546 -4.775 0.0006 1.00

 

* * * Statistics for Variables Not in the Model * * *

Coef. Standard t-statistic Prob. of Variance

Variable Estimate Error to enter Larger t Inflation

1 1.440 0.1384 10.403 0.0000 1.06

2 0.311 0.7486 0.415 0.6867 18.74

3 -1.200 0.1890 -6.348 0.0001 1.00

 

STEP 2 : Variable 1 entered.

 

Dependent R-squared Adjusted Est. Std. Dev.

Variable (percent) R-squared of Model Error

5 97.247 96.697 2.734

 

* * * Analysis of Variance * * *

Sum of Mean Prob. of

Source DF Squares Square Overall F Larger F

Regression 2 2641.0 1320.5 176.636 0.0000

Error 10 74.8 7.5

Total 12 2715.8

 

* * * Inference on Coefficients * * *

(Conditional on the Selected Model)

Coef. Standard Prob. of Variance

Variable Estimate Error t-statistic Larger t Inflation

1 1.440 0.1384 10.403 0.0000 1.06

4 -0.614 0.0486 -12.622 0.0000 1.06

 

* * * Statistics for Variables Not in the Model * * *

Coef. Standard t-statistic Prob. of Variance

Variable Estimate Error to enter Larger t Inflation

2 0.416 0.1856 2.242 0.0517 18.78

3 -0.410 0.1992 -2.058 0.0697 3.46

 

* * * Forward Selection Summary * * *

Variable Step Entered

1 2

4 1

Example 3

For an extended version of Example 2 that in addition computes the intercept and standard error for the final model from RSTEP, see Example 2 for routine RSUBM.