Chapter 2: Regression > regression_stepwise

regression_stepwise

Builds multiple linear regression models using forward selection, backward selection, or stepwise selection.

Synopsis

#include <imsls.h>

void imsls_f_regression_stepwise (int n_rows, int n_candidate, float x[], float y[], ..., 0)

The type double function is imsls_d_regression_stepwise.

Required Arguments

int n_rows   (Input)
Number of rows in x and the number of elements in y.

int n_candidate   (Input)
Number of candidate variables (independent variables) or columns in x.

float x[]   (Input)
Array of size n_rows × n_candidate containing the data for the candidate variables.

float y[]   (Input)
Array of length n_rows containing the responses for the dependent variable.

Synopsis with Optional Arguments

#include <imsls.h>

void imsls_f_regression_stepwise (int n_rows, int n_candidate, float x[], float y[],
IMSLS_X_COL_DIM, int x_col_dim,
IMSLS_WEIGHTS, float weights[],
IMSLS_FREQUENCIES, float frequencies[],
IMSLS_FIRST_STEP, or
IMSLS_INTERMEDIATE_STEP, or
IMSLS_LAST_STEP, or
IMSLS_ALL_STEPS,
IMSLS_N_STEPS, int n_steps,
IMSLS_FORWARD, or
IMSLS_BACKWARD, or
IMSLS_STEPWISE,
IMSLS_P_VALUE_IN, float p_value_in,
IMSLS_P_VALUE_OUT, float p_value_out,
IMSLS_TOLERANCE, float tolerance,
IMSLS_ANOVA_TABLE, float **anova_table,
IMSLS_ANOVA_TABLE_USER, float anova_table[],
IMSLS_COEF_T_TESTS, float **coef_t_tests,
IMSLS_COEF_T_TESTS_USER, float coef_t_tests[],
IMSLS_COEF_VIF, float **coef_vif,
IMSLS_COEF_VIF_USER, float coef_vif[],
IMSLS_LEVEL, int level[],
IMSLS_FORCE, int n_force,
IMSLS_IEND, int *iend,
IMSLS_SWEPT_USER, int swept[],
IMSLS_HISTORY_USER, float history[],
IMSLS_COV_SWEPT_USER, float *covs
IMSLS_INPUT_COV, int n_observations, float *cov,
0)

Optional Arguments

IMSLS_X_COL_DIM, int x_col_dim   (Input)
Column dimension of x.
Default: x_col_dim = n_candidate

IMSLS_WEIGHTS, float weights[]   (Input)
Array of length n_rows containing the weight for each row of x.
Default: weights[] = 1

IMSLS_FREQUENCIES, float frequencies[]   (Input)
Array of length n_rows containing the frequency for each row of x.
Default: frequencies[] = 1

IMSLS_FIRST_STEP, or

IMSLS_INTERMEDIATE_STEP, or

IMSLS_LAST_STEP, or

IMSLS_ALL_STEPS
One or none of these options can be specified. If none of these is specified, the action defaults to IMSLS_ALL_STEPS.

 

Argument

Action

IMSLS_FIRST_STEP

This is the first invocation; additional calls will be made. Initialization and stepping is performed.

IMSLS_INTERMEDIATE_STEP

This is an intermediate invocation.
Stepping is performed.

IMSLS_LAST_STEP

This is the final invocation. Stepping and wrap-up computations are performed.

IMSLS_ALL_STEPS

This is the only invocation. Initialization, stepping, and wrap-up computations are performed.

 

IMSLS_N_STEPS, int n_steps   (Input)
For nonnegative n_steps, n_steps steps are taken. If n_steps = 1, stepping continues until completion.

IMSLS_FORWARD, or

IMSLS_BACKWARD, or

IMSLS_STEPWISE
One or none of these options can be specified. If none is specified, the action defaults to IMSLS_BACKWARD.

 

Keyword

Action

IMSLS_FORWARD

An attempt is made to add a variable to the model. A variable is added if its p-value is less than p_value_in. During initialization, only the forced variables enter the model.

IMSLS_BACKWARD

An attempt is made to remove a variable from the model. A variable is removed if its p-value exceeds p_value_out. During initialization, all candidate independent variables enter the model.

IMSLS_STEPWISE

A backward step is attempted. If a variable is not removed, a forward step is attempted. This is a stepwise step. Only the forced variables enter the model during initialization.

 

IMSLS_P_VALUE_IN, float p_value_in   (Input)
Largest p-value for variables entering the model. Variables with p-values less than p_value_in may enter the model.
Default: p_value_in = 0.05

IMSLS_P_VALUE_OUT, float p_value_out   (Input)
Smallest p-value for removing variables. Variables with p_values greater than p_value_out may leave the model. Argument p_value_out must be greater than or equal to p_value_in. A common choice for p_value_out is 2*p_value_in.
Default: p_value_out = 0.10

IMSLS_TOLERANCE, float tolerance   (Input)
Tolerance used in determining linear dependence.
Default: tolerance = 100*eps, where eps = imsls_f_machine(4) for single precision and eps = imsls_d_machine(4) for double precision

IMSLS_ANOVA_TABLE, float **anova_table   (Output)
Address of a pointer to the internally allocated array containing the analysis of variance table. The analysis of variance statistics are as follows:

 

Element

Analysis of Variance Statistic

0

degrees of freedom for regression

1

degrees of freedom for error

2

total degrees of freedom

3

sum of squares for regression

4

sum of squares for error

5

total sum of squares

6

regression mean square

7

error mean square

8

F-statistic

9

p-value

10

R2 (in percent)

11

adjusted R2 (in percent)

12

estimate of the standard deviation

 

            Note that the p-value is returned as 0.0 when the value is so small that all significant digits have been lost.

IMSLS_ANOVA_TABLE_USER, float anova_table[]   (Output)
Storage for anova_table is provided by the user. See IMSLS_ANOVA_TABLE.

IMSLS_COEF_T_TESTS, float **coef_t_tests   (Output)
Address to a pointer to the internally allocated array containing statistics relating to the regression coefficient for the final model in this invocationing. The rows correspond to the n_candidate independent variables. The rows are in the same order as the variables in x (or, if IMSLS_INPUT_COV is specified, the rows are in the same order as the variables in cov). Each row corresponding to a variable not in the model contains statistics for a model which includes the variables of the final model and the variable corresponding to the row in question.

 

Column

Description

0

coefficient estimate

1

estimated standard error of the coefficient estimate

2

t-statistic for the test that the coefficient is 0

3

p-value for the two-sided t test

 

IMSLS_COEF_T_TESTS_USER, float coef_t_tests[]   (Output)
Storage for array coef_t_tests is provided by the user. See IMSLS_COEF_T_TESTS.

IMSLS_COEF_VIF, float **coef_vif   (Output)
Address to a pointer to the internally allocated array containing variance inflation factors for the final model in this invocation. The elements correspond to the n_candidate dependent variables. The elements are in the same order as the variables in x (or, if IMSLS_INPUT_COV is specified, the elements are in the same order as the variables in cov). Each element corresponding to a variable not in the model contains statistics for a model which includes the variables of the final model and the variables corresponding to the element in question.

            The square of the multiple correlation coefficient for the i-th regressor after all others can be obtained from coef_vif[I] by the following formula:

IMSLS_COEF_VIF_USER, float coef_vif[]   (Output)
Storage for array coef_vif is provided by the user. See IMSLS_COEF_VIF.

IMSLS_LEVEL, int level[]   (Input)
Array of length n_candidate + 1 containing levels of priority for variables entering and leaving the regression. Each variable is assigned a positive value which indicates its level of entry into the model. A variable can enter the model only after all variables with smaller nonzero levels of entry have entered. Similarly, a variable can only leave the model after all variables with higher levels of entry have left. Variables with the same level of entry compete for entry (deletion) at each step. Argument level[I] = 0 means the I-th variable is never to enter the model. Argument level[I] = 1 means the I-th variable is the dependent variable. Argument level[n_candidate] must correspond to the dependent variable, except when IMSLS_INPUT_COV is specified.
Default: 1, 1, ..., 1, 1 where 1 corresponds to level[n_candidate]

IMSLS_FORCE, int n_force   (Input)
Variable with levels 1, 2, ..., n_force are forced into the model as independent variables. See IMSLS_LEVEL.

IMSLS_IEND, int *iend   (Output)
Variable which indicates whether additional steps are possible.

 

iend

Meaning

0

Additional steps may be possible.

1

No additional steps are possible.

 

IMSLS_SWEPT_USER, int swept[]   (Output)
A user-allocated array of length n_candidate + 1 with information to indicate the independent variables in the model. Argument swept[n_candidate] usually corresponds to the dependent variable. See IMSLS_LEVEL.

 

swept[i]

Status of i-th Variable

1

Variable i is not in model.

1

Variable i is in model.

 

IMSLS_HISTORY_USER, float history[]   (Output)
User-allocated array of length n_candidate + 1 containing the recent history of the independent variables. Element history[n_candidate] usually corresponds to the dependent variable. See IMSLS_LEVEL.

 

 

history[i]

Status of i-th Variable

0.0

Variable has never been added to model.

0.5

Variable was added into the model during initialization.

k > 0.0

Variable was added to the model during the k-th step.

k < 0.0

Variable was deleted from model during the k-th step.

 

IMSLS_COV_SWEPT_USER, float *covs   (Output)
User-allocated array of length (n_candidate + 1) × (n_candidate + 1) that results after cov has been swept on the columns corresponding to the variables in the model. The estimated variance-covariance matrix of the estimated regression coefficients in the final model can be obtained by extracting the rows and columns of covs corresponding to the independent variables in the final model and multiplying the elements of this matrix by anova_table[7].

IMSLS_INPUT_COV, int n_observations float *cov   (Input)
An (n_candidate + 1) by (n_candidate + 1) array containing a variance-covariance or sum of squares and crossproducts matrix, in which the last column must correspond to the dependent variable. Argument n_observations is an integer specifying the number of observations associated with cov. Argument cov can be computed using imsls_f_covariances. Arguments x, y, weights, and frequencies are not accessed when this option is specified.

            By default, imsls_regression_stepwise computes cov from the input data matrices x and y.

Description

Function imsls_f_regression_stepwise builds a multiple linear regression model using forward selection, backward selection, or forward stepwise (with a backward glance) selection. Function imsls_f_regression_stepwise is designed so the user can monitor, and perhaps change, the variables added (deleted) to (from) the model after each step. In this case, multiple calls to imsls_f_regression_stepwise (using optional arguments IMSLS_FIRST_STEP, IMSLS_INTERMEDIATE_STEP, ..., IMSLS_LAST_STEP) are made. Alternatively, imsls_f_regression_stepwise can be invoked once (default, or specify optional argument IMSLS_ALL_STEPS) in order to perform the stepping until a final model is selected.

Levels of priority can be assigned to the candidate independent variables (use optional argument IMSLS_LEVEL). All variables with a priority level of 1 must enter the model before variables with a priority level of 2. Similarly, variables with a level of 2 must enter before variables with a level of 3, etc. Variables also can be forced into the model (see optional argument IMSLS_FORCE). Note that specifying optional argument IMSLS_FORCE without also specifying optional argument IMSLS_LEVEL will result in all variables being forced into the model.

Typically, the intercept is forced into all models and is not a candidate variable. In this case, a sum-of-squares and crossproducts matrix for the independent and dependent variables corrected for the mean is required. Other possibilities are as follows:

1.     The intercept is not in the model. A raw (uncorrected) sum-of-squares and crossproducts matrix for the independent and dependent variables is required as input in cov (see optional argument IMSLS_INPUT_COV). Argument n_observations must be set to one greater than the number of observations.

2.     An intercept is a candidate variable. A raw (uncorrected) sum-of-squares and crossproducts matrix for the constant regressor (=1), independent and dependent variables are required for cov. In this case, cov contains one additional row and column corresponding to the constant regressor. This row/column contains the sum-of-squares and crossproducts of the constant regressor with the independent and dependent variables. The remaining elements in cov are the same as in the previous case. Argument n_observations must be set to one greater than the number of observations.

The stepwise regression algorithm is due to Efroymson (1960). Function imsls_f_regression_stepwise uses sweeps of the covariance matrix (input in cov, if optional argument IMSLS_INPUT_COV is specified, or generated internally by default) to move variables in and out of the model (Hemmerle 1967, Chapter 3). The SWEEP operator discussed in Goodnight (1979) is used. A description of the stepwise algorithm is also given by Kennedy and Gentle (1980, pp. 335340). The advantage of stepwise model building over all possible regression (see function imsls_f_regression_selection) is that it is less demanding computationally when the number of candidate independent variables is very large. However, there is no guarantee that the model selected will be the best model (highest R2) for any subset size of independent variables.

Example

This example uses a data set from Draper and Smith (1981, pp. 629630). Backwards stepping is performed by default.

 

#include <imsls.h>

#define N_OBSERVATIONS 13

#define N_CANDIDATE    4

 

int main()

{    char           *labels[] = {

                    "degrees of freedom for regression",

                    "degrees of freedom for error",

                    "total degrees of freedom",

                    "sum of squares for regression",

                    "sum of squares for error",

                    "total sum of squares",

                    "regression mean square",

                    "error mean square",

                    "F-statistic",

                    "p-value",

                    "R-squared (in percent)",

                    "adjusted R-squared (in percent)",

                    "est. standard deviation of within error"

    };

    char           *c_labels[] = {

                    "variable",

                    "estimate",

                    "s.e.",

                    "t",

                    "prob > t"

    };

    float  *aov, *tt;

    float  x[N_OBSERVATIONS][N_CANDIDATE] =

        {7., 26.,  6., 60.,

         1., 29., 15., 52.,

        11., 56.,  8., 20.,

        11., 31.,  8., 47.,

         7., 52.,  6., 33.,

        11., 55.,  9., 22.,

         3., 71., 17.,  6.,

         1., 31., 22., 44.,

         2., 54., 18., 22.,

        21., 47.,  4., 26.,

         1., 40., 23., 34.,

        11., 66.,  9., 12.,

        10., 68.,  8., 12.};

    float  y[N_OBSERVATIONS] = {78.5, 74.3, 104.3, 87.6, 95.9,

        109.2, 102.7,  72.5, 93.1, 115.9, 83.8, 113.3, 109.4};

 

    imsls_f_regression_stepwise(N_OBSERVATIONS, N_CANDIDATE, x, y,

        IMSLS_ANOVA_TABLE, &aov,

        IMSLS_COEF_T_TESTS, &tt,

        0);

 

    imsls_f_write_matrix("* * * Analysis of Variance * * *\n",

        13, 1, aov,

        IMSLS_ROW_LABELS, labels,

        IMSLS_WRITE_FORMAT, "%9.2f",

        0);

 

    imsls_f_write_matrix("* * * Inference on Coefficients * * *\n",

        4, 4, tt,

        IMSLS_COL_LABELS, c_labels,

        IMSLS_WRITE_FORMAT, "%9.2f",

        0);

 

    return;

}

Output

         * * * Analysis of Variance * * *

 

degrees of freedom for regression             2.00

degrees of freedom for error                 10.00

total degrees of freedom                     12.00

sum of squares for regression              2657.86

sum of squares for error                     57.90

total sum of squares                       2715.76

regression mean square                     1328.93

error mean square                             5.79

F-statistic                                 229.50

p-value                                       0.00

R-squared (in percent)                       97.87

adjusted R-squared (in percent)              97.44

est. standard deviation of within error       2.41

 

       * * * Inference on Coefficients * * *

variable   estimate       s.e.          t   prob > t

       1       1.47       0.12      12.10       0.00

       2       0.66       0.05      14.44       0.00

       3       0.25       0.18       1.35       0.21

       4      -0.24       0.17      -1.36       0.21

Warning Errors

IMSLS_LINEAR_DEPENDENCE_1              Based on “tolerance” = #, there are linear dependencies among the variables to be forced.

Fatal Errors

IMSLS_NO_VARIABLES_ENTERED           No variables entered the model. All elements of “anova_table” are set to NaN.


RW_logo.jpg
Contact Support