Builds multiple linear regression models using forward selection, backward selection, or stepwise selection.
#include <imsls.h>
void imsls_f_regression_stepwise (int n_rows, int n_candidate, float x[], float y[], ..., 0)
The type double function is imsls_d_regression_stepwise.
int n_rows
(Input)
Number of rows in x and the number of
elements in y.
int
n_candidate (Input)
Number of candidate variables
(independent variables) or columns in x.
float x[]
(Input)
Array of size n_rows × n_candidate containing
the data for the candidate variables.
float y[]
(Input)
Array of length n_rows containing the
responses for the dependent variable.
#include <imsls.h>
void
imsls_f_regression_stepwise (int n_rows,
int n_candidate,
float x[],
float y[],
IMSLS_X_COL_DIM,
int
x_col_dim,
IMSLS_WEIGHTS,
float
weights[],
IMSLS_FREQUENCIES,
float
frequencies[],
IMSLS_FIRST_STEP,
or
IMSLS_INTERMEDIATE_STEP,
or
IMSLS_LAST_STEP,
or
IMSLS_ALL_STEPS,
IMSLS_N_STEPS,
int
n_steps,
IMSLS_FORWARD,
or
IMSLS_BACKWARD,
or
IMSLS_STEPWISE,
IMSLS_P_VALUE_IN,
float
p_value_in,
IMSLS_P_VALUE_OUT,
float
p_value_out,
IMSLS_TOLERANCE,
float
tolerance,
IMSLS_ANOVA_TABLE,
float
**anova_table,
IMSLS_ANOVA_TABLE_USER,
float
anova_table[],
IMSLS_COEF_T_TESTS,
float
**coef_t_tests,
IMSLS_COEF_T_TESTS_USER,
float
coef_t_tests[],
IMSLS_COEF_VIF,
float
**coef_vif,
IMSLS_COEF_VIF_USER,
float
coef_vif[],
IMSLS_LEVEL,
int
level[],
IMSLS_FORCE,
int
n_force,
IMSLS_IEND,
int
*iend,
IMSLS_SWEPT_USER,
int
swept[],
IMSLS_HISTORY_USER,
float
history[],
IMSLS_COV_SWEPT_USER,
float
*covs
IMSLS_INPUT_COV,
int
n_observations,
float
*cov,
0)
IMSLS_X_COL_DIM, int x_col_dim
(Input)
Column dimension of x.
Default: x_col_dim = n_candidate
IMSLS_WEIGHTS, float weights[]
(Input)
Array of length n_rows containing the
weight for each row of x.
Default: weights[] = 1
IMSLS_FREQUENCIES, float
frequencies[] (Input)
Array of length n_rows containing the
frequency for each row of x.
Default: frequencies[] = 1
IMSLS_FIRST_STEP, or
IMSLS_INTERMEDIATE_STEP, or
IMSLS_LAST_STEP, or
IMSLS_ALL_STEPS
One
or none of these options can be specified. If none of these is specified, the
action defaults to IMSLS_ALL_STEPS.
Argument |
Action |
IMSLS_FIRST_STEP |
This is the first invocation; additional calls will be made. Initialization and stepping is performed. |
IMSLS_INTERMEDIATE_STEP |
This is an intermediate invocation. |
IMSLS_LAST_STEP |
This is the final invocation. Stepping and wrap-up computations are performed. |
IMSLS_ALL_STEPS |
This is the only invocation. Initialization, stepping, and wrap-up computations are performed. |
IMSLS_N_STEPS, int n_steps
(Input)
For nonnegative n_steps, n_steps steps are
taken. If n_steps = −1, stepping
continues until completion.
IMSLS_FORWARD, or
IMSLS_BACKWARD, or
IMSLS_STEPWISE
One
or none of these options can be specified. If none is specified, the action
defaults to IMSLS_BACKWARD.
Keyword |
Action |
IMSLS_FORWARD |
An attempt is made to add a variable to the model. A variable is added if its p-value is less than p_value_in. During initialization, only the forced variables enter the model. |
IMSLS_BACKWARD |
An attempt is made to remove a variable from the model. A variable is removed if its p-value exceeds p_value_out. During initialization, all candidate independent variables enter the model. |
IMSLS_STEPWISE |
A backward step is attempted. If a variable is not removed, a forward step is attempted. This is a stepwise step. Only the forced variables enter the model during initialization. |
IMSLS_P_VALUE_IN, float
p_value_in (Input)
Largest p-value for variables
entering the model. Variables with p-values less than p_value_in may enter
the model.
Default: p_value_in = 0.05
IMSLS_P_VALUE_OUT, float
p_value_out (Input)
Smallest p-value for removing
variables. Variables with p_values greater than
p_value_out may
leave the model. Argument p_value_out must be
greater than or equal to p_value_in. A common
choice for p_value_out is 2*p_value_in.
Default:
p_value_out =
0.10
IMSLS_TOLERANCE, float tolerance
(Input)
Tolerance used in determining linear dependence.
Default: tolerance = 100*eps, where
eps = imsls_f_machine(4) for
single precision and eps = imsls_d_machine(4) for
double precision
IMSLS_ANOVA_TABLE, float
**anova_table (Output)
Address of a pointer to the
internally allocated array containing the analysis of variance table. The
analysis of variance statistics are as follows:
Element |
Analysis of Variance Statistic |
0 |
degrees of freedom for regression |
1 |
degrees of freedom for error |
2 |
total degrees of freedom |
3 |
sum of squares for regression |
4 |
sum of squares for error |
5 |
total sum of squares |
6 |
regression mean square |
7 |
error mean square |
8 |
F-statistic |
9 |
p-value |
10 |
R2 (in percent) |
11 |
adjusted R2 (in percent) |
12 |
estimate of the standard deviation |
Note that the p-value is returned as 0.0 when the value is so small that all significant digits have been lost.
IMSLS_ANOVA_TABLE_USER, float
anova_table[] (Output)
Storage for anova_table is
provided by the user. See IMSLS_ANOVA_TABLE.
IMSLS_COEF_T_TESTS, float
**coef_t_tests (Output)
Address to a pointer to the
internally allocated array containing statistics relating to the regression
coefficient for the final model in this invocationing. The rows correspond to
the n_candidate
independent variables. The rows are in the same order as the variables in x (or, if IMSLS_INPUT_COV is
specified, the rows are in the same order as the variables in cov). Each row
corresponding to a variable not in the model contains statistics for a model
which includes the variables of the final model and the variable corresponding
to the row in question.
Column |
Description |
0 |
coefficient estimate |
1 |
estimated standard error of the coefficient estimate |
2 |
t-statistic for the test that the coefficient is 0 |
3 |
p-value for the two-sided t test |
IMSLS_COEF_T_TESTS_USER, float
coef_t_tests[] (Output)
Storage for array coef_t_tests is
provided by the user. See IMSLS_COEF_T_TESTS.
IMSLS_COEF_VIF, float
**coef_vif (Output)
Address to a pointer to the internally
allocated array containing variance inflation factors for the final model in
this invocation. The elements correspond to the n_candidate dependent
variables. The elements are in the same order as the variables in x (or, if IMSLS_INPUT_COV is
specified, the elements are in the same order as the variables in cov). Each element
corresponding to a variable not in the model contains statistics for a model
which includes the variables of the final model and the variables corresponding
to the element in question.
The square of the multiple correlation coefficient for the i-th regressor after all others can be obtained from coef_vif[I] by the following formula:
IMSLS_COEF_VIF_USER, float
coef_vif[] (Output)
Storage for array coef_vif is provided
by the user. See IMSLS_COEF_VIF.
IMSLS_LEVEL, int level[]
(Input)
Array of length n_candidate + 1
containing levels of priority for variables entering and leaving the regression.
Each variable is assigned a positive value which indicates its level of entry
into the model. A variable can enter the model only after all variables with
smaller nonzero levels of entry have entered. Similarly, a variable can only
leave the model after all variables with higher levels of entry have left.
Variables with the same level of entry compete for entry (deletion) at each
step. Argument level[I] = 0 means
the I-th variable is
never to enter the model. Argument level[I] = −1 means the
I-th variable is
the dependent variable. Argument level[n_candidate] must
correspond to the dependent variable, except when IMSLS_INPUT_COV is
specified.
Default: 1, 1, ..., 1, −1 where −1 corresponds to
level[n_candidate]
IMSLS_FORCE, int n_force
(Input)
Variable with levels 1, 2, ..., n_force are forced
into the model as independent variables. See IMSLS_LEVEL.
IMSLS_IEND, int *iend
(Output)
Variable which indicates whether additional steps are possible.
iend |
Meaning |
0 |
Additional steps may be possible. |
1 |
No additional steps are possible. |
IMSLS_SWEPT_USER, int swept[]
(Output)
A user-allocated array of length n_candidate + 1 with
information to indicate the independent variables in the model. Argument swept[n_candidate] usually
corresponds to the dependent variable. See IMSLS_LEVEL.
swept[i] |
Status of i-th Variable |
−1 |
Variable i is not in model. |
1 |
Variable i is in model. |
IMSLS_HISTORY_USER, float history[]
(Output)
User-allocated array of length n_candidate + 1
containing the recent history of the independent variables. Element history[n_candidate] usually
corresponds to the dependent variable. See IMSLS_LEVEL.
history[i] |
Status of i-th Variable |
0.0 |
Variable has never been added to model. |
0.5 |
Variable was added into the model during initialization. |
k > 0.0 |
Variable was added to the model during the k-th step. |
k < 0.0 |
Variable was deleted from model during the k-th step. |
IMSLS_COV_SWEPT_USER, float *covs
(Output)
User-allocated array of length (n_candidate + 1) × (n_candidate + 1)
that results after cov has been swept on
the columns corresponding to the variables in the model. The estimated
variance-covariance matrix of the estimated regression coefficients in the final
model can be obtained by extracting the rows and columns of covs corresponding to
the independent variables in the final model and multiplying the elements of
this matrix by anova_table[7].
IMSLS_INPUT_COV, int n_observations
float
*cov (Input)
An (n_candidate + 1) by
(n_candidate +
1) array containing a variance-covariance or sum of squares and crossproducts
matrix, in which the last column must correspond to the dependent variable.
Argument n_observations is an
integer specifying the number of observations associated with cov. Argument cov can be computed
using imsls_f_covariances.
Arguments x,
y, weights, and frequencies are not
accessed when this option is specified.
By default, imsls_regression_stepwise computes cov from the input data matrices x and y.
Function imsls_f_regression_stepwise builds a multiple linear regression model using forward selection, backward selection, or forward stepwise (with a backward glance) selection. Function imsls_f_regression_stepwise is designed so the user can monitor, and perhaps change, the variables added (deleted) to (from) the model after each step. In this case, multiple calls to imsls_f_regression_stepwise (using optional arguments IMSLS_FIRST_STEP, IMSLS_INTERMEDIATE_STEP, ..., IMSLS_LAST_STEP) are made. Alternatively, imsls_f_regression_stepwise can be invoked once (default, or specify optional argument IMSLS_ALL_STEPS) in order to perform the stepping until a final model is selected.
Levels of priority can be assigned to the candidate independent variables (use optional argument IMSLS_LEVEL). All variables with a priority level of 1 must enter the model before variables with a priority level of 2. Similarly, variables with a level of 2 must enter before variables with a level of 3, etc. Variables also can be forced into the model (see optional argument IMSLS_FORCE). Note that specifying optional argument IMSLS_FORCE without also specifying optional argument IMSLS_LEVEL will result in all variables being forced into the model.
Typically, the intercept is forced into all models and is not a candidate variable. In this case, a sum-of-squares and crossproducts matrix for the independent and dependent variables corrected for the mean is required. Other possibilities are as follows:
1. The intercept is not in the model. A raw (uncorrected) sum-of-squares and crossproducts matrix for the independent and dependent variables is required as input in cov (see optional argument IMSLS_INPUT_COV). Argument n_observations must be set to one greater than the number of observations.
2. An intercept is a candidate variable. A raw (uncorrected) sum-of-squares and crossproducts matrix for the constant regressor (=1), independent and dependent variables are required for cov. In this case, cov contains one additional row and column corresponding to the constant regressor. This row/column contains the sum-of-squares and crossproducts of the constant regressor with the independent and dependent variables. The remaining elements in cov are the same as in the previous case. Argument n_observations must be set to one greater than the number of observations.
The stepwise regression algorithm is due to Efroymson (1960). Function imsls_f_regression_stepwise uses sweeps of the covariance matrix (input in cov, if optional argument IMSLS_INPUT_COV is specified, or generated internally by default) to move variables in and out of the model (Hemmerle 1967, Chapter 3). The SWEEP operator discussed in Goodnight (1979) is used. A description of the stepwise algorithm is also given by Kennedy and Gentle (1980, pp. 335−340). The advantage of stepwise model building over all possible regression (see function imsls_f_regression_selection) is that it is less demanding computationally when the number of candidate independent variables is very large. However, there is no guarantee that the model selected will be the best model (highest R2) for any subset size of independent variables.
This example uses a data set from Draper and Smith (1981, pp. 629−630). Backwards stepping is performed by default.
#include <imsls.h>
#define N_OBSERVATIONS 13
#define N_CANDIDATE 4
int main()
{ char *labels[] = {
"degrees of freedom for regression",
"degrees of freedom for error",
"total degrees of freedom",
"sum of squares for regression",
"sum of squares for error",
"total sum of squares",
"regression mean square",
"error mean square",
"F-statistic",
"p-value",
"R-squared (in percent)",
"adjusted R-squared (in percent)",
"est. standard deviation of within error"
};
char *c_labels[] = {
"variable",
"estimate",
"s.e.",
"t",
"prob > t"
};
float *aov, *tt;
float x[N_OBSERVATIONS][N_CANDIDATE] =
{7., 26., 6., 60.,
1., 29., 15., 52.,
11., 56., 8., 20.,
11., 31., 8., 47.,
7., 52., 6., 33.,
11., 55., 9., 22.,
3., 71., 17., 6.,
1., 31., 22., 44.,
2., 54., 18., 22.,
21., 47., 4., 26.,
1., 40., 23., 34.,
11., 66., 9., 12.,
10., 68., 8., 12.};
float y[N_OBSERVATIONS] = {78.5, 74.3, 104.3, 87.6, 95.9,
109.2, 102.7, 72.5, 93.1, 115.9, 83.8, 113.3, 109.4};
imsls_f_regression_stepwise(N_OBSERVATIONS, N_CANDIDATE, x, y,
IMSLS_ANOVA_TABLE, &aov,
IMSLS_COEF_T_TESTS, &tt,
0);
imsls_f_write_matrix("* * * Analysis of Variance * * *\n",
13, 1, aov,
IMSLS_ROW_LABELS, labels,
IMSLS_WRITE_FORMAT, "%9.2f",
0);
imsls_f_write_matrix("* * * Inference on Coefficients * * *\n",
4, 4, tt,
IMSLS_COL_LABELS, c_labels,
IMSLS_WRITE_FORMAT, "%9.2f",
0);
return;
}
* * * Analysis of Variance * * *
degrees of freedom for regression 2.00
degrees of freedom for error 10.00
total degrees of freedom 12.00
sum of squares for regression 2657.86
sum of squares for error 57.90
total sum of squares 2715.76
regression mean square 1328.93
error mean square 5.79
F-statistic 229.50
p-value 0.00
R-squared (in percent) 97.87
adjusted R-squared (in percent) 97.44
est. standard deviation of within error 2.41
* * * Inference on Coefficients * * *
variable estimate s.e. t prob > t
1 1.47 0.12 12.10 0.00
2 0.66 0.05 14.44 0.00
3 0.25 0.18 1.35 0.21
4 -0.24 0.17 -1.36 0.21
IMSLS_LINEAR_DEPENDENCE_1 Based on “tolerance” = #, there are linear dependencies among the variables to be forced.
IMSLS_NO_VARIABLES_ENTERED No variables entered the model. All elements of “anova_table” are set to NaN.