Class SelectionRegression
- All Implemented Interfaces:
Serializable,Cloneable
Class SelectionRegression finds the best subset regressions
for a regression problem with three or more independent variables.
Typically, the intercept is forced into all models and is not a candidate
variable. In this case, a sum of squares and crossproducts matrix for the
independent and dependent variables corrected for the mean is computed
internally. Optionally, SelectionRegression supports
user-calculated sum-of-squares and crossproducts matrices; see the
description of the compute method.
"Best" is defined by using one of the following three criteria:
- \(R^2\) (in percent) $$R^2= 100(1-\frac{{\mbox{SSE}}_p}{\mbox{SST}}) $$
- \(R^2_a\) (adjusted \(R^2\)) $$R^2_a=100[1-(\frac{n-1}{n-p})\frac{{\mbox{ SSE}}_p}{\mbox{SST}}] $$ Note that maximizing the \(R^2_a\) is equivalent to minimizing the residual mean squared error: $$\frac{{\mbox{SSE}}_p}{(n-p)} $$
- Mallow's \(C_p\) statistic $$C_p=\frac{{\mbox{SSE}}_p}{s^2_{\mbox{ k}}}+2p-n $$
Here, n is equal to the sum of the frequencies (or the number of
rows in x if frequencies are not specified in the
compute method), and \(\mbox{SST}\) is the
total sum of squares. k is the number of candidate or independent
variables, represented as the nCandidate argument in the
SelectionRegression constructor. \({\mbox{SSE}}_p\)
is the error sum of squares in a model containing p regression
parameters including \(\beta_0\) (or p - 1 of the
k candidate variables). Variable
$$S^2_{\mbox{k}} $$
is the error mean square from the model with all k variables in the
model. Hocking (1972) and Draper and Smith (1981, pp. 296-302) discuss these
criteria.
Class SelectionRegression is based on the algorithm of
Furnival and Wilson (1974). This algorithm finds the maximum number of good
saved candidate regressions for each possible subset size. For more details,
see method setMaximumGoodSaved(int). These regressions are used to
identify a set of best regressions. In large problems, many regressions are
not computed. They may be rejected without computation based on results for
other subsets; this yields an efficient technique for considering all
possible regressions.
There are cases when the user may want to input the variance-covariance
matrix rather than allow it to be calculated. This can be accomplished
using the appropriate compute method. Three situations in which
the user may want to do this are as follows:
- The intercept is not in the model. A raw (uncorrected) sum of
squares and crossproducts matrix for the independent and dependent
variables is required. Argument
nObservationsmust be set to 1 greater than the number of observations. Form \(A^TA\), where A = [A, Y], to compute the raw sum of squares and crossproducts matrix. - An intercept is a candidate variable. A raw (uncorrected) sum of
squares and crossproducts matrix for the constant regressor (= 1.0),
independent, and dependent variables is required for
cov. In this case,covcontains one additional row and column corresponding to the constant regressor. This row and column contain the sum of squares and crossproducts of the constant regressor with the independent and dependent variables. The remaining elements incovare the same as in the previous case. ArgumentnObservationsmust be set to 1 greater than the number of observations. - There are m variables that must be forced into the models. A
sum of squares and crossproducts matrix adjusted for the m
variables is required (calculated by regressing the candidate
variables on the variables to be forced into the model). Argument
nObservationsmust be set to m less than the number of observations.
Programming Notes
SelectionRegression can save considerable CPU time over
explicitly computing all possible regressions. However, the function has
some limitations that can cause unexpected results for users who are unaware
of the limitations of the software.
- For \(\mbox{k}+1\gt-\log_2(\epsilon)\), where
\(\epsilon\) is the largest relative spacing for
double precision, some results can be incorrect. This limitation
arises because the possible models indicated (the model numbers 1,
2, ..., 2k) are stored as floating-point values; for
sufficiently large k, the model numbers cannot be stored
exactly. On many computers, this means
SelectionRegression(for \(\mbox{k}\gt{49}\)) can produce incorrect results. SelectionRegressioneliminates some subsets of candidate variables by obtaining lower bounds on the error sum of squares from fitting larger models. First, the full model containing all independent variables is fit sequentially using a forward stepwise procedure in which one variable enters the model at a time, and criterion values and model numbers for all the candidate variables that can enter at each step are stored. If linearly dependent variables are removed from the full model, a "VariablesDeleted" warning is issued. In this case, some submodels that contain variables removed from the full model because of linear dependency can be overlooked if they have not already been identified during the initial forward stepwise procedure. If this warning is issued and you want the variables that were removed from the full model to be considered in smaller models, you can rerun the program with a set of linearly independent variables.
- See Also:
-
Nested Class Summary
Nested ClassesModifier and TypeClassDescriptionstatic classNo Variables can enter the model.classStatisticscontains statistics related to the regression coefficients. -
Field Summary
FieldsModifier and TypeFieldDescriptionstatic final intIndicates \(R^2_a\) (adjusted \(R^2\)) criterion regression.static final intIndicates Mallow's \(C_p\) criterion regression.static final intIndicates \(R^2\) criterion regression. -
Constructor Summary
ConstructorsConstructorDescriptionSelectionRegression(int nCandidate) Constructs a newSelectionRegressionobject. -
Method Summary
Modifier and TypeMethodDescriptionvoidcompute(double[][] x, double[] y) Computes the best multiple linear regression models.voidcompute(double[][] x, double[] y, double[] weights) Computes the best weighted multiple linear regression models.voidcompute(double[][] x, double[] y, double[] weights, double[] frequencies) Computes the best weighted multiple linear regression models using frequencies for each observation.voidcompute(double[][] cov, int nObservations) Computes the best multiple linear regression models using a user-supplied covariance matrix.intReturns the criterion option used to calculate the regression estimates.intReturns the number of best regression models computed.Returns a newStatisticsobject.voidsetCriterionOption(int criterionOption) Sets the Criterion to be used.voidsetMaximumBestFound(int maxFound) Sets the maximum number of best regressions to be found.voidsetMaximumGoodSaved(int maxSaved) Sets the maximum number of good regressions for each subset size saved.voidsetMaximumSubsetSize(int maxSubset) Sets the maximum subset size if \(R^2\) criterion is used.
-
Field Details
-
R_SQUARED_CRITERION
public static final int R_SQUARED_CRITERIONIndicates \(R^2\) criterion regression.- See Also:
-
ADJUSTED_R_SQUARED_CRITERION
public static final int ADJUSTED_R_SQUARED_CRITERIONIndicates \(R^2_a\) (adjusted \(R^2\)) criterion regression.- See Also:
-
MALLOWS_CP_CRITERION
public static final int MALLOWS_CP_CRITERIONIndicates Mallow's \(C_p\) criterion regression.- See Also:
-
-
Constructor Details
-
SelectionRegression
public SelectionRegression(int nCandidate) Constructs a newSelectionRegressionobject.- Parameters:
nCandidate- Anintcontaining the number of candidate variables (independent variables).nCandidatemust be greater than 2.
-
-
Method Details
-
getStatistics
Returns a newStatisticsobject.- Returns:
- A
Statisticsobject containing the Coefficient statistics.
-
setCriterionOption
public void setCriterionOption(int criterionOption) Sets the Criterion to be used. By default for all criteria, subset size 1,2, ..., k =nCandidateare considered. However, for \(R^2\) the maximum number of subsets can be restricted tomaxSubsetin thesetMaximumSubsetSize(int)method.Criterion Option Description R_SQUARED_CRITERION For \(R^2\), subset sizes 1, 2, ..., maxSubsetare examined. This is the default withmaxSubset=nCandidate.ADJUSTED_R_SQUARED_CRITERION For Adjusted \(R^2\), subset sizes 1, 2, ..., nCandidateare examined.MALLOWS_CP_CRITERION For Mallow's \(C_p\) Subset sizes 1, 2, ..., nCandidateare examined.- Parameters:
criterionOption- Anintcontaining the criterion option used for the best subset regression selection.- See Also:
-
getCriterionOption
public int getCriterionOption()Returns the criterion option used to calculate the regression estimates.- Returns:
- An
intcontaining the criterion option. - See Also:
-
getNumberOfBestRegressions
public int getNumberOfBestRegressions()Returns the number of best regression models computed.Depending on the criterion used, the number is usually equal to the number defined in
SelectionRegression.Statistics.getCoefficientStatistics(int), but it can be smaller if one of thecomputemethods throws a warning.- Returns:
- An
intcontaining the number of best models identified. - See Also:
-
setMaximumSubsetSize
public void setMaximumSubsetSize(int maxSubset) Sets the maximum subset size if \(R^2\) criterion is used.- Parameters:
maxSubset- Anintcontaining the maximum subset size when \(R^2\) criterion is used. Default:maxSubset=nCandidate.- See Also:
-
setMaximumBestFound
public void setMaximumBestFound(int maxFound) Sets the maximum number of best regressions to be found.If the \(R^2\) criterion option is selected, the
maxFoundbest regressions for each subset size examined are reported. If the adjusted \(R^2\) or Mallow's \(C_p\) criteria are selected, themaxFoundamong all possible regressions are found.- Parameters:
maxFound- Anintcontaining the maximum number of best regressions to be reported. Default:maxFound= 1.- See Also:
-
setMaximumGoodSaved
public void setMaximumGoodSaved(int maxSaved) Sets the maximum number of good regressions for each subset size saved.Argument
maxSavedmust be greater than or equal tomaxFound. Normally,maxSavedshould be less than or equal to 10. It should never need be larger thanmaxSubset, the maximum number of subsets for any subset size. Computing time required is inversely related tomaxSaved.- Parameters:
maxSaved- Anintcontaining the maximum number of good regressions saved for each subset size. Default:maxSaved= maximum(10,maxSubset).
-
compute
public void compute(double[][] x, double[] y) throws SelectionRegression.NoVariablesException, Covariances.TooManyObsDeletedException, Covariances.MoreObsDelThanEnteredException, Covariances.DiffObsDeletedException Computes the best multiple linear regression models.- Parameters:
x- Adoublematrix containing the observations of the candidate (independent) variables. The number of columns inxmust be equal to the number of variables set in the constructor.y- Adoublearray containing the observations of the dependent variable.- Throws:
SelectionRegression.NoVariablesException- if no variables can enter any modelCovariances.TooManyObsDeletedException- more observations have been deleted than were originally enteredCovariances.MoreObsDelThanEnteredException- more observations are being deleted from the output covariance matrix than were originally enteredCovariances.DiffObsDeletedException- different observations are being deleted from return matrix than were originally entered
-
compute
public void compute(double[][] x, double[] y, double[] weights) throws SelectionRegression.NoVariablesException, Covariances.NonnegativeWeightException, Covariances.TooManyObsDeletedException, Covariances.MoreObsDelThanEnteredException, Covariances.DiffObsDeletedException Computes the best weighted multiple linear regression models.- Parameters:
x- Adoublematrix containing the observations of the candidate (independent) variables. The number of columns inxmust be equal to the number of variables set in the constructor.y- Adoublearray containing the observations of the dependent variable.weights- Adoublearray containing the weight for each of the observations.- Throws:
SelectionRegression.NoVariablesException- if no variables can enter any modelCovariances.NonnegativeWeightException- weights must be nonnegativeCovariances.TooManyObsDeletedException- more observations have been deleted than were originally enteredCovariances.MoreObsDelThanEnteredException- more observations are being deleted from the output covariance matrix than were originally enteredCovariances.DiffObsDeletedException- different observations are being deleted from return matrix than were originally entered
-
compute
public void compute(double[][] x, double[] y, double[] weights, double[] frequencies) throws SelectionRegression.NoVariablesException, Covariances.NonnegativeFreqException, Covariances.NonnegativeWeightException, Covariances.TooManyObsDeletedException, Covariances.MoreObsDelThanEnteredException, Covariances.DiffObsDeletedException Computes the best weighted multiple linear regression models using frequencies for each observation.- Parameters:
x- Adoublematrix containing the observations of the candidate (independent) variables. The number of columns inxmust be equal to the number of variables set in the constructor.y- Adoublearray containing the observations of the dependent variable.weights- Adoublearray containing the weight for each of the observations.frequencies- Adoublearray containing the frequency for each of the observations ofx.- Throws:
SelectionRegression.NoVariablesException- if no variables can enter any modelCovariances.NonnegativeFreqException- frequencies must be nonnegativeCovariances.NonnegativeWeightException- weights must be nonnegativeCovariances.TooManyObsDeletedException- more observations have been deleted than were originally enteredCovariances.MoreObsDelThanEnteredException- more observations are being deleted from the output covariance matrix than were originally enteredCovariances.DiffObsDeletedException- different observations are being deleted from return matrix than were originally entered
-
compute
public void compute(double[][] cov, int nObservations) throws SelectionRegression.NoVariablesException Computes the best multiple linear regression models using a user-supplied covariance matrix.- Parameters:
cov- Adoublematrix containing a variance-covariance or sum of squares and crossproducts matrix, in which the last column must correspond to the dependent variable.covcan be computed using the Covariances class.nObservations- Anintcontaining the number of observations used to computecov.- Throws:
SelectionRegression.NoVariablesException- if no variables can enter any model
-