Usage Notes¶
The regression models in this chapter include the simple and multiple linear regression models, the multivariate general linear model, the polynomial model, and the nonlinear regression model. Functions for fitting regression models, computing summary statistics from a fitted regression, computing diagnostics, and computing confidence intervals for individual cases are provided. This chapter also provides methods for building a model from a set of candidate variables.
Simple and Multiple Linear Regression¶
The simple linear regression model is
where the observed values of the \(y_i\)’s constitute the responses or values of the dependent variable, the \(x_i\)’s are the settings of the independent (explanatory) variable, \(\beta_0\) and \(\beta_1\) are the intercept and slope parameters (respectively) and the \(\varepsilon_i\)’s are independently distributed normal errors, each with mean 0 and variance \(\sigma^2\).
The multiple linear regression model is
where the observed values of the \(y_i\)’s constitute the responses or values of the dependent variable; the \(x_{i1}\)’s, \(x_{i2}\)’s, …, \(x_{ik}\)’s are the settings of the k independent (explanatory) variables; \(\beta_0\), \(\beta_1\), …, \(\beta_k\) are the regression coefficients; and the \(\varepsilon_i\)’s are independently distributed normal errors, each with mean 0 and variance \(\sigma^2\).
Function regression fits both the simple and multiple
linear regression models using a fast Given’s transformation and includes an
option for excluding the intercept \(\beta_0\). The responses are input
in array y
, and the independent variables are input in array x
, where
the individual cases correspond to the rows and the variables correspond to
the columns.
After the model has been fitted using regression
, function
regressionSummary computes summary statistics and
regressionPrediction computes predicted values,
confidence intervals, and case statistics for the fitted model. The
information about the fit is communicated from regression
to
regressionSummary
and regressionPrediction
by passing an argument of
type structure.
No Intercept Model¶
Several functions provide the option for excluding the intercept from a model. In most practical applications, the intercept should be included in the model. For functions that use the sums of squares and crossproducts matrix as input, the no-intercept case can be handled by using the raw sums of squares and crossproducts matrix as input in place of the corrected sums of squares and crossproducts. The raw sums of squares and crossproducts matrix can be computed as \((x_1,x_2,\ldots,x_k,y)^T (x_1,x_2,\ldots,x_k,y)\).
Variable Selection¶
Variable selection can be performed by
regressionSelection, which computes all best-subset
regressions, or by regressionStepwise, which
computes stepwise regression. The method used by regressionSelection
is
generally preferred over that used by regressionStepwise
because
regressionSelection
implicitly examines all possible models in the
search for a model that optimizes some criterion while stepwise does not
examine all possible models. However, the computer time and memory
requirements for regressionSelection
can be much greater than that for
regressionStepwise
when the number of candidate variables is large.
Polynomial Model¶
The polynomial model is
where the observed values of the \(y_i\)’s constitute the responses or values of the dependent variable; the \(x_i\)’s are the settings of the independent (explanatory) variable; \(\beta_0\), \(\beta_1\), …, \(\beta_k\) are the regression coefficients; and the \(\varepsilon_i\)’s are independently distributed normal errors each with mean 0 and variance \(\sigma^2\).
Function polyRegression fits a polynomial regression
model with the option of determining the degree of the model and also
produces summary information. Function
polyPrediction computes predicted values, confidence
intervals, and case statistics for the model fit by polyRegression
.
The information about the fit is communicated from polyRegression
to
polyPrediction
by passing an argument of structure type
polyRegression
.
Specification of X for the General Linear Model¶
Variables used in the general linear model are either continuous or classification variables. Typically, multiple regression models use continuous variables, whereas analysis of variance models use classification variables. Although the notation used to specify analysis of variance models and multiple regression models may look quite different, the models are essentially the same. The term “general linear model” emphasizes that a common notational scheme is used for specifying a model that may contain both continuous and classification variables.
A general linear model is specified by its effects (sources of variation). An effect is referred to in this text as a single variable or a product of variables. (The term “effect” is often used in a narrower sense, referring only to a single regression coefficient.) In particular, an “effect” is composed of one of the following:
- A single continuous variable.
- A single classification variable.
- Several different classification variables.
- Several continuous variables, some of which may be the same.
- Continuous variables, some of which may be the same, and classification variables, which must be distinct.
Effects of the first type are common in multiple regression models. Effects of the second type appear as main effects in analysis of variance models. Effects of the third type appear as interactions in analysis of variance models. Effects of the fourth type appear in polynomial models and response surface models as powers and crossproducts of some basic variables. Effects of the fifth type appear in one-way analysis of covariance models as regression coefficients that indicate lack of parallelism of a regression function across the groups.
The analysis of a general linear model occurs in two stages. The first stage calls function regressorsForGlm to specify all regressors except the intercept. The second stage calls regression, at which point the model will be specified as either having (default) or not having an intercept.
For this discussion, define a variable INTCEP
as follows:
Option | INTCEP |
Action |
---|---|---|
noIntercept |
0 | An intercept is not in the model. |
intercept
(default) |
1 | An intercept is in the model. |
The remaining variables (nContinuous
, nClass
, xClassColumns
,
nEffects
, nVarEffects
, and indicesEffects
) are defined for
function regressorsForGlm
. All these variables have defaults except for
nContinuous
and nClass
, both of which must be specified. (See the
documentation for regressorsForGlm
for a discussion of the defaults.)
The meaning of each of these arguments is as follows:
nContinuous
(Input)- Number of continuous variables.
nClass
(Input)- Number of classification variables.
xClassColumns
(Input)- Index vector of length
nClass
containing the column numbers ofx
that are the classification variables. nEffects
(Input)- Number of effects (sources of variation) in the model, excluding error.
nVarEffects
(Input)- Vector of length
nEffects
containing the number of variables associated with each effect in the model. indicesEffects
(Input)- Index vector of length
nVarEffects
[0] +nVarEffects
[1] + … +nVarEffects
[nEffects
‑ 1]. The firstnVarEffects
[0] elements give the column numbers ofx
for each variable in the first effect; the nextnVarEffects
[1] elements give the column numbers for each variable in the second effect; and finally, the lastnVarEffects
[nEffects
‑ 1] elements give the column numbers for each variable in the last effect.
Suppose the data matrix has as its first four columns two continuous variables in Columns 0 and 1 and two classification variables in Columns 2 and 3. The data might appear as follows:
Column 0 | Column 1 | Column 2 | Column 3 |
---|---|---|---|
11.23 | 1.23 | 1.0 | 5.0 |
12.12 | 2.34 | 1.0 | 4.0 |
12.34 | 1.23 | 1.0 | 4.0 |
4.34 | 2.21 | 1.0 | 5.0 |
5.67 | 4.31 | 2.0 | 4.0 |
4.12 | 5.34 | 2.0 | 1.0 |
4.89 | 9.31 | 2.0 | 1.0 |
9.12 | 3.71 | 2.0 | 1.0 |
Each distinct value of a classification variable determines a level. The classification variable in Column 2 has two levels. The classification variable in Column 3 has three levels. (Integer values are recommended, but not required, for values of the classification variables. The values of the classification variables corresponding to the same level must be identical.) Some examples of regression functions and their specifications are as follows:
INTCEP |
nClass |
xClassColumns |
|
---|---|---|---|
\(\beta_0+\beta_1x_1\) | 1 | 0 | |
\(\beta_0+\beta_1x_1+\beta_2x_1^2\) | 1 | 0 | |
\(\mu+\alpha_i\) | 1 | 1 | 2 |
\(\mu+\alpha_i+\beta_j+\gamma_{ij}\) | 1 | 2 | 2, 3 |
\(\mu_{ij}\) | 0 | 2 | 2, 3 |
\(\beta_0+\beta_1x_1+\beta_2x_2+\beta_3x_1x_2\) | 1 | 0 | |
\(\mu+\alpha_i+\beta x_{1i}+\beta_ix_{1i}\) | 1 | 1 | 2 |
nEffects |
nVarEffects |
Indices_effects |
|
---|---|---|---|
\(\beta_0+\beta_1x_1\) | 1 | 1 | 0 |
\(\beta_0+\beta_1x_1+\beta_2x_1^2\) | 2 | 1, 2 | 0, 0, 0 |
\(\mu+\alpha_i\) | 1 | 1 | 2 |
\(\mu+\alpha_i+\beta_j+\gamma_{ij}\) | 3 | 1, 1, 2 | 2, 3, 2, 3 |
\(\mu_{ij}\) | 1 | 2 | 2, 3 |
\(\beta_0+\beta_1x_1+\beta_2x_2+\beta_3x_1x_2\) | 3 | 1, 1, 2 | 0, 1, 0, 1 |
\(\mu+\alpha_i+\beta x_{1i}+\beta_ix_{1i}\) | 3 | 1, 1, 2 | 2, 0, 0, 2 |
Functions for Fitting the Model¶
Function regression fits a multivariate general linear model, where regressors for the general linear model have been generated using function regressorsForGlm.
Linear Dependence and the R Matrix¶
Linear dependence of the regressors frequently arises in regression models—sometimes by design and sometimes by accident. The functions in this chapter are designed to handle linear dependence of the regressors; i.e., the n × p matrix X (the matrix of regressors) in the general linear model can have rank less than p. Often, the models are referred to as non-full rank models.
As discussed in Searle (1971, Chapter 5), be careful to correctly use the results of the fitted non-full rank regression model for estimation and hypothesis testing. In the non-full rank case, not all linear combinations of the regression coefficients can be estimated. Those linear combinations that can be estimated are called “estimable functions.” If the functions are used to attempt to estimate linear combinations that cannot be estimated, error messages are issued. A good general discussion of estimable functions is given by Searle (1971, pp. 180–188).
The check used by functions in this chapter for linear dependence is sequential. The j‑th regressor is declared linearly dependent on the preceding j - 1 regressors if
is less than or equal to tolerance
. Here,
is the multiple correlation coefficient of the j‑th regressor with the first j − 1 regressors. When a function declares the j-th regressor to be linearly dependent on the first j − 1, the j‑th regression coefficient is set to 0. Essentially, this removes the j‑th regressor from the model.
The reason a sequential check is used is that practitioners frequently
include the preferred variables to remain in the model first. Also, the
sequential check is based on many of the computations already performed as
this does not degrade the overall efficiency of the functions. There is no
perfect test for linear dependence when finite precision arithmetic is used.
The optional argument tolerance
allows the user some control over the
check for linear dependence. If a model is full rank, input tolerance
=
0.0. However, tolerance
should be input as approximately 100 times the
machine epsilon. The machine epsilon is machine
(4) in single
precision. (See function machine in Chapter
15,:doc:/stat/utilities/index.)
Functions performing least squares are based on QR decomposition of X or on a Cholesky factorization \(R^T R\) of \(X^T X\). Maindonald (1984, Chapters 1−5) discusses these methods extensively. The R matrix used by the regression function is a p × p upper-triangular matrix, i.e., all elements below the diagonal are 0. The signs of the diagonal elements of R are used as indicators of linearly dependent regressors and as indicators of parameter restrictions imposed by fitting a restricted model. The rows of R can be partitioned into three classes by the sign of the corresponding diagonal element:
- A negative diagonal element means the row corresponds to a linearly independent restriction imposed on the regression parameters by \(AB=Z\) in a restricted model.
- A zero diagonal element means a linear dependence of the regressors was declared. The regression coefficients in the corresponding row of \(\hat{B}\) are set to 0. This represents an arbitrary restriction that is imposed to obtain a solution for the regression coefficients. The elements of the corresponding row of R also are set to 0.
Nonlinear Regression Model¶
The nonlinear regression model is
where the observed values of the \(y_i\)’s constitute the responses or values of the dependent variable, the \(x_i\)’s are the known vectors of values of the independent (explanatory) variables, f is a known function of an unknown regression parameter vector θ, and the \(\varepsilon_i\)’s are independently distributed normal errors each with mean 0 and variance \(\sigma^2\).
Function nonlinearRegression performs the least-squares fit to the data for this model.
Weighted Least Squares¶
Functions throughout the chapter generally allow weights to be assigned to
the observations. The vector weights
is used throughout to specify the
weighting for each row of X.
Computations that relate to statistical inference—e.g., t tests, F tests, and confidence intervals—are based on the multiple regression model except that the variance of \(\varepsilon_i\) is assumed to equal \(\sigma^2\) times the reciprocal of the corresponding weight.
If a single row of the data matrix corresponds to \(n_i\) observations,
the vector frequencies
can be used to specify the frequency for each row
of X. Degrees of freedom for error are affected by frequencies but are
unaffected by weights.
Summary Statistics¶
Function regressionSummary can be used to compute and print statistics related to a regression for each of the q dependent variables fitted by regression. The summary statistics include the model analysis of variance table, sequential sums of squares and F-statistics, coefficient estimates, estimated standard errors, t-statistics, variance inflation factors, and estimated variance-covariance matrix of the estimated regression coefficients. Function polyRegression includes most of the same functionality for polynomial regressions.
The summary statistics are computed under the model \(y=X\beta+\varepsilon\), where y is the n × 1 vector of responses, X is the n × p matrix of regressors with rank \((X)=r\), β is the p × 1 vector of regression coefficients, and ɛ is the n × 1 vector of errors whose elements are independently normally distributed with mean 0 and variance \(\sigma^2/w_i\).
Given the results of a weighted least-squares fit of this model (with the \(w_i\)’s as the weights), most of the computed summary statistics are output in the following variables:
anovaTable
One-dimensional array usually of length 15. In regressionStepwise,
anovaTable
is of length 13 because the last two elements of the array cannot be computed from the input. The array contains statistics related to the analysis of variance. The sources of variation examined are the regression, error, and total. The first 10 elements ofanovaTable
and the notation frequently used for these is described in the following table (here,AOV
replacesanovaTable
):Model Analysis of Variance Table Source of Variation Degrees of Freedom Sum of Squares Mean Square F p-value Regression DFR = AOV
[0]SSR = AOV
[3]MSR = AOV
[6]AOV
[8]AOV
[9]Error DFE = AOV
[1]SSE = AOV
[4]\(s^2\) = AOV
[7]Total DFT = AOV
[2]SST = AOV
[5]If the model has an intercept (default), the total sum of squares is the sum of squares of the deviations of \(y_i\) from its (weighted) mean \(\bar{y}\) —the so-called corrected total sum of squares, denoted by the following:
\[\mathit{SST} = \sum_{i=1}^{n} w_i \left(y_i - \overline{y}\right)^2\]If the model does not have an intercept (
noIntercept
), the total sum of squares is the sum of squares of \(y_i\)—the so-called uncorrected total sum of squares, denoted by the following:\[\mathrm{SST} = \sum_{i=1}^{n} w_i y_i^2\]The error sum of squares is given as follows:
\[\mathit{SSE} = \sum_{i=1}^{n} w_i \left(y_i - \hat{y}_i\right)^2\]The error degrees of freedom is defined by \(DFE=n-r\).
The estimate of \(\sigma^2\) is given by \(s^2=SSE/DFE\), which is the error mean square.
The computed F statistic for the null hypothesis, \(H_0 : \beta_1=\beta_2=\ldots=\beta_k=0\), versus the alternative that at least one coefficient is nonzero is given by \(F=MSR/s^2\). The p-value associated with the test is the probability of an F larger than that computed under the assumption of the model and the null hypothesis. A small p-value (less than 0.05) is customarily used to indicate there is sufficient evidence from the data to reject the null hypothesis. Note that the p‑value is returned as 0.0 when the value is so small that all significant digits have been lost.
The remaining five elements in
anovaTable
frequently are displayed together with the actual analysis of variance table. The quantities R-squared (\(R^2\) =anovaTable
[10]) and adjusted R-squared\[R_{\mathrm{a}}^{2} = (\mathrm{anovaTable}[11])\]are expressed as a percentage and are defined as follows:
\[R^2 = 100(SSR∕SST) = 100(1 – SSE∕SST)\]\[R_{\mathrm{a}}^{2} = 100 \max \left\{ 0,1 - \frac{s^2}{\mathrm{SST} / \mathrm{DFT}}\right\}\]The square root of \(s^2\)(s =
anovaTable
[12]) is frequently referred to as the estimated standard deviation of the model error.The overall mean of the responses \(\bar{y}\) is output in
anovaTable
[13].The coefficient of variation (CV =
anovaTable
[14]) is expressed as a percentage and defined by \(CV=100s/\bar{y}\).
coefTTests
Two-dimensional matrix containing the regression coefficient vector β as one column and associated statistics (estimated standard error, t statistic and p-value) in the remaining columns.
coefCovariances
- Estimated variance-covariance matrix of the estimated regression coefficients.
Tests for Lack-of-Fit¶
Tests for lack-of-fit are computed for the polynomial regression by the
function polyRegression. The output array ssqLof
contains the lack-of-fit F tests for each degree polynomial 1, 2, …,
k, that is fit to the data. These tests are used to indicate the degree of
the polynomial required to fit the data well.
Diagnostics for Individual Cases¶
Diagnostics for individual cases (observations) are computed by two functions: regressionPrediction for linear and nonlinear regressions and polyPrediction for polynomial regressions.
Statistics computed include predicted values, confidence intervals, and diagnostics for detecting outliers and cases that greatly influence the fitted regression.
The diagnostics are computed under the model \(y=X\beta+\varepsilon\), where y is the n × 1 vector of responses, X is the n × p matrix of regressors with rank \((X)=r\), β is the p × 1 vector of regression coefficients, and ɛ is the n × 1 vector of errors whose elements are independently normally distributed with mean 0 and variance \(\sigma^2/w_i\).
Given the results of a weighted least-squares fit of this model (with the \(w_i\)’s as the weights), the following five diagnostics are computed:
- Leverage
- Standardized residual
- Jackknife residual
- Cook’s distance
- DFFITS
The definition of these terms is given in the discussion that follows:
Let \(x_i\) be a column vector containing the elements of the i‑th row of X. A case can be unusual either because of \(x_i\) or because of the response \(y_i\). The leverage \(h_i\) is a measure of uniqueness of the \(x_i\). The leverage is defined by
where \(W=\mathrm{diag}(w_1,w_2,\ldots,w_n)\) and \((X^T WX)^-\) denotes a generalized inverse of \(X^T WX\). The average value of the \(h_i\)’s is \(r/n\). Regression functions declare \(x_i\) unusual if \(h_i>2r/n\). Hoaglin and Welsch (1978) call a data point highly influential (i.e., a leverage point) when this occurs.
Let \(e_i\) denote the residual
for the i‑th case. The estimated variance of \(e_i\) is \((1-h_i)s^2/w_i\), where \(s^2\) is the residual mean square from the fitted regression. The i‑th standardized residual (also called the internally studentized residual) is by definition
and \(r_i\) follows an approximate standard normal distribution in large samples.
The i‑th jackknife residual or deleted residual involves the difference between \(y_i\) and its predicted value, based on the data set in which the i‑th case is deleted. This difference equals \(e_i/(1-h_i)\). The jackknife residual is obtained by standardizing this difference. The residual mean square for the regression in which the i‑th case is deleted is as follows:
The jackknife residual is defined as
and \(t_i\) follows a t distribution with n – r − 1 degrees of freedom.
Cook’s distance for the i‑th case is a measure of how much an individual case affects the estimated regression coefficients. It is given as follows:
Weisberg (1985) states that if \(D_i\) exceeds the 50-th percentile of the \(F(r,n-r)\) distribution, it should be considered large. (This value is about 1. This statistic does not have an F distribution.)
DFFITS, like Cook’s distance, is also a measure of influence. For the i‑th case, DFFITS is computed by the formula below.
Hoaglin and Welsch (1978) suggest that DFFITS greater than
is large.
Transformations¶
Transformations of the independent variables are sometimes useful in order to satisfy the regression model. The inclusion of squares and crossproducts of the variables
is often needed. Logarithms of the independent variables are used also. (See Draper and Smith 1981, pp. 218−222; Box and Tidwell 1962; Atkinson 1985, pp. 177−180; Cook and Weisberg 1982, pp. 78−86.)
When the responses are described by a nonlinear function of the parameters, a transformation of the model equation often can be selected so that the transformed model is linear in the regression parameters. For example, by taking natural logarithms on both sides of the equation, the exponential model
can be transformed to a model that satisfies the linear regression model provided the \(\varepsilon_i\)’s have a log-normal distribution (Draper and Smith, pp. 222−225).
When the responses are nonnormal and their distribution is known, a transformation of the responses can often be selected so that the transformed responses closely satisfy the regression model, assumptions. The square-root transformation for counts with a Poisson distribution and the arc-sine transformation for binomial proportions are common examples (Snedecor and Cochran 1967, pp. 325−330; Draper and Smith, pp. 237−239).
Alternatives to Least Squares¶
The method of least squares has desirable characteristics when the errors are normally distributed, e.g., a least-squares solution produces maximum likelihood estimates of the regression parameters. However, when errors are not normally distributed, least squares may yield poor estimators. Function lnormRegression offers three alternatives to least squares methodology, Least Absolute Value , Lp Norm , and Least Maximum Value.
The least absolute value (LAV, L1) criterion yields the maximum likelihood estimate when the errors follow a Laplace distribution. Option methodLav is often used when the errors have a heavy tailed distribution or when a fit is needed that is resistant to outliers.
A more general approach, minimizing the Lp norm (\(p\leq 1\)), is given
by option methodLlp. Although the routine
requires about 30 times the CPU
time for the case \(p=1\) than would
the use of methodLav
, the generality of methodLlp
allows the user to
try several choices for p ≥1 by simply changing the input value of
p in the calling program. The CPU
time decreases as p gets larger.
Generally, choices of p between 1 and 2 are of interest. However, the Lp
norm solution for values of p larger than 2 can also be computed.
The minimax (LMV, L∞, Chebyshev) criterion is used by methodLmv. Its estimates are very sensitive to outliers, however, the minimax estimators are quite efficient if the errors are uniformly distributed.
Function plsRegression provides an alternative
method which is useful when there are many inter-related regression
variables and relatively few observations. plsRegression
finds linear
combinations of the predictor variables that have highest covariance with
Y.
Missing Values¶
NaN (Not a Number) is the missing value code used by the regression
functions. Use function machine(6), Chapter
15,:doc:/stat/utilities/index to retrieve NaN. Any element of the data
matrix that is missing must be set to machine
(6). In fitting
regression models, any observation containing NaN for the independent,
dependent, weight, or frequency variables is omitted from the computation of
the regression parameters.