Usage Notes¶

The regression models in this chapter include the simple and multiple linear regression models, the multivariate general linear model, the polynomial model, and the nonlinear regression model. Functions for fitting regression models, computing summary statistics from a fitted regression, computing diagnostics, and computing confidence intervals for individual cases are provided. This chapter also provides methods for building a model from a set of candidate variables.

Simple and Multiple Linear Regression¶

The simple linear regression model is

$y_i = β_0 + β_1x_i + ɛ_i \phantom{...} i = 1, 2, ..., n$

where the observed values of the $y_i$ ’s constitute the responses or values of the dependent variable, the $x_i$ ’s are the settings of the independent (explanatory) variable, $\beta_0$ and $\beta_1$ are the intercept and slope parameters (respectively) and the $\varepsilon_i$ ’s are independently distributed normal errors, each with mean 0 and variance $\sigma^2$ .

The multiple linear regression model is

$y_i = β_0 + β_1x_{i1} + β_2x_{i2} + \ldots + β_kx_{ik} + ɛ_i \phantom{...} i = 1, 2, \ldots, n$

where the observed values of the $y_i$ ’s constitute the responses or values of the dependent variable; the $x_{i1}$ ’s, $x_{i2}$ ’s, …, $x_{ik}$ ’s are the settings of the k independent (explanatory) variables; $\beta_0$ , $\beta_1$ , …, $\beta_k$ are the regression coefficients; and the $\varepsilon_i$ ’s are independently distributed normal errors, each with mean 0 and variance $\sigma^2$ .

Function regression fits both the simple and multiple linear regression models using a fast Given’s transformation and includes an option for excluding the intercept $\beta_0$ . The responses are input in array y, and the independent variables are input in array x, where the individual cases correspond to the rows and the variables correspond to the columns.

After the model has been fitted using regression, function regressionSummary computes summary statistics and regressionPrediction computes predicted values, confidence intervals, and case statistics for the fitted model. The information about the fit is communicated from regression to regressionSummary and regressionPrediction by passing an argument of type structure.

No Intercept Model¶

Several functions provide the option for excluding the intercept from a model. In most practical applications, the intercept should be included in the model. For functions that use the sums of squares and crossproducts matrix as input, the no-intercept case can be handled by using the raw sums of squares and crossproducts matrix as input in place of the corrected sums of squares and crossproducts. The raw sums of squares and crossproducts matrix can be computed as $(x_1,x_2,\ldots,x_k,y)^T (x_1,x_2,\ldots,x_k,y)$ .

Variable Selection¶

Variable selection can be performed by regressionSelection, which computes all best-subset regressions, or by regressionStepwise, which computes stepwise regression. The method used by regressionSelection is generally preferred over that used by regressionStepwise because regressionSelection implicitly examines all possible models in the search for a model that optimizes some criterion while stepwise does not examine all possible models. However, the computer time and memory requirements for regressionSelection can be much greater than that for regressionStepwise when the number of candidate variables is large.

Polynomial Model¶

The polynomial model is

$y_i = \beta_0 + \beta_1 x_i + \beta_2 x_i^2 + \ldots + \beta_k x_i^k + \varepsilon_i \phantom{...} i=1,2,\ldots n$

where the observed values of the $y_i$ ’s constitute the responses or values of the dependent variable; the $x_i$ ’s are the settings of the independent (explanatory) variable; $\beta_0$ , $\beta_1$ , …, $\beta_k$ are the regression coefficients; and the $\varepsilon_i$ ’s are independently distributed normal errors each with mean 0 and variance $\sigma^2$ .

Function polyRegression fits a polynomial regression model with the option of determining the degree of the model and also produces summary information. Function polyPrediction computes predicted values, confidence intervals, and case statistics for the model fit by polyRegression.

The information about the fit is communicated from polyRegression to polyPrediction by passing an argument of structure type polyRegression.

Specification of X for the General Linear Model¶

Variables used in the general linear model are either continuous or classification variables. Typically, multiple regression models use continuous variables, whereas analysis of variance models use classification variables. Although the notation used to specify analysis of variance models and multiple regression models may look quite different, the models are essentially the same. The term “general linear model” emphasizes that a common notational scheme is used for specifying a model that may contain both continuous and classification variables.

A general linear model is specified by its effects (sources of variation). An effect is referred to in this text as a single variable or a product of variables. (The term “effect” is often used in a narrower sense, referring only to a single regression coefficient.) In particular, an “effect” is composed of one of the following:

A single continuous variable.
A single classification variable.
Several different classification variables.
Several continuous variables, some of which may be the same.
Continuous variables, some of which may be the same, and classification variables, which must be distinct.

Effects of the first type are common in multiple regression models. Effects of the second type appear as main effects in analysis of variance models. Effects of the third type appear as interactions in analysis of variance models. Effects of the fourth type appear in polynomial models and response surface models as powers and crossproducts of some basic variables. Effects of the fifth type appear in one-way analysis of covariance models as regression coefficients that indicate lack of parallelism of a regression function across the groups.

The analysis of a general linear model occurs in two stages. The first stage calls function regressorsForGlm to specify all regressors except the intercept. The second stage calls regression, at which point the model will be specified as either having (default) or not having an intercept.

For this discussion, define a variable INTCEP as follows:

Option	`INTCEP`	Action
`noIntercept`	0	An intercept is not in the model.
`intercept` (default)	1	An intercept is in the model.

The remaining variables (nContinuous, nClass, xClassColumns, nEffects, nVarEffects, and indicesEffects) are defined for function regressorsForGlm. All these variables have defaults except for nContinuous and nClass, both of which must be specified. (See the documentation for regressorsForGlm for a discussion of the defaults.) The meaning of each of these arguments is as follows:

nContinuous (Input): Number of continuous variables.
nClass (Input): Number of classification variables.
xClassColumns (Input): Index vector of length nClass containing the column numbers of x that are the classification variables.
nEffects (Input): Number of effects (sources of variation) in the model, excluding error.
nVarEffects (Input): Vector of length nEffects containing the number of variables associated with each effect in the model.
indicesEffects (Input): Index vector of length nVarEffects[0] + nVarEffects[1] + … + nVarEffects[nEffects ‑ 1]. The first nVarEffects[0] elements give the column numbers of x for each variable in the first effect; the next nVarEffects[1] elements give the column numbers for each variable in the second effect; and finally, the last nVarEffects [nEffects ‑ 1] elements give the column numbers for each variable in the last effect.

Suppose the data matrix has as its first four columns two continuous variables in Columns 0 and 1 and two classification variables in Columns 2 and 3. The data might appear as follows:

Column 0	Column 1	Column 2	Column 3
11.23	1.23	1.0	5.0
12.12	2.34	1.0	4.0
12.34	1.23	1.0	4.0
4.34	2.21	1.0	5.0
5.67	4.31	2.0	4.0
4.12	5.34	2.0	1.0
4.89	9.31	2.0	1.0
9.12	3.71	2.0	1.0

Each distinct value of a classification variable determines a level. The classification variable in Column 2 has two levels. The classification variable in Column 3 has three levels. (Integer values are recommended, but not required, for values of the classification variables. The values of the classification variables corresponding to the same level must be identical.) Some examples of regression functions and their specifications are as follows:

	`INTCEP`	`nClass`	`xClassColumns`
$\beta_0+\beta_1x_1$	1	0
$\beta_0+\beta_1x_1+\beta_2x_1^2$	1	0
$\mu+\alpha_i$	1	1	2
$\mu+\alpha_i+\beta_j+\gamma_{ij}$	1	2	2, 3
$\mu_{ij}$	0	2	2, 3
$\beta_0+\beta_1x_1+\beta_2x_2+\beta_3x_1x_2$	1	0
$\mu+\alpha_i+\beta x_{1i}+\beta_ix_{1i}$	1	1	2

	`nEffects`	`nVarEffects`	`Indices_effects`
$\beta_0+\beta_1x_1$	1	1	0
$\beta_0+\beta_1x_1+\beta_2x_1^2$	2	1, 2	0, 0, 0
$\mu+\alpha_i$	1	1	2
$\mu+\alpha_i+\beta_j+\gamma_{ij}$	3	1, 1, 2	2, 3, 2, 3
$\mu_{ij}$	1	2	2, 3
$\beta_0+\beta_1x_1+\beta_2x_2+\beta_3x_1x_2$	3	1, 1, 2	0, 1, 0, 1
$\mu+\alpha_i+\beta x_{1i}+\beta_ix_{1i}$	3	1, 1, 2	2, 0, 0, 2

Functions for Fitting the Model¶

Function regression fits a multivariate general linear model, where regressors for the general linear model have been generated using function regressorsForGlm.

Linear Dependence and the R Matrix¶

Linear dependence of the regressors frequently arises in regression models—sometimes by design and sometimes by accident. The functions in this chapter are designed to handle linear dependence of the regressors; i.e., the n × p matrix X (the matrix of regressors) in the general linear model can have rank less than p. Often, the models are referred to as non-full rank models.

As discussed in Searle (1971, Chapter 5), be careful to correctly use the results of the fitted non-full rank regression model for estimation and hypothesis testing. In the non-full rank case, not all linear combinations of the regression coefficients can be estimated. Those linear combinations that can be estimated are called “estimable functions.” If the functions are used to attempt to estimate linear combinations that cannot be estimated, error messages are issued. A good general discussion of estimable functions is given by Searle (1971, pp. 180–188).

The check used by functions in this chapter for linear dependence is sequential. The j‑th regressor is declared linearly dependent on the preceding j - 1 regressors if

$1 - R_{j(1,2,\ldots j-1)}^{2}$

is less than or equal to tolerance. Here,

$R_{j(1,2,\ldots j-1)}$

is the multiple correlation coefficient of the j‑th regressor with the first j − 1 regressors. When a function declares the j-th regressor to be linearly dependent on the first j − 1, the j‑th regression coefficient is set to 0. Essentially, this removes the j‑th regressor from the model.

The reason a sequential check is used is that practitioners frequently include the preferred variables to remain in the model first. Also, the sequential check is based on many of the computations already performed as this does not degrade the overall efficiency of the functions. There is no perfect test for linear dependence when finite precision arithmetic is used. The optional argument tolerance allows the user some control over the check for linear dependence. If a model is full rank, input tolerance = 0.0. However, tolerance should be input as approximately 100 times the machine epsilon. The machine epsilon is machine(4) in single precision. (See function machine in Chapter 15,:doc:/stat/utilities/index.)

Functions performing least squares are based on QR decomposition of X or on a Cholesky factorization $R^T R$ of $X^T X$ . Maindonald (1984, Chapters 1−5) discusses these methods extensively. The R matrix used by the regression function is a p × p upper-triangular matrix, i.e., all elements below the diagonal are 0. The signs of the diagonal elements of R are used as indicators of linearly dependent regressors and as indicators of parameter restrictions imposed by fitting a restricted model. The rows of R can be partitioned into three classes by the sign of the corresponding diagonal element:

A negative diagonal element means the row corresponds to a linearly independent restriction imposed on the regression parameters by $AB=Z$ in a restricted model.
A zero diagonal element means a linear dependence of the regressors was declared. The regression coefficients in the corresponding row of $\hat{B}$ are set to 0. This represents an arbitrary restriction that is imposed to obtain a solution for the regression coefficients. The elements of the corresponding row of R also are set to 0.

Nonlinear Regression Model¶

The nonlinear regression model is

$y_i = f\left(x_i; \theta\right) + \varepsilon_i \phantom{...} i=1,2, \ldots, n$

where the observed values of the $y_i$ ’s constitute the responses or values of the dependent variable, the $x_i$ ’s are the known vectors of values of the independent (explanatory) variables, f is a known function of an unknown regression parameter vector θ, and the $\varepsilon_i$ ’s are independently distributed normal errors each with mean 0 and variance $\sigma^2$ .

Function nonlinearRegression performs the least-squares fit to the data for this model.

Weighted Least Squares¶

Functions throughout the chapter generally allow weights to be assigned to the observations. The vector weights is used throughout to specify the weighting for each row of X.

Computations that relate to statistical inference—e.g., t tests, F tests, and confidence intervals—are based on the multiple regression model except that the variance of $\varepsilon_i$ is assumed to equal $\sigma^2$ times the reciprocal of the corresponding weight.

If a single row of the data matrix corresponds to $n_i$ observations, the vector frequencies can be used to specify the frequency for each row of X. Degrees of freedom for error are affected by frequencies but are unaffected by weights.

Summary Statistics¶

Function regressionSummary can be used to compute and print statistics related to a regression for each of the q dependent variables fitted by regression. The summary statistics include the model analysis of variance table, sequential sums of squares and F-statistics, coefficient estimates, estimated standard errors, t-statistics, variance inflation factors, and estimated variance-covariance matrix of the estimated regression coefficients. Function polyRegression includes most of the same functionality for polynomial regressions.

The summary statistics are computed under the model $y=X\beta+\varepsilon$ , where y is the n × 1 vector of responses, X is the n × p matrix of regressors with rank $(X)=r$ , β is the p × 1 vector of regression coefficients, and ɛ is the n × 1 vector of errors whose elements are independently normally distributed with mean 0 and variance $\sigma^2/w_i$ .

Given the results of a weighted least-squares fit of this model (with the $w_i$ ’s as the weights), most of the computed summary statistics are output in the following variables:

anovaTable

One-dimensional array usually of length 15. In regressionStepwise, anovaTable is of length 13 because the last two elements of the array cannot be computed from the input. The array contains statistics related to the analysis of variance. The sources of variation examined are the regression, error, and total. The first 10 elements of anovaTable and the notation frequently used for these is described in the following table (here, AOV replaces anovaTable):

Model Analysis of Variance Table
Source of Variation	Degrees of Freedom	Sum of Squares	Mean Square	F	p-value
Regression	DFR = `AOV`[0]	SSR = `AOV`[3]	MSR = `AOV`[6]	`AOV`[8]	`AOV`[9]
Error	DFE = `AOV`[1]	SSE = `AOV`[4]	$s^2$ = `AOV`[7]
Total	DFT = `AOV`[2]	SST = `AOV`[5]

If the model has an intercept (default), the total sum of squares is the sum of squares of the deviations of $y_i$ from its (weighted) mean $\bar{y}$ —the so-called corrected total sum of squares, denoted by the following:

$\mathit{SST} = \sum_{i=1}^{n} w_i \left(y_i - \overline{y}\right)^2$

If the model does not have an intercept (noIntercept), the total sum of squares is the sum of squares of $y_i$ —the so-called uncorrected total sum of squares, denoted by the following:

$\mathrm{SST} = \sum_{i=1}^{n} w_i y_i^2$

The error sum of squares is given as follows:

$\mathit{SSE} = \sum_{i=1}^{n} w_i \left(y_i - \hat{y}_i\right)^2$

The error degrees of freedom is defined by $DFE=n-r$ .

The estimate of $\sigma^2$ is given by $s^2=SSE/DFE$ , which is the error mean square.

The computed F statistic for the null hypothesis, $H_0 : \beta_1=\beta_2=\ldots=\beta_k=0$ , versus the alternative that at least one coefficient is nonzero is given by $F=MSR/s^2$ . The p-value associated with the test is the probability of an F larger than that computed under the assumption of the model and the null hypothesis. A small p-value (less than 0.05) is customarily used to indicate there is sufficient evidence from the data to reject the null hypothesis. Note that the p‑value is returned as 0.0 when the value is so small that all significant digits have been lost.

The remaining five elements in anovaTable frequently are displayed together with the actual analysis of variance table. The quantities R-squared ( $R^2$ = anovaTable[10]) and adjusted R-squared

$R_{\mathrm{a}}^{2} = (\mathrm{anovaTable}[11])$

are expressed as a percentage and are defined as follows:

$R^2 = 100(SSR∕SST) = 100(1 – SSE∕SST)$

$R_{\mathrm{a}}^{2} = 100 \max \left\{ 0,1 - \frac{s^2}{\mathrm{SST} / \mathrm{DFT}}\right\}$

The square root of $s^2$ (s = anovaTable[12]) is frequently referred to as the estimated standard deviation of the model error.

The overall mean of the responses $\bar{y}$ is output in anovaTable[13].

The coefficient of variation (CV = anovaTable[14]) is expressed as a percentage and defined by $CV=100s/\bar{y}$ .

coefTTestsTwo-dimensional matrix containing the regression coefficient vector β as one column and associated statistics (estimated standard error, t statistic and p-value) in the remaining columns.

coefCovariances: Estimated variance-covariance matrix of the estimated regression coefficients.

Tests for Lack-of-Fit¶

Tests for lack-of-fit are computed for the polynomial regression by the function polyRegression. The output array ssqLof contains the lack-of-fit F tests for each degree polynomial 1, 2, …, k, that is fit to the data. These tests are used to indicate the degree of the polynomial required to fit the data well.

Diagnostics for Individual Cases¶

Diagnostics for individual cases (observations) are computed by two functions: regressionPrediction for linear and nonlinear regressions and polyPrediction for polynomial regressions.

Statistics computed include predicted values, confidence intervals, and diagnostics for detecting outliers and cases that greatly influence the fitted regression.

The diagnostics are computed under the model $y=X\beta+\varepsilon$ , where y is the n × 1 vector of responses, X is the n × p matrix of regressors with rank $(X)=r$ , β is the p × 1 vector of regression coefficients, and ɛ is the n × 1 vector of errors whose elements are independently normally distributed with mean 0 and variance $\sigma^2/w_i$ .

Given the results of a weighted least-squares fit of this model (with the $w_i$ ’s as the weights), the following five diagnostics are computed:

Leverage
Standardized residual
Jackknife residual
Cook’s distance
DFFITS

The definition of these terms is given in the discussion that follows:

Let $x_i$ be a column vector containing the elements of the i‑th row of X. A case can be unusual either because of $x_i$ or because of the response $y_i$ . The leverage $h_i$ is a measure of uniqueness of the $x_i$ . The leverage is defined by

$h_i = \left[x_i^T \left(X^T WX\right)^- x_i\right] w_i$

where $W=\mathrm{diag}(w_1,w_2,\ldots,w_n)$ and $(X^T WX)^-$ denotes a generalized inverse of $X^T WX$ . The average value of the $h_i$ ’s is $r/n$ . Regression functions declare $x_i$ unusual if $h_i>2r/n$ . Hoaglin and Welsch (1978) call a data point highly influential (i.e., a leverage point) when this occurs.

Let $e_i$ denote the residual

$y_i = \hat{y}_i$

for the i‑th case. The estimated variance of $e_i$ is $(1-h_i)s^2/w_i$ , where $s^2$ is the residual mean square from the fitted regression. The i‑th standardized residual (also called the internally studentized residual) is by definition

$r_i = e_i \left(\frac{w_i}{s^2 \left(1 - h_i\right)}\right)^{1/2}$

and $r_i$ follows an approximate standard normal distribution in large samples.

The i‑th jackknife residual or deleted residual involves the difference between $y_i$ and its predicted value, based on the data set in which the i‑th case is deleted. This difference equals $e_i/(1-h_i)$ . The jackknife residual is obtained by standardizing this difference. The residual mean square for the regression in which the i‑th case is deleted is as follows:

$s_i^2 = \frac{(n-r)s^2 - w_i e_i^2 / \left(1 - h_i\right)}{n-r-1}$

The jackknife residual is defined as

$t_i = e_i \sqrt{\frac{w_i}{s_i^2 \left(1 - h_i\right)}}$

and $t_i$ follows a t distribution with n – r − 1 degrees of freedom.

Cook’s distance for the i‑th case is a measure of how much an individual case affects the estimated regression coefficients. It is given as follows:

$D_i = \frac{w_i h_i e_i^2}{rs^2 \left(1 - h_i\right)^2}$

Weisberg (1985) states that if $D_i$ exceeds the 50-th percentile of the $F(r,n-r)$ distribution, it should be considered large. (This value is about 1. This statistic does not have an F distribution.)

DFFITS, like Cook’s distance, is also a measure of influence. For the i‑th case, DFFITS is computed by the formula below.

$\mathrm{DFFITS}_i = e_i \sqrt{\frac{w_i}{s_i^2 \left(1 - h_i\right)^2}}$

Hoaglin and Welsch (1978) suggest that DFFITS greater than

$2 \sqrt{r/n}$

is large.

Transformations¶

Transformations of the independent variables are sometimes useful in order to satisfy the regression model. The inclusion of squares and crossproducts of the variables

$\left(x_1, x_2, x_1^2, x_2^2, x_1 x_2\right)$

is often needed. Logarithms of the independent variables are used also. (See Draper and Smith 1981, pp. 218−222; Box and Tidwell 1962; Atkinson 1985, pp. 177−180; Cook and Weisberg 1982, pp. 78−86.)

When the responses are described by a nonlinear function of the parameters, a transformation of the model equation often can be selected so that the transformed model is linear in the regression parameters. For example, by taking natural logarithms on both sides of the equation, the exponential model

$y = e^{\beta_0 + \beta_1 x_1} \varepsilon$

can be transformed to a model that satisfies the linear regression model provided the $\varepsilon_i$ ’s have a log-normal distribution (Draper and Smith, pp. 222−225).

When the responses are nonnormal and their distribution is known, a transformation of the responses can often be selected so that the transformed responses closely satisfy the regression model, assumptions. The square-root transformation for counts with a Poisson distribution and the arc-sine transformation for binomial proportions are common examples (Snedecor and Cochran 1967, pp. 325−330; Draper and Smith, pp. 237−239).

Alternatives to Least Squares¶

The method of least squares has desirable characteristics when the errors are normally distributed, e.g., a least-squares solution produces maximum likelihood estimates of the regression parameters. However, when errors are not normally distributed, least squares may yield poor estimators. Function lnormRegression offers three alternatives to least squares methodology, Least Absolute Value , Lp Norm , and Least Maximum Value.

The least absolute value (LAV, L1) criterion yields the maximum likelihood estimate when the errors follow a Laplace distribution. Option methodLav is often used when the errors have a heavy tailed distribution or when a fit is needed that is resistant to outliers.

A more general approach, minimizing the Lp norm ( $p\leq 1$ ), is given by option methodLlp. Although the routine requires about 30 times the CPU time for the case $p=1$ than would the use of methodLav, the generality of methodLlp allows the user to try several choices for p ≥1 by simply changing the input value of p in the calling program. The CPU time decreases as p gets larger. Generally, choices of p between 1 and 2 are of interest. However, the Lp norm solution for values of p larger than 2 can also be computed.

The minimax (LMV, L∞, Chebyshev) criterion is used by methodLmv. Its estimates are very sensitive to outliers, however, the minimax estimators are quite efficient if the errors are uniformly distributed.

Function plsRegression provides an alternative method which is useful when there are many inter-related regression variables and relatively few observations. plsRegression finds linear combinations of the predictor variables that have highest covariance with Y.

Missing Values¶

NaN (Not a Number) is the missing value code used by the regression functions. Use function machine(6), Chapter 15,:doc:/stat/utilities/index to retrieve NaN. Any element of the data matrix that is missing must be set to machine(6). In fitting regression models, any observation containing NaN for the independent, dependent, weight, or frequency variables is omitted from the computation of the regression parameters.