Usage Notes

FNLStat : Regression : Usage Notes

Usage Notes

Simple Linear Regression

Multiple Linear Regression

Polynomial Model

Multivariate General Linear Model

Nonlinear Regression Model

Weighted Least Squares

Summary Statistics

Tests for Lack of Fit

Diagnostics for Individual Cases

Transformations

Alternatives to Least Squares

Missing Values

Simple Linear Regression

The simple linear regression model is

where the observed values of the yi’s constitute the responses or values of the dependent variable, the xi’s are the settings of the independent (explanatory) variable, β0 and β1 are the intercept and slope parameters, respectively, and the ɛi’s are independently distributed normal errors each with mean zero and variance σ2.

Routine RLINE fits a straight line and computes summary statistics for the simple linear regression model. There are no options with this routine.

Routine RONE analyzes a simple linear regression model. Routine RONE requires a data matrix as input. There is an option for excluding the intercept β0 from the model. The variables x, y, weights (optional), and frequencies (optional) must correspond to columns in this matrix. The simple linear regression model is fit, summary statistics are computed (including a test for lack of fit), and confidence intervals and diagnostics for individual cases are computed. There are options for printing and plotting the results.

Routines RINCF and RINPF solve the inverse regression (calibration) problem using a fitted simple linear regression. Routines RLINE or RONE can be used to compute the fit. Routine RINCF estimates settings of the independent variable that restrict, at a specified confidence percentage, y to a given specified range. Routine RINPF computes a confidence interval on the setting of the independent variable for a given response y0.

Multiple Linear Regression

The multiple linear regression model is

where the observed values of the yi’s constitute the responses or values of the dependent variable, the xi1’s, xi2’s, …, xik’s are the settings of the k independent (explanatory) variables, β0, β1, …, βk are the regression coefficients, and the ɛi’s are independently distributed normal errors each with mean zero and variance σ2.

Routine RLSE fits the multiple linear regression model. There is an option for excluding the intercept β0. There are no other options. The responses are input in a one-dimensional array Y, and the independent variables are input in a two-dimensional array X that contains the individual cases as the rows and the variables as the columns.

By specifying a single dependent variable, either RGIVN or RCOV can also be used to fit the multiple linear regression. (These routines are designed to fit any number of dependent variables simultaneously. See the section Multivariate General Linear Model.)

Routine RGIVN fits the model using fast Givens transformations. For large data sets that cannot be stored in a single array, RGIVN is designed to allow multiple invocations. In this case, only some of the rows from the entire data set are input at any one time. Alternatively, the data set can be input in a single array.

Routine RCOV fits the multiple linear regression model from the sum of squares and crossproducts matrix for the data (x1, x2, …, xk, y). Routine CORVC in Chapter 3, “Correlation” can compute the required sums of squares and crossproducts matrix for input into RCOV. Routine RORDM in Chapter 19, “Utilities” can reorder this matrix, if required.

Three routines in the IMSL MATH/LIBRARY can be used for fitting the multiple linear regression model. Routine LSQRR (IMSL MATH/LIBRARY) computes the fit via the Householder QR decomposition. Routine LSBRR (IMSL MATH/LIBRARY) computes the fit via iterative refinement. Routine LSVRR (IMSL MATH/LIBRARY) computes the singular value decomposition of a matrix. Routines LSQRR and LSBRR use the regressors and dependent variable as two input arrays. Routine LSVRR computes the singular value decomposition of the matrix of regressors, from which the regression coefficients can be obtained. Kennedy and Gentle (1980, section 8.1) discuss some of the computational advantages and disadvantages of the various methods for least-squares computations.

No Intercept Model

Several routines provide the option for excluding the intercept from a model. In most practical applications, the intercept should be included in the model. For routines that use the sums of squares and crossproducts matrix as input, the no-intercept case can be handled by using the raw sums of squares and crossproducts matrix as input in place of the corrected sums of squares and crossproducts. The raw sum of squares and crossproducts matrix can be computed as(x1, x2, …, xk, y)T(x1, x2, …, xk, y) using the matrix multiplication routine MXTXF (IMSL MATH/LIBRARY).

Variable Selection

Variable selection can be performed by RBEST, which does all best subset regressions, or by RSTEP, which does stepwise regression. In either case, the sum of squares and crossproducts matrix must first be formed. The method used by RBEST is generally preferred over that used by RSTEP because RBEST implicitly examines all possible models in the search for a model that optimizes some criterion while stepwise does not examine all possible models. However, the computer time and memory requirements for RBEST can be much greater than that for RSTEP when the number of candidate variables is large.

Two utility routines GSWEP and RSUBM are provided also for variable selection. Routine GSWEP performs a generalized sweep of a nonnegative define matrix. Routine RSUBM can be invoked after either GSWEP or RSTEP in order to extract the symmetric submatrix whose rows and columns have been swept, i.e., whose rows and columns have entered the stepwise model. Routines GSWEP and RSUBM can be invoked prior to RBEST in order to force certain variables into all the models considered by RBEST.

Polynomial Model

The polynomial model is

where the observed values of the yi’s constitute the responses or values of the dependent variable, the xi’s are the settings of the independent (explanatory) variables, β0, β1, …, βk are the regression coefficients, and the ɛi’s are independently distributed normal errors each with mean zero and variance σ2.

Routine RCURV fits a specified degree polynomial. Routine RPOLY determines the degree polynomial to fit and analyzes this model. If only a decomposition of sum of squares for first, second, …, k-th degree effects in a polynomial model is required, either RCURV or the service routine RFORP can be used to compute this decomposition. The other service routines (RSTAP, RCASP, OPOLY) can be used to perform other parts of the full analysis.

Multivariate General Linear Model

Routines for the multivariate general linear model use the model

Y = XB + ɛ

where Y is the n × q matrix of responses, X is the n × p matrix of regressors, B is the p × q matrix of regression coefficients, and ɛ is the n × q matrix of errors whose q‑dimensional rows are identically and independently distributed multivariate normal with mean vector 0 and variance-covariance matrix Σ.

Specification of X for the General Linear Model

Variables used in the general linear model are either continuous or classification variables. Typically, multiple regression models use continuous variables, whereas analysis of variance models use classification variables. Although the notation used to specify analysis of variance models and multiple regression models may look quite different, the models are essentially the same. The term general linear model emphasizes that a common notational scheme is used for specifying a model that may contain both continuous and classification variables.

A general linear model is specified by its effects (sources of variation). We refer to an effect as a single variable or a product of variables. (The term effect is often used in a narrower sense, referring only to a single regression coefficient.) In particular, an effect is composed of one of the following:

1. a single continuous variable

2. a single classification variable

3. several different classification variables

4. several continuous variables, some of which may be the same

5. continuous variables, some of which may be the same, and classification variables, which must be distinct

Effects of the first type are common in multiple regression models. Effects of the second type appear as main effects in analysis of variance models. Effects of the third type appear as interactions in analysis of variance models. Effects of the fourth type appear in polynomial models and response surface models as powers and crossproducts of some basic variables. Effects of the fifth type appear in one-way analysis of covariance models as regression coefficients that indicate lack of parallelism of a regression function across the groups.

The specification of a general linear model is through arguments INTCEP, NCLVAR, INDCL, NEF, NVEF, and INDEF, whose meanings are as follows:

INTCEP — Intercept option. (Input)

INTCEP	Action
0	An intercept is not in the model.
1	An intercept is in the model.

NCLVAR — Number of classification variables. (Input)

INDCL — Index vector of length NCLVAR containing the column numbers of X that are the classification variables. (Input)

NEF — Number of effects (sources of variation) in the model excluding error. (Input)

NVEF — Vector of length NEF containing the number of variables associated with each effect in the model. (Input)

INDEF — Index vector of length NVEF(1) + NVEF(2) + … + NVEF(NEF). (Input)
The first NVEF(1) elements give the column numbers of X for each variable in the first effect; the next NVEF(2) elements give the column numbers for each variable in the second effect; and so on. The last NVEF(NEF) elements give the column numbers for each variable in the last effect.

Suppose the data matrix has as its first 4 columns two continuous variables in columns 1 and 2 and two classification variables in columns 3 and 4. The data might appear as follows:

Column 1	Column 2	Column 3	Column 4
11.23	1.23	1.0	5.0
12.12	2.34	1.0	4.0
12.34	1.23	1.0	4.0
4.34	2.21	1.0	5.0
5.67	4.31	2.0	4.0
4.12	5.34	2.0	1.0
4.89	9.31	2.0	1.0
9.12	3.71	2.0	1.0

Each distinct value of a classification variable determines a level. The classification variable in column 3 has two levels. The classification variable in column 4 has three levels. (Integer values are recommended, but not required, for values of the classification variables. If real numbers are used, the values of the classification variables corresponding to the same level must be identical.) Some examples of regression functions and their specifications are as follows:

	INTCEP	NCLVAR	INDCL	NEF	NVEF	INDEF
β0 + β1 x1	1	0		1	1	1
β0 + β1x1+ β2x12	1	0		2	1,2	1,1,1
μ + α I	1	1	3	1	1	3
μ + α i + βj + γij	1	2	3,4	3	1,1,2	3,4,3,4
μ ij	0	2	3,4	1	2	3,4
β0 + β1x1 + β2x2 + β3x1x2	1	0		3	1,1,2	1,2,1,2
μ + α i + βx1i + βix1i	1	1	3	3	1,1,2	3,1,1,3

Routines for Fitting the Model

Routine RGLM fits a multivariate general linear model. If the data set is too large to be stored in a single array, RGLM is designed so that multiple invocations can be made. In this case, one or more rows of the entire data set can be input at each invocation. Alternatively, the data set can be input all at once in a single array. Index vectors are used to specify the column numbers of the data matrix used as classification variables, effects, and dependent variables. This is useful if several models with different effects need to be fit from the same data matrix.

Routine RLEQU can be called after RGIVN or RGLM to impose linear equality restrictions AB = Z on the regression parameters. RLEQU checks consistency of the restrictions. Routine RLEQU is useful for fitting spline functions where restrictions on the regression parameters arise from continuity and differentiability conditions on the regression function.

Routine RLEQU can be used to test the multivariate general linear hypothesis AB = Z by fitting the restricted model after the full model is fit. The additional degrees of freedom for error (and the additional sum of squares and crossproducts for error) gained in the restricted model can be used for computing a test statistic. However, a more efficient approach for computing the sum of squares and crossproducts for a multivariate general linear hypothesis is provided by RHPSS. See the next section entitled “Multivariate General Linear Hypothesis” for a brief description of the problem and related routines.

Two utility routines GCLAS and GRGLM are provided to determine the values of the classification variables and then to use those values and the specified general linear model to generate the regressors in the model. These routines would not be required if you use RGLM to fit the model since RGLM does this automatically. However, if other routines in this chapter are used that require the actual regressors and not the classification variables, then these routines could be used.

Linear Dependence and the R Matrix

Linear dependence of the regressors frequently arises in regression models—sometimes by design and sometimes by accident. The routines in this chapter are designed to handle linear dependence of the regressors, i.e., the n × p matrix X (the matrix of regressors) in the general linear model can have rank less than p. Often, the models are referred to as nonfull rank models.

As discussed in Searle (1971, Chapter 5) some care must be taken to use correctly the results of the fitted nonfull rank regression model for estimation and hypothesis testing. In the nonfull rank case, not all linear combinations of the regression coefficients can be estimated. Those linear combinations that can be estimated are called “estimable functions.” If routines in this chapter are used to attempt to estimate linear combinations that cannot be estimated, error messages are issued. A good general discussion of estimable functions is given by Searle (1971, pages 180‑188).

The check used by routines in this chapter for linear dependence is sequential. The j-th regressor is declared linearly dependent on the preceding j ‑ 1 regressors if

is less than or equal to TOL. Here, Rj⋅1, 2, …, j−1 is the multiple correlation coefficient of the j‑th regressor with the first j ‑ 1 regressors. Also, TOL is a tolerance that must be input by the user. When a routine declares the j-th regressor to be linearly dependent on the first j ‑ 1 regressors, the j-th regression coefficient is set to zero. Essentially, this removes the j-th regressor from the model.

The reason a sequential check is used is that frequently practitioners include the variables that they prefer to remain in the model first. Also, the sequential check is based on many of the computations already performed as this does not degrade the overall efficiency of the routines. There is no perfect test for linear dependence when finite precision arithmetic is used. The input of the tolerance TOL allows the user some control over the check for linear dependence. If you know your model is full rank, you can input TOL = 0.0. However, generally TOL should be input as approximately 100 times the machine epsilon. The machine epsilon is AMACH(4) in single precision and DMACH(4) in double precision. (See routines AMACH and DMACH in Reference Material)

Routines in this chapter performing least squares are based on QR decomposition of X or on a Cholesky factorization RTR of XTX. Maindonald (1984, chapters 1‑5) discusses these methods extensively. The R matrix used by the regression routines is taken to be a p × p upper triangular matrix, i.e., all elements below the diagonal are zero. The signs of the diagonal elements of R are used as indicators of linearly dependent regressors and as indicators of parameter restrictions imposed by fitting a restricted model. The rows of R can be partitioned into three classes by the sign of the corresponding diagonal element:

1. A positive diagonal element means the row corresponds to data.

2. A negative diagonal element means the row corresponds to a linearly independent restriction imposed on the regression parameters by AB = Z in a restricted model.

3. A zero diagonal element means a linear dependence of the regressors was declared. The regression coefficients in the corresponding row of

are set to zero. This represents an arbitrary restriction which is imposed to obtain a solution for the regression coefficients. The elements of the corresponding row of R are also set to zero.

Multivariate General Linear Hypothesis

Routine RHPSS computes the matrix of sums of squares and crossproducts for the general linear hypothesis H B U = G for the multivariate general linear model Y = XB + ɛ with possible linear equality restrictions AB = Z. The R matrix and

from the routines that fit the model are required for input to RHPSS.

The rows of H must be linear combinations of the rows of R, i.e., H B = G must be completely testable. If the hypothesis is not completely testable, routine CESTI can be used to construct an equivalent completely testable hypothesis.

Routine RHPTE computes several test statistics and approximate p‑values for the multivariate general linear hypothesis. The test statistics computed included are Wilks’ lambda, Roy’s maximum root, Hotelling’s trace, and Pillai’s trace. Seber (1984, pages 409‑416) and Morrison (1976, pages 222‑224) discuss the procedures and compare the test statistics. The error sum of squares and crossproducts matrix (SCPE) output from the fit of the model is required for input to RHPTE. In addition, the hypothesis sum of squares and crossproducts matrix (SCPH), which can be computed using RHPSS, is required for input to RHPTE.

Nonlinear Regression Model

The nonlinear regression model is

where the observed values of the yi’s constitute the responses or values of the dependent variable, the xi’s are the known vectors of values of the independent (explanatory) variables, f is a known function of an unknown regression parameter vector θ, and the ɛi’s are independently distributed normal errors each with mean zero and variance σ2.

Routine RNLIN performs the least-squares fit to the data for this model. The routine RCOVB can be used to compute the large sample variance-covariance matrix of the estimated nonlinear regression parameters from the output of RNLIN.

Weighted Least Squares

Routines throughout the chapter generally allow weights to be assigned to the observations. The argument IWT is used throughout to specify the weighting option. (IWT = 0 means ordinary least squares; a positive IWT means weighted least squares with weights in column IWT of the data set.) All of the weights must be nonnegative. For routines requiring a sum of squares and crossproducts matrix for input, a weighted analysis can be performed by using as input a weighted sum of squares and crossproducts matrix. Routine CORVC in Chapter 3, "Correlation” can compute the required weighted sum of squares and crossproducts matrix.

Computations that relate to statistical inference, e.g., t tests, F tests, and confidence intervals, are based on the multiple regression model except that the variance of ɛi is assumed to equal σ2 (or Σ in the multivariate case) times the reciprocal of the corresponding weight.

If a single row of the data matrix corresponds to ni observations, the argument IFRQ can be used to specify the frequency option. IFRQ = 0 means that for all rows, ni = 1; a positive IFRQ means the frequencies are entered into column IFRQ of the data matrix. Degrees of freedom for error are affected by frequencies, but are unaffected by weights.

Summary Statistics

Summary statistics for a single dependent variable are computed by several routines in the regression chapter. The routines RONE, RLSE, RSTEP, and RPOLY output some summary statistics with the fit of the model. For additional summary statistics, the routines RSTAT and RSTAP can be used.

Routine RSTAT can be used to compute and print statistics related to a regression for each of the q dependent variables fitted by RGIVN, RGLM, RLEQU, or RCOV. Routine RSTAT computes summary statistics that include the model analysis of variance table, sequential sums of squares and
F-statistics, coefficient estimates, estimated standard errors, t-statistics, variance inflation factors, and estimated variance-covariance matrix of the estimated regression coefficients. If only the variance-covariance matrix of the estimated regression coefficients in needed, routine RCOVB can be used.

The summary statistics are computed under the model y = X β + ɛ, where y is the n × 1 vector of responses, X is the n × p matrix of regressors with rank(X) = r, β is the p × 1 vector of regression coefficients, and ɛ is the n × 1 vector of errors whose elements are independently normally distributed with mean 0 and variance σ2∕wi.

Given the results of a weighted least-squares fit of this model (with the wi’s as the weights), most of the computed summary statistics are output in the following variables:

AOV — a one-dimensional array usually of length 15. In RSTEP, AOV is of length 13 because the last two elements of the array cannot be computed from the input. The array contains statistics related to the analysis of variance. The sources of variation examined are the regression, error, and total. The first 10 elements of AOV and the notation frequently used for these is described in the following table:

Model Analysis of Variance Table
Source of Variation	Degrees of Freedom	Sum of Squares	Mean Square	F	p‑value
Regression	DFR=AOV(1)	SSR=AOV(4)	MSR=AOV(7)	AOV(9)	AOV(10)
Error	DFE=AOV(2)	SSE=AOV(5)	s2 = AOV(8)
Total	DFT=AOV(3)	SST=AOV(6)

In the case an intercept is indicated (INTCEP = 1), the total sum of squares is the sum of squares of the deviations of yi from its (weighted) mean

— the so-called corrected total sum of squares, it is denoted by

In the case an intercept is not indicated (INTCEP = 0), the total sum of squares is the sum of squares of yi —the so-called corrected total sum of squares, it is denoted by

The error sum of squares is given by

The error degrees of freedom is defined by

DFE = n ‑ r

The estimate of σ2 is given by

s2 = SSE/DFE

which is the error mean square.

The computed F statistic for the null hypothesis H0 : β1 = β2 = … = βk = 0 versus the alternative that at least one coefficient is nonzero is given by

F = MSR/s2

The p‑value associated with the test is the probability of an F larger than that computed under the assumption of the model and the null hypothesis. A small p‑value (less that 0.05) is customarily used to indicate that there is sufficient evidence from the data to reject the null hypothesis.

The remaining 5 elements in AOV frequently are displayed together with the actual analysis of variance table. The quantities R-squared (R2 = AOV(11)) and adjusted R-squared

are expressed as a percentage and are defined by

R2 = 100(SSR/SST) = 100(1 ‑ SSE/SST)

The square root of s2(s = AOV(13)) is frequently referred to as the estimated standard deviation of the model error.

The overall mean of the responses

is output in (AOV(14)).

The coefficient of variation (CV = AOV(15)) is expressed as a percentage and is defined by

COEF — a two dimensional array containing the regression coefficient vector

as one column and associated statistics (including the estimated standard error, t statistic and p‑value) in the remaining columns.

SQSS — a two dimensional array containing sequential sums of squares as one column and associated statistics (including degrees of freedom, F statistic, and p‑value) in the remaining columns.

COVB — the estimated variance-covariance matrix of the estimated regression coefficients.

Tests for Lack of Fit

Tests for lack of fit are computed for simple linear regression by RONE, for the polynomial regression by routines RPOLY and RSTAP and for multiple regression by routines RLOFE and RLOFN.

In the case of polynomial regression, the two-dimensional output array TLOF contains the lack of fit F tests for each degree polynomial 1, 2, …, k, that is fit to the data. These tests are useful for indicating the degree of the polynomial required to fit the data well.

In the case of simple and multiple regression, the one-dimensional output array TESTLF of length 10 contains the analysis of variance table for the test of lack of fit. Two routines RLOFE and RLOFN can be used to compute a test for lack of fit. Routine RLOFE requires exact replicates of the independent variables, i.e., there must be at least two cases in the data set that have the same settings of all the independent variables, while RLOFN does not require exact replicates. Customarily, one would require there to be several sets of duplicate settings of the independent variables in order to use RLOFE.

For RLOFE, the 10 elements of TESTLF and the notation frequently used for these is described in the following table:

Lack of Fit Analysis of Variance Table
Source of Variation	Degrees of Freedom	Sum of Squares	Mean Square	F	p‑value
Lack of Fit	TESTLF(1)	TESTLF(4)	TESTLF(7)	TESTLF(9)	TESTLF(10)
Error	DFPE = TESTLF(2)	SSPE = TESTLF(5)	TESTLF(8)
Pure Error	DFE = TESTLF(3)	SSE = TESTLF(6)

For RLOFN, the 10 elements of TESTLF are similar to those in the previous table. However, since there may not be exact replicates in the data, the data are grouped into sets of near replicates. Then, instead of computing a pure error (or within) sum of squares using a one-way analysis of variance model, an expanded one-way analysis of covariance model using the clusters of near replicates as the groups is computed. The error from this expanded model replaces the pure error in the preceding table in order to compute an exact F test for lack of fit conditional on the selected clusters.

Diagnostics for Individual Cases

Diagnostics for individual cases (observations) are computed by several routines in the regression chapter. Routines RONE, and RPOLY output diagnostics for individual cases with the fit. If the fit of the model is done by other routines, RCASE and RCASP can be used to compute the diagostics.

Routine RCASE computes confidence intervals and diagnostics for individual cases in the data matrix. The cases can be stored in a single data matrix or multiple invocations can be made in which one or more rows of the entire data set are input at any one time. Statistics computed by RCASE include predicted values, confidence intervals, and diagnostics for detecting outliers and cases that greatly influence the fitted regression.

If not all of the statistics computed by RCASE are needed, ROTIN can be used to obtain some of the statistics.

The diagnostics are computed under the model y = X β + ɛ, where y is the n × 1 vector of responses, X is the n × p matrix of regressors with rank(X) = r, β is the p × 1 vector of regression coefficients, and ɛ is the n × 1 vector of errors whose elements are independently normally distributed with mean 0 and variance σ2/wi.

Given the results of a weighted least-squares fit of this model (with the wi’s as the weights), the following five diagnostics are computed: (1) leverage, (2) standardized residual, (3) jackknife residual, (4) Cook’s distance, and (5) DFFITS. These diagnostics are stored in the FORTRAN matrix CASE. The definition of these terms is given in the discussion that follows:

Let xi be a column vector containing the elements of the i-th row of X. A case could be unusual either because of xi or because of the response yi. The leverage hi is a measure of unusualness of the xi. The leverage is defined by

where W = diag(w1, w2, …, wn) and (XTW X)− denotes a generalized inverse of XTWX. The average value of the hi’s is r/n. Regression routines declare xi unusual if hi > 2r/n. A row label X is printed beside a case that is unusual because of of xi. Hoaglin and Welsch (1978) call a data point highly influential (i.e., a leverage point) when this occurs.

Let ei denote the residual

for the i-th case. The estimated variance of ei is (1 ‑ hi)s2/wi where s2 is the residual mean square from the fitted regression. The i-th standardized residual (also called the internally studentized residual) is by definition

and ri follows an approximate standard normal distribution in large samples.

The i-th jackknife residual or deleted residual involves the difference between yi and its predicted value based on the data set in which the i-th case is deleted. This difference equals ei/(1 ‑ hi). The jackknife residual is obtained by standardizing this difference. The residual mean square for the regression in which the i-th case is deleted is

The jackknife residual is defined to be

and ti follows a t distribution with n ‑ r ‑ 1 degrees of freedom. The regression routines declare yi unusual (an outlier) if a jackknife residual greater than 2.0 in absolute value is computed. A row label Y is printed beside a case that is unusual because of yi.

Cook’s distance for the i-th case is a measure of how much an individual case affects the estimated regression coefficients. It is given as

Weisberg (1985) states that if Di exceeds the 50-th percentile of the F(r, n ‑ r) distribution, it should be considered large. (This value is about 1. This statistic does not have an F distribution.)

DFFITS, like Cook’s distance, is also a measure of influence. For the i-th case, DFFITS is computed by the formula

Hoaglin and Welsch (1978) suggest that DFFITSi is greater than

is large.

Transformations

Transformations of the independent variables are sometimes useful in order to satisfy the regression model. The inclusion of squares and crossproducts of the variables

is often needed. Logarithms of the independent variables are also often used. (See Draper and Smith, 1981, pages 218‑222, Box and Tidwell, 1962, Atkinson, 1985, pages 177‑180, Cook and Weisberg, 1982, pages 78‑86.)

When the responses are described by a nonlinear function of the parameters, a transformation of the model equation can often be selected so that the transformed model is linear in the regression parameters. For example, the exponential model

by taking natural logarithms on both sides of the equation, can be transformed to a model that satisfies the linear regression model provided the ɛi’s have a log normal distribution (Draper and Smith, pages 222‑225).

When the responses are nonnormal and their distribution is known, a transformation of the responses can often be selected so that the transformed responses closely satisfy the regression model assumptions. The square root transformation for counts with a Poisson distribution and the arc-sine transformation for binomial proportions are common examples (Snedecor and Cochran, 1967, pages 325‑330, Draper and Smith, pages 237‑239).

If the distribution of the responses is not known, the data can be used to select a transformation so that the transformed responses may more closely obey the regression model. For a positive response variable y > 0, the family of power transformations indexed by λ

and generalizations of this family are useful. Routine BCTR (See Chapter 8, “Time Series Analysis and Forecasting”) can be used to perform the transformation. A method to estimate and to compute an approximate test for λ = 1 is given by Atkinson (1973). Also, Atkinson (1986) discusses transformation deletion statistics for computing the estimate and test leaving out a single observation since the evidence for a transformation of the response may sometimes depend crucially on one or a few observations.

Alternatives to Least Squares

The method of least squares has desirable characteristics when the errors are normally distributed, e.g., a least-squares solution produces maximum likelihood estimates of the regression parameters. However, when errors are not normally distributed, least squares may yield poor estimators. The least absolute value (LAV, L1) criterion yields the maximum likelihood estimate when the errors follow a Laplace distribution. Routine RLAV is often used when the errors have a heavy tailed distribution or when a fit is needed that is resistant to outliers.

A more general approach, minimizing the Lp norm (p ≥ 1), is given by routine RLLP. Although the routine requires about 30 times the CPU time for the case p = 1 than would the use of RLAV, the generality of RLLP allows the user to try several choices for p ≥ 1 by simply changing the input value of p in the calling program. The CPU time decreases as p gets larger. Generally, choices of p between 1 and 2 are of interest. However, the Lp norm solution for values of p larger than 2 can also be computed.

The minimax (LMV, L∞, Chebyshev) criterion is used by RLMV. Its estimates are very sensitive to outliers, however, the minimax estimators are quite efficient if the errors are uniformly distributed.

Routine PLSR provides a fourth alternative useful when there are many inter-related regression variables and relatively few observations. PLSR finds linear combinations of the predictor variables that have highest covariance with Y.

Missing Values

NaN (not a number) is the missing value code used by the regression routines. Use function AMACH(6) (or function DMACH(6) with double precision regression routines) to retrieve NaN. (See the section Machine-Dependent Constants in Reference Material.) Any element of the data matrix that is missing must be set to AMACH(6) (or DMACH(6) for double precision). In fitting regression models, any row of the data matrix containing NaN for the independent, dependent, weight, or frequency variables is omitted from the computation of the regression parameters.

Often predicted values and confidence intervals are desired for combinations of settings of the independent variables not used in computing the regression fit. This can be accomplished by including additional rows in the data matrix. These additional rows should contain the desired settings of the independent variables along with the responses set equal to NaN. The cases with NaN will not be used in determining the estimates of the regression parameters, and a predicted value and confidence interval will be computed from the given settings of the independent variables.