regression

Fits a multiple linear regression model using least squares.

Synopsis

regression (x, y)

Required Arguments

float x[[]] (Input)
Array of size nObservations × nIndependent containing the matrix of independent (explanatory) variables.
float y[] (Input)
Array of length nObservations containing the dependent (response) variable.

Return Value

If the optional argument noIntercept is not used, regression returns an array of length nIndependent + 1 containing a least-squares solution for the regression coefficients. The estimated intercept is the initial component of the array.

Optional Arguments

noIntercept

By default, the fitted value for observation i is

\[\hat{\beta}_0 + \hat{\beta}_1x_1 + \ldots + \hat{\beta}_kx_k\]

where k = nIndependent. If noIntercept is specified, the intercept term

\[\hat{\beta}_0\]

is omitted from the model.

tolerance, float (Input)
The tolerance used in determining linear dependence. For regression, tolerance = 100 ×machine(4) is the default choice. For regression, tolerance = 100 × machine(4) is the default. See machine.
rank (Output)
The rank of the fitted model is returned in rank.
coefCovariances (Output)
The m × m array containing the estimated variances and covariances of the estimated regression coefficients. Here, m is the number of regression coefficients in the model. If noIntercept is specified, m = nIndependent; otherwise, m = nIndependent + 1.
xMean (Output)
The array containing the estimated means of the independent variables.
residual (Output)
The array containing the residuals.
anovaTable (Output)

The array containing the analysis of variance table.

The analysis of variance statistics are given as follows:

Element Analysis of Variance Statistics
0 degrees of freedom for the model
1 degrees of freedom for error
2 total (corrected) degrees of freedom
3 sum of squares for the model
4 sum of squares for error
5 total (corrected) sum of squares
6 model mean square
7 error mean square
8 overall F-statistic
9 p-value
10 \(R^2\) (in percent)
11 adjusted \(R^2\) (in percent)
12 estimate of the standard deviation
13 overall mean of y
14 coefficient of variation (in percent)

Description

The function regression fits a multiple linear regression model with or without an intercept. By default, the multiple linear regression model is

\[y_i =β_0 +β_1x_{i1} +β_2x_{i2} + \ldots +β_kx_{ik} + ɛ_i \phantom{...} i =1, 2, \ldots, n\]

where the observed values of the \(y_i\)’s (input in y) are the responses or values of the dependent variable; the \(x_{i1}\)’s, \(x_{i2}\)’s, …, \(x_{ik}\)’s (input in x) are the settings of the k (input in nIndependent) independent variables; \(\beta_0\), \(\beta_1\), …, \(\beta_k\) are the regression coefficients whose estimated values are to be output by regression; and the \(\varepsilon_i\)’s are independently distributed normal errors each with mean zero and variance \(\sigma^2\). Here, n is the number of rows in the augmented matrix (x,y), i.e., n equals nObservations. Note that by default, \(\beta_0\) is included in the model.

The function regression computes estimates of the regression coefficients by minimizing the sum of squares of the deviations of the observed response \(y_i\) from the fitted response

\[\hat{y}_i\]

for the n observations. This minimum sum of squares (the error sum of squares) is output as one of the analysis of variance statistics if anovaTable is specified and is computed as

\[\mathit{SSE} = \sum_{i=1}^{n} \left(y_i-\hat{y}_i\right)^2\]

Another analysis of variance statistic is the total sum of squares. By default, the total sum of squares is the sum of squares of the deviations of \(y_i\) from its mean

\[\overline{y}\]

the so-called corrected total sum of squares. This statistic is computed as

\[\mathit{SST} = \sum_{i=1}^{n} \left(y_i-\overline{y}\right)^2\]

When noIntercept is specified, the total sum of squares is the sum of squares of \(y_i\), the so-called uncorrected total sum of squares. This is computed as

\[\mathit{SST} = \sum_{i=1}^{n} y_i^2\]

See Draper and Smith (1981) for a good general treatment of the multiple linear regression model, its analysis, and many examples.

In order to compute a least-squares solution, regression performs an orthogonal reduction of the matrix of regressors to upper-triangular form. The reduction is based on one pass through the rows of the augmented matrix (x, y) using fast Givens transformations. (See Golub and Van Loan 1983, pp. 156-162; Gentleman 1974.) This method has the advantage that the loss of accuracy resulting from forming the crossproduct matrix used in the normal equations is avoided.

By default, the current means of the dependent and independent variables are used to internally center the data for improved accuracy. Let \(x_i\) be a column vector containing the j-th row of data for the independent variables. Let \(\overline{x}_i\) represent the mean vector for the independent variables given the data for rows 1, 2, …, i. The current mean vector is defined to be

\[\overline{x}_i = \frac{\sum_{j=1}^{i} x_j}{i}\]

The i-th row of data has \(\overline{x}_i\) subtracted from it and is then weighted by \(i/(i − 1)\). Although a crossproduct matrix is not computed, the validity of this centering operation can be seen from the following formula for the sum of squares and crossproducts matrix:

\[\sum_{i=1}^{n} \left(x_i-\overline{x}_n\right) \left(x_i-\overline{x}_n\right)^T = \sum_{i=2}^{n} \frac{i}{i-1} \left(x_i-\overline{x}_i\right) \left(x_i-\overline{x}_i\right)^T\]

An orthogonal reduction on the centered matrix is computed. When the final computations are performed, the intercept estimate and the first row and column of the estimated covariance matrix of the estimated coefficients are updated (if coefCovariances is specified) to reflect the statistics for the original (uncentered) data. This means that the estimate of the intercept is for the uncentered data.

As part of the final computations, regression checks for linearly dependent regressors. In particular, linear dependence of the regressors is declared if any of the following three conditions are satisfied:

  • A regressor equals zero.
  • Two or more regressors are constant.
  • \(\sqrt{1-R_{i\cdot 1,2,\ldots,i-1}^2}\) is less than or equal to tolerance. Here, \(R_{i.1,2,\ldots,i-1}\) is the multiple correlation coefficient of the i-th independent variable with the first i − 1 independent variables. If no intercept is in the model, the “multiple correlation” coefficient is computed without adjusting for the mean.

On completion of the final computations, if the i-th regressor is declared to be linearly dependent upon the previous i − 1 regressors, then the i-th coefficient estimate and all elements in the i-th row and i-th column of the estimated variance-covariance matrix of the estimated coefficients (if coefCovariances is specified) are set to zero. Finally, if a linear dependence is declared, an informational (error) message, code IMSL_RANK_DEFICIENT, is issued indicating the model is not full rank.

Examples

Example 1

A regression model

\[y_i = \beta_0 + \beta_1x_{i1} + \beta_2x_{i2} + \beta_3x_{i3} + \varepsilon_i \phantom{...} i = 1,2,\ldots,9\]

is fitted to data taken from Maindonald (1984, pp. 203-204).

from numpy import *
from pyimsl.math.regression import regression
from pyimsl.math.writeMatrix import writeMatrix

n_independent = 3
n_dependent = 2
dummy = []
anova_table = []
regressionInfo = []
rank = []
x = ([[7.0, 5.0, 6.0],
      [2.0, -1.0, 6.0],
      [7.0, 3.0, 5.0],
      [-3.0, 1.0, 4.0],
      [2.0, -1.0, 0.0],
      [2.0, 1.0, 7.0],
      [-3.0, -1.0, 3.0],
      [2.0, 1.0, 1.0],
      [2.0, 1.0, 4.0]])
y = (7.0, -5.0, 6.0, 5.0, 5.0, -2.0, 0.0, 8.0, 3.0)

coefficients = regression(x, y)
writeMatrix('Least-Squares Coefficients', coefficients,
            colNumberZero=True)

Output

 
            Least-Squares Coefficients
          0            1            2            3
      7.733       -0.200        2.333       -1.667

Example 2

A weighted least-squares fit is computed using the model

\[y_i =β_0x_{i0} +β_1x_{i1} +β_2x_{i2} +ɛ_i \phantom{...} i =1, 2, \ldots, 4\]

and weights \(1/i^2\) discussed by Maindonald (1984, pp. 67-68). In order to compute the weighted least-squares fit, using an ordinary least-squares function (regression), the regressors (including the column of ones for the intercept term) and the responses must be transformed prior to invocation of regression. Specifically, the i-th response and regressors are multiplied by a square root of the i-th weight. noIntercept must be specified since the column of ones corresponding to the intercept term in the untransformed model is transformed by the weights and is regarded as an additional independent variable.

In the example, anovaTable is specified. The minimum sum of squares for error in terms of the original untransformed regressors and responses for this weighted regression is

\[\mathit{SSE} = \sum_{i=1}^{4} w_i \left(y_i - \hat{y}_i\right)^2\]

where \(w_i=1/i^2\). Also, since noIntercept is specified, the uncorrected total sum-of-squares terms of the original untransformed responses is

\[\mathit{SST} = \sum_{i=1}^{4} w_i y_i^2\]
from numpy import *
from pyimsl.math.regression import regression
from pyimsl.math.writeMatrix import writeMatrix

n_observations = 4
n_independent = 3
x = array([[1.0, -2.0, 0.0],
           [1.0, -1.0, 2.0],
           [1.0, 2.0, 5.0],
           [1.0, 7.0, 3.0]])
y = [-3.0, 1.0, 2.0, 6.0]
anova_row_labels = ["degrees of freedom for regression",
                    "degrees of freedom for error",
                    "total (uncorrected) degrees of freedom",
                    "sum of squares for regression",
                    "sum of squares for error",
                    "total (uncorrected) sum of squares",
                    "regression mean square",
                    "error mean square", "F-statistic",
                    "p-value", "R-squared (in percent)",
                    "adjusted R-squared (in percent)",
                    "est. standard deviation of model error",
                    "overall mean of y",
                    "coefficient of variation (in percent)"]

power = 0.0
for i in range(0, n_observations):
    power += 1.0
    # The square root of the weight
    w = sqrt(1.0 / (power * power))
    # Transform response
    y[i] *= w
    # Transform regressors
    for j in range(0, n_independent):
        x[i][j] *= w
anova_table = []
coefficients = regression(x, y, noIntercept=True,
                          anovaTable=anova_table)
writeMatrix("Least-Squares Coefficients", coefficients)
writeMatrix("* * * Analysis of Variance * * *\n", anova_table,
            rowLabels=anova_row_labels,
            writeFormat="%10.2f",
            column=True)

Output

 
     Least-Squares Coefficients
          1            2            3
     -1.431        0.658        0.748
 
         * * * Analysis of Variance * * *

degrees of freedom for regression             3.00
degrees of freedom for error                  1.00
total (uncorrected) degrees of freedom        4.00
sum of squares for regression                10.93
sum of squares for error                      1.01
total (uncorrected) sum of squares           11.94
regression mean square                        3.64
error mean square                             1.01
F-statistic                                   3.60
p-value                                       0.37
R-squared (in percent)                       91.52
adjusted R-squared (in percent)              66.08
est. standard deviation of model error        1.01
overall mean of y                            -0.08
coefficient of variation (in percent)     -1207.73

Warning Errors

IMSL_RANK_DEFICIENT The model is not full rank. There is not a unique least-squares solution.