regression

Fits a multivariate linear regression model using least squares.

Synopsis

regression (x, y)

Required Arguments

float x[[]] (Input)
Array of size nRows × nIndependent containing the independent (explanatory) variable(s). The i‑th column of x contains the i‑th independent variable.
float y[] (Input)
Array of size nRows × nDependent containing the dependent (response) variable(s). The i‑th column of y contains the i‑th dependent variable. See optional argument nDependent to set the value of nDependent.

Return Value

If the optional argument noIntercept is not used, regression returns an array of length nDependent × (nIndependent + 1) containing a least-squares solution for the regression coefficients. The estimated intercept is the initial component of each row, where the i‑th row contains the regression coefficients for the i‑th dependent variable.

Optional Arguments

nDependent, int (Input)
Number of dependent variables. Input matrix y must be declared of size nRows by nDependent, where column i of y contains the i‑th dependent variable. Default: nDependent = 1
xIndices, int indind[], int inddep, int ifrq, int iwt (Input)

This argument allows an alternative method for data specification. Data (independent, dependent, frequencies, and weights) is all stored in the data matrix x. Argument y, and keywords frequencies and weights are ignored.

Each of the four arguments contains indices indicating column numbers of x in which particular types of data are stored.

Parameter indind contains the indices of the independent variables.

Parameter inddep contains the indices of the dependent variables.

Parameters ifrq and iwt contain the column numbers of x in which the frequencies and weights, respectively, are stored. Set ifrq = −1 if there will be no column for frequencies. Set iwt = −1 if there will be no column for weights. Weights are rounded to the nearest integer. Negative weights are not allowed.

Note that required input argument y is not referenced, and can be declared a vector of length 1.

ido, int (Input)

Processing option.

The argument ido must be one of 0, 1, 2, or 3. If ido = 0 (the default), all of the observations are input during one invocation. If ido = 1, 2, or 3, blocks of rows of the data can be processed sequentially in separate invocations of regression; with this option, it is not a requirement that all observations be memory resident, thus enabling one to handle large data sets.

ido Action
0 This is the only invocation; all the data are input at once. (Default)
1 This is the first invocation with this data; additional calls will be made. Initialization and updating for the nRows observations of x will be performed.
2 This is an intermediate invocation; updating for the nRows observations of x will be performed.
3 This is the final invocation of this function. Updating for the data in x and wrap-up computations are performed. Workspace is released No further invocations of regression with ido greater than 1 should be made without first invoking regression with ido = 1.

Default: ido = 0

rowsAdd, or

rowsDelete

By default (or if rowsAdd is specified), the observations in x are added to the discriminant statistics. If rowsDelete is specified, then the observations are deleted.

If ido = 0, these optional arguments are ignored (data is always added if there is only one invocation).

intercept, or

noIntercept

intercept is the default where the fitted value for observation i is

\[\hat{\beta}_0 + \hat{\beta}_1 x_1 + \ldots + \hat{\beta}_k x_k\]

where k = nIndependent. If noIntercept is specified, the intercept term

\[\left(\hat{\beta}_0\right)\]

is omitted from the model and the return value from regression is an array of length nDependent × nIndependent.

tolerance, float (Input)
Tolerance used in determining linear dependence. For regression, tolerance = 100 × machine(4) is the default choice. For regression, tolerance = 100 × machine(4) is the default. (See machine Chapter 15,:doc:/stat/utilities/index.)
rank (Output)
Rank of the fitted model is returned in rank.
coefCovariances (Output)

The nDependent × m × m array containing the estimated variances and covariances of the estimated regression coefficients. Here, m is the number of regression coefficients in the model. If noIntercept is specified, n = nIndependent; otherwise, m = nIndependent + 1.

The first m × m elements contain the matrix for the first dependent variable, the next m × m elements contain the matrix for the next dependent variable, … and so on.

xMean (Output)
The array containing the estimated means of the independent variables.
residual (Output)
The array of size nRows by nDependent containing the residuals. Residuals may not be requested if ido > 0.
anovaTable (Output)

The array of size

15 × nDependent containing the analysis of variance table for each dependent variable. The i‑th column corresponds to the analysis for the i‑th dependent variable.

The analysis of variance statistics are given as follows:

Element Analysis of Variance Statistics
0 degrees of freedom for the model
1 degrees of freedom for error
2 total (corrected) degrees of freedom
3 sum of squares for the model
4 sum of squares for error
5 total (corrected) sum of squares
6 model mean square
7 error mean square
8 overall F-statistic
9 p-value
10 \(R^2\) (in percent)
11 adjusted \(R^2\) (in percent)
12 estimate of the standard deviation
13 overall mean of y
14 coefficient of variation (in percent)

The anova statistics may not be requested if ido > 0. Note that the p‑value is returned as 0.0 when the value is so small that all significant digits have been lost.

scpe (Output)
An array of size nDependent × nDependent containing the error (residual) sums of squares and crossproducts. scpe [m][n] contains the sum of crossproducts for the m‑th and n‑th dependent variables.
frequencies, float[] (Input)­

Array of length nRows containing the frequency for each observation.

Default: frequencies[] = 1

weights, float[] (Input)

Array of length nRows containing the weight for each observation.

Default: weights[] = 1

regressionInfo (Output)
A structure containing information about the regression fit. This structure is required as input for functions regressionPrediction and regressionSummary.

Description

Function regression fits a multivariate multiple linear regression model with or without an intercept. The multiple linear regression model is

\[y_i = β_0 + β_1x_{i1} + β_2x_{i2} + \ldots + β_kx_{ik} + ɛ_i \phantom{...} i = 1, 2, \ldots, n\]

where the observed values of the \(y_i\)’s are the responses or values of the dependent variable; the \(x_{i1}\)’s, \(x_{i2}\)’s, …, \(x_{ik}\)’s are the settings of the k (input in nIndependent) independent variables; \(\beta_0,\beta_1,\ldots,\beta_k\) are the regression coefficients whose estimated values are to be output by regression; and the \(\varepsilon_i\)’s are independently distributed normal errors each with mean 0 and variance \(s^2\). Here, n is the sum of the frequencies for all nonmissing observations, i.e.,

\[\left(n = \sum_{i=0}^{\mathtt{n\_rows}-1} f_i\right)\]

where \(f_i\) is equal to frequencies[i] if optional argument frequencies is specified and equal to 1.0 otherwise. Note that by default, \(\beta_0\) is included in the model.

More generally, regression fits a multivariate regression model. See the chapter introduction for a description of the multivariate model.

Function regression computes estimates of the regression coefficients by minimizing the sum of squares of the deviations of the observed response \(y_i\) from the fitted response

\[\hat{y}_i\]

for the n observations. This minimum sum of squares (the error sum of squares) is output as one of the analysis of variance statistics if anovaTable is specified and is computed as follows:

\[\mathrm{SSE} = \sum_{i=1}^{n} w_i \left(y_i - \hat{y}_i\right)^2\]

Another analysis of variance statistic is the total sum of squares. By default, the total sum of squares is the sum of squares of the deviations of \(y_i\) from its mean

\[\overline{y}\]

the so-called corrected total sum of squares. This statistic is computed as follows:

\[\mathrm{SST} = \sum_{i=1}^{n} w_i \left(y_i - \overline{y}\right)^2\]

When noIntercept is specified, the total sum of squares is the sum of squares of \(y_i\), the so-called uncorrected total sum of squares. This is computed as follows:

\[\mathrm{SST} = \sum_{i=1}^{n} w_i y_i^2\]

See Draper and Smith (1981) for a good general treatment of the multiple linear regression model, its analysis, and many examples.

In order to compute a least-squares solution, regression performs an orthogonal reduction of the matrix of regressors to upper-triangular form. The reduction is based on one pass through the rows of the augmented matrix (x, y) using fast Givens transformations. (See Golub and Van Loan 1983, pp. 156–162; Gentleman 1974.) This method has the advantage that the loss of accuracy resulting from forming the crossproduct matrix used in the normal equations is avoided.

By default, the current means of the dependent and independent variables are used to internally center the data for improved accuracy. Let \(x_i\) be a column vector containing the j-th row of data for the independent variables. Let \(x_i\) represent the mean vector for the independent variables given the data for rows \(1,2,\ldots,i\).

The current mean vector is defined as follows:

\[\overline{x}_i = \frac{\sum\limits_{j=1}^{i} w_j f_j x_j}{\sum\limits_{j=1}^{i} w_j f_j}\]

where the \(w_j\)’s and the \(f_j\)’s are the weights and frequencies. The i‑th row of data has

\[\overline{x}_i\]

subtracted from it and is multiplied by

\[w_i f_i \frac{a_i}{a_{i-1}}\]

where

\[a_i = \sum_{j=1}^{i} w_j f_j\]

Although a crossproduct matrix is not computed, the validity of this centering operation can be seen from the following formula for the sum of squares and crossproducts matrix:

\[\sum_{i=1}^{n} w_i f_i \left( x_i - \overline{x}_i \right) \left( x_i - \overline{x}_n \right)^T = \sum_{i=2}^{n} \frac{a_i}{a_{i-1}} w_i f_i \left( x_i - \overline{x}_i \right) \left( x_i - \overline{x}_i \right)^T\]

An orthogonal reduction on the centered matrix is computed. When the final computations are performed, the intercept estimate and the first row and column of the estimated covariance matrix of the estimated coefficients are updated (if coefCovariances is specified) to reflect the statistics for the original (uncentered) data. This means that the estimate of the intercept is for the uncentered data.

As part of the final computations, regression checks for linearly dependent regressors. In particular, linear dependence of the regressors is declared if any of the following three conditions are satisfied:

  • A regressor equals 0.

  • Two or more regressors are constant.

    \[\sqrt{1 - R_{i \cdot 1,2, \ldots i-1}^{2}}\]

    is less than or equal to tolerance. Here,

    \[R_{i \cdot 1,2, \ldots i-1}\]

    is the multiple correlation coefficient of the i‑th independent variable with the first i – 1 independent variables. If no intercept is in the model, the multiple correlation coefficient is computed without adjusting for the mean.

On completion of the final computations, if the i‑th regressor is declared to be linearly dependent upon the previous i − 1 regressors, the i‑th coefficient estimate and all elements in the i‑th row and i‑th column of the estimated variance-covariance matrix of the estimated coefficients (if coefCovariances is specified) are set to 0. Finally, if a linear dependence is declared, an informational (error) message, code IMSLS_RANK_DEFICIENT, is issued indicating the model is not full rank.

Examples

Example 1

A regression model

\[y_i = β_0 + β_1x_{i1} + β_2x_{i2} + β_3x_{i3} + ɛ_i \phantom{...} i = 1, 2, \ldots, 9\]

is fitted to data taken from Maindonald (1984, pp. 203–204).

from numpy import *
from pyimsl.stat.regression import regression
from pyimsl.stat.writeMatrix import writeMatrix

intercept = 1
n_independent = 3
n_coefficients = (intercept + n_independent)
n_observations = 9
x = [[7.0, 5.0, 6.0],
     [2.0, -1.0, 6.0],
     [7.0, 3.0, 5.0],
     [-3.0, 1.0, 4.0],
     [2.0, -1.0, 0.0],
     [2.0, 1.0, 7.0],
     [-3.0, -1.0, 3.0],
     [2.0, 1.0, 1.0],
     [2.0, 1.0, 4.0]]
y = [7.0, -5.0, 6.0, 5.0, 5.0, -2.0, 0.0, 8.0, 3.0]

coefficients = regression(x, y)

writeMatrix("Least-Squares Coefficients", coefficients,
            colNumberZero=True)

Output

 
            Least-Squares Coefficients
          0            1            2            3
      7.733       -0.200        2.333       -1.667

Example 2

A weighted least-squares fit is computed using the model

\[y_i = β_0 + β_1x_{i1} + β_2x_{i2} + ɛ_i \phantom{...} i = 1, 2, \ldots, 4\]

and weights \(1/i^2\) discussed by Maindonald (1984, pp. 67−68).

In the example, weights is specified. The minimum sum of squares for error in terms of the original untransformed regressors and responses for this weighted regression is

\[\mathit{SSE} = \sum_{i=1}^{4} w_i \left(y_i - \hat{y}_i\right)^2\]

where \(w_i=1/i^2\), represented in the C code as array w.

from numpy import *
from pyimsl.stat.regression import regression
from pyimsl.stat.writeMatrix import writeMatrix

n_independent = 2
n_coefficients = n_independent + 1
n_observations = 4
x = [[-2.0, 0.0], [-1.0, 2.0], [2.0, 5.0], [7.0, 3.0]]
y = [-3.0, 1.0, 2.0, 6.0]
anova_row_labels = \
    ["degrees of freedom for regression",
     "degrees of freedom for error",
     "total (uncorrected) degrees of freedom",
     "sum of squares for regression",
     "sum of squares for error",
     "total (uncorrected) sum of squares",
     "regression mean square",
     "error mean square", "F-statistic",
     "p-value", "R-squared (in percent)",
     "adjusted R-squared (in percent)",
     "est. standard deviation of model error",
     "overall mean of y",
     "coefficient of variation (in percent)"]

# Calculate weights
power = 0.0
w = empty((n_observations), dtype=double)
for i in range(0, n_observations):
    power = power + 1.0
    w[i] = 1.0 / (power * power)

# Perform analysis
anova_table = []
coefficients = regression(x, y, weights=w,
                          anovaTable=anova_table)

# Print results
writeMatrix("Least Squares Coefficients", coefficients)
writeMatrix("* * * Analysis of Variance * * *\n",
            anova_table, rowLabels=anova_row_labels,
            writeFormat="%10.2f", column=True)

Output

 
     Least Squares Coefficients
          1            2            3
     -1.431        0.658        0.748
 
         * * * Analysis of Variance * * *

degrees of freedom for regression             2.00
degrees of freedom for error                  1.00
total (uncorrected) degrees of freedom        3.00
sum of squares for regression                 7.68
sum of squares for error                      1.01
total (uncorrected) sum of squares            8.69
regression mean square                        3.84
error mean square                             1.01
F-statistic                                   3.79
p-value                                       0.34
R-squared (in percent)                       88.34
adjusted R-squared (in percent)              65.03
est. standard deviation of model error        1.01
overall mean of y                            -1.51
coefficient of variation (in percent)       -66.55

Example 3

A multivariate regression is performed for a data set with two dependent variables. Also, usage of the keyword xIndices is demonstrated. Note that the required input variable y is not referenced and is declared as a float.

from numpy import *
from pyimsl.stat.regression import regression
from pyimsl.stat.writeMatrix import writeMatrix

n_independent = 3
n_dependent = 2
dummy = []
scpe = []
anova_table = []
regressionInfo = []
rank = []
anova_row_labels = \
    ["degrees of freedom for regression",
     "degrees of freedom for error",
     "total (uncorrected) degrees of freedom",
     "sum of squares for regression",
     "sum of squares for error",
     "total (uncorrected) sum of squares",
     "regression mean square",
     "error mean square", "F-statistic",
     "p-value", "R-squared (in percent)",
     "adjusted R-squared (in percent)",
     "est. standard deviation of model error",
     "overall mean of y",
     "coefficient of variation (in percent)"]
x = ([[7.0, 5.0, 6.0, 7.0, 1.0],
      [2.0, -1.0, 6.0, -5.0, 4.0],
      [7.0, 3.0, 5.0, 6.0, 10.0],
      [-3.0, 1.0, 4.0, 5.0, 5.0],
      [2.0, -1.0, 0.0, 5.0, -2.0],
      [2.0, 1.0, 7.0, -2.0, 4.0],
      [-3.0, -1.0, 3.0, 0.0, -6.0],
      [2.0, 1.0, 1.0, 8.0, 2.0],
      [2.0, 1.0, 4.0, 3.0, 0.0]])
ifrq = -1
iwt = -1
indind = (0, 1, 2)
inddep = (3, 4)
xIndices = {}
xIndices["indind"] = indind
xIndices["inddep"] = inddep
xIndices["ifrq"] = ifrq
xIndices["iwt"] = iwt

coefficients = regression(x, dummy,
                          nDependent=n_dependent,
                          xIndices=xIndices,
                          scpe=scpe,
                          anovaTable=anova_table,
                          regressionInfo=regressionInfo,
                          rank=rank)

writeMatrix("Least Squares Coefficients", coefficients)
writeMatrix("SCPE", scpe, writeFormat="%10.4f")
writeMatrix("* * * Analysis of Variance * * *\n",
            anova_table, rowLabels=anova_row_labels,
            writeFormat="%10.2f")

Output

 
             Least Squares Coefficients
             1            2            3            4
1        7.733       -0.200        2.333       -1.667
2       -1.633        0.400        0.167        0.667
 
          SCPE
            1           2
1      4.0000     20.0000
2     20.0000    110.0000
 
     * * * Analysis of Variance * * *

                              1           2
degrees of freedom         3.00        3.00
   for regression                          
degrees of freedom         5.00        5.00
   for error                               
total (uncorrected)        8.00        8.00
   degrees of                              
   freedom                                 
sum of squares           152.00       56.00
   for regression                          
sum of squares             4.00      110.00
   for error                               
total (uncorrected)      156.00      166.00
   sum of squares                          
regression mean           50.67       18.67
   square                                  
error mean square          0.80       22.00
F-statistic               63.33        0.85
p-value                    0.00        0.52
R-squared (in             97.44       33.73
   percent)                                
adjusted R-squared        95.90        0.00
   (in percent)                            
est. standard              0.89        4.69
   deviation of                            
   model error                             
overall mean of y          3.00        2.00
coefficient of            29.81      234.52
   variation (in                           
   percent)

Example 4

Continuing with Example 1 data, the example below invokes the regression function using values of ido greater than 0. Also, usage of the keywords coefCovariances and xMean is demonstrated.

from numpy import *
from pyimsl.stat.regression import regression
from pyimsl.stat.writeMatrix import writeMatrix

n_independent = 3
n_observations_block_1 = 3
n_observations_block_2 = 3
n_observations_block_3 = 3
n_coefficients = 4

x1 = [[7.0, 5.0, 6.0],
      [2.0, -1.0, 6.0],
      [7.0, 3.0, 5.0]]
x2 = [[-3.0, 1.0, 4.0],
      [2.0, -1.0, 0.0],
      [2.0, 1.0, 7.0]]
x3 = [[-3.0, -1.0, 3.0],
      [2.0, 1.0, 1.0],
      [2.0, 1.0, 4.0]]
y1 = [7.0, -5.0, 6.0]
y2 = [5.0, 5.0, -2.0]
y3 = [0.0, 8.0, 3.0]
coefCovariances = []
xMean = []
coefficients = regression(x1, y1, ido=1)
coefficients = regression(x2, y2, ido=2)
coefficients = regression(x3, y3, ido=3,
                          coefCovariances=coefCovariances,
                          xMean=xMean)

writeMatrix("Least-Squares Coefficients", coefficients)
writeMatrix("Coefficient Covariance", coefCovariances,
            printUpper=True)
writeMatrix("x means", xMean)

Output

 
            Least-Squares Coefficients
          1            2            3            4
      7.733       -0.200        2.333       -1.667
 
               Coefficient Covariance
             1            2            3            4
1       0.3951      -0.0120       0.0289      -0.0778
2                    0.0160      -0.0200       0.0000
3                                 0.0556      -0.0111
4                                              0.0222
 
               x means
          1            2            3
          2            1            4

Warning Errors

IMSLS_RANK_DEFICIENT The model is not full rank. There is not a unique least-squares solution.

Fatal Errors

IMSLS_BAD_IDO_6 ido” = #. Initial allocations must be performed by invoking the function with “ido” = 1.
IMSLS_BAD_IDO_7 ido” = #. A new analysis may not begin until the previous analysis is terminated by invoking the function with “ido” = 3.