regression¶
Fits a multiple linear regression model using least squares.
Synopsis¶
regression (x, y)
Required Arguments¶
- float
x[[]]
(Input) - Array of size
nObservations
×nIndependent
containing the matrix of independent (explanatory) variables. - float
y[]
(Input) - Array of length
nObservations
containing the dependent (response) variable.
Return Value¶
If the optional argument noIntercept
is not used, regression
returns
an array of length nIndependent
+ 1 containing a least-squares solution
for the regression coefficients. The estimated intercept is the initial
component of the array.
Optional Arguments¶
noIntercept
By default, the fitted value for observation i is
\[\hat{\beta}_0 + \hat{\beta}_1x_1 + \ldots + \hat{\beta}_kx_k\]where k =
nIndependent
. IfnoIntercept
is specified, the intercept term\[\hat{\beta}_0\]is omitted from the model.
tolerance
, float (Input)- The tolerance used in determining linear dependence. For
regression
,tolerance
= 100 ×machine
(4) is the default choice. Forregression
,tolerance
= 100 ×machine
(4) is the default. Seemachine
. rank
(Output)- The rank of the fitted model is returned in
rank
. coefCovariances
(Output)- The m × m array containing the estimated variances and covariances of
the estimated regression coefficients. Here, m is the number of
regression coefficients in the model. If
noIntercept
is specified, m =nIndependent
; otherwise, m =nIndependent
+ 1. xMean
(Output)- The array containing the estimated means of the independent variables.
residual
(Output)- The array containing the residuals.
anovaTable
(Output)The array containing the analysis of variance table.
The analysis of variance statistics are given as follows:
Element | Analysis of Variance Statistics |
---|---|
0 | degrees of freedom for the model |
1 | degrees of freedom for error |
2 | total (corrected) degrees of freedom |
3 | sum of squares for the model |
4 | sum of squares for error |
5 | total (corrected) sum of squares |
6 | model mean square |
7 | error mean square |
8 | overall F-statistic |
9 | p-value |
10 | \(R^2\) (in percent) |
11 | adjusted \(R^2\) (in percent) |
12 | estimate of the standard deviation |
13 | overall mean of y |
14 | coefficient of variation (in percent) |
Description¶
The function regression
fits a multiple linear regression model with or
without an intercept. By default, the multiple linear regression model is
where the observed values of the \(y_i\)’s (input in y
) are the
responses or values of the dependent variable; the \(x_{i1}\)’s,
\(x_{i2}\)’s, …, \(x_{ik}\)’s (input in x
) are the settings
of the k (input in nIndependent
) independent variables;
\(\beta_0\), \(\beta_1\), …, \(\beta_k\) are the regression
coefficients whose estimated values are to be output by regression
; and
the \(\varepsilon_i\)’s are independently distributed normal errors
each with mean zero and variance \(\sigma^2\). Here, n is the number of
rows in the augmented matrix (x,y), i.e., n equals nObservations
. Note
that by default, \(\beta_0\) is included in the model.
The function regression
computes estimates of the regression
coefficients by minimizing the sum of squares of the deviations of the
observed response \(y_i\) from the fitted response
for the n observations. This minimum sum of squares (the error sum of
squares) is output as one of the analysis of variance statistics if
anovaTable
is specified and is computed as
Another analysis of variance statistic is the total sum of squares. By default, the total sum of squares is the sum of squares of the deviations of \(y_i\) from its mean
the so-called corrected total sum of squares. This statistic is computed as
When noIntercept
is specified, the total sum of squares is the sum of
squares of \(y_i\), the so-called uncorrected total sum of
squares. This is computed as
See Draper and Smith (1981) for a good general treatment of the multiple linear regression model, its analysis, and many examples.
In order to compute a least-squares solution, regression
performs an
orthogonal reduction of the matrix of regressors to upper-triangular form.
The reduction is based on one pass through the rows of the augmented matrix
(x, y) using fast Givens transformations. (See Golub and Van Loan 1983, pp.
156-162; Gentleman 1974.) This method has the advantage that the loss of
accuracy resulting from forming the crossproduct matrix used in the normal
equations is avoided.
By default, the current means of the dependent and independent variables are used to internally center the data for improved accuracy. Let \(x_i\) be a column vector containing the j-th row of data for the independent variables. Let \(\overline{x}_i\) represent the mean vector for the independent variables given the data for rows 1, 2, …, i. The current mean vector is defined to be
The i-th row of data has \(\overline{x}_i\) subtracted from it and is then weighted by \(i/(i − 1)\). Although a crossproduct matrix is not computed, the validity of this centering operation can be seen from the following formula for the sum of squares and crossproducts matrix:
An orthogonal reduction on the centered matrix is computed. When the final
computations are performed, the intercept estimate and the first row and
column of the estimated covariance matrix of the estimated coefficients are
updated (if coefCovariances
is specified) to reflect the statistics for
the original (uncentered) data. This means that the estimate of the
intercept is for the uncentered data.
As part of the final computations, regression
checks for linearly
dependent regressors. In particular, linear dependence of the regressors is
declared if any of the following three conditions are satisfied:
- A regressor equals zero.
- Two or more regressors are constant.
- \(\sqrt{1-R_{i\cdot 1,2,\ldots,i-1}^2}\) is less than or equal
to
tolerance
. Here, \(R_{i.1,2,\ldots,i-1}\) is the multiple correlation coefficient of the i-th independent variable with the first i − 1 independent variables. If no intercept is in the model, the “multiple correlation” coefficient is computed without adjusting for the mean.
On completion of the final computations, if the i-th regressor is declared
to be linearly dependent upon the previous i − 1 regressors, then the
i-th coefficient estimate and all elements in the i-th row and i-th
column of the estimated variance-covariance matrix of the estimated
coefficients (if coefCovariances
is specified) are set to zero. Finally,
if a linear dependence is declared, an informational (error) message, code
IMSL_RANK_DEFICIENT
, is issued indicating the model is not full rank.
Examples¶
Example 1¶
A regression model
is fitted to data taken from Maindonald (1984, pp. 203-204).
from numpy import *
from pyimsl.math.regression import regression
from pyimsl.math.writeMatrix import writeMatrix
n_independent = 3
n_dependent = 2
dummy = []
anova_table = []
regressionInfo = []
rank = []
x = ([[7.0, 5.0, 6.0],
[2.0, -1.0, 6.0],
[7.0, 3.0, 5.0],
[-3.0, 1.0, 4.0],
[2.0, -1.0, 0.0],
[2.0, 1.0, 7.0],
[-3.0, -1.0, 3.0],
[2.0, 1.0, 1.0],
[2.0, 1.0, 4.0]])
y = (7.0, -5.0, 6.0, 5.0, 5.0, -2.0, 0.0, 8.0, 3.0)
coefficients = regression(x, y)
writeMatrix('Least-Squares Coefficients', coefficients,
colNumberZero=True)
Output¶
Least-Squares Coefficients
0 1 2 3
7.733 -0.200 2.333 -1.667
Example 2¶
A weighted least-squares fit is computed using the model
and weights \(1/i^2\) discussed by Maindonald (1984, pp. 67-68). In
order to compute the weighted least-squares fit, using an ordinary
least-squares function (regression
), the regressors (including the
column of ones for the intercept term) and the responses must be transformed
prior to invocation of regression
. Specifically, the i-th response and
regressors are multiplied by a square root of the i-th weight.
noIntercept
must be specified since the column of ones corresponding to
the intercept term in the untransformed model is transformed by the weights
and is regarded as an additional independent variable.
In the example, anovaTable
is specified. The minimum sum of squares for
error in terms of the original untransformed regressors and responses for
this weighted regression is
where \(w_i=1/i^2\). Also, since noIntercept
is specified, the
uncorrected total sum-of-squares terms of the original untransformed
responses is
from numpy import *
from pyimsl.math.regression import regression
from pyimsl.math.writeMatrix import writeMatrix
n_observations = 4
n_independent = 3
x = array([[1.0, -2.0, 0.0],
[1.0, -1.0, 2.0],
[1.0, 2.0, 5.0],
[1.0, 7.0, 3.0]])
y = [-3.0, 1.0, 2.0, 6.0]
anova_row_labels = ["degrees of freedom for regression",
"degrees of freedom for error",
"total (uncorrected) degrees of freedom",
"sum of squares for regression",
"sum of squares for error",
"total (uncorrected) sum of squares",
"regression mean square",
"error mean square", "F-statistic",
"p-value", "R-squared (in percent)",
"adjusted R-squared (in percent)",
"est. standard deviation of model error",
"overall mean of y",
"coefficient of variation (in percent)"]
power = 0.0
for i in range(0, n_observations):
power += 1.0
# The square root of the weight
w = sqrt(1.0 / (power * power))
# Transform response
y[i] *= w
# Transform regressors
for j in range(0, n_independent):
x[i][j] *= w
anova_table = []
coefficients = regression(x, y, noIntercept=True,
anovaTable=anova_table)
writeMatrix("Least-Squares Coefficients", coefficients)
writeMatrix("* * * Analysis of Variance * * *\n", anova_table,
rowLabels=anova_row_labels,
writeFormat="%10.2f",
column=True)
Output¶
Least-Squares Coefficients
1 2 3
-1.431 0.658 0.748
* * * Analysis of Variance * * *
degrees of freedom for regression 3.00
degrees of freedom for error 1.00
total (uncorrected) degrees of freedom 4.00
sum of squares for regression 10.93
sum of squares for error 1.01
total (uncorrected) sum of squares 11.94
regression mean square 3.64
error mean square 1.01
F-statistic 3.60
p-value 0.37
R-squared (in percent) 91.52
adjusted R-squared (in percent) 66.08
est. standard deviation of model error 1.01
overall mean of y -0.08
coefficient of variation (in percent) -1207.73
Warning Errors¶
IMSL_RANK_DEFICIENT |
The model is not full rank. There is not a unique least-squares solution. |