regressionSummary

Produces summary statistics for a regression model given the information from the fit.

Synopsis

regressionSummary (regressionInfo)

Required Argument

structure regressionInfo (Input)
A structure containing information about the regression fit. See regression.

Optional Arguments

indexRegression, int (Input)

Given a multivariate regression fit, this option allows the user to specify for which regression summary statistics will be computed.

Default: indexRegression = 0

coefTTests (Output)

The npar × 4 array containing statistics relating to the regression coefficients, where npar is equal to the number of parameters in the model.

Each row (for each dependent variable) corresponds to a coefficient in the model, where npar is the number of parameters in the model. Row i + intcep corresponds to the i‑th independent variable, where intcep* is equal to 1 if an intercept is in the model and 0 otherwise, for \(i=0,1,2,\ldots,npar-1\).

The statistics in the columns are as follows:

Column  
0 coefficient estimate
1 estimated standard error of the coefficient estimate
2 t-statistic for the test that the coefficient is 0
3 p-value for the two-sided t test
coefVif (Output)

An array of length npar containing the variance inflation factor, where npar is the number of parameters. The i + intcep-th column corresponds to the i‑th independent variable, where \(i=0,1, 2,\ldots,npar-1\), and intcep is equal to 1 if an intercept is in the model and 0 otherwise.

The square of the multiple correlation coefficient for the i‑th regressor after all others can be obtained from coefVif by

\[1.0 - \frac{1.0}{\mathrm{coefVif}[i]}\]

If there is no intercept, or there is an intercept and \(j=0\), the multiple correlation coefficient is not adjusted for the mean.

coefCovariances (Output)
An npar by npar (where npar is equal to the number of parameters in the model) array that is the estimated variance-covariance matrix of the estimated regression coefficients when R is nonsingular and is from an unrestricted regression fit. See Remarks for an explanation of coefCovariances when R is singular and is from a restricted regression fit.
anovaTable (Output)

The array of size 15 containing the analysis of variance table.

Row Analysis of Variance Statistic
0 degrees of freedom for the model
1 degrees of freedom for error
2 total (corrected) degrees of freedom
3 sum of squares for the model
4 sum of squares for error
5 total (corrected) sum of squares
6 model mean square
7 error mean square
8 overall F-statistic
9 p-value
10 \(R^2\)(in percent)
11 adjusted \(R^2\) (in percent)
12 estimate of the standard deviation
13 overall mean of y
14 coefficient of variation (in percent)

If the model has an intercept, the regression and total are corrected for the mean; otherwise, the regression and total are not corrected for the mean, and anovaTable[13] and anovaTable[14] are set to NaN. Note that the p‑value is returned as 0.0 when the value is so small that all significant digits have been lost.

Description

Function regressionSummary computes summary statistics from a fitted general linear model. The model is \(y=X\beta+\varepsilon\), where y is the n × 1 vector of responses, X is the n × p matrix of regressors, β is the p × 1 vector of regression coefficients, and ɛ is the n × 1 vector of errors whose elements are each independently distributed with mean 0 and variance \(\sigma^2\). Function regression can be used to compute the fit of the model. Next, regressionSummary uses the results of this fit to compute summary statistics, including analysis of variance, sequential sum of squares, t tests, and an estimated variance-covariance matrix of the estimated regression coefficients.

Some generalizations of the general linear model are allowed. If the i‑th element of ɛ has variance of

\[\frac{\sigma^2}{w_i}\]

and the weights \(w_i\) are used in the fit of the model, regressionSummary produces summary statistics from the weighted least-squares fit. More generally, if the variance-covariance matrix of ɛ is \(\sigma^2 V\), regressionSummary can be used to produce summary statistics from the generalized least-squares fit. Function regression can be used to perform a generalized least-squares fit, by regressing y on X where \(y=(T^{-1})^T y\), \(X=(T^{-1})^T X\) and T satisfies \(T^TT=V\).

The sequential sum of squares for the i‑th regression parameter is given by

\[\left(R \hat{\beta}\right)_i^2\]

The regression sum of squares is given by the sum of the sequential sums of squares. If an intercept is in the model, the regression sum of squares is adjusted for the mean, i.e.,

\[\left(R \hat{\beta}\right)_0^2\]

is not included in the sum.

The estimate of \(\sigma^2\) is \(s^2\) (stored in anovaTable[7]) that is computed as SSE/DFE.

If R is nonsingular, the estimated variance-covariance matrix of

\[\hat{\beta}\]

(stored in coefCovariances) is computed by \(s^2 R^{-1} (R^{-1})^T\).

If R is singular, corresponding to rank(X) < p, a generalized inverse is used. For a matrix G to be a \(g_i\) (i = 1, 2, 3, or 4) inverse of a matrix A, G must satisfy conditions j (for \(j\leq i\)) for the Moore-Penrose inverse but generally must fail conditions k (for \(k> i\)). The four conditions for G to be a Moore-Penrose inverse of A are as follows:

  1. AGA = A.
  2. GAG = G.
  3. AG is symmetric.
  4. GA is symmetric.

In the case where R is singular, the method for obtaining coefCovariances follows the discussion of Maindonald (1984, pp. 101–103). Let Z be the diagonal matrix with diagonal elements defined by the following:

\[\begin{split}z_{ii} = \begin{cases} 1 & \text{if } r_{ii} \neq 0 \\ 0 & \text{if } r_{ii} = 0 \\ \end{cases}\end{split}\]

Let G be the solution to \(RG=Z\) obtained by setting the i‑th (\(\{ i : r_{ii}=0\}\)) row of G to 0. Argument coefCovariances is set to \(s^2 GG^T\). (G is a \(g_3\) inverse of R, represented by,

\[R^{g_3}\]

the result

\[R^{g_3} R^{g_3^T}\]

is a symmetric \(g_2\) inverse of \(R^T R=X^T X\). See Sallas and Lionti 1988.)

Note that argument coefCovariances can be used only to get variances and covariances of estimable functions of the regression coefficients, i.e., nonestimable functions (linear combinations of the regression coefficients not in the space spanned by the nonzero rows of R) must not be used. See, for example, Maindonald (1984, pp. 166–168) for a discussion of estimable functions.

The estimated standard errors of the estimated regression coefficients (stored in Column 1 of coefTTests) are computed as square roots of the corresponding diagonal entries in coefCovariances.

For the case where an intercept is in the model, put \(\overline{R}\) equal to the matrix R with the first row and column deleted. Generally, the variance inflation factor (VIF) for the i‑th regression coefficient is computed as the product of the i‑th diagonal element of \(R^T R\) and the i‑th diagonal element of its computed inverse. If an intercept is in the model, the VIF for those coefficients not corresponding to the intercept uses the diagonal elements of \(\overline{R}^T \overline{R}\) (see Maindonald 1984, p. 40).

Remarks

When R is nonsingular and comes from an unrestricted regression fit, coefCovariances is the estimated variance-covariance matrix of the estimated regression coefficients, and coefCovariances = (SSE/DFE) (\(R^T R\)). Otherwise, variances and covariances of estimable functions of the regression coefficients can be obtained using coefCovariances, and coefCovariances = (SSE/DFE) (\(GDG^T\)). Here, D is the diagonal matrix with diagonal elements equal to 0 if the corresponding rows of R are restrictions and with diagonal elements equal to 1 otherwise. Also, G is a particular generalized inverse of R.

Example

from numpy import *
from pyimsl.stat.regression import regression
from pyimsl.stat.regressionSummary import regressionSummary
from pyimsl.stat.writeMatrix import writeMatrix

x = array([
    [7.0, 26.0, 6.0, 60.0],
    [1.0, 29.0, 15.0, 52.0],
    [11.0, 56.0, 8.0, 20.0],
    [11.0, 31.0, 8.0, 47.0],
    [7.0, 52.0, 6.0, 33.0],
    [11.0, 55.0, 9.0, 22.0],
    [3.0, 71.0, 17.0, 6.0],
    [1.0, 31.0, 22.0, 44.0],
    [2.0, 54.0, 18.0, 22.0],
    [21.0, 47.0, 4.0, 26.0],
    [1.0, 40.0, 23.0, 34.0],
    [11.0, 66.0, 9.0, 12.0],
    [10.0, 68.0, 8.0, 12.0]])
y = array([78.5, 74.3, 104.3, 87.6, 95.9, 109.2,
           102.7, 72.5, 93.1, 115.9, 83.8, 113.3, 109.4])
anova_row_labels = \
    ["degrees of freedom for regression",
     "degrees of freedom for error",
     "total (uncorrected) degrees of freedom",
     "sum of squares for regression",
     "sum of squares for error",
     "total (uncorrected) sum of squares",
     "regression mean square",
     "error mean square", "F-statistic",
     "p-value", "R-squared (in percent)",
     "adjusted R-squared (in percent)",
     "est. standard deviation of model error",
     "overall mean of y",
     "coefficient of variation (in percent)"]
anova_table = []
coef_t_tests = []
coef_vif = []
coef_covariances = []
regression_info = []

# Fit the regression model
coefficients = regression(x, y, regressionInfo=regression_info)

# Generate summary statistics
regressionSummary(regression_info[0],
                  anovaTable=anova_table,
                  coefTTests=coef_t_tests,
                  coefVif=coef_vif,
                  coefCovariances=coef_covariances)

# Print results
writeMatrix("* * * Analysis of Variance * * *\n",
            anova_table, rowLabels=anova_row_labels,
            writeFormat="%10.2f", column=True)
writeMatrix("* * * Inference on Coefficients * * *\n",
            coef_t_tests, writeFormat="%10.2f")
writeMatrix("* * * Variance Inflation Factors * * *\n",
            coef_vif, writeFormat="%10.2f", column=True)
writeMatrix("* * Variance-Covariance Matrix * * *\n",
            coef_covariances, writeFormat="%10.2f")

Output

 
         * * * Analysis of Variance * * *

degrees of freedom for regression             4.00
degrees of freedom for error                  8.00
total (uncorrected) degrees of freedom       12.00
sum of squares for regression              2667.90
sum of squares for error                     47.86
total (uncorrected) sum of squares         2715.76
regression mean square                      666.97
error mean square                             5.98
F-statistic                                 111.48
p-value                                       0.00
R-squared (in percent)                       98.24
adjusted R-squared (in percent)              97.36
est. standard deviation of model error        2.45
overall mean of y                            95.42
coefficient of variation (in percent)         2.56
 
     * * * Inference on Coefficients * * *

            1           2           3           4
1       62.41       70.07        0.89        0.40
2        1.55        0.74        2.08        0.07
3        0.51        0.72        0.70        0.50
4        0.10        0.75        0.14        0.90
5       -0.14        0.71       -0.20        0.84
 
* * * Variance Inflation Factors * * *

             1    10668.51
             2       38.50
             3      254.42
             4       46.87
             5      282.51
 
            * * Variance-Covariance Matrix * * *

            1           2           3           4           5
1     4909.94      -50.51      -50.60      -51.66      -49.60
2      -50.51        0.55        0.51        0.55        0.51
3      -50.60        0.51        0.52        0.53        0.51
4      -51.66        0.55        0.53        0.57        0.52
5      -49.60        0.51        0.51        0.52        0.50