com.imsl.stat.LinearRegression

All Implemented Interfaces:: Serializable, Cloneable

public class LinearRegression extends Object implements Serializable, Cloneable

Fits a multiple linear regression model with or without an intercept. If the constructor argument hasIntercept is true, the multiple linear regression model is $$y_i = \beta _0 + \beta _1 x_{i1} + \beta _2 x_{i2} + \, \ldots + \beta _k x_{ik}+ \varepsilon _i \, \,\,\,\, i = 1,\,2,\, \ldots ,\,n$$ where the observed values of the $ y_i $'s constitute the responses or values of the dependent variable, the $ x_{i1} $'s, $ x_{i2} $'s, $ \ldots, x_{ik} $'s are the settings of the independent variables, $ \beta_0,\beta_1, \ldots, \beta_k $ are the regression coefficients, and the $ e_i $'s are independently distributed normal errors each with mean zero and variance $ \sigma^2/w_i $. If hasIntercept is


 false

, $ \beta_0 $ is not included in the model.

LinearRegression computes estimates of the regression coefficients by minimizing the sum of squares of the deviations of the observed response $ y_i $ from the fitted response

$$ \hat y_i $$

for the observations. This minimum sum of squares (the error sum of squares) is in the ANOVA output and denoted by

$$ {\rm SSE}=\sum\limits_{i=1}^n w_i (y_i-\hat y_i)^2 $$

In addition, the total sum of squares is output in the ANOVA table. For the case, hasIntercept is true; the total sum of squares is the sum of squares of the deviations of $ y_i $ from its mean

$$ \bar y $$

--the so-called corrected total sum of squares; it is denoted by

$$ {\rm SST}=\sum\limits_{i=1}^n w_i (y_i-\bar y)^2 $$

For the case hasIntercept is false, the total sum of squares is the sum of squares of $ y_i $ --the so-called uncorrected total sum of squares; it is denoted by

$$ {\rm SST}=\sum\limits_{i=1}^n y_i^2 $$

See Draper and Smith (1981) for a good general treatment of the multiple linear regression model, its analysis, and many examples.

In order to compute a least-squares solution, LinearRegression performs an orthogonal reduction of the matrix of regressors to upper triangular form. Givens rotations are used to reduce the matrix. This method has the advantage that the loss of accuracy resulting from forming the crossproduct matrix used in the normal equations is avoided, while not requiring the storage of the full matrix of regressors. The method is described by Lawson and Hanson, pages 207-212.

From a general linear model fitted using the $ w_i $'s as the weights, inner class LinearRegression.CaseStatistics can also compute predicted values, confidence intervals, and diagnostics for detecting outliers and cases that greatly influence the fitted regression. Let $ x_i $ be a column vector containing elements of the $ i $-th row of $ X $. Let $ W= diag(w_1, w_2, ..., w_n) $. The leverage is defined as

$$ h_i=[x_i^T(X^TWX)^-x_i]w_i $$

(In the case of linear equality restrictions on $ \beta $, the leverage is defined in terms of the reduced model.) Put $ D=diag(d_1, d_2, ..., d_k) $ with $ d_j=1 $ if the $ j $-th diagonal element of $ R $ is positive and 0 otherwise. The leverage is computed as $ h_i=(a^T Da)w_i $ where $ a $ is a solution to $ R^T a=x_i $. The estimated variance of

$$ \hat{y_i}=x_i^T \hat{\beta} $$

is given by $ h_i s^2 /w_i $, where $ s^2=SSE/DFE $. The computation of the remainder of the case statistics follows easily from their definitions.

Let $ e_i $ denote the residual

$$ y_i-\hat{y_i} $$

for the $ i $th case. The estimated variance of $ e_i $ is $ (1-h_i)s^2 /w_i $ where $ s^2 $ is the residual mean square from the fitted regression. The $ i $th standardized residual (also called the internally studentized residual) is by definition

$$ r_i=e_i\sqrt{\frac{{w_i}}{{s^2(1-h_i)}}} $$

and $ r_i $ follows an approximate standard normal distribution in large samples.

The $ i $th jackknife residual or deleted residual involves the difference between $ y_i $ and its predicted value based on the data set in which the $ i $th case is deleted. This difference equals $ e_i/(1-h_i) $. The jackknife residual is obtained by standardizing this difference. The residual mean square for the regression in which the $ i $th case is deleted is

$$ s_i^2={\frac{{( n-r)s^2-w_ie_i^2/(1-h_i)}}{{n-r-1}}} $$ The jackknife residual is defined to be

$$ t_i=e_i \sqrt {\frac{{w_i}}{{s_i^2(1-h_i)}}} $$

and $ t_i $ follows a $ t $ distribution with $ n-r-1 $ degrees of freedom.

Cook's distance for the $ i $th case is a measure of how much an individual case affects the estimated regression coefficients. It is given by

$$ D_i={\frac{{w_i h_i e_i^2}}{{rs^2(1-h_i) ^2}}} $$

Weisberg (1985) states that if $ D_i $ exceeds the 50-th percentile of the $ F(r,n-r) $ distribution, it should be considered large. (This value is about 1. This statistic does not have an $ F $ distribution.)

DFFITS, like Cook's distance, is also a measure of influence. For the $ i $th case, DFFITS is computed by the formula

$$ DFFITS_i=e_i\sqrt{\frac{{w_i h_i}}{{s_i^2(1-h_i)^2}}} $$

Hoaglin and Welsch (1978) suggest that $ DFFITS_i $ greater than

$$ 2\sqrt{r/n} $$

is large.

Often predicted values and confidence intervals are desired for combinations of settings of the effect variables not used in computing the regression fit. This can be accomplished using a single data matrix by including these settings of the variables as part of the data matrix and by setting the response equal to Double.NaN. LinearRegression will omit the case when performing the fit and a predicted value and confidence interval for the missing response will be computed from the given settings of the effect variables.

See Also:

Nested Class Summary

Nested Classes

Modifier and Type

Class

Description

class

LinearRegression.CaseStatistics

Inner Class CaseStatistics allows for the computation of predicted values, confidence intervals, and diagnostics for detecting outliers and cases that greatly influence the fitted regression.

class

LinearRegression.CoefficientTTests

Contains statistics related to the regression coefficients.
Constructor Summary

Constructors

Constructor

Description

LinearRegression(int nVariables, boolean hasIntercept)

Constructs a new linear regression object.
Method Summary

Modifier and Type

Method

Description

ANOVA

getANOVA()

Get an analysis of variance table and related statistics.

LinearRegression.CaseStatistics

getCaseStatistics(double[] x, double y)

Returns the case statistics for an observation.

LinearRegression.CaseStatistics

getCaseStatistics(double[] x, double y, double w)

Returns the case statistics for an observation and a weight.

LinearRegression.CaseStatistics

getCaseStatistics(double[] x, double y, double w, int pred)

Returns the case statistics for an observation, weight, and future response count for the desired prediction interval.

LinearRegression.CaseStatistics

getCaseStatistics(double[] x, double y, int pred)

Returns the case statistics for an observation and future response count for the desired prediction interval.

double[]

getCoefficients()

Returns the regression coefficients.

LinearRegression.CoefficientTTests

getCoefficientTTests()

Returns statistics relating to the regression coefficients.

int[]

getPermute()

Returns an integer vector containing information about the permutation of the columns of the matrix of regressors during QR factorization.

double[][]

getR()

Returns a copy of the R matrix.

int

getRank()

Returns the rank of the matrix.

double[]

getRHS()

Returns the right hand side of the regression problem.

void

update(double[][] x, double[] y)

Updates the regression object with a new set of observations.

void

update(double[][] x, double[] y, double[] w)

Updates the regression object with a new set of observations and weights.

void

update(double[] x, double y)

Updates the regression object with a new observation.

void

update(double[] x, double y, double w)

Updates the regression object with a new observation and weight.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Constructor Details
- LinearRegression
  
  public LinearRegression(int nVariables, boolean hasIntercept)
  
  Constructs a new linear regression object.
  
  Parameters:
  
  nVariables - int number of variables in the regression
  
  hasIntercept - boolean which indicates whether or not an intercept is in this regression model
Method Details
- update
  
  public void update(double[] x, double y)
  
  Updates the regression object with a new observation.
  
  Parameters:
  
  x - a double array containing the independent (explanatory) variables. Its length must be equal to the number of variables set in the constructor.
  
  y - a double representing the dependent (response) variable
- update
  
  public void update(double[] x, double y, double w)
  
  Updates the regression object with a new observation and weight.
  
  Parameters:
  
  x - a double array containing the independent (explanatory) variables. Its length must be equal to the number of variables set in the constructor.
  
  y - a double representing the dependent (response) variable
  
  w - a double representing the weight
- update
  
  public void update(double[][] x, double[] y)
  
  Updates the regression object with a new set of observations.
  
  Parameters:
  
  x - a double matrix containing the independent (explanatory) variables. The number of rows in x must equal the length of y and the number of columns must be equal to the number of variables set in the constructor.
  
  y - a double array containing the dependent (response) variables.
- update
  
  public void update(double[][] x, double[] y, double[] w)
  
  Updates the regression object with a new set of observations and weights.
  
  Parameters:
  
  x - a double matrix containing the independent (explanatory) variables. The number of rows in x must equal the length of y and the number of columns must be equal to the number of variables set in the constructor.
  
  y - a double array containing the dependent (response) variables.
  
  w - a double array representing the weights
- getCoefficients
  
  public double[] getCoefficients()
  
  Returns the regression coefficients.
  Returns:
  
  a double array containing the regression coefficients. If hasIntercept is false its length is equal to the number of variables. If hasIntercept is true then its length is the number of variables plus one and the 0-th entry is the value of the intercept. If the model is not full rank, the regression coefficients are not uniquely determined. In this case, a warning is issued and a solution with all linearly dependent regressors set to zero is returned.
  
  See Also:
  
  Warning
- getR
  
  public double[][] getR()
  
  Returns a copy of the R matrix. R is the upper triangular matrix containing the R matrix from a QR decomposition of the matrix of regressors.
  
  Returns:
  
  a double matrix containing a copy of the R matrix
- getANOVA
  
  public ANOVA getANOVA()
  
  Get an analysis of variance table and related statistics.
  
  Returns:
  
  an ANOVA table and related statistics
- getRank
  
  public int getRank()
  
  Returns the rank of the matrix.
  
  Returns:
  
  the int rank of the matrix
- getCoefficientTTests
  
  public LinearRegression.CoefficientTTests getCoefficientTTests()
  
  Returns statistics relating to the regression coefficients.
- getCaseStatistics
  
  public LinearRegression.CaseStatistics getCaseStatistics(double[] x, double y)
  
  Returns the case statistics for an observation.
  
  Parameters:
  
  x - a double array containing the independent (explanatory) variables. Its length must be equal to the number of variables set in the LinearRegression constructor.
  
  y - a double representing the dependent (response) variable
  
  Returns:
  
  the CaseStatistics for the observation.
- getCaseStatistics
  
  public LinearRegression.CaseStatistics getCaseStatistics(double[] x, double y, double w)
  
  Returns the case statistics for an observation and a weight.
  
  Parameters:
  
  x - a double array containing the independent (explanatory) variables. Its length must be equal to the number of variables set in the constructor.
  
  y - a double representing the dependent (response) variable
  
  w - a double representing the weight
  
  Returns:
  
  the CaseStatistics for the observation.
- getCaseStatistics
  
  public LinearRegression.CaseStatistics getCaseStatistics(double[] x, double y, int pred)
  
  Returns the case statistics for an observation and future response count for the desired prediction interval.
  
  Parameters:
  
  x - a double array containing the independent (explanatory) variables. Its length must be equal to the number of variables set in the constructor.
  
  y - a double representing the dependent (response) variable
  
  pred - an int representing the number of future responses for which the prediction interval is desired on the average of the future responses.
  
  Returns:
  
  the CaseStatistics for the observation.
- getCaseStatistics
  
  public LinearRegression.CaseStatistics getCaseStatistics(double[] x, double y, double w, int pred)
  
  Returns the case statistics for an observation, weight, and future response count for the desired prediction interval.
  
  Parameters:
  
  x - a double array containing the independent (explanatory) variables. Its length must be equal to the number of variables set in the constructor.
  
  y - a double representing the dependent (response) variable
  
  w - a double representing the weight
  
  pred - an int representing the number of future responses for which the prediction interval is desired on the average of the future responses
  
  Returns:
  
  the CaseStatistics for the observation.
- getRHS
  
  public double[] getRHS()
  
  Returns the right hand side of the regression problem.
  
  Returns:
  
  a double array containing a copy of the right-hand side d of the upper triangular system Rx=d from the QR-transformed regression problem.
- getPermute
  
  public int[] getPermute()
  
  Returns an integer vector containing information about the permutation of the columns of the matrix of regressors during QR factorization.
  
  Returns:
  
  an int array containing column pivoting information described in the column permutation matrix P of the QR factorization of the regressor matrix X, trans(Q)*X*P = R. The k-th element of the array contains the index of the column of the original matrix X that has been interchanged into column k.

Class LinearRegression

Nested Class Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Constructor Details

LinearRegression

Method Details

update

update

update

update

getCoefficients

getR

getANOVA

getRank

getCoefficientTTests

getCaseStatistics

getCaseStatistics

getCaseStatistics

getCaseStatistics

getRHS

getPermute