public class LinearRegression extends Object implements Serializable, Cloneable
hasIntercept
is true, the multiple linear
regression model is $$y_i = \beta _0 + \beta _1
x_{i1} + \beta _2 x_{i2} + \, \ldots + \beta _k x_{ik}+ \varepsilon _i \,
\,\,\,\, i = 1,\,2,\, \ldots ,\,n$$ where the observed values of the
\( y_i \)'s constitute the responses or values of the
dependent variable, the \( x_{i1} \)'s, \( x_{i2} \)'s,
\( \ldots, x_{ik} \)'s are the settings of the
independent variables, \( \beta_0,\beta_1, \ldots, \beta_k \)
are the regression coefficients, and the \( e_i \)'s are
independently distributed normal errors each with mean zero and variance
\( \sigma^2/w_i \). If hasIntercept
is
false
, \( \beta_0 \) is not included in the model.
LinearRegression
computes estimates of the regression
coefficients by minimizing the sum of squares of the deviations of the
observed response \( y_i \) from the fitted response
for the observations. This minimum sum of squares (the error sum of squares) is in the ANOVA output and denoted by
$$ {\rm SSE}=\sum\limits_{i=1}^n w_i (y_i-\hat y_i)^2 $$
In addition, the total sum of squares is output in the ANOVA table. For the
case, hasIntercept
is true; the total sum of squares is the sum
of squares of the deviations of \( y_i \) from its mean
--the so-called corrected total sum of squares; it is denoted by
$$ {\rm SST}=\sum\limits_{i=1}^n w_i (y_i-\bar y)^2 $$
For the case hasIntercept
is false
, the total sum
of squares is the sum of squares of \( y_i \) --the so-called
uncorrected total sum of squares; it is denoted by
See Draper and Smith (1981) for a good general treatment of the multiple linear regression model, its analysis, and many examples.
In order to compute a least-squares solution, LinearRegression
performs an orthogonal reduction of the matrix of regressors to upper
triangular form. Givens rotations are used to reduce the matrix. This method
has the advantage that the loss of accuracy resulting from forming the
crossproduct matrix used in the normal equations is avoided, while not
requiring the storage of the full matrix of regressors. The method is
described by Lawson and Hanson, pages 207-212.
From a general linear model fitted using the \( w_i \)'s as
the weights, inner class LinearRegression.CaseStatistics
can also
compute predicted values, confidence intervals, and diagnostics for detecting
outliers and cases that greatly influence the fitted regression. Let
\( x_i \) be a column vector containing elements of the
\( i \)-th row of \( X \). Let \( W=
diag(w_1, w_2, ..., w_n) \). The leverage is defined as
(In the case of linear equality restrictions on \( \beta \), the leverage is defined in terms of the reduced model.) Put \( D=diag(d_1, d_2, ..., d_k) \) with \( d_j=1 \) if the \( j \)-th diagonal element of \( R \) is positive and 0 otherwise. The leverage is computed as \( h_i=(a^T Da)w_i \) where \( a \) is a solution to \( R^T a=x_i \). The estimated variance of
$$ \hat{y_i}=x_i^T \hat{\beta} $$is given by \( h_i s^2 /w_i \), where \( s^2=SSE/DFE \). The computation of the remainder of the case statistics follows easily from their definitions.
Let \( e_i \) denote the residual
$$ y_i-\hat{y_i} $$for the \( i \)th case. The estimated variance of \( e_i \) is \( (1-h_i)s^2 /w_i \) where \( s^2 \) is the residual mean square from the fitted regression. The \( i \)th standardized residual (also called the internally studentized residual) is by definition
$$ r_i=e_i\sqrt{\frac{{w_i}}{{s^2(1-h_i)}}} $$and \( r_i \) follows an approximate standard normal distribution in large samples.
The \( i \)th jackknife residual or deleted residual involves the difference between \( y_i \) and its predicted value based on the data set in which the \( i \)th case is deleted. This difference equals \( e_i/(1-h_i) \). The jackknife residual is obtained by standardizing this difference. The residual mean square for the regression in which the \( i \)th case is deleted is
$$ s_i^2={\frac{{( n-r)s^2-w_ie_i^2/(1-h_i)}}{{n-r-1}}} $$ The jackknife residual is defined to be $$ t_i=e_i \sqrt {\frac{{w_i}}{{s_i^2(1-h_i)}}} $$and \( t_i \) follows a \( t \) distribution with \( n-r-1 \) degrees of freedom.
Cook's distance for the \( i \)th case is a measure of how much an individual case affects the estimated regression coefficients. It is given by
$$ D_i={\frac{{w_i h_i e_i^2}}{{rs^2(1-h_i) ^2}}} $$Weisberg (1985) states that if \( D_i \) exceeds the 50-th percentile of the \( F(r,n-r) \) distribution, it should be considered large. (This value is about 1. This statistic does not have an \( F \) distribution.)
DFFITS, like Cook's distance, is also a measure of influence. For the \( i \)th case, DFFITS is computed by the formula
$$ DFFITS_i=e_i\sqrt{\frac{{w_i h_i}}{{s_i^2(1-h_i)^2}}} $$Hoaglin and Welsch (1978) suggest that \( DFFITS_i \) greater than
$$ 2\sqrt{r/n} $$is large.
Often predicted values and confidence intervals are desired for combinations
of settings of the effect variables not used in computing the regression fit.
This can be accomplished using a single data matrix by including these
settings of the variables as part of the data matrix and by setting the
response equal to Double.NaN
.
LinearRegression
will omit the case when performing the fit and a
predicted value and confidence interval for the missing response will be
computed from the given settings of the effect variables.
Modifier and Type | Class and Description |
---|---|
class |
LinearRegression.CaseStatistics
Inner Class
CaseStatistics allows for the computation of
predicted values, confidence intervals, and diagnostics for detecting
outliers and cases that greatly influence the fitted regression. |
class |
LinearRegression.CoefficientTTests
Contains statistics related to the regression coefficients.
|
Constructor and Description |
---|
LinearRegression(int nVariables,
boolean hasIntercept)
Constructs a new linear regression object.
|
Modifier and Type | Method and Description |
---|---|
ANOVA |
getANOVA()
Get an analysis of variance table and related statistics.
|
LinearRegression.CaseStatistics |
getCaseStatistics(double[] x,
double y)
Returns the case statistics for an observation.
|
LinearRegression.CaseStatistics |
getCaseStatistics(double[] x,
double y,
double w)
Returns the case statistics for an observation and a weight.
|
LinearRegression.CaseStatistics |
getCaseStatistics(double[] x,
double y,
double w,
int pred)
Returns the case statistics for an observation, weight, and future
response count for the desired prediction interval.
|
LinearRegression.CaseStatistics |
getCaseStatistics(double[] x,
double y,
int pred)
Returns the case statistics for an observation and future response count
for the desired prediction interval.
|
double[] |
getCoefficients()
Returns the regression coefficients.
|
LinearRegression.CoefficientTTests |
getCoefficientTTests()
Returns statistics relating to the regression coefficients.
|
int[] |
getPermute()
Returns an integer vector containing information about the permutation
of the columns of the matrix of regressors during QR factorization.
|
double[][] |
getR()
Returns a copy of the R matrix.
|
int |
getRank()
Returns the rank of the matrix.
|
double[] |
getRHS()
Returns the right hand side of the regression problem.
|
void |
update(double[][] x,
double[] y)
Updates the regression object with a new set of observations.
|
void |
update(double[][] x,
double[] y,
double[] w)
Updates the regression object with a new set of observations and weights.
|
void |
update(double[] x,
double y)
Updates the regression object with a new observation.
|
void |
update(double[] x,
double y,
double w)
Updates the regression object with a new observation and weight.
|
public LinearRegression(int nVariables, boolean hasIntercept)
nVariables
- int
number of variables in the regressionhasIntercept
- boolean
which indicates whether or not
an intercept is in this regression modelpublic void update(double[] x, double y)
x
- a double
array containing the independent
(explanatory) variables. Its length must be equal to the number of
variables set in the constructor.y
- a double
representing the dependent (response)
variablepublic void update(double[] x, double y, double w)
x
- a double
array containing the independent
(explanatory) variables. Its length must be equal to the number of
variables set in the constructor.y
- a double
representing the dependent (response)
variablew
- a double
representing the weightpublic void update(double[][] x, double[] y)
x
- a double
matrix containing the independent
(explanatory) variables. The number of rows in x
must equal
the length of y
and the number of columns must be equal to
the number of variables set in the constructor.y
- a double
array containing the dependent (response)
variables.public void update(double[][] x, double[] y, double[] w)
x
- a double
matrix containing the independent
(explanatory) variables. The number of rows in x
must equal
the length of y
and the number of columns must be equal to
the number of variables set in the constructor.y
- a double
array containing the dependent (response)
variables.w
- a double
array representing the weightspublic double[] getCoefficients()
double
array containing the regression
coefficients. If hasIntercept
is false
its
length is equal to the number of variables. If
hasIntercept
is true
then its length is the number of
variables plus one and the 0-th entry is the value of the intercept. If
the model is not full rank, the regression coefficients are not uniquely
determined. In this case, a warning is issued and a solution with all
linearly dependent regressors set to zero is returned.Warning
public double[][] getR()
double
matrix containing a copy of the R
matrixpublic ANOVA getANOVA()
ANOVA
table and related statisticspublic int getRank()
int
rank of the matrixpublic LinearRegression.CoefficientTTests getCoefficientTTests()
public LinearRegression.CaseStatistics getCaseStatistics(double[] x, double y)
x
- a double
array containing the independent
(explanatory) variables. Its length must be equal to the number of
variables set in the LinearRegression constructor.y
- a double
representing the dependent (response)
variablepublic LinearRegression.CaseStatistics getCaseStatistics(double[] x, double y, double w)
x
- a double
array containing the independent
(explanatory) variables. Its length must be equal to the number of
variables set in the constructor.y
- a double
representing the dependent (response)
variablew
- a double
representing the weightpublic LinearRegression.CaseStatistics getCaseStatistics(double[] x, double y, int pred)
x
- a double
array containing the independent
(explanatory) variables. Its length must be equal to the number of
variables set in the constructor.y
- a double
representing the dependent (response)
variablepred
- an int
representing the number of future
responses for which the prediction interval is desired on the average of
the future responses.public LinearRegression.CaseStatistics getCaseStatistics(double[] x, double y, double w, int pred)
x
- a double
array containing the independent
(explanatory) variables. Its length must be equal to the number of
variables set in the constructor.y
- a double
representing the dependent (response)
variablew
- a double
representing the weightpred
- an int
representing the number of future
responses for which the prediction interval is desired on the average of
the future responsespublic double[] getRHS()
double
array containing a copy of the right-hand
side d of the upper triangular system Rx=d from the
QR-transformed regression problem.public int[] getPermute()
int
array containing column pivoting information
described in the column permutation matrix P of the QR
factorization of the regressor matrix X,
trans(Q)*X*P = R. The k-th element
of the array contains the index of the column of the original
matrix X that has been interchanged into column k.Copyright © 2020 Rogue Wave Software. All rights reserved.