Skip navigation links

Package com.imsl.stat

Statistical methods.

See: Description

Package com.imsl.stat Description

Statistical methods. The IMSL stats package includes classes for descriptive statistics, hypothesis testing, multivariate analysis, time series analysis, regression analysis, parameter estimation, and more.

Design of experiments

The classes ANOVA, ANOVAFactorial, ANCOVA, and MultipleComparisons are for commonly used experimental designs. Typically, responses are stored in the input vector y in a pattern that takes advantage of the balanced design structure. Consequently, the full set of model subscripts is not needed to identify each response. The classes assume the usual pattern, which requires that the last model subscript change most rapidly, followed by the model subscript next in line, and so forth, with the first subscript changing at the slowest rate. This pattern is referred to as lexicographical ordering.

The ANOVA class allows missing responses if confidence interval information is not requested. Double.NaN (NaN = Not a Number) is the missing value code used by these classes. Any element of y that is missing must be set to NaN. Other classes do not allow missing responses because the classes generally deal with balanced designs.

Basic Statistical Methods and Tests

The classes for the computations of basic statistics include Summary, Covariances, PartialCovariances, PooledCovariances, Sort, Ranks, and EmpiricalQuantiles. They generally have relatively simple arguments and most of them will allow for missing values. Missing value codes can be set by using Double.NaN.

The classes NormOneSample, NormTwoSample, WelchsTTest, ChiSquaredTest, WilcoxonRankSum perform statistical tests and return a "p-value" for the test. The p-value is the probability of observing data that would yield a test statistic as extreme or more extreme under the assumption of the null hypothesis. Hence, a small p-value is evidence for the rejection of the null hypothesis.

Classes for tabulation include TableOneWay, TableTwoWay, and TableMultiWay.

Categorical and Discrete Data Analysis

The ContingencyTable class computes many statistics of interest in a two-way table. Statistics computed by this routine include the usual chi-squared statistics, measures of association, Kappa, and many others.

The CategoricalGenLinModel class is concerned with generalized linear models in discrete data. This routine may be used to compute estimates and associated statistics in probit, logistic, minimum extreme value, Poisson, negative binomial (with known number of successes), and logarithmic models. Classification variables as well as weights, frequencies, and additive constants may be used so that quite general linear models can be fit. Residuals, a measure of influence, the coefficient estimates, and other statistics are returned for each model fit. When infinite parameter estimates are required, extended maximum likelihood estimation may be used. Log-linear models may be fit through the use of Poisson regression models. Results from Poisson regression models involving structural and sampling zeros will be identical to the results obtained from the log-linear model but will be fit by a quasi-Newton algorithm rather than through iterative proportional fitting.

Tests of Goodness of Fit

Goodness-of-fit is a term that refers to how well an hypothesized distribution fits a set of data (see Conover, 1980). The methods for goodness-of-fit include ChiSquaredTest, NormalityTest, KolmogorovOneSample, and KolmogorovTwoSample.

The chi-squared goodness-of-fit test ChiSquaredTest may be used with discrete as well as continuous distributions.

The Kolmogorov-Smirnov methods KolmogorovOneSample and KolmogorovTwoSample compute exact probabilities in small to moderate sample sizes.

The Kolmogorov-Smirnov and chi-squared goodness-of-fit test routines allow for missing values (NaN, not a number) in the input data.

Multivariate Analysis

Multivariate analysis is concerned with the relationships among and between multiple variables.

Cluster Analysis

Clustering deals with the problem of segmenting or organizing a set of individual objects into groups or "clusters" of similar objects, so that objects within a cluster are more similar to each other than they are to objects in other clusters. In practice, objects are virtually anything for which attributes or features can be measured and quantified into a data matrix X. Typically a set of clusters is constructed from a training data set. Then the clusters are used to predict the class label for a new object.

The library includes three classes that perform clustering and classification: ClusterKMeans, ClusterKNN, and ClusterHierarchical.

ClusterKMeans performs a K-means cluster analysis. The K-means method attempts to find a clustering that minimizes the within-cluster sum-of-squares. In this method of clustering the data matrix X is grouped so that each observation (row in X) is assigned to one of a fixed number, K, of clusters. The sum of the squared difference of each observation about the mean of its assigned cluster is used as the criterion for assignment. In the basic algorithm, observations are transferred from one cluster or another when doing so decreases the within-cluster sum-of-squared differences. When no transfer occurs in a pass through the entire data set, the algorithm stops. The clustering can be evaluated by methods described in "Basic Statistics," or others throughout the library. Often, K-means clustering with more than one value of K is performed, and the value of K that best fits the data is used.

ClusterKNN performs k-Nearest Neighbor, or KNN, cluster analysis. For a new object, KNN finds a cluster consisting of the object's K closest neighbors, where "closest" is in the sense of a specified distance metric, such as the L2 norm or the L1 norm. The new object then receives the class label that the majority of the objects in its neighborhood have (majority vote). The algorithm works with a training set. For each observation in the data X (representing a new object), the distance to each object in a training set is calculated. The distances are sorted, and the smallest K distances are noted. The objects with the K smallest distances form the cluster. Note that unlike K-Means, KNN does not provide a clustering of the data set. In a sense KNN forms clusters "on the fly" for the purpose of classification.

ClusterHierarchical performs a variant of hierarchical cluster analysis. ClusterHierarchical implements a bottom-up, or agglomerative approach. Starting with every individual object in a data set as a cluster, the two clusters (objects) that are closest in the sense of some distance metric are then combined into a new cluster. With the new set of n - 1 clusters, the process is repeated, so that at the h-th stage, there are n - h clusters. The clustering continues until there is only one cluster. Each level represents a different clustering of the data that can be evaluated using fit statistics and other metrics.

Principal Components and Factor Analysis

The idea in principal components is to find a small number of linear combinations of the original variables that maximize the variance accounted for in the original data. This amounts to an eigensystem analysis of the covariance (or correlation) matrix. The class FactorAnalysis performs either principal component analysis or factor analysis. When the principal component model is used, FactorAnalysis computes standard errors for the eigenvalues. Correlations of the original variables with the principal component scores also are computed.

Factor analysis and principal component analysis, while quite different in assumptions, often serve the same ends. Unlike principal components in which linear combinations yielding the highest possible variances are obtained, factor analysis generally obtains linear combinations of the observed variables according to a model relating the observed variable to hypothesized underlying factors, plus a random error term called the unique error or uniqueness. In factor analysis, the unique errors associated with each variable are usually assumed to be independent of the factors. In addition, in the common factor model, the unique errors are assumed to be mutually independent. The factor analysis model is expressed in the following equation: $$x - \mu = \Lambda f + e $$ where \(x\) is the \(p\) vector of observed values, \(\mu\) is the \(p\) vector of variable means, \(\Lambda\) is the \(p \times k\) matrix of factor loadings, \(f\) is the \(k\) vector of hypothesized underlying random factors, \(e\) is the \(p\) vector of hypothesized unique random errors, \(p\) is the number of variables in the observed variables, and \(k\) is the number of factors.

Because much of the computation in factor analysis was originally done by hand or was expensive on early computers, quick (but dirty) algorithms that made the calculations possible were developed. One result is the many factor extraction methods available today. Generally speaking, in the exploratory or model building phase of a factor analysis, a method of factor extraction that is not computationally intensive (such as principal components, principal factor, or image analysis) is used. If desired, a computationally intensive method is then used to obtain the final factors.

Discriminant Analysis

Discriminant analysis involves fitting a mathematical "boundary" between groups for the purpose of classifying new observations into those groups. The class DiscriminantAnalysis performs linear or quadratic discrimination together with a number of approaches for classifying new observations, including reclassification, leave-out-one, and split-sample. The data consists of multivariate observations along with known memberships in a fixed number groups. The discriminant function is then fit to the known data. The analysis can be executed in an online mode, meaning that one or more observations can be added to the rule during each invocation of DiscriminantAnalysis.update(double[][]). One or more observations can also be removed from the state of the model, using DiscriminantAnalysis.downdate(double[][], int[]).

Nonparametric Statistics

Nonparametric statistics are derived without any assumptions about the underlying probability distribution. Examples include measures of location and scale (see "Basic Statistics"), descriptive measures in a contingency table (see "Categorical and Discrete Data Analysis") and statistics used in goodness of fit tests (see "Tests of Goodness of Fit"). In addition, the classes SignTest and WilcoxonRankSum perform nonparametric tests comparing two populations.

Probability Distribution Functions and Inverses

The classes Pdf and Cdf provide static methods for several continuous and discrete probability distributions.

The distribution function for the (real, single-valued) random variable X is the function F defined for all real x by

$$F(x) = {\rm {Prob}}(X \leq x)$$

where \(\rm Prob(\cdot)\) denotes the probability of an event. The distribution function is often called the cumulative distribution function (CDF).

For distributions with finite ranges, such as the beta distribution, the CDF is \(0\) for values less than the left endpoint and \(1\) for values greater than the right endpoint. The methods in the Cdf class return the correct values for the distribution functions when values outside of the range of the random variable are input, but warning error conditions are set in these cases.

Discrete Distributions

For discrete distributions, the function giving the probability that the random variable takes on specific values is called the probability mass function, defined by

$$p(x) = {\rm {Prob}}(X = x)$$

The CDF for a discrete random variable is

$$F(x) = \sum\limits_A p(k) $$

where A is set such that \(k \leq x\). Since the distribution function is a step function, its inverse does not exist uniquely.

Continuous Distributions

For continuous distributions, a probability mass function, as defined above, would not be useful because the probability of any given point is \(0\). For such distributions, the useful analog is the probability density function (PDF). The integral of the PDF is the probability over the interval. If the continuous random variable X has PDF f, then

$${\rm Prob}(a \le X \le b) = \int_a^b f(x)\,dx $$

The relationship between the CDF and the PDF is

$$F(x) = \int_{-\infty}^x f(t) \, dt $$

For (absolutely) continuous distributions, the value of \(F(x)\) uniquely determines \(x\) within the support of the distribution. The "inverse" methods in the Cdf class compute the inverses of the distribution functions, that is, given \(F(x)\), they compute, \(x\). The inverses are defined only over the open interval \((0,1)\).

Additional Comments

Whenever a probability close to \(1.0\) results from a call to a distribution function or is to be input to an inverse function, it is often impossible to achieve good accuracy because of the nature of the representation of numeric values. In this case, it may be better to work with the complementary distribution function (one minus the distribution function). If the distribution is symmetric about some point (as the normal distribution, for example) or is reflective about some point (as the beta distribution, for example), the complementary distribution function has a simple relationship with the distribution function. For example, to evaluate the standard normal distribution at \(4.0\), using the com.imsl.stat method in the Cdf class directly, the result to six places is \(0.999968\). Only two of those digits are really useful, however. A more useful result may be \(1.000000\) minus this value, which can be obtained to six places as \(3.16712\text{e-}5\) by evaluating normal at \(-4.0\). For the normal distribution, the two values are related by \(\Phi(x) = 1 - \Phi(-x)\), where \(\Phi (\cdot)\) is the normal distribution function. Another example is the beta distribution with parameters \(2\) and \(10\). This distribution is skewed to the right, so evaluating beta at \(0.7\), the value \(0.999953\) is obtained. A more precise result is obtained by evaluating beta with parameters \(10\) and \(2\) at \(0.3\). This yields \(4.72392\text{e-}5\).

Many of the algorithms used by the classes in this section are discussed by Abramowitz and Stegun (1964). The algorithms make use of various expansions and recursive relationships and often use different methods in different regions.

Cumulative distribution functions are defined for all real arguments. However, if the input to one of the distribution functions in this section is outside the range of the random variable, an error is issued.

Definitions and discussions of the basic terms can be found in Johnson and Kotz (1969, 1970a, 1970b). These are also good references for the specific distributions.

Many of the methods for probability distributions are written for standard forms of statistical distributions. Hence, the number of parameters for any given distribution may be fewer than the number often associated with the distribution. Also, the methods relating to the normal distribution, Cdf.normal(double) and Cdf.inverseNormal(double), are for a normal distribution with mean equal to zero and variance equal to one. For other means and variances, it is very easy for the user to standardize the variables by subtracting the mean and dividing by the square root of the variance.

Random Number Generation

This section describes classes and methods for the generation of random numbers that are useful for applications in simulation studies.

In the following discussions, the phrases random numbers, random deviates, deviates, and variates are used interchangeably. The phrase pseudorandom is sometimes used to emphasize that the numbers generated are really not random since they result from a deterministic process. The usefulness of pseudorandom numbers is derived from the similarity, in a statistical sense, of samples of the pseudorandom numbers to samples of observations from the specified distributions. In short, while the pseudorandom numbers are completely deterministic and repeatable, they simulate the realizations of independent and identically distributed random variables.

Class Random extends Random. It adds the ability to generate random numbers with a number of different non-uniform distributions. The non-uniform random number generators use a uniform random number generator. The uniform random number generator in java.util.Random can be used. Random adds a linear congruential generator. It is also possible to use another uniform generator as long as it implements the Random.BaseGenerator interface.

The MersenneTwister and MersenneTwister64 uniform random number generators generate random number series with very long periods. They both implement the Random.BaseGenerator interface and so can be used as the base uniform generator in com.imsl.stat.Random.

The class FaureSequence generates the low-discrepancy Faure sequence. This is also called a quasi-random generator. The sequence is a series of points in n-dimensions, which are close to being as equally spaced as possible. Low-discrepancy refers to the difference from actually being as equally spaced as possible.

The FaureSequence implements the RandomSequence interface. This interface defines a sequence of points in n-dimensions. It is used by the class HyperRectangleQuadrature to evaluate n-dimensional integrals.

Regression Analysis

The classes LinearRegression and RegressorsForGLM contain methods for linear regression and the multivariate general linear model. The class NonlinearRegression is available for nonlinear regression models. Methods for estimating or fitting regression models, methods for computing summary statistics and diagnostics, and methods for computing confidence intervals for individual cases are provided. There are also methods for building a model from a set of candidate variables, i.e., variable selection.

Simple and Multiple Linear Regression

The simple linear regression model is

$$y_i=\beta_0+\beta_1 x_i+\varepsilon_i$$

for \(i=1,2,\ldots ,n\). The \(y_i\)'s represent the observed values of the response or dependent variable, the \(x_i\)'s are the settings of the independent (explanatory) variable, \(\beta_0 \) and \(\beta_1\) are the intercept and slope parameters (respectively) and the \(\varepsilon_i\)'s are independently distributed normal errors, each with mean 0 and variance \(\sigma^2\).

The multiple linear regression model is

$$y_i=\beta_0+\beta_1 x_{i1}+\beta_2 x_{i2}+\ldots+ \beta_k x_{ik}+\varepsilon_i$$

for \(i=1,2,\ldots,n\). The \(y_i\)'s constitute the observed values of the response or dependent variable; the \(x_{i1} \)'s, \(x_{i2}\)'s,\(\ldots,\) \(x_{ik}\)'s are the settings of the k independent (explanatory) variables; \(\beta_0,\beta_1,\ldots,\beta_k\) are the regression coefficients; and the \(\varepsilon_i\)'s are independently distributed normal errors, each with mean 0 and variance \(\sigma^2 \).

The class LinearRegression fits both the simple and multiple linear regression models using a fast Givens transformation and includes an option for excluding the intercept \(\beta_0\). The responses are input in array y, and the independent variables are input in array x, where the individual cases correspond to the rows and the variables correspond to the columns.

After the model has been fitted using the LinearRegression class, member "get" methods such as LinearRegression.getCoefficientTTests() can be used to retrieve summary statistics. Predicted values, confidence intervals, and case statistics for the fitted model can be obtained from inner class LinearRegression.getCaseStatistics(double[], double).

No Intercept Model

Several classes provide the option for excluding the intercept from a model. In most practical applications, the intercept should be included in the model. For classes that use the sums of squares and crossproducts matrix as input, the no-intercept case can be handled by using the raw sums of squares and crossproducts matrix as input in place of the corrected sums of squares and crossproducts. The raw sums of squares and crossproducts matrix can be computed as \(\left({x_1,x_2,\ldots,x_k,y}\right)^T\left({x_1,x_2,\ldots,x_k,y} \right)\).

Specification of X for the General Linear Model

Variables used in the general linear model are either continuous or classification variables. Typically, multiple regression models use continuous variables, whereas analysis of variance models use classification variables. Although the notation used to specify analysis of variance models and multiple regression models may look quite different, the models are essentially the same. The term general linear model emphasizes that a common notational scheme is used for specifying a model that may contain both continuous and classification variables.

A general linear model is specified by its effects (sources of variation). An effect is a single variable or a product of variables. In particular, an effect is composed of one of the following:

  1. a single continuous variable
  2. a single classification variable
  3. several different classification variables
  4. several continuous variables, some of which may be the same
  5. continuous variables, some of which may be the same, and classification variables, which must be distinct

Effects of the first type are common in multiple regression models. Effects of the second type appear as main effects in analysis of variance models. Effects of the third type appear as interactions in analysis of variance models. Effects of the fourth type appear in polynomial models and response surface models as powers and crossproducts of some basic variables. Effects of the fifth type appear in one-way analysis of covariance models as regression coefficients that indicate lack of parallelism of a regression function across the groups.

The analysis of a general linear model occurs in two stages. The first stage calls class RegressorsForGLM to specify all regressors except the intercept. The second stage uses LinearRegression at which point the model will be specified as either having or not having an intercept.

Suppose the data matrix has as its first four columns two continuous variables in Columns \(0\) and \(1\) and two classification variables in Columns \(2\) and \(3\). The data might appear as follows:

Column \(0\) Column \(1\) Column \(2\) Column \(3\)
\(11.23\) \(1.23\) \(1.0\) \(5.0\)
\(12.12\) \(2.34\) \(1.0\) \(4.0\)
\(12.34\) \(1.23\) \(1.0\) \(4.0\)
\(4.34\) \(2.21\) \(1.0\) \(5.0\)
\(5.67\) \(4.31\) \(2.0\) \(4.0\)
\(4.12\) \(5.34\) \(2.0\) \(1.0\)
\(4.89\) \(9.31\) \(2.0\) \(1.0\)
\(9.12\) \(3.71\) \(2.0\) \(1.0\)

Each distinct value of a classification variable determines a level. The classification variable in Column \(2\) has two levels. The classification variable in Column \(3\) has three levels. (Integer values are recommended, but not required, for values of the classification variables. The values of the classification variables corresponding to the same level must be identical.) Some examples of regression functions and their specifications are as follows:

Intercept Class Columns Effects
\( \beta_0+\beta_1 x_1\) true {} \(\{ \{0\} \}\)
\(\beta_0+\beta_1 x_1 + \beta_2 x_1^2\) true {} \(\{\{0\}, \{0,0\}\}\)
\(\mu+\alpha_i\) true \(\{2\}\) \({ \{2\} }\)
\(\mu+\alpha_i+\beta_j+\gamma_{ij}\) true \(\{2, 3\}\) \(\{\{2\}, \{3\}, \{2, 3\}\}\)
\(\mu_{ij}\) false \({2, 3}\) \({ \{2, 3\} }\)
\(\beta_0+\beta_1 x_1 + \beta_2 x_2 + \beta_3 x_1 x_2\) true \(\{\}\) \({\{0\}, \{1\}, \{0, 1\}}\)

Variable Selection

Variable selection can be performed by SelectionRegression, which computes all best-subset regressions, or by StepwiseRegression, which computes stepwise regression. The method used by SelectionRegression is generally preferred over that used by StepwiseRegression because SelectionRegression implicitly examines all possible models in the search for a model that optimizes some criterion while stepwise does not examine all possible models. However, the computer time and memory requirements for SelectionRegression can be much greater than that for StepwiseRegression when the number of candidate variables is large.

Nonlinear Regression

The nonlinear regression model is

$$y_i=f(x_i;\theta)+\varepsilon_i$$

for \(i=1,2,\ldots,n\). The \(y_i\)'s are the values of the response or dependent variable, the \(x_i\)'s are the known vectors of values of the independent (explanatory) variables, f is a known function of an unknown regression parameter vector \(\theta\), and the \(\varepsilon_i\)'s are independently distributed normal errors each with mean \(0\) and variance \(\sigma^2\).

Class NonlinearRegression performs the least-squares fit to the data for this model.

Weighted Least Squares

Classes throughout the section generally allow weights to be assigned to the observations. A weight argument is used throughout to specify the weighting for particular rows of X.

Computations that relate to statistical inference-e.g., t tests, F tests, and confidence intervals-are based on the multiple regression model except that the variance of \(\varepsilon_i\) is assumed to equal \(\sigma^2\) times the reciprocal of the corresponding weight.

If a single row of the data matrix corresponds to \(n_i\) observations, the vector frequencies can be used to specify the frequency for each row of X. Degrees of freedom for error are affected by frequencies but are unaffected by weights.

Regression summary Statistics

Methods LinearRegression.getANOVA(), LinearRegression.getCoefficientTTests(), NonlinearRegression.getR() and StepwiseRegression.getCoefficientVIF() can be used to compute statistics related to a regression for each of the dependent variables. The summary statistics include the model analysis of variance table, sequential sums of squares and F-statistics, coefficient estimates, estimated standard errors, t-statistics, variance inflation factors and estimated variance-covariance matrix of the estimated regression coefficients.

The summary statistics are computed under the model \(y=X\beta+ \varepsilon\), where y is the \(n\times 1\) vector of responses, X is the \(n\times p\) matrix of regressors with rank\((X)=r\) = r, \(\beta\) is the \(p\times 1\) vector of regression coefficients, and \(\varepsilon\) is the \(n\times 1\) vector of errors whose elements are independently normally distributed with mean 0 and variance \(\sigma^2 /w_i\).

Given the results of a weighted least-squares fit of this model (with the \(w_i\)'s as the weights), most of the computed summary statistics are output in an ANOVA object.

The getANOVA() methods in several of the regression classes return an ANOVA object. Summary statistics can be retrieved via specific "get" methods or the ANOVA.getArray() method. This returns a one-dimensional array. In StepwiseRegression, ANOVA.getArray() returns Double.NaN for the last two elements of the array because they cannot be computed from the input. The array contains statistics related to the analysis of variance. The sources of variation examined are the regression, error, and total. The first 10 elements of the ANOVA.getArray() and the notation frequently used for these is described in the following table (here, AOV = ANOVA.getArray())

:
 Model Analysis of Variance Table
 Variation Src.   Deg. of Freedom   Sum of Squares   Mean Square  F  p-value 
Regression DFR=AOV[0] SSR=AOV[3] MSR=AOV[6] AOV[8] AOV[9]
Error DFE=AOV[1] SSE=AOV[4] \(s^2\)=AOV[7]    
Total DFT=AOV[2] SST=AOV[5]      

If the model has an intercept (default), the total sum of squares is the sum of squares of the deviations of \(y_i\) from its (weighted) mean \(\bar {y}\) - the so-called corrected total sum of squares, denoted by the following:

$$\textrm{SST}=\sum\limits_{i=1}^n w_i(y_i-\bar y)^2 $$

If the model does not have an intercept (hasIntercept = false), the total sum of squares is the sum of squares of \(y_i\) - the so-called uncorrected total sum of squares, denoted by the following:

$$\textrm{SST}=\sum\limits_{i=1}^n w_i y_i^2 $$

The error sum of squares is given as follows:

$$\textrm{SSE}=\sum\limits_{i=1}^n{w_i}\left(y_i- \hat{y_i}\right)^2$$

where \( \hat{y_i} = x_i\hat{\beta} \) is the value estimated by the fitted regression model.

The error degrees of freedom is defined by \(\textrm{DFE}=n-r \).

The estimate of \(\sigma^2\) is given by \(s^2= \textrm{SSE}/\textrm{DFE}\), which is the error mean square.

The computed F statistic for the null hypothesis, \(H_0: \beta_1=\beta_2=\ldots\beta_p=0\), versus the alternative that at least one coefficient is nonzero is given by \(F=\textrm{MSR}/s^2 \). The p-value associated with the test is the probability of an F larger than that computed under the assumption of the model and the null hypothesis. A small p-value (less than 0.05) is customarily used to indicate there is sufficient evidence from the data to reject the null hypothesis.

The remaining five elements in AOV frequently are displayed together with the actual analysis of variance table. The quantities R-squared (\(R^2\) = AOV[10]) and adjusted R-squared (\(R_a^2\) = AOV[11]) are expressed as a percentage and are defined as follows:

$$R^2=100\left(\textrm{SSR}/\textrm{SST}\right)=100 \left(1-\textrm{SSE}/\textrm{SST}\right)$$

$$R_a^2=100\ \textrm{max}\left\{0,1-\frac{s^2}{\textrm{ SST}/ \textrm{DFT}}\right\}$$

The square root of \(s^2 = s\)= AOV[12] is frequently referred to as the estimated standard deviation of the model error.

The overall mean of the responses \({\bar y}\) is output in AOV[13].

The coefficient of variation (\(CV\)=AOV[14] is expressed as a percentage and defined by \(\textrm{CV}=100s/\bar y \).

LinearRegression.CoefficientTTests is a nested class within the LinearRegression and StepwiseRegression classes. The statistics (estimated standard error, t-statistic and p-value) associated with each coefficient can be retrieved with their associated "get" methods. And finally, getR() returns the variance-covariance matrix of the estimated coefficients.

Diagnostics for Individual Cases

Diagnostics for individual cases (observations) are computed by the LinearRegression.CaseStatistics class for linear regression.

Statistics computed include predicted values, confidence intervals, and diagnostics for detecting outliers and cases that greatly influence the fitted regression.

The diagnostics are computed under the model \(y=X\beta+\varepsilon\), where y is the \(n\times 1\) vector of responses, X is the \(n\times p\) matrix of regressors with rank(X) = r, \(\beta\) is the \(p\times 1 \) vector of regression coefficients, and \(\varepsilon \) is the \(n\times 1\) vector of errors whose elements are independently normally distributed with mean \(0\) and variance \(\sigma^2/w_i\).

Given the results of a weighted least-squares fit of this model (with the \(w_i\)'s as the weights), the following five diagnostics are computed:

  1. leverage
  2. standardized residual
  3. jackknife residual
  4. Cook's distance
  5. DFFITS

The definition of these terms is given in the discussion that follows:

Let \(x_i\) be a column vector containing the elements of the i-th row of X. A case can be unusual either because of \(x_i\) or because of the response \(y_i\). The leverage \(h_i \) is a measure of uniqueness of the \(x_i\). The leverage is defined by

$$h_i=[x_i^T\left({X^T WX}\right)^-x_i]w_i $$

where \(W=\textrm{diag}\left({w_1,w_2\ldots,w_n}\right)\) and \(\left({X^T WT}\right)^-\) denotes a generalized inverse of \(X^T WT\). The average value of the \(h_i \)'s is \(r/n\). Regression classes declare \(x_i\) unusual if \(h_i>2r/n\). Hoaglin and Welsch (1978) call a data point highly influential (i.e., a leverage point) when this occurs.

Let \(e_i\) denote the residual

$$y_i-\hat y_i$$

for the i-th case. The estimated variance of \(e_i\) is \((1-h_i)s^2/w_i\), where \(s^2\) is the residual mean square from the fitted regression. The i-th standardized residual (also called the internally studentized residual) is by definition

$$r_i=e_i\sqrt{\frac{w_i}{s^2\left({1-h_i}\right)}} $$

and \(r_i\) follows an approximate standard normal distribution in large samples.

The i-th jackknife residual or deleted residual involves the difference between \(y_i\) and its predicted value, based on the data set in which the i-th case is deleted. This difference equals \(e_i/\left({1-h_i}\right)\). The jackknife residual is obtained by standardizing this difference. The residual mean square for the regression in which the i-th case is deleted is as follows:

$$s_i^2=\frac{{\left({n-r}\right)s^2-w_i e_i^2/ \left({1-h_i}\right)}}{{n-r-1}}$$

The jackknife residual is defined as

$$t_i=e_i\sqrt{\frac{w_i}{s_i^2\left({1-h_i} \right)}}$$

and \(t_i\) follows a \(t-\)distribution with \(n-r\times 1\) degrees of freedom.

Cook's distance for the i-th case is a measure of how much an individual case affects the estimated regression coefficients. It is given as follows:

$$D_i=\frac{w_i h_i e_i^2}{rs^2\left({1-h_i}\right) ^2}$$

Weisberg (1985) states that if \(D_i\) exceeds the 50-th percentile of the F(r, n - r) distribution, it should be considered large. (This value is about 1. This statistic does not have an F distribution.)

DFFITS, like Cook's distance, is also a measure of influence. For the i-th case, DFFITS is computed by the formula below.

$$\text{DFFITS}_i=e_i\sqrt{\frac{w_i h_i}{s_i^2 \left({1-h_i}\right)^2}}$$

Hoaglin and Welsch (1978) suggest that DFFITS greater than

$$2\sqrt{r/n}$$

is large.

Transformations

Transformations of the independent variables are sometimes useful in order to satisfy the regression model. The inclusion of squares and crossproducts of the variables

$$\left({x_1,x_2,x_1^2,x_2^2,x_1 x_2}\right) $$

is often needed. Logarithms of the independent variables are used also. (See Draper and Smith 1981, pp. 218-222; Box and Tidwell 1962; Atkinson 1985, pp. 177-180; Cook and Weisberg 1982, pp. 78-86.)

When the responses are described by a nonlinear function of the parameters, a transformation of the model equation often can be selected so that the transformed model is linear in the regression parameters. For example, by taking natural logarithms on both sides of the equation, the exponential model

$$y=e^{\beta_0+\beta_1 x_1}\varepsilon $$

can be transformed to a model that satisfies the linear regression model provided the \(\varepsilon_i\)'s have a log-normal distribution (Draper and Smith, pp. 222-225).

When the responses are nonnormal and their distribution is known, a transformation of the responses can often be selected so that the transformed responses closely satisfy the regression model assumptions. The square-root transformation for counts with a Poisson distribution and the arc-sine transformation for binomial proportions are common examples (Snedecor and Cochran 1967, pp. 325-330; Draper and Smith, pp. 237-239).

Missing Values

NaN (Not a Number) is the missing value code used by the regression classes. Use field Double.NaN to retrieve NaN. Any element of the data matrix that is missing must be set to Double.NaN. In fitting regression models, any observation containing NaN for the independent, dependent, weight, or frequency variables is omitted from the computation of the regression parameters.

Survival and Reliability Analysis

The classes described in this section have primary application in the areas of reliability and life testing, but they may find application in any situation in which analysis of binomial events over time is of interest. Kalbfleisch and Prentice (1980), Elandt-Johnson and Johnson (1980), Lee (1980), Gross and Clark (1975), Lawless (1982), and Chiang (1968) and Tanner and Wong (1984) are references for discussing the models and methods described in this section. The class KaplanMeierEstimates produces Kaplan-Meier (product-limit) estimates of the survival distribution in a single population. The class ProportionalHazards computes the parameter estimates in a proportional hazards model. The class LifeTables computes and (optionally) prints an actuarial table based either upon a cohort followed over time or a cross-section of a population.

Time Series and Forecasting

The classes in this section allow users to filter and analyze time series data using autoregressive, ARIMA, GARCH and state-space models. Some are designed for automatic model selection, estimation and forecasting using algorithms based upon minimizing AIC (Akaike's Information Criterion).

General Methodology

Model Identification and Data Filtering

Most time series classes in the library assume the data are stationary and collected at equally spaced times with no missing observations in the series. In order to prepare a series for analysis, any missing values must be replaced with estimates. The class ARMAEstimateMissing can be used to automatically estimate missing values using one of four, user-selected methods, provided the largest time gap with missing values has no more than 4 missing values. An exception is the class AutoARIMA, which accepts a data series with missing values.

In addition to estimating missing values, if a series is nonstationary or contains seasonal variation, its values should be filtered before fitting a model and preparing forecasts. Class ARSeasonalFit evaluates seasonal adjustments prior to modeling the series. Users specify a range of adjustments.ARSeasonalFit evaluates each adjustment to identify the one that minimizes the AIC. If a user already knows the series needs to be filtered using differences between consecutive values, the class Difference can be used to filter a series into an equivalent series of differences.

There are several classes available for transforming a time series into its autocorrelation matrix: AutoCorrelation,CrossCorrelation and MultiCrossCorrelation. Class AutoCorrelation computes the autocorrelation and partial autocorrelation matrices from a time series or its filtered values. Box and Jenkins (1976) describe how to identify the structure of an ARIMA model using the autocorrelation and partial autocorrelation matrices.

Classes CrossCorrelation and MultiCrossCorrelation are used to calculate the cross-correlation matrix for two univariate or multivariate time series, respectively.

Model Estimation and Forecasting 

There are several classes useful for modeling stationary, equally spaced time series -  ARMA, ARMAMaxLikelihood, GARCH, and KalmanFilter. Class ARMA can be used to fit autoregressive, moving-average and ARIMA (autoregressive, integrated moving average) models. Rather than passing the original series to this class, users are required to construct this class using the autocorrelation matrix of the series. Class AutoCorrelation is useful for computing this matrix prior to constructing the ARMA class.

The other classes require users to construct the class using the original series or its filtered values. Class ARMAMaxLikelihood estimates the maximum likelihood estimates of the ARMA parameters and prepares forecasts from these values. Class ARAutoUnivariate selects the best autoregressive model for an equally spaced, stationary time series using AIC.Class GARCH estimates the parameters of a GARCH model and the KalmanFilter class estimates the parameters for a state-space model using Kalman filtering and calculates one-step ahead forecasts.

Several classes contain methods for obtaining forecasts from an ARIMA model. Class KalmanFilter can be used to obtain one-step ahead forecasts for a state-space model.

ARIMA Model (Autoregressive Integrated Moving Average)

A small, yet comprehensive, class of stationary time-series models consists of the nonseasonal ARMA processes defined by

\(\phi \left( B \right)\left( {W_t - \mu } \right) = \theta \left( B \right)A_t , \,\,\,\,\,\,\,\, t \in Z\)

where \(Z = {\dots, -2, -1, 0, 1, 2, \dots}\) denotes the set of integers, B is the backward shift operator defined by \(B^kW_t = W_{t-k}, \mu\) is the mean of \(W_t\), and the following equations are true:

\( \phi(B) = 1 - \phi_1 B - \phi_2B^2 - \dots - \phi_pB^p, p \geq 0\)

\(\theta(B) = 1 -\theta_1 B- \theta_2B^2 - \dots - \theta_qB^q, q \geq 0 \)

The model is of order (p, q) and is referred to as an ARMA (p, q) model.

An equivalent version of the ARMA (p, q) model is given by

\(\phi(B) W_t = \theta_0 + \theta(B)A_i, \,\,\,\,\,\, t \in Z\)

where \(\theta_0\) is an overall constant defined by the following:

\(\theta _0 = \mu \left( {1 - \sum\limits_{i = 1}^p {\phi _i } } \right) \)

See Box and Jenkins (1976, pp. 92-93) for a discussion of the meaning and usefulness of the overall constant.

If the "raw" data, \(\left\{ {Z_t } \right\}\), are homogeneous and nonstationary, then differencing using the Difference class induces stationarity, and the model is called ARIMA (AutoRegressive Integrated Moving Average). Parameter estimation is performed on the stationary time series \(W_t = \Delta^dZ_t\) , where \(\Delta^d = (1 - B)^d\) is the backward difference operator with period 1 and order \(d, {\rm d} \gt 0\). For the ARIMA model, see AutoARIMA class for parameter estimation and model selection.

There are two classes for estimating the parameters in an ARMA model and preparing forecasts:  ARMA and ARMAMaxLikelihood.  Class  ARMA estimates model parameters using either method of moments or least squares.  The method of moments is selected by default or by setting property Method to METHOD_Of_MOMENTS. Method of moment estimates are good initial values for obtaining the least-squares estimates by setting property Method to LEAST_SQUARES . Other initial estimates provided by the user can be used. The least-squares procedure can be used to compute conditional or unconditional least-squares estimates of the parameters, depending on the choice of the backcasting length. The parameter estimates from either the method of moments or least-squares procedures can be used in the forecast method. The methods for preliminary parameter estimation, least-squares parameter estimation, and forecasting follow the approach of Box and Jenkins (1976, Programs 2-4, pp. 498-509).

Class ARMAMaxLikelihood estimates model parameters using maximum likelihood. Users can tailor the maximum likelihood algorithm by setting tolerances and providing initial estimates for the parameters. By default, ARMAMaxLikelihood uses method of moment estimates to initialize the parameters. However, this can be overridden with the ARMAMaxLikelihood.setAR(double[]) method.

Exponential Smoothing Methods

Exponential smoothing approximates the value of a time series at time t with a weighted average of previous values, with weights defined in such a way that they decay exponentially over time. The weights can be determined by a smoothing parameter \(\alpha\) and the relation,

$${y_t} = \alpha {y_{t - 1}} + {\hat y_t} $$ $$\Rightarrow $$ $${y_t} = \sum\limits_{j = 0}^{t - 1} {{\alpha ^{t - j}}{{(1 - \alpha )}^j}} {y_{_j}} = \sum\limits_{j = 0}^{t - 1} {{w_j}} {y_{_j}}$$

The parameter \(\alpha\) is on the interval (0,1) and controls the rate of decay. For values close to 1, the effect decays rapidly, whereas for values close to 0, the influence of a past value persists for some time. Exponential smoothing as a forecasting procedure is widely used and largely effective for short term, mean level forecasting. With the addition of a term for linear trend and terms or factors for seasonal patterns, exponential smoothing is an intuitive procedure for many series that arise in business applications. Class HoltWintersExponentialSmoothing performs the Holt-Winters method, otherwise known as triple exponential smoothing, and allows for either an additive or a multiplicative seasonal effect.

Skip navigation links

Copyright © 2020 Rogue Wave Software. All rights reserved.