Performs partial least squares regression for one or more response variables and one or more predictor variables.
Required Arguments
Y — Array of size ny by h containing the values of the responses, where ny≥NOBS is the number of rows of Y and h is the number of response variables. (Input)
X — Array of size nx by p containing the values of the predictor variables, where nx≥NOBS is the number of rows of X and p is the number of predictor variables. (Input)
COEF — Array of size SIZE(IXIND) by SIZE(IYIND) containing the final PLS regression coefficient estimates. (Output)
Optional Arguments
NOBS — Positive integer specifying the number of observations to be used in the analysis. (Input) Default : NOBS = min(size (Y,1), size (X,1)).
IYIND — Array containing column indices of Y specifying which response variables to use in the analysis. MAXVAL(IYIND)≤h. (Input) Default: IYIND = 1, 2, …, h.
IXIND — Array containing column indices of X specifying which predictor variables to use in the analysis. MAXVAL(IXIND)≤p. (Input) Default: IXIND = 1, 2,…, p.
NCOMPS — The number of PLS components to fit. NCOMPS≤SIZE(IXIND). (Input) Default: NCOMPS = size (IXIND).
Note: If CV = .TRUE., models with 1 up to NCOMPS components are tested using cross-validation. The model with the lowest predicted residual sum of squares is reported.
CV — Logical. If .TRUE., the routine performs K-fold cross validation to select the number of components. If .FALSE., the routine fits only the model specified by NCOMPS. (Input) Default: CV = .TRUE.
K — Integer specifying the number of folds to use in K-fold cross validation. K must be between 2 and NOBS, inclusive. K is ignored if CV = .FALSE. (Input) Default:K= 5.
Note: If NOBS/K≤ 3, the routine performs leave-one-out cross validation as opposed to K-fold cross validation.
SCALE — Logical. If .TRUE., Y and X are centered and scaled to have mean 0 and standard deviation of 1. If .FALSE., Y and X are centered only. (Input) Default: SCALE= .FALSE.
YHAT — Array of size NOBS by h containing the predicted values for the response variables using the final values of the coefficients. (Output)
RESIDS — Array of size NOBS by h containing residuals of the final fit for each response variable. (Output)
SE — Array of size p by h containing the standard errors of the PLS coefficients. (Output)
PRESS — Array of size NCOMPS by h providing the predicted residual error sum of squares obtained by cross-validation for each model of size j = 1, …, NCOMPS components. The argument PRESS is ignored if CV = .FALSE.. (Output)
XSCRS — Array of size NOBS by NCOMPS containing X-scores. (Output)
YSCRS — Array of size NOBS by NCOMPS containing Y-scores. (Output)
XLDGS — Array of size p by NCOMPS containing X-loadings. (Output)
YLDGS — Array of size h by NCOMPS containing Y-loadings. (Output)
WTS — Array of size p by NCOMPS containing the weight vectors. (Output)
FORTRAN
Generic: CALLPLSR (X, Y, COEF[, …])
Specific: The specific interface names are S_PLSR and D_PLSR.
Description
Routine PLSR performs partial least squares regression for a response matrix , and a set of p explanatory variables, . PLSR finds linear combinations of the predictor variables that have highest covariance with Y. In so doing, PLSR produces a predictive model for Y using components (linear combinations) of the individual predictors. Other names for these linear combinations are scores, factors, or latent variables. Partial least squares regression is an alternative method to ordinary least squares for problems with many, highly collinear predictor variables. For further discussion see, for example, Abdi (2010), and Frank and Friedman (1993).
In Partial Least Squares (PLS), a score, or component matrix, T, is selected to represent both X and Y as in,
X = TPT + Ex
and
Y = TQT + Ey
The matrices P and Q are the least squares solutions of X and Y regressed on T.
That is,
QT = (TTT)–1TTY
and
PT = (TTT)–1TTX
The columns of T in the above relations are often called X-scores, while the columns of P are the X‑loadings. The columns of the matrix U in Y = UQT + G are the corresponding Y scores, where G is a residual matrix and Q as defined above contains the Y‑loadings.
Restricting T to be linear in X , the problem is to find a set of weight vectors (columns of W) such that T = XW predicts both X and Y reasonably well.
Formally, where each wj is a column vector of length p, M≤p is the number of components, and where the m-th partial least squares (PLS) component wm solves:
where and is the Euclidean norm. For further details see Hastie, et. al., pages 80-82 (2001).
That is, wm is the vector which maximizes the product of the squared correlation between Y and Xα and the variance of Xα, subject to being orthogonal to each previous weight vector left multiplied by S. The PLS regression coefficients arise from
Algorithms to solve the above optimization problem include NIPALS (nonlinear iterative partial least squares) developed by Herman Wold (1966, 1985) and numerous variations, including the SIMPLS algorithm of de Jong (1993). Subroutine PLSR implements the SIMPLS method. SIMPLS is appealing because it finds a solution in terms of the original predictor variables, whereas NIPALS reduces the matrices at each step. For univariate Y it has been shown that SIMPLS and NIPALS are equivalent (the score, loading, and weights matrices will be proportional between the two methods).
If CV=.TRUE., PLSR searches for the best number of PLS components using K-fold cross-validation. That is, for each M = 1, 2, …, p, PLSR estimates a PLS model with M components using all of the data except a hold-out set of size roughly equal to NOBS/K. Using the resulting model estimates, PLSR predicts the outcomes in the hold-out set and calculates the predicted residual sum of squares (PRESS). The procedure then selects the next hold-out sample and repeats for a total of K times (i.e., folds). For further details see Hastie, et. al., pages 241-245 (2001).
For each response variable, PLSR returns results for the model with lowest PRESS. The best model (the number of components giving lowest PRESS), generally will be different for different response variables.
When requested via the optional argument SE, PLSR calculates modifed jackknife estimates of the standard errors as described in Martens and Martens (2000).
Comments
1. PLSR defaults to leave-one-out cross-validation when there are too few observations to form K folds in the data. The user is cautioned that there may be too few observations to make strong inferences from the results:
2. Informational errors
Type
Code
Description
2
1
For response #, residuals converged in # components, while # is the requested number of components.
3. This implementation of PLSR does not handle missing values. The user should remove missing values in the data. The user should removes missing data or NaN’s from the data input.
Examples
Example 1
The following artificial data set is provided in de Jong (1993).
The first call to PLSR fixes the number of components to 3 for both response variables, and the second call sets cv = .true. in order to perform K-fold cross validation. Note that because the number of folds is equal to n, PLSR performs leave-one-out (LOO) cross–validation.
Example 1a: no cross-validation, request 3 components.
PLS Coeff
1 2
1 0.7 10.3
2 17.2 -29.0
3 398.5 5.0
Predicted Y
1 2
1 430.0 -94.0
2 -436.0 12.0
3 -361.0 -22.0
4 367.0 104.0
Std. Errors
1 2
1 131.5 5.1
2 263.0 10.3
3 526.0 20.5
*** ALERT ERROR 1 from s_plsr. For response 2, residuals converged in 2
*** components, while 3 is the requested number of components.
Example 1b: cross-validation
Cross-validated results for response 1:
Comp PRESS
1 542903.8
2 830049.8
3 830049.8
The best model has 1 component(s).
Cross-validated results for response 2:
Comp PRESS
1 5079.6
2 1263.4
3 1263.4
The best model has 2 component(s).
PLS Coeff
1 2
1 15.9 12.7
2 49.2 -23.9
3 371.1 0.6
Predicted Y
1 2
1 405.8 -97.8
2 -533.3 -3.5
3 -208.8 2.2
4 336.4 99.1
Std. Errors
1 2
1 134.1 7.1
2 269.9 3.8
3 478.5 19.5
*** ALERT ERROR 1 from s_plsr. For response 2, residuals converged in 2
*** components, while 3 is the requested number of components.
Example 2
The data, as appears in S. Wold, et.al. (2001), is a single response variable, the “free energy of the unfolding of a protein”, while the predictor variables are 7 different, highly correlated measurements taken on 19 amino acids.