plsRegression¶
Performs partial least squares (PLS) regression for one or more response variables and one or more predictor variables.
Synopsis¶
plsRegression (y, x)
Required Arguments¶
- float
y[[]](Input) - Array of length
ny×hcontaining the values of the responses. - float
x[[]](Input) - Array of length
nx×pcontaining the values of the predictor variables.
Return Value¶
An array of length ix × iy containing the final PLS regression
coefficient estimates, where ix ≤ p is the number of predictor
variables in the model, and iy ≤ h is the number of response
variables. If the estimates cannot be computed, None is returned.
Optional Arguments¶
nObservations, int (Input)Positive integer specifying the number of observations to be used in the analysis.
Default:
nObservations= min(ny,nx).yIndices, floatiyind[](Input)Argument
iyindis an array of lengthiycontaining column indices ofyspecifying which response variables to use in the analysis. Each element iniyindmust be less than or equal toh-1.Default:
yIndices= 0, 1, …,h-1.xIndices, floatixind[](Input)Argument
ixindis an array of lengthixcontaining column indices ofxspecifying which predictor variables to use in the analysis. Each element inixindmust be less than or equal top-1.Default:
xIndices= 0, 1, …,p-1.nComponents, int (Input)The number of PLS components to fit.
nComponents≤ix.Default:
nComponents=ix.
If crossValidation = 1 is used, models with 1 up to
nComponents components are tested using cross-validation. The
model with the lowest predicted residual sum of squares is reported. |
crossValidation, int (Input)If
crossValidation= 0, the function fits only the model specified bynComponents. IfcrossValidation= 1, the function performs K-fold cross validation to select the number of components.Default:
crossValidation= 1.nFold, int (Input)The number of folds to use in K-fold cross validation.
nFoldmust be between 1 andnObservations, inclusive.nFoldis ignored ifcrossValidation= 0 is used.Default:
nFold= 5.
If nObservations/nFold ≤ 3, the routine performs
leave-one-out cross validation as opposed to K-fold cross
validation. |
scale, int (Input)If
scale= 1,yandxare centered and scaled to have mean 0 and standard deviation of 1. Ifscale= 0,yandxare centered to have mean 0 but are not scaled.Default:
scale= 0.printLevel, int (Input)Printing option.
printLevelAction 0 No Printing. 1 Prints final results only. 2 Prints intermediate and final results. Default:
printLevel= 0.predicted(Ouput)- Argument
predictedis an array of lengthnObservations×iy, containing the predicted values for the response variables using the final values of the coefficients. residuals(Ouput)- Argument
residualsis an array of lengthnObservations×iy, containing residuals of the final fit for each response variable. stdErrors(Ouput)- Argument
stdErrorsis an array of lengthix×iy, containing the standard errors of the PLS coefficients. press(Ouput)- Argument
pressis an array of lengthnComponents×iy, containing the predicted residual error sum of squares obtained by cross-validation for each model of size \(j=1\), … ,nComponentscomponents. The argumentpressis ignored ifcrossValidation= 0 is used forcrossValidation. xScores(Ouput)- Argument
xScoresis an array of lengthnObservations×nComponentscontaining X‑scores. yScores(Ouput)- Argument
yScoresis an array of lengthnObservations×nComponentscontaining Y‑scores. xLoadings(Ouput)- Argument
xLoadingsis an array of lengthix×nComponents, containing X‑loadings. yLoadings(Ouput)- Argument
yLoadingsis an array of lengthiy×nComponents, containing Y‑loadings. weights(Ouput)- Argument
weightsis an array of lengthix×nComponents, containing the weight vectors.
Description¶
Function plsRegression performs partial least squares regression for a
response matrix \(Y(n_y\times h)\) and a set of p explanatory
variables, \(X(n_x\times p)\). plsRegression finds linear
combinations of the predictor variables that have highest covariance with
Y. In so doing, plsRegression produces a predictive model for Y
using components (linear combinations) of the individual predictors. Other
names for these linear combinations are scores, factors, or latent variables.
Partial least squares regression is an alternative method to ordinary least
squares for problems with many, highly collinear predictor variables. For
further discussion see, for example, Abdi (2009), and Frank and Friedman
(1993).
In Partial Least Squares (PLS), a score, or component matrix, T, is selected to represent both X and Y as in,
and
The matrices P and Q are the least squares solutions of X and Y regressed on T.
That is,
and
The columns of T in the above relations are often called X-scores, while the columns of P are the X-loadings. The columns of the matrix U in \(Y=UQ^T+G\) are the corresponding Y scores, where G is a residual matrix and Q, as defined above, contains the Y-loadings.
Restricting T to be linear in X, the problem is to find a set of weight vectors (columns of W) such that \(T=XW\) predicts both X and Y reasonably well.
Formally, \(w=\left[w_1,\ldots,w_{m-1},w_m,\ldots w_M]\right]\) where each \(w_j\) is a column vector of length p, \(M\leq p\) is the number of components, and where the m-th partial least squares (PLS) component \(w_m\) solves:
where \(S=X^T X\) and \(\| \alpha\|=\sqrt{\alpha^T \alpha}\) is the Euclidean norm. For further details see Hastie, et. al., pages 80-82 (2001).
That is, \(w_m\) is the vector which maximizes the product of the squared correlation between Y and Xα and the variance of Xα, subject to being orthogonal to each previous weight vector left multiplied by S. The PLS regression coefficients \(\hat{\beta}_{\mathit{PLS}}\) arise from
Algorithms to solve the above optimization problem include NIPALS (nonlinear
iterative partial least squares) developed by Herman Wold (1966, 1985) and
numerous variations, including the SIMPLS algorithm of de Jong (1993).
plsRegression implements the SIMPLS method. SIMPLS is appealing because
it finds a solution in terms of the original predictor variables, whereas
NIPALS reduces the matrices at each step. For univariate Y it has been
shown that SIMPLS and NIPALS are equivalent (the score, loading, and weights
matrices will be proportional between the two methods).
By default, plsRegression searches for the best number of PLS components
using K-fold cross-validation. That is, for each \(M=1,2,\ldots,p\),
plsRegression estimates a PLS model with M components using all of the
data except a hold-out set of size roughly equal to nObservations/k.
Using the resulting model estimates, plsRegression predicts the outcomes
in the hold-out set and calculates the predicted residual sum of squares
(PRESS). The procedure then selects the next hold-out sample and repeats for
a total of K times (i.e., folds). For further details see Hastie, et.
al., pages 241-245 (2001).
For each response variable, plsRegression returns results for the model
with lowest PRESS. The best model (the number of components giving lowest
PRESS), generally will be different for different response variables.
When requested via the optional argument, stdErrors, plsRegression
calculates modified jackknife estimates of the standard errors as described
in Martens and Martens (2000).
Comments¶
plsRegressiondefaults to leave-one-out cross-validation when there are too few observations to form K folds in the data. The user is cautioned that there may be too few observations to make strong inferences from the results.- This implementation of
plsRegressiondoes not handle missing values. The user should remove missing values or NaN’s from the input data.
Examples¶
Example 1¶
The following artificial data set is provided in de Jong (1993).
The first call to plsRegression fixes the number of components to 3 for
both response variables, and the second call performs K-fold cross
validation. Note that because n is small, plsRegression performs
leave-one-out (LOO) cross–validation.
from __future__ import print_function
from numpy import *
from pyimsl.stat.plsRegression import plsRegression
from pyimsl.stat.writeMatrix import writeMatrix
from pyimsl.stat.errorOptions import errorOptions
ncomps = 3
x = [[-4.0, 2.0, 1.0],
[-4.0, -2.0, -1.0],
[4.0, 2.0, -1.0],
[4.0, -2.0, 1.0]]
y = [[430.0, -94.0],
[-436.0, 12.0],
[-361.0, -22.0],
[367.0, 104.0]]
yhat = []
se = []
yhat2 = []
se2 = []
# Print out informational error.
errorPrintSetting = {"type": 2, "setting": 1}
errorOptions(setPrint=errorPrintSetting)
print("Example 1a: no cross-validation, request %d components." % ncomps)
coef = plsRegression(y, x,
nComponents=ncomps,
crossValidation=0,
printLevel=1,
predicted=yhat,
stdErrors=se)
print("\nExample 1b: cross-validation")
coef2 = plsRegression(y, x,
nFold=4,
printLevel=1,
predicted=yhat2,
stdErrors=se2)
Output¶
Example 1a: no cross-validation, request 3 components.
***
*** Warning error issued from IMSL function plsRegression:
*** Error code 11432.
***
Example 1b: cross-validation
***
*** Warning error issued from IMSL function plsRegression:
*** Error code 11432.
***
PLS Coeff
1 2
1 0.8 10.2
2 17.2 -29.0
3 398.5 5.0
Predicted Y
1 2
1 430 -94
2 -436 12
3 -361 -22
4 367 104
Std. Errors
1 2
1 131.5 5.1
2 263.0 10.3
3 526.0 20.5
Cross-validated results for response 1:
Comp PRESS
1 3860649
2 5902575
3 5902575
The best model has 1 component(s).
Cross-validated results for response 2:
Comp PRESS
1 36121
2 8984
3 8984
The best model has 2 component(s).
PLS Coeff
1 2
1 15.9 12.7
2 49.2 -23.9
3 371.1 0.6
Predicted Y
1 2
1 405.8 -97.8
2 -533.3 -3.5
3 -208.8 2.2
4 336.4 99.1
Std. Errors
1 2
1 134.1 7.1
2 269.9 3.8
3 478.5 19.5
Example 2¶
The data, as appears in S. Wold, et.al. (2001), is a single response variable, the “free energy of the unfolding of a protein”, while the predictor variables are 7 different, highly correlated measurements taken on 19 amino acids.
from __future__ import print_function
from numpy import *
from pyimsl.stat.plsRegression import plsRegression
from pyimsl.stat.writeMatrix import writeMatrix
ncomps = 7
x = [[0.23, 0.31, -0.55, 254.2, 2.126, -0.02, 82.2],
[-0.48, -0.6, 0.51, 303.6, 2.994, -1.24, 112.3],
[-0.61, -0.77, 1.2, 287.9, 2.994, -1.08, 103.7],
[0.45, 1.54, -1.4, 282.9, 2.933, -0.11, 99.1],
[-0.11, -0.22, 0.29, 335.0, 3.458, -1.19, 127.5],
[-0.51, -0.64, 0.76, 311.6, 3.243, -1.43, 120.5],
[0.0, 0.0, 0.0, 224.9, 1.662, 0.03, 65.0],
[0.15, 0.13, -0.25, 337.2, 3.856, -1.06, 140.6],
[1.2, 1.8, -2.1, 322.6, 3.35, 0.04, 131.7],
[1.28, 1.7, -2.0, 324.0, 3.518, 0.12, 131.5],
[-0.77, -0.99, 0.78, 336.6, 2.933, -2.26, 144.3],
[0.9, 1.23, -1.6, 336.3, 3.86, -0.33, 132.3],
[1.56, 1.79, -2.6, 366.1, 4.638, -0.05, 155.8],
[0.38, 0.49, -1.5, 288.5, 2.876, -0.31, 106.7],
[0.0, -0.04, 0.09, 266.7, 2.279, -0.4, 88.5],
[0.17, 0.26, -0.58, 283.9, 2.743, -0.53, 105.3],
[1.85, 2.25, -2.7, 401.8, 5.755, -0.31, 185.9],
[0.89, 0.96, -1.7, 377.8, 4.791, -0.84, 162.7],
[0.71, 1.22, -1.6, 295.1, 3.054, -0.13, 115.6]]
y = [8.5, 8.2, 8.5, 11.0, 6.3, 8.8, 7.1, 10.1, 16.8,
15.0, 7.9, 13.3, 11.2, 8.2, 7.4, 8.8, 9.9, 8.8, 12.0]
yhat = []
se = []
yhat2 = []
se2 = []
print("Example 2a: no cross-validation, request %d components." % ncomps)
coef = plsRegression(y, x,
nComponents=ncomps,
crossValidation=0,
scale=1,
printLevel=2,
predicted=yhat,
stdErrors=se)
print("\nExample 2b: cross-validation")
coef2 = plsRegression(y, x,
scale=1,
printLevel=2,
predicted=yhat2,
stdErrors=se2)
Output¶
Example 2a: no cross-validation, request 7 components.
Example 2b: cross-validation
Standard PLS Coefficients
1
-5.458
1.668
0.625
1.431
-2.550
4.861
4.857
PLS Coeff
1
-20.03
4.63
1.42
0.09
-7.27
20.90
0.46
Predicted Y
1
9.38
7.30
8.09
12.02
8.79
6.76
7.24
10.44
15.79
14.35
8.41
9.95
11.52
8.64
8.23
8.40
11.12
8.97
12.39
Std. Errors
1
3.566
2.422
0.808
3.207
1.638
3.310
3.525
Corrected Std. Errors
1
13.09
6.72
1.83
0.20
4.67
14.23
0.33
Variance Analysis
=============================================
Pctge of Y variance explained
Component Cum. Pctge
1 39.7
2 42.8
3 58.3
4 65.1
5 68.2
6 75.0
7 75.5
=============================================
Pctge of X variance explained
Component Cum. Pctge
1 64.1
2 96.3
3 97.4
4 97.9
5 98.2
6 98.3
7 98.4
Cross-validated results for response 1:
Comp PRESS
1 167.5
2 162.9
3 166.5
4 168.9
5 264.6
6 221.0
7 185.0
The best model has 2 component(s).
Standard PLS Coefficients
1
0.1598
0.2163
-0.1673
0.0095
-0.0136
0.1649
0.0294
PLS Coeff
1
0.5867
0.6000
-0.3797
0.0006
-0.0388
0.7089
0.0028
Predicted Y
1
9.86
7.71
7.35
11.02
8.32
7.46
9.32
9.00
12.09
12.09
6.59
11.11
12.46
10.27
9.02
9.51
12.82
10.69
11.09
Std. Errors
1
0.0603
0.1508
0.0570
0.0574
0.0751
0.0703
0.0927
Corrected Std. Errors
1
0.2212
0.4183
0.1292
0.0036
0.2140
0.3021
0.0087
Variance Analysis
=============================================
Pctge of Y variance explained
Component Cum. Pctge
1 39.7
2 42.8
=============================================
Pctge of X variance explained
Component Cum. Pctge
1 64.1
2 96.3
Alert Errors¶
IMSLS_RESIDUAL_CONVERGED |
For response #, residuals converged in # components, while #s is the requested number of components. |