plsRegression¶
Performs partial least squares (PLS) regression for one or more response variables and one or more predictor variables.
Synopsis¶
plsRegression (y, x)
Required Arguments¶
- float
y[[]]
(Input) - Array of length
ny
×h
containing the values of the responses. - float
x[[]]
(Input) - Array of length
nx
×p
containing the values of the predictor variables.
Return Value¶
An array of length ix
× iy
containing the final PLS regression
coefficient estimates, where ix
≤ p
is the number of predictor
variables in the model, and iy
≤ h
is the number of response
variables. If the estimates cannot be computed, None
is returned.
Optional Arguments¶
nObservations
, int (Input)Positive integer specifying the number of observations to be used in the analysis.
Default:
nObservations
= min(ny
,nx
).yIndices
, floatiyind[]
(Input)Argument
iyind
is an array of lengthiy
containing column indices ofy
specifying which response variables to use in the analysis. Each element iniyind
must be less than or equal toh
-1.Default:
yIndices
= 0, 1, …,h
-1.xIndices
, floatixind[]
(Input)Argument
ixind
is an array of lengthix
containing column indices ofx
specifying which predictor variables to use in the analysis. Each element inixind
must be less than or equal top
-1.Default:
xIndices
= 0, 1, …,p
-1.nComponents
, int (Input)The number of PLS components to fit.
nComponents
≤ix
.Default:
nComponents
=ix
.
If crossValidation = 1 is used, models with 1 up to
nComponents components are tested using cross-validation. The
model with the lowest predicted residual sum of squares is reported. |
crossValidation
, int (Input)If
crossValidation
= 0, the function fits only the model specified bynComponents
. IfcrossValidation
= 1, the function performs K-fold cross validation to select the number of components.Default:
crossValidation
= 1.nFold
, int (Input)The number of folds to use in K-fold cross validation.
nFold
must be between 1 andnObservations
, inclusive.nFold
is ignored ifcrossValidation
= 0 is used.Default:
nFold
= 5.
If nObservations /nFold ≤ 3, the routine performs
leave-one-out cross validation as opposed to K-fold cross
validation. |
scale
, int (Input)If
scale
= 1,y
andx
are centered and scaled to have mean 0 and standard deviation of 1. Ifscale
= 0,y
andx
are centered to have mean 0 but are not scaled.Default:
scale
= 0.printLevel
, int (Input)Printing option.
printLevel
Action 0 No Printing. 1 Prints final results only. 2 Prints intermediate and final results. Default:
printLevel
= 0.predicted
(Ouput)- Argument
predicted
is an array of lengthnObservations
×iy
, containing the predicted values for the response variables using the final values of the coefficients. residuals
(Ouput)- Argument
residuals
is an array of lengthnObservations
×iy
, containing residuals of the final fit for each response variable. stdErrors
(Ouput)- Argument
stdErrors
is an array of lengthix
×iy
, containing the standard errors of the PLS coefficients. press
(Ouput)- Argument
press
is an array of lengthnComponents
×iy
, containing the predicted residual error sum of squares obtained by cross-validation for each model of size \(j=1\), … ,nComponents
components. The argumentpress
is ignored ifcrossValidation
= 0 is used forcrossValidation
. xScores
(Ouput)- Argument
xScores
is an array of lengthnObservations
×nComponents
containing X‑scores. yScores
(Ouput)- Argument
yScores
is an array of lengthnObservations
×nComponents
containing Y‑scores. xLoadings
(Ouput)- Argument
xLoadings
is an array of lengthix
×nComponents
, containing X‑loadings. yLoadings
(Ouput)- Argument
yLoadings
is an array of lengthiy
×nComponents
, containing Y‑loadings. weights
(Ouput)- Argument
weights
is an array of lengthix
×nComponents
, containing the weight vectors.
Description¶
Function plsRegression
performs partial least squares regression for a
response matrix \(Y(n_y\times h)\) and a set of p explanatory
variables, \(X(n_x\times p)\). plsRegression
finds linear
combinations of the predictor variables that have highest covariance with
Y. In so doing, plsRegression
produces a predictive model for Y
using components (linear combinations) of the individual predictors. Other
names for these linear combinations are scores, factors, or latent variables.
Partial least squares regression is an alternative method to ordinary least
squares for problems with many, highly collinear predictor variables. For
further discussion see, for example, Abdi (2009), and Frank and Friedman
(1993).
In Partial Least Squares (PLS), a score, or component matrix, T, is selected to represent both X and Y as in,
and
The matrices P and Q are the least squares solutions of X and Y regressed on T.
That is,
and
The columns of T in the above relations are often called X-scores, while the columns of P are the X-loadings. The columns of the matrix U in \(Y=UQ^T+G\) are the corresponding Y scores, where G is a residual matrix and Q, as defined above, contains the Y-loadings.
Restricting T to be linear in X, the problem is to find a set of weight vectors (columns of W) such that \(T=XW\) predicts both X and Y reasonably well.
Formally, \(w=\left[w_1,\ldots,w_{m-1},w_m,\ldots w_M]\right]\) where each \(w_j\) is a column vector of length p, \(M\leq p\) is the number of components, and where the m-th partial least squares (PLS) component \(w_m\) solves:
where \(S=X^T X\) and \(\| \alpha\|=\sqrt{\alpha^T \alpha}\) is the Euclidean norm. For further details see Hastie, et. al., pages 80-82 (2001).
That is, \(w_m\) is the vector which maximizes the product of the squared correlation between Y and Xα and the variance of Xα, subject to being orthogonal to each previous weight vector left multiplied by S. The PLS regression coefficients \(\hat{\beta}_{\mathit{PLS}}\) arise from
Algorithms to solve the above optimization problem include NIPALS (nonlinear
iterative partial least squares) developed by Herman Wold (1966, 1985) and
numerous variations, including the SIMPLS algorithm of de Jong (1993).
plsRegression
implements the SIMPLS method. SIMPLS is appealing because
it finds a solution in terms of the original predictor variables, whereas
NIPALS reduces the matrices at each step. For univariate Y it has been
shown that SIMPLS and NIPALS are equivalent (the score, loading, and weights
matrices will be proportional between the two methods).
By default, plsRegression
searches for the best number of PLS components
using K-fold cross-validation. That is, for each \(M=1,2,\ldots,p\),
plsRegression
estimates a PLS model with M components using all of the
data except a hold-out set of size roughly equal to nObservations/k
.
Using the resulting model estimates, plsRegression
predicts the outcomes
in the hold-out set and calculates the predicted residual sum of squares
(PRESS). The procedure then selects the next hold-out sample and repeats for
a total of K times (i.e., folds). For further details see Hastie, et.
al., pages 241-245 (2001).
For each response variable, plsRegression
returns results for the model
with lowest PRESS. The best model (the number of components giving lowest
PRESS), generally will be different for different response variables.
When requested via the optional argument, stdErrors
, plsRegression
calculates modified jackknife estimates of the standard errors as described
in Martens and Martens (2000).
Comments¶
plsRegression
defaults to leave-one-out cross-validation when there are too few observations to form K folds in the data. The user is cautioned that there may be too few observations to make strong inferences from the results.- This implementation of
plsRegression
does not handle missing values. The user should remove missing values or NaN’s from the input data.
Examples¶
Example 1¶
The following artificial data set is provided in de Jong (1993).
The first call to plsRegression
fixes the number of components to 3 for
both response variables, and the second call performs K-fold cross
validation. Note that because n is small, plsRegression
performs
leave-one-out (LOO) cross–validation.
from __future__ import print_function
from numpy import *
from pyimsl.stat.plsRegression import plsRegression
from pyimsl.stat.writeMatrix import writeMatrix
from pyimsl.stat.errorOptions import errorOptions
ncomps = 3
x = [[-4.0, 2.0, 1.0],
[-4.0, -2.0, -1.0],
[4.0, 2.0, -1.0],
[4.0, -2.0, 1.0]]
y = [[430.0, -94.0],
[-436.0, 12.0],
[-361.0, -22.0],
[367.0, 104.0]]
yhat = []
se = []
yhat2 = []
se2 = []
# Print out informational error.
errorPrintSetting = {"type": 2, "setting": 1}
errorOptions(setPrint=errorPrintSetting)
print("Example 1a: no cross-validation, request %d components." % ncomps)
coef = plsRegression(y, x,
nComponents=ncomps,
crossValidation=0,
printLevel=1,
predicted=yhat,
stdErrors=se)
print("\nExample 1b: cross-validation")
coef2 = plsRegression(y, x,
nFold=4,
printLevel=1,
predicted=yhat2,
stdErrors=se2)
Output¶
Example 1a: no cross-validation, request 3 components.
***
*** Warning error issued from IMSL function plsRegression:
*** Error code 11432.
***
Example 1b: cross-validation
***
*** Warning error issued from IMSL function plsRegression:
*** Error code 11432.
***
PLS Coeff
1 2
1 0.8 10.2
2 17.2 -29.0
3 398.5 5.0
Predicted Y
1 2
1 430 -94
2 -436 12
3 -361 -22
4 367 104
Std. Errors
1 2
1 131.5 5.1
2 263.0 10.3
3 526.0 20.5
Cross-validated results for response 1:
Comp PRESS
1 3860649
2 5902575
3 5902575
The best model has 1 component(s).
Cross-validated results for response 2:
Comp PRESS
1 36121
2 8984
3 8984
The best model has 2 component(s).
PLS Coeff
1 2
1 15.9 12.7
2 49.2 -23.9
3 371.1 0.6
Predicted Y
1 2
1 405.8 -97.8
2 -533.3 -3.5
3 -208.8 2.2
4 336.4 99.1
Std. Errors
1 2
1 134.1 7.1
2 269.9 3.8
3 478.5 19.5
Example 2¶
The data, as appears in S. Wold, et.al. (2001), is a single response variable, the “free energy of the unfolding of a protein”, while the predictor variables are 7 different, highly correlated measurements taken on 19 amino acids.
from __future__ import print_function
from numpy import *
from pyimsl.stat.plsRegression import plsRegression
from pyimsl.stat.writeMatrix import writeMatrix
ncomps = 7
x = [[0.23, 0.31, -0.55, 254.2, 2.126, -0.02, 82.2],
[-0.48, -0.6, 0.51, 303.6, 2.994, -1.24, 112.3],
[-0.61, -0.77, 1.2, 287.9, 2.994, -1.08, 103.7],
[0.45, 1.54, -1.4, 282.9, 2.933, -0.11, 99.1],
[-0.11, -0.22, 0.29, 335.0, 3.458, -1.19, 127.5],
[-0.51, -0.64, 0.76, 311.6, 3.243, -1.43, 120.5],
[0.0, 0.0, 0.0, 224.9, 1.662, 0.03, 65.0],
[0.15, 0.13, -0.25, 337.2, 3.856, -1.06, 140.6],
[1.2, 1.8, -2.1, 322.6, 3.35, 0.04, 131.7],
[1.28, 1.7, -2.0, 324.0, 3.518, 0.12, 131.5],
[-0.77, -0.99, 0.78, 336.6, 2.933, -2.26, 144.3],
[0.9, 1.23, -1.6, 336.3, 3.86, -0.33, 132.3],
[1.56, 1.79, -2.6, 366.1, 4.638, -0.05, 155.8],
[0.38, 0.49, -1.5, 288.5, 2.876, -0.31, 106.7],
[0.0, -0.04, 0.09, 266.7, 2.279, -0.4, 88.5],
[0.17, 0.26, -0.58, 283.9, 2.743, -0.53, 105.3],
[1.85, 2.25, -2.7, 401.8, 5.755, -0.31, 185.9],
[0.89, 0.96, -1.7, 377.8, 4.791, -0.84, 162.7],
[0.71, 1.22, -1.6, 295.1, 3.054, -0.13, 115.6]]
y = [8.5, 8.2, 8.5, 11.0, 6.3, 8.8, 7.1, 10.1, 16.8,
15.0, 7.9, 13.3, 11.2, 8.2, 7.4, 8.8, 9.9, 8.8, 12.0]
yhat = []
se = []
yhat2 = []
se2 = []
print("Example 2a: no cross-validation, request %d components." % ncomps)
coef = plsRegression(y, x,
nComponents=ncomps,
crossValidation=0,
scale=1,
printLevel=2,
predicted=yhat,
stdErrors=se)
print("\nExample 2b: cross-validation")
coef2 = plsRegression(y, x,
scale=1,
printLevel=2,
predicted=yhat2,
stdErrors=se2)
Output¶
Example 2a: no cross-validation, request 7 components.
Example 2b: cross-validation
Standard PLS Coefficients
1
-5.458
1.668
0.625
1.431
-2.550
4.861
4.857
PLS Coeff
1
-20.03
4.63
1.42
0.09
-7.27
20.90
0.46
Predicted Y
1
9.38
7.30
8.09
12.02
8.79
6.76
7.24
10.44
15.79
14.35
8.41
9.95
11.52
8.64
8.23
8.40
11.12
8.97
12.39
Std. Errors
1
3.566
2.422
0.808
3.207
1.638
3.310
3.525
Corrected Std. Errors
1
13.09
6.72
1.83
0.20
4.67
14.23
0.33
Variance Analysis
=============================================
Pctge of Y variance explained
Component Cum. Pctge
1 39.7
2 42.8
3 58.3
4 65.1
5 68.2
6 75.0
7 75.5
=============================================
Pctge of X variance explained
Component Cum. Pctge
1 64.1
2 96.3
3 97.4
4 97.9
5 98.2
6 98.3
7 98.4
Cross-validated results for response 1:
Comp PRESS
1 167.5
2 162.9
3 166.5
4 168.9
5 264.6
6 221.0
7 185.0
The best model has 2 component(s).
Standard PLS Coefficients
1
0.1598
0.2163
-0.1673
0.0095
-0.0136
0.1649
0.0294
PLS Coeff
1
0.5867
0.6000
-0.3797
0.0006
-0.0388
0.7089
0.0028
Predicted Y
1
9.86
7.71
7.35
11.02
8.32
7.46
9.32
9.00
12.09
12.09
6.59
11.11
12.46
10.27
9.02
9.51
12.82
10.69
11.09
Std. Errors
1
0.0603
0.1508
0.0570
0.0574
0.0751
0.0703
0.0927
Corrected Std. Errors
1
0.2212
0.4183
0.1292
0.0036
0.2140
0.3021
0.0087
Variance Analysis
=============================================
Pctge of Y variance explained
Component Cum. Pctge
1 39.7
2 42.8
=============================================
Pctge of X variance explained
Component Cum. Pctge
1 64.1
2 96.3
Alert Errors¶
IMSLS_RESIDUAL_CONVERGED |
For response #, residuals converged in # components, while #s is the requested number of components. |