plsRegression

Performs partial least squares (PLS) regression for one or more response variables and one or more predictor variables.

Synopsis

plsRegression (y, x)

Required Arguments

float y[[]] (Input)
Array of length ny × h containing the values of the responses.
float x[[]] (Input)
Array of length nx × p containing the values of the predictor variables.

Return Value

An array of length ix × iy containing the final PLS regression coefficient estimates, where ixp is the number of predictor variables in the model, and iyh is the number of response variables. If the estimates cannot be computed, None is returned.

Optional Arguments

nObservations, int (Input)

Positive integer specifying the number of observations to be used in the analysis.

Default: nObservations = min(ny, nx).

yIndices, float iyind[] (Input)

Argument iyind is an array of length iy containing column indices of y specifying which response variables to use in the analysis. Each element in iyind must be less than or equal to h-1.

Default: yIndices = 0, 1, …, h-1.

xIndices, float ixind[] (Input)

Argument ixind is an array of length ix containing column indices of x specifying which predictor variables to use in the analysis. Each element in ixind must be less than or equal to p-1.

Default: xIndices = 0, 1, …, p-1.

nComponents, int (Input)

The number of PLS components to fit. nComponentsix.

Default: nComponents = ix.

If crossValidation = 1 is used, models with 1 up to nComponents components are tested using cross-validation. The model with the lowest predicted residual sum of squares is reported.
crossValidation, int (Input)

If crossValidation = 0, the function fits only the model specified by nComponents. If crossValidation = 1, the function performs K-fold cross validation to select the number of components.

Default: crossValidation = 1.

nFold, int (Input)

The number of folds to use in K-fold cross validation. nFold must be between 1 and nObservations, inclusive. nFold is ignored if crossValidation = 0 is used.

Default: nFold = 5.

If nObservations/nFold ≤ 3, the routine performs leave-one-out cross validation as opposed to K-fold cross validation.
scale, int (Input)

If scale = 1, y and x are centered and scaled to have mean 0 and standard deviation of 1. If scale = 0, y and x are centered to have mean 0 but are not scaled.

Default: scale = 0.

printLevel, int (Input)

Printing option.

printLevel Action
0 No Printing.
1 Prints final results only.
2 Prints intermediate and final results.

Default: printLevel = 0.

predicted (Ouput)
Argument predicted is an array of length nObservations × iy, containing the predicted values for the response variables using the final values of the coefficients.
residuals (Ouput)
Argument residuals is an array of length nObservations × iy, containing residuals of the final fit for each response variable.
stdErrors (Ouput)
Argument stdErrors is an array of length ix × iy, containing the standard errors of the PLS coefficients.
press (Ouput)
Argument press is an array of length nComponents × iy, containing the predicted residual error sum of squares obtained by cross-validation for each model of size \(j=1\), … , nComponents components. The argument press is ignored if crossValidation = 0 is used for crossValidation.
xScores (Ouput)
Argument xScores is an array of length nObservations × nComponents containing X‑scores.
yScores (Ouput)
Argument yScores is an array of length nObservations × nComponents containing Y‑scores.
xLoadings (Ouput)
Argument xLoadings is an array of length ix × nComponents, containing X‑loadings.
yLoadings (Ouput)
Argument yLoadings is an array of length iy × nComponents, containing Y‑loadings.
weights (Ouput)
Argument weights is an array of length ix × nComponents, containing the weight vectors.

Description

Function plsRegression performs partial least squares regression for a response matrix \(Y(n_y\times h)\) and a set of p explanatory variables, \(X(n_x\times p)\). plsRegression finds linear combinations of the predictor variables that have highest covariance with Y. In so doing, plsRegression produces a predictive model for Y using components (linear combinations) of the individual predictors. Other names for these linear combinations are scores, factors, or latent variables. Partial least squares regression is an alternative method to ordinary least squares for problems with many, highly collinear predictor variables. For further discussion see, for example, Abdi (2009), and Frank and Friedman (1993).

In Partial Least Squares (PLS), a score, or component matrix, T, is selected to represent both X and Y as in,

\[X = TP^T + E_x\]

and

\[Y = TQ^T + E_y\]

The matrices P and Q are the least squares solutions of X and Y regressed on T.

That is,

\[P^T = \left(T^T T\right)^{-1} T^T X\]

and

\[Q^T = \left(T^T T\right)^{-1} T^T Y\]

The columns of T in the above relations are often called X-scores, while the columns of P are the X-loadings. The columns of the matrix U in \(Y=UQ^T+G\) are the corresponding Y scores, where G is a residual matrix and Q, as defined above, contains the Y-loadings.

Restricting T to be linear in X, the problem is to find a set of weight vectors (columns of W) such that \(T=XW\) predicts both X and Y reasonably well.

Formally, \(w=\left[w_1,\ldots,w_{m-1},w_m,\ldots w_M]\right]\) where each \(w_j\) is a column vector of length p, \(M\leq p\) is the number of components, and where the m-th partial least squares (PLS) component \(w_m\) solves:

\[\begin{split}\left\{ \begin{array}{c} \max_{\alpha} \mathrm{Corr}^2(Y,X\alpha) \mathrm{Var}(X\alpha) \\ \textit{s.t.} \\ \|\alpha\| = 1 \\ \alpha^T Sw_l = 0,\phantom{.....}l=1, \ldots, m-1 \\ \end{array} \right.\end{split}\]

where \(S=X^T X\) and \(\| \alpha\|=\sqrt{\alpha^T \alpha}\) is the Euclidean norm. For further details see Hastie, et. al., pages 80-82 (2001).

That is, \(w_m\) is the vector which maximizes the product of the squared correlation between Y and Xα and the variance of Xα, subject to being orthogonal to each previous weight vector left multiplied by S. The PLS regression coefficients \(\hat{\beta}_{\mathit{PLS}}\) arise from

\[Y = X\hat{\beta}_{\mathit{PLS}} + E_y = TQ^T + E_y = XWQ^T + E_y, \textit{ or } \hat{\beta}_{\mathit{PLS}} = WQ^T\]

Algorithms to solve the above optimization problem include NIPALS (nonlinear iterative partial least squares) developed by Herman Wold (1966, 1985) and numerous variations, including the SIMPLS algorithm of de Jong (1993). plsRegression implements the SIMPLS method. SIMPLS is appealing because it finds a solution in terms of the original predictor variables, whereas NIPALS reduces the matrices at each step. For univariate Y it has been shown that SIMPLS and NIPALS are equivalent (the score, loading, and weights matrices will be proportional between the two methods).

By default, plsRegression searches for the best number of PLS components using K-fold cross-validation. That is, for each \(M=1,2,\ldots,p\), plsRegression estimates a PLS model with M components using all of the data except a hold-out set of size roughly equal to nObservations/k. Using the resulting model estimates, plsRegression predicts the outcomes in the hold-out set and calculates the predicted residual sum of squares (PRESS). The procedure then selects the next hold-out sample and repeats for a total of K times (i.e., folds). For further details see Hastie, et. al., pages 241-245 (2001).

For each response variable, plsRegression returns results for the model with lowest PRESS. The best model (the number of components giving lowest PRESS), generally will be different for different response variables.

When requested via the optional argument, stdErrors, plsRegression calculates modified jackknife estimates of the standard errors as described in Martens and Martens (2000).

Comments

  1. plsRegression defaults to leave-one-out cross-validation when there are too few observations to form K folds in the data. The user is cautioned that there may be too few observations to make strong inferences from the results.
  2. This implementation of plsRegression does not handle missing values. The user should remove missing values or NaN’s from the input data.

Examples

Example 1

The following artificial data set is provided in de Jong (1993).

\[\begin{split}\begin{array}{l} X = \begin{bmatrix} -4 & 2 & 1 \\ -4 & -2 & -1 \\ 4 & 2 & -1 \\ 4 & -2 & 1 \\ \end{bmatrix} \\ \\ Y = \begin{bmatrix} 430 & -94 \\ -436 & 12 \\ -361 & -22 \\ 367 & 104 \\ \end{bmatrix} \\ \end{array}\end{split}\]

The first call to plsRegression fixes the number of components to 3 for both response variables, and the second call performs K-fold cross validation. Note that because n is small, plsRegression performs leave-one-out (LOO) cross–validation.

from __future__ import print_function
from numpy import *
from pyimsl.stat.plsRegression import plsRegression
from pyimsl.stat.writeMatrix import writeMatrix
from pyimsl.stat.errorOptions import errorOptions

ncomps = 3
x = [[-4.0, 2.0, 1.0],
     [-4.0, -2.0, -1.0],
     [4.0, 2.0, -1.0],
     [4.0, -2.0, 1.0]]
y = [[430.0, -94.0],
     [-436.0, 12.0],
     [-361.0, -22.0],
     [367.0, 104.0]]
yhat = []
se = []
yhat2 = []
se2 = []

# Print out informational error.
errorPrintSetting = {"type": 2, "setting": 1}
errorOptions(setPrint=errorPrintSetting)

print("Example 1a: no cross-validation, request %d components." % ncomps)

coef = plsRegression(y, x,
                     nComponents=ncomps,
                     crossValidation=0,
                     printLevel=1,
                     predicted=yhat,
                     stdErrors=se)

print("\nExample 1b: cross-validation")
coef2 = plsRegression(y, x,
                      nFold=4,
                      printLevel=1,
                      predicted=yhat2,
                      stdErrors=se2)

Output

Example 1a: no cross-validation, request 3 components.
***
*** Warning error issued from IMSL function plsRegression:
*** Error code 11432.
***

Example 1b: cross-validation
***
*** Warning error issued from IMSL function plsRegression:
*** Error code 11432.
***
 
         PLS Coeff
             1            2
1          0.8         10.2
2         17.2        -29.0
3        398.5          5.0
 
        Predicted Y
             1            2
1          430          -94
2         -436           12
3         -361          -22
4          367          104
 
        Std. Errors
             1            2
1        131.5          5.1
2        263.0         10.3
3        526.0         20.5

Cross-validated results for response 1:

Comp      PRESS 
1      3860649
2      5902575
3      5902575

The best model has 1 component(s).

Cross-validated results for response 2:

Comp      PRESS 
1        36121
2         8984
3         8984

The best model has 2 component(s).
 
         PLS Coeff
             1            2
1         15.9         12.7
2         49.2        -23.9
3        371.1          0.6
 
        Predicted Y
             1            2
1        405.8        -97.8
2       -533.3         -3.5
3       -208.8          2.2
4        336.4         99.1
 
        Std. Errors
             1            2
1        134.1          7.1
2        269.9          3.8
3        478.5         19.5

Example 2

The data, as appears in S. Wold, et.al. (2001), is a single response variable, the “free energy of the unfolding of a protein”, while the predictor variables are 7 different, highly correlated measurements taken on 19 amino acids.

from __future__ import print_function
from numpy import *
from pyimsl.stat.plsRegression import plsRegression
from pyimsl.stat.writeMatrix import writeMatrix

ncomps = 7

x = [[0.23, 0.31, -0.55, 254.2, 2.126, -0.02, 82.2],
     [-0.48, -0.6, 0.51, 303.6, 2.994, -1.24, 112.3],
     [-0.61, -0.77, 1.2, 287.9, 2.994, -1.08, 103.7],
     [0.45, 1.54, -1.4, 282.9, 2.933, -0.11, 99.1],
     [-0.11, -0.22, 0.29, 335.0, 3.458, -1.19, 127.5],
     [-0.51, -0.64, 0.76, 311.6, 3.243, -1.43, 120.5],
     [0.0, 0.0, 0.0, 224.9, 1.662, 0.03, 65.0],
     [0.15, 0.13, -0.25, 337.2, 3.856, -1.06, 140.6],
     [1.2, 1.8, -2.1, 322.6, 3.35, 0.04, 131.7],
     [1.28, 1.7, -2.0, 324.0, 3.518, 0.12, 131.5],
     [-0.77, -0.99, 0.78, 336.6, 2.933, -2.26, 144.3],
     [0.9, 1.23, -1.6, 336.3, 3.86, -0.33, 132.3],
     [1.56, 1.79, -2.6, 366.1, 4.638, -0.05, 155.8],
     [0.38, 0.49, -1.5, 288.5, 2.876, -0.31, 106.7],
     [0.0, -0.04, 0.09, 266.7, 2.279, -0.4, 88.5],
     [0.17, 0.26, -0.58, 283.9, 2.743, -0.53, 105.3],
     [1.85, 2.25, -2.7, 401.8, 5.755, -0.31, 185.9],
     [0.89, 0.96, -1.7, 377.8, 4.791, -0.84, 162.7],
     [0.71, 1.22, -1.6, 295.1, 3.054, -0.13, 115.6]]


y = [8.5, 8.2, 8.5, 11.0, 6.3, 8.8, 7.1, 10.1, 16.8,
     15.0, 7.9, 13.3, 11.2, 8.2, 7.4, 8.8, 9.9, 8.8, 12.0]
yhat = []
se = []
yhat2 = []
se2 = []

print("Example 2a: no cross-validation, request %d components." % ncomps)

coef = plsRegression(y, x,
                     nComponents=ncomps,
                     crossValidation=0,
                     scale=1,
                     printLevel=2,
                     predicted=yhat,
                     stdErrors=se)

print("\nExample 2b: cross-validation")
coef2 = plsRegression(y, x,
                      scale=1,
                      printLevel=2,
                      predicted=yhat2,
                      stdErrors=se2)

Output

Example 2a: no cross-validation, request 7 components.

Example 2b: cross-validation
 
Standard PLS Coefficients
                 1
            -5.458
             1.668
             0.625
             1.431
            -2.550
             4.861
             4.857
 
 PLS Coeff
          1
     -20.03
       4.63
       1.42
       0.09
      -7.27
      20.90
       0.46
 
Predicted Y
          1
       9.38
       7.30
       8.09
      12.02
       8.79
       6.76
       7.24
      10.44
      15.79
      14.35
       8.41
       9.95
      11.52
       8.64
       8.23
       8.40
      11.12
       8.97
      12.39
 
Std. Errors
          1
      3.566
      2.422
      0.808
      3.207
      1.638
      3.310
      3.525
 
Corrected Std. Errors
               1
           13.09
            6.72
            1.83
            0.20
            4.67
           14.23
            0.33

Variance Analysis
=============================================
Pctge of Y variance explained
Component    Cum. Pctge
1          39.7
2          42.8
3          58.3
4          65.1
5          68.2
6          75.0
7          75.5
=============================================
Pctge of X variance explained
Component    Cum. Pctge
1          64.1
2          96.3
3          97.4
4          97.9
5          98.2
6          98.3
7          98.4

Cross-validated results for response 1:

Comp      PRESS 
1        167.5
2        162.9
3        166.5
4        168.9
5        264.6
6        221.0
7        185.0

The best model has 2 component(s).
 
Standard PLS Coefficients
                 1
            0.1598
            0.2163
           -0.1673
            0.0095
           -0.0136
            0.1649
            0.0294
 
 PLS Coeff
          1
     0.5867
     0.6000
    -0.3797
     0.0006
    -0.0388
     0.7089
     0.0028
 
Predicted Y
          1
       9.86
       7.71
       7.35
      11.02
       8.32
       7.46
       9.32
       9.00
      12.09
      12.09
       6.59
      11.11
      12.46
      10.27
       9.02
       9.51
      12.82
      10.69
      11.09
 
Std. Errors
          1
     0.0603
     0.1508
     0.0570
     0.0574
     0.0751
     0.0703
     0.0927
 
Corrected Std. Errors
               1
          0.2212
          0.4183
          0.1292
          0.0036
          0.2140
          0.3021
          0.0087

Variance Analysis
=============================================
Pctge of Y variance explained
Component    Cum. Pctge
1          39.7
2          42.8
=============================================
Pctge of X variance explained
Component    Cum. Pctge
1          64.1
2          96.3

Alert Errors

IMSLS_RESIDUAL_CONVERGED For response #, residuals converged in # components, while #s is the requested number of components.