pls_regression
Performs partial least squares (PLS) regression for one or more response variables and one or more predictor variables.
Synopsis
#include <imsls.h>
float *imsls_f_pls_regression (int ny, int h, float y[], int nx, int p, float x[], ..., 0)
The type double function is imsls_d_pls_regression.
Required Arguments
int ny (Input)
The number of rows of y.
int h (Input)
The number of response variables.
float y[] (Input)
Array of length ny × h containing the values of the responses.
int nx (Input)
The number of rows of x.
int p (Input)
The number of predictor variables.
float x[] (Input)
Array of length nx × p containing the values of the predictor variables.
Return Value
A pointer to the array of length ix × iy containing the final PLS regression coefficient estimates for the mean-centered variables, where ix ≤ p is the number of predictor variables in the model, and iy ≤ h is the number of response variables. To release this space, use imsls_free. If the estimates cannot be computed, NULL is returned.
Synopsis with Optional Arguments
#include <imsls.h>
float *imsls_f_pls_regression (int ny, int h, float y[], int nx, int p, float x[],
IMSLS_N_OBSERVATIONS, int nobs,
IMSLS_Y_INDICES, int iy, int iyind[],
IMSLS_X_INDICES, int ix, int ixind[],
IMSLS_N_COMPONENTS, int ncomps,
IMSLS_CROSS_VALIDATATION, int cv,
IMSLS_N_FOLD, int k,
IMSLS_SCALE, int scale,
IMSLS_PRINT_LEVEL, int iprint,
IMSLS_OPT_N_COMPONENTS, int *optcomps,
IMSLS_PREDICTED, float **yhat,
IMSLS_PREDICTED_USER, float yhat[],
IMSLS_RESIDUALS, float **resids,
IMSLS_RESIDUALS_USER, float resids[],
IMSLS_STD_ERRORS, float **se,
IMSLS_STD_ERRORS_USER, float se[],
IMSLS_PRESS, float **press,
IMSLS_PRESS_USER, float press[],
IMSLS_X_SCORES, float **xscrs,
IMSLS_X_SCORES_USER, float xscrs[],
IMSLS_Y_SCORES, float **yscrs,
IMSLS_Y_SCORES_USER, float yscrs[],
IMSLS_X_LOADINGS, float **xldgs,
IMSLS_X_LOADINGS_USER, float xldgs[],
IMSLS_Y_LOADINGS, float **yldgs,
IMSLS_Y_LOADINGS_USER, float yldgs[],
IMSLS_WEIGHTS, float **wts,
IMSLS_WEIGHTS_USER, float wts[],
IMSLS_STANDARD_COEF, float **standard_coef,
IMSLS_STANDARD_COEF_USER, float standard_coef[],
IMSLS_INTERCEPT_TERMS, float **intercepts,
IMSLS_INTERCEPT_TERMS_USER, float intercepts[],
IMSLS_PCT_VAR, float **pctvar,
IMSLS_PCT_VAR_USER, float pctvar[],
IMSLS_RETURN_USER, float coef[],
0)
Optional Arguments
IMSLS_N_OBSERVATIONS, int nobs (Input)
Positive integer specifying the number of observations to be used in the analysis.
Default: nobs = min(ny, nx).
IMSLS_Y_INDICES, int iy, int iyind[] (Input)
Argument iyind is an array of length iy containing column indices of y specifying which response variables to use in the analysis. Each element in iyind must be less than or equal to h-1.
Default: iy = h, iyind = 0, 1, …, h-1.
IMSLS_X_INDICES, int ix, int ixind[] (Input)
Argument ixind is an array of length ix containing column indices of x specifying which predictor variables to use in the analysis. Each element in ixind must be less than or equal to p-1.
Default: ix = p, ixind = 0, 1, …, p-1.
IMSLS_N_COMPONENTS, int ncomps (Input)
The number of PLS components to fit. ncomps ≤ ix.
Default: ncomps = ix.
Note: If cv = 1 is used, models with 1 up to ncomps components are tested using cross-validation. The model with the lowest predicted residual sum of squares is reported. |
IMSLS_CROSS_VALIDATION, int cv (Input)
If cv = 0, the function fits only the model specified by ncomps. If cv = 1, the function performs K-fold cross validation to select the number of components.
Default: cv = 1.
IMSLS_N_FOLD, int k (Input)
The number of folds to use in K-fold cross validation. k must be between 2 and nobs, inclusive. k is ignored if cv = 0 is used.
Default: k = 5.
IMSLS_SCALE, int scale (Input)
If scale = 1, y and x are centered and scaled to have mean 0 and standard deviation of 1. If scale = 0, y and x are centered to have mean 0 but are not scaled.
Default: scale = 0.
IMSLS_PRINT_LEVEL, int iprint (Input)
Printing option.
iprint |
Action |
0 |
No Printing. |
1 |
Prints final results only. |
2 |
Prints intermediate and final results. |
Default: iprint = 0.
IMSLS_OPT_N_COMPONENTS, int *optcomps (Output)
The number of components of the optimal model. The value is identical with ncomps, if cv = 0 is used.
IMSLS_PREDICTED, float **yhat (Output)
Argument yhat is the address of an array of length nobs × iy, containing the predicted values for the response variables using the final values of the coefficients.
IMSLS_PREDICTED_USER, float yhat[] (Output)
Storage for array yhat is provided by the user. See IMSLS_PREDICTED.
IMSLS_RESIDUALS, float **resids (Output)
Argument resids is the address of an array of length nobs × iy, containing residuals of the final fit for each response variable.
IMSLS_RESIDUALS_USER, float resids[] (Output)
Storage for array resids is provided by the user. See IMSLS_RESIDUALS.
IMSLS_STD_ERRORS, float **se (Output)
Argument se is the address of an array of length ix × iy, containing the standard errors of the PLS coefficients.
IMSLS_STD_ERRORS_USER, float se[] (Output)
Storage for array se is provided by the user. See IMSLS_STD_ERRORS.
IMSLS_PRESS, float **press (Output)
Argument press is the address of an array of length ncomps × iy, containing the predicted residual error sum of squares obtained by cross-validation for each model of size j= 1, … , ncomps components. The argument press is ignored if cv = 0 is used for IMSLS_CROSS_VALIDATION.
IMSLS_PRESS_USER, float press[] (Output)
Storage for array press is provided by the user. See IMSLS_PRESS.
IMSLS_X_SCORES, float **xscrs (Output)
Argument xscrs is the address of an array of length nobs × ncomps containing X‑scores.
IMSLS_X_SCORES_USER, float xscrs[] (Output)
Storage for array xscrs is provided by the user. See IMSLS_X_SCORES.
IMSLS_Y_SCORES, float **yscrs (Output)
Argument yscrs is the address of an array of length nobs × ncomps containing Y‑scores.
IMSLS_Y_SCORES_USER, float yscrs[] (Output)
Storage for array yscrs is provided by the user. See IMSLS_Y_SCORES.
IMSLS_X_LOADINGS, float **xldgs (Output)
Argument xldgs is the address of an array of length ix × ncomps, containing X‑loadings.
IMSLS_X_LOADINGS_USER, float xldgs[] (Output)
Storage for array xldgs is provided by the user. See IMSLS_X_LOADINGS.
IMSLS_Y_LOADINGS, float **yldgs (Output)
Argument yldgs is the address of an array of length iy × ncomps, containing Y‑loadings.
IMSLS_Y_LOADINGS_USER, float yldgs[] (Output)
Storage for array yldgs is provided by the user. See IMSLS_Y_LOADINGS.
IMSLS_WEIGHTS, float **wts (Output)
Argument wts is the address of an array of length ix × ncomps, containing the weight vectors.
IMSLS_WEIGHTS_USER, float wts[] (Output)
Storage for array wts is provided by the user. See IMSLS_WEIGHTS.
IMSLS_STANDARD_COEF, float **standard_coef (Output)
Argument standard_coef is the address of an array of length ix × iy, containing the final PLS regression coefficient estimates for the centered (if scale = 0) or standardized variables (if scale = 1). The contents of standard_coef and coef are identical if scale = 0 is used.
IMSLS_STANDARD_COEF_USER, float standard_coef[] (Output)
Storage for array standard_coef is provided by the user. See IMSLS_STANDARD_COEF.
IMSLS_INTERCEPT_TERMS, float **intercepts (Output)
Argument intercepts is the address of an array of length iy, containing the intercept terms of the PLS regression.
IMSLS_INTERCEPT_TERMS_USER, float intercepts[] (Output)
Storage for array intercepts is provided by the user. See IMSLS_INTERCEPT_TERMS.
IMSLS_PCT_VAR, float **pctvar (Output)
Argument pctvar is the address of an array of length 2 × ncomps, containing the percentage of variance explained by the model in its first optcomps columns. The first row contains the percentage of variance of x explained by each component, the second row the percentage of variance of y explained by each component.
IMSLS_PCT_VAR_USER, float pctvar[] (Output)
Storage for array pctvar is provided by the user. See IMSLS_PCT_VAR.
IMSLS_RETURN_USER, float coef[] (Output)
If specified, the final PLS regression coefficient estimates are stored in array coef provided by the user.
Description
Function imsls_f_pls_regression performs partial least squares regression for a response matrix Y(ny × h) and a set of p explanatory variables, X(nx × p). imsls_f_pls_regression finds linear combinations of the predictor variables that have highest covariance with Y. In so doing, imsls_f_pls_regression produces a predictive model for Y using components (linear combinations) of the individual predictors. Other names for these linear combinations are scores, factors, or latent variables. Partial least squares regression is an alternative method to ordinary least squares for problems with many, highly collinear predictor variables. For further discussion see, for example, Abdi (2010), and Frank and Friedman (1993).
In Partial Least Squares (PLS), a score, or component matrix, T, is selected to represent both X and Y as in,
and
The matrices P and Q are the least squares solutions of X and Y regressed on T.
That is,
and
The columns of T in the above relations are often called X-scores, while the columns of P are the X-loadings. The columns of the matrix U in Y = UQT + G are the corresponding Y scores, where G is a residual matrix and Q, as defined above, contains the Y-loadings.
Restricting T to be linear in X, the problem is to find a set of weight vectors (columns of W) such that T = XW predicts both X and Y reasonably well.
Formally, W = [w1, ..., wm-1, wm, ...wM] where each wj is a column vector of length p, M ≤ p is the number of components, and where the m-th partial least squares (PLS) component wm solves:
where and is the Euclidean norm. For further details see Hastie, et. al., pages 80-82 (2001).
That is, wm is the vector which maximizes the product of the squared correlation between Y and Xα and the variance of Xα, subject to being orthogonal to each previous weight vector left multiplied by S. The PLS regression coefficients arise from
Algorithms to solve the above optimization problem include NIPALS (nonlinear iterative partial least squares) developed by Herman Wold (1966, 1985) and numerous variations, including the SIMPLS algorithm of de Jong (1993). imsls_f_pls_regression implements the SIMPLS method. SIMPLS is appealing because it finds a solution in terms of the original predictor variables, whereas NIPALS reduces the matrices at each step. For univariate Y it has been shown that SIMPLS and NIPALS are equivalent (the score, loading, and weights matrices will be proportional between the two methods).
By default, imsls_f_pls_regression searches for the best number of PLS components using K-fold cross-validation. That is, for each M = 1, 2,…, p, imsls_f_pls_regression estimates a PLS model with M components using all of the data except a hold-out set of size roughly equal to nobs/k. Using the resulting model estimates, imsls_f_pls_regression predicts the outcomes in the hold-out set and calculates the predicted residual sum of squares (PRESS). The procedure then selects the next hold-out sample and repeats for a total of K times (i.e., folds). For further details see Hastie, et. al., pages 241-245 (2001).
For each response variable, imsls_f_pls_regression returns results for the model with lowest PRESS. The best model (the number of components giving lowest PRESS), generally will be different for different response variables.
When requested via the optional argument IMSLS_STD_ERRORS, imsls_f_pls_regression calculates modified jackknife estimates of the standard errors as described in Martens and Martens (2000).
Comments
2. | This implementation of imsls_f_pls_regression does not handle missing values. The user should remove missing values or NaN’s from the input data. |
Examples
Example 1
The following artificial data set is provided in de Jong (1993).
The first call to imsls_f_pls_regression fixes the number of components to 3 for both response variables, and the second call performs K-fold cross validation. Note that because the number of folds is equal to n, imsls_f_pls_regression performs leave-one-out (LOO) cross-validation.
#include <imsls.h>
#include <stdio.h>
#define H 2
#define N 4
#define P 3
int main() {
int iprint=1, ncomps=3;
float x[N][P] = {
-4.0, 2.0, 1.0,
-4.0, -2.0, -1.0,
4.0, 2.0, -1.0,
4.0, -2.0, 1.0
};
float y[N][H] = {
430.0, -94.0,
-436.0, 12.0,
-361.0, -22.0,
367.0, 104.0
};
float *coef=NULL, *yhat=NULL, *se=NULL;
float *coef2=NULL, *yhat2=NULL, *se2=NULL;
/* Print out informational error. */
imsls_error_options(IMSLS_SET_PRINT, IMSLS_ALERT, 1, 0);
printf("Example 1a: no cross-validation, request %d components.\n",
ncomps);
coef = imsls_f_pls_regression(N, H, &y[0][0], N, P, &x[0][0],
IMSLS_N_COMPONENTS, ncomps,
IMSLS_CROSS_VALIDATION, 0,
IMSLS_PRINT_LEVEL, iprint,
IMSLS_PREDICTED, &yhat,
IMSLS_STD_ERRORS, &se,
0);
printf("\nExample 1b: cross-validation\n");
coef2 = imsls_f_pls_regression(N, H, &y[0][0], N, P, &x[0][0],
IMSLS_N_FOLD, N,
IMSLS_PRINT_LEVEL, iprint,
IMSLS_PREDICTED, &yhat2,
IMSLS_STD_ERRORS, &se2,
0);
}
Output
Example 1a: no cross-validation, request 3 components.
PLS Coeff
1 2
1 0.8 10.3
2 17.3 -29.0
3 398.5 5.0
Predicted Y
1 2
1 430 -94
2 -436 12
3 -361 -22
4 367 104
Std. Errors
1 2
1 131.5 5.1
2 263.0 10.3
3 526.0 20.5
*** WARNING Error IMSLS_PLS_REGRESSION_CONVERGED from imsls_f_pls_regression.
*** The PLS regression algorithm converged in 2 iterations, but the
*** number of requested PLS components is 3. The number of computed
*** PLS components is reduced to 2.
Example 1b: cross-validation
Cross-validated results for response 1:
Comp PRESS
1 3860649
2 5902575
3 5902575
The best model has 1 component(s).
Cross-validated results for response 2:
Comp PRESS
1 36121
2 8984
3 8984
The best model has 2 component(s).
PLS Coeff
1 2
1 6.0 -0.2
2 66.1 -2.2
3 361.4 -11.8
Predicted Y
1 2
1 469.5 -15.4
2 -517.6 17.0
3 -205.3 6.7
4 253.4 -8.3
Std. Errors
1 2
1 131.2 18.5
2 114.8 10.1
3 561.5 22.5
*** WARNING Error IMSLS_PLS_REGRESSION_CONVERGED from imsls_f_pls_regression.
*** The PLS regression algorithm converged in 2 iterations, but the
*** number of requested PLS components is 3. The number of computed
*** PLS components is reduced to 2.
Example 2
The data, as appears in S. Wold et al. (2001), is a single response variable, the “free energy of the unfolding of a protein”, while the predictor variables are 7 different, highly correlated measurements taken on 19 amino acids.
#include <imsls.h>
#include <stdio.h>
#define H 1
#define N 19
#define P 7
int main() {
int iprint=2, ncomps=7;
float x[N][P] = {
0.23, 0.31, -0.55, 254.2, 2.126, -0.02, 82.2,
-0.48, -0.6, 0.51, 303.6, 2.994, -1.24, 112.3,
-0.61, -0.77, 1.2, 287.9, 2.994, -1.08, 103.7,
0.45, 1.54, -1.4, 282.9, 2.933, -0.11, 99.1,
-0.11, -0.22, 0.29, 335.0, 3.458, -1.19, 127.5,
-0.51, -0.64, 0.76, 311.6, 3.243, -1.43, 120.5,
0.0, 0.0, 0.0, 224.9, 1.662, 0.03, 65.0,
0.15, 0.13, -0.25, 337.2, 3.856, -1.06, 140.6,
1.2, 1.8, -2.1, 322.6, 3.35, 0.04, 131.7,
1.28, 1.7, -2.0, 324.0, 3.518, 0.12, 131.5,
-0.77, -0.99, 0.78, 336.6, 2.933, -2.26, 144.3,
0.9, 1.23, -1.6, 336.3, 3.86, -0.33, 132.3,
1.56, 1.79, -2.6, 366.1, 4.638, -0.05, 155.8,
0.38, 0.49, -1.5, 288.5, 2.876, -0.31, 106.7,
0.0, -0.04, 0.09, 266.7, 2.279, -0.4, 88.5,
0.17, 0.26, -0.58, 283.9, 2.743, -0.53, 105.3,
1.85, 2.25, -2.7, 401.8, 5.755, -0.31, 185.9,
0.89, 0.96, -1.7, 377.8, 4.791, -0.84, 162.7,
0.71, 1.22, -1.6, 295.1, 3.054, -0.13, 115.6
};
float y[N][H] = {8.5, 8.2, 8.5, 11.0, 6.3, 8.8, 7.1, 10.1,
16.8, 15.0, 7.9, 13.3, 11.2, 8.2, 7.4, 8.8, 9.9, 8.8, 12.0};
float *coef=NULL, *yhat=NULL, *se=NULL;
float *coef2=NULL, *yhat2=NULL, *se2=NULL;
printf("Example 2a: no cross-validation, request %d components.\n",
ncomps);
coef = imsls_f_pls_regression(N, H, &y[0][0], N, P, &x[0][0],
IMSLS_N_COMPONENTS, ncomps,
IMSLS_CROSS_VALIDATION, 0,
IMSLS_SCALE, 1,
IMSLS_PRINT_LEVEL, iprint,
IMSLS_PREDICTED, &yhat,
IMSLS_STD_ERRORS, &se,
0);
printf("\nExample 2b: cross-validation\n");
coef2 = imsls_f_pls_regression(N, H, &y[0][0], N, P, &x[0][0],
IMSLS_SCALE, 1,
IMSLS_PRINT_LEVEL, iprint,
IMSLS_PREDICTED, &yhat2,
IMSLS_STD_ERRORS, &se2,
0);
}
Output
Example 2a: no cross-validation, request 7 components.
Standard PLS Coefficients
1
-5.468
1.668
0.624
1.424
-2.550
4.870
4.871
PLS Coeff
1
-20.07
4.63
1.42
0.09
-7.27
20.93
0.46
Predicted Y
1
9.37
7.30
8.10
12.02
8.79
6.76
7.24
10.45
15.79
14.36
8.41
9.94
11.52
8.64
8.22
8.40
11.13
8.97
12.39
Variance Analysis
=============================================
Pctge of Y variance explained
Component Cum. Pctge
1 42.3
2 45.5
3 61.2
4 68.5
5 71.6
6 78.7
7 78.8
=============================================
Pctge of X variance explained
Component Cum. Pctge
1 64.2
2 97.7
3 99.0
4 99.5
5 99.8
6 99.9
7 100.0
Std. Errors
1
13.13
6.72
1.84
0.20
4.68
14.30
0.33
Example 2b: cross-validation
Cross-validated results for response 1:
Comp PRESS
1 167.5
2 162.9
3 166.5
4 168.8
5 264.6
6 221.1
7 184.7
The best model has 2 component(s).
Standard PLS Coefficients
1
0.1598
0.2163
-0.1673
0.0095
-0.0136
0.1649
0.0294
PLS Coeff
1
0.5867
0.6000
-0.3797
0.0006
-0.0388
0.7089
0.0028
Predicted Y
1
9.86
7.71
7.35
11.02
8.32
7.46
9.32
9.00
12.09
12.09
6.59
11.11
12.46
10.27
9.02
9.51
12.82
10.69
11.09
Variance Analysis
=============================================
Pctge of Y variance explained
Component Cum. Pctge
1 42.3
2 45.5
=============================================
Pctge of X variance explained
Component Cum. Pctge
1 64.2
2 97.7
Std. Errors
1
0.2615
0.2029
0.1302
0.0041
0.2078
0.4279
0.0064
Warning Errors
IMSLS_PLS_REGRESSION_CONVERGED |
The PLS regression algorithm converged in # iterations, but the number of requested PLS components is #. The number of computed PLS components is reduced to #. |