Lecture 18: Principal Component Regression and Partial Least Squares
University of British Columbia Okanagan
PCA (once you’ve removed some components) is generally viewed as a dimensionality reduction technique.
Most commonly, you’ll be interested in taking that reduced space, and using it for more machine learning! (Be it supervised, or unsupervised)
Principal Components Regression, or PCReg for short, is the most natural (though not necessarily the most useful) form for making use of PCA in a supervised context.
Suppose we perform PCA on \(\mathbf{X}\), which is an \(n \times p\) matrix of observations (on \(p\) predictors), and decide to retain \(M\) components. AKA \[\boldsymbol{\Sigma} \approx \mathbf{P} \boldsymbol{\Lambda} \mathbf{P}'\] where \(\boldsymbol{\Sigma}\) is a \(p \times p\) covariance/correlation matrix, \(\mathbf{P}\) is a \(p \times M\) matrix of eigenvectors (loadings) and \(\boldsymbol{\Lambda}\) is a \(M \times M\) diagonal matrix with the \(M\) largest eigenvalues.
We then map our \(n \times p\) matrix \(\mathbf{X}\) to an \(n \times M\) matrix \(\mathbf{S}\) via \(\mathbf{S} = \mathbf{X}\mathbf{P}\), these are the scores.
If we have a continuous response variable \(Y\) we can easily assume the following model for linear regression \[Y = \beta_0 + \beta_1 S_1 + \beta_2 S_2 + \cdots + \beta_h S_h + \epsilon\] with all the usual linear regression assumptions (for inferential purposes).
And that’s it! That is PCReg: using your retained principal components as predictors for a continuous response.
Source: ISLR Figure 6.18 PCR was applied to two simulated data sets. In each panel, the horizontal dashed line represents the irreducible error.
If \(M=p\) (aka, you retain all PCs), then the model is equivalent to ordinary least squares on the original \(p\) predictors.
Obviously the estimated coefficients differ, but your predictions \(\hat Y\) would be the same.
PCReg can be viewed as a discretized regularizer, removing low variance components (but not entire predictors).
So it has some commonality with ridge regression, but due to ridge regression’s smooth regularization, ridge regression is often a better option than PCReg.
There is absolutely no guarantee that higher variance components — which is with respect to the predictor space — contain more predictive power towards \(Y\)!
Let’s add a \(Y\) response variable
Call:
lm(formula = y ~ x1 + x2)
Residuals:
Min 1Q Median 3Q Max
-0.26489 -0.07545 -0.01078 0.08809 0.21513
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.007547 0.038730 0.195 0.847
x1 -2.040311 0.043519 -46.883 <2e-16 ***
x2 1.020113 0.022127 46.103 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.1199 on 37 degrees of freedom
Multiple R-squared: 0.9835, Adjusted R-squared: 0.9826
F-statistic: 1100 on 2 and 37 DF, p-value: < 2.2e-16
Call:
lm(formula = y ~ PC1_scores)
Residuals:
Min 1Q Median 3Q Max
-1.9419 -0.7712 0.2144 0.6891 1.6439
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.08484 0.14517 0.584 0.562
PC1_scores 0.00834 0.02419 0.345 0.732
Residual standard error: 0.9181 on 38 degrees of freedom
Multiple R-squared: 0.003117, Adjusted R-squared: -0.02312
F-statistic: 0.1188 on 1 and 38 DF, p-value: 0.7322
Call:
lm(formula = y ~ PC2_scores)
Residuals:
Min 1Q Median 3Q Max
-0.261207 -0.080209 -0.007321 0.081190 0.261425
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.08484 0.02039 4.161 0.000174 ***
PC2_scores 2.28110 0.05241 43.527 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.1289 on 38 degrees of freedom
Multiple R-squared: 0.9803, Adjusted R-squared: 0.9798
F-statistic: 1895 on 1 and 38 DF, p-value: < 2.2e-16
Note that scaling does not fix the problem …
Call:
lm(formula = y ~ scaledPC1_scores)
Residuals:
Min 1Q Median 3Q Max
-2.0009 -0.7343 0.2586 0.7136 1.5962
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.08484 0.14463 0.587 0.561
scaledPC1_scores 0.06621 0.10390 0.637 0.528
Residual standard error: 0.9147 on 38 degrees of freedom
Multiple R-squared: 0.01057, Adjusted R-squared: -0.01546
F-statistic: 0.4061 on 1 and 38 DF, p-value: 0.5278
Call:
lm(formula = y ~ scaledPC2_scores)
Residuals:
Min 1Q Median 3Q Max
-0.257674 -0.126315 0.009366 0.095724 0.306733
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.08484 0.02394 3.543 0.00107 **
scaledPC2_scores 7.89705 0.21389 36.921 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.1514 on 38 degrees of freedom
Multiple R-squared: 0.9729, Adjusted R-squared: 0.9722
F-statistic: 1363 on 1 and 38 DF, p-value: < 2.2e-16
A more proper approach is to somehow consider the response variable \(Y\) while performing the PCA in the first place!
Partial Least Squares (PLS) does exactly this.
It seeks linear combos of predictors that have high variance AND high correlation with \(Y\).
Just like PCA, it is not scale invariant. Therefore, the predictors are generally scaled to have mean 0 variance 1.
From Elements of Statistical Learning (for the interested reader only)
We do this in R using the plsr
function from the pls package
method
options include:
"kernelPLS"
(default): kernelized version of PLS."simpls"
: uses the SIMPLS (SIMultaneous PLS) algorithm"oscorespls"
: computes orthogonal scores during the PLS iterations"simpls.linear"
variant of SIMPLS that ensures the weights of the X and Y variables are orthogonal.Notice that the % variance explained
in the PLS output is the equivalent to the R-squared value obtained when regressing y
on the first two PLS components:
...
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.007e-15 1.895e-02 0.000 1
plsmod$scores[, 1] 2.935e-02 4.800e-03 6.114 4.4e-07 ***
plsmod$scores[, 2] 1.491e+00 3.206e-02 46.496 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.1199 on 37 degrees of freedom
Multiple R-squared: 0.9835, Adjusted R-squared: 0.9826
F-statistic: 1100 on 2 and 37 DF, p-value: < 2.2e-16
...
# remove name, calories (our Y response), and group
nupca <- prcomp(nutrition[,-c(1,2, 28)], scale.=TRUE)
summary(nupca)
Importance of components:
PC1 PC2 PC3 PC4 PC5 PC6 PC7
Standard deviation 2.0905 1.6613 1.45197 1.2787 1.24728 1.17387 1.13369
Proportion of Variance 0.1748 0.1104 0.08433 0.0654 0.06223 0.05512 0.05141
Cumulative Proportion 0.1748 0.2852 0.36953 0.4349 0.49715 0.55227 0.60368
PC8 PC9 PC10 PC11 PC12 PC13 PC14
Standard deviation 1.07393 0.98081 0.95224 0.92905 0.85969 0.83157 0.78987
Proportion of Variance 0.04613 0.03848 0.03627 0.03453 0.02956 0.02766 0.02496
Cumulative Proportion 0.64982 0.68830 0.72457 0.75909 0.78865 0.81631 0.84127
PC15 PC16 PC17 PC18 PC19 PC20 PC21
Standard deviation 0.77604 0.76121 0.70758 0.64976 0.62872 0.59025 0.5678
Proportion of Variance 0.02409 0.02318 0.02003 0.01689 0.01581 0.01394 0.0129
Cumulative Proportion 0.86536 0.88854 0.90856 0.92545 0.94126 0.95520 0.9681
PC22 PC23 PC24 PC25
Standard deviation 0.53484 0.46515 0.40082 0.36685
Proportion of Variance 0.01144 0.00865 0.00643 0.00538
Cumulative Proportion 0.97954 0.98819 0.99462 1.00000
If we wanted to retain 65% of the variation, then we would need to keep 8 components
Now we can run PCReg for calories as a response…
Call:
lm(formula = nutrition$calories ~ scr)
Residuals:
Min 1Q Median 3Q Max
-624.91 -40.83 -10.22 29.49 236.46
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 201.1200 1.5068 133.475 < 2e-16 ***
scrPC1 35.9805 0.7210 49.907 < 2e-16 ***
scrPC2 5.0377 0.9072 5.553 3.15e-08 ***
scrPC3 51.3592 1.0380 49.479 < 2e-16 ***
scrPC4 26.6879 1.1787 22.642 < 2e-16 ***
scrPC5 -55.3065 1.2083 -45.770 < 2e-16 ***
scrPC6 20.3169 1.2839 15.824 < 2e-16 ***
scrPC7 28.3220 1.3294 21.304 < 2e-16 ***
scrPC8 22.3124 1.4034 15.899 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 70.42 on 2175 degrees of freedom
Multiple R-squared: 0.7969, Adjusted R-squared: 0.7962
F-statistic: 1067 on 8 and 2175 DF, p-value: < 2.2e-16
Here it seems only 2 components are needed to explain 89% of the variation in calories!
nuplsmod <- plsr(calories~., data=nutrition[,-c(1,28)], scale=TRUE, method="oscorespls")
summary(nuplsmod)
Data: X dimension: 2184 25
Y dimension: 2184 1
Fit method: oscorespls
Number of components considered: 25
TRAINING: % variance explained
1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps
X 14.53 24.29 29.77 34.48 42.06 47.18 51.29
calories 69.29 89.08 94.64 97.56 98.41 99.10 99.35
8 comps 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps
X 53.93 58.15 62.05 65.75 68.60 71.63 74.91
calories 99.48 99.51 99.51 99.51 99.52 99.52 99.52
15 comps 16 comps 17 comps 18 comps 19 comps 20 comps 21 comps
X 77.47 79.78 82.71 84.67 86.75 89.30 92.12
calories 99.52 99.52 99.52 99.52 99.52 99.52 99.52
22 comps 23 comps 24 comps 25 comps
X 94.66 96.10 97.89 100.00
calories 99.52 99.52 99.52 99.52
plsr
function’s validation
argument = "none"
(default), "CV"
, "LOO"
; see ?plsr
The minimum RSME occurs with 10 PLS components.
Data: X dimension: 2184 25
Y dimension: 2184 1
Fit method: oscorespls
Number of components considered: 25
VALIDATION: RMSEP
Cross-validated using 10 random segments.
(Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
CV 156 93.99 57.90 41.57 30.76 22.10 17.64
adjCV 156 93.11 57.29 41.04 30.15 22.14 17.42
7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
CV 13.48 12.82 12.18 12.09 12.13 12.12 12.15
adjCV 13.42 12.67 12.09 12.01 12.04 12.04 12.07
14 comps 15 comps 16 comps 17 comps 18 comps 19 comps 20 comps
CV 12.15 12.14 12.14 12.13 12.13 12.13 12.13
adjCV 12.06 12.05 12.05 12.05 12.05 12.05 12.05
21 comps 22 comps 23 comps 24 comps 25 comps
CV 12.13 12.13 12.13 12.13 12.13
adjCV 12.05 12.05 12.05 12.05 12.05
TRAINING: % variance explained
1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps
X 14.53 24.29 29.77 34.48 42.06 47.18 51.29
calories 69.29 89.08 94.64 97.56 98.41 99.10 99.35
8 comps 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps
X 53.93 58.15 62.05 65.75 68.60 71.63 74.91
calories 99.48 99.51 99.51 99.51 99.52 99.52 99.52
15 comps 16 comps 17 comps 18 comps 19 comps 20 comps 21 comps
X 77.47 79.78 82.71 84.67 86.75 89.30 92.12
calories 99.52 99.52 99.52 99.52 99.52 99.52 99.52
22 comps 23 comps 24 comps 25 comps
X 94.66 96.10 97.89 100.00
calories 99.52 99.52 99.52 99.52
Compared with the 2 component PCReg solution, it is quite a bit of improvement!
Call:
lm(formula = nutrition$calories ~ scr[, 1:2])
Residuals:
Min 1Q Median 3Q Max
-887.77 -92.51 -45.45 74.04 646.03
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 201.120 2.920 68.888 <2e-16 ***
scr[, 1:2]PC1 35.981 1.397 25.758 <2e-16 ***
scr[, 1:2]PC2 5.038 1.758 2.866 0.0042 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 136.4 on 2181 degrees of freedom
Multiple R-squared: 0.2355, Adjusted R-squared: 0.2348
F-statistic: 335.8 on 2 and 2181 DF, p-value: < 2.2e-16
Once we finalize the model, we can make predictions on new observations.
set.seed(1); n <- nrow(nutrition)
index <- sample(1:n, 0.7*n)
training_set <- nutrition[index, ]
testing_set <- nutrition[-index, ]
cv_mod <- plsr(calories~., data=training_set[,-c(1,28)],
scale=TRUE, method="oscorespls", validation="CV")
# summary(cv_mod) # <--------- suggests 10 components (again)
pls_pred <- predict(cv_mod, testing_set, ncomp=10)
# calculate test RMSE
y_test = testing_set$calories
sqrt(mean((pls_pred - y_test)^2))
[1] 12.24655
Comments
Clearly PCReg has some issues that need fixing…
One simple approach is to instead consider all \(p\) PCs, then perform standard stepwise regression procedures (or other variable selection or regularization technique).
This would fix the issue with our simulation, but would not cover all potential issues.
Leaving the problems aside for a moment, note that PCs can be used as predictors in any supervised context (random forests, neural nets, discriminant analysis, etc).