Quarto - Data 311: Machine Learning

Recap

Recall: PCA is an unsupervised technique. If you look back, there is no inherent response variable (Y) that we are optimizing.
We simply rotated our original predictors in a “clever” manner, and toss out the rotations that do not appear to contain important information.
We then discussed some interpretations of these rotations, etc. Rarely would this be the end of the story…

Motivation

PCA (once you’ve removed some components) is generally viewed as a dimensionality reduction technique.
Most commonly, you’ll be interested in taking that reduced space, and using it for more machine learning! (Be it supervised, or unsupervised)
Principal Components Regression, or PCReg for short, is the most natural (though not necessarily the most useful) form for making use of PCA in a supervised context.

Principal Components Regression

Suppose we perform PCA on \(\mathbf{X}\), which is an \(n \times p\) matrix of observations (on \(p\) predictors), and decide to retain \(M\) components. AKA \[\boldsymbol{\Sigma} \approx \mathbf{P} \boldsymbol{\Lambda} \mathbf{P}'\] where \(\boldsymbol{\Sigma}\) is a \(p \times p\) covariance/correlation matrix, \(\mathbf{P}\) is a \(p \times M\) matrix of eigenvectors (loadings) and \(\boldsymbol{\Lambda}\) is a \(M \times M\) diagonal matrix with the \(M\) largest eigenvalues.
We then map our \(n \times p\) matrix \(\mathbf{X}\) to an \(n \times M\) matrix \(\mathbf{S}\) via \(\mathbf{S} = \mathbf{X}\mathbf{P}\), these are the scores.

Principal Components Regression

If we have a continuous response variable \(Y\) we can easily assume the following model for linear regression \[Y = \beta_0 + \beta_1 S_1 + \beta_2 S_2 + \cdots + \beta_h S_h + \epsilon\] with all the usual linear regression assumptions (for inferential purposes).
And that’s it! That is PCReg: using your retained principal components as predictors for a continuous response.

Source: ISLR Figure 6.18 PCR was applied to two simulated data sets. In each panel, the horizontal dashed line represents the irreducible error.

PCReg Deeper

If \(M=p\) (aka, you retain all PCs), then the model is equivalent to ordinary least squares on the original \(p\) predictors.
Obviously the estimated coefficients differ, but your predictions \(\hat Y\) would be the same.

However, you’ve just taken care of the problem of multicollinearity!

The Cons of PCReg

PCReg can be viewed as a discretized regularizer, removing low variance components (but not entire predictors).
So it has some commonality with ridge regression, but due to ridge regression’s smooth regularization, ridge regression is often a better option than PCReg.
There is absolutely no guarantee that higher variance components — which is with respect to the predictor space — contain more predictive power towards \(Y\)!

Simulation

A cautionary tale …

Code

#PCReg example
set.seed(311351)
x1 <- runif(40, 0, 10)
x2 <- 2*x1 + rnorm(40)
y <- x2-2*x1 + rnorm(40, sd=.1)
plot(x1,x2)

Let’s add a \(Y\) response variable

pairs(cbind(y,x1, x2))

Fitted linear model

linmod <- lm(y~x1+x2)
summary(linmod)


Call:
lm(formula = y ~ x1 + x2)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.26489 -0.07545 -0.01078  0.08809  0.21513 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.007547   0.038730   0.195    0.847    
x1          -2.040311   0.043519 -46.883   <2e-16 ***
x2           1.020113   0.022127  46.103   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1199 on 37 degrees of freedom
Multiple R-squared:  0.9835,    Adjusted R-squared:  0.9826 
F-statistic:  1100 on 2 and 37 DF,  p-value: < 2.2e-16

PCA

pcs <- prcomp(cbind(x1, x2))
summary(pcs)

Importance of components:
                          PC1     PC2
Standard deviation     6.0766 0.39399
Proportion of Variance 0.9958 0.00419
Cumulative Proportion  0.9958 1.00000

PRCReg in R

library(pls)
plsmod <- plsr(y~x1+x2, method="oscorespls")
summary(plsmod)

Data:   X dimension: 40 2 
    Y dimension: 40 1
Fit method: oscorespls
Number of components considered: 2
TRAINING: % variance explained
   1 comps  2 comps
X   99.033   100.00
y    1.672    98.35

Regression on PC1

PC1_scores <- pcs$x[,1]
linpc <- lm(y~PC1_scores)
summary(linpc)


Call:
lm(formula = y ~ PC1_scores)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.9419 -0.7712  0.2144  0.6891  1.6439 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  0.08484    0.14517   0.584    0.562
PC1_scores   0.00834    0.02419   0.345    0.732

Residual standard error: 0.9181 on 38 degrees of freedom
Multiple R-squared:  0.003117,  Adjusted R-squared:  -0.02312 
F-statistic: 0.1188 on 1 and 38 DF,  p-value: 0.7322

Regression on PC2

PC2_scores <- pcs$x[,2]
linpc2 <- lm(y~PC2_scores)
summary(linpc2)


Call:
lm(formula = y ~ PC2_scores)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.261207 -0.080209 -0.007321  0.081190  0.261425 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.08484    0.02039   4.161 0.000174 ***
PC2_scores   2.28110    0.05241  43.527  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1289 on 38 degrees of freedom
Multiple R-squared:  0.9803,    Adjusted R-squared:  0.9798 
F-statistic:  1895 on 1 and 38 DF,  p-value: < 2.2e-16

Scaling

Note that scaling does not fix the problem …

pcss <- prcomp(cbind(x1, x2), scale.=TRUE)
summary(pcss)

Importance of components:
                          PC1     PC2
Standard deviation     1.4097 0.11337
Proportion of Variance 0.9936 0.00643
Cumulative Proportion  0.9936 1.00000

SLR on scaled PC1

scaledPC1_scores <- pcss$x[,1]
summary(lm(y~scaledPC1_scores))


Call:
lm(formula = y ~ scaledPC1_scores)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.0009 -0.7343  0.2586  0.7136  1.5962 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)
(Intercept)       0.08484    0.14463   0.587    0.561
scaledPC1_scores  0.06621    0.10390   0.637    0.528

Residual standard error: 0.9147 on 38 degrees of freedom
Multiple R-squared:  0.01057,   Adjusted R-squared:  -0.01546 
F-statistic: 0.4061 on 1 and 38 DF,  p-value: 0.5278

SLR on scaled PC1

scaledPC2_scores <- pcss$x[,2]
summary(lm(y~scaledPC2_scores))


Call:
lm(formula = y ~ scaledPC2_scores)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.257674 -0.126315  0.009366  0.095724  0.306733 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)       0.08484    0.02394   3.543  0.00107 ** 
scaledPC2_scores  7.89705    0.21389  36.921  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1514 on 38 degrees of freedom
Multiple R-squared:  0.9729,    Adjusted R-squared:  0.9722 
F-statistic:  1363 on 1 and 38 DF,  p-value: < 2.2e-16

Pairsplot

pairs(cbind(y,pcss$x))

Comments

Clearly PCReg has some issues that need fixing…
One simple approach is to instead consider all \(p\) PCs, then perform standard stepwise regression procedures (or other variable selection or regularization technique).
This would fix the issue with our simulation, but would not cover all potential issues.
Leaving the problems aside for a moment, note that PCs can be used as predictors in any supervised context (random forests, neural nets, discriminant analysis, etc).

Partial Least Squares

A more proper approach is to somehow consider the response variable \(Y\) while performing the PCA in the first place!
Partial Least Squares (PLS) does exactly this.
It seeks linear combos of predictors that have high variance AND high correlation with \(Y\).
Just like PCA, it is not scale invariant. Therefore, the predictors are generally scaled to have mean 0 variance 1.

PLS algorithm

From Elements of Statistical Learning (for the interested reader only)

PLS Steps

Standardize both the predictor and response variables.
Calculate \(M\) PLS components that explain a significant amount of variation in both the response variable and the predictor variables.
Use the method of least squares to fit a linear regression model using the PLS components as predictors.

PLS in R

We do this in R using the plsr function from the pls package

plsr(..., method = pls.options()$plsralg)

method options include:

"kernelPLS" (default): kernelized version of PLS.
"simpls": uses the SIMPLS (SIMultaneous PLS) algorithm
"oscorespls": computes orthogonal scores during the PLS iterations
"simpls.linear" variant of SIMPLS that ensures the weights of the X and Y variables are orthogonal.

PLS in R

library(pls)

plsmod <- plsr(y~x1+x2, method= "oscorespls")
summary(plsmod)

Data:   X dimension: 40 2 
    Y dimension: 40 1
Fit method: oscorespls
Number of components considered: 2
TRAINING: % variance explained
   1 comps  2 comps
X   99.033   100.00
y    1.672    98.35

PLS scores

pairs(cbind(y, plsmod$scores))

Notice that the % variance explained in the PLS output is the equivalent to the R-squared value obtained when regressing y on the first two PLS components:

y_c <- y - mean(y)
pls_lm <- lm(y_c ~  plsmod$scores[,1] + plsmod$scores[,2])
summary(pls_lm)

...

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)        -3.007e-15  1.895e-02   0.000        1    
plsmod$scores[, 1]  2.935e-02  4.800e-03   6.114  4.4e-07 ***
plsmod$scores[, 2]  1.491e+00  3.206e-02  46.496  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1199 on 37 degrees of freedom
Multiple R-squared:  0.9835,    Adjusted R-squared:  0.9826 
F-statistic:  1100 on 2 and 37 DF,  p-value: < 2.2e-16
...

PLS Comments

There are PLS versions for classification tasks as well (PLS-DA for example), and plenty of extensions/refinement.
It is the predominant model in several applied fields (some chemistry fields, sensory sciences, marketing, and more).
This is likely due to its inherent interpretability (linear combos), in addition to being an early solution to the problem of moderate-to-high dimensionality.

Example: Nutrition

load("data/nutrition.rdata"); nutrition <- na.omit(nutrition)
nutrition

PCA on nutrition

# remove name, calories (our Y response), and group
nupca <- prcomp(nutrition[,-c(1,2, 28)], scale.=TRUE)
summary(nupca)

Importance of components:
                          PC1    PC2     PC3    PC4     PC5     PC6     PC7
Standard deviation     2.0905 1.6613 1.45197 1.2787 1.24728 1.17387 1.13369
Proportion of Variance 0.1748 0.1104 0.08433 0.0654 0.06223 0.05512 0.05141
Cumulative Proportion  0.1748 0.2852 0.36953 0.4349 0.49715 0.55227 0.60368
                           PC8     PC9    PC10    PC11    PC12    PC13    PC14
Standard deviation     1.07393 0.98081 0.95224 0.92905 0.85969 0.83157 0.78987
Proportion of Variance 0.04613 0.03848 0.03627 0.03453 0.02956 0.02766 0.02496
Cumulative Proportion  0.64982 0.68830 0.72457 0.75909 0.78865 0.81631 0.84127
                          PC15    PC16    PC17    PC18    PC19    PC20   PC21
Standard deviation     0.77604 0.76121 0.70758 0.64976 0.62872 0.59025 0.5678
Proportion of Variance 0.02409 0.02318 0.02003 0.01689 0.01581 0.01394 0.0129
Cumulative Proportion  0.86536 0.88854 0.90856 0.92545 0.94126 0.95520 0.9681
                          PC22    PC23    PC24    PC25
Standard deviation     0.53484 0.46515 0.40082 0.36685
Proportion of Variance 0.01144 0.00865 0.00643 0.00538
Cumulative Proportion  0.97954 0.98819 0.99462 1.00000

How many components

If we wanted to retain 65% of the variation, then we would need to keep 8 components

scr <- nupca$x[,1:8]

Now we can run PCReg for calories as a response…

pcregmod <- lm(nutrition$calories ~ scr)
summary(pcregmod)


Call:
lm(formula = nutrition$calories ~ scr)

Residuals:
    Min      1Q  Median      3Q     Max 
-624.91  -40.83  -10.22   29.49  236.46 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 201.1200     1.5068 133.475  < 2e-16 ***
scrPC1       35.9805     0.7210  49.907  < 2e-16 ***
scrPC2        5.0377     0.9072   5.553 3.15e-08 ***
scrPC3       51.3592     1.0380  49.479  < 2e-16 ***
scrPC4       26.6879     1.1787  22.642  < 2e-16 ***
scrPC5      -55.3065     1.2083 -45.770  < 2e-16 ***
scrPC6       20.3169     1.2839  15.824  < 2e-16 ***
scrPC7       28.3220     1.3294  21.304  < 2e-16 ***
scrPC8       22.3124     1.4034  15.899  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 70.42 on 2175 degrees of freedom
Multiple R-squared:  0.7969,    Adjusted R-squared:  0.7962 
F-statistic:  1067 on 8 and 2175 DF,  p-value: < 2.2e-16

PLS on Nutrition

Here it seems only 2 components are needed to explain 89% of the variation in calories!

nuplsmod <- plsr(calories~., data=nutrition[,-c(1,28)], scale=TRUE, method="oscorespls")
summary(nuplsmod)

Data:   X dimension: 2184 25 
    Y dimension: 2184 1
Fit method: oscorespls
Number of components considered: 25
TRAINING: % variance explained
          1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps
X           14.53    24.29    29.77    34.48    42.06    47.18    51.29
calories    69.29    89.08    94.64    97.56    98.41    99.10    99.35
          8 comps  9 comps  10 comps  11 comps  12 comps  13 comps  14 comps
X           53.93    58.15     62.05     65.75     68.60     71.63     74.91
calories    99.48    99.51     99.51     99.51     99.52     99.52     99.52
          15 comps  16 comps  17 comps  18 comps  19 comps  20 comps  21 comps
X            77.47     79.78     82.71     84.67     86.75     89.30     92.12
calories     99.52     99.52     99.52     99.52     99.52     99.52     99.52
          22 comps  23 comps  24 comps  25 comps
X            94.66     96.10     97.89    100.00
calories     99.52     99.52     99.52     99.52

Cross-validation

Another way to choose the Number of PLS Components is by cross-validation
This feature is actually built in to the plsr function’s validation argument = "none" (default), "CV", "LOO"; see ?plsr

set.seed(1)
cv_mod <- plsr(calories~., data=nutrition[,-c(1,28)], 
               scale=TRUE, method="oscorespls", validation="CV")

The minimum RSME occurs with 10 PLS components.

summary(cv_mod)

Data:   X dimension: 2184 25 
    Y dimension: 2184 1
Fit method: oscorespls
Number of components considered: 25

VALIDATION: RMSEP
Cross-validated using 10 random segments.
       (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
CV             156    93.99    57.90    41.57    30.76    22.10    17.64
adjCV          156    93.11    57.29    41.04    30.15    22.14    17.42
       7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
CV       13.48    12.82    12.18     12.09     12.13     12.12     12.15
adjCV    13.42    12.67    12.09     12.01     12.04     12.04     12.07
       14 comps  15 comps  16 comps  17 comps  18 comps  19 comps  20 comps
CV        12.15     12.14     12.14     12.13     12.13     12.13     12.13
adjCV     12.06     12.05     12.05     12.05     12.05     12.05     12.05
       21 comps  22 comps  23 comps  24 comps  25 comps
CV        12.13     12.13     12.13     12.13     12.13
adjCV     12.05     12.05     12.05     12.05     12.05

TRAINING: % variance explained
          1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps
X           14.53    24.29    29.77    34.48    42.06    47.18    51.29
calories    69.29    89.08    94.64    97.56    98.41    99.10    99.35
          8 comps  9 comps  10 comps  11 comps  12 comps  13 comps  14 comps
X           53.93    58.15     62.05     65.75     68.60     71.63     74.91
calories    99.48    99.51     99.51     99.51     99.52     99.52     99.52
          15 comps  16 comps  17 comps  18 comps  19 comps  20 comps  21 comps
X            77.47     79.78     82.71     84.67     86.75     89.30     92.12
calories     99.52     99.52     99.52     99.52     99.52     99.52     99.52
          22 comps  23 comps  24 comps  25 comps
X            94.66     96.10     97.89    100.00
calories     99.52     99.52     99.52     99.52

validationplot(cv_mod)

Regression fit on PLS

Compared with the 2 component PCReg solution, it is quite a bit of improvement!

pcregmod2 <- lm(nutrition$calories ~ scr[,1:2])
summary(pcregmod2)


Call:
lm(formula = nutrition$calories ~ scr[, 1:2])

Residuals:
    Min      1Q  Median      3Q     Max 
-887.77  -92.51  -45.45   74.04  646.03 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)    201.120      2.920  68.888   <2e-16 ***
scr[, 1:2]PC1   35.981      1.397  25.758   <2e-16 ***
scr[, 1:2]PC2    5.038      1.758   2.866   0.0042 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 136.4 on 2181 degrees of freedom
Multiple R-squared:  0.2355,    Adjusted R-squared:  0.2348 
F-statistic: 335.8 on 2 and 2181 DF,  p-value: < 2.2e-16

Making Predictions

Once we finalize the model, we can make predictions on new observations.

set.seed(1); n <- nrow(nutrition)
index <- sample(1:n, 0.7*n)
training_set <- nutrition[index, ]
testing_set <- nutrition[-index, ]

cv_mod <- plsr(calories~., data=training_set[,-c(1,28)], 
               scale=TRUE, method="oscorespls", validation="CV")
# summary(cv_mod) # <--------- suggests 10 components (again)

pls_pred <- predict(cv_mod, testing_set, ncomp=10)

# calculate test RMSE
y_test = testing_set$calories
sqrt(mean((pls_pred - y_test)^2))

[1] 12.24655