Lab 8

PCAReg, PLS, and Neural Networks

Author

Dr. Irene Vrbik

Published

December 1, 2024

Introduction

In this lab we discuss the nutrition data in more detail than done in lecture and show some examples in neural networks.

Data

Let’s download the nutrition data from canvas. We’ll do some lazy cleaning just to keep things easy.

load("../data/nutrition.rdata")
dim(nutrition)

[1] 8463   28

nutrition <- na.omit(nutrition)
str(nutrition)

Classes 'tbl_df', 'tbl' and 'data.frame':   2184 obs. of  28 variables:
 $ food         : chr  "Butter, salted" "Butter, whipped, with salt" "Butter oil, anhydrous" "Cheese, blue" ...
 $ calories     : num  717 717 876 353 371 334 300 403 394 98 ...
 $ protein      : num  0.85 0.85 0.28 21.4 23.24 ...
 $ carbohydrates: num  0.06 0.06 0 2.34 2.79 0.45 0.46 1.28 2.57 3.38 ...
 $ total_fat    : num  81.1 81.1 99.5 28.7 29.7 ...
 $ saturated_fat: num  51.4 50.5 61.9 18.7 18.8 ...
 $ caffiene     : num  0 0 0 0 0 0 0 0 0 0 ...
 $ cholesterol  : num  215 219 256 75 94 100 72 105 95 17 ...
 $ fiber        : num  0 0 0 0 0 0 0 0 0 0 ...
 $ folic_acid   : num  0 0 0 0 0 0 0 0 0 0 ...
 $ sodium       : num  643 659 2 1146 560 ...
 $ calcium      : num  24 24 4 528 674 184 388 721 685 83 ...
 $ iron         : num  0.02 0.16 0 0.31 0.43 0.5 0.33 0.68 0.76 0.07 ...
 $ magnesium    : num  2 2 0 23 24 20 20 28 26 8 ...
 $ manganese    : num  0 0.004 0 0.009 0.012 0.034 0.038 0.01 0.012 0.002 ...
 $ niacin       : num  0.042 0.042 0.003 1.016 0.118 ...
 $ phosphorus   : num  24 23 3 387 451 188 347 512 457 159 ...
 $ potassium    : num  24 26 5 256 136 152 187 98 127 104 ...
 $ riboflavin   : num  0.034 0.034 0.005 0.382 0.351 0.52 0.488 0.375 0.375 0.163 ...
 $ selenium     : num  1 1 0 14.5 14.5 14.5 14.5 13.9 14.5 9.7 ...
 $ thiamin      : num  0.005 0.005 0.001 0.029 0.014 0.07 0.028 0.027 0.015 0.027 ...
 $ vitamin_A    : num  2499 2499 3069 721 1080 ...
 $ vitamin_B6   : num  0.003 0.003 0.001 0.166 0.065 0.235 0.227 0.074 0.079 0.046 ...
 $ vitamin_B12  : num  0.17 0.13 0.01 1.22 1.26 1.65 1.3 0.83 0.83 0.43 ...
 $ vitamin_C    : num  0 0 0 0 0 0 0 0 0 0 ...
 $ vitamin_D    : num  60 60 73 21 22 20 18 24 24 3 ...
 $ zinc         : num  0.09 0.05 0.01 2.66 2.6 2.38 2.38 3.11 3.07 0.4 ...
 $ group        : chr  "Dairy and Egg Products" "Dairy and Egg Products" "Dairy and Egg Products" "Dairy and Egg Products" ...
 - attr(*, "na.action")= 'omit' Named int [1:6279] 8 10 21 31 39 41 45 47 58 60 ...
  ..- attr(*, "names")= chr [1:6279] "8" "10" "21" "31" ...

In an attempt to understand this data better before using it for regression, let’s run PCA on it and try to understand the principal components.

pca1 <- prcomp(nutrition[,-c(1,28)], scale.=TRUE)
plot(pca1, type="lines",  npcs=26)

In the scree plot it is not clear how many principal components we should use, maybe three? Let’s try and to understand what these PCs might be correspoding to by investigating the loading matrix.

loading_mat <- round(pca1$rotation[,1:3], 2)

Lets start with the first component. This component has relatively low loading of nearly all the variables. Interestingly, all of the loadings (including those below the threshold) are negative. That is, every variable in the data set is loading in the same direction.

To try and emphasize which features are loading higher with each PC, let’s suppress any loading values less that 0.2 in magnitude:

loading_mat[abs(loading_mat)<0.2] <- " "
kable(loading_mat)

	PC1	PC2	PC3
calories	-0.29		-0.4
protein	-0.25	0.3
carbohydrates		-0.36
total_fat	-0.2	0.25	-0.41
saturated_fat		0.27	-0.39
caffiene
cholesterol		0.33
fiber	-0.22	-0.34
folic_acid
sodium
calcium
iron	-0.25	-0.2
magnesium	-0.32	-0.22
manganese		-0.22
niacin	-0.29		0.26
phosphorus	-0.24
potassium	-0.25	-0.21
riboflavin	-0.29		0.26
selenium
thiamin	-0.23
vitamin_A			0.25
vitamin_B6	-0.29		0.32
vitamin_B12		0.26	0.2
vitamin_C			0.28
vitamin_D
zinc

We might stand to interpret this component as something like “nutrient density”. We would expect any food item that scores low in this PC to be nutrient rich (high in calories, protein, fat, fiber, iron, magnesium, … etc), and anything scoring high would “low-energy” food (like water).

Lets see what the which food items have the lowest scores on this PC:

head(nutrition[order(pca1$x[,1]),1])

[1] "Fruit-flavored drink, powder, with high vitamin C with other added vitamins, low calorie"
[2] "Leavening agents, yeast, baker's, active dry"                                            
[3] "Malted drink mix, chocolate, with added nutrients, powder"                               
[4] "Rice bran, crude"                                                                        
[5] "Spices, basil, dried"                                                                    
[6] "Malted drink mix, natural, with added nutrients, powder"

and the highest scores:

tail(nutrition[order(pca1$x[,1]),1])

[1] "Carbonated beverage, low calorie, cola or pepper-type, with aspartame, without caffeine"       
[2] "Tea, herb, chamomile, brewed"                                                                  
[3] "Tea, herb, other than chamomile, brewed"                                                       
[4] "Carbonated beverage, low calorie, other than cola or pepper, with aspartame, contains caffeine"
[5] "Carbonated beverage, club soda"                                                                
[6] "Carbonated beverage, low calorie, other than cola or pepper,  without caffeine"

At first glance these food items might seem a bit unintuitive as most of the entries on both lists are drinks. BUT READ CLOSELY! The first, third, and sixth most “nutrient dense” items are actually POWDERED drink mixes. Since these nutritional values are all based on 100g servings, by weight, these powder items are likely packed with sugar (hence very high in calories) and in some cases it even mentions that they have added vitamins. In fact, all of the top 6 are concentrated/dried items. On the other hand the bottom six are liquid drinks as opposed to powdered drinks. These liquid drinks are specifically void of most nutrients.

Lets move on to the second component. Judging by the loading matrix, it appears that any item scoring high on PC2 will have high protein/fat/cholesterol/vitaminB12, and low carbs/fiber/iron/magnesium/manganese/potassium. For the most part, it seems like a measure of protein-dense foods like meat and nuts. Lets look at the top 6 and bottom 6 on this

head(nutrition[order(pca1$x[,2], decreasing = TRUE),1])

[1] "Egg, yolk, dried"                                             
[2] "Beef, variety meats and by-products, brain, cooked, simmered" 
[3] "Beef, variety meats and by-products, liver, cooked, pan-fried"
[4] "Egg, whole, dried"                                            
[5] "Beef, variety meats and by-products, liver, cooked, braised"  
[6] "Nuts, brazilnuts, dried, unblanched"

tail(nutrition[order(pca1$x[,2], decreasing = TRUE),1])

[1] "Spices, cloves, ground"                          
[2] "Spices, marjoram, dried"                         
[3] "Spices, thyme, dried"                            
[4] "Spices, basil, dried"                            
[5] "Tea, instant, unsweetened, powder, decaffeinated"
[6] "Tea, instant, unsweetened, powder"

For the most part it seems like this interpretation is accurate.

Finally, it appears that anything scoring high on the third component will be low in calories and fat, but high in vitamins. So generally speaking, lets just call it “vitamin-dense”.

head(nutrition[order(pca1$x[,3], decreasing=TRUE),1])

[1] "Fruit-flavored drink, powder, with high vitamin C with other added vitamins, low calorie"
[2] "Peppers, sweet, red, freeze-dried"                                                       
[3] "Beef, variety meats and by-products, liver, cooked, pan-fried"                           
[4] "Beef, variety meats and by-products, liver, cooked, braised"                             
[5] "Malted drink mix, chocolate, with added nutrients, powder"                               
[6] "Malted drink mix, natural, with added nutrients, powder"

tail(nutrition[order(pca1$x[,3], decreasing=TRUE),1])

[1] "Pork, fresh, backfat, raw"                            
[2] "Butter, whipped, with salt"                           
[3] "Butter, without salt"                                 
[4] "Butter, salted"                                       
[5] "Nuts, coconut meat, dried (desiccated), not sweetened"
[6] "Butter oil, anhydrous"

The bottom six are no brainers (basically just fats), but the top six are again a bit strange. Again, considering these nutritional values are given according to a weight of 100g, it makes sense that powdered drink mixes and dehydrated vegetables would appear alongside and beef liver (which is very rich in various vitamins and minerals).

PCRegression

Next, we will run PCRegression. To set up a regression problem we’ll need a natural continuous response variable. I think calories is likely the most natural choice since this is one of the most commonly a focal point in nutritional labeling. Since this is unsupervised, we will need to remove calories from the original PCA call. We’ll also remove the character variables food (the name of the food) and group the food-group categorization:

udat <- nutrition[,-c(1,2,28)]
nupca <- prcomp(udat, scale.=TRUE)
plot(nupca, type="lines", npcs=26)

Note how we expanded the default x-axis of the plot using npcs. There’s no clear picture of how many components to retain using a scree plot so let’s look at the summary:

summary(nupca)

Importance of components:
                          PC1    PC2     PC3    PC4     PC5     PC6     PC7
Standard deviation     2.0905 1.6613 1.45197 1.2787 1.24728 1.17387 1.13369
Proportion of Variance 0.1748 0.1104 0.08433 0.0654 0.06223 0.05512 0.05141
Cumulative Proportion  0.1748 0.2852 0.36953 0.4349 0.49715 0.55227 0.60368
                           PC8     PC9    PC10    PC11    PC12    PC13    PC14
Standard deviation     1.07393 0.98081 0.95224 0.92905 0.85969 0.83157 0.78987
Proportion of Variance 0.04613 0.03848 0.03627 0.03453 0.02956 0.02766 0.02496
Cumulative Proportion  0.64982 0.68830 0.72457 0.75909 0.78865 0.81631 0.84127
                          PC15    PC16    PC17    PC18    PC19    PC20   PC21
Standard deviation     0.77604 0.76121 0.70758 0.64976 0.62872 0.59025 0.5678
Proportion of Variance 0.02409 0.02318 0.02003 0.01689 0.01581 0.01394 0.0129
Cumulative Proportion  0.86536 0.88854 0.90856 0.92545 0.94126 0.95520 0.9681
                          PC22    PC23    PC24    PC25
Standard deviation     0.53484 0.46515 0.40082 0.36685
Proportion of Variance 0.01144 0.00865 0.00643 0.00538
Cumulative Proportion  0.97954 0.98819 0.99462 1.00000

Using the Kaiser criterion would suggest retaining 8 components, note that 8 PCs only retain approximately 65% of the variability in the data. Let’s store the associated scores on those 8 components into an object for future use.

scr <- nupca$x[,1:8]

Now we can run PCReg for calories as a response…

pcregmod <- lm(nutrition$calories ~ scr)
summary(pcregmod)


Call:
lm(formula = nutrition$calories ~ scr)

Residuals:
    Min      1Q  Median      3Q     Max 
-624.91  -40.83  -10.22   29.49  236.46 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 201.1200     1.5068 133.475  < 2e-16 ***
scrPC1       35.9805     0.7210  49.907  < 2e-16 ***
scrPC2        5.0377     0.9072   5.553 3.15e-08 ***
scrPC3       51.3592     1.0380  49.479  < 2e-16 ***
scrPC4       26.6879     1.1787  22.642  < 2e-16 ***
scrPC5      -55.3065     1.2083 -45.770  < 2e-16 ***
scrPC6       20.3169     1.2839  15.824  < 2e-16 ***
scrPC7       28.3220     1.3294  21.304  < 2e-16 ***
scrPC8       22.3124     1.4034  15.899  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 70.42 on 2175 degrees of freedom
Multiple R-squared:  0.7969,    Adjusted R-squared:  0.7962 
F-statistic:  1067 on 8 and 2175 DF,  p-value: < 2.2e-16

Cool! Everything appears interesting and useful for modelling. Of course, our PCA was done unsupervised — therefore, without respect to what we were trying to model (calories). So perhaps PLS can help us improve…

PLS

To run PLS we will need the pls library:

#install.packages("pls")
library(pls)

Now, we will feed the function plsr a formula for calories as a response to all other numeric predictors.

sdat <- nutrition[,-c(1,28)]
nuplsmod <- plsr(calories~., data=sdat, scale=TRUE, method="oscorespls")
summary(nuplsmod)

Data:   X dimension: 2184 25 
    Y dimension: 2184 1
Fit method: oscorespls
Number of components considered: 25
TRAINING: % variance explained
          1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps
X           14.53    24.29    29.77    34.48    42.06    47.18    51.29
calories    69.29    89.08    94.64    97.56    98.41    99.10    99.35
          8 comps  9 comps  10 comps  11 comps  12 comps  13 comps  14 comps
X           53.93    58.15     62.05     65.75     68.60     71.63     74.91
calories    99.48    99.51     99.51     99.51     99.52     99.52     99.52
          15 comps  16 comps  17 comps  18 comps  19 comps  20 comps  21 comps
X            77.47     79.78     82.71     84.67     86.75     89.30     92.12
calories     99.52     99.52     99.52     99.52     99.52     99.52     99.52
          22 comps  23 comps  24 comps  25 comps
X            94.66     96.10     97.89    100.00
calories     99.52     99.52     99.52     99.52

This gives a very different picture…namely that it appears probably 2 components are sufficient (cumulatively explain just over 89% of the variation in calories). Compare that to PCReg — 8 components explained only 79% of the variation in calories. Clearly an improvement. In fact, if we only take the first two components for PCReg…

pcregmod2 <- lm(nutrition$calories ~ scr[,1:2])
summary(pcregmod2)


Call:
lm(formula = nutrition$calories ~ scr[, 1:2])

Residuals:
    Min      1Q  Median      3Q     Max 
-887.77  -92.51  -45.45   74.04  646.03 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)    201.120      2.920  68.888   <2e-16 ***
scr[, 1:2]PC1   35.981      1.397  25.758   <2e-16 ***
scr[, 1:2]PC2    5.038      1.758   2.866   0.0042 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 136.4 on 2181 degrees of freedom
Multiple R-squared:  0.2355,    Adjusted R-squared:  0.2348 
F-statistic: 335.8 on 2 and 2181 DF,  p-value: < 2.2e-16

Only 23%! Hence why PLS is generally a superior approach in a supervised context.

Caution

But how would even PLS compare to a domain-wise and straightforward model, say, based on fat and carbohydrates …

dwmod <- lm(calories~total_fat+carbohydrates, data=nutrition)
summary(dwmod)


Call:
lm(formula = calories ~ total_fat + carbohydrates, data = nutrition)

Residuals:
    Min      1Q  Median      3Q     Max 
-151.58  -40.60  -10.91   36.02  310.91 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)   49.95804    1.47220   33.93   <2e-16 ***
total_fat      9.63835    0.07792  123.69   <2e-16 ***
carbohydrates  3.29893    0.03917   84.22   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 47.3 on 2181 degrees of freedom
Multiple R-squared:  0.9081,    Adjusted R-squared:  0.908 
F-statistic: 1.078e+04 on 2 and 2181 DF,  p-value: < 2.2e-16

Just slightly more of the variation (90.8% versus 89.1%) in calories can be explained using an equivalent dimensionality from the original variables.

In machine learning, there is often no substitute for domain-knowledge!

Basic Neural Nets in R

Let’s start with using the neuralnet package to approximate the linear model for the cars data, as shown in lecture.

#install.packages("neuralnet")
library(neuralnet)
#?"neuralnet"
data(cars)
plot(cars)

attach(cars)
carlm <- lm(dist~speed, data=cars)
summary(carlm)


Call:
lm(formula = dist ~ speed, data = cars)

Residuals:
    Min      1Q  Median      3Q     Max 
-29.069  -9.525  -2.272   9.215  43.201 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -17.5791     6.7584  -2.601   0.0123 *  
speed         3.9324     0.4155   9.464 1.49e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 15.38 on 48 degrees of freedom
Multiple R-squared:  0.6511,    Adjusted R-squared:  0.6438 
F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

sum(carlm$residuals^2)

[1] 11353.52

nn <- neuralnet(dist~speed, data=cars, hidden=0)
plot(nn)
sum((predict(nn, data.frame(speed))-dist)^2)

[1] 11353.52

Note that the neuralnet package can only deal with numeric responses. If we have a binary response, we can basically approximate a classification scheme by pretending we’re modelling the probability (but of course, it might suggest values outside of the realm of probability, aka <0 or >1). Also, neuralnet cannot take simplified formulas such as “Response ~ .” instead, one has to explicitly type out each response and predictor by name in the formula (there are tricks around this, but it’s aggravating regardless).

Fortunately, there are many neural network packages out there… nnet is another one. It has no built-in plotting function, so we also install NeuralNetTools for visualization purposes. Let’s look at the body example from class.

library(gclus)

Loading required package: cluster

data(body)
sbod <- cbind(scale(body[,1:24]), factor(body[,25]))
colnames(sbod)[25] <- "Gender"
library(nnet)
nnbod2 <- nnet(factor(Gender)~., data=sbod, size=4)

# weights:  105
initial  value 356.704610 
iter  10 value 8.078480
iter  20 value 0.807377
iter  30 value 0.019682
iter  40 value 0.001804
iter  50 value 0.000120
final  value 0.000086 
converged

table(body[,25], predict(nnbod2, type="class"))

library(NeuralNetTools)
plotnet(nnbod2)

0 misclassifications, but again, as discussed in class, this is probably overfitting…let’s set up a training and testing set

set.seed(53747958)
bindex <- sample(1:nrow(sbod), 250)
btrain <- sbod[bindex,]
btest <- sbod[-bindex,]
nnbodtr <- nnet(factor(Gender)~., data=btrain, size=4)

# weights:  105
initial  value 169.109314 
iter  10 value 3.552025
iter  20 value 2.533965
iter  30 value 1.921988
iter  40 value 1.399205
iter  50 value 1.391963
iter  60 value 1.389062
iter  70 value 1.386696
iter  80 value 1.386382
final  value 1.386325 
converged

table(btest[,25], predict(nnbodtr, newdata=btest[,-25], type="class"))

In lecture, we left it at that. But we can easily setup a loop to investigate if there’s a better option for the number of hidden layer variables according to the testing error. Since it fits relatively quickly, let’s make a quick loop. Note that nnet has an annoying printout whenever it is fitted, the help file doesn’t provide any clear way to silence it…but a little googling can lead us to an argument trace=FALSE.

for(i in 1:10){
  nnbodtr <- nnet(factor(Gender)~., data=btrain, size=i, trace=FALSE)
  print(paste("Number of hidden layer variables:", i))
  print(table(btest[,25], predict(nnbodtr, newdata=btest[,-25], type="class")))
}

[1] "Number of hidden layer variables: 1"
   
      1   2
  1 126   0
  2   6 125
[1] "Number of hidden layer variables: 2"
   
      1   2
  1 126   0
  2   1 130
[1] "Number of hidden layer variables: 3"
   
      1   2
  1 123   3
  2   1 130
[1] "Number of hidden layer variables: 4"
   
      1   2
  1 124   2
  2   5 126
[1] "Number of hidden layer variables: 5"
   
      1   2
  1 123   3
  2   2 129
[1] "Number of hidden layer variables: 6"
   
      1   2
  1 123   3
  2   4 127
[1] "Number of hidden layer variables: 7"
   
      1   2
  1 125   1
  2   3 128
[1] "Number of hidden layer variables: 8"
   
      1   2
  1 126   0
  2   4 127
[1] "Number of hidden layer variables: 9"
   
      1   2
  1 124   2
  2   2 129
[1] "Number of hidden layer variables: 10"
   
      1   2
  1 123   3
  2   4 127

From this fit, it appears the best neural net (ignoring changing activation functions, number of hidden layers, or any other tuning parameters) will result in a predicted misclassification rate around 1.6%. We can compare this to LDA…

library(MASS)
blda <- lda(factor(Gender)~., data=data.frame(btrain))
table(btest[,25], predict(blda, newdata=data.frame(btest[,-25]))$class)

…which appears nearly equivalent to the neural net in terms of predictive power on this data. Importantly though, LDA provides us a structure for inference as well (we can investigate the estimated mean vector of each group, the covariances, decision boundary, etc). What about random forests…

library(randomForest)

randomForest 4.7-1.2

Type rfNews() to see new features/changes/bug fixes.

rfbod <- randomForest(factor(Gender)~., data=data.frame(btrain))
rfbod


Call:
 randomForest(formula = factor(Gender) ~ ., data = data.frame(btrain)) 
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 4

        OOB estimate of  error rate: 5.2%
Confusion matrix:
    1   2 class.error
1 127   7  0.05223881
2   6 110  0.05172414

Once more, since the output provides OOB error rates, they are believable for future predictions. That said, let’s look at predicting for the test set for a direct comparison to the LDA and NN models…

table(btest[,25], predict(rfbod, newdata=data.frame(btest[,-25])))

All the results point to the RF model being less strong on this data versus LDA or NN.

PCA on Cars

Let’s run PCA on the car93.csv data set (download here).

card <- read.csv("../data/car93.csv", stringsAsFactors = FALSE)
pcard <- prcomp(as.matrix(card[,-c(1,2,3)]), scale.=TRUE)
summary(pcard)

Importance of components:
                          PC1    PC2     PC3     PC4    PC5    PC6     PC7
Standard deviation     3.1704 1.2840 0.97225 0.76321 0.6504 0.5732 0.53562
Proportion of Variance 0.6701 0.1099 0.06302 0.03883 0.0282 0.0219 0.01913
Cumulative Proportion  0.6701 0.7800 0.84304 0.88187 0.9101 0.9320 0.95110
                           PC8     PC9    PC10    PC11    PC12    PC13    PC14
Standard deviation     0.44946 0.41551 0.34330 0.26369 0.26018 0.20809 0.19842
Proportion of Variance 0.01347 0.01151 0.00786 0.00464 0.00451 0.00289 0.00262
Cumulative Proportion  0.96457 0.97608 0.98393 0.98857 0.99308 0.99597 0.99859
                          PC15
Standard deviation     0.14522
Proportion of Variance 0.00141
Cumulative Proportion  1.00000

According to the Kaiser criterion, we would retain 2 components which explain 78% of the variation in the data. We can view the data projected onto these components in a biplot…

biplot(pcard)

Note that this is equivalent to plotting the scores (which are the projected/rotated data). We can verify this by simply plotting

plot(pcard$x[,1:2])

Of course, the biplot labels the data by observation number…easy enough

plot(pcard$x[,1:2], type="n")
text(pcard$x[,1], pcard$x[,2], labels=1:nrow(card))

Let’s actually take a look at the component loadings (eigenvectors) which provide the coefficients of the original variables. We’ll round them so that they are a bit more easily interpreted

round(pcard$rotation[,1:2], 2)

                     PC1   PC2
Price               0.22  0.37
MPG.city           -0.27 -0.19
MPG.highway        -0.24 -0.29
EngineSize          0.30 -0.05
Horsepower          0.25  0.39
RPM                -0.14  0.53
Rev.per.mile       -0.26  0.12
Fuel.tank.capacity  0.28  0.17
Length              0.29 -0.11
Wheelbase           0.29 -0.10
Width               0.29 -0.11
Turn.circle         0.26 -0.14
Rear.seat.room      0.18 -0.28
Luggage.room        0.23 -0.35
Weight              0.31  0.11

Focusing on the first component, we see that the magnitudes of the coefficients are pretty similar across the board. But note that while most are positive-valued, there are several negative-valued coefficients. If we had to summarize the positive ones, most these measurements have something (or some positive correlation) with the size of the vehicle, while the negative ones are related to fuel efficiency (which would be generally negatively correlated with size). As such, we can broadly interpret this first component as the size of the car.

The second component is a bit less clear. Let’s do our best to focus on the larger magnitudes.

round(pcard$rotation[,2], 2)[abs(pcard$rotation[,2])>.2]

         Price    MPG.highway     Horsepower            RPM Rear.seat.room 
          0.37          -0.29           0.39           0.53          -0.28 
  Luggage.room 
         -0.35

So we have a high positive loading of price, horsepower, and RPM, along with large negative loadings for MPG and rear seats, luggage. What would you envision for a car that scores high on this component? Probably sports cars. We could therefore consider this a measure of ‘sportiness’…relatively small, expensive cars with strong engines.

This makes some sense from a rotation standpoint. If the bulk of the variation (first component) will measure cars based on their size contrasted with fuel efficiency, sports cars would not ‘fit in’ with the rest of the data on that component, as they are small cars that are not fuel efficient.

Proof of concept on these components, here are the four cars that score highest on PC1 (size):

card[order(pcard$x[,1], decreasing=TRUE)[1:4] , 1:3]

   Manufacturer          Model  Type
8         Buick     Roadmaster Large
47      Lincoln       Town_Car Large
16    Chevrolet        Caprice Large
33         Ford Crown_Victoria Large

All four are considered large cars (for this data set). And here are four that score highest on PC2 (sportiness):

card[order(pcard$x[,2], decreasing=TRUE)[1:4] , 1:3]

    Manufacturer   Model    Type
24         Dodge Stealth  Sporty
45         Lexus   SC300 Midsize
43      Infiniti     Q45 Midsize
52 Mercedes-Benz    300E Midsize

Only the first is actually designated as ‘sporty’, though the next three are all expensive, luxury midsize vehicles. Perhaps we misinterpreted slightly with ‘sportiness’ and should rather designate this component as ‘luxuriousness’…but either way, cars that score highly on it will be expensive and fuel inefficient relative to their actual size.

Now that we have the data projected onto the principal components, there is nothing stopping us from moving forward with other analyses we have discussed during this course. For example, let’s perform some clustering on PCA data.

Clustering on PCA

First, let’s show that if we do NOT reduce components, then the distances between observations are retained.

test1 <- hclust(dist(scale(card[,-c(1:3)])))
plot(test1)

test2 <- hclust(dist(pcard$x))
plot(test2)

The dendrograms are identical. But frankly, we don’t need to eyeball it, we can have R compare distance matrices. Note that the scale call adds in a bunch of attributes that won’t be held in the prcomp object, so we set check.attributes=FALSE so that we are only comparing the pairwise distances contained in the objects…

all.equal(dist(scale(card[,-c(1:3)])), dist(pcard$x), check.attributes=FALSE)

[1] TRUE

On the other hand, if we reduce components (say to the two we used earlier), then there is some loss of information

all.equal(dist(scale(card[,-c(1:3)])), dist(pcard$x[,1:2]), check.attributes=FALSE)

[1] "Mean relative difference: 0.1715305"

Let’s look at clustering applied to the scores on the first two components

pcclust <- hclust(dist(pcard$x[,1:2]))
plot(pcclust)

This dendrogram suggests two groups. Let’s compare them to car types to see what falls out…

table(card$Type, cutree(pcclust,2))

         
           1  2
  Compact 13  3
  Large    0 11
  Midsize  2 20
  Small   21  0
  Sporty   9  3

The classification table shows that the predominant group structure on the first two principal components separates out the smaller cars (Compact, Small, Sporty) and larger cars (Midsize, Large). This matches up well with the interpretation on the first principal component (which is where the bulk of variance lies).

PCA LECTURE EXAMPLE

library(HSAUR2)

Loading required package: tools

data(heptathlon)
head(heptathlon)

                    hurdles highjump  shot run200m longjump javelin run800m
Joyner-Kersee (USA)   12.69     1.86 15.80   22.56     7.27   45.66  128.51
John (GDR)            12.85     1.80 16.23   23.65     6.71   42.56  126.12
Behmer (GDR)          13.20     1.83 14.20   23.10     6.68   44.54  124.20
Sablovskaite (URS)    13.61     1.80 15.23   23.92     6.25   42.78  132.24
Choubenkova (URS)     13.51     1.74 14.76   23.93     6.32   47.46  127.90
Schulz (GDR)          13.75     1.83 13.50   24.65     6.33   42.82  125.79
                    score
Joyner-Kersee (USA)  7291
John (GDR)           6897
Behmer (GDR)         6858
Sablovskaite (URS)   6540
Choubenkova (URS)    6540
Schulz (GDR)         6411

pcahepu <- prcomp(heptathlon[,-8])
#Scree plot
plot(pcahepu, type="lines")

#Loadings, rotation matrix (eigenvectors) how the variables 
#are correlated with each principal component
pcahepu$rotation[,1:3]

                  PC1           PC2         PC3
hurdles   0.069508692  0.0094891417  0.22180829
highjump -0.005569781 -0.0005647147 -0.01451405
shot     -0.077906090 -0.1359282330 -0.88374045
run200m   0.072967545  0.1012004268  0.31005700
longjump -0.040369299 -0.0148845034 -0.18494319
javelin   0.006685584 -0.9852954510  0.16021268
run800m   0.990994208 -0.0127652701 -0.11655815

round(pcahepu$rotation[,1:3], 2)

           PC1   PC2   PC3
hurdles   0.07  0.01  0.22
highjump -0.01  0.00 -0.01
shot     -0.08 -0.14 -0.88
run200m   0.07  0.10  0.31
longjump -0.04 -0.01 -0.18
javelin   0.01 -0.99  0.16
run800m   0.99 -0.01 -0.12

#Importance of components (variation explained)
summary(pcahepu)

Importance of components:
                          PC1    PC2     PC3     PC4     PC5     PC6     PC7
Standard deviation     8.3646 3.5910 1.38570 0.58571 0.32382 0.14712 0.03325
Proportion of Variance 0.8207 0.1513 0.02252 0.00402 0.00123 0.00025 0.00001
Cumulative Proportion  0.8207 0.9720 0.99448 0.99850 0.99973 0.99999 1.00000

pcahep <- prcomp(heptathlon[,-8], scale.=TRUE)
plot(pcahep, type="lines")

pcahep$rotation

                PC1         PC2         PC3         PC4         PC5         PC6
hurdles   0.4528710 -0.15792058  0.04514996 -0.02653873 -0.09494792 -0.78334101
highjump -0.3771992  0.24807386 -0.36777902  0.67999172 -0.01879888 -0.09939981
shot     -0.3630725 -0.28940743  0.67618919  0.12431725 -0.51165201  0.05085983
run200m   0.4078950  0.26038545 -0.08359211  0.36106580 -0.64983404  0.02495639
longjump -0.4562318  0.05587394  0.13931653  0.11129249  0.18429810 -0.59020972
javelin  -0.0754090 -0.84169212 -0.47156016  0.12079924 -0.13510669  0.02724076
run800m   0.3749594 -0.22448984  0.39585671  0.60341130  0.50432116  0.15555520
                 PC7
hurdles  -0.38024707
highjump -0.43393114
shot     -0.21762491
run200m   0.45338483
longjump  0.61206388
javelin   0.17294667
run800m   0.09830963

round(pcahep$rotation[,1:3], 2)

           PC1   PC2   PC3
hurdles   0.45 -0.16  0.05
highjump -0.38  0.25 -0.37
shot     -0.36 -0.29  0.68
run200m   0.41  0.26 -0.08
longjump -0.46  0.06  0.14
javelin  -0.08 -0.84 -0.47
run800m   0.37 -0.22  0.40

summary(pcahep)

Importance of components:
                          PC1    PC2     PC3     PC4     PC5     PC6    PC7
Standard deviation     2.1119 1.0928 0.72181 0.67614 0.49524 0.27010 0.2214
Proportion of Variance 0.6372 0.1706 0.07443 0.06531 0.03504 0.01042 0.0070
Cumulative Proportion  0.6372 0.8078 0.88223 0.94754 0.98258 0.99300 1.0000

biplot(pcahep)

pcahep$x

                             PC1         PC2         PC3         PC4
Joyner-Kersee (USA) -4.121447626 -1.24240435  0.36991309  0.02300174
John (GDR)          -2.882185935 -0.52372600  0.89741472 -0.47545176
Behmer (GDR)        -2.649633766 -0.67876243 -0.45917668 -0.67962860
Sablovskaite (URS)  -1.343351210 -0.69228324  0.59527044 -0.14067052
Choubenkova (URS)   -1.359025696 -1.75316563 -0.15070126 -0.83595001
Schulz (GDR)        -1.043847471  0.07940725 -0.67453049 -0.20557253
Fleming (AUS)       -1.100385639  0.32375304 -0.07343168 -0.48627848
Greiner (USA)       -0.923173639  0.80681365  0.81241866 -0.03022915
Lajbnerova (CZE)    -0.530250689 -0.14632191  0.16122744  0.61590242
Bouraga (URS)       -0.759819024  0.52601568  0.18316881 -0.66756426
Wijnsma (HOL)       -0.556268302  1.39628179 -0.13619463  0.40503603
Dimitrova (BUL)     -1.186453832  0.35376586 -0.08201243 -0.48123479
Scheider (SWI)       0.015461226 -0.80644305 -1.96745373  0.73341733
Braun (FRG)          0.003774223 -0.71479785 -0.32496780  1.06604134
Ruotsalainen (FIN)   0.090747709 -0.76304501 -0.94571404  0.26883477
Yuping (CHN)        -0.137225440  0.53724054  1.06529469  1.63144151
Hagger (GB)          0.171128651  1.74319472  0.58701048  0.47103131
Brown (USA)          0.519252646 -0.72696476 -0.31302308  1.28942720
Mulliner (GB)        1.125481833  0.63479040  0.72530080 -0.57961844
Hautenauve (BEL)     1.085697646  1.84722368  0.01452749 -0.25561691
Kytola (FIN)         1.447055499  0.92446876 -0.64596313 -0.21493997
Geremias (BRA)       2.014029620  0.09304121  0.64802905  0.02454548
Hui-Ing (TAI)        2.880298635  0.66150588 -0.74936718 -1.11903480
Jeong-Mi (KOR)       2.970118607  0.95961101 -0.57118753 -0.11547402
Launa (PNG)          6.270021972 -2.83919926  1.03414797 -0.24141489
                            PC5          PC6          PC7
Joyner-Kersee (USA)  0.42600624 -0.339329222  0.347921325
John (GDR)          -0.70306588  0.238087298  0.144015774
Behmer (GDR)         0.10552518 -0.239190707 -0.129647756
Sablovskaite (URS)  -0.45392816  0.091805638 -0.486577968
Choubenkova (URS)   -0.68719483  0.126303968  0.239482044
Schulz (GDR)        -0.73793351 -0.355789386 -0.103414314
Fleming (AUS)        0.76299122  0.084844490 -0.142871612
Greiner (USA)       -0.09086737 -0.151561253  0.034237928
Lajbnerova (CZE)    -0.56851477  0.265359696 -0.249591589
Bouraga (URS)        1.02148109  0.396397714 -0.020405097
Wijnsma (HOL)       -0.29221101 -0.344582964 -0.182701990
Dimitrova (BUL)      0.78103608  0.233718538 -0.070605615
Scheider (SWI)       0.02177427 -0.004249913  0.036155878
Braun (FRG)          0.18389959  0.272903729  0.044351160
Ruotsalainen (FIN)   0.18416945  0.141403697  0.135136482
Yuping (CHN)         0.21162048 -0.280043639 -0.171160984
Hagger (GB)          0.05781435  0.147155606  0.520000710
Brown (USA)          0.49779301 -0.071211150 -0.005529394
Mulliner (GB)        0.15611502 -0.427484048  0.081522940
Hautenauve (BEL)    -0.19143514 -0.100087033  0.085430091
Kytola (FIN)        -0.49993839 -0.072673266 -0.125585203
Geremias (BRA)      -0.24445870  0.640572055 -0.215626046
Hui-Ing (TAI)        0.47418755 -0.180568513 -0.207364881
Jeong-Mi (KOR)      -0.58055249  0.183940799  0.381783751
Launa (PNG)          0.16568672 -0.255722133  0.061044365

print(cbind(-sort(pcahep$x[,1]), rownames(heptathlon), heptathlon$score))

                    [,1]                   [,2]                  [,3]  
Joyner-Kersee (USA) "4.12144762636023"     "Joyner-Kersee (USA)" "7291"
John (GDR)          "2.88218593484013"     "John (GDR)"          "6897"
Behmer (GDR)        "2.64963376599126"     "Behmer (GDR)"        "6858"
Choubenkova (URS)   "1.35902569554282"     "Sablovskaite (URS)"  "6540"
Sablovskaite (URS)  "1.34335120967758"     "Choubenkova (URS)"   "6540"
Dimitrova (BUL)     "1.18645383210095"     "Schulz (GDR)"        "6411"
Fleming (AUS)       "1.10038563857154"     "Fleming (AUS)"       "6351"
Schulz (GDR)        "1.0438474709217"      "Greiner (USA)"       "6297"
Greiner (USA)       "0.923173638862052"    "Lajbnerova (CZE)"    "6252"
Bouraga (URS)       "0.759819023916295"    "Bouraga (URS)"       "6252"
Wijnsma (HOL)       "0.556268302151922"    "Wijnsma (HOL)"       "6205"
Lajbnerova (CZE)    "0.53025068878324"     "Dimitrova (BUL)"     "6171"
Yuping (CHN)        "0.137225439803273"    "Scheider (SWI)"      "6137"
Braun (FRG)         "-0.00377422255698061" "Braun (FRG)"         "6109"
Scheider (SWI)      "-0.015461226409334"   "Ruotsalainen (FIN)"  "6101"
Ruotsalainen (FIN)  "-0.0907477089383118"  "Yuping (CHN)"        "6087"
Hagger (GB)         "-0.171128651449235"   "Hagger (GB)"         "5975"
Brown (USA)         "-0.519252645741107"   "Brown (USA)"         "5972"
Hautenauve (BEL)    "-1.08569764619083"    "Mulliner (GB)"       "5746"
Mulliner (GB)       "-1.12548183277136"    "Hautenauve (BEL)"    "5734"
Kytola (FIN)        "-1.44705549915266"    "Kytola (FIN)"        "5686"
Geremias (BRA)      "-2.01402962042439"    "Geremias (BRA)"      "5508"
Hui-Ing (TAI)       "-2.88029863527855"    "Hui-Ing (TAI)"       "5290"
Jeong-Mi (KOR)      "-2.97011860698208"    "Jeong-Mi (KOR)"      "5289"
Launa (PNG)         "-6.27002197162808"    "Launa (PNG)"         "4566"

plot(-pcahep$x[,1], heptathlon$score)

PCAReg on Nutrition data

Let’s download the nutrition here. We’ll do some lazy cleaning just to keep things easy.

load("../data/nutrition.rdata")
dim(nutrition)

[1] 8463   28

nutrition <- na.omit(nutrition)
colnames(nutrition)

 [1] "food"          "calories"      "protein"       "carbohydrates"
 [5] "total_fat"     "saturated_fat" "caffiene"      "cholesterol"  
 [9] "fiber"         "folic_acid"    "sodium"        "calcium"      
[13] "iron"          "magnesium"     "manganese"     "niacin"       
[17] "phosphorus"    "potassium"     "riboflavin"    "selenium"     
[21] "thiamin"       "vitamin_A"     "vitamin_B6"    "vitamin_B12"  
[25] "vitamin_C"     "vitamin_D"     "zinc"          "group"

Now then, to set this up in a PCRegression type context, we’ll need a natural continuous response variable. I think calories is likely the most natural approach since they can come primarily from both fat and carbs. As such, we will need to remove it from the original PCA call.

nupca <- prcomp(nutrition[,-c(1,2,28)], scale.=TRUE)
summary(nupca)

Importance of components:
                          PC1    PC2     PC3    PC4     PC5     PC6     PC7
Standard deviation     2.0905 1.6613 1.45197 1.2787 1.24728 1.17387 1.13369
Proportion of Variance 0.1748 0.1104 0.08433 0.0654 0.06223 0.05512 0.05141
Cumulative Proportion  0.1748 0.2852 0.36953 0.4349 0.49715 0.55227 0.60368
                           PC8     PC9    PC10    PC11    PC12    PC13    PC14
Standard deviation     1.07393 0.98081 0.95224 0.92905 0.85969 0.83157 0.78987
Proportion of Variance 0.04613 0.03848 0.03627 0.03453 0.02956 0.02766 0.02496
Cumulative Proportion  0.64982 0.68830 0.72457 0.75909 0.78865 0.81631 0.84127
                          PC15    PC16    PC17    PC18    PC19    PC20   PC21
Standard deviation     0.77604 0.76121 0.70758 0.64976 0.62872 0.59025 0.5678
Proportion of Variance 0.02409 0.02318 0.02003 0.01689 0.01581 0.01394 0.0129
Cumulative Proportion  0.86536 0.88854 0.90856 0.92545 0.94126 0.95520 0.9681
                          PC22    PC23    PC24    PC25
Standard deviation     0.53484 0.46515 0.40082 0.36685
Proportion of Variance 0.01144 0.00865 0.00643 0.00538
Cumulative Proportion  0.97954 0.98819 0.99462 1.00000

Using the Kaiser criterion would suggest retaining 8 components, but that would only retain approximately 65% of the variability in the data. So, let’s pop the scores of the original observations on those 8 components into an object for future use.

scr <- nupca$x[,1:8]

Now we can run PCReg for calories as a response…

pcregmod <- lm(nutrition$calories ~ scr)
summary(pcregmod)


Call:
lm(formula = nutrition$calories ~ scr)

Residuals:
    Min      1Q  Median      3Q     Max 
-624.91  -40.83  -10.22   29.49  236.46 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 201.1200     1.5068 133.475  < 2e-16 ***
scrPC1       35.9805     0.7210  49.907  < 2e-16 ***
scrPC2        5.0377     0.9072   5.553 3.15e-08 ***
scrPC3       51.3592     1.0380  49.479  < 2e-16 ***
scrPC4       26.6879     1.1787  22.642  < 2e-16 ***
scrPC5      -55.3065     1.2083 -45.770  < 2e-16 ***
scrPC6       20.3169     1.2839  15.824  < 2e-16 ***
scrPC7       28.3220     1.3294  21.304  < 2e-16 ***
scrPC8       22.3124     1.4034  15.899  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 70.42 on 2175 degrees of freedom
Multiple R-squared:  0.7969,    Adjusted R-squared:  0.7962 
F-statistic:  1067 on 8 and 2175 DF,  p-value: < 2.2e-16

Cool! Everything appears interesting and useful for modelling. Of course, our PCA was done unsupervised — therefore, without respect to what we were trying to model (calories).