Lab 5

LDA and QDA

Author

Dr. Irene Vrbik

Published

October 1, 2023

Data

To we will analyze the Default data set from the ISLR2 package:

library(ISLR2)
data(Default)
str(Default)

'data.frame':   10000 obs. of  4 variables:
 $ default: Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
 $ student: Factor w/ 2 levels "No","Yes": 1 2 1 1 1 2 1 2 1 1 ...
 $ balance: num  730 817 1074 529 786 ...
 $ income : num  44362 12106 31767 35704 38463 ...

We can split the data into 60% training and 40% test set using the following code¹

set.seed(123)
sample <- sample(c(TRUE, FALSE), nrow(Default), replace = T, prob = c(0.6,0.4))
train <- Default[sample, ]
test <- Default[!sample, ]

Fitting LDA

In R, we fit a LDA model using the lda() function, which is part of the MASS library. Notice that the syntax for the lda(): like lm() (linear regression), and glm() (as seen in the logistic regression) this function accepts a formula input:

library(MASS)


Attaching package: 'MASS'

The following object is masked from 'package:ISLR2':

    Boston

(lda.m1 <- lda(default ~ balance + student, data = train))

Call:
lda(default ~ balance + student, data = train)

Prior probabilities of groups:
       No       Yes 
0.9677526 0.0322474 

Group means:
     balance studentYes
No   804.968  0.2956254
Yes 1776.971  0.3948718

Coefficients of linear discriminants:
                    LD1
balance     0.002232368
studentYes -0.227922283

The LDA output indicates that our prior probabilities are $\hat \pi_1$ = 0.968 and $\hat \pi_2$ = 0.032; in other words, 96.8% of the training observations are customers who did not default and 3.2% represent those that defaulted.

It also provides the group means; these are the average of each predictor within each class, and are used by LDA as estimates of $\mu_k$. These suggest that customers that tend to default have, on average, a credit card balance of $1776.97 and are more likely to be students then non-defaulters.

Making Predictions

We can use the predict() function to created prediction for new values. Consider the following new values (which as you may recall from our regression models) need to be stored within a data frame with the columns names identical to our training set.

(df <- data.frame(balance = rep(c(1000, 2000), 2), 
       student = c("No", "No", "Yes", "Yes")))

  balance student
1    1000      No
2    2000      No
3    1000     Yes
4    2000     Yes

Below we see that predict returns a list with three elements. The first element, class, contains LDA’s predictions about the customer defaulting. Here we see that the second observation (non-student with balance of $2,000) is the only one that is predicted to default.

(df.pred <- predict(lda.m1, df))

$class
[1] No  Yes No  No 
Levels: No Yes

$posterior
         No         Yes
1 0.9903138 0.009686200
2 0.4585630 0.541436990
3 0.9940401 0.005959891
4 0.5801226 0.419877403

$x
        LD1
1 0.4335197
2 2.6658881
3 0.2055974
4 2.4379658

The second element, posterior, is a matrix that contains the posterior probability that the corresponding observations will or will not default. Here we see that the only observation to have a posterior probability of defaulting greater than 50% is observation 2, which is why the LDA model predicted this observation will default. However, we also see that observation 4 has a 42% probability of defaulting. Right now the model is predicting that this observation will not default because this probability is less than 50%; however, we will see shortly how we can make adjustments to our posterior probability thresholds. Finally, x contains the linear discriminant values, we will return to these values in our PCA unit.

As previously mentioned the default setting is to use a 50% threshold for the posterior probabilities. We can recreate the predictions contained in the class element above:

# number of non-defaulters
sum(df.pred$posterior[, 1] >= .5)

[1] 3

## [1] 3

# number of defaulters
sum(df.pred$posterior[, 2] > .5)

[1] 1

## [1] 1

Adjusting the threshold

If we wanted to use a posterior probability threshold other than 50% in order to make predictions, then we could easily do so. For instance, suppose that a credit card company is extremely risk-adverse and wants to assume that a customer with 40% or greater probability is a high-risk customer. We can easily assess the number of high-risk customers.

# number of high-risk customers with 40% probability of defaulting
sum(df.pred$posterior[, 2] > .4)

[1] 2

## [1] 2

Fitting QDA

Similar to lda(), we can use the MASS library to fit a QDA model. Here we use the qda() function. The output is very similar to the lda output. It contains the prior probabilities and the group means. But it does not contain the coefficients of the linear discriminants, because the QDA classifier involves a quadratic, rather than a linear, function of the predictors.

(qda.m1 <- qda(default ~ balance + student, data = train))

Call:
qda(default ~ balance + student, data = train)

Prior probabilities of groups:
       No       Yes 
0.9677526 0.0322474 

Group means:
     balance studentYes
No   804.968  0.2956254
Yes 1776.971  0.3948718

Making Predictions

The predict() function works in exactly the same fashion as for LDA except it does not return the linear discriminant values. In comparing this simple prediction example to that seen in the LDA section we see minor changes in the posterior probabilities. Most notably, the posterior probability that observation 4 will default increased by nearly 8% points.

predict(qda.m1, df)

$class
[1] No  Yes No  No 
Levels: No Yes

$posterior
         No         Yes
1 0.9957697 0.004230299
2 0.4381383 0.561861660
3 0.9980862 0.001913799
4 0.5148050 0.485194962

Prediction Evaluation

Now that we understand the basics of evaluating our model and making predictions. Let’s assess how well our two models (lda.m1 & qda.m1) perform on our test data set. First we need to apply our models to the test data.

test.predicted.lda <- predict(lda.m1, newdata = test)
test.predicted.qda <- predict(qda.m1, newdata = test)

Now we can evaluate how well our model predicts by assessing the different classification rates discussed in the logistic regression tutorial. First we’ll look at the confusion matrix in a percentage form. The below results show that the models perform in a very similar manner. 96% of the predicted observations are true negatives and about 1% are true positives. Both models have a type II error of less than 3% in which the model predicts the customer will not default but they actually did. And both models have a type I error of less than 1% in which the models predict the customer will default but they never did.

lda.cm <- table(test$default, test.predicted.lda$class)
qda.cm <- table(test$default, test.predicted.qda$class)

list(LDA_model = prop.table(lda.cm),
     QDA_model = prop.table(qda.cm))

$LDA_model
     
               No         Yes
  No  0.963571971 0.001517835
  Yes 0.027573994 0.007336200

$QDA_model
     
               No         Yes
  No  0.963318998 0.001770807
  Yes 0.026056160 0.008854035

Furthermore, we can estimate the overall error rates. Here we see that the QDA model reduces the error rate by just a hair.

(lda.error = mean(test$default != (test.predicted.lda$class)))

[1] 0.02909183

(qda.error = mean(test$default != (test.predicted.qda$class)))

[1] 0.02782697

However, the overall error may be less important then understanding the precision of our model. If we look at the raw numbers of our confusion matrix (lda.cm and qda.cm) we can compute the precision:

list(LDA_model = lda.cm,
     QDA_model = qda.cm)

$LDA_model
     
        No  Yes
  No  3809    6
  Yes  109   29

$QDA_model
     
        No  Yes
  No  3808    7
  Yes  103   35

LDA model: 29/(109+29) = 21%
QDA model: 35/(103+7) = 25%

So our QDA model has a slightly higher precision than the LDA model.

If we are concerned with increasing the precision of our model we can tune our model by adjusting the posterior probability threshold. For instance, we might label any customer with a posterior probability of default above 20% as high-risk. Now the precision of our QDA model improves to 83/(83+55)=60%. However, the overall error rate has increased to 4%. But a credit card company may consider this slight increase in the total error rate to be a small price to pay for more accurate identification of individuals who do indeed default. It’s important to keep in mind these kinds of trade-offs, which are common with classification models - tuning models can improve certain classification rates while worsening others.

# create adjusted predictions
lda.pred.adj <- ifelse(test.predicted.lda$posterior[, 2] > .20, "Yes", "No")
qda.pred.adj <- ifelse(test.predicted.qda$posterior[, 2] > .20, "Yes", "No")

# create new confusion matrices
list(LDA_model = table(test$default, lda.pred.adj),
     QDA_model = table(test$default, qda.pred.adj))

$LDA_model
     lda.pred.adj
        No  Yes
  No  3731   84
  Yes   69   69

$QDA_model
     qda.pred.adj
        No  Yes
  No  3699  116
  Yes   55   83

## $LDA_model
##      lda.pred.adj
##         No  Yes
##   No  3731   84
##   Yes   69   69
## 
## $QDA_model
##      qda.pred.adj
##         No  Yes
##   No  3699  116
##   Yes   55   83

ROC

While we never covered this in lecture, another important metrics is the ROC curve. The Receiver Operating Characteristic (ROC) curve is a graphical tool used to evaluate the performance of binary classification models. It provides a way to visualize and analyze the trade-off between a model’s true positive rate (sensitivity) and its false positive rate (1 - specificity) for different classification thresholds.

The overall performance of a classifier, summarized over all possible thresholds, is given by the area under the (ROC) curve (AUC). An ideal ROC curve will hug the top left corner, so the larger the AUC, the better the classifier.

We will use the ROCR package to plot these:

# ROC curves
library(ROCR)

par(mfrow=c(1, 2))

pred.lda = prediction(test.predicted.lda$posterior[,2], test$default)
perf.lda = performance(pred.lda, measure = "tpr", x.measure = "fpr")
plot(perf.lda)


pred.qda = prediction(test.predicted.qda$posterior[,2], test$default)
perf.qda = performance(pred.qda, measure = "tpr", x.measure = "fpr")
plot(perf.qda)

For this data, both models have an AUC of 0.94, which is close to the maximum of 1.0, so would be considered very good. We expect a classifier that performs no better than chance to have an AUC of 0.5 (when evaluated on an independent test set not used in model training). ROC curves are useful for comparing different classifiers since they take into account all possible thresholds.

# model 1 AUC
performance(pred.lda, measure = "auc")@y.values

[[1]]
[1] 0.9420727

# model 2 AUC
performance(pred.qda, measure = "auc")@y.values

[[1]]
[1] 0.9420746

We’ll use the MLmetrics library to compute the measures automatically for us…

library(MLmetrics)


Attaching package: 'MLmetrics'

The following object is masked from 'package:base':

    Recall

library(gclus)

Loading required package: cluster

data(body)
body$Gender <- factor(body$Gender)

Other metrics

We can use the MLmetrics library to compute the other classification metrics we discussed in lecture.

For LDA:

library(MLmetrics)

# LDA
Sensitivity(test$default, test.predicted.lda$class)
Recall(test$default, test.predicted.lda$class) # same as above
Precision(test$default, test.predicted.lda$class)
Specificity(test$default, test.predicted.lda$class)
F1_Score(test$default, test.predicted.lda$class)

[1] 0.9984273
[1] 0.9984273
[1] 0.9721797
[1] 0.2101449
[1] 0.9851287

For QDA:

Sensitivity(test$default, test.predicted.qda$class)
Recall(test$default, test.predicted.qda$class) # same as above
Precision(test$default, test.predicted.qda$class)
Specificity(test$default, test.predicted.qda$class)
F1_Score(test$default, test.predicted.qda$class)

[1] 0.9981651
[1] 0.9981651
[1] 0.973664
[1] 0.2536232
[1] 0.9857624

To compute the logloss, we can use the LogLoss() function². Judging from help files (fairly sparse), the first argument seems to be the predicted probability vector of classifying a 1 (in this case the predicted probabilty that they will default). Notice that they (bizarrely) have swapped the arguments order from function(truth, predicted) to function(predicted, truth) as compared to the other metrics we’ve used thus far. Thirdly, note that the truth vector is required to be 0-1 labels.

# need to convert "No"/"Yes" to 0/1
truth = as.numeric(test$default)-1
LogLoss(test.predicted.lda$posterior[,2], truth)

[1] 0.086328

LogLoss(test.predicted.qda$posterior[,2], truth)

[1] 0.08664758

Recall that smaller is better for log-loss, so LDA performs slightly better based on the log-loss metric.

Footnotes

I have intentionally used a variety of code for doing this to familiarize youself with other methods↩︎
Note that the LogLoss function is specifically for binary. There is also MultiLogLoss from the same library for multiclass scenarios.↩︎