In R, we fit a LDA model using the lda() function, which is part of the MASS library. Notice that the syntax for the lda(): like lm() (linear regression), and glm() (as seen in the logistic regression) this function accepts a formula input:
library(MASS)
Attaching package: 'MASS'
The following object is masked from 'package:ISLR2':
Boston
(lda.m1 <-lda(default ~ balance + student, data = train))
Call:
lda(default ~ balance + student, data = train)
Prior probabilities of groups:
No Yes
0.9677526 0.0322474
Group means:
balance studentYes
No 804.968 0.2956254
Yes 1776.971 0.3948718
Coefficients of linear discriminants:
LD1
balance 0.002232368
studentYes -0.227922283
The LDA output indicates that our prior probabilities are = 0.968 and = 0.032; in other words, 96.8% of the training observations are customers who did not default and 3.2% represent those that defaulted.
It also provides the group means; these are the average of each predictor within each class, and are used by LDA as estimates of . These suggest that customers that tend to default have, on average, a credit card balance of $1776.97 and are more likely to be students then non-defaulters.
Making Predictions
We can use the predict() function to created prediction for new values. Consider the following new values (which as you may recall from our regression models) need to be stored within a data frame with the columns names identical to our training set.
balance student
1 1000 No
2 2000 No
3 1000 Yes
4 2000 Yes
Below we see that predict returns a list with three elements. The first element, class, contains LDA’s predictions about the customer defaulting. Here we see that the second observation (non-student with balance of $2,000) is the only one that is predicted to default.
(df.pred <-predict(lda.m1, df))
$class
[1] No Yes No No
Levels: No Yes
$posterior
No Yes
1 0.9903138 0.009686200
2 0.4585630 0.541436990
3 0.9940401 0.005959891
4 0.5801226 0.419877403
$x
LD1
1 0.4335197
2 2.6658881
3 0.2055974
4 2.4379658
The second element, posterior, is a matrix that contains the posterior probability that the corresponding observations will or will not default. Here we see that the only observation to have a posterior probability of defaulting greater than 50% is observation 2, which is why the LDA model predicted this observation will default. However, we also see that observation 4 has a 42% probability of defaulting. Right now the model is predicting that this observation will not default because this probability is less than 50%; however, we will see shortly how we can make adjustments to our posterior probability thresholds. Finally, x contains the linear discriminant values, we will return to these values in our PCA unit.
As previously mentioned the default setting is to use a 50% threshold for the posterior probabilities. We can recreate the predictions contained in the class element above:
# number of non-defaulterssum(df.pred$posterior[, 1] >= .5)
[1] 3
## [1] 3# number of defaulterssum(df.pred$posterior[, 2] > .5)
[1] 1
## [1] 1
Adjusting the threshold
If we wanted to use a posterior probability threshold other than 50% in order to make predictions, then we could easily do so. For instance, suppose that a credit card company is extremely risk-adverse and wants to assume that a customer with 40% or greater probability is a high-risk customer. We can easily assess the number of high-risk customers.
# number of high-risk customers with 40% probability of defaultingsum(df.pred$posterior[, 2] > .4)
[1] 2
## [1] 2
Fitting QDA
Similar to lda(), we can use the MASS library to fit a QDA model. Here we use the qda() function. The output is very similar to the lda output. It contains the prior probabilities and the group means. But it does not contain the coefficients of the linear discriminants, because the QDA classifier involves a quadratic, rather than a linear, function of the predictors.
(qda.m1 <-qda(default ~ balance + student, data = train))
Call:
qda(default ~ balance + student, data = train)
Prior probabilities of groups:
No Yes
0.9677526 0.0322474
Group means:
balance studentYes
No 804.968 0.2956254
Yes 1776.971 0.3948718
Making Predictions
The predict() function works in exactly the same fashion as for LDA except it does not return the linear discriminant values. In comparing this simple prediction example to that seen in the LDA section we see minor changes in the posterior probabilities. Most notably, the posterior probability that observation 4 will default increased by nearly 8% points.
predict(qda.m1, df)
$class
[1] No Yes No No
Levels: No Yes
$posterior
No Yes
1 0.9957697 0.004230299
2 0.4381383 0.561861660
3 0.9980862 0.001913799
4 0.5148050 0.485194962
Prediction Evaluation
Now that we understand the basics of evaluating our model and making predictions. Let’s assess how well our two models (lda.m1 & qda.m1) perform on our test data set. First we need to apply our models to the test data.
Now we can evaluate how well our model predicts by assessing the different classification rates discussed in the logistic regression tutorial. First we’ll look at the confusion matrix in a percentage form. The below results show that the models perform in a very similar manner. 96% of the predicted observations are true negatives and about 1% are true positives. Both models have a type II error of less than 3% in which the model predicts the customer will not default but they actually did. And both models have a type I error of less than 1% in which the models predict the customer will default but they never did.
However, the overall error may be less important then understanding the precision of our model. If we look at the raw numbers of our confusion matrix (lda.cm and qda.cm) we can compute the precision:
list(LDA_model = lda.cm,QDA_model = qda.cm)
$LDA_model
No Yes
No 3809 6
Yes 109 29
$QDA_model
No Yes
No 3808 7
Yes 103 35
LDA model: 29/(109+29) = 21%
QDA model: 35/(103+7) = 25%
So our QDA model has a slightly higher precision than the LDA model.
If we are concerned with increasing the precision of our model we can tune our model by adjusting the posterior probability threshold. For instance, we might label any customer with a posterior probability of default above 20% as high-risk. Now the precision of our QDA model improves to 83/(83+55)=60%. However, the overall error rate has increased to 4%. But a credit card company may consider this slight increase in the total error rate to be a small price to pay for more accurate identification of individuals who do indeed default. It’s important to keep in mind these kinds of trade-offs, which are common with classification models - tuning models can improve certain classification rates while worsening others.
$LDA_model
lda.pred.adj
No Yes
No 3731 84
Yes 69 69
$QDA_model
qda.pred.adj
No Yes
No 3699 116
Yes 55 83
## $LDA_model## lda.pred.adj## No Yes## No 3731 84## Yes 69 69## ## $QDA_model## qda.pred.adj## No Yes## No 3699 116## Yes 55 83
ROC
While we never covered this in lecture, another important metrics is the ROC curve. The Receiver Operating Characteristic (ROC) curve is a graphical tool used to evaluate the performance of binary classification models. It provides a way to visualize and analyze the trade-off between a model’s true positive rate (sensitivity) and its false positive rate (1 - specificity) for different classification thresholds.
The overall performance of a classifier, summarized over all possible thresholds, is given by the area under the (ROC) curve (AUC). An ideal ROC curve will hug the top left corner, so the larger the AUC, the better the classifier.
For this data, both models have an AUC of 0.94, which is close to the maximum of 1.0, so would be considered very good. We expect a classifier that performs no better than chance to have an AUC of 0.5 (when evaluated on an independent test set not used in model training). ROC curves are useful for comparing different classifiers since they take into account all possible thresholds.
# model 1 AUCperformance(pred.lda, measure ="auc")@y.values
[[1]]
[1] 0.9420727
# model 2 AUCperformance(pred.qda, measure ="auc")@y.values
[[1]]
[1] 0.9420746
We’ll use the MLmetrics library to compute the measures automatically for us…
library(MLmetrics)
Attaching package: 'MLmetrics'
The following object is masked from 'package:base':
Recall
library(gclus)
Loading required package: cluster
data(body)body$Gender <-factor(body$Gender)
Other metrics
We can use the MLmetrics library to compute the other classification metrics we discussed in lecture.
For LDA:
library(MLmetrics)# LDASensitivity(test$default, test.predicted.lda$class)Recall(test$default, test.predicted.lda$class) # same as abovePrecision(test$default, test.predicted.lda$class)Specificity(test$default, test.predicted.lda$class)F1_Score(test$default, test.predicted.lda$class)
Sensitivity(test$default, test.predicted.qda$class)Recall(test$default, test.predicted.qda$class) # same as abovePrecision(test$default, test.predicted.qda$class)Specificity(test$default, test.predicted.qda$class)F1_Score(test$default, test.predicted.qda$class)
To compute the logloss, we can use the LogLoss() function2. Judging from help files (fairly sparse), the first argument seems to be the predicted probability vector of classifying a 1 (in this case the predicted probabilty that they will default). Notice that they (bizarrely) have swapped the arguments order from function(truth, predicted) to function(predicted, truth) as compared to the other metrics we’ve used thus far. Thirdly, note that the truth vector is required to be 0-1 labels.
# need to convert "No"/"Yes" to 0/1truth =as.numeric(test$default)-1LogLoss(test.predicted.lda$posterior[,2], truth)
[1] 0.086328
LogLoss(test.predicted.qda$posterior[,2], truth)
[1] 0.08664758
Recall that smaller is better for log-loss, so LDA performs slightly better based on the log-loss metric.
Footnotes
I have intentionally used a variety of code for doing this to familiarize youself with other methods↩︎
Note that the LogLoss function is specifically for binary. There is also MultiLogLoss from the same library for multiclass scenarios.↩︎