library(gclus)
data(body)
bodyFitting and Assessing Classification Models
DATA 311: Machine Learning
Introduction
In this lab, we will focus on comparing three types of classification models: Logistic Regression, K-Nearest Neighbors (KNN), and Discriminant Analysis (LDA and QDA).
While in lecture we usually referred to categories of \(y\) as groups, or classes, to keep consistent with R terminology we will also be referring to the categorical variable as a factor variable, and it’s classes/groups as levels.
Learning Outcomes
By the end of this lab students will be able to:
- Fit and interpret Logistic Regression, KNN, LDA, and QDA.
- Distinguish prediction scales for GLMs (linear predictor vs. response probability) and map probabilities to class labels.
- Compute test error and confusion matrices correctly (mind the factor level order!).
- Adjust classification thresholds and quantify trade‑offs across metrics (precision, recall/sensitivity, specificity, F1).
- Use ROC curves and AUC to compare classifiers across all thresholds.
- Compute log‑loss and interpret “smaller is better.”
- Evaluate model performance using test error rate and confusion matrix.
Data
The body dataset contains 21 body dimension measurements as well as age, weight (in kg), height (in cm), and gender (binary) on 507 individuals. You can find this in the gclus library as body. As always, lets load our data set into our R session and make sure that it looks as we expect.1
As you will notice in the print out above, Gender is being treated as a numeric integer (as indicated by <int> under the column heading) rather than a factor. If ever you are unsure if you are unsure if your variable is being treated as an integer or a factor, you can check using class
class(body$Gender)[1] "integer"
As in the linear regression setting where we had to use something like lm(y~ x1 + factor(x2)) to let R know that a predictor variable is a categorical, we need to convert the Gender to the factor class to ensure that these values are being treated as a categories (i.e. levels) rather than numbers. To accomplish this we could run:
body$Gender <- factor(body$Gender)Alternatively if we want to convert the 1’s to male and 0 to female2 we could use something like (to give the levels more meaningful labels):
body$Gender <- factor(body$Gender, levels = c(1,0), labels = c("male", "female"))Now we can see that Gender is being treated as a categorical variable (i.e. a factor variable in R).
class(body$Gender)[1] "factor"
Training and Testing
Before fitting our model, let’s divide our data into 60% training and 40% testing. To make our results reproducible let’s also set a seed3.
set.seed(1696623282) # as.integer(Sys.time())
n <- nrow(body)
ptest <- 0.4 # 40% testing (60% training)
ntest <- round(n*ptest)
rind = sample(n)
# indices
test_ind = rind[1:ntest]
train_ind = rind[(ntest+1):n]
# data frames
test_data = body[test_ind, ]
train_data = body[train_ind, ]Logistic Regression
This model was covered in detail in the previous lab. Please review if you are shaking on some of the details.
Using Gender as the response variable we can fit the model in R using:
simlog <- glm(Gender ~ Height , family="binomial", data = body, subset = train_ind)
summary(simlog)
Call:
glm(formula = Gender ~ Height, family = "binomial", data = body,
subset = train_ind)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 45.75626 5.13902 8.904 <2e-16 ***
Height -0.26691 0.02999 -8.899 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 421.10 on 303 degrees of freedom
Residual deviance: 240.86 on 302 degrees of freedom
AIC: 244.86
Number of Fisher Scoring iterations: 5
Notice that since we already converted Gender to a factor we need not wrap the variable name in factor() in the glm function. Notice that the subset argument above allows us to specify the index of training observations in our full data set (as opposed to feeding the data frame of training observations as we will do later for KNN). In other words, simlog is the equivalent to:
# output surpressed to save space
glm(Gender ~ Height , family="binomial", data = train_data)For binary GLMs with family = "binomial", predict(..., type = "link") (default) returns log‑odds; type = "response" returns probability \(\Pr(Y= \text{``event class"} \mid x)\) where the “event class” is the second factor level. Because of the way we re-leveled our response in Listing 1, the resulting probability that is being returned is the \(\Pr(Y= \text{female} \mid x)\)
# Example prediction at Height = 174
predict(simlog, newdata = data.frame(Height = 174)) # log-odds 1
-0.6866771
predict(simlog, newdata = data.frame(Height = 174), type = "response") # prob 1
0.3347727
These predictions can be verified “by-hand”:
\[\begin{align*} \text{log-odds of being female} &= 45.75626 -0.26691 (\texttt{Height})\\ \implies \quad P(Y=\text{female} \mid X) &= \dfrac{e^{45.75626 -0.26691 (\texttt{Height})}} {1+ e^{45.75626 -0.26691 (\texttt{Height})}} \end{align*}\]
Hence with a height of, say, 174:
\[\begin{align*} \text{log-odds of being female} &= -0.68608 \\ \Pr(Y=\text{female} \mid X=174) &= \dfrac{e^{45.75626 + -0.26691 (174)}} {1+ e^{45.75626 + -0.26691 (174)}}\\ &= 0.3349057 \end{align*}\]
Classification Performance
Test Error
To evaluate the performance of our logistic regression model, we compute the test error by applying the model to the test set, predicting the class labels for each observation, and comparing these predictions to the actual class labels. The test error reflects how well the model generalizes to new, unseen data.
To calculate the test error in R:
- Predict probabilities for the test set using the trained model.
- Convert probabilities to class predictions using a threshold (typically 0.5 for binary classification).
- Compare the predictions to the true class labels and calculate the error rate.
Step 1: Predict probabilities for the test set
We first predict the probabilities of the test set observations belonging to the female class.
# Predict probabilities (of being female) on the test set
pred.prob <- predict(simlog, newdata = test_data, type = "response")
pred_gender <- factor(pred.prob>0.5, levels= c(FALSE, TRUE), labels=c("male", "female"))
y = test_data$Gender
# see the first 10:
head(cbind(pred_gender, y)) pred_gender y
20 1 1
228 2 1
334 2 2
205 1 1
503 1 2
42 2 1
Step 2: Convert probabilities to class predictions
Next, we classify each observation as either class female or male using a threshold of 0.5. In other words, if the \(\Pr(Y_i = \text{female} \mid x_i) > 0.5\) then \(x_i\) is classified as female, otherwise, male.
# Convert probabilities to class predictions using a 0.5 threshold
pred_gender <- factor(ifelse(pred.prob >= 0.5, "female", "male"))Step 3: Compare predictions to actual test labels and calculate the error rate
We now compare the predicted classes to the actual classes in the test set and compute the test error. The test error is the proportion of incorrectly classified observations.
# Actual test labels
actual_gender <- test_data$Gender
# print the result
data.frame(yhat = pred_gender, y = actual_gender)To calculated the test error (or misclassification rate) we use:
\[ \text{Test Error} = \frac{1}{n_{\text{test}}} \sum_{i=1}^{n_{\text{test}}} \mathbb{I}(\hat{y}_i \neq y_i) = \frac{\text{total misclassified}}{\text{total no. observations}} \tag{1}\]
where
- \(n_{\text{text}}\) are the number of observations in our test set
- \(\hat{y}_i\) represents the predicted class and
- \({y}_i\) represents the actual class
This can be computer in R using:
# Calculate the error rate
test_error <- mean(pred_gender != actual_gender)
test_error[1] 0.1674877
Confusion Matrix
As discussed in lecture and in previous labs, we can summarize and compare our predictions to the truth using a confusion matrix. It can be found quickly using the table() function:
tab <- table(test_data$Gender, pred_gender)
tab pred_gender
female male
male 18 82
female 87 16
Usually when we are organizing a confusiong matrix it should have the correct classifications on the diagonal. This is not the case here. We can fix this by changing the order.
levels(test_data$Gender)[1] "male" "female"
levels(pred_gender)[1] "female" "male"
To reorder the factor levels in R, you can use the factor() function with the levels argument to explicitly specify the new order of the levels. In our case, since we want the levels of pred_gender to match the levels of test_data$Gender, you can reorder pred_gender like this:
# Reorder the levels of pred_gender to match test_data$Gender
pred_gender <- factor(pred_gender, levels = c("male", "female")) # same as:
pred_gender <- factor(pred_gender, levels = levels(test_data$Gender)) Now that the levels of these two vectors not consistent, lets redo our table.
tab <- table(test_data$Gender, pred_gender)
tab pred_gender
male female
male 82 18
female 16 87
That’s better. Now the off-diagonal elements represent the number of misclassified observations, where predictions did not match the actual gender. The diagonal elements represent the correctly classified observations, where predictions matched the actual gender. Using the confusion matrix, we can easily recalculate the test error as shown in Equation Equation 1 by summing the misclassified observations and dividing by the total number of observations.
\[\begin{equation} \frac{\text{total misclassified}}{\text{total no. observations}} =\frac{18+16}{203} = 0.1674877 \end{equation}\]
Always ensure that the levels of factor variables are in the same order between the predicted and actual values! When the order of the factor levels are not consistent, this leads to incorrect results in the confusion matrix. You can reorder the factor levels using the factor(, levels = )
Now let’s repeat for the other methods
KNN Classification
Algorithm
K-Nearest Neighbors (KNN) is a simple, intuitive classification algorithm that makes predictions based on the similarity of new data points to existing labeled data points. In KNN, a data point is classified by a majority vote of its \(k\) nearest neighbors.
Step 1: Given a positive integer \(k\) and a test observation \(x_0\), the KNN classifier first identifies the \(k\) points in the training data that are closest to \(x_0\), represented by \(\mathcal{N}_0\).
Step 2: For each class \(j\), find \[\begin{equation} P(Y=j \mid X=x_0) \approx \frac{1}{K} \sum_{i \in N_0} I(y_i = j) \end{equation}\]
Step 3: Assign observation \(x_0\) to the class (\(j\)) for which the above probability is largest; “ties” are typical broken at random (i.e. a coin toss)
Note that this is a nonparametric approach. That is, KNN does not make any assumptions about the underlying distribution of the data.
R code
While this would be easy to code up ourselves, we’ll make use of a package.To fit a K-Nearest Neighbors (KNN) classification we can use the knn function from the class package in R. For demonstration purposes we will use \(k = 5\) and \(k = 11\) (I am strategically using odd numbers to avoid ties). Don’t worry, we’ll learn a more systematic way to select \(k\) once we discuss cross-validation.
Note that the arguments for this algorithm are not as easy as glm and lm. Rather than accepting a formula we will need to specify
train= \(\boldsymbol X_{\text{train}}\), the matrix of predictors in the training set,test= \(\boldsymbol X_{\text{test}}\), the matrix of predictors in the test set, andcl= \(\boldsymbol y_{\text{train}}\), the true classes for the observations in the training set.
Since we are only using Height as a predictor, we can create this (one-column) matrix4 as follows:
library(class)
# Separate features and labels for KNN
train_x <- as.matrix(train_data[, "Height"]) # Convert to matrix
train_y <- train_data$Gender # Labels for training data
test_x <- as.matrix(test_data[, "Height"]) # Convert to matrix
test_y <- test_data$Gender # Labels for testing data
# Fit KNN with k=5
knn_pred5 <- knn(train = train_x, test = test_x, cl = train_y, k = 5)
# Fit KNN with k=11
knn_pred11 <- knn(train = train_x, test = test_x, cl = train_y, k = 11)Note that the outputted vector is the predictions on the test set so we can go straight to calculating performance metrics.
Classification performance
As before we can compare the predicted gender in the test set, to the actual gender:
knn_pred5 <- factor(knn_pred5, levels = levels(actual_gender))
data.frame(yhat = knn_pred5, y = actual_gender, disagree = knn_pred5 != actual_gender) # actual_gender is the same as test_yOur confusion matrix can be found as:
tab_knn <- table(actual_gender, knn_pred5)
tab_knn knn_pred5
actual_gender male female
male 80 20
female 19 84
Our test error is:
test_error_knn <- mean(knn_pred5 != actual_gender)
test_error_knn[1] 0.1921182
Similarly for KNN with \(k = 11\)
knn_pred11 <- factor(knn_pred11, levels = levels(actual_gender))
data.frame(yhat = knn_pred11, y = actual_gender, disagree = knn_pred11 != actual_gender) # actual_gender is the same as test_yOur confusion matrix can be found as:
tab_knn11 <- table(actual_gender, knn_pred11)
tab_knn11 knn_pred11
actual_gender male female
male 73 27
female 12 91
Our test error is:
test_error_knn11 <- mean(knn_pred11 != actual_gender)
test_error_knn11[1] 0.1921182
QDA and LDA
Linear Discriminant Analysis (LDA) is a classification algorithm that assumes data is normally distributed and models each class with its own Gaussian distribution. LDA creates a linear decision boundary. Quadratic Discriminant Analysis (QDA) is an extension of LDA that allows for a quadratic decision boundary. Unlike LDA, which assumes equal covariance matrices for each class, QDA estimates a separate covariance matrix for each class. This flexibility allows QDA to handle more complex classification problems but can lead to overfitting with small datasets.
R code
We can use the MASS package for LDA and QDA. Like glm it accepts a formula, a data input, and an optional subset for training index.
library(MASS)
# LDA model
lda_model <- lda(Gender ~ Height, data = body, subset = train_ind)
# QDA model
qda_model <- qda(Gender ~ Height, data = body, subset = train_ind)Prediction
We make predictions using the predict() function; see ?predict.lda and ?predict.qda for details:
lda_fit <- predict(lda_model, newdata = test_data)A matrix of the posterior probabilities for each class is stored in $posterior
# posterior probabilities of the first 6 observations
head(lda_fit$posterior) male female
20 0.75107307 0.2489269
228 0.41261467 0.5873853
334 0.01340679 0.9865932
205 0.89888636 0.1011136
503 0.77381700 0.2261830
42 0.48709961 0.5129004
The predicted class labels for the observations are stored in $class
lda_pred <- lda_fit$class
lda_pred [1] male female female male male female male male female male
[11] male female female male male male male male male male
[21] male female male female male male female male male female
[31] female female male female female male male female female male
[41] male female female female female female female male female female
[51] female female male male female female female female female male
[61] female male male male male female female male female female
[71] female male female male male female female female female female
[81] male male male male male female female male female female
[91] female male female male female male male male female female
[101] female male male female male female female male male female
[111] male female female female female female female male female male
[121] male male male female male male female female male male
[131] female female female female female female male male male female
[141] male male female male female female female male female female
[151] female female male male female male male male female female
[161] male female female male female male male female male female
[171] male female female female male female female male male male
[181] female male female male male male male male male female
[191] male male male female female female male male female male
[201] female female female
Levels: male female
# remember we should reorder the levels for later:
lda_pred <- factor(lda_pred, levels = levels(actual_gender))Similarly for QDA:
qda_fit <- predict(qda_model, newdata = test_data)
qda_pred <- factor(qda_fit$class, levels = levels(actual_gender))Classification performance
As before we can compare the predicted gender in the test set, to the actual gender:
data.frame(yhat = lda_pred, y = actual_gender, disagree = lda_pred != actual_gender) Our confusion matrix can be found as:
tab_lda <- table(actual_gender, lda_pred)
tab_lda lda_pred
actual_gender male female
male 82 18
female 16 87
Our test error is:
test_error_lda <- mean(lda_pred != actual_gender)
test_error_lda[1] 0.1674877
Similarly for QDA:
data.frame(yhat = qda_pred, y = actual_gender, disagree = lda_pred != actual_gender) Our confusion matrix can be found as:
tab_qda <- table(actual_gender, qda_pred)
tab_qda qda_pred
actual_gender male female
male 82 18
female 16 87
Our test error is:
test_error_qda <- mean(qda_pred != actual_gender)
test_error_qda[1] 0.1674877
As we can see, LDA and QDA are performing identically in this case.
Conclusion
In this lab we’ve fitted several classification methods. Overall, logistic regression, LDA, and QDA perform equally as well on the test data and all achieve and test error rate of 16.75%.
| Method | test error |
|---|---|
| Logistic Regression | 0.1674877 |
| KNN (\(k\) = 5) | 0.1921182 |
| KNN (\(k\) = 11) | 0.1921182 |
| LDA | 0.1674877 |
| QDA | 0.1674877 |
tab pred_gender
male female
male 82 18
female 16 87
tab_lda lda_pred
actual_gender male female
male 82 18
female 16 87
tab_qda qda_pred
actual_gender male female
male 82 18
female 16 87
In this lab we just looked at some basic metrics for comparison (test error and confusion matrices). Next week we will look at other evaluation metrics including precision, recall, specificity, and the F1 score.
ROC
While we never covered this in lecture, another important metrics is the ROC curve. The Receiver Operating Characteristic, or ROC curve is a graphical tool used to evaluate the performance of binary classification models. It provides a way to visualize and analyze the trade-off between a model’s true positive rate (sensitivity) and its false positive rate (1 - specificity) for different classification thresholds. So instead of choosing a fixed cutoff (like 0.5), the ROC curve shows the trade-off between:
- True Positive Rate (TPR) — also called sensitivity or recall: \[ \text{TPR} = \frac{\text{True Positives}}{\text{True Positives + False Negatives}} \]
- False Positive Rate (FPR) — the proportion of negatives incorrectly predicted as positives: \[ \text{FPR} = \frac{\text{False Positives}}{\text{False Positives + True Negatives}} \]
for all possible thresholds.
A model that perfectly distinguishes classes would reach the top-left corner (TPR = 1, FPR = 0). A model that guesses randomly lies along the diagonal line (TPR ≈ FPR).
Area Under the Curve (AUC)
The overall performance of a classifier, summarized over all possible thresholds, is given by the area under the ROC curve (AUC). Since the ideal ROC curve will hug the top left corner, this implies that the larger the AUC, the better the classifier. The AUC summarizes the ROC curve in a single number between 0 and 1:
- AUC = 1 → perfect classifier
- AUC = 0.5 → performs no better than random guessing
- AUC < 0.5 → worse than random (predictions reversed)
Using the ROCR package
We will use the ROCR package to plot the ROC curve and obtain the AUC metric. Every classifier evaluation using ROCR starts with creating a prediction object. This function is used to transform the input data (which can be in vector, matrix, data frame, or list form) into a standardized format.
prediction.obj = prediction(predictions, labels)predictionsis a vector of probabilities for the “event class” andlabelsis a binary label vector.
library(ROCR)
# Predicted probabilities of the positive class ("female")
prob_log <- predict(simlog, newdata = test_data, type = "response")
prob_lda <- lda_fit$posterior[, "female"]
prob_qda <- qda_fit$posterior[, "female"]
# Truth vector as 0/1 with female = 1 (matches factor order: c("male","female"))
truth01 <- as.numeric(actual_gender == "female")
# ROCR prediction objects
pred_log <- prediction(prob_log, truth01)
pred_lda <- prediction(prob_lda, truth01)
pred_qda <- prediction(prob_qda, truth01)Performance measures or combinations thereof are computed by feeding the prediction object into the performance() function:
performance(prediction.obj, measure, x.measure = "cutoff", ...)prediction.objAn object of class prediction.measurePerformance measure to use for the evaluation. A complete list of the performance measures that are available for measure and x.measure is given in the ‘Details’ section.x.measureA second performance measure. If different from the default, a two-dimensional curve, with x.measure taken to be the unit in direction of the x axis, and measure to be the unit in direction of the y axis, is created. This curve is parametrized with the cutoff.
The resulting performance object can be visualized using plot(). For example, an ROC curve that trades off the rate of true positives (tpr) against the rate of false positives (fpr) is obtained as follows:
# Overlay ROC curves for side-by-side comparison
perf_log <- performance(pred_log, "tpr", "fpr")
perf_lda <- performance(pred_lda, "tpr", "fpr")
perf_qda <- performance(pred_qda, "tpr", "fpr")
par(mfrow=c(1,3))
plot(perf_log, main = "Logistic Regression")
plot(perf_lda, main = "LDA")
plot(perf_qda, main = "QDA")
To obtain the area under the ROC curve (aka the AUC) we use:
# AUCs
auc_log <- performance(pred_log, "auc")@y.values[[1]]
auc_lda <- performance(pred_lda, "auc")@y.values[[1]]
auc_qda <- performance(pred_qda, "auc")@y.values[[1]]
c(Logistic_AUC = auc_log, LDA_AUC = auc_lda, QDA_AUC = auc_qda)Logistic_AUC LDA_AUC QDA_AUC
0.9179126 0.9179126 0.9179126
Logistic Regression, LDA, and QDA yield nearly identical AUC of 0.92, indicating strong separation between male and female based on height. Recall that the maximum AUC is 1.0, so these classifiers would be considered very good.
MLmetrics
To help us calcuate some of the popular metrics for assessing classification meolds we can use the MLmetrics library
library(MLmetrics)
Attaching package: 'MLmetrics'
The following object is masked from 'package:base':
Recall
For LDA:
# LDA
Sensitivity(test_data$Gender, lda_fit$class)
Recall(test_data$Gender, lda_fit$class) # same as above
Precision(test_data$Gender, lda_fit$class)
Specificity(test_data$Gender, lda_fit$class)
F1_Score(test_data$Gender, lda_fit$class)[1] 0.82
[1] 0.82
[1] 0.8367347
[1] 0.8446602
[1] 0.8282828
For QDA:
Sensitivity(test_data$Gender, qda_fit$class)
Recall(test_data$Gender, qda_fit$class) # same as above
Precision(test_data$Gender, qda_fit$class)
Specificity(test_data$Gender, qda_fit$class)
F1_Score(test_data$Gender, qda_fit$class)[1] 0.82
[1] 0.82
[1] 0.8367347
[1] 0.8446602
[1] 0.8282828
Log-Loss
While accuracy and AUC depend only on whether a prediction is correct or on its relative ordering, log-loss (or cross-entropy loss) penalizes how confident a model is in its predictions.
Formally, for binary classification with \(n\) test observations: \[ \text{LogLoss} = -\frac{1}{n}\sum_{i=1}^n \left[ y_i \log(\hat{p}_i) + (1 - y_i)\log(1 - \hat{p}_i) \right] \]
where:
- \(y_i \in {0, 1}\) are the true labels
- \(\hat{p}_i\) are the predicted probabilities of the positive class
A model that is confident and correct (e.g., predicts 0.95 for a true positive) receives a small penalty, whereas a model that is confident and wrong (e.g., predicts 0.95 for a false positive) receives a large penalty.
Hence:
- Smaller log-loss → better calibrated predictions
- Perfect classifier → LogLoss = 0
Because it takes into account both accuracy and confidence, log-loss is often used for probabilistic classifiers such as logistic regression, LDA, and QDA.
To compute the logloss, we can use the LogLoss() function5. Judging from help files (fairly sparse), the first argument seems to be the predicted probability vector of classifying a 1 (in this case the predicted probabilty that they will default). Notice that they (bizarrely) have swapped the arguments order from function(truth, predicted) to function(predicted, truth) as compared to the other metrics we’ve used thus far. Thirdly, note that the truth vector is required to be 0-1 labels.
# need to convert "No"/"Yes" to 0/1
truth = as.numeric(test_data$Gender)-1
LogLoss(lda_fit$posterior[,2], truth)[1] 0.3683008
LogLoss(qda_fit$posterior[,2], truth)[1] 0.3683334
Recall that smaller is better for log-loss, so LDA performs slightly better based on the log-loss metric.
Footnotes
For example, you can call
View(...)orhead(...)orsrt(...)on your data set to accomplish this.↩︎the information regarded this encoding can be found in the help file for the
bodydata set which can be accessed using?body↩︎In case you were wondering how a generate this seed, I used
as.integer(Sys.time()). I like to do this to avoid using similar seeds over and over again, e.g.set.seed(<year>).↩︎the
knnfunction expectstrainandtestto be a matrix or data frame. If you provide a vector, you will get an error.↩︎Note that the
LogLossfunction is specifically for binary. There is alsoMultiLogLossfrom the same library for multiclass scenarios.↩︎