Fitting and Assessing Classification Models

DATA 311: Machine Learning

Author

Affiliation

Dr. Irene Vrbik

University of British Columbia Okanagan

Introduction

In this lab, we will focus on comparing three types of classification models: Logistic Regression, K-Nearest Neighbors (KNN), and Discriminant Analysis (LDA and QDA).

Notation

While in lecture we usually referred to categories of $y$ as groups, or classes, to keep consistent with R terminology we will also be referring to the categorical variable as a factor variable, and it’s classes/groups as levels.

Learning Outcomes

By the end of this lab students will be able to:

Fit and interpret Logistic Regression, KNN, LDA, and QDA.
Distinguish prediction scales for GLMs (linear predictor vs. response probability) and map probabilities to class labels.
Compute test error and confusion matrices correctly (mind the factor level order!).
Adjust classification thresholds and quantify trade‑offs across metrics (precision, recall/sensitivity, specificity, F1).
Use ROC curves and AUC to compare classifiers across all thresholds.
Compute log‑loss and interpret “smaller is better.”
Evaluate model performance using test error rate and confusion matrix.

Data

The body dataset contains 21 body dimension measurements as well as age, weight (in kg), height (in cm), and gender (binary) on 507 individuals. You can find this in the gclus library as body. As always, lets load our data set into our R session and make sure that it looks as we expect.¹

library(gclus)
data(body)
body

As you will notice in the print out above, Gender is being treated as a numeric integer (as indicated by <int> under the column heading) rather than a factor. If ever you are unsure if you are unsure if your variable is being treated as an integer or a factor, you can check using class

class(body$Gender)

[1] "integer"

As in the linear regression setting where we had to use something like lm(y~ x1 + factor(x2)) to let R know that a predictor variable is a categorical, we need to convert the Gender to the factor class to ensure that these values are being treated as a categories (i.e. levels) rather than numbers. To accomplish this we could run:

body$Gender <- factor(body$Gender)

Alternatively if we want to convert the 1’s to male and 0 to female² we could use something like (to give the levels more meaningful labels):

Listing 1: Recode gender variable as a factor with labels ‘male’ and ‘female’.

body$Gender <- factor(body$Gender, levels = c(1,0), labels = c("male", "female"))

Now we can see that Gender is being treated as a categorical variable (i.e. a factor variable in R).

class(body$Gender)

[1] "factor"

Training and Testing

Before fitting our model, let’s divide our data into 60% training and 40% testing. To make our results reproducible let’s also set a seed³.

set.seed(1696623282) #  as.integer(Sys.time()) 
n <- nrow(body)
ptest <- 0.4 # 40% testing (60% training)
ntest <- round(n*ptest)
rind  = sample(n)

# indices
test_ind = rind[1:ntest]
train_ind = rind[(ntest+1):n]

# data frames
test_data = body[test_ind, ]
train_data = body[train_ind, ]

Logistic Regression

Note

This model was covered in detail in the previous lab. Please review if you are shaking on some of the details.

Using Gender as the response variable we can fit the model in R using:

simlog <- glm(Gender ~ Height , family="binomial",  data = body, subset = train_ind)
summary(simlog)


Call:
glm(formula = Gender ~ Height, family = "binomial", data = body, 
    subset = train_ind)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) 45.75626    5.13902   8.904   <2e-16 ***
Height      -0.26691    0.02999  -8.899   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 421.10  on 303  degrees of freedom
Residual deviance: 240.86  on 302  degrees of freedom
AIC: 244.86

Number of Fisher Scoring iterations: 5

Notice that since we already converted Gender to a factor we need not wrap the variable name in factor() in the glm function. Notice that the subset argument above allows us to specify the index of training observations in our full data set (as opposed to feeding the data frame of training observations as we will do later for KNN). In other words, simlog is the equivalent to:

# output surpressed to save space
glm(Gender ~ Height , family="binomial", data = train_data)

For binary GLMs with family = "binomial", predict(..., type = "link") (default) returns log‑odds; type = "response" returns probability $\Pr(Y= \text{``event class"} \mid x)$ where the “event class” is the second factor level. Because of the way we re-leveled our response in Listing 1, the resulting probability that is being returned is the $\Pr(Y= \text{female} \mid x)$

# Example prediction at Height = 174
predict(simlog, newdata = data.frame(Height = 174))              # log-odds

         1 
-0.6866771

predict(simlog, newdata = data.frame(Height = 174), type = "response") # prob

        1 
0.3347727

These predictions can be verified “by-hand”:

\[\begin{align*} \text{log-odds of being female} &= 45.75626 -0.26691 (\texttt{Height})\\ \implies \quad P(Y=\text{female} \mid X) &= \dfrac{e^{45.75626 -0.26691 (\texttt{Height})}} {1+ e^{45.75626 -0.26691 (\texttt{Height})}} \end{align*}\]

Hence with a height of, say, 174:

\[\begin{align*} \text{log-odds of being female} &= -0.68608 \\ \Pr(Y=\text{female} \mid X=174) &= \dfrac{e^{45.75626 + -0.26691 (174)}} {1+ e^{45.75626 + -0.26691 (174)}}\\ &= 0.3349057 \end{align*}\]

Classification Performance

Test Error

To evaluate the performance of our logistic regression model, we compute the test error by applying the model to the test set, predicting the class labels for each observation, and comparing these predictions to the actual class labels. The test error reflects how well the model generalizes to new, unseen data.

To calculate the test error in R:

Predict probabilities for the test set using the trained model.
Convert probabilities to class predictions using a threshold (typically 0.5 for binary classification).
Compare the predictions to the true class labels and calculate the error rate.

Step 1: Predict probabilities for the test set

We first predict the probabilities of the test set observations belonging to the female class.

# Predict probabilities (of being female) on the test set
pred.prob <- predict(simlog, newdata = test_data, type = "response")

pred_gender <- factor(pred.prob>0.5, levels= c(FALSE, TRUE), labels=c("male", "female"))
y = test_data$Gender
# see the first 10:
head(cbind(pred_gender, y))

    pred_gender y
20            1 1
228           2 1
334           2 2
205           1 1
503           1 2
42            2 1

Step 2: Convert probabilities to class predictions

Next, we classify each observation as either class female or male using a threshold of 0.5. In other words, if the $\Pr(Y_i = \text{female} \mid x_i) > 0.5$ then $x_i$ is classified as female, otherwise, male.

# Convert probabilities to class predictions using a 0.5 threshold
pred_gender <- factor(ifelse(pred.prob >= 0.5, "female", "male"))

Step 3: Compare predictions to actual test labels and calculate the error rate

We now compare the predicted classes to the actual classes in the test set and compute the test error. The test error is the proportion of incorrectly classified observations.

# Actual test labels
actual_gender <- test_data$Gender

# print the result
data.frame(yhat = pred_gender, y = actual_gender)

To calculated the test error (or misclassification rate) we use:

\[ \text{Test Error} = \frac{1}{n_{\text{test}}} \sum_{i=1}^{n_{\text{test}}} \mathbb{I}(\hat{y}_i \neq y_i) = \frac{\text{total misclassified}}{\text{total no. observations}} \tag{1}\]

where

$n_{\text{text}}$ are the number of observations in our test set
$\hat{y}_i$ represents the predicted class and
${y}_i$ represents the actual class

This can be computer in R using:

# Calculate the error rate
test_error <- mean(pred_gender != actual_gender)
test_error

[1] 0.1674877

Confusion Matrix

As discussed in lecture and in previous labs, we can summarize and compare our predictions to the truth using a confusion matrix. It can be found quickly using the table() function:

tab <- table(test_data$Gender, pred_gender)
tab

        pred_gender
         female male
  male       18   82
  female     87   16

Usually when we are organizing a confusiong matrix it should have the correct classifications on the diagonal. This is not the case here. We can fix this by changing the order.

levels(test_data$Gender)

[1] "male"   "female"

levels(pred_gender)

[1] "female" "male"

To reorder the factor levels in R, you can use the factor() function with the levels argument to explicitly specify the new order of the levels. In our case, since we want the levels of pred_gender to match the levels of test_data$Gender, you can reorder pred_gender like this:

# Reorder the levels of pred_gender to match test_data$Gender
pred_gender <- factor(pred_gender, levels = c("male", "female")) # same as: 
pred_gender <- factor(pred_gender, levels = levels(test_data$Gender))

Now that the levels of these two vectors not consistent, lets redo our table.

tab <- table(test_data$Gender, pred_gender)
tab

        pred_gender
         male female
  male     82     18
  female   16     87

That’s better. Now the off-diagonal elements represent the number of misclassified observations, where predictions did not match the actual gender. The diagonal elements represent the correctly classified observations, where predictions matched the actual gender. Using the confusion matrix, we can easily recalculate the test error as shown in Equation Equation 1 by summing the misclassified observations and dividing by the total number of observations.

\[\begin{equation} \frac{\text{total misclassified}}{\text{total no. observations}} =\frac{18+16}{203} = 0.1674877 \end{equation}\]

Factor Level Mismatch Warning

Always ensure that the levels of factor variables are in the same order between the predicted and actual values! When the order of the factor levels are not consistent, this leads to incorrect results in the confusion matrix. You can reorder the factor levels using the factor(, levels = )

Now let’s repeat for the other methods

KNN Classification

Algorithm

K-Nearest Neighbors (KNN) is a simple, intuitive classification algorithm that makes predictions based on the similarity of new data points to existing labeled data points. In KNN, a data point is classified by a majority vote of its $k$ nearest neighbors.

Step 1: Given a positive integer $k$ and a test observation $x_0$, the KNN classifier first identifies the $k$ points in the training data that are closest to $x_0$, represented by $\mathcal{N}_0$.

Step 2: For each class $j$, find \[\begin{equation} P(Y=j \mid X=x_0) \approx \frac{1}{K} \sum_{i \in N_0} I(y_i = j) \end{equation}\]

Step 3: Assign observation $x_0$ to the class ($j$) for which the above probability is largest; “ties” are typical broken at random (i.e. a coin toss)

Note that this is a nonparametric approach. That is, KNN does not make any assumptions about the underlying distribution of the data.

R code

While this would be easy to code up ourselves, we’ll make use of a package.To fit a K-Nearest Neighbors (KNN) classification we can use the knn function from the class package in R. For demonstration purposes we will use $k = 5$ and $k = 11$ (I am strategically using odd numbers to avoid ties). Don’t worry, we’ll learn a more systematic way to select $k$ once we discuss cross-validation.

Note that the arguments for this algorithm are not as easy as glm and lm. Rather than accepting a formula we will need to specify

train = $\boldsymbol X_{\text{train}}$, the matrix of predictors in the training set,
test = $\boldsymbol X_{\text{test}}$, the matrix of predictors in the test set, and
cl = $\boldsymbol y_{\text{train}}$, the true classes for the observations in the training set.

Since we are only using Height as a predictor, we can create this (one-column) matrix⁴ as follows:

library(class)

# Separate features and labels for KNN
train_x <- as.matrix(train_data[, "Height"])    # Convert to matrix
train_y <- train_data$Gender                    # Labels for training data
test_x  <- as.matrix(test_data[, "Height"])     # Convert to matrix
test_y  <- test_data$Gender                     # Labels for testing data

# Fit KNN with k=5
knn_pred5 <- knn(train = train_x, test = test_x, cl = train_y, k = 5)

# Fit KNN with k=11
knn_pred11 <- knn(train = train_x, test = test_x, cl = train_y, k = 11)

Note that the outputted vector is the predictions on the test set so we can go straight to calculating performance metrics.

Classification performance

As before we can compare the predicted gender in the test set, to the actual gender:

knn_pred5 <- factor(knn_pred5, levels = levels(actual_gender))
data.frame(yhat = knn_pred5, y = actual_gender, disagree = knn_pred5 != actual_gender) # actual_gender is the same as test_y

Our confusion matrix can be found as:

tab_knn <- table(actual_gender, knn_pred5)
tab_knn

             knn_pred5
actual_gender male female
       male     80     20
       female   19     84

Our test error is:

test_error_knn <- mean(knn_pred5 != actual_gender)
test_error_knn

[1] 0.1921182

Similarly for KNN with $k = 11$

knn_pred11 <- factor(knn_pred11, levels = levels(actual_gender))
data.frame(yhat = knn_pred11, y = actual_gender, disagree = knn_pred11 != actual_gender) # actual_gender is the same as test_y

Our confusion matrix can be found as:

tab_knn11 <- table(actual_gender, knn_pred11)
tab_knn11

             knn_pred11
actual_gender male female
       male     73     27
       female   12     91

Our test error is:

test_error_knn11 <- mean(knn_pred11 != actual_gender)
test_error_knn11

[1] 0.1921182

QDA and LDA

Linear Discriminant Analysis (LDA) is a classification algorithm that assumes data is normally distributed and models each class with its own Gaussian distribution. LDA creates a linear decision boundary. Quadratic Discriminant Analysis (QDA) is an extension of LDA that allows for a quadratic decision boundary. Unlike LDA, which assumes equal covariance matrices for each class, QDA estimates a separate covariance matrix for each class. This flexibility allows QDA to handle more complex classification problems but can lead to overfitting with small datasets.

R code

We can use the MASS package for LDA and QDA. Like glm it accepts a formula, a data input, and an optional subset for training index.

library(MASS)

# LDA model
lda_model <- lda(Gender ~ Height, data = body, subset = train_ind)

# QDA model
qda_model <- qda(Gender ~ Height, data = body, subset = train_ind)

Prediction

We make predictions using the predict() function; see ?predict.lda and ?predict.qda for details:

lda_fit <- predict(lda_model, newdata = test_data)

A matrix of the posterior probabilities for each class is stored in $posterior

#  posterior probabilities of the first 6 observations
head(lda_fit$posterior)

          male    female
20  0.75107307 0.2489269
228 0.41261467 0.5873853
334 0.01340679 0.9865932
205 0.89888636 0.1011136
503 0.77381700 0.2261830
42  0.48709961 0.5129004

The predicted class labels for the observations are stored in $class

lda_pred <- lda_fit$class
lda_pred

  [1] male   female female male   male   female male   male   female male  
 [11] male   female female male   male   male   male   male   male   male  
 [21] male   female male   female male   male   female male   male   female
 [31] female female male   female female male   male   female female male  
 [41] male   female female female female female female male   female female
 [51] female female male   male   female female female female female male  
 [61] female male   male   male   male   female female male   female female
 [71] female male   female male   male   female female female female female
 [81] male   male   male   male   male   female female male   female female
 [91] female male   female male   female male   male   male   female female
[101] female male   male   female male   female female male   male   female
[111] male   female female female female female female male   female male  
[121] male   male   male   female male   male   female female male   male  
[131] female female female female female female male   male   male   female
[141] male   male   female male   female female female male   female female
[151] female female male   male   female male   male   male   female female
[161] male   female female male   female male   male   female male   female
[171] male   female female female male   female female male   male   male  
[181] female male   female male   male   male   male   male   male   female
[191] male   male   male   female female female male   male   female male  
[201] female female female
Levels: male female

# remember we should reorder the levels for later: 
lda_pred <- factor(lda_pred, levels = levels(actual_gender))

Similarly for QDA:

qda_fit <- predict(qda_model, newdata = test_data)
qda_pred <- factor(qda_fit$class, levels = levels(actual_gender))

Classification performance

As before we can compare the predicted gender in the test set, to the actual gender:

data.frame(yhat = lda_pred, y = actual_gender, disagree = lda_pred != actual_gender)

Our confusion matrix can be found as:

tab_lda <- table(actual_gender, lda_pred)
tab_lda

             lda_pred
actual_gender male female
       male     82     18
       female   16     87

Our test error is:

test_error_lda <- mean(lda_pred != actual_gender)
test_error_lda

[1] 0.1674877

Similarly for QDA:

data.frame(yhat = qda_pred, y = actual_gender, disagree = lda_pred != actual_gender)

Our confusion matrix can be found as:

tab_qda <- table(actual_gender, qda_pred)
tab_qda

             qda_pred
actual_gender male female
       male     82     18
       female   16     87

Our test error is:

test_error_qda <- mean(qda_pred != actual_gender)
test_error_qda

[1] 0.1674877

As we can see, LDA and QDA are performing identically in this case.

Conclusion

In this lab we’ve fitted several classification methods. Overall, logistic regression, LDA, and QDA perform equally as well on the test data and all achieve and test error rate of 16.75%.

Method	test error
Logistic Regression	0.1674877
KNN ($k$ = 5)	0.1921182
KNN ($k$ = 11)	0.1921182
LDA	0.1674877
QDA	0.1674877

tab

        pred_gender
         male female
  male     82     18
  female   16     87

tab_lda

             lda_pred
actual_gender male female
       male     82     18
       female   16     87

tab_qda

             qda_pred
actual_gender male female
       male     82     18
       female   16     87

In this lab we just looked at some basic metrics for comparison (test error and confusion matrices). Next week we will look at other evaluation metrics including precision, recall, specificity, and the F1 score.

ROC

While we never covered this in lecture, another important metrics is the ROC curve. The Receiver Operating Characteristic, or ROC curve is a graphical tool used to evaluate the performance of binary classification models. It provides a way to visualize and analyze the trade-off between a model’s true positive rate (sensitivity) and its false positive rate (1 - specificity) for different classification thresholds. So instead of choosing a fixed cutoff (like 0.5), the ROC curve shows the trade-off between:

True Positive Rate (TPR) — also called sensitivity or recall: \[ \text{TPR} = \frac{\text{True Positives}}{\text{True Positives + False Negatives}} \]
False Positive Rate (FPR) — the proportion of negatives incorrectly predicted as positives: \[ \text{FPR} = \frac{\text{False Positives}}{\text{False Positives + True Negatives}} \]

for all possible thresholds.

A model that perfectly distinguishes classes would reach the top-left corner (TPR = 1, FPR = 0). A model that guesses randomly lies along the diagonal line (TPR ≈ FPR).

Area Under the Curve (AUC)

The overall performance of a classifier, summarized over all possible thresholds, is given by the area under the ROC curve (AUC). Since the ideal ROC curve will hug the top left corner, this implies that the larger the AUC, the better the classifier. The AUC summarizes the ROC curve in a single number between 0 and 1:

AUC = 1 → perfect classifier
AUC = 0.5 → performs no better than random guessing
AUC < 0.5 → worse than random (predictions reversed)

Using the ROCR package

We will use the ROCR package to plot the ROC curve and obtain the AUC metric. Every classifier evaluation using ROCR starts with creating a prediction object. This function is used to transform the input data (which can be in vector, matrix, data frame, or list form) into a standardized format.

prediction.obj = prediction(predictions, labels)

predictions is a vector of probabilities for the “event class” and
labels is a binary label vector.

library(ROCR)

# Predicted probabilities of the positive class ("female")
prob_log <- predict(simlog, newdata = test_data, type = "response")
prob_lda <- lda_fit$posterior[, "female"]
prob_qda <- qda_fit$posterior[, "female"]

# Truth vector as 0/1 with female = 1  (matches factor order: c("male","female"))
truth01 <- as.numeric(actual_gender == "female")

# ROCR prediction objects
pred_log <- prediction(prob_log, truth01)
pred_lda <- prediction(prob_lda, truth01)
pred_qda <- prediction(prob_qda, truth01)

Performance measures or combinations thereof are computed by feeding the prediction object into the performance() function:

performance(prediction.obj, measure,  x.measure = "cutoff", ...)

prediction.obj An object of class prediction.
measure Performance measure to use for the evaluation. A complete list of the performance measures that are available for measure and x.measure is given in the ‘Details’ section.
x.measure A second performance measure. If different from the default, a two-dimensional curve, with x.measure taken to be the unit in direction of the x axis, and measure to be the unit in direction of the y axis, is created. This curve is parametrized with the cutoff.

The resulting performance object can be visualized using plot(). For example, an ROC curve that trades off the rate of true positives (tpr) against the rate of false positives (fpr) is obtained as follows:

# Overlay ROC curves for side-by-side comparison
perf_log <- performance(pred_log, "tpr", "fpr")
perf_lda <- performance(pred_lda, "tpr", "fpr")
perf_qda <- performance(pred_qda, "tpr", "fpr")


par(mfrow=c(1,3))
plot(perf_log, main = "Logistic Regression")
plot(perf_lda, main = "LDA")
plot(perf_qda, main = "QDA")

To obtain the area under the ROC curve (aka the AUC) we use:

# AUCs
auc_log <- performance(pred_log, "auc")@y.values[[1]]
auc_lda <- performance(pred_lda, "auc")@y.values[[1]]
auc_qda <- performance(pred_qda, "auc")@y.values[[1]]

c(Logistic_AUC = auc_log, LDA_AUC = auc_lda, QDA_AUC = auc_qda)

Logistic_AUC      LDA_AUC      QDA_AUC 
   0.9179126    0.9179126    0.9179126

Logistic Regression, LDA, and QDA yield nearly identical AUC of 0.92, indicating strong separation between male and female based on height. Recall that the maximum AUC is 1.0, so these classifiers would be considered very good.

MLmetrics

To help us calcuate some of the popular metrics for assessing classification meolds we can use the MLmetrics library

library(MLmetrics)


Attaching package: 'MLmetrics'

The following object is masked from 'package:base':

    Recall

For LDA:

# LDA
Sensitivity(test_data$Gender, lda_fit$class)
Recall(test_data$Gender, lda_fit$class) # same as above
Precision(test_data$Gender, lda_fit$class)
Specificity(test_data$Gender, lda_fit$class)
F1_Score(test_data$Gender, lda_fit$class)

[1] 0.82
[1] 0.82
[1] 0.8367347
[1] 0.8446602
[1] 0.8282828

For QDA:

Sensitivity(test_data$Gender, qda_fit$class)
Recall(test_data$Gender, qda_fit$class) # same as above
Precision(test_data$Gender, qda_fit$class)
Specificity(test_data$Gender, qda_fit$class)
F1_Score(test_data$Gender, qda_fit$class)

[1] 0.82
[1] 0.82
[1] 0.8367347
[1] 0.8446602
[1] 0.8282828

Log-Loss

While accuracy and AUC depend only on whether a prediction is correct or on its relative ordering, log-loss (or cross-entropy loss) penalizes how confident a model is in its predictions.

Formally, for binary classification with $n$ test observations: \[ \text{LogLoss} = -\frac{1}{n}\sum_{i=1}^n \left[ y_i \log(\hat{p}_i) + (1 - y_i)\log(1 - \hat{p}_i) \right] \]

where:

$y_i \in {0, 1}$ are the true labels
$\hat{p}_i$ are the predicted probabilities of the positive class

A model that is confident and correct (e.g., predicts 0.95 for a true positive) receives a small penalty, whereas a model that is confident and wrong (e.g., predicts 0.95 for a false positive) receives a large penalty.

Hence:

Smaller log-loss → better calibrated predictions
Perfect classifier → LogLoss = 0

Because it takes into account both accuracy and confidence, log-loss is often used for probabilistic classifiers such as logistic regression, LDA, and QDA.

To compute the logloss, we can use the LogLoss() function⁵. Judging from help files (fairly sparse), the first argument seems to be the predicted probability vector of classifying a 1 (in this case the predicted probabilty that they will default). Notice that they (bizarrely) have swapped the arguments order from function(truth, predicted) to function(predicted, truth) as compared to the other metrics we’ve used thus far. Thirdly, note that the truth vector is required to be 0-1 labels.

# need to convert "No"/"Yes" to 0/1
truth = as.numeric(test_data$Gender)-1
LogLoss(lda_fit$posterior[,2], truth)

[1] 0.3683008

LogLoss(qda_fit$posterior[,2], truth)

[1] 0.3683334

Recall that smaller is better for log-loss, so LDA performs slightly better based on the log-loss metric.

Footnotes

For example, you can call View(...) or head(...) or srt(...) on your data set to accomplish this.↩︎
the information regarded this encoding can be found in the help file for the body data set which can be accessed using ?body↩︎
In case you were wondering how a generate this seed, I used as.integer(Sys.time()). I like to do this to avoid using similar seeds over and over again, e.g. set.seed(<year>).↩︎
the knn function expects train and test to be a matrix or data frame. If you provide a vector, you will get an error.↩︎
Note that the LogLoss function is specifically for binary. There is also MultiLogLoss from the same library for multiclass scenarios.↩︎