Lab 3

Fitting and Assessing Classification Models

Author

Dr. Irene Vrbik

Published

December 1, 2024

Introduction

In this lab, we move from regression problems, where the response variable is continuous, to classification, where the response variable is categorical. We will focus on fitting three types of classification models: Logistic Regression, K-Nearest Neighbors (KNN), and Discriminant Analysis (LDA and QDA).

The primary aim of this lab is to guide you through the process of implementing these models and assessing their performance using two key evaluation methods: the test error rate and the confusion matrix. These metrics will help you understand how well the model generalizes to unseen data and how its predictions align with actual class labels. This lab does not dive into more advanced metrics like precision, recall, or F1 score, which will be covered in the next lab.

Notation

While in lecture we usually referred to categories of $y$ as groups, or classes, to keep consistent with R terminology we will also be referring to the categorical variable as a factor variable, and it’s classes/groups as levels.

Learning Outcomes

By the end of this lab students will be able to:

Fit a Logistic Regression model to binary classification data.
Apply a K-Nearest Neighbors (KNN) classifier.
Use Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA) for classification.
Evaluate model performance using test error rate and confusion matrix.
Understand how to correctly interpret classification model outputs, including:
- managing factor releveling (to ensure consistent interpretation of class labels) and
- determining which class corresponds to $Y = 1$ in binary models, especially for logistic regression.

Data

The body dataset contains 21 body dimension measurements as well as age, weight (in kg), height (in cm), and gender (binary) on 507 individuals. You can find this in the gclus library as body. As always, lets load our data set into our R session and make sure that it looks as we expect.¹

library(gclus)
data(body)
body

As you will notice in the print out above, Gender is being treated as a numeric integer (as indicated by <int> under the column heading) rather than a factor. If ever you are unsure if you are unsure if your variable is being treated as an integer or a factor, you can check using class

class(body$Gender)

[1] "integer"

As in the linear regression setting where we had to use something like lm(y~ x1 + factor(x2)) to let R know that a predictor variable is a categorical, we need to convert the Gender to the factor class to ensure that these values are being treated as a categories (i.e. levels) rather than numbers. To accomplish this we could run:

body$Gender <- factor(body$Gender)

Alternatively if we want to convert the 1’s to male and 0 to female² we could use something like:

body$Gender <- factor(body$Gender, levels = c(1,0), labels = c("male", "female"))

Now we can see that Gender is being treated as a categorical variable (i.e. a factor variable in R).

class(body$Gender)

[1] "factor"

Training and Testing

Before fitting our model, let’s divide our data into training and testing. To make our results reproducible let’s also set a seed³.

set.seed(1696623282) #  as.integer(Sys.time()) 
n <- nrow(body)
ptest <- 0.4 # 40% testing (60% training)
ntest <- round(n*ptest)
rind  = sample(n)
test_ind = rind[1:ntest]
train_ind = rind[(ntest+1):n]

test_data = body[test_ind, ]
train_data = body[train_ind, ]

Logistic Regression

Model

Logistic regression is a type of regression analysis used when the dependent variable is categorical, typically binary (i.e., having two possible outcomes, such as “yes” or “no,” “success” or “failure”, 0 or 1). Unlike linear regression, which predicts continuous values, logistic regression predicts the probability of the outcome belonging to a particular category. For binary classification using logistic regression the outputted probability that the dependent variable $Y$ is 1 (i.e., the “positive” class).

\[ \Pr(Y=1 \mid x) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 X_1 + \beta_2 X_2 + ... + \beta_n X_n)}} = \frac{e^{\beta_0 + \beta_1 X_1 + \beta_2 X_2 + ... + \beta_n X_n}}{1 + e^{\beta_0 + \beta_1 X_1 + \beta_2 X_2 + ... + \beta_n X_n}} \tag{1}\]

where:

$\Pr(Y=1 \mid x)$ is the probability that the outcome is 1 given the predictor variables.
$\beta_0$ is the intercept, and
$\beta_1, \beta_2, ..., \beta_n$ are the coefficients for the predictor variables $X_1, X_2, ..., X_n$.

Hence the odds of $Y=1$ given $X =x$ is $\frac{\Pr(Y=1 \mid x)}{1- \Pr(Y=1 \mid x)}$ and the log-odds⁴ is given by $\log\left(\frac{\Pr(Y=1 \mid x)}{1- \Pr(Y=1 \mid x)}\right)$. Logistic regression models the relationship between the independent variables and the log-odds of the dependent variable as follows:

\[ \text{log-odds} = \log\left(\frac{\Pr(Y=1 \mid x)}{1- \Pr(Y=1 \mid x)}\right) = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + ... + \beta_n X_n \tag{2}\]

R code

Logistic regression is a type of generalized linear models, hence we fit it using the glm function in R. In terms of inputs, it will feel very similar to using lm, i.e. its first input will be of the formula form Y~X. The family="binomial" indicates that R should fit logistic regression.

# General format for logistic regression
fit <- glm(formula, family = "binomial", data = ..., subset = ...)

Which factor level do the probabilities correspond to?

For the binomial families, the “$1$” in the $\Pr(Y = 1 \mid X)$ corresponds to the second level of our factor. By default R will order levels alphabetically. Our levels are not in alphabetical order because we hard coded them in the Data section. You can read more here for a discussion on how do I know which factor level of my response is coded as 1 in logistic regression?

In our case, since female is the second level of the Gender factor, the model will be predicting the probability of being female, $P(Y = \text{female} \mid X)$. You can check the order of the factor levels using the following code:

levels(body$Gender)

[1] "male"   "female"

Application

Using Gender as the response variable we can fit the model in R using:

simlog <- glm(Gender ~ Height , family="binomial",  data = body, subset = train_ind)
summary(simlog)


Call:
glm(formula = Gender ~ Height, family = "binomial", data = body, 
    subset = train_ind)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) 45.75626    5.13902   8.904   <2e-16 ***
Height      -0.26691    0.02999  -8.899   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 421.10  on 303  degrees of freedom
Residual deviance: 240.86  on 302  degrees of freedom
AIC: 244.86

Number of Fisher Scoring iterations: 5

Notice that since we already converted Gender to a factor we need not wrap the variable name in factor() in the lm function. Notice that the subset argument above allows us to specify the index of training observations in our full data set (as opposed to feeding the data frame of training observations as we will do later for KNN). In other words, simlog is the equivalent to:

# output surpressed to save space
glm(Gender ~ Height , family="binomial", data = train_data)

Prediction

Like with lm, the output of glm provides the estimates of the parameters, i.e. the $\hat\beta$s. As discussed in further detail in this section, that the the logistic regression coefficients give the change in the log odds of the given observation being female⁵ for every one unit increase in the predictor variable. Plugging our estimates into equations Equation 2 and Equation 1 we get:

\[\begin{align*} \text{log-odds of being female} &= 45.75626 -0.26691 (\texttt{Height})\\ \implies \quad P(Y=\text{female} \mid X) &= \dfrac{e^{45.75626 -0.26691 (\texttt{Height})}} {1+ e^{45.75626 -0.26691 (\texttt{Height})}} \end{align*}\]

Hence with a height of, say, 174:

\[\begin{align*} \text{log-odds of being female} &= -0.68608 \\ \Pr(Y=\text{female} \mid X=174) &= \dfrac{e^{45.75626 + -0.26691 (174)}} {1+ e^{45.75626 + -0.26691 (174)}}\\ &= 0.3349057 \end{align*}\]

We could find this using the predict() function. You have two ways of specifying the type of prediction (as determined by the type argument in predict.glm; see ?predict.gml for details):

The default is on the scale of the linear predictors; Thus for a logistic regression default predictions are of log-odds value for the new data point, which represents the natural logarithm of the odds that the observation belongs to the positive class.
the alternative type = "response" is on the scale of the response variable. Thus for logistic regression, type = "response" gives the predicted probabilities.

# returns the log-odds of being a female
predict(simlog, newdata = data.frame(Height = 174))

         1 
-0.6866771

# returns the probability of being a female 
predict(simlog, newdata = data.frame(Height = 174), type = "response")

        1 
0.3347727

Warning

The predict function will not work unless your data is stored in a data frame with column names that match the variables used in training your logistic regression model.

# WRONG ways to specify 'newdata' (will produce an error)
predict(simlog, newdata = 174, type = "response")

Since we will be reusing this data, let’s save it to an object:

newdata1 <- data.frame(Height = 174)

To convert the above probability into a classification, simply map observations to whichever class for which it has the highest probability of belonging. In our case \[ \begin{align} \Pr(Y = \text{female} \mid X = 174) &= 0.3347727\\ \implies \Pr(Y = \text{male} \mid X = 174) &= 1 - 0.3347727 \\ &= 0.6652273 \end{align} \]

Since a person with the height of 174 has a higher probability of being male than female, the observation is predicted to be male.

Important

Recall: the output of logistic regression is $\Pr(Y = 1\mid x)$. Therefore, for binary response variables ($y = 1$ or $y = 0$) any observations with predicted probability greater that 0.5 are classified as class 1.

Classification Performance

Test Error

To evaluate the performance of our logistic regression model, we compute the test error by applying the model to the test set, predicting the class labels for each observation, and comparing these predictions to the actual class labels. The test error reflects how well the model generalizes to new, unseen data.

To calculate the test error in R:

Predict probabilities for the test set using the trained model.
Convert probabilities to class predictions using a threshold (typically 0.5 for binary classification).
Compare the predictions to the true class labels and calculate the error rate.

Step 1: Predict probabilities for the test set

We first predict the probabilities of the test set observations belonging to the female class.

# Predict probabilities (of being female) on the test set
pred.prob <- predict(simlog, newdata = test_data, type = "response")

pred_gender <- factor(pred.prob>0.5, levels= c(FALSE, TRUE), labels=c("male", "female"))
y = test_data$Gender
# see the first 10:
head(cbind(pred_gender, y))

    pred_gender y
20            1 1
228           2 1
334           2 2
205           1 1
503           1 2
42            2 1

Step 2: Convert probabilities to class predictions

Next, we classify each observation as either class female or male using a threshold of 0.5. In other words, if the $\Pr(Y_i = \text{female} \mid x_i) > 0.5$ then $x_i$ is classified as female, otherwise, male.

# Convert probabilities to class predictions using a 0.5 threshold
pred_gender <- factor(ifelse(pred.prob >= 0.5, "female", "male"))

Step 3: Compare predictions to actual test labels and calculate the error rate

We now compare the predicted classes to the actual classes in the test set and compute the test error. The test error is the proportion of incorrectly classified observations.

# Actual test labels
actual_gender <- test_data$Gender

# print the result
data.frame(yhat = pred_gender, y = actual_gender)

To calculated the test error (or misclassification rate) we use:

\[ \text{Test Error} = \frac{1}{n_{\text{test}}} \sum_{i=1}^{n_{\text{test}}} \mathbb{I}(\hat{y}_i \neq y_i) = \frac{\text{total misclassified}}{\text{total no. observations}} \tag{3}\]

where

$n_{\text{text}}$ are the number of observations in our test set
$\hat{y}_i$ represents the predicted class and
${y}_i$ represents the actual class

This can be computer in R using:

# Calculate the error rate
test_error <- mean(pred_gender != actual_gender)
test_error

[1] 0.1674877

Confusion Matrix

As discussed in lecture and in previous labs, we can summarize and compare our predictions to the truth using a confusion matrix. It can be found quickly using the table() function:

tab <- table(test_data$Gender, pred_gender)
tab

        pred_gender
         female male
  male       18   82
  female     87   16

Usually when we are organizing a confusiong matrix it should have the correct classifications on the diagonal. This is not the case here. We can fix this by changing the order.

levels(test_data$Gender)

[1] "male"   "female"

levels(pred_gender)

[1] "female" "male"

To reorder the factor levels in R, you can use the factor() function with the levels argument to explicitly specify the new order of the levels. In our case, since we want the levels of pred_gender to match the levels of test_data$Gender, you can reorder pred_gender like this:

# Reorder the levels of pred_gender to match test_data$Gender
pred_gender <- factor(pred_gender, levels = c("male", "female")) # same as: 
pred_gender <- factor(pred_gender, levels = levels(test_data$Gender))

Now that the levels of these two vectors not consistent, lets redo our table.

tab <- table(test_data$Gender, pred_gender)
tab

        pred_gender
         male female
  male     82     18
  female   16     87

That’s better. Now the off-diagonal elements represent the number of misclassified observations, where predictions did not match the actual gender. The diagonal elements represent the correctly classified observations, where predictions matched the actual gender. Using the confusion matrix, we can easily recalculate the test error as shown in Equation Equation 3 by summing the misclassified observations and dividing by the total number of observations.

\[\begin{equation} \frac{\text{total misclassified}}{\text{total no. observations}} =\frac{18+16}{203} = 0.1674877 \end{equation}\]

Factor Level Mismatch Warning

Always ensure that the levels of factor variables are in the same order between the predicted and actual values! When the order of the factor levels are not consistent, this leads to incorrect results in the confusion matrix. You can reorder the factor levels using the factor(, levels = )

Now let’s repeat for the other methods

KNN Classification

Algorithm

K-Nearest Neighbors (KNN) is a simple, intuitive classification algorithm that makes predictions based on the similarity of new data points to existing labeled data points. In KNN, a data point is classified by a majority vote of its $k$ nearest neighbors.

Step 1: Given a positive integer $k$ and a test observation $x_0$, the KNN classifier first identifies the $k$ points in the training data that are closest to $x_0$, represented by $\mathcal{N}_0$.

Step 2: For each class $j$, find \[\begin{equation} P(Y=j \mid X=x_0) \approx \frac{1}{K} \sum_{i \in N_0} I(y_i = j) \end{equation}\]

Step 3: Assign observation $x_0$ to the class ($j$) for which the above probability is largest; “ties” are typical broken at random (i.e. a coin toss)

Note that this is a nonparametric approach. That is, KNN does not make any assumptions about the underlying distribution of the data.

R code

While this would be easy to code up ourselves, we’ll make use of a package.To fit a K-Nearest Neighbors (KNN) classification we can use the knn function from the class package in R. For demonstration purposes we will use $k = 5$ and $k = 11$ (I am strategically using odd numbers to avoid ties). Don’t worry, we’ll learn a more systematic way to select $k$ once we discuss cross-validation.

Note that the arguments for this algorithm are not as easy as glm and lm. Rather than accepting a formula we will need to specify

train = $\boldsymbol X_{\text{train}}$, the matrix of predictors in the training set,
test = $\boldsymbol X_{\text{test}}$, the matrix of predictors in the test set, and
cl = $\boldsymbol y_{\text{train}}$, the true classes for the observations in the training set.

Since we are only using Height as a predictor, we can create this (one-column) matrix⁶ as follows:

library(class)

# Separate features and labels for KNN
train_x <- as.matrix(train_data[, "Height"])    # Convert to matrix
train_y <- train_data$Gender                    # Labels for training data
test_x  <- as.matrix(test_data[, "Height"])     # Convert to matrix
test_y  <- test_data$Gender                     # Labels for testing data

# Fit KNN with k=5
knn_pred5 <- knn(train = train_x, test = test_x, cl = train_y, k = 5)

# Fit KNN with k=11
knn_pred11 <- knn(train = train_x, test = test_x, cl = train_y, k = 11)

Note that the outputted vector is the predictions on the test set so we can go straight to calculating performance metrics.

Classification performance

As before we can compare the predicted gender in the test set, to the actual gender:

knn_pred5 <- factor(knn_pred5, levels = levels(actual_gender))
data.frame(yhat = knn_pred5, y = actual_gender, disagree = knn_pred5 != actual_gender) # actual_gender is the same as test_y

Our confusion matrix can be found as:

tab_knn <- table(actual_gender, knn_pred5)
tab_knn

             knn_pred5
actual_gender male female
       male     80     20
       female   19     84

Our test error is:

test_error_knn <- mean(knn_pred5 != actual_gender)
test_error_knn

[1] 0.1921182

Similarly for KNN with $k = 11$

knn_pred11 <- factor(knn_pred11, levels = levels(actual_gender))
data.frame(yhat = knn_pred11, y = actual_gender, disagree = knn_pred11 != actual_gender) # actual_gender is the same as test_y

Our confusion matrix can be found as:

tab_knn11 <- table(actual_gender, knn_pred11)
tab_knn11

             knn_pred11
actual_gender male female
       male     73     27
       female   12     91

Our test error is:

test_error_knn11 <- mean(knn_pred11 != actual_gender)
test_error_knn11

[1] 0.1921182

QDA and LDA

Linear Discriminant Analysis (LDA) is a classification algorithm that assumes data is normally distributed and models each class with its own Gaussian distribution. LDA creates a linear decision boundary. Quadratic Discriminant Analysis (QDA) is an extension of LDA that allows for a quadratic decision boundary. Unlike LDA, which assumes equal covariance matrices for each class, QDA estimates a separate covariance matrix for each class. This flexibility allows QDA to handle more complex classification problems but can lead to overfitting with small datasets.

R code

We can use the MASS package for LDA and QDA. Like glm it accepts a formula, a data input, and an optional subset for training index.

library(MASS)

# LDA model
lda_model <- lda(Gender ~ Height, data = body, subset = train_ind)

# QDA model
qda_model <- qda(Gender ~ Height, data = body, subset = train_ind)

Prediction

We make predictions using the predict() function; see ?predict.lda and ?predict.qda for details:

lda_fit <- predict(lda_model, newdata = test_data)

A matrix of the posterior probabilities for each class is stored in $posterior

#  posterior probabilities of the first 6 observations
head(lda_fit$posterior)

          male    female
20  0.75107307 0.2489269
228 0.41261467 0.5873853
334 0.01340679 0.9865932
205 0.89888636 0.1011136
503 0.77381700 0.2261830
42  0.48709961 0.5129004

The predicted class labels for the observations are stored in $class

lda_pred <- lda_fit$class
lda_pred

  [1] male   female female male   male   female male   male   female male  
 [11] male   female female male   male   male   male   male   male   male  
 [21] male   female male   female male   male   female male   male   female
 [31] female female male   female female male   male   female female male  
 [41] male   female female female female female female male   female female
 [51] female female male   male   female female female female female male  
 [61] female male   male   male   male   female female male   female female
 [71] female male   female male   male   female female female female female
 [81] male   male   male   male   male   female female male   female female
 [91] female male   female male   female male   male   male   female female
[101] female male   male   female male   female female male   male   female
[111] male   female female female female female female male   female male  
[121] male   male   male   female male   male   female female male   male  
[131] female female female female female female male   male   male   female
[141] male   male   female male   female female female male   female female
[151] female female male   male   female male   male   male   female female
[161] male   female female male   female male   male   female male   female
[171] male   female female female male   female female male   male   male  
[181] female male   female male   male   male   male   male   male   female
[191] male   male   male   female female female male   male   female male  
[201] female female female
Levels: male female

# remember we should reorder the levels for later: 
lda_pred <- factor(lda_pred, levels = levels(actual_gender))

Similarly for QDA:

qda_fit <- predict(qda_model, newdata = test_data)
qda_pred <- factor(qda_fit$class, levels = levels(actual_gender))

Classification performance

As before we can compare the predicted gender in the test set, to the actual gender:

data.frame(yhat = lda_pred, y = actual_gender, disagree = lda_pred != actual_gender)

Our confusion matrix can be found as:

tab_lda <- table(actual_gender, lda_pred)
tab_lda

             lda_pred
actual_gender male female
       male     82     18
       female   16     87

Our test error is:

test_error_lda <- mean(lda_pred != actual_gender)
test_error_lda

[1] 0.1674877

Similarly for QDA:

data.frame(yhat = qda_pred, y = actual_gender, disagree = lda_pred != actual_gender)

Our confusion matrix can be found as:

tab_qda <- table(actual_gender, qda_pred)
tab_qda

             qda_pred
actual_gender male female
       male     82     18
       female   16     87

Our test error is:

test_error_qda <- mean(qda_pred != actual_gender)
test_error_qda

[1] 0.1674877

Conclusion

In this lab we’ve fitted several classification methods. Overall, logistic regression, LDA, and QDA perform equally as well on the test data and all achieve and test error rate of 17%.

Method	test error
Logistic Regression	0.1674877
KNN ($k$ = 5)	0.1921182
KNN ($k$ = 11)	0.1921182
LDA	0.1674877
QDA	0.1674877

tab

        pred_gender
         male female
  male     82     18
  female   16     87

tab_lda

             lda_pred
actual_gender male female
       male     82     18
       female   16     87

tab_qda

             qda_pred
actual_gender male female
       male     82     18
       female   16     87

In this lab we just looked at some basic metrics for comparison (test error and confusion matrices). Next week we will look at other evaluation metrics including precision, recall, specificity, and the F1 score.

Footnotes

For example, you can call View(...) or head(...) or srt(...) on your data set to accomplish this.↩︎
the information regarded this encoding can be found in the help file for the body data set which can be accessed using ?body↩︎
In case you were wondering how a generate this seed, I used as.integer(Sys.time()). I like to do this to avoid using similar seeds over and over again, e.g. set.seed(<year>).↩︎
we are referring to the natural logarithm↩︎
why female? because this for family="binomial" the second level of the response variable y is treated as the positive, i.e. $Y = 1$, class. See Which factor level do the probabilities correspond to? for more explanation.↩︎
the knn function expects train and test to be a matrix or data frame. If you provide a vector, you will get an error.↩︎