library(gclus)
data(body)
body
Lab 3
Fitting and Assessing Classification Models
Introduction
In this lab, we move from regression problems, where the response variable is continuous, to classification, where the response variable is categorical. We will focus on fitting three types of classification models: Logistic Regression, K-Nearest Neighbors (KNN), and Discriminant Analysis (LDA and QDA).
The primary aim of this lab is to guide you through the process of implementing these models and assessing their performance using two key evaluation methods: the test error rate and the confusion matrix. These metrics will help you understand how well the model generalizes to unseen data and how its predictions align with actual class labels. This lab does not dive into more advanced metrics like precision, recall, or F1 score, which will be covered in the next lab.
While in lecture we usually referred to categories of \(y\) as groups, or classes, to keep consistent with R terminology we will also be referring to the categorical variable as a factor variable, and it’s classes/groups as levels.
Learning Outcomes
By the end of this lab students will be able to:
- Fit a Logistic Regression model to binary classification data.
- Apply a K-Nearest Neighbors (KNN) classifier.
- Use Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA) for classification.
- Evaluate model performance using test error rate and confusion matrix.
- Understand how to correctly interpret classification model outputs, including:
- managing factor releveling (to ensure consistent interpretation of class labels) and
- determining which class corresponds to \(Y = 1\) in binary models, especially for logistic regression.
Data
The body
dataset contains 21 body dimension measurements as well as age, weight (in kg), height (in cm), and gender (binary) on 507 individuals. You can find this in the gclus library as body
. As always, lets load our data set into our R session and make sure that it looks as we expect.1
As you will notice in the print out above, Gender
is being treated as a numeric integer (as indicated by <int> under the column heading) rather than a factor. If ever you are unsure if you are unsure if your variable is being treated as an integer or a factor, you can check using class
class(body$Gender)
[1] "integer"
As in the linear regression setting where we had to use something like lm(y~ x1 + factor(x2))
to let R know that a predictor variable is a categorical, we need to convert the Gender
to the factor class to ensure that these values are being treated as a categories (i.e. levels) rather than numbers. To accomplish this we could run:
$Gender <- factor(body$Gender) body
Alternatively if we want to convert the 1’s to male
and 0 to female
2 we could use something like:
$Gender <- factor(body$Gender, levels = c(1,0), labels = c("male", "female")) body
Now we can see that Gender
is being treated as a categorical variable (i.e. a factor variable in R).
class(body$Gender)
[1] "factor"
Training and Testing
Before fitting our model, let’s divide our data into training and testing. To make our results reproducible let’s also set a seed3.
set.seed(1696623282) # as.integer(Sys.time())
<- nrow(body)
n <- 0.4 # 40% testing (60% training)
ptest <- round(n*ptest)
ntest = sample(n)
rind = rind[1:ntest]
test_ind = rind[(ntest+1):n]
train_ind
= body[test_ind, ]
test_data = body[train_ind, ] train_data
Logistic Regression
Model
Logistic regression is a type of regression analysis used when the dependent variable is categorical, typically binary (i.e., having two possible outcomes, such as “yes” or “no,” “success” or “failure”, 0 or 1). Unlike linear regression, which predicts continuous values, logistic regression predicts the probability of the outcome belonging to a particular category. For binary classification using logistic regression the outputted probability that the dependent variable \(Y\) is 1 (i.e., the “positive” class).
\[ \Pr(Y=1 \mid x) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 X_1 + \beta_2 X_2 + ... + \beta_n X_n)}} = \frac{e^{\beta_0 + \beta_1 X_1 + \beta_2 X_2 + ... + \beta_n X_n}}{1 + e^{\beta_0 + \beta_1 X_1 + \beta_2 X_2 + ... + \beta_n X_n}} \tag{1}\]
where:
- \(\Pr(Y=1 \mid x)\) is the probability that the outcome is 1 given the predictor variables.
- \(\beta_0\) is the intercept, and
- \(\beta_1, \beta_2, ..., \beta_n\) are the coefficients for the predictor variables \(X_1, X_2, ..., X_n\).
Hence the odds of \(Y=1\) given \(X =x\) is \(\frac{\Pr(Y=1 \mid x)}{1- \Pr(Y=1 \mid x)}\) and the log-odds4 is given by \(\log\left(\frac{\Pr(Y=1 \mid x)}{1- \Pr(Y=1 \mid x)}\right)\). Logistic regression models the relationship between the independent variables and the log-odds of the dependent variable as follows:
\[ \text{log-odds} = \log\left(\frac{\Pr(Y=1 \mid x)}{1- \Pr(Y=1 \mid x)}\right) = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + ... + \beta_n X_n \tag{2}\]
R code
Logistic regression is a type of generalized linear models, hence we fit it using the glm
function in R. In terms of inputs, it will feel very similar to using lm
, i.e. its first input will be of the formula form Y~X
. The family="binomial"
indicates that R should fit logistic regression.
# General format for logistic regression
<- glm(formula, family = "binomial", data = ..., subset = ...) fit
For the binomial families, the “\(1\)” in the \(\Pr(Y = 1 \mid X)\) corresponds to the second level of our factor. By default R will order levels alphabetically. Our levels are not in alphabetical order because we hard coded them in the Data section. You can read more here for a discussion on how do I know which factor level of my response is coded as 1 in logistic regression?
In our case, since female
is the second level of the Gender
factor, the model will be predicting the probability of being female, \(P(Y = \text{female} \mid X)\). You can check the order of the factor levels using the following code:
levels(body$Gender)
[1] "male" "female"
Application
Using Gender
as the response variable we can fit the model in R using:
<- glm(Gender ~ Height , family="binomial", data = body, subset = train_ind)
simlog summary(simlog)
Call:
glm(formula = Gender ~ Height, family = "binomial", data = body,
subset = train_ind)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 45.75626 5.13902 8.904 <2e-16 ***
Height -0.26691 0.02999 -8.899 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 421.10 on 303 degrees of freedom
Residual deviance: 240.86 on 302 degrees of freedom
AIC: 244.86
Number of Fisher Scoring iterations: 5
Notice that since we already converted Gender
to a factor we need not wrap the variable name in factor()
in the lm
function. Notice that the subset
argument above allows us to specify the index of training observations in our full data set (as opposed to feeding the data frame of training observations as we will do later for KNN). In other words, simlog
is the equivalent to:
# output surpressed to save space
glm(Gender ~ Height , family="binomial", data = train_data)
Prediction
Like with lm
, the output of glm
provides the estimates of the parameters, i.e. the \(\hat\beta\)s. As discussed in further detail in this section, that the the logistic regression coefficients give the change in the log odds of the given observation being female5 for every one unit increase in the predictor variable. Plugging our estimates into equations Equation 2 and Equation 1 we get:
\[\begin{align*} \text{log-odds of being female} &= 45.75626 -0.26691 (\texttt{Height})\\ \implies \quad P(Y=\text{female} \mid X) &= \dfrac{e^{45.75626 -0.26691 (\texttt{Height})}} {1+ e^{45.75626 -0.26691 (\texttt{Height})}} \end{align*}\]
Hence with a height of, say, 174:
\[\begin{align*} \text{log-odds of being female} &= -0.68608 \\ \Pr(Y=\text{female} \mid X=174) &= \dfrac{e^{45.75626 + -0.26691 (174)}} {1+ e^{45.75626 + -0.26691 (174)}}\\ &= 0.3349057 \end{align*}\]
We could find this using the predict()
function. You have two ways of specifying the type of prediction (as determined by the type
argument in predict.glm
; see ?predict.gml
for details):
- The default is on the scale of the linear predictors; Thus for a logistic regression default predictions are of log-odds value for the new data point, which represents the natural logarithm of the odds that the observation belongs to the positive class.
- the alternative
type = "response"
is on the scale of the response variable. Thus for logistic regression,type = "response"
gives the predicted probabilities.
# returns the log-odds of being a female
predict(simlog, newdata = data.frame(Height = 174))
1
-0.6866771
# returns the probability of being a female
predict(simlog, newdata = data.frame(Height = 174), type = "response")
1
0.3347727
The predict
function will not work unless your data is stored in a data frame with column names that match the variables used in training your logistic regression model.
# WRONG ways to specify 'newdata' (will produce an error)
predict(simlog, newdata = 174, type = "response")
Since we will be reusing this data, let’s save it to an object:
<- data.frame(Height = 174) newdata1
To convert the above probability into a classification, simply map observations to whichever class for which it has the highest probability of belonging. In our case \[ \begin{align} \Pr(Y = \text{female} \mid X = 174) &= 0.3347727\\ \implies \Pr(Y = \text{male} \mid X = 174) &= 1 - 0.3347727 \\ &= 0.6652273 \end{align} \]
Since a person with the height of 174 has a higher probability of being male than female, the observation is predicted to be male.
Recall: the output of logistic regression is \(\Pr(Y = 1\mid x)\). Therefore, for binary response variables (\(y = 1\) or \(y = 0\)) any observations with predicted probability greater that 0.5 are classified as class 1.
Classification Performance
Test Error
To evaluate the performance of our logistic regression model, we compute the test error by applying the model to the test set, predicting the class labels for each observation, and comparing these predictions to the actual class labels. The test error reflects how well the model generalizes to new, unseen data.
To calculate the test error in R:
- Predict probabilities for the test set using the trained model.
- Convert probabilities to class predictions using a threshold (typically 0.5 for binary classification).
- Compare the predictions to the true class labels and calculate the error rate.
Step 1: Predict probabilities for the test set
We first predict the probabilities of the test set observations belonging to the female
class.
# Predict probabilities (of being female) on the test set
<- predict(simlog, newdata = test_data, type = "response")
pred.prob
<- factor(pred.prob>0.5, levels= c(FALSE, TRUE), labels=c("male", "female"))
pred_gender = test_data$Gender
y # see the first 10:
head(cbind(pred_gender, y))
pred_gender y
20 1 1
228 2 1
334 2 2
205 1 1
503 1 2
42 2 1
Step 2: Convert probabilities to class predictions
Next, we classify each observation as either class female
or male
using a threshold of 0.5. In other words, if the \(\Pr(Y_i = \text{female} \mid x_i) > 0.5\) then \(x_i\) is classified as female
, otherwise, male
.
# Convert probabilities to class predictions using a 0.5 threshold
<- factor(ifelse(pred.prob >= 0.5, "female", "male")) pred_gender
Step 3: Compare predictions to actual test labels and calculate the error rate
We now compare the predicted classes to the actual classes in the test set and compute the test error. The test error is the proportion of incorrectly classified observations.
# Actual test labels
<- test_data$Gender
actual_gender
# print the result
data.frame(yhat = pred_gender, y = actual_gender)
To calculated the test error (or misclassification rate) we use:
\[ \text{Test Error} = \frac{1}{n_{\text{test}}} \sum_{i=1}^{n_{\text{test}}} \mathbb{I}(\hat{y}_i \neq y_i) = \frac{\text{total misclassified}}{\text{total no. observations}} \tag{3}\]
where
- \(n_{\text{text}}\) are the number of observations in our test set
- \(\hat{y}_i\) represents the predicted class and
- \({y}_i\) represents the actual class
This can be computer in R using:
# Calculate the error rate
<- mean(pred_gender != actual_gender)
test_error test_error
[1] 0.1674877
Confusion Matrix
As discussed in lecture and in previous labs, we can summarize and compare our predictions to the truth using a confusion matrix. It can be found quickly using the table()
function:
<- table(test_data$Gender, pred_gender)
tab tab
pred_gender
female male
male 18 82
female 87 16
Usually when we are organizing a confusiong matrix it should have the correct classifications on the diagonal. This is not the case here. We can fix this by changing the order.
levels(test_data$Gender)
[1] "male" "female"
levels(pred_gender)
[1] "female" "male"
To reorder the factor levels in R, you can use the factor()
function with the levels
argument to explicitly specify the new order of the levels. In our case, since we want the levels of pred_gender
to match the levels of test_data$Gender
, you can reorder pred_gender
like this:
# Reorder the levels of pred_gender to match test_data$Gender
<- factor(pred_gender, levels = c("male", "female")) # same as:
pred_gender <- factor(pred_gender, levels = levels(test_data$Gender)) pred_gender
Now that the levels of these two vectors not consistent, lets redo our table.
<- table(test_data$Gender, pred_gender)
tab tab
pred_gender
male female
male 82 18
female 16 87
That’s better. Now the off-diagonal elements represent the number of misclassified observations, where predictions did not match the actual gender. The diagonal elements represent the correctly classified observations, where predictions matched the actual gender. Using the confusion matrix, we can easily recalculate the test error as shown in Equation Equation 3 by summing the misclassified observations and dividing by the total number of observations.
\[\begin{equation} \frac{\text{total misclassified}}{\text{total no. observations}} =\frac{18+16}{203} = 0.1674877 \end{equation}\]
Always ensure that the levels of factor variables are in the same order between the predicted and actual values! When the order of the factor levels are not consistent, this leads to incorrect results in the confusion matrix. You can reorder the factor levels using the factor(, levels = )
Now let’s repeat for the other methods
KNN Classification
Algorithm
K-Nearest Neighbors (KNN) is a simple, intuitive classification algorithm that makes predictions based on the similarity of new data points to existing labeled data points. In KNN, a data point is classified by a majority vote of its \(k\) nearest neighbors.
Step 1: Given a positive integer \(k\) and a test observation \(x_0\), the KNN classifier first identifies the \(k\) points in the training data that are closest to \(x_0\), represented by \(\mathcal{N}_0\).
Step 2: For each class \(j\), find \[\begin{equation} P(Y=j \mid X=x_0) \approx \frac{1}{K} \sum_{i \in N_0} I(y_i = j) \end{equation}\]
Step 3: Assign observation \(x_0\) to the class (\(j\)) for which the above probability is largest; “ties” are typical broken at random (i.e. a coin toss)
Note that this is a nonparametric approach. That is, KNN does not make any assumptions about the underlying distribution of the data.
R code
While this would be easy to code up ourselves, we’ll make use of a package.To fit a K-Nearest Neighbors (KNN) classification we can use the knn
function from the class
package in R. For demonstration purposes we will use \(k = 5\) and \(k = 11\) (I am strategically using odd numbers to avoid ties). Don’t worry, we’ll learn a more systematic way to select \(k\) once we discuss cross-validation.
Note that the arguments for this algorithm are not as easy as glm
and lm
. Rather than accepting a formula we will need to specify
train
= \(\boldsymbol X_{\text{train}}\), the matrix of predictors in the training set,test
= \(\boldsymbol X_{\text{test}}\), the matrix of predictors in the test set, andcl
= \(\boldsymbol y_{\text{train}}\), the true classes for the observations in the training set.
Since we are only using Height
as a predictor, we can create this (one-column) matrix6 as follows:
library(class)
# Separate features and labels for KNN
<- as.matrix(train_data[, "Height"]) # Convert to matrix
train_x <- train_data$Gender # Labels for training data
train_y <- as.matrix(test_data[, "Height"]) # Convert to matrix
test_x <- test_data$Gender # Labels for testing data
test_y
# Fit KNN with k=5
<- knn(train = train_x, test = test_x, cl = train_y, k = 5)
knn_pred5
# Fit KNN with k=11
<- knn(train = train_x, test = test_x, cl = train_y, k = 11) knn_pred11
Note that the outputted vector is the predictions on the test set so we can go straight to calculating performance metrics.
Classification performance
As before we can compare the predicted gender in the test set, to the actual gender:
<- factor(knn_pred5, levels = levels(actual_gender))
knn_pred5 data.frame(yhat = knn_pred5, y = actual_gender, disagree = knn_pred5 != actual_gender) # actual_gender is the same as test_y
Our confusion matrix can be found as:
<- table(actual_gender, knn_pred5)
tab_knn tab_knn
knn_pred5
actual_gender male female
male 80 20
female 19 84
Our test error is:
<- mean(knn_pred5 != actual_gender)
test_error_knn test_error_knn
[1] 0.1921182
Similarly for KNN with \(k = 11\)
<- factor(knn_pred11, levels = levels(actual_gender))
knn_pred11 data.frame(yhat = knn_pred11, y = actual_gender, disagree = knn_pred11 != actual_gender) # actual_gender is the same as test_y
Our confusion matrix can be found as:
<- table(actual_gender, knn_pred11)
tab_knn11 tab_knn11
knn_pred11
actual_gender male female
male 73 27
female 12 91
Our test error is:
<- mean(knn_pred11 != actual_gender)
test_error_knn11 test_error_knn11
[1] 0.1921182
QDA and LDA
Linear Discriminant Analysis (LDA) is a classification algorithm that assumes data is normally distributed and models each class with its own Gaussian distribution. LDA creates a linear decision boundary. Quadratic Discriminant Analysis (QDA) is an extension of LDA that allows for a quadratic decision boundary. Unlike LDA, which assumes equal covariance matrices for each class, QDA estimates a separate covariance matrix for each class. This flexibility allows QDA to handle more complex classification problems but can lead to overfitting with small datasets.
R code
We can use the MASS
package for LDA and QDA. Like glm
it accepts a formula
, a data
input, and an optional subset
for training index.
library(MASS)
# LDA model
<- lda(Gender ~ Height, data = body, subset = train_ind)
lda_model
# QDA model
<- qda(Gender ~ Height, data = body, subset = train_ind) qda_model
Prediction
We make predictions using the predict()
function; see ?predict.lda
and ?predict.qda
for details:
<- predict(lda_model, newdata = test_data) lda_fit
A matrix of the posterior probabilities for each class is stored in $posterior
# posterior probabilities of the first 6 observations
head(lda_fit$posterior)
male female
20 0.75107307 0.2489269
228 0.41261467 0.5873853
334 0.01340679 0.9865932
205 0.89888636 0.1011136
503 0.77381700 0.2261830
42 0.48709961 0.5129004
The predicted class labels for the observations are stored in $class
<- lda_fit$class
lda_pred lda_pred
[1] male female female male male female male male female male
[11] male female female male male male male male male male
[21] male female male female male male female male male female
[31] female female male female female male male female female male
[41] male female female female female female female male female female
[51] female female male male female female female female female male
[61] female male male male male female female male female female
[71] female male female male male female female female female female
[81] male male male male male female female male female female
[91] female male female male female male male male female female
[101] female male male female male female female male male female
[111] male female female female female female female male female male
[121] male male male female male male female female male male
[131] female female female female female female male male male female
[141] male male female male female female female male female female
[151] female female male male female male male male female female
[161] male female female male female male male female male female
[171] male female female female male female female male male male
[181] female male female male male male male male male female
[191] male male male female female female male male female male
[201] female female female
Levels: male female
# remember we should reorder the levels for later:
<- factor(lda_pred, levels = levels(actual_gender)) lda_pred
Similarly for QDA:
<- predict(qda_model, newdata = test_data)
qda_fit <- factor(qda_fit$class, levels = levels(actual_gender)) qda_pred
Classification performance
As before we can compare the predicted gender in the test set, to the actual gender:
data.frame(yhat = lda_pred, y = actual_gender, disagree = lda_pred != actual_gender)
Our confusion matrix can be found as:
<- table(actual_gender, lda_pred)
tab_lda tab_lda
lda_pred
actual_gender male female
male 82 18
female 16 87
Our test error is:
<- mean(lda_pred != actual_gender)
test_error_lda test_error_lda
[1] 0.1674877
Similarly for QDA:
data.frame(yhat = qda_pred, y = actual_gender, disagree = lda_pred != actual_gender)
Our confusion matrix can be found as:
<- table(actual_gender, qda_pred)
tab_qda tab_qda
qda_pred
actual_gender male female
male 82 18
female 16 87
Our test error is:
<- mean(qda_pred != actual_gender)
test_error_qda test_error_qda
[1] 0.1674877
Conclusion
In this lab we’ve fitted several classification methods. Overall, logistic regression, LDA, and QDA perform equally as well on the test data and all achieve and test error rate of 17%.
Method | test error |
---|---|
Logistic Regression | 0.1674877 |
KNN (\(k\) = 5) | 0.1921182 |
KNN (\(k\) = 11) | 0.1921182 |
LDA | 0.1674877 |
QDA | 0.1674877 |
tab
pred_gender
male female
male 82 18
female 16 87
tab_lda
lda_pred
actual_gender male female
male 82 18
female 16 87
tab_qda
qda_pred
actual_gender male female
male 82 18
female 16 87
In this lab we just looked at some basic metrics for comparison (test error and confusion matrices). Next week we will look at other evaluation metrics including precision, recall, specificity, and the F1 score.
Footnotes
For example, you can call
View(...)
orhead(...)
orsrt(...)
on your data set to accomplish this.↩︎the information regarded this encoding can be found in the help file for the
body
data set which can be accessed using?body
↩︎In case you were wondering how a generate this seed, I used
as.integer(Sys.time())
. I like to do this to avoid using similar seeds over and over again, e.g.set.seed(<year>)
.↩︎we are referring to the natural logarithm↩︎
why female? because this for
family="binomial"
the second level of the response variabley
is treated as the positive, i.e. \(Y = 1\), class. See Which factor level do the probabilities correspond to? for more explanation.↩︎the
knn
function expectstrain
andtest
to be a matrix or data frame. If you provide a vector, you will get an error.↩︎