Fitting and Assessing Logistic Regression Models

DATA 311: Machine Learning

Author
Affiliation

Dr. Irene Vrbik

University of British Columbia Okanagan

Introduction

In this lab we move from regression with continuous outcomes to classification, focusing on binary logistic regression using the UCI Estimation of Obesity Levels dataset. We’ll (i) prepare the data (including handling ordinal items and avoiding information leakage from BMI-related variables), (ii) fit logistic models, (iii) compare models via AIC using forward, backward, and both-direction stepwise selection, and (iv) assess performance with a confusion matrix, test error, precision/recall, and F1.

The primary aim of this lab is to guide you through the process of implementing these models and assessing their performance using two key evaluation methods: the test error rate and the confusion matrix. These metrics will help you understand how well the model generalizes to unseen data and how its predictions align with actual class labels.

Notation

While in lecture we usually referred to categories of \(y\) as groups, or classes, to keep consistent with R terminology we will also be referring to the categorical variable as a factor variable, and it’s classes/groups as levels.

Learning Outcomes

By the end of this lab students will be able to:

  • Prepare and clean data for binary classification (including ordered factors and avoiding information leakage).
  • Fit logistic regression models (null, simple, full) and run stepwise selection (forward, backward, both).
  • Interpret logistic regression output (coefficients on the log-odds scale).
  • Evaluate models with confusion matrix, test error, precision, recall, and F1.
  • Compare models fairly using held-out test data.

Data

Our dataset comes from the UCI Estimation of Obesity Levels study. It contains information on eating habits, lifestyle, and physical condition for individuals, along with their diagnosed weight category.

For the purposes of this lab, we will focus on predicting a binary outcome:

  • Obese: if the individual is classified as Obesity Type I, II, or III
  • Not Obese: if the individual is classified as Insufficient Weight, Normal Weight, or Overweight (Levels I & II)

This allows us to apply binary logistic regression, where the model estimates the probability:

\[ \pi(x) = \Pr(Y = \text{Obese} \mid X = x) \]

Throughout the lab, we will fit different classification models, evaluate their performance on a held-out test set, and compare how well they predict obesity status from eating habits and physical activity.

library(tidyverse)
# Read in the dataset
obesity <- read.csv("obesity.csv", stringsAsFactors = TRUE) 

Because the dataset is partly synthetic, some of these ordinal survey variables (e.g., FCVC, NCP, CH2O, FAF, TUE) contain decimal values (e.g., 2.3 or 1.7). These decimals are artifacts of the imputation process used when creating the dataset. To restore their intended meaning, we will round and convert back to ordered factors.

Warning

When I first converted NCP into an ordered factor, I noticed some values were being turned into NA. This happened because the dataset contains values of 4, even though the documentation states that NCP should represent the number of main meals per day (1 = between 1 and 2, 2 = three, 3 = more than three). Since the variable had unexpected values, I decided it was simpler (and also somewhat reasonable) to treat NCP as a numeric variable.

To make this a binary classification problem, we create a new variable called Obese, which will be 1 if the individual is obese (Overweight_Level_I, Overweight_Level_II, Obesity_Type_I, Obesity_Type_II, Obesity_Type_III)) and 0 if the individual is Insufficient_Weight, Normal_Weight,

Code
library(tidyverse)

# Convert to appropriate types: round synthetic decimals and restore ordered factors
obesity <- obesity |>
  mutate(
    Gender = factor(Gender),
    family_history_with_overweight = factor(family_history_with_overweight),
    FAVC = factor(FAVC),
    SMOKE = factor(SMOKE),
    SCC = factor(SCC),
    MTRANS = factor(MTRANS),
    CALC = factor(CALC, levels = c("no","Sometimes","Frequently","Always"),
                  labels = c("Never","Sometimes","Frequently","Always"),
                  ordered = TRUE),

    # Ordinal items with decimals in the UCI file -> round then make ordered
    FCVC = factor(round(FCVC), levels = c(1,2,3),
                  labels = c("Never","Sometimes","Always"), ordered = TRUE),
    # nvm, i will keep this as numeric -------------------------------------
    # NCP  = factor(round(NCP),  levels = c(1,2,3),
    #               labels = c("Between 1 and 2","Three","More than three"), ordered = TRUE),
    CH2O = factor(round(CH2O), levels = c(1,2,3),
                  labels = c("<1 L","1–2 L",">2 L"), ordered = TRUE),
    FAF  = factor(round(FAF),  levels = c(0,1,2,3),
                  labels = c("None","1–2 times/wk","2–4 times/wk",">4 times/wk"), ordered = TRUE),
    TUE  = factor(round(TUE),  levels = c(0,1,2),
                  labels = c("0–2 hrs","3–5 hrs",">5 hrs"), ordered = TRUE),

    # Outcome (binary)
    Obese = if_else(NObeyesdad %in% c("Obesity_Type_I","Obesity_Type_II","Obesity_Type_III"),
                    "Obese","NotObese"),
    Obese = factor(Obese, levels = c("NotObese","Obese"))
  )

obesity
Code
# Check for any missing values here:
isna <- function(x) {
  any(is.na(x))
}

xmissing <- apply(obesity, 2, isna)
xmissing[xmissing==TRUE] # none. great!
named logical(0)

Training and Testing

Before fitting our model, let’s divide our data into training and testing. To make our results reproducible let’s also set a seed1. For this lab we’ll use a 60/40 train/test split.

set.seed(1759442669)              # as.integer(Sys.time()) 
n <- nrow(obesity)                # number of obs. in dataset         
ptest <- 0.4                      # 40% testing 
ntest <- round(n*ptest)           # 60% training

# indices:
rind  = sample(n)                 # shuffle the row indices (1 through n)
test_ind = rind[1:ntest]          # take the first 40% to be test
train_ind = rind[(ntest+1):n]     # the remaining 60% to be training

# data frames:
test_data = obesity[test_ind, ]   # create test set using the test indices
train_data = obesity[train_ind, ] # create training set using the training indices

Logistic Regression

Logistic regression is a type of regression analysis used when the dependent variable is categorical, typically binary (i.e., having two possible outcomes, such as “yes” or “no,” “success” or “failure”, 0 or 1). Unlike linear regression, which predicts continuous values, logistic regression predicts the probability of the outcome belonging to a particular category. For binary classification using logistic regression the outputted probability that the dependent variable \(Y\) is 1 (i.e., the “positive” class).

\[ \Pr(Y=1 \mid x) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 X_1 + \dots + \beta_n X_n)}} = \frac{e^{\beta_0 + \beta_1 X_1 + \dots + \beta_n X_n}}{1 + e^{\beta_0 + \beta_1 X_1 + \dots + \beta_n X_n}} \tag{1}\]

where:

  • \(\Pr(Y=1 \mid x)\) is the probability that the outcome is 1 given the predictor variables.
  • \(\beta_0\) is the intercept, and
  • \(\beta_1, \beta_2, ..., \beta_n\) are the coefficients for the predictor variables \(X_1, X_2, ..., X_n\).
Note

For notational convenience we let \(\pi = \Pr(Y=1 \mid x)\)

Hence the odds of \(Y=1\) given \(X =x\) is \(\frac{\pi}{1- \pi}\) and the log-odds2 is given by \(\log\left(\frac{\pi}{1- \pi}\right)\).

Model

Logistic regression models the relationship between the predictor variables and the log-odds of the dependent variable as follows:

\[ \text{log-odds} = \log\left(\frac{\pi}{1- \pi}\right) = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + ... + \beta_n X_n \tag{2}\]

Which factor level do the probabilities correspond to?

In logistic regression, \(\pi\) represents the probability of the second level of the response factor3. By default R orders factor levels alphabetically (for text coerced to factor) or numerically (for numbers coerced to factor):

  • So for binary variables coded up as 0s and 1, the \(\pi\) will correspond to the \(\Pr(Y= 1 \mid x)\);
  • For binary variabls coded up as A and B, \(\pi\) will correspond to the \(\Pr(Y= B \mid x)\);
Important: Know your “event” class

Factor variables can always be re-leveled, so you should always check which is the “event” class corresponding to \(Y=1\) before interpreting results.

To check the ordering of those levels you can use:

levels(obesity$Obese)
[1] "NotObese" "Obese"   

Since Obese is the second level, the fitted probabilities correspond to

\[\pi = P(Y = \text{Obese} \mid X)\]

GLMs

Logistic regression is a type of generalized linear models (GLM), hence we fit it using the glm function in R. In terms of inputs, it will feel very similar to using lm, i.e. its first input will be of the formula form, e.g. Y~X. The family="binomial" indicates that R should fit logistic regression.

# General format for logistic regression
fit <- glm(formula, family = "binomial", data = ..., subset = ...)

Model Deviance

In logistic regression, model fit is assessed using deviance, which plays a role similar to the residual sum of squares in linear regression.

  • Null deviance: the deviance of the intercept-only model (no predictors). It shows how well we can predict the outcome just using the overall proportion of cases.

  • Residual deviance: the deviance of the fitted model with predictors included. Smaller values indicate better fit.

  • Difference in deviance: the reduction in deviance when predictors are added is analogous to the reduction in residual sum of squares in linear regression. This difference can be tested with a chi-squared test.

Formally, deviance is defined as:

\[D = -2 \big(\ell_{\text{fitted}} - \ell_{\text{saturated}}\big)\]

where:

  • \(\ell_{\text{fitted}}\) is the log-likelihood of the fitted model,
  • \(\ell_{\text{saturated}}\) is the log-likelihood of the “perfect” model that fits every observation exactly.

👉 Thus, lower deviance = model closer to saturated model = better fit.

Application

Using Obese as the response variable we will compare three logistic regression fits:

  1. Null (intercept-only)
  2. Simple logistic model (only using Weight)
  3. Full: all remaining body measurements as predictors
  4. Stepwise (AIC): data-driven subset chosen by MASS::stepAIC
Predictors to exclude

Because obesity level classifications are defined by BMI (a function of Weight and Height), including these variables would let the model simply reproduce the outcome. Similarly, since Obese was derived from NObeyesdad, using it as a predictor would amount to “giving the model the answer”. Excluding these ensures we test whether lifestyle and behavioral factors predict obesity, rather than variables that already encode it.

Null Model

The null model (sometimes called the intercept-only model) is the simplest possible logistic regression. In other words, it is the logistic regression model that ignores all predictors:

\[ \text{log odds} = \beta_0 \]

So the null model is not interesting by itself (it always predicts the same probability), but it’s essential as a baseline for evaluating whether predictors improve classification. We fit it using the following:

Listing 1: Null model. Read more on defining statistical formulae here.
null_glm <- glm(Obese ~ 1, data = train_data, family = "binomial")
summary(null_glm)

Call:
glm(formula = Obese ~ 1, family = "binomial", data = train_data)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)  
(Intercept)  -0.1249     0.0563  -2.218   0.0266 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1751.5  on 1266  degrees of freedom
Residual deviance: 1751.5  on 1266  degrees of freedom
AIC: 1753.5

Number of Fisher Scoring iterations: 3
Note

Notice that the Null deviance and Residual deviance are identical here. This is because the model being estimated is itself the baseline model used to compute the null deviance. 👉 Normally, residual deviance < null deviance, because adding predictors should improve the fit (lower deviance).

Simple Logistic Model

We will also consider a simple one-predictor model (e.g., using NCP).

Listing 2: A simple logistic model.
simp_glm  <- glm(Obese ~ NCP,  
                  data = obesity,
                  subset = train_ind, 
                  family = "binomial")
summary(simp_glm)

Call:
glm(formula = Obese ~ NCP, family = "binomial", data = obesity, 
    subset = train_ind)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)  
(Intercept) -0.44402    0.19772  -2.246   0.0247 *
NCP          0.11952    0.07088   1.686   0.0917 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1751.5  on 1266  degrees of freedom
Residual deviance: 1748.6  on 1265  degrees of freedom
AIC: 1752.6

Number of Fisher Scoring iterations: 3
one.pred <- glm(Obese ~ family_history_with_overweight,  
                  data = obesity,
                  subset = train_ind, 
                  family = "binomial")
summary(one.pred)

Call:
glm(formula = Obese ~ family_history_with_overweight, family = "binomial", 
    data = obesity, subset = train_ind)

Coefficients:
                                  Estimate Std. Error z value Pr(>|z|)    
(Intercept)                        -3.6288     0.4136  -8.773   <2e-16 ***
family_history_with_overweightyes   3.9029     0.4184   9.329   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1751.5  on 1266  degrees of freedom
Residual deviance: 1471.2  on 1265  degrees of freedom
AIC: 1475.2

Number of Fisher Scoring iterations: 6
Note

Notice that since we already converted Obese to a factor we need not wrap the variable name in factor() in the glm function. Notice that the subset argument above allows us to specify the index of training observations in our full data set (as opposed to feeding the data frame of training observations as we will do later for KNN). In other words, simp_glm in Listing 2 is the equivalent to the following:

# output surpressed to save space
glm(Obese ~ NCP,  data = train_data, family = "binomial")

Full Model

On the flip side, we will also fit the full model; that is, the model that uses all available predictors.

# Refit full model, forcing R to use only complete cases from the start
full_glm <- glm(
  Obese ~ . - NObeyesdad - Weight - Height, 
  data = train_data, 
  family = "binomial")

summary(full_glm)

Call:
glm(formula = Obese ~ . - NObeyesdad - Weight - Height, family = "binomial", 
    data = train_data)

Coefficients:
                                  Estimate Std. Error z value Pr(>|z|)    
(Intercept)                       -9.16418    0.98666  -9.288  < 2e-16 ***
GenderMale                         0.11203    0.16391   0.683 0.494314    
Age                                0.08587    0.01701   5.049 4.45e-07 ***
family_history_with_overweightyes  3.38583    0.43723   7.744 9.65e-15 ***
FAVCyes                            2.04242    0.33891   6.026 1.68e-09 ***
FCVC.L                             0.17874    0.25916   0.690 0.490376    
FCVC.Q                             0.64354    0.17406   3.697 0.000218 ***
NCP                                0.09893    0.10172   0.973 0.330769    
CAECFrequently                    -2.86862    0.77757  -3.689 0.000225 ***
CAECno                            -0.63347    1.00787  -0.629 0.529663    
CAECSometimes                      0.62297    0.57527   1.083 0.278849    
SMOKEyes                           0.87937    0.62036   1.418 0.156331    
CH2O.L                             0.26271    0.16727   1.571 0.116293    
CH2O.Q                             0.31824    0.12450   2.556 0.010581 *  
SCCyes                            -1.51934    0.65809  -2.309 0.020960 *  
FAF.L                             -0.95769    0.28288  -3.385 0.000711 ***
FAF.Q                             -0.30943    0.22696  -1.363 0.172770    
FAF.C                             -0.21301    0.15913  -1.339 0.180706    
TUE.L                             -0.21488    0.17197  -1.250 0.211460    
TUE.Q                             -0.16899    0.13828  -1.222 0.221670    
CALC.L                            -0.39459    0.35680  -1.106 0.268767    
CALC.Q                            -0.34947    0.21985  -1.590 0.111923    
MTRANSBike                         1.53712    2.62840   0.585 0.558674    
MTRANSMotorbike                   -0.28555    1.39927  -0.204 0.838300    
MTRANSPublic_Transportation        1.24122    0.23441   5.295 1.19e-07 ***
MTRANSWalking                     -1.17092    0.83094  -1.409 0.158789    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1751.5  on 1266  degrees of freedom
Residual deviance: 1133.4  on 1241  degrees of freedom
AIC: 1185.4

Number of Fisher Scoring iterations: 6
Note

In R, the . is a shorthand in formulas that means “all other variables in the dataset.”. The notation ~.- NObeyesdad - Weight - Height means regress on all other variables except Weight, Height, and NObeyesdad.

Checking for multicollinearity

As before, we can use the vif() function to check for multicollinearity

car::vif(full_glm)
                                   GVIF Df GVIF^(1/(2*Df))
Gender                         1.237341  1        1.112358
Age                            2.307886  1        1.519173
family_history_with_overweight 1.037089  1        1.018376
FAVC                           1.067430  1        1.033165
FCVC                           1.269038  2        1.061375
NCP                            1.123820  1        1.060104
CAEC                           1.159248  3        1.024934
SMOKE                          1.137303  1        1.066444
CH2O                           1.169733  2        1.039972
SCC                            1.058587  1        1.028877
FAF                            1.234771  3        1.035773
TUE                            1.263027  2        1.060115
CALC                           1.206600  2        1.048071
MTRANS                         2.192304  4        1.103094

In the output above, not predictors have a VIF score above 10. This tells us that the model doesn’t suffer from severe multicollinearity.

Note

In R, the package::function() notation tells R which package a function should come from.

  • car::vif() means use the vif function from the car package.
  • This is useful when two packages contain functions with the same name (for example, both MASS and car have a select() function).
  • Using :: avoids ambiguity and makes your code easier to read by showing exactly which package each function belongs to.

In our case, I’m using this notation because I want to use the vif() function without loading the whole package with library(car). (This is is helpful bceause it avoids loading extra functions into your R session that I won’t need).

Stepwise Logistic Regression

Stepwise regression builds (or prunes) a model one variable at a time based on a chosen criterion:

  • Forward selection: Start with no predictors → add the most useful one at each step.
  • Backward elimination: Start with all predictors → drop the least useful at each step.
  • Stepwise (both directions): Combines the two — at each step it may add or remove a variable.

In R we can accomplish this using the step() function or the stepAIC function from the MASS library. As its name suggests, it will aims to minimize the AIC \(\dots\)

Caution about stepwise

It can overfit and produce unstable selections. Treat the chosen model as exploratory; validate on the test set.

AIC

The AIC (Akaike Information Criterion) is a measure of model quality that balances fit and complexity. It is defined as
\[AIC = -2 \times \ell(\hat{\theta}) + 2p\] where \(\ell(\hat{\theta})\) is the maximized log-likelihood and \(p\) is the number of estimated parameters.

  • The first term rewards models that fit the data well (higher likelihood → lower AIC).
  • The second term penalizes models with more parameters (to discourage overfitting).

When comparing models, the one with the lower AIC is preferred.

Stepwise Logistic Regression

To execute the stepwise selection we will be using the step() function (alternatively you could use the stepAIC function from the MASS library).

step(object, scope, direction, trace ...)
  • object the name of the fitted model
  • scope the range of models exaimed in the stepwise search. This should be either a single formula, or a list containing components upper and lower, both formulae.
  • direction the type of stepwise search to use ("backward", "forward", or "both")
  • trace if positive, information is printed during the running of step.

Forward Selection

forward_glm <- step(null_glm, scope = formula(full_glm), direction = "forward")
Start:  AIC=1753.51
Obese ~ 1

                                 Df Deviance    AIC
+ family_history_with_overweight  1   1471.2 1475.2
+ CAEC                            3   1545.7 1553.7
+ FAVC                            1   1642.8 1646.8
+ SCC                             1   1700.3 1704.3
+ Age                             1   1707.7 1711.7
+ FAF                             3   1716.0 1724.0
+ CALC                            2   1719.6 1725.6
+ FCVC                            2   1720.7 1726.7
+ MTRANS                          4   1717.3 1727.3
+ CH2O                            2   1730.8 1736.8
+ TUE                             2   1744.5 1750.5
+ NCP                             1   1748.7 1752.7
<none>                                1751.5 1753.5
+ Gender                          1   1751.5 1755.5
+ SMOKE                           1   1751.5 1755.5

Step:  AIC=1475.25
Obese ~ family_history_with_overweight

         Df Deviance    AIC
+ CAEC    3   1355.4 1365.4
+ FAVC    1   1399.9 1405.9
+ CALC    2   1427.4 1435.4
+ FCVC    2   1430.6 1438.6
+ FAF     3   1436.4 1446.4
+ SCC     1   1442.8 1448.8
+ MTRANS  4   1440.7 1452.7
+ CH2O    2   1452.9 1460.9
+ Age     1   1455.0 1461.0
+ TUE     2   1460.4 1468.4
<none>        1471.2 1475.2
+ Gender  1   1469.3 1475.3
+ NCP     1   1471.0 1477.0
+ SMOKE   1   1471.2 1477.2

Step:  AIC=1365.38
Obese ~ family_history_with_overweight + CAEC

         Df Deviance    AIC
+ FAVC    1   1293.5 1305.5
+ FCVC    2   1308.0 1322.0
+ FAF     3   1324.3 1340.3
+ MTRANS  4   1326.2 1344.2
+ CALC    2   1331.8 1345.8
+ CH2O    2   1334.6 1348.6
+ SCC     1   1337.5 1349.5
+ Age     1   1344.0 1356.0
+ TUE     2   1346.7 1360.7
+ Gender  1   1351.2 1363.2
<none>        1355.4 1365.4
+ NCP     1   1353.6 1365.6
+ SMOKE   1   1354.3 1366.3

Step:  AIC=1305.53
Obese ~ family_history_with_overweight + CAEC + FAVC

         Df Deviance    AIC
+ FCVC    2   1247.7 1263.7
+ FAF     3   1265.6 1283.6
+ MTRANS  4   1267.1 1287.1
+ CH2O    2   1272.5 1288.5
+ CALC    2   1276.8 1292.8
+ SCC     1   1280.6 1294.6
+ Age     1   1282.8 1296.8
+ TUE     2   1282.4 1298.4
+ Gender  1   1285.3 1299.3
+ SMOKE   1   1289.4 1303.4
<none>        1293.5 1305.5
+ NCP     1   1292.5 1306.5

Step:  AIC=1263.65
Obese ~ family_history_with_overweight + CAEC + FAVC + FCVC

         Df Deviance    AIC
+ FAF     3   1220.3 1242.3
+ Age     1   1231.7 1249.7
+ SCC     1   1232.4 1250.4
+ MTRANS  4   1227.1 1251.1
+ CH2O    2   1235.0 1255.0
+ CALC    2   1237.6 1257.6
+ TUE     2   1240.8 1260.8
+ SMOKE   1   1243.6 1261.6
<none>        1247.7 1263.7
+ Gender  1   1246.9 1264.9
+ NCP     1   1247.1 1265.1

Step:  AIC=1242.27
Obese ~ family_history_with_overweight + CAEC + FAVC + FCVC + 
    FAF

         Df Deviance    AIC
+ SCC     1   1207.0 1231.0
+ MTRANS  4   1201.4 1231.4
+ CH2O    2   1207.2 1233.2
+ Age     1   1209.9 1233.9
+ CALC    2   1213.8 1239.8
+ SMOKE   1   1216.7 1240.7
+ TUE     2   1214.7 1240.7
<none>        1220.3 1242.3
+ NCP     1   1219.5 1243.5
+ Gender  1   1220.3 1244.3

Step:  AIC=1231.04
Obese ~ family_history_with_overweight + CAEC + FAVC + FCVC + 
    FAF + SCC

         Df Deviance    AIC
+ MTRANS  4   1187.5 1219.5
+ CH2O    2   1194.9 1222.9
+ Age     1   1199.2 1225.2
+ SMOKE   1   1202.2 1228.2
+ CALC    2   1201.6 1229.6
+ TUE     2   1202.0 1230.0
<none>        1207.0 1231.0
+ NCP     1   1206.5 1232.5
+ Gender  1   1207.0 1233.0

Step:  AIC=1219.52
Obese ~ family_history_with_overweight + CAEC + FAVC + FCVC + 
    FAF + SCC + MTRANS

         Df Deviance    AIC
+ Age     1   1152.3 1186.3
+ CH2O    2   1176.0 1212.0
+ SMOKE   1   1182.5 1216.5
+ TUE     2   1181.1 1217.1
+ CALC    2   1182.6 1218.6
<none>        1187.5 1219.5
+ NCP     1   1186.9 1220.9
+ Gender  1   1187.5 1221.5

Step:  AIC=1186.29
Obese ~ family_history_with_overweight + CAEC + FAVC + FCVC + 
    FAF + SCC + MTRANS + Age

         Df Deviance    AIC
+ CH2O    2   1142.8 1180.8
+ NCP     1   1149.8 1185.8
<none>        1152.3 1186.3
+ SMOKE   1   1150.4 1186.4
+ CALC    2   1148.8 1186.8
+ Gender  1   1151.8 1187.8
+ TUE     2   1150.2 1188.2

Step:  AIC=1180.75
Obese ~ family_history_with_overweight + CAEC + FAVC + FCVC + 
    FAF + SCC + MTRANS + Age + CH2O

         Df Deviance    AIC
<none>        1142.8 1180.8
+ SMOKE   1   1140.8 1180.8
+ NCP     1   1141.0 1181.0
+ CALC    2   1139.2 1181.2
+ Gender  1   1142.0 1182.0
+ TUE     2   1140.3 1182.3
Note

👉 The <none> row in the backwards stepwise output is R’s way of saying:

➡️ What if we stop here and don’t drop any more variables?

In other words, it’s a baseline reference — the option of not changing the model.

Here we can see that the forward selection started with the null (intercept-only) model.

At each step, the algorithm considered adding one variable at a time, and kept the change only if it improved the model fit according to AIC.

The process continued until no further additions decreased the AIC. In the end, the forward selection landed on a model with 9 predictors (the first predictor that was added was family_history_with_overweight, the second one was CAEC, \(\dots\), the 9th one was CH2O. The final fitted model obtained is:

formula(forward_glm)
Obese ~ family_history_with_overweight + CAEC + FAVC + FCVC + 
    FAF + SCC + MTRANS + Age + CH2O

Backwards Selection

backward_glm <- step(full_glm, direction = "backward")
Start:  AIC=1185.42
Obese ~ (Gender + Age + Height + Weight + family_history_with_overweight + 
    FAVC + FCVC + NCP + CAEC + SMOKE + CH2O + SCC + FAF + TUE + 
    CALC + MTRANS + NObeyesdad) - NObeyesdad - Weight - Height

                                 Df Deviance    AIC
- TUE                             2   1135.7 1183.7
- Gender                          1   1133.9 1183.9
- CALC                            2   1136.1 1184.1
- NCP                             1   1134.4 1184.4
<none>                                1133.4 1185.4
- SMOKE                           1   1135.5 1185.5
- SCC                             1   1140.4 1190.4
- CH2O                            2   1143.0 1191.0
- FAF                             3   1147.4 1193.4
- Age                             1   1160.3 1210.3
- FCVC                            2   1165.5 1213.5
- MTRANS                          4   1174.7 1218.7
- FAVC                            1   1180.8 1230.8
- CAEC                            3   1225.1 1271.1
- family_history_with_overweight  1   1260.7 1310.7

Step:  AIC=1183.67
Obese ~ Gender + Age + family_history_with_overweight + FAVC + 
    FCVC + NCP + CAEC + SMOKE + CH2O + SCC + FAF + CALC + MTRANS

                                 Df Deviance    AIC
- Gender                          1   1136.0 1182.0
- CALC                            2   1138.7 1182.7
- NCP                             1   1136.8 1182.8
- SMOKE                           1   1137.6 1183.6
<none>                                1135.7 1183.7
- CH2O                            2   1144.8 1188.8
- SCC                             1   1142.9 1188.9
- FAF                             3   1149.4 1191.4
- Age                             1   1166.1 1212.1
- FCVC                            2   1170.2 1214.2
- MTRANS                          4   1177.9 1217.9
- FAVC                            1   1182.0 1228.0
- CAEC                            3   1228.4 1270.4
- family_history_with_overweight  1   1262.4 1308.4

Step:  AIC=1181.98
Obese ~ Age + family_history_with_overweight + FAVC + FCVC + 
    NCP + CAEC + SMOKE + CH2O + SCC + FAF + CALC + MTRANS

                                 Df Deviance    AIC
- CALC                            2   1139.0 1181.0
- NCP                             1   1137.2 1181.2
<none>                                1136.0 1182.0
- SMOKE                           1   1138.0 1182.0
- CH2O                            2   1144.9 1186.9
- SCC                             1   1143.4 1187.4
- FAF                             3   1149.4 1189.4
- Age                             1   1166.2 1210.2
- FCVC                            2   1171.3 1213.3
- MTRANS                          4   1177.9 1215.9
- FAVC                            1   1183.9 1227.9
- CAEC                            3   1230.2 1270.2
- family_history_with_overweight  1   1263.7 1307.7

Step:  AIC=1181.04
Obese ~ Age + family_history_with_overweight + FAVC + FCVC + 
    NCP + CAEC + SMOKE + CH2O + SCC + FAF + MTRANS

                                 Df Deviance    AIC
- NCP                             1   1140.8 1180.8
- SMOKE                           1   1141.0 1181.0
<none>                                1139.0 1181.0
- CH2O                            2   1147.9 1185.9
- SCC                             1   1146.8 1186.8
- FAF                             3   1153.8 1189.8
- Age                             1   1170.7 1210.7
- FCVC                            2   1177.2 1215.2
- MTRANS                          4   1183.0 1217.0
- FAVC                            1   1190.0 1230.0
- CAEC                            3   1245.8 1281.8
- family_history_with_overweight  1   1265.9 1305.9

Step:  AIC=1180.81
Obese ~ Age + family_history_with_overweight + FAVC + FCVC + 
    CAEC + SMOKE + CH2O + SCC + FAF + MTRANS

                                 Df Deviance    AIC
- SMOKE                           1   1142.8 1180.8
<none>                                1140.8 1180.8
- CH2O                            2   1150.4 1186.4
- SCC                             1   1149.1 1187.1
- FAF                             3   1155.9 1189.9
- Age                             1   1171.2 1209.2
- MTRANS                          4   1183.4 1215.4
- FCVC                            2   1179.5 1215.5
- FAVC                            1   1192.4 1230.4
- CAEC                            3   1246.2 1280.2
- family_history_with_overweight  1   1268.4 1306.4

Step:  AIC=1180.75
Obese ~ Age + family_history_with_overweight + FAVC + FCVC + 
    CAEC + CH2O + SCC + FAF + MTRANS

                                 Df Deviance    AIC
<none>                                1142.8 1180.8
- CH2O                            2   1152.3 1186.3
- SCC                             1   1150.3 1186.3
- FAF                             3   1157.7 1189.7
- Age                             1   1176.0 1212.0
- FCVC                            2   1181.2 1215.2
- MTRANS                          4   1186.8 1216.8
- FAVC                            1   1192.9 1228.9
- CAEC                            3   1246.4 1278.4
- family_history_with_overweight  1   1271.0 1307.0

Here we can see that the backward selection started with the full model (including all available predictors, except those we excluded such as Weight, Height, and NObeyesdad).

At each step, the algorithm considered removing one variable at a time, and kept the change only if it improved the model fit according to AIC.

The process continued until no further removals decreased the AIC. In the end, the backward selection landed on a model with 9 predictors (after dropping unnecessary ones).

formula(backward_glm)
Obese ~ Age + family_history_with_overweight + FAVC + FCVC + 
    CAEC + CH2O + SCC + FAF + MTRANS
Note

Notably, the predictors that were seleced using backward selection are not identical oto the predictors chosen using foreward selection.

Both-Direction Selection

Finally, we can try stepwise selection in both directions.

This approach starts from the null model (intercept-only), and at each step the algorithm is free to either:

  • Add a variable (like in forward selection), or
  • Remove a variable that was previously added (like in backward selection).

This flexibility allows it to explore a wider range of possible models.

both_glm <- step(
  null_glm,
  scope = list(lower = null_glm, upper = full_glm),
  direction = "both",
  trace = TRUE
)
Start:  AIC=1753.51
Obese ~ 1

                                 Df Deviance    AIC
+ family_history_with_overweight  1   1471.2 1475.2
+ CAEC                            3   1545.7 1553.7
+ FAVC                            1   1642.8 1646.8
+ SCC                             1   1700.3 1704.3
+ Age                             1   1707.7 1711.7
+ FAF                             3   1716.0 1724.0
+ CALC                            2   1719.6 1725.6
+ FCVC                            2   1720.7 1726.7
+ MTRANS                          4   1717.3 1727.3
+ CH2O                            2   1730.8 1736.8
+ TUE                             2   1744.5 1750.5
+ NCP                             1   1748.7 1752.7
<none>                                1751.5 1753.5
+ Gender                          1   1751.5 1755.5
+ SMOKE                           1   1751.5 1755.5

Step:  AIC=1475.25
Obese ~ family_history_with_overweight

                                 Df Deviance    AIC
+ CAEC                            3   1355.4 1365.4
+ FAVC                            1   1399.9 1405.9
+ CALC                            2   1427.4 1435.4
+ FCVC                            2   1430.6 1438.6
+ FAF                             3   1436.4 1446.4
+ SCC                             1   1442.8 1448.8
+ MTRANS                          4   1440.7 1452.7
+ CH2O                            2   1452.9 1460.9
+ Age                             1   1455.0 1461.0
+ TUE                             2   1460.4 1468.4
<none>                                1471.2 1475.2
+ Gender                          1   1469.3 1475.3
+ NCP                             1   1471.0 1477.0
+ SMOKE                           1   1471.2 1477.2
- family_history_with_overweight  1   1751.5 1753.5

Step:  AIC=1365.38
Obese ~ family_history_with_overweight + CAEC

                                 Df Deviance    AIC
+ FAVC                            1   1293.5 1305.5
+ FCVC                            2   1308.0 1322.0
+ FAF                             3   1324.3 1340.3
+ MTRANS                          4   1326.2 1344.2
+ CALC                            2   1331.8 1345.8
+ CH2O                            2   1334.6 1348.6
+ SCC                             1   1337.5 1349.5
+ Age                             1   1344.0 1356.0
+ TUE                             2   1346.7 1360.7
+ Gender                          1   1351.2 1363.2
<none>                                1355.4 1365.4
+ NCP                             1   1353.6 1365.6
+ SMOKE                           1   1354.3 1366.3
- CAEC                            3   1471.2 1475.2
- family_history_with_overweight  1   1545.7 1553.7

Step:  AIC=1305.53
Obese ~ family_history_with_overweight + CAEC + FAVC

                                 Df Deviance    AIC
+ FCVC                            2   1247.7 1263.7
+ FAF                             3   1265.6 1283.6
+ MTRANS                          4   1267.1 1287.1
+ CH2O                            2   1272.5 1288.5
+ CALC                            2   1276.8 1292.8
+ SCC                             1   1280.6 1294.6
+ Age                             1   1282.8 1296.8
+ TUE                             2   1282.4 1298.4
+ Gender                          1   1285.3 1299.3
+ SMOKE                           1   1289.4 1303.4
<none>                                1293.5 1305.5
+ NCP                             1   1292.5 1306.5
- FAVC                            1   1355.4 1365.4
- CAEC                            3   1399.9 1405.9
- family_history_with_overweight  1   1463.7 1473.7

Step:  AIC=1263.65
Obese ~ family_history_with_overweight + CAEC + FAVC + FCVC

                                 Df Deviance    AIC
+ FAF                             3   1220.3 1242.3
+ Age                             1   1231.7 1249.7
+ SCC                             1   1232.4 1250.4
+ MTRANS                          4   1227.1 1251.1
+ CH2O                            2   1235.0 1255.0
+ CALC                            2   1237.6 1257.6
+ TUE                             2   1240.8 1260.8
+ SMOKE                           1   1243.6 1261.6
<none>                                1247.7 1263.7
+ Gender                          1   1246.9 1264.9
+ NCP                             1   1247.1 1265.1
- FCVC                            2   1293.5 1305.5
- FAVC                            1   1308.0 1322.0
- CAEC                            3   1358.8 1368.8
- family_history_with_overweight  1   1422.9 1436.9

Step:  AIC=1242.27
Obese ~ family_history_with_overweight + CAEC + FAVC + FCVC + 
    FAF

                                 Df Deviance    AIC
+ SCC                             1   1207.0 1231.0
+ MTRANS                          4   1201.4 1231.4
+ CH2O                            2   1207.2 1233.2
+ Age                             1   1209.9 1233.9
+ CALC                            2   1213.8 1239.8
+ SMOKE                           1   1216.7 1240.7
+ TUE                             2   1214.7 1240.7
<none>                                1220.3 1242.3
+ NCP                             1   1219.5 1243.5
+ Gender                          1   1220.3 1244.3
- FAF                             3   1247.7 1263.7
- FCVC                            2   1265.6 1283.6
- FAVC                            1   1277.3 1297.3
- CAEC                            3   1329.2 1345.2
- family_history_with_overweight  1   1393.5 1413.5

Step:  AIC=1231.04
Obese ~ family_history_with_overweight + CAEC + FAVC + FCVC + 
    FAF + SCC

                                 Df Deviance    AIC
+ MTRANS                          4   1187.5 1219.5
+ CH2O                            2   1194.9 1222.9
+ Age                             1   1199.2 1225.2
+ SMOKE                           1   1202.2 1228.2
+ CALC                            2   1201.6 1229.6
+ TUE                             2   1202.0 1230.0
<none>                                1207.0 1231.0
+ NCP                             1   1206.5 1232.5
+ Gender                          1   1207.0 1233.0
- SCC                             1   1220.3 1242.3
- FAF                             3   1232.4 1250.4
- FCVC                            2   1254.0 1274.0
- FAVC                            1   1260.0 1282.0
- CAEC                            3   1311.3 1329.3
- family_history_with_overweight  1   1361.9 1383.9

Step:  AIC=1219.52
Obese ~ family_history_with_overweight + CAEC + FAVC + FCVC + 
    FAF + SCC + MTRANS

                                 Df Deviance    AIC
+ Age                             1   1152.3 1186.3
+ CH2O                            2   1176.0 1212.0
+ SMOKE                           1   1182.5 1216.5
+ TUE                             2   1181.1 1217.1
+ CALC                            2   1182.6 1218.6
<none>                                1187.5 1219.5
+ NCP                             1   1186.9 1220.9
+ Gender                          1   1187.5 1221.5
- MTRANS                          4   1207.0 1231.0
- SCC                             1   1201.4 1231.4
- FAF                             3   1211.1 1237.1
- FCVC                            2   1229.0 1257.0
- FAVC                            1   1238.0 1268.0
- CAEC                            3   1294.3 1320.3
- family_history_with_overweight  1   1343.6 1373.6

Step:  AIC=1186.29
Obese ~ family_history_with_overweight + CAEC + FAVC + FCVC + 
    FAF + SCC + MTRANS + Age

                                 Df Deviance    AIC
+ CH2O                            2   1142.8 1180.8
+ NCP                             1   1149.8 1185.8
<none>                                1152.3 1186.3
+ SMOKE                           1   1150.4 1186.4
+ CALC                            2   1148.8 1186.8
+ Gender                          1   1151.8 1187.8
+ TUE                             2   1150.2 1188.2
- SCC                             1   1160.4 1192.4
- FAF                             3   1165.8 1193.8
- Age                             1   1187.5 1219.5
- MTRANS                          4   1199.2 1225.2
- FCVC                            2   1195.4 1225.4
- FAVC                            1   1203.7 1235.7
- CAEC                            3   1255.5 1283.5
- family_history_with_overweight  1   1288.7 1320.7

Step:  AIC=1180.75
Obese ~ family_history_with_overweight + CAEC + FAVC + FCVC + 
    FAF + SCC + MTRANS + Age + CH2O

                                 Df Deviance    AIC
<none>                                1142.8 1180.8
+ SMOKE                           1   1140.8 1180.8
+ NCP                             1   1141.0 1181.0
+ CALC                            2   1139.2 1181.2
+ Gender                          1   1142.0 1182.0
+ TUE                             2   1140.3 1182.3
- CH2O                            2   1152.3 1186.3
- SCC                             1   1150.3 1186.3
- FAF                             3   1157.7 1189.7
- Age                             1   1176.0 1212.0
- FCVC                            2   1181.2 1215.2
- MTRANS                          4   1186.8 1216.8
- FAVC                            1   1192.9 1228.9
- CAEC                            3   1246.4 1278.4
- family_history_with_overweight  1   1271.0 1307.0

In the end, the both-direction stepwise landed on a model with
9 predictors. In this case, it never removed any predictors.

formula(both_glm)
Obese ~ family_history_with_overweight + CAEC + FAVC + FCVC + 
    FAF + SCC + MTRANS + Age + CH2O

Prediction

Like with lm, the output of glm provides the estimates of the parameters, i.e. the \(\hat\beta\)s. As discussed in further detail in this section, that the the logistic regression coefficients give the change in the log odds of the given observation being obsese for every one unit increase in the predictor variable. Plugging our estimates into equations Equation 2 and Equation 1 we get:

\[\begin{align*} \text{log-odds of being Obese} &= -0.44 + 0.12 \times \texttt{NCP}\\ \implies \quad P(Y=\text{Obese} \mid X) &= \dfrac{e^{-0.44 + 0.12 \times \texttt{NCP}}} {1+ e^{-0.44 + 0.12 \times \texttt{NCP}}} \end{align*}\]

Hence with a NCP of, say, 3:

\[\begin{align*} \text{log-odds of being Obese} &= -0.44 + 0.12 \times \texttt{NCP}\\ \implies \quad P(Y=\text{Obese} \mid \texttt{NCP} = 3) &= \dfrac{e^{-0.44 + 0.12 \times 3}} {1+ e^{-0.44 + 0.12 \times 3}}\\ &= 0.4786473 \end{align*}\]

We could find this using the predict() function. You have two ways of specifying the type of prediction (as determined by the type argument in predict.glm; see ?predict.gml for details):

  1. The default is on the scale of the linear predictors; Thus for a logistic regression default predictions are of log-odds value for the new data point, which represents the natural logarithm of the odds that the observation belongs to the positive class.
  2. the alternative type = "response" is on the scale of the response variable. Thus for logistic regression, type = "response" gives the predicted probabilities.

where newdata1 is a dataframe containing the NCP of our new individual.

newdata1
# returns the log-odds 
predict(simp_glm, newdata = newdata1)
          1 
-0.08546284 

By default, the predict() function returns the log-odds of being Obese. In order to get the corresponding probability you need to use:

# returns the probability 
predict(simp_glm, newdata1, type = "response")
        1 
0.4786473 
Important

Recall: the output of logistic regression is \(\Pr(Y = 1\mid x)\). Therefore, for binary response variables (\(y = 1\) or \(y = 0\)) any observations with predicted probability greater that 0.5 are classified as class 1.

To convert the above probability into a classification, simply map observations to whichever class for which it has the highest probability of belonging. In our case

\[ \begin{align} \Pr(Y = \text{Obese} \mid X = 3) &= 0.4786473\\ \implies \Pr(Y = \text{not Obese} \mid X = 3) &= 1 - 0.4786473 \\ &= 0.5213527 \end{align} \]

Since a person with the NCP of 3 has a 0.4786473 probability of being Obese, they are predicted as not Obese.

Classification Performance

Test Error

To evaluate the performance of our logistic regression model, we compute the test error by applying the model to the test set, predicting the class labels for each observation, and comparing these predictions to the actual class labels. The test error reflects how well the model generalizes to new, unseen data.

To calculate the test error in R:

  1. Predict probabilities for the test set using the trained model.
  2. Convert probabilities to class predictions using a threshold (typically 0.5 for binary classification).
  3. Compare the predictions to the true class labels and calculate the error rate.

Step 1: Predict probabilities for the test set

We first predict the probabilities of the test set observations belonging to the Obese class.

# Predict probabilities on the test set
pred.prob <- predict(simp_glm, newdata = test_data, type = "response")

y = test_data$Obese

# see the first 10:
head(cbind(pred.prob, y))
     pred.prob y
1502 0.4786473 2
1909 0.4786473 2
1633 0.4786473 2
1686 0.4786473 2
1150 0.4701871 1
1679 0.4786473 2

Step 2: Convert probabilities to class predictions

Next, we classify each observation as either obese or not based on the 0.5 threshold. In other words, if the \(\Pr(Y_i = \text{obese} \mid x_i) > 0.5\) then \(x_i\) is classified as Obese, otherwise, they are predicted as not Obese.

# Convert probabilities to class predictions using a 0.5 threshold
pred_class <-  factor(pred.prob>0.5, levels= c(FALSE, TRUE), labels=levels(test_data$Obese))
Note

The 0.5 threshold is conventional; lowering it generally increases recall and lowers precision, and vice versa.

Step 3: Compare predictions to actual test labels and calculate the error rate

We now compare the predicted classes to the actual classes in the test set and compute the test error. The test error is the proportion of incorrectly classified observations.

# Actual test labels
actual_class <- test_data$Obese

# print the result
data.frame(yhat = pred_class, y = actual_class)

To calculated the test error (or misclassification rate) we use:

\[ \text{Test Error} = \frac{1}{n_{\text{test}}} \sum_{i=1}^{n_{\text{test}}} \mathbb{I}(\hat{y}_i \neq y_i) = \frac{\text{total misclassified}}{\text{total no. observations}} \tag{3}\]

where

  • \(n_{\text{text}}\) are the number of observations in our test set
  • \(\hat{y}_i\) represents the predicted class and
  • \({y}_i\) represents the actual class

This can be computed in R using:

# Calculate the error rate
test_error <- mean(pred_class != actual_class)
test_error
[1] 0.4988152

Confusion Matrix

We can summarize and compare our predictions to the truth using a confusion matrix. It can be found quickly using the table() function:

tab <- table(actual_class, pred_class)
tab
            pred_class
actual_class NotObese Obese
    NotObese      422    44
    Obese         377     1

Usually when we are organizing a confusion matrix so that the diagonal elements represent the correctly classified observations and the off-diagonal elements represent the number of misclassified observations. Using the confusion matrix, we can easily recalculate the test error as shown in Equation Equation 3 by summing the misclassified observations and dividing by the total number of observations.

\[\begin{equation} \frac{\text{total misclassified}}{\text{total no. observations}} =\frac{44+377}{844} = 0.4988152 \end{equation}\]

Other Metrics

The caret package has a nice wrapper function confusionMatrix() that gives you precision, recall (called sensitivity), specificity, and more.

library(caret)

# suppose `pred` are your predictions and `truth` is the true outcome
cm <- confusionMatrix(data = pred_class, reference = actual_class, positive = "Obese")
cm
Confusion Matrix and Statistics

          Reference
Prediction NotObese Obese
  NotObese      422   377
  Obese          44     1
                                          
               Accuracy : 0.5012          
                 95% CI : (0.4669, 0.5355)
    No Information Rate : 0.5521          
    P-Value [Acc > NIR] : 0.9987          
                                          
                  Kappa : -0.1001         
                                          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.002646        
            Specificity : 0.905579        
         Pos Pred Value : 0.022222        
         Neg Pred Value : 0.528160        
             Prevalence : 0.447867        
         Detection Rate : 0.001185        
   Detection Prevalence : 0.053318        
      Balanced Accuracy : 0.454112        
                                          
       'Positive' Class : Obese           
                                          

If we want to calculate the F1 score, we can use the MLmetrics package which has functions you can call directly for the different metrics:

library(MLmetrics)

Attaching package: 'MLmetrics'
The following objects are masked from 'package:caret':

    MAE, RMSE
The following object is masked from 'package:base':

    Recall
# calculates F1 score
F1_Score(y_pred = pred_class, y_true = actual_class, positive = "Obese")
[1] 0.004728132
# also can calculate other metrics...
Precision(y_pred = pred_class, y_true = actual_class, positive = "Obese")
[1] 0.02222222
Recall(y_pred = pred_class, y_true = actual_class, positive = "Obese")
[1] 0.002645503

Comparing Models

Up to this point, we have fit several logistic regression models: the null model, a simple one-predictor model, the full model with all predictors, and three stepwise variants (forward, backward, and both directions). Each of these represents a different trade-off between model complexity and fit.

To move beyond individual summaries, it is helpful to evaluate them side by side using a consistent set of metrics. This allows us to answer questions like:

  • Which model fits the training data best according to AIC?

  • Which model generalizes best to unseen data?

  • Do simpler models perform comparably to more complex ones?

We will compare models using both in-sample and out-of-sample metrics:

  • AIC: a likelihood-based measure that penalizes complexity.

  • Test-set performance: accuracy, error rate, precision, recall, F1 score, ROC-AUC, and log-loss.

Together, these metrics provide a fuller picture: AIC rewards parsimony, while test metrics assess real-world predictive power. By aligning both views, we can better judge whether stepwise selection or the full model is most effective for predicting obesity from lifestyle and behavioral factors.

Code
# Packages for metrics
library(MLmetrics)   # F1_Score, Precision, Recall
library(dplyr)
library(tibble)
library(stringr)

# Helper to evaluate one model
eval_model <- function(mod, data_test, positive = "Obese", thresh = 0.5) {
  # Probabilities on test set
  p <- predict(mod, newdata = data_test, type = "response")
  # Predicted class at threshold
  neg <- setdiff(levels(data_test$Obese), positive)
  yhat <- factor(ifelse(p >= thresh, positive, neg), levels = levels(data_test$Obese))
  # Truth
  y <- data_test$Obese

  # Metrics
  acc  <- mean(yhat == y)
  err  <- 1 - acc
  prec <- Precision(y_pred = yhat, y_true = y, positive = positive)
  rec  <- Recall(y_pred = yhat, y_true = y, positive = positive)
  f1   <- F1_Score(y_pred = yhat, y_true = y, positive = positive)
  
  tibble(
    model    = deparse(formula(mod)),
    AIC      = AIC(mod),
    accuracy = acc,
    error    = err,
    precision= prec,
    recall   = rec,
    F1       = f1
  )
}

# Evaluate a list of models without purrr
model_names <- c("null","simple","forward","backward","both","full")
model_list  <- list(null = null_glm,
                    simple = simp_glm,
                    forward = forward_glm,
                    backward = backward_glm,
                    both = both_glm,
                    full = full_glm)

results_list <- vector("list", length(model_list))
names(results_list) <- names(model_list)

for (i in seq_along(model_list)) {
  # try() to keep going even if one model errors
  res <- try(eval_model(model_list[[i]], data_test = test_data), silent = TRUE)
  if (inherits(res, "try-error")) {
    # capture the error in a one-row tibble so the table still prints
    results_list[[i]] <- tibble(
      model    = names(model_list)[i],
      AIC      = NA_real_,
      accuracy = NA_real_,
      error    = NA_real_,
      precision= NA_real_,
      recall   = NA_real_,
      F1       = NA_real_
    )
  } else {
    results_list[[i]] <- res
  }
}

results <- dplyr::bind_rows(results_list)
# order the results by best F1 score:
knitr::kable(results[order(results$F1), ], digits = 3)
model AIC accuracy error precision recall F1
Obese ~ NCP 1752.646 0.501 0.499 0.022 0.003 0.005
Obese ~ family_history_with_overweight + CAEC + FAVC + FCVC + 1180.751 0.768 0.232 0.697 0.852 0.767
FAF + SCC + MTRANS + Age + CH2O 1180.751 0.768 0.232 0.697 0.852 0.767
Obese ~ Age + family_history_with_overweight + FAVC + FCVC + 1180.751 0.768 0.232 0.697 0.852 0.767
CAEC + CH2O + SCC + FAF + MTRANS 1180.751 0.768 0.232 0.697 0.852 0.767
Obese ~ family_history_with_overweight + CAEC + FAVC + FCVC + 1180.751 0.768 0.232 0.697 0.852 0.767
FAF + SCC + MTRANS + Age + CH2O 1180.751 0.768 0.232 0.697 0.852 0.767
Obese ~ 1 1753.506 0.552 0.448 NaN 0.000 NaN
full NA NA NA NA NA NA

We are seeing some NAs for the full_glm model, since the training set did not exhibit any examples of the Always category for the CALC variable. This happens becuase this option was only selected once in our data set (and it happened to be assigned to the test set)

table(obesity$CALC)

     Never  Sometimes Frequently     Always 
       639       1401         70          1 
table(train_data$CALC)

     Never  Sometimes Frequently     Always 
       390        838         39          0 
table(test_data$CALC)

     Never  Sometimes Frequently     Always 
       249        563         31          1 

Handling Rare Factor Levels

Because the training set didn’t contain that level, glm() has no coefficient for it, and predict() fails on test rows with “Always”. In practice, ultra-rare categories are unstable and hard to estimate. A common remedy is to merge rare levels with a neighboring or semantically similar level before splitting into train/test.

We’ll merge “Always” into “Frequently” across the full dataset, and then re-do the split and model fits.

library(forcats)

# Collapse "Always" into "Frequently" BEFORE train/test split
obesity <- obesity %>%
  mutate(
    CALC = fct_collapse(CALC, Frequently = c("Frequently", "Always"))
  )

# sanity check
table(obesity$CALC)

     Never  Sometimes Frequently 
       639       1401         71 

Exercise: Now recreate the train/test split, refit models, and your full_glm will predict on the test set without NA/level errors.

Conclusion

Looking across the different logistic regression fits (null, simple, stepwise, and full), we see that the stepwise-selected models perform almost identically. This indicates that no single predictor dominates and that multiple subsets of variables can provide essentially the same predictive power.

In contrast, the null model and the one-predictor model (NCP only) perform substantially worse, with accuracy close to random guessing and near-zero recall. These serve as useful baselines, showing that the inclusion of behavioral and lifestyle factors meaningfully improves predictive ability.

The stepwise models select different predictors but achieve nearly identical accuracy, recall, and F1 on the test set. This suggests overlapping/partly redundant information among predictors: many subsets work about equally well. Use stepwise as an exploratory tool, and prioritize out-of-sample performance over any particular selected subset.

Footnotes

  1. In case you were wondering how a generate this seed, I used as.integer(Sys.time()). I like to do this to avoid using similar seeds over and over again, e.g. set.seed(<year>).↩︎

  2. we are referring to the natural logarithm↩︎

  3. You can read more about this on StackOverflow: which class is being predicted?↩︎