library(tidyverse)
# Read in the dataset
obesity <- read.csv("obesity.csv", stringsAsFactors = TRUE) Fitting and Assessing Logistic Regression Models
DATA 311: Machine Learning
Introduction
In this lab we move from regression with continuous outcomes to classification, focusing on binary logistic regression using the UCI Estimation of Obesity Levels dataset. We’ll (i) prepare the data (including handling ordinal items and avoiding information leakage from BMI-related variables), (ii) fit logistic models, (iii) compare models via AIC using forward, backward, and both-direction stepwise selection, and (iv) assess performance with a confusion matrix, test error, precision/recall, and F1.
The primary aim of this lab is to guide you through the process of implementing these models and assessing their performance using two key evaluation methods: the test error rate and the confusion matrix. These metrics will help you understand how well the model generalizes to unseen data and how its predictions align with actual class labels.
While in lecture we usually referred to categories of \(y\) as groups, or classes, to keep consistent with R terminology we will also be referring to the categorical variable as a factor variable, and it’s classes/groups as levels.
Learning Outcomes
By the end of this lab students will be able to:
- Prepare and clean data for binary classification (including ordered factors and avoiding information leakage).
- Fit logistic regression models (null, simple, full) and run stepwise selection (forward, backward, both).
- Interpret logistic regression output (coefficients on the log-odds scale).
- Evaluate models with confusion matrix, test error, precision, recall, and F1.
- Compare models fairly using held-out test data.
Data
Our dataset comes from the UCI Estimation of Obesity Levels study. It contains information on eating habits, lifestyle, and physical condition for individuals, along with their diagnosed weight category.
For the purposes of this lab, we will focus on predicting a binary outcome:
- Obese: if the individual is classified as Obesity Type I, II, or III
- Not Obese: if the individual is classified as Insufficient Weight, Normal Weight, or Overweight (Levels I & II)
This allows us to apply binary logistic regression, where the model estimates the probability:
\[ \pi(x) = \Pr(Y = \text{Obese} \mid X = x) \]
Throughout the lab, we will fit different classification models, evaluate their performance on a held-out test set, and compare how well they predict obesity status from eating habits and physical activity.
Because the dataset is partly synthetic, some of these ordinal survey variables (e.g., FCVC, NCP, CH2O, FAF, TUE) contain decimal values (e.g., 2.3 or 1.7). These decimals are artifacts of the imputation process used when creating the dataset. To restore their intended meaning, we will round and convert back to ordered factors.
When I first converted NCP into an ordered factor, I noticed some values were being turned into NA. This happened because the dataset contains values of 4, even though the documentation states that NCP should represent the number of main meals per day (1 = between 1 and 2, 2 = three, 3 = more than three). Since the variable had unexpected values, I decided it was simpler (and also somewhat reasonable) to treat NCP as a numeric variable.
To make this a binary classification problem, we create a new variable called Obese, which will be 1 if the individual is obese (Overweight_Level_I, Overweight_Level_II, Obesity_Type_I, Obesity_Type_II, Obesity_Type_III)) and 0 if the individual is Insufficient_Weight, Normal_Weight,
Code
library(tidyverse)
# Convert to appropriate types: round synthetic decimals and restore ordered factors
obesity <- obesity |>
mutate(
Gender = factor(Gender),
family_history_with_overweight = factor(family_history_with_overweight),
FAVC = factor(FAVC),
SMOKE = factor(SMOKE),
SCC = factor(SCC),
MTRANS = factor(MTRANS),
CALC = factor(CALC, levels = c("no","Sometimes","Frequently","Always"),
labels = c("Never","Sometimes","Frequently","Always"),
ordered = TRUE),
# Ordinal items with decimals in the UCI file -> round then make ordered
FCVC = factor(round(FCVC), levels = c(1,2,3),
labels = c("Never","Sometimes","Always"), ordered = TRUE),
# nvm, i will keep this as numeric -------------------------------------
# NCP = factor(round(NCP), levels = c(1,2,3),
# labels = c("Between 1 and 2","Three","More than three"), ordered = TRUE),
CH2O = factor(round(CH2O), levels = c(1,2,3),
labels = c("<1 L","1–2 L",">2 L"), ordered = TRUE),
FAF = factor(round(FAF), levels = c(0,1,2,3),
labels = c("None","1–2 times/wk","2–4 times/wk",">4 times/wk"), ordered = TRUE),
TUE = factor(round(TUE), levels = c(0,1,2),
labels = c("0–2 hrs","3–5 hrs",">5 hrs"), ordered = TRUE),
# Outcome (binary)
Obese = if_else(NObeyesdad %in% c("Obesity_Type_I","Obesity_Type_II","Obesity_Type_III"),
"Obese","NotObese"),
Obese = factor(Obese, levels = c("NotObese","Obese"))
)
obesityCode
# Check for any missing values here:
isna <- function(x) {
any(is.na(x))
}
xmissing <- apply(obesity, 2, isna)
xmissing[xmissing==TRUE] # none. great!named logical(0)
Training and Testing
Before fitting our model, let’s divide our data into training and testing. To make our results reproducible let’s also set a seed1. For this lab we’ll use a 60/40 train/test split.
set.seed(1759442669) # as.integer(Sys.time())
n <- nrow(obesity) # number of obs. in dataset
ptest <- 0.4 # 40% testing
ntest <- round(n*ptest) # 60% training
# indices:
rind = sample(n) # shuffle the row indices (1 through n)
test_ind = rind[1:ntest] # take the first 40% to be test
train_ind = rind[(ntest+1):n] # the remaining 60% to be training
# data frames:
test_data = obesity[test_ind, ] # create test set using the test indices
train_data = obesity[train_ind, ] # create training set using the training indicesLogistic Regression
Logistic regression is a type of regression analysis used when the dependent variable is categorical, typically binary (i.e., having two possible outcomes, such as “yes” or “no,” “success” or “failure”, 0 or 1). Unlike linear regression, which predicts continuous values, logistic regression predicts the probability of the outcome belonging to a particular category. For binary classification using logistic regression the outputted probability that the dependent variable \(Y\) is 1 (i.e., the “positive” class).
\[ \Pr(Y=1 \mid x) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 X_1 + \dots + \beta_n X_n)}} = \frac{e^{\beta_0 + \beta_1 X_1 + \dots + \beta_n X_n}}{1 + e^{\beta_0 + \beta_1 X_1 + \dots + \beta_n X_n}} \tag{1}\]
where:
- \(\Pr(Y=1 \mid x)\) is the probability that the outcome is 1 given the predictor variables.
- \(\beta_0\) is the intercept, and
- \(\beta_1, \beta_2, ..., \beta_n\) are the coefficients for the predictor variables \(X_1, X_2, ..., X_n\).
For notational convenience we let \(\pi = \Pr(Y=1 \mid x)\)
Hence the odds of \(Y=1\) given \(X =x\) is \(\frac{\pi}{1- \pi}\) and the log-odds2 is given by \(\log\left(\frac{\pi}{1- \pi}\right)\).
Model
Logistic regression models the relationship between the predictor variables and the log-odds of the dependent variable as follows:
\[ \text{log-odds} = \log\left(\frac{\pi}{1- \pi}\right) = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + ... + \beta_n X_n \tag{2}\]
In logistic regression, \(\pi\) represents the probability of the second level of the response factor3. By default R orders factor levels alphabetically (for text coerced to factor) or numerically (for numbers coerced to factor):
- So for binary variables coded up as
0s and1, the \(\pi\) will correspond to the \(\Pr(Y= 1 \mid x)\); - For binary variabls coded up as
AandB, \(\pi\) will correspond to the \(\Pr(Y= B \mid x)\);
Factor variables can always be re-leveled, so you should always check which is the “event” class corresponding to \(Y=1\) before interpreting results.
To check the ordering of those levels you can use:
levels(obesity$Obese)[1] "NotObese" "Obese"
Since Obese is the second level, the fitted probabilities correspond to
\[\pi = P(Y = \text{Obese} \mid X)\]
GLMs
Logistic regression is a type of generalized linear models (GLM), hence we fit it using the glm function in R. In terms of inputs, it will feel very similar to using lm, i.e. its first input will be of the formula form, e.g. Y~X. The family="binomial" indicates that R should fit logistic regression.
# General format for logistic regression
fit <- glm(formula, family = "binomial", data = ..., subset = ...)Model Deviance
In logistic regression, model fit is assessed using deviance, which plays a role similar to the residual sum of squares in linear regression.
Null deviance: the deviance of the intercept-only model (no predictors). It shows how well we can predict the outcome just using the overall proportion of cases.
Residual deviance: the deviance of the fitted model with predictors included. Smaller values indicate better fit.
Difference in deviance: the reduction in deviance when predictors are added is analogous to the reduction in residual sum of squares in linear regression. This difference can be tested with a chi-squared test.
Formally, deviance is defined as:
\[D = -2 \big(\ell_{\text{fitted}} - \ell_{\text{saturated}}\big)\]
where:
- \(\ell_{\text{fitted}}\) is the log-likelihood of the fitted model,
- \(\ell_{\text{saturated}}\) is the log-likelihood of the “perfect” model that fits every observation exactly.
👉 Thus, lower deviance = model closer to saturated model = better fit.
Application
Using Obese as the response variable we will compare three logistic regression fits:
- Null (intercept-only)
- Simple logistic model (only using
Weight) - Full: all remaining body measurements as predictors
- Stepwise (AIC): data-driven subset chosen by
MASS::stepAIC
Because obesity level classifications are defined by BMI (a function of Weight and Height), including these variables would let the model simply reproduce the outcome. Similarly, since Obese was derived from NObeyesdad, using it as a predictor would amount to “giving the model the answer”. Excluding these ensures we test whether lifestyle and behavioral factors predict obesity, rather than variables that already encode it.
Null Model
The null model (sometimes called the intercept-only model) is the simplest possible logistic regression. In other words, it is the logistic regression model that ignores all predictors:
\[ \text{log odds} = \beta_0 \]
So the null model is not interesting by itself (it always predicts the same probability), but it’s essential as a baseline for evaluating whether predictors improve classification. We fit it using the following:
null_glm <- glm(Obese ~ 1, data = train_data, family = "binomial")
summary(null_glm)
Call:
glm(formula = Obese ~ 1, family = "binomial", data = train_data)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.1249 0.0563 -2.218 0.0266 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1751.5 on 1266 degrees of freedom
Residual deviance: 1751.5 on 1266 degrees of freedom
AIC: 1753.5
Number of Fisher Scoring iterations: 3
Notice that the Null deviance and Residual deviance are identical here. This is because the model being estimated is itself the baseline model used to compute the null deviance. 👉 Normally, residual deviance < null deviance, because adding predictors should improve the fit (lower deviance).
Simple Logistic Model
We will also consider a simple one-predictor model (e.g., using NCP).
simp_glm <- glm(Obese ~ NCP,
data = obesity,
subset = train_ind,
family = "binomial")
summary(simp_glm)
Call:
glm(formula = Obese ~ NCP, family = "binomial", data = obesity,
subset = train_ind)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.44402 0.19772 -2.246 0.0247 *
NCP 0.11952 0.07088 1.686 0.0917 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1751.5 on 1266 degrees of freedom
Residual deviance: 1748.6 on 1265 degrees of freedom
AIC: 1752.6
Number of Fisher Scoring iterations: 3
one.pred <- glm(Obese ~ family_history_with_overweight,
data = obesity,
subset = train_ind,
family = "binomial")
summary(one.pred)
Call:
glm(formula = Obese ~ family_history_with_overweight, family = "binomial",
data = obesity, subset = train_ind)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.6288 0.4136 -8.773 <2e-16 ***
family_history_with_overweightyes 3.9029 0.4184 9.329 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1751.5 on 1266 degrees of freedom
Residual deviance: 1471.2 on 1265 degrees of freedom
AIC: 1475.2
Number of Fisher Scoring iterations: 6
Notice that since we already converted Obese to a factor we need not wrap the variable name in factor() in the glm function. Notice that the subset argument above allows us to specify the index of training observations in our full data set (as opposed to feeding the data frame of training observations as we will do later for KNN). In other words, simp_glm in Listing 2 is the equivalent to the following:
# output surpressed to save space
glm(Obese ~ NCP, data = train_data, family = "binomial")Full Model
On the flip side, we will also fit the full model; that is, the model that uses all available predictors.
# Refit full model, forcing R to use only complete cases from the start
full_glm <- glm(
Obese ~ . - NObeyesdad - Weight - Height,
data = train_data,
family = "binomial")
summary(full_glm)
Call:
glm(formula = Obese ~ . - NObeyesdad - Weight - Height, family = "binomial",
data = train_data)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -9.16418 0.98666 -9.288 < 2e-16 ***
GenderMale 0.11203 0.16391 0.683 0.494314
Age 0.08587 0.01701 5.049 4.45e-07 ***
family_history_with_overweightyes 3.38583 0.43723 7.744 9.65e-15 ***
FAVCyes 2.04242 0.33891 6.026 1.68e-09 ***
FCVC.L 0.17874 0.25916 0.690 0.490376
FCVC.Q 0.64354 0.17406 3.697 0.000218 ***
NCP 0.09893 0.10172 0.973 0.330769
CAECFrequently -2.86862 0.77757 -3.689 0.000225 ***
CAECno -0.63347 1.00787 -0.629 0.529663
CAECSometimes 0.62297 0.57527 1.083 0.278849
SMOKEyes 0.87937 0.62036 1.418 0.156331
CH2O.L 0.26271 0.16727 1.571 0.116293
CH2O.Q 0.31824 0.12450 2.556 0.010581 *
SCCyes -1.51934 0.65809 -2.309 0.020960 *
FAF.L -0.95769 0.28288 -3.385 0.000711 ***
FAF.Q -0.30943 0.22696 -1.363 0.172770
FAF.C -0.21301 0.15913 -1.339 0.180706
TUE.L -0.21488 0.17197 -1.250 0.211460
TUE.Q -0.16899 0.13828 -1.222 0.221670
CALC.L -0.39459 0.35680 -1.106 0.268767
CALC.Q -0.34947 0.21985 -1.590 0.111923
MTRANSBike 1.53712 2.62840 0.585 0.558674
MTRANSMotorbike -0.28555 1.39927 -0.204 0.838300
MTRANSPublic_Transportation 1.24122 0.23441 5.295 1.19e-07 ***
MTRANSWalking -1.17092 0.83094 -1.409 0.158789
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1751.5 on 1266 degrees of freedom
Residual deviance: 1133.4 on 1241 degrees of freedom
AIC: 1185.4
Number of Fisher Scoring iterations: 6
In R, the . is a shorthand in formulas that means “all other variables in the dataset.”. The notation ~.- NObeyesdad - Weight - Height means regress on all other variables except Weight, Height, and NObeyesdad.
Checking for multicollinearity
As before, we can use the vif() function to check for multicollinearity
car::vif(full_glm) GVIF Df GVIF^(1/(2*Df))
Gender 1.237341 1 1.112358
Age 2.307886 1 1.519173
family_history_with_overweight 1.037089 1 1.018376
FAVC 1.067430 1 1.033165
FCVC 1.269038 2 1.061375
NCP 1.123820 1 1.060104
CAEC 1.159248 3 1.024934
SMOKE 1.137303 1 1.066444
CH2O 1.169733 2 1.039972
SCC 1.058587 1 1.028877
FAF 1.234771 3 1.035773
TUE 1.263027 2 1.060115
CALC 1.206600 2 1.048071
MTRANS 2.192304 4 1.103094
In the output above, not predictors have a VIF score above 10. This tells us that the model doesn’t suffer from severe multicollinearity.
In R, the package::function() notation tells R which package a function should come from.
car::vif()means use theviffunction from the car package.
- This is useful when two packages contain functions with the same name (for example, both MASS and car have a
select()function).
- Using
::avoids ambiguity and makes your code easier to read by showing exactly which package each function belongs to.
In our case, I’m using this notation because I want to use the vif() function without loading the whole package with library(car). (This is is helpful bceause it avoids loading extra functions into your R session that I won’t need).
Stepwise Logistic Regression
Stepwise regression builds (or prunes) a model one variable at a time based on a chosen criterion:
- Forward selection: Start with no predictors → add the most useful one at each step.
- Backward elimination: Start with all predictors → drop the least useful at each step.
- Stepwise (both directions): Combines the two — at each step it may add or remove a variable.
In R we can accomplish this using the step() function or the stepAIC function from the MASS library. As its name suggests, it will aims to minimize the AIC \(\dots\)
It can overfit and produce unstable selections. Treat the chosen model as exploratory; validate on the test set.
AIC
The AIC (Akaike Information Criterion) is a measure of model quality that balances fit and complexity. It is defined as
\[AIC = -2 \times \ell(\hat{\theta}) + 2p\] where \(\ell(\hat{\theta})\) is the maximized log-likelihood and \(p\) is the number of estimated parameters.
- The first term rewards models that fit the data well (higher likelihood → lower AIC).
- The second term penalizes models with more parameters (to discourage overfitting).
When comparing models, the one with the lower AIC is preferred.
Stepwise Logistic Regression
To execute the stepwise selection we will be using the step() function (alternatively you could use the stepAIC function from the MASS library).
step(object, scope, direction, trace ...)objectthe name of the fitted modelscopethe range of models exaimed in the stepwise search. This should be either a single formula, or a list containing componentsupperandlower, both formulae.directionthe type of stepwise search to use ("backward","forward", or"both")traceif positive, information is printed during the running ofstep.
Forward Selection
forward_glm <- step(null_glm, scope = formula(full_glm), direction = "forward")Start: AIC=1753.51
Obese ~ 1
Df Deviance AIC
+ family_history_with_overweight 1 1471.2 1475.2
+ CAEC 3 1545.7 1553.7
+ FAVC 1 1642.8 1646.8
+ SCC 1 1700.3 1704.3
+ Age 1 1707.7 1711.7
+ FAF 3 1716.0 1724.0
+ CALC 2 1719.6 1725.6
+ FCVC 2 1720.7 1726.7
+ MTRANS 4 1717.3 1727.3
+ CH2O 2 1730.8 1736.8
+ TUE 2 1744.5 1750.5
+ NCP 1 1748.7 1752.7
<none> 1751.5 1753.5
+ Gender 1 1751.5 1755.5
+ SMOKE 1 1751.5 1755.5
Step: AIC=1475.25
Obese ~ family_history_with_overweight
Df Deviance AIC
+ CAEC 3 1355.4 1365.4
+ FAVC 1 1399.9 1405.9
+ CALC 2 1427.4 1435.4
+ FCVC 2 1430.6 1438.6
+ FAF 3 1436.4 1446.4
+ SCC 1 1442.8 1448.8
+ MTRANS 4 1440.7 1452.7
+ CH2O 2 1452.9 1460.9
+ Age 1 1455.0 1461.0
+ TUE 2 1460.4 1468.4
<none> 1471.2 1475.2
+ Gender 1 1469.3 1475.3
+ NCP 1 1471.0 1477.0
+ SMOKE 1 1471.2 1477.2
Step: AIC=1365.38
Obese ~ family_history_with_overweight + CAEC
Df Deviance AIC
+ FAVC 1 1293.5 1305.5
+ FCVC 2 1308.0 1322.0
+ FAF 3 1324.3 1340.3
+ MTRANS 4 1326.2 1344.2
+ CALC 2 1331.8 1345.8
+ CH2O 2 1334.6 1348.6
+ SCC 1 1337.5 1349.5
+ Age 1 1344.0 1356.0
+ TUE 2 1346.7 1360.7
+ Gender 1 1351.2 1363.2
<none> 1355.4 1365.4
+ NCP 1 1353.6 1365.6
+ SMOKE 1 1354.3 1366.3
Step: AIC=1305.53
Obese ~ family_history_with_overweight + CAEC + FAVC
Df Deviance AIC
+ FCVC 2 1247.7 1263.7
+ FAF 3 1265.6 1283.6
+ MTRANS 4 1267.1 1287.1
+ CH2O 2 1272.5 1288.5
+ CALC 2 1276.8 1292.8
+ SCC 1 1280.6 1294.6
+ Age 1 1282.8 1296.8
+ TUE 2 1282.4 1298.4
+ Gender 1 1285.3 1299.3
+ SMOKE 1 1289.4 1303.4
<none> 1293.5 1305.5
+ NCP 1 1292.5 1306.5
Step: AIC=1263.65
Obese ~ family_history_with_overweight + CAEC + FAVC + FCVC
Df Deviance AIC
+ FAF 3 1220.3 1242.3
+ Age 1 1231.7 1249.7
+ SCC 1 1232.4 1250.4
+ MTRANS 4 1227.1 1251.1
+ CH2O 2 1235.0 1255.0
+ CALC 2 1237.6 1257.6
+ TUE 2 1240.8 1260.8
+ SMOKE 1 1243.6 1261.6
<none> 1247.7 1263.7
+ Gender 1 1246.9 1264.9
+ NCP 1 1247.1 1265.1
Step: AIC=1242.27
Obese ~ family_history_with_overweight + CAEC + FAVC + FCVC +
FAF
Df Deviance AIC
+ SCC 1 1207.0 1231.0
+ MTRANS 4 1201.4 1231.4
+ CH2O 2 1207.2 1233.2
+ Age 1 1209.9 1233.9
+ CALC 2 1213.8 1239.8
+ SMOKE 1 1216.7 1240.7
+ TUE 2 1214.7 1240.7
<none> 1220.3 1242.3
+ NCP 1 1219.5 1243.5
+ Gender 1 1220.3 1244.3
Step: AIC=1231.04
Obese ~ family_history_with_overweight + CAEC + FAVC + FCVC +
FAF + SCC
Df Deviance AIC
+ MTRANS 4 1187.5 1219.5
+ CH2O 2 1194.9 1222.9
+ Age 1 1199.2 1225.2
+ SMOKE 1 1202.2 1228.2
+ CALC 2 1201.6 1229.6
+ TUE 2 1202.0 1230.0
<none> 1207.0 1231.0
+ NCP 1 1206.5 1232.5
+ Gender 1 1207.0 1233.0
Step: AIC=1219.52
Obese ~ family_history_with_overweight + CAEC + FAVC + FCVC +
FAF + SCC + MTRANS
Df Deviance AIC
+ Age 1 1152.3 1186.3
+ CH2O 2 1176.0 1212.0
+ SMOKE 1 1182.5 1216.5
+ TUE 2 1181.1 1217.1
+ CALC 2 1182.6 1218.6
<none> 1187.5 1219.5
+ NCP 1 1186.9 1220.9
+ Gender 1 1187.5 1221.5
Step: AIC=1186.29
Obese ~ family_history_with_overweight + CAEC + FAVC + FCVC +
FAF + SCC + MTRANS + Age
Df Deviance AIC
+ CH2O 2 1142.8 1180.8
+ NCP 1 1149.8 1185.8
<none> 1152.3 1186.3
+ SMOKE 1 1150.4 1186.4
+ CALC 2 1148.8 1186.8
+ Gender 1 1151.8 1187.8
+ TUE 2 1150.2 1188.2
Step: AIC=1180.75
Obese ~ family_history_with_overweight + CAEC + FAVC + FCVC +
FAF + SCC + MTRANS + Age + CH2O
Df Deviance AIC
<none> 1142.8 1180.8
+ SMOKE 1 1140.8 1180.8
+ NCP 1 1141.0 1181.0
+ CALC 2 1139.2 1181.2
+ Gender 1 1142.0 1182.0
+ TUE 2 1140.3 1182.3
👉 The <none> row in the backwards stepwise output is R’s way of saying:
➡️ What if we stop here and don’t drop any more variables?
In other words, it’s a baseline reference — the option of not changing the model.
Here we can see that the forward selection started with the null (intercept-only) model.
At each step, the algorithm considered adding one variable at a time, and kept the change only if it improved the model fit according to AIC.
The process continued until no further additions decreased the AIC. In the end, the forward selection landed on a model with 9 predictors (the first predictor that was added was family_history_with_overweight, the second one was CAEC, \(\dots\), the 9th one was CH2O. The final fitted model obtained is:
formula(forward_glm)Obese ~ family_history_with_overweight + CAEC + FAVC + FCVC +
FAF + SCC + MTRANS + Age + CH2O
Backwards Selection
backward_glm <- step(full_glm, direction = "backward")Start: AIC=1185.42
Obese ~ (Gender + Age + Height + Weight + family_history_with_overweight +
FAVC + FCVC + NCP + CAEC + SMOKE + CH2O + SCC + FAF + TUE +
CALC + MTRANS + NObeyesdad) - NObeyesdad - Weight - Height
Df Deviance AIC
- TUE 2 1135.7 1183.7
- Gender 1 1133.9 1183.9
- CALC 2 1136.1 1184.1
- NCP 1 1134.4 1184.4
<none> 1133.4 1185.4
- SMOKE 1 1135.5 1185.5
- SCC 1 1140.4 1190.4
- CH2O 2 1143.0 1191.0
- FAF 3 1147.4 1193.4
- Age 1 1160.3 1210.3
- FCVC 2 1165.5 1213.5
- MTRANS 4 1174.7 1218.7
- FAVC 1 1180.8 1230.8
- CAEC 3 1225.1 1271.1
- family_history_with_overweight 1 1260.7 1310.7
Step: AIC=1183.67
Obese ~ Gender + Age + family_history_with_overweight + FAVC +
FCVC + NCP + CAEC + SMOKE + CH2O + SCC + FAF + CALC + MTRANS
Df Deviance AIC
- Gender 1 1136.0 1182.0
- CALC 2 1138.7 1182.7
- NCP 1 1136.8 1182.8
- SMOKE 1 1137.6 1183.6
<none> 1135.7 1183.7
- CH2O 2 1144.8 1188.8
- SCC 1 1142.9 1188.9
- FAF 3 1149.4 1191.4
- Age 1 1166.1 1212.1
- FCVC 2 1170.2 1214.2
- MTRANS 4 1177.9 1217.9
- FAVC 1 1182.0 1228.0
- CAEC 3 1228.4 1270.4
- family_history_with_overweight 1 1262.4 1308.4
Step: AIC=1181.98
Obese ~ Age + family_history_with_overweight + FAVC + FCVC +
NCP + CAEC + SMOKE + CH2O + SCC + FAF + CALC + MTRANS
Df Deviance AIC
- CALC 2 1139.0 1181.0
- NCP 1 1137.2 1181.2
<none> 1136.0 1182.0
- SMOKE 1 1138.0 1182.0
- CH2O 2 1144.9 1186.9
- SCC 1 1143.4 1187.4
- FAF 3 1149.4 1189.4
- Age 1 1166.2 1210.2
- FCVC 2 1171.3 1213.3
- MTRANS 4 1177.9 1215.9
- FAVC 1 1183.9 1227.9
- CAEC 3 1230.2 1270.2
- family_history_with_overweight 1 1263.7 1307.7
Step: AIC=1181.04
Obese ~ Age + family_history_with_overweight + FAVC + FCVC +
NCP + CAEC + SMOKE + CH2O + SCC + FAF + MTRANS
Df Deviance AIC
- NCP 1 1140.8 1180.8
- SMOKE 1 1141.0 1181.0
<none> 1139.0 1181.0
- CH2O 2 1147.9 1185.9
- SCC 1 1146.8 1186.8
- FAF 3 1153.8 1189.8
- Age 1 1170.7 1210.7
- FCVC 2 1177.2 1215.2
- MTRANS 4 1183.0 1217.0
- FAVC 1 1190.0 1230.0
- CAEC 3 1245.8 1281.8
- family_history_with_overweight 1 1265.9 1305.9
Step: AIC=1180.81
Obese ~ Age + family_history_with_overweight + FAVC + FCVC +
CAEC + SMOKE + CH2O + SCC + FAF + MTRANS
Df Deviance AIC
- SMOKE 1 1142.8 1180.8
<none> 1140.8 1180.8
- CH2O 2 1150.4 1186.4
- SCC 1 1149.1 1187.1
- FAF 3 1155.9 1189.9
- Age 1 1171.2 1209.2
- MTRANS 4 1183.4 1215.4
- FCVC 2 1179.5 1215.5
- FAVC 1 1192.4 1230.4
- CAEC 3 1246.2 1280.2
- family_history_with_overweight 1 1268.4 1306.4
Step: AIC=1180.75
Obese ~ Age + family_history_with_overweight + FAVC + FCVC +
CAEC + CH2O + SCC + FAF + MTRANS
Df Deviance AIC
<none> 1142.8 1180.8
- CH2O 2 1152.3 1186.3
- SCC 1 1150.3 1186.3
- FAF 3 1157.7 1189.7
- Age 1 1176.0 1212.0
- FCVC 2 1181.2 1215.2
- MTRANS 4 1186.8 1216.8
- FAVC 1 1192.9 1228.9
- CAEC 3 1246.4 1278.4
- family_history_with_overweight 1 1271.0 1307.0
Here we can see that the backward selection started with the full model (including all available predictors, except those we excluded such as Weight, Height, and NObeyesdad).
At each step, the algorithm considered removing one variable at a time, and kept the change only if it improved the model fit according to AIC.
The process continued until no further removals decreased the AIC. In the end, the backward selection landed on a model with 9 predictors (after dropping unnecessary ones).
formula(backward_glm)Obese ~ Age + family_history_with_overweight + FAVC + FCVC +
CAEC + CH2O + SCC + FAF + MTRANS
Notably, the predictors that were seleced using backward selection are not identical oto the predictors chosen using foreward selection.
Both-Direction Selection
Finally, we can try stepwise selection in both directions.
This approach starts from the null model (intercept-only), and at each step the algorithm is free to either:
- Add a variable (like in forward selection), or
- Remove a variable that was previously added (like in backward selection).
This flexibility allows it to explore a wider range of possible models.
both_glm <- step(
null_glm,
scope = list(lower = null_glm, upper = full_glm),
direction = "both",
trace = TRUE
)Start: AIC=1753.51
Obese ~ 1
Df Deviance AIC
+ family_history_with_overweight 1 1471.2 1475.2
+ CAEC 3 1545.7 1553.7
+ FAVC 1 1642.8 1646.8
+ SCC 1 1700.3 1704.3
+ Age 1 1707.7 1711.7
+ FAF 3 1716.0 1724.0
+ CALC 2 1719.6 1725.6
+ FCVC 2 1720.7 1726.7
+ MTRANS 4 1717.3 1727.3
+ CH2O 2 1730.8 1736.8
+ TUE 2 1744.5 1750.5
+ NCP 1 1748.7 1752.7
<none> 1751.5 1753.5
+ Gender 1 1751.5 1755.5
+ SMOKE 1 1751.5 1755.5
Step: AIC=1475.25
Obese ~ family_history_with_overweight
Df Deviance AIC
+ CAEC 3 1355.4 1365.4
+ FAVC 1 1399.9 1405.9
+ CALC 2 1427.4 1435.4
+ FCVC 2 1430.6 1438.6
+ FAF 3 1436.4 1446.4
+ SCC 1 1442.8 1448.8
+ MTRANS 4 1440.7 1452.7
+ CH2O 2 1452.9 1460.9
+ Age 1 1455.0 1461.0
+ TUE 2 1460.4 1468.4
<none> 1471.2 1475.2
+ Gender 1 1469.3 1475.3
+ NCP 1 1471.0 1477.0
+ SMOKE 1 1471.2 1477.2
- family_history_with_overweight 1 1751.5 1753.5
Step: AIC=1365.38
Obese ~ family_history_with_overweight + CAEC
Df Deviance AIC
+ FAVC 1 1293.5 1305.5
+ FCVC 2 1308.0 1322.0
+ FAF 3 1324.3 1340.3
+ MTRANS 4 1326.2 1344.2
+ CALC 2 1331.8 1345.8
+ CH2O 2 1334.6 1348.6
+ SCC 1 1337.5 1349.5
+ Age 1 1344.0 1356.0
+ TUE 2 1346.7 1360.7
+ Gender 1 1351.2 1363.2
<none> 1355.4 1365.4
+ NCP 1 1353.6 1365.6
+ SMOKE 1 1354.3 1366.3
- CAEC 3 1471.2 1475.2
- family_history_with_overweight 1 1545.7 1553.7
Step: AIC=1305.53
Obese ~ family_history_with_overweight + CAEC + FAVC
Df Deviance AIC
+ FCVC 2 1247.7 1263.7
+ FAF 3 1265.6 1283.6
+ MTRANS 4 1267.1 1287.1
+ CH2O 2 1272.5 1288.5
+ CALC 2 1276.8 1292.8
+ SCC 1 1280.6 1294.6
+ Age 1 1282.8 1296.8
+ TUE 2 1282.4 1298.4
+ Gender 1 1285.3 1299.3
+ SMOKE 1 1289.4 1303.4
<none> 1293.5 1305.5
+ NCP 1 1292.5 1306.5
- FAVC 1 1355.4 1365.4
- CAEC 3 1399.9 1405.9
- family_history_with_overweight 1 1463.7 1473.7
Step: AIC=1263.65
Obese ~ family_history_with_overweight + CAEC + FAVC + FCVC
Df Deviance AIC
+ FAF 3 1220.3 1242.3
+ Age 1 1231.7 1249.7
+ SCC 1 1232.4 1250.4
+ MTRANS 4 1227.1 1251.1
+ CH2O 2 1235.0 1255.0
+ CALC 2 1237.6 1257.6
+ TUE 2 1240.8 1260.8
+ SMOKE 1 1243.6 1261.6
<none> 1247.7 1263.7
+ Gender 1 1246.9 1264.9
+ NCP 1 1247.1 1265.1
- FCVC 2 1293.5 1305.5
- FAVC 1 1308.0 1322.0
- CAEC 3 1358.8 1368.8
- family_history_with_overweight 1 1422.9 1436.9
Step: AIC=1242.27
Obese ~ family_history_with_overweight + CAEC + FAVC + FCVC +
FAF
Df Deviance AIC
+ SCC 1 1207.0 1231.0
+ MTRANS 4 1201.4 1231.4
+ CH2O 2 1207.2 1233.2
+ Age 1 1209.9 1233.9
+ CALC 2 1213.8 1239.8
+ SMOKE 1 1216.7 1240.7
+ TUE 2 1214.7 1240.7
<none> 1220.3 1242.3
+ NCP 1 1219.5 1243.5
+ Gender 1 1220.3 1244.3
- FAF 3 1247.7 1263.7
- FCVC 2 1265.6 1283.6
- FAVC 1 1277.3 1297.3
- CAEC 3 1329.2 1345.2
- family_history_with_overweight 1 1393.5 1413.5
Step: AIC=1231.04
Obese ~ family_history_with_overweight + CAEC + FAVC + FCVC +
FAF + SCC
Df Deviance AIC
+ MTRANS 4 1187.5 1219.5
+ CH2O 2 1194.9 1222.9
+ Age 1 1199.2 1225.2
+ SMOKE 1 1202.2 1228.2
+ CALC 2 1201.6 1229.6
+ TUE 2 1202.0 1230.0
<none> 1207.0 1231.0
+ NCP 1 1206.5 1232.5
+ Gender 1 1207.0 1233.0
- SCC 1 1220.3 1242.3
- FAF 3 1232.4 1250.4
- FCVC 2 1254.0 1274.0
- FAVC 1 1260.0 1282.0
- CAEC 3 1311.3 1329.3
- family_history_with_overweight 1 1361.9 1383.9
Step: AIC=1219.52
Obese ~ family_history_with_overweight + CAEC + FAVC + FCVC +
FAF + SCC + MTRANS
Df Deviance AIC
+ Age 1 1152.3 1186.3
+ CH2O 2 1176.0 1212.0
+ SMOKE 1 1182.5 1216.5
+ TUE 2 1181.1 1217.1
+ CALC 2 1182.6 1218.6
<none> 1187.5 1219.5
+ NCP 1 1186.9 1220.9
+ Gender 1 1187.5 1221.5
- MTRANS 4 1207.0 1231.0
- SCC 1 1201.4 1231.4
- FAF 3 1211.1 1237.1
- FCVC 2 1229.0 1257.0
- FAVC 1 1238.0 1268.0
- CAEC 3 1294.3 1320.3
- family_history_with_overweight 1 1343.6 1373.6
Step: AIC=1186.29
Obese ~ family_history_with_overweight + CAEC + FAVC + FCVC +
FAF + SCC + MTRANS + Age
Df Deviance AIC
+ CH2O 2 1142.8 1180.8
+ NCP 1 1149.8 1185.8
<none> 1152.3 1186.3
+ SMOKE 1 1150.4 1186.4
+ CALC 2 1148.8 1186.8
+ Gender 1 1151.8 1187.8
+ TUE 2 1150.2 1188.2
- SCC 1 1160.4 1192.4
- FAF 3 1165.8 1193.8
- Age 1 1187.5 1219.5
- MTRANS 4 1199.2 1225.2
- FCVC 2 1195.4 1225.4
- FAVC 1 1203.7 1235.7
- CAEC 3 1255.5 1283.5
- family_history_with_overweight 1 1288.7 1320.7
Step: AIC=1180.75
Obese ~ family_history_with_overweight + CAEC + FAVC + FCVC +
FAF + SCC + MTRANS + Age + CH2O
Df Deviance AIC
<none> 1142.8 1180.8
+ SMOKE 1 1140.8 1180.8
+ NCP 1 1141.0 1181.0
+ CALC 2 1139.2 1181.2
+ Gender 1 1142.0 1182.0
+ TUE 2 1140.3 1182.3
- CH2O 2 1152.3 1186.3
- SCC 1 1150.3 1186.3
- FAF 3 1157.7 1189.7
- Age 1 1176.0 1212.0
- FCVC 2 1181.2 1215.2
- MTRANS 4 1186.8 1216.8
- FAVC 1 1192.9 1228.9
- CAEC 3 1246.4 1278.4
- family_history_with_overweight 1 1271.0 1307.0
In the end, the both-direction stepwise landed on a model with
9 predictors. In this case, it never removed any predictors.
formula(both_glm)Obese ~ family_history_with_overweight + CAEC + FAVC + FCVC +
FAF + SCC + MTRANS + Age + CH2O
Prediction
Like with lm, the output of glm provides the estimates of the parameters, i.e. the \(\hat\beta\)s. As discussed in further detail in this section, that the the logistic regression coefficients give the change in the log odds of the given observation being obsese for every one unit increase in the predictor variable. Plugging our estimates into equations Equation 2 and Equation 1 we get:
\[\begin{align*} \text{log-odds of being Obese} &= -0.44 + 0.12 \times \texttt{NCP}\\ \implies \quad P(Y=\text{Obese} \mid X) &= \dfrac{e^{-0.44 + 0.12 \times \texttt{NCP}}} {1+ e^{-0.44 + 0.12 \times \texttt{NCP}}} \end{align*}\]
Hence with a NCP of, say, 3:
\[\begin{align*} \text{log-odds of being Obese} &= -0.44 + 0.12 \times \texttt{NCP}\\ \implies \quad P(Y=\text{Obese} \mid \texttt{NCP} = 3) &= \dfrac{e^{-0.44 + 0.12 \times 3}} {1+ e^{-0.44 + 0.12 \times 3}}\\ &= 0.4786473 \end{align*}\]
We could find this using the predict() function. You have two ways of specifying the type of prediction (as determined by the type argument in predict.glm; see ?predict.gml for details):
- The default is on the scale of the linear predictors; Thus for a logistic regression default predictions are of log-odds value for the new data point, which represents the natural logarithm of the odds that the observation belongs to the positive class.
- the alternative
type = "response"is on the scale of the response variable. Thus for logistic regression,type = "response"gives the predicted probabilities.
where newdata1 is a dataframe containing the NCP of our new individual.
newdata1# returns the log-odds
predict(simp_glm, newdata = newdata1) 1
-0.08546284
By default, the predict() function returns the log-odds of being Obese. In order to get the corresponding probability you need to use:
# returns the probability
predict(simp_glm, newdata1, type = "response") 1
0.4786473
Recall: the output of logistic regression is \(\Pr(Y = 1\mid x)\). Therefore, for binary response variables (\(y = 1\) or \(y = 0\)) any observations with predicted probability greater that 0.5 are classified as class 1.
To convert the above probability into a classification, simply map observations to whichever class for which it has the highest probability of belonging. In our case
\[ \begin{align} \Pr(Y = \text{Obese} \mid X = 3) &= 0.4786473\\ \implies \Pr(Y = \text{not Obese} \mid X = 3) &= 1 - 0.4786473 \\ &= 0.5213527 \end{align} \]
Since a person with the NCP of 3 has a 0.4786473 probability of being Obese, they are predicted as not Obese.
Classification Performance
Test Error
To evaluate the performance of our logistic regression model, we compute the test error by applying the model to the test set, predicting the class labels for each observation, and comparing these predictions to the actual class labels. The test error reflects how well the model generalizes to new, unseen data.
To calculate the test error in R:
- Predict probabilities for the test set using the trained model.
- Convert probabilities to class predictions using a threshold (typically 0.5 for binary classification).
- Compare the predictions to the true class labels and calculate the error rate.
Step 1: Predict probabilities for the test set
We first predict the probabilities of the test set observations belonging to the Obese class.
# Predict probabilities on the test set
pred.prob <- predict(simp_glm, newdata = test_data, type = "response")
y = test_data$Obese
# see the first 10:
head(cbind(pred.prob, y)) pred.prob y
1502 0.4786473 2
1909 0.4786473 2
1633 0.4786473 2
1686 0.4786473 2
1150 0.4701871 1
1679 0.4786473 2
Step 2: Convert probabilities to class predictions
Next, we classify each observation as either obese or not based on the 0.5 threshold. In other words, if the \(\Pr(Y_i = \text{obese} \mid x_i) > 0.5\) then \(x_i\) is classified as Obese, otherwise, they are predicted as not Obese.
# Convert probabilities to class predictions using a 0.5 threshold
pred_class <- factor(pred.prob>0.5, levels= c(FALSE, TRUE), labels=levels(test_data$Obese))The 0.5 threshold is conventional; lowering it generally increases recall and lowers precision, and vice versa.
Step 3: Compare predictions to actual test labels and calculate the error rate
We now compare the predicted classes to the actual classes in the test set and compute the test error. The test error is the proportion of incorrectly classified observations.
# Actual test labels
actual_class <- test_data$Obese
# print the result
data.frame(yhat = pred_class, y = actual_class)To calculated the test error (or misclassification rate) we use:
\[ \text{Test Error} = \frac{1}{n_{\text{test}}} \sum_{i=1}^{n_{\text{test}}} \mathbb{I}(\hat{y}_i \neq y_i) = \frac{\text{total misclassified}}{\text{total no. observations}} \tag{3}\]
where
- \(n_{\text{text}}\) are the number of observations in our test set
- \(\hat{y}_i\) represents the predicted class and
- \({y}_i\) represents the actual class
This can be computed in R using:
# Calculate the error rate
test_error <- mean(pred_class != actual_class)
test_error[1] 0.4988152
Confusion Matrix
We can summarize and compare our predictions to the truth using a confusion matrix. It can be found quickly using the table() function:
tab <- table(actual_class, pred_class)
tab pred_class
actual_class NotObese Obese
NotObese 422 44
Obese 377 1
Usually when we are organizing a confusion matrix so that the diagonal elements represent the correctly classified observations and the off-diagonal elements represent the number of misclassified observations. Using the confusion matrix, we can easily recalculate the test error as shown in Equation Equation 3 by summing the misclassified observations and dividing by the total number of observations.
\[\begin{equation} \frac{\text{total misclassified}}{\text{total no. observations}} =\frac{44+377}{844} = 0.4988152 \end{equation}\]
Other Metrics
The caret package has a nice wrapper function confusionMatrix() that gives you precision, recall (called sensitivity), specificity, and more.
library(caret)
# suppose `pred` are your predictions and `truth` is the true outcome
cm <- confusionMatrix(data = pred_class, reference = actual_class, positive = "Obese")
cmConfusion Matrix and Statistics
Reference
Prediction NotObese Obese
NotObese 422 377
Obese 44 1
Accuracy : 0.5012
95% CI : (0.4669, 0.5355)
No Information Rate : 0.5521
P-Value [Acc > NIR] : 0.9987
Kappa : -0.1001
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.002646
Specificity : 0.905579
Pos Pred Value : 0.022222
Neg Pred Value : 0.528160
Prevalence : 0.447867
Detection Rate : 0.001185
Detection Prevalence : 0.053318
Balanced Accuracy : 0.454112
'Positive' Class : Obese
If we want to calculate the F1 score, we can use the MLmetrics package which has functions you can call directly for the different metrics:
library(MLmetrics)
Attaching package: 'MLmetrics'
The following objects are masked from 'package:caret':
MAE, RMSE
The following object is masked from 'package:base':
Recall
# calculates F1 score
F1_Score(y_pred = pred_class, y_true = actual_class, positive = "Obese")[1] 0.004728132
# also can calculate other metrics...
Precision(y_pred = pred_class, y_true = actual_class, positive = "Obese")[1] 0.02222222
Recall(y_pred = pred_class, y_true = actual_class, positive = "Obese")[1] 0.002645503
Comparing Models
Up to this point, we have fit several logistic regression models: the null model, a simple one-predictor model, the full model with all predictors, and three stepwise variants (forward, backward, and both directions). Each of these represents a different trade-off between model complexity and fit.
To move beyond individual summaries, it is helpful to evaluate them side by side using a consistent set of metrics. This allows us to answer questions like:
Which model fits the training data best according to AIC?
Which model generalizes best to unseen data?
Do simpler models perform comparably to more complex ones?
We will compare models using both in-sample and out-of-sample metrics:
AIC: a likelihood-based measure that penalizes complexity.
Test-set performance: accuracy, error rate, precision, recall, F1 score, ROC-AUC, and log-loss.
Together, these metrics provide a fuller picture: AIC rewards parsimony, while test metrics assess real-world predictive power. By aligning both views, we can better judge whether stepwise selection or the full model is most effective for predicting obesity from lifestyle and behavioral factors.
Code
# Packages for metrics
library(MLmetrics) # F1_Score, Precision, Recall
library(dplyr)
library(tibble)
library(stringr)
# Helper to evaluate one model
eval_model <- function(mod, data_test, positive = "Obese", thresh = 0.5) {
# Probabilities on test set
p <- predict(mod, newdata = data_test, type = "response")
# Predicted class at threshold
neg <- setdiff(levels(data_test$Obese), positive)
yhat <- factor(ifelse(p >= thresh, positive, neg), levels = levels(data_test$Obese))
# Truth
y <- data_test$Obese
# Metrics
acc <- mean(yhat == y)
err <- 1 - acc
prec <- Precision(y_pred = yhat, y_true = y, positive = positive)
rec <- Recall(y_pred = yhat, y_true = y, positive = positive)
f1 <- F1_Score(y_pred = yhat, y_true = y, positive = positive)
tibble(
model = deparse(formula(mod)),
AIC = AIC(mod),
accuracy = acc,
error = err,
precision= prec,
recall = rec,
F1 = f1
)
}
# Evaluate a list of models without purrr
model_names <- c("null","simple","forward","backward","both","full")
model_list <- list(null = null_glm,
simple = simp_glm,
forward = forward_glm,
backward = backward_glm,
both = both_glm,
full = full_glm)
results_list <- vector("list", length(model_list))
names(results_list) <- names(model_list)
for (i in seq_along(model_list)) {
# try() to keep going even if one model errors
res <- try(eval_model(model_list[[i]], data_test = test_data), silent = TRUE)
if (inherits(res, "try-error")) {
# capture the error in a one-row tibble so the table still prints
results_list[[i]] <- tibble(
model = names(model_list)[i],
AIC = NA_real_,
accuracy = NA_real_,
error = NA_real_,
precision= NA_real_,
recall = NA_real_,
F1 = NA_real_
)
} else {
results_list[[i]] <- res
}
}
results <- dplyr::bind_rows(results_list)
# order the results by best F1 score:
knitr::kable(results[order(results$F1), ], digits = 3)| model | AIC | accuracy | error | precision | recall | F1 |
|---|---|---|---|---|---|---|
| Obese ~ NCP | 1752.646 | 0.501 | 0.499 | 0.022 | 0.003 | 0.005 |
| Obese ~ family_history_with_overweight + CAEC + FAVC + FCVC + | 1180.751 | 0.768 | 0.232 | 0.697 | 0.852 | 0.767 |
| FAF + SCC + MTRANS + Age + CH2O | 1180.751 | 0.768 | 0.232 | 0.697 | 0.852 | 0.767 |
| Obese ~ Age + family_history_with_overweight + FAVC + FCVC + | 1180.751 | 0.768 | 0.232 | 0.697 | 0.852 | 0.767 |
| CAEC + CH2O + SCC + FAF + MTRANS | 1180.751 | 0.768 | 0.232 | 0.697 | 0.852 | 0.767 |
| Obese ~ family_history_with_overweight + CAEC + FAVC + FCVC + | 1180.751 | 0.768 | 0.232 | 0.697 | 0.852 | 0.767 |
| FAF + SCC + MTRANS + Age + CH2O | 1180.751 | 0.768 | 0.232 | 0.697 | 0.852 | 0.767 |
| Obese ~ 1 | 1753.506 | 0.552 | 0.448 | NaN | 0.000 | NaN |
| full | NA | NA | NA | NA | NA | NA |
We are seeing some NAs for the full_glm model, since the training set did not exhibit any examples of the Always category for the CALC variable. This happens becuase this option was only selected once in our data set (and it happened to be assigned to the test set)
table(obesity$CALC)
Never Sometimes Frequently Always
639 1401 70 1
table(train_data$CALC)
Never Sometimes Frequently Always
390 838 39 0
table(test_data$CALC)
Never Sometimes Frequently Always
249 563 31 1
Handling Rare Factor Levels
Because the training set didn’t contain that level, glm() has no coefficient for it, and predict() fails on test rows with “Always”. In practice, ultra-rare categories are unstable and hard to estimate. A common remedy is to merge rare levels with a neighboring or semantically similar level before splitting into train/test.
We’ll merge “Always” into “Frequently” across the full dataset, and then re-do the split and model fits.
library(forcats)
# Collapse "Always" into "Frequently" BEFORE train/test split
obesity <- obesity %>%
mutate(
CALC = fct_collapse(CALC, Frequently = c("Frequently", "Always"))
)
# sanity check
table(obesity$CALC)
Never Sometimes Frequently
639 1401 71
Exercise: Now recreate the train/test split, refit models, and your full_glm will predict on the test set without NA/level errors.
Conclusion
Looking across the different logistic regression fits (null, simple, stepwise, and full), we see that the stepwise-selected models perform almost identically. This indicates that no single predictor dominates and that multiple subsets of variables can provide essentially the same predictive power.
In contrast, the null model and the one-predictor model (NCP only) perform substantially worse, with accuracy close to random guessing and near-zero recall. These serve as useful baselines, showing that the inclusion of behavioral and lifestyle factors meaningfully improves predictive ability.
The stepwise models select different predictors but achieve nearly identical accuracy, recall, and F1 on the test set. This suggests overlapping/partly redundant information among predictors: many subsets work about equally well. Use stepwise as an exploratory tool, and prioritize out-of-sample performance over any particular selected subset.
Footnotes
In case you were wondering how a generate this seed, I used
as.integer(Sys.time()). I like to do this to avoid using similar seeds over and over again, e.g.set.seed(<year>).↩︎we are referring to the natural logarithm↩︎
You can read more about this on StackOverflow: which class is being predicted?↩︎