Lately we’ve been focused on the regression problem–that is, the supervised machine learning problem when our response variable is a continuous quantity.
Machine learning algorithms that aim at predicting a categorical (AKA qualitative) response are referred to as classification techniques.
In this section, rather than trying to predict a continuous responses, we will be trying to classify observations in to categories (AKA classes)
These the posts are the only ones that made sense to me:
Logistic Regression is a supervised statistical method used for binary1 classification problems.
Examples: Spam detection, medical diagnosis, …
Note that the name is a bit of a misnomer; this is not a method for regression (predicting a continuous \(y\)), this is a classification technique (predicting categorical \(y\))
Mathematical Definition
As we will see, logistic regression outputs a probability like:
Can we use this in a linear model such that a predicted value of 0.5 suggests a 50-50 chance that a person’s eye colour is green?
Example: body
Consider the body (from the gclus package) data set from previous lectures
Let’s attempt to model the recorded Gender as a linear function of Height.
# install.packages(gclus) # DONT ever include in Rmd/Rscriptlibrary(gclus) # cannot data() before library() data(body); attach(body) # cannot attach() before data()fit1 <-lm(Gender~Height) # cannot call Gender/Height before attach()plot(Gender~Height) # note the formula option Y~X
Example: body
Multiple Linear Regression
Recall the linear regression model: \[\begin{equation}
Y = \beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p + \epsilon
\end{equation}\]
\(Y\) is the random, quantitative response variable
\(X_j\) is the \(j^{th}\) random predictor variable
\(\beta_0\) is the intercept
\(\beta_j\) from \(j = 1, \dots p\) are the regression coefficients
\(\epsilon\) is the error term
Fitted line
plot(Gender~Height, ylab ="Probability of Male")abline(fit1, lwd=2, col=2)
Same Plot (zoomed out)
plot(Gender~Height, ylim=c(-0.5, 1.5), ylab="Probability of Male")abline(fit1, lwd=2, col=2)
Generalized Linear Model
Rather than modeling continuous \(Y \in (-\infty, \infty)\) using: \[\begin{equation}
Y = \underbrace{\beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p}_{\eta}
\end{equation}\]
A GLM consists of three components:
Systematic component: a linear combination of the predictors (\(\eta\))
Random component: the probability distribution of \(Y\)
Link function: Connects \(E[Y]\) to \(\eta\).
Systematic Component
The systematic component represents the linear combination of the independent variables.
It is typically expressed as \[\begin{equation}
\eta = \beta_0 + \beta_1 X_1 + \dots \beta_p X_p
\end{equation}\] where
\(\eta\) is the linear predictor,
\(\beta_0, \beta_1, \dots \beta_p\) are the coefficients. , and
\(X_1, X_2, \dots, X_p\) are the predictors.
Random Component
We assume that \(y_1, \dots, y_n\) are samples of independent random variables \(Y_1, \dots, Y_n\) respectively.
The random component specifies the probability distribution \(f(y_i; \theta_i)\) of the response variable.
For GLM’s the probability distributions are assumed to arise from the Exponential family.
Logistic regression is a type of GLM where the response variable is binary.
Exponential family
The exponential family includes a large class of probability distributions e.g. normal, binomial, Poisson and gamma distributions, among others.
Classical regression assumes:
\(Y_i \sim \text{Normal}(\mu_i, \sigma^2)\)
\(E[Y_i] = \mu_i\)
Logistic regression assumes
\(Y_i \sim \text{Bernoulli}(\pi_i)\)
\(E[Y_i] = \pi_i\)
Link and Inverse Link Function
The mean of the distribution, \(\mu = {E}[Y_i]\) depends on the independent variables, \(X\), through the link function
The inverse link function (aka activation function), \(g^{−1}\), transforms the linear predictor \(\eta = \beta_0 + \beta_1 X_1 + \dots \beta_p X_p\) back to the scale of the response variable.
Link function
The link function is a transformation that relates the mean of the response variable to the linear predictor.
\[\begin{equation}
\eta = g(\mu)
\end{equation}\]
where \(\mu = {E}[Y_i]\)
Link function Linear Regression
The link function is a transformation that relates the mean of the response variable to the linear predictor.
\[\begin{equation}
\eta = g(\mu)
\end{equation}\]
In classical linear regression, we use the identity link:
where \(\mu = E[Y_i] = \pi_i = P(Y_i = 1)\) when \(Y_i \sim \text{Bernoulli}(\pi_i)\)
Logit Function
More generally if \(p\) is a probability, then
\(\dfrac{p}{1 − p}\) is the corresponding odds and
logit \(p\) = \(\ln \left( \dfrac{p}{1-p}\right)\) is the logit of the probability \(p\)
this value is sometimes referred to as the log-odds or “logits”
Plotted log-odds
Probabilities converted to the logit (log-odds) scale. Notice how the logit functions allows us to map values from 0 to 1 to values from negative infinity to infinty.
If \(p = 0.5 \implies\) the odds \(\left(\frac{p}{1-p}\right)\) are even and the log-odds is zero.
Binary dependent variable: The response variable \(Y\) must be binary (i.e., it takes on only two possible outcomes, such as 0 and 1, yes or no, success or failure).
Independence of observations1
Linearity of the Logit: There should be a linear relationship between the logit of the outcome (log-odds) and each continuous predictor variable.
No multicollinearity.
iClicker
logit function
In the context of logistic regression, what does the logit function represent?
The probability of success in a binary outcome.
The natural logarithm of the odds of success.
The inverse of the probability of failure.
The square root of the expected value of the response variable.
Correct answer: B
iClicker
logit function
In a logistic regression model, what is the range of values that \(Y\) can take?
Any real number
Any integer
0 or 1
0 to infinity
Correct answer: C
Interpretation of Coefficients
Since probability \(\pi_i\) is non-linear as \(X\) varies1, we cannot interpret the coefficients as we did in regular regression.
The best we can do is talk about the direction:
Holding all other variables constant, if \(\beta_j\) is positive, then an increase in \(X_j\) is associated with an increase in \(\pi_i\)
Holding all other variables constant, if \(\beta_j\) is negative, then an increase in \(X_j\) is associated with a decrease in \(\pi_i\).
Estimation
Let \(G_1\) be the set of observations where \(Y=1\), let \(G_0\) be the set of observations where \(Y=0\), and let \(\boldsymbol{\beta}\) be the set of coefficients. The likelihood is given by:
family = binomial('probit') probit regression (a common alternative to logistic regression)
Fitted Logistics Regression Plot
We are using the Logistic Function with the fitted \(\hat \beta\) values to plot the fitted S-curve
# Fit the logistic regression modelsimlog <-glm(factor(Gender) ~ Height , family="binomial")# store the beta hats for easy referencingbetas <-coef(simlog) # plot the data (renaming the y-axis)plot(Gender~Height, ylab="Probility of being Male")# plot the fitted curve p(x) = e^(b0+b2x)/(1 + e^(b0+b2x))curve( (exp(betas[1] + betas[2]*x))/(1+exp(betas[1] + betas[2]*x)),add=TRUE, # superimposes onto scatterplot (otherwise new plot)lwd=2, # line width (make the line a bit thicker)col=2) # 2 = red
Fitted Logistics Regression Plot
Summary Output
summary(simlog)
...
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -46.76328 4.00571 -11.67 <2e-16 ***
Height 0.27292 0.02339 11.67 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 702.52 on 506 degrees of freedom
Residual deviance: 389.59 on 505 degrees of freedom
AIC: 393.59
Number of Fisher Scoring iterations: 5
...
iClicker
Interpretation of coefficients
In a logistic regression model, the coefficient \(\beta_1\) associated with a predictor variable \(X_1\) is interpreted as:
The change in the predicted probability of the outcome for a one-unit increase in \(X_1\), holding all other variables constant.
The change in the odds of the outcome for a one-unit increase in \(X_1\), holding all other variables constant.
The change in the log-odds of the outcome for a one-unit increase in \(X_1\), holding all other variables constant.
The change in the outcome value for a one-unit increase in \(X_1\), holding all other variables constant.
Correct answer: C
iClicker
Parameter estimation
Which method is used to estimate the coefficients (parameters) in a logistic regression model?
Ordinary Least Squares (OLS)
Maximum Likelihood Estimation (MLE)
Method of Moments
Bayesian Estimation
Correct answer: B
Multiple Logistic Regression
As with linear regression we could just as easily add predictors.
# Fit the logistic regression modelmult_log_reg <-glm(factor(Gender) ~ Height + WaistG + BicepG, family="binomial")summary(mult_log_reg)
...
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -55.32670 5.51960 -10.024 < 2e-16 ***
Height 0.21534 0.02864 7.520 5.50e-14 ***
WaistG 0.05167 0.02891 1.787 0.0739 .
BicepG 0.46541 0.08265 5.631 1.79e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 702.52 on 506 degrees of freedom
Residual deviance: 232.90 on 503 degrees of freedom
AIC: 240.9
Number of Fisher Scoring iterations: 6
...
Model fit
Deviance is a lack of fit measure (the smaller the better) that plays the role of RSS for a broader class of models.
Null Deviance is the deviance of a model with no predictors (only an intercept). It serves as a baseline.
The residual deviance measures the deviance that remains unexplained after fitting the logistic regression model.
The Akaike Information Criterion (AIC) is a goodness of fit measure that penalizing for the number of parameters.
Metrics for Classification
We can evaluate the model by making predictions on new unseen data and assessing its performance; see Lecture 3
error/misclassification rate
accuracy
precision
recall
specificity
F1-score
These can all be computed from a so-called classification table (aka confusion matrix)
Validation of Predicted Values
With a fitted logistic regression model the predict() function (see ?predict.glm) we can either predict the categorical response or output a probability.
Typically the off-diagonals are errors and diagonal are correctly classified observations
Classification Table
predicted male
predicted female
1 - male
216
44
0 - female
45
202
Table 1: This table presents the confusion matrix which summarizes the performance of a classification model in predicting Gender. It shows the counts of correctly and incorrectly classified instances for both males and females. The rows represent the actual gender, while the columns represent the predicted gender.
and \(\hat y_i\) is the predicted class label for the \(i\)th observation \(x_i\) using \(\hat f\) and \[\begin{equation}
I(y_i \neq \hat y_i)=
\begin{cases}
1, y_i \neq \hat y_i\\
0, \text{otherwise}
\end{cases}
\end{equation}\]
The error rate for this classification table is \[\begin{align}
\dfrac{44+45}{216+44+45+202} = \dfrac{89}{507} = 0.1755424 \approx 17.6%
\end{align}\]
F1-score is particularly useful when there is an uneven class distribution (i.e. “unbalanced” classes).
Summary of Metrics
Accuracy: 82.45%
Precision: 82.76%
Recall: 83.08%
Specificity: 81.78%
F1-Score: 82.92%
Accuracy measures overall correctness.
Precision measures the accuracy of the positive predictions.
Recall measures how well the model identifies true positives.
Specificity measures how well the model identifies true negatives.
F1-Score balances precision and recall
Which metric to use?
Accuracy: When the dataset is balanced (i.e., both classes are equally represented)
Precision: use when the cost of false positives is high, e.g. spam
Recall: use when the cost of false negatives is high, e.g. disease screening
Specificity: use when its important to correctly identify negative cases, e.g. fraud detection
F1-score: When there is an imbalance1 between precision and recall, and you want a single metric that accounts for both. Useful in imbalanced datasets.
Testing vs Training error rate
Akin to our earlier discussions with MSE for regression, we are generally interested with the error rate from the testing set rather than the training set.
That is, for a set of new observations \((x_0, y_0)\), a good classifier achieves the smallest test error on average: \[\begin{equation}
E[I(y_0 \neq \hat y_0))],
\end{equation}\] where \(\hat y_0\) is the predicted class label for \(x_0\) that uses \(\hat f\).
Comments
Is it reasonable to assume a linear relationship between the log-odds and the predictors?
Your guess is as good as mine…(no more residuals to check)
What if we have multiple categories for our response?
No problem, bit of a change in notation and mathematics, but R can handle it just fine. family = "multinomial"
✏️ Next class we’ll move on to another (more natural) model for classification (Bring some paper and a pen!)
Comments
Your guess is as good as mine…(no more residuals to check)
No problem, bit of a change in notation and mathematics, but R can handle it just fine.
family = "multinomial"
✏️ Next class we’ll move on to another (more natural) model for classification (Bring some paper and a pen!)