Lecture 5: Linear Regression

DATA 311: Machine Learning

Dr. Irene Vrbik

University of British Columbia Okanagan

Introduction

  • Today we introduce our first model for the supervised regression problem: linear regression.

  • Linear regression is a supervised machine learning algorithm used for modeling the relationship between a dependent variable and one or more independent variables by fitting a linear equation to observed data.

  • The objective of this model is to find the best-fitting line (or hyperplane) that represents the relationship between the dependent and independent variables.

Learning objectives

  • Refresh the basics of simple linear regression (SLR)
  • Introduce multiple linear regression (MLR)
  • Interpret regression coefficients in context
  • Explore extensions: polynomial terms, interactions, categorical predictors
  • Diagnose and address common regression issues

iClicker

iClicker

How familiar are you with regression analysis?

A. Very familiar; I’ve worked with multiple regression models, interactions, and diagnostics

B. Familiar; I’ve worked with multiple regression models

C. Somewhat familiar; I’ve seen simple linear regression before

D. Only heard of it and haven’t had much hands on experience with

E. Not familiar at all

Outline

Any slides in gray you don’t need to pay too much attention to.

Why Linear Regression

  • Regression is one of the simplest machine learning algorithms to understand/implement.
  • It provides a solid foundation for learning more complex models and concepts, e.g. model evaluation (MSE), variable selection, overfitting and regularization, \(\dots\)
  • Linear regression serves as a versatile and widely used benchmark model for many machine learning tasks.

Advertising Dataset

  • The Advertising data set (csv) consists of the sales (in thousands) of a certain product and the advertising budgets (in thousands of dollars) for the product in three media:

    • TV, radio, and newspaper
  • Our goal is to develop an accurate model that can be used to predict sales on the basis of the media budgets.

library(dplyr)
Advertising <- read.csv("https://www.statlearning.com/s/Advertising.csv")
Advertising <- rename(Advertising, market = X) 
Advertising

Advertising Plots

Below plots the Sales vs TV, Radio and Newspaper, with a blue simple linear-regression line fit separately to each.

ISLR Fig 2.1

Simple linear regression

\[Y_i = \beta_0 + \beta_1 X_i + \epsilon_i\]

  • \(Y\) is the quantitative response (aka dependent) variable
  • \(X\) is the random predictor (aka independent) variable
  • \(\beta_0\) is the intercept (unknown)
  • \(\beta_1\) is the slope (unknown)
  • \(\epsilon\) is the error
  • \(i\) is the index for each observation

Assumptions

  1. Linearity: The relationship between the independent and dependent variables should be (approximately) linear.
  1. Independence of Errors: The errors (residuals) in the model should be independent of each other.
  1. Homoscedasticity: the variance of the residuals is constant across all levels of the predictors.
  1. Normality: The error terms are normally distributed

Notation

For example, \(X_i\) may represent the amount spent in TV advertising in market \(i\), and \(Y_i\) may represent sales of that product in market \(i\).

Mathematically, we can write: \[\texttt{sales} \approx \beta_0 + \beta_1 \texttt{TV}\]

Here we “regress sales onto TV”, or we are regressing \(Y\) on \(X\) (or \(Y\) onto \(X\)).

Scatterplot

plot(sales~TV, data = Advertising)

Scatterplot with Sqrt Transformation

plot(sales~sqrt(TV), data = Advertising)

lm() function

fitlm <- lm(formula, data, subset)
Type of model formula
Single predictor (SLR) response ~ predictor
Multiple predictors (MLR) response ~ predictor1 + predictor2 + ... + predictorp
All variables response ~ .
Excluding a specific predictor response ~ . - predictor
Interaction terms response ~ predictor1 * predictor2
Polynomial terms response ~ poly(predictor, degree)
Categorical Variables response ~ factor(predictor)
Transformed Variables response ~ I(log(predictor))

Data Splitting

Before we fit our model, we will first split our data into training and testing; here we’ll use a 50/50 split.

# number of observations
n <- nrow(Advertising)

# set seed for reproducibility
set.seed(123)

# randomly select 50% to be in our training
train_ind <- sample(1:n, n/2)

# the other 50%
test_ind <- setdiff(1:n, train_ind)

SLR in R

Typically we will save our fitted model to an object in R1

# Fit model
fit1 <- lm(sales ~ TV, data = Advertising, subset = train_ind)
fit1

Call:
lm(formula = sales ~ TV, data = Advertising, subset = train_ind)

Coefficients:
(Intercept)           TV  
    7.44821      0.04381  

Our estimate for \(\beta_0\) is \(\hat \beta_0\) = 7.44821
Our estimate for \(\beta_1\) is \(\hat \beta_1\) = 0.04381

Interpretation

  • Intercept (\(\hat\beta_0\))
    • Predicted (expected) value of \(y\) when \(x = 0\)
    • Interpretable only if \(X=0\) is meaningful in the data context.
  • Slope (\(\hat\beta_1\))
    • The expected change in \(Y\) for a one-unit increase in \(X\).

iClicker

iClicker

Which is the correct interpretation of the intercept \(\hat\beta_0\)?

A. If sales are 0, then the expected TV advertising spending is 7.45 thousand dollars.

B. If TV advertising spending is 0, then the expected sales of the product is 7.448 units.

C. If TV advertising spending is 0, then the expected sales of the product is 7448 units.

D. The intercept is not interpretable in this scenario.

iClicker

iClicker

Which is the best interpretation of the slope \(\hat\beta_1\)?

A. For each additional dollar spent on TV advertising, the sales of this product is expected to increase by an average of 0.04381 units.

B. For each additional dollar spent on TV advertising, the sales of this product is expected to increase by an average of 44 units.

C. For each additional $1,000 spent on TV advertising, the sales of this product is expected to increase by 43.81 units.

D. For each additional $1,000 spent on TV advertising, the sales of this product is expected to increase increases by $43.81.

Plot of Line

Extrapolation

  • Regression models are reliable within the range of observed data
  • Predictions outside this range (extrapolation) can be very misleading
  • The model assumes the same linear trend continues indefinitely
  • In practice: relationships may change beyond observed values

Warning

The \(y\)-intercept is often a large extrapolation (and therefore “meaningless”), so be careful with the interpretation.

Ordinary Least Squares

  • The \(\hat \beta\)s are most commonly estimated by ordinary least squares (OLS)

  • Using some basic calculs, this statistical technique involves minimizing the residual sum of squares:

\[ \text{RSS}=\sum_{i=1}^n (\underbrace{y_i-\hat y_i}_{\text{residual}})^{2} \]

Advertising SLR

ISLR Fig 3.1: The least squares fit for the regression of sales onto TV. In this case, a linear fit captures the essence of the relationship, although it is somewhat deficient in the left of the plot

3D Visualization

ISLR Fig 3.4: In a three-dimensional setting, with two predictors and one response, the least squares regression line becomes a plane. The plane is chosen to minimize the sum of the squared vertical distances between each observation (shown in red) and the plane.

Fitted model

The “fitted model” is given by:

\[ \hat f(X) = \hat \beta_0 + \hat \beta_1 X_1 \]

For our SLR model:

\[ \texttt{sales} = 7.45 + 0.04\times \texttt{TV} \]

Important

❌ Don’t hard-code results into reports ✅ Do pull values from fitted object

e.g., slope can be extracted using:

fit1$coefficients["TV"] # or [2]
        TV 
0.04380845 

e.g., intercept extracted using:

fit1$coefficients[1] # or ["(Intercept)"]
(Intercept) 
   7.448213 

Summary of fit

# For a more detailed output use summary()
summary(fit1) # scroll to the bottom!

Call:
lm(formula = sales ~ TV, data = Advertising, subset = train_ind)

Residuals:
    Min      1Q  Median      3Q     Max 
-7.7700 -1.9111  0.1785  1.8400  6.3916 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 7.448213   0.596828   12.48   <2e-16 ***
TV          0.043808   0.003494   12.54   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.028 on 98 degrees of freedom
Multiple R-squared:  0.6161,    Adjusted R-squared:  0.6121 
F-statistic: 157.3 on 1 and 98 DF,  p-value: < 2.2e-16

lm() output

  • Call: shows the model you fit
  • Residuals: quick check of symmetry/spread
  • Coefficients Table : estimates, stardand errors, p-values

Below the coefficients table are a few measures of fit:

Coefficients Table

  • Estimates: the fitted coefficients (\(\hat\beta_0\) and \(\hat\beta_1\))
  • Std. Error: the estimated standard deviations of the coefficient estimates; measures uncertainty in \(\hat\beta\)s
  • t value: test statistic for \(H_0: \beta = 0\)

\[t = \dfrac{\text{Estimate}}{\text{Std. Error}}\]

  • Pr(>|t|): p-value for the test \(H_0: \beta = 0\)
    • Small value → evidence the predictor is useful

Asterisks Notation in R Output

R’s regression summary gives a quick way to indicate the statistical significance of a coefficient:

  • *** p < 0.001 highly significant
  • ** p < 0.01 very significant
  • * p < 0.05 signficant significant
  • . p < 0.1 almost significant
  • (blank) not significant not significant

⚠️ These stars are a convenience, not a substitute for careful interpretation.

Residual Standard Error

Residual standard error (RSE)

\[ \text{RSE} = \sqrt{\frac{1}{n-p}\sum_{i=1}^n (y_i - \hat y_i)^2} \]

  • \(n\) = number of observations,
  • \(p\) = number of estimated parameters (including the intercept),
  • the standard deviation of residuals
  • same units as the response (e.g. sales in thousands of units)

Compared with Mean Squar Error: \[ \text{MSE} = \frac{1}{n}\sum_{i=1}^n (y_i - \hat y_i)^2 \]

  • the average squared residual
  • Units are squared (e.g., sales\(^2\)).

RSE vs MSE

MSE is a more general metric used in ML. RSE is essentially a scaled version of MSE. Both measure model fit quality, but RSE is often easier to explain

Multiple R-squared

R-squared, \(R^2\), or the coefficient of determination represents proportion of the variance in the response variable that is predictable from the independent variable(s).

\[ \begin{align} R^2 &= 1 - \frac{\text{RSS}}{\text{TSS}}\\ &= 1 - \frac{\sum_{i=1}^n (y_i - \hat y_i)^2}{\sum_{i=1}^n (y_i - \bar y)^2} \end{align} \]

  • Hence it ranges from 0 to 1 (0% to 100%)

Example values

Adjusted R-squared

Warning

\(R^2\) can be misleading in multiple regression since \(R^2\) always increases (or stays the same) when more predictors are added.

The Adjusted \(R^2\) is a penalized \(R^2\) value that adjusts for the number of predictors \(p\):

\[ R^2_\text{adj} = 1 - \Bigg(\frac{n-1}{n-p}\Bigg)\frac{\text{RSS}}{\text{TSS}} \]

Multiple Linear Regression

We assume that we can write the following model \[\begin{equation} Y = \beta_0 + \beta_1 X_{1} + \cdots + \beta_p X_{p} + \epsilon \end{equation}\]

  • \(Y\) is the random, quantitative response variable
  • \(X_j\) is the \(j^{th}\) random predictor variable
  • \(\beta_0\) is the intercept
  • \(\beta_j\) from \(j = 1, \dots p\) are the regression coefficients
  • \(\epsilon\) is the error term

MLR

Here we regress sales onto the predictors TV, Radio, and Newspaper.

\[\texttt{sales} \approx \beta_0 + \beta_1 \texttt{TV} + \beta_1 \texttt{radio} + \beta_2 \texttt{newspaper}\]

# Fit the MLR model
fit2 <- lm(sales ~ TV + radio + newspaper, 
           data = Advertising, subset = train_ind)
fit2

Call:
lm(formula = sales ~ TV + radio + newspaper, data = Advertising, 
    subset = train_ind)

Coefficients:
(Intercept)           TV        radio    newspaper  
    3.68340      0.04354      0.18608     -0.01352  

Interpreting MLR Coefficients

  • Intercept (\(\hat\beta_0\))
    • Predicted (expected) value of \(Y\) when all predictors and 0
    • ⚠️ Often not meaningful if 0 is outside the data range
  • Slope (\(\hat\beta_1\))
  • Expected change in \(Y\) for a 1-unit increase in \(X_j\),
    holding all other predictors constant
  • Units: expected units of \(Y\) per one unit of \(X_j\)

Some important questions

  1. Is at least one of the predictors \(X_1, X_2, \dots, X_p\)​ useful in predicting the response?

  2. Do all the predictors help to explain \(Y\), or is only a subset of the predictors useful?

  3. How well does the model fit the data?

  4. Given a set of predictor values, what response value should we predict, and how accurate is our prediction?

Multicollinearity

  • Multicollinearity occurs when two or more independent variables in a regression model are highly correlated with each other.

    • e.g., redundant variables (like height in inches and height in centimeters), correlated predictors (e.g. age and years of experience)
  • This can lead to unstable coefficient estimates, inflated standard errors, reuduction in overall interpretiblity.

Checking for Multicollinearity

  • Check1 for correlated variables as they can cause problems

  • Unstable Coefficient Estimates The variance of all coefficients tend to increase, sometimes dramatically

  • Difficulty in Interpreting Coefficients Coefficients may not reflect the true relationship between each predictor and the outcome, leading to misleading conclusions about the importance or impact of each predictor.

Checking Assumptions

Are the assumptions of the linear regression model met?

Note

These assumptions will usually be checked using Diagnostic Plots

Diagnostic Plots

By calling plot() on an lm object, you automatically get a set of diagnostic plots:

plot(fit1)   # diagnostic plots shown on next slide
  1. Residuals vs Fitted

  2. Normal QQ-plot

  3. Scale-Location

  4. Residual vs Leverage

Code
# plotting options
par(mfrow=c(2,2)) # sets 2x2 layout matrix for plotting
par(mar=c(5,4,1.2,0.1)) # removes some white space around margins

# plot the diagnostic plots
plot(fit1)

Residuals vs Fitted

plot(fit1, which = 1)

  • Purpose: This plot checks for non-linearity, constant variance (homoscedasticity), and the presence of outliers.

  • Ideally, the points should be randomly scattered around the horizontal line (y=0), with no clear pattern.

  • A systematic pattern (such as a curve) suggests a non-linear relationship, and a “funnel” shape indicates heteroscedasticity (non-constant variance).

Good: If the model assumptions are correct, the residuals should fall within an area representing a horizontal band.

Bad (Curvature): Residuals follow a curve which violates linearity.

Bad (Heteroscedasticity): Fan/funnel shape implies non-constant variance.

Bad (Heteroscedasticity): Implies non-constant variance. Source of images: PSU STAT 504: Regression Diagnostics

Normal QQ-plot

plot(fit1, which = 2)

  • Purpose: This plot assesses whether the residuals are normally distributed.

  • Ideally, the points should lie approximately along the reference line.

  • Deviations from this line, particularly in the tails, indicate that the residuals are not normally distributed, which can affect the validity of hypothesis tests and confidence intervals.

Code
set.seed(123)

# Good: Standard normal
normal_data = rnorm(1000, mean=7, sd = 10)
qqnorm(normal_data, main = "Good: Normal Residuals")
qqline(normal_data, col = "red")
# Bad: Heavy tails (t with df=4)
t_data = rt(1000, df = 4)
qqnorm(t_data, main = "Bad: Heavy Tails")
qqline(t_data, col = "red")
# Bad: Skewed (exponential)
skewed_data = rexp(1000, rate = 4)
qqnorm(skewed_data, main = "Bad: Skewed")
qqline(skewed_data, col = "red")

Good: Points fall close to the line, indicating residuals are approximately normal.

Bad (Heavy Tails): t-distribution with small df shows many outliers in tails.

Bad (Skewed): Exponential distribution produces systematic curvature.

Scale-Location

(Spread-Location Plot)

plot(fit1, which = 3)

  • Purpose: This plot checks for homoscedasticity (constant variance of residuals).

  • It plots the square root of standardized residuals against the fitted values.

  • Ideally, the points should be scattered horizontally with no discernible pattern. A trend or pattern (e.g., increasing or decreasing spread) suggests heteroscedasticity.

Residual vs Leverage

plot(fit1, which = 5)

  • Purpose: This plot identifies influential observations that have a disproportionate impact on the model fit, also known as high-leverage points.

  • It helps to spot influential data points that might unduly affect the regression results.

  • Points that are far from others in the x-direction (high leverage) or those with high Cook’s distance (indicated by dashed lines) could be potential concerns.

Exemplary Residual Plot

xx <- runif(500); eps <- rnorm(500); yy = 8 + 6*xx + eps
simfit <- lm(yy~xx); plot(simfit)

scatter plot of Residuals vs. fitted.  Points are distributed evenly about the horizontal line drawn at 0 on the y-axis.  That is, it appears that residuals are evenly distributed for positive and negative values.

Summary of Assumptions Checked

  • Linearity: Assessed through the Residuals vs Fitted plot.
  • Normality of Residuals: Assessed through the Normal Q-Q plot.
  • Homoscedasticity: Assessed through both the Residuals vs Fitted and Scale-Location plots.
  • Independence: Not directly assessed; however, patterns in residuals can indicate problems.

Dealing with Multicollinearity

  • To deal with correlated variables you may consider removing one of the highly correlated predictors (e.g. removing predictors with vif() > 5)

  • Alternatively, you could combine correlated predictors into a single composite variable.

  • Regularization methods add a penalty to the regression model to shrink coefficient estimates and reduce multicollinearity effects (we will look at Ridge Regression and Lasso in future lectures)

Testing for Multicollinearity with VIF

Variance Inflation Factor (VIF): Measures how much the variance of a regression coefficient is inflated due to correlation among predictors.

\[ \text{VIF}(X_j) = \frac{1}{1 - R^2_j} \] where \(R_j^2\) is the \(R^2\) from regressing predictor \(X_j\) on all the other predictors.

  • VIF = 1: No multicollinearity.
  • VIF > 5: Moderate concern.
  • VIF > 10: Serious multicollinearity problem.

R demo

library(car)
fit2 <- lm(sales ~ TV + radio + newspaper, data = Advertising, subset = train_ind)
vif(fit2)
       TV     radio newspaper 
 1.007046  1.136856  1.143884 

Since all VIFs are close to 1, there is no evidence of harmful multicollinearity, hence standard errors of the coefficients are not inflated due to collinearity.

Note

If your goal is prediction (not understanding individual coefficient effects), collinearity might be okay if overall predictive performance is good. Read: Multicollinearity in Regression Analysis: Problems, Detection, and Solutions Statistics by Jim

Dealing with too many predictors

Question: Do all the predictors help to explain \(Y\), or is only a subset of the predictors useful?

  • Each \(\beta_j\) has an associated \(t\)-statistic and \(p\)-value in the R output

  • These check whether each predictor is linearly related to the response, after adjusting for all the other predictors considered

  • These are useful, but we have to be careful…especially for large numbers of predictors (multiple testing problem)

Attempt no. 1

  • A natural temptation when considering all these individualized hypothesis tests is to “toss out” any variable(s) that does not appear significant (i.e. \(p\)-value > 0.05)

  • This brings us into the topic of variable selection, which we will go into much more detail later in the course.

  • As our first foray, we’ll look at the most straightforward (and also oldest) techniques for doing so… Stepwise Regression

Attempt no. 2

  • We have a few ways we can compare models (e.g. RSE, adjusted \(R^2\), test MSE, or anova()1) but how do we go about choosing which models to compare?

  • Ideally we would try every combination of \(X_j\), but that becomes computationally impractical very quickly2

  • Instead, we may decided on a systematic approach to perform model selection: …

Stepwise Regression

  • Forward Selection: Start with only the intercept (no predictors, ie null model). Create \(p\) simple linear regressions, whichever predictor has the lowest RSS, add it into the model. Continue until, for instance, we see a decrease in our ‘best model’ measurement.

  • Backward Selection: Start with all predictors1. Remove the predictor with the largest p-value. Continue until, for instance, all variables are considered significant.

  • Mixed Selection: Start with no predictors. Perform forward selection, but if any \(p\)-values for variables currently in the model reach above a threshold, remove that variable. Continue until, for instance, all variables in your model are significant and any additional variable would not be.

Assessing Model Fit

How well does the model fit the data?

  • While the summary() function does not spit out the MSE by default, we can still calculate this value and compare the resulting MSE with the MSE of other fitted models using the same data.

  • That is to say the MSE isn’t so useful on it’s own but more-so as a comparative value across multiple models.

Making Predictions

Once a model is fit, we can use it to predict responses for “new” data:

test = Advertising[test_ind,] 
yhat_test <- predict(fit1, newdata = test)

To evaluate predictive accuracy, compare predictions to observed values in the test set:

mse_test <- mean((test$sales - yhat_test)^2)
mse_test
[1] 12.28183

Looking Forward

  • In this weeks lab, you cover how to fit SLR and MLR in R using a training set, and assess them on a test set.
  • We will look at some extension to the linear regression model that will make them more flexible.
  • Look at an non-parametric alternatives for regression problems (more flexible).
  • For more practice, go through ISLR 3.6 Lab: Linear Regression