DATA 311 – Lecture 5: Linear Regression

Introduction

Today we introduce our first model for the supervised regression problem: linear regression.
Linear regression is a supervised machine learning algorithm used for modeling the relationship between a dependent variable and one or more independent variables by fitting a linear equation to observed data.
The objective of this model is to find the best-fitting line (or hyperplane) that represents the relationship between the dependent and independent variables.

Learning objectives

Refresh the basics of simple linear regression (SLR)
Introduce multiple linear regression (MLR)
Interpret regression coefficients in context
Explore extensions: polynomial terms, interactions, categorical predictors
Diagnose and address common regression issues

iClicker

iClicker

How familiar are you with regression analysis?

A. Very familiar; I’ve worked with multiple regression models, interactions, and diagnostics

B. Familiar; I’ve worked with multiple regression models

C. Somewhat familiar; I’ve seen simple linear regression before

D. Only heard of it and haven’t had much hands on experience with

E. Not familiar at all

Fitting linear models in R
Diagnostic Plots
Measuring Fit
- $R^2$ , adjusted $R^2$, RSE

Any slides in gray you don’t need to pay too much attention to.

Why Linear Regression

Regression is one of the simplest machine learning algorithms to understand/implement.

It provides a solid foundation for learning more complex models and concepts, e.g. model evaluation (MSE), variable selection, overfitting and regularization, $\dots$

Linear regression serves as a versatile and widely used benchmark model for many machine learning tasks.

Advertising Dataset

The Advertising data set (csv) consists of the sales (in thousands) of a certain product and the advertising budgets (in thousands of dollars) for the product in three media:
- TV, radio, and newspaper
Our goal is to develop an accurate model that can be used to predict sales on the basis of the media budgets.

library(dplyr)
Advertising <- read.csv("https://www.statlearning.com/s/Advertising.csv")
Advertising <- rename(Advertising, market = X) 
Advertising

If we establish a link between advertising and sales, we can advise our client to modify their advertising budgets, consequently leading to an indirect increase in sales.

suppose we are consultants hired to determine if there is an association between advertisting and sales for a product.
we want to instruct the client on how to adjust advertising budgets (with the goal of indirectly increasing sales)
IOW we want to develop an accurate model that can be used to predcit sales on the basis of the three media budgets.
Input (predictions, ind. var, features, “variables”) varials: $X_1$ = TV, $X_2$ = Radio, $X_3$ = Newspaper. Output: sales (response, dependent variable)

The Advertising dataset. The plot displays sales, in thousands of units, as a function of TV, radio, and newspaper budgets, in thousands of dollars, for 200 different markets. In each plot, we show the simple least squares fit of sales to that variable, as described in Chapter 3. In other words, each blue line represents a simple model that can be used to predict sales using TV, radio, and newspaper, respectively.

Advertising Plots

Below plots the Sales vs TV, Radio and Newspaper, with a blue simple linear-regression line fit separately to each.

ISLR Fig 2.1

Simple linear regression

\[Y_i = \beta_0 + \beta_1 X_i + \epsilon_i\]

$Y$ is the quantitative response (aka dependent) variable
$X$ is the random predictor (aka independent) variable
$\beta_0$ is the intercept (unknown)
$\beta_1$ is the slope (unknown)
$\epsilon$ is the error
$i$ is the index for each observation

Assumptions

Linearity: The relationship between the independent and dependent variables should be (approximately) linear.

Independence of Errors: The errors (residuals) in the model should be independent of each other.

Homoscedasticity: the variance of the residuals is constant across all levels of the predictors.

Normality: The error terms are normally distributed

Notation

For example, $X_i$ may represent the amount spent in TV advertising in market $i$, and $Y_i$ may represent sales of that product in market $i$.

Mathematically, we can write: \[\texttt{sales} \approx \beta_0 + \beta_1 \texttt{TV}\]

Here we “regress sales onto TV”, or we are regressing $Y$ on $X$ (or $Y$ onto $X$).

Scatterplot

plot(sales~TV, data = Advertising)

Scatterplot with Sqrt Transformation

plot(sales~sqrt(TV), data = Advertising)

`lm()` function

fitlm <- lm(formula, data, subset)

Type of model	`formula`
Single predictor (SLR)	`response ~ predictor`
Multiple predictors (MLR)	`response ~ predictor1 + predictor2 + ... + predictorp`
All variables	`response ~ .`
Excluding a specific predictor	`response ~ . - predictor`
Interaction terms	`response ~ predictor1 * predictor2`
Polynomial terms	`response ~ poly(predictor, degree)`
Categorical Variables	`response ~ factor(predictor)`
Transformed Variables	`response ~ I(log(predictor))`

Data Splitting

Before we fit our model, we will first split our data into training and testing; here we’ll use a 50/50 split.

# number of observations
n <- nrow(Advertising)

# set seed for reproducibility
set.seed(123)

# randomly select 50% to be in our training
train_ind <- sample(1:n, n/2)

# the other 50%
test_ind <- setdiff(1:n, train_ind)

SLR in R

Typically we will save our fitted model to an object in R¹

# Fit model
fit1 <- lm(sales ~ TV, data = Advertising, subset = train_ind)
fit1


Call:
lm(formula = sales ~ TV, data = Advertising, subset = train_ind)

Coefficients:
(Intercept)           TV  
    7.44821      0.04381

Our estimate for $\beta_0$ is $\hat \beta_0$ = 7.44821
Our estimate for $\beta_1$ is $\hat \beta_1$ = 0.04381

Interpretation

Intercept ($\hat\beta_0$)
- Predicted (expected) value of $y$ when $x = 0$
- Interpretable only if $X=0$ is meaningful in the data context.

Slope ($\hat\beta_1$)
- The expected change in $Y$ for a one-unit increase in $X$.

iClicker

iClicker

Which is the correct interpretation of the intercept $\hat\beta_0$?

A. If sales are 0, then the expected TV advertising spending is 7.45 thousand dollars.

B. If TV advertising spending is 0, then the expected sales of the product is 7.448 units.

C. If TV advertising spending is 0, then the expected sales of the product is 7448 units.

D. The intercept is not interpretable in this scenario.

iClicker

iClicker

Which is the best interpretation of the slope $\hat\beta_1$?

A. For each additional dollar spent on TV advertising, the sales of this product is expected to increase by an average of 0.04381 units.

B. For each additional dollar spent on TV advertising, the sales of this product is expected to increase by an average of 44 units.

C. For each additional $1,000 spent on TV advertising, the sales of this product is expected to increase by 43.81 units.

D. For each additional $1,000 spent on TV advertising, the sales of this product is expected to increase increases by $43.81.

Plot of Line

Extrapolation

Regression models are reliable within the range of observed data
Predictions outside this range (extrapolation) can be very misleading
The model assumes the same linear trend continues indefinitely
In practice: relationships may change beyond observed values

Warning

The $y$-intercept is often a large extrapolation (and therefore “meaningless”), so be careful with the interpretation.

Predicting house price ($y$) from square footage ($x$).

\[ \text{Price} = \beta_0 + \beta_1 \cdot \text{Size (sq ft)} + \varepsilon \]

If you fit this model on houses ranging from 800–3000 sq ft, the intercept $\beta_0$ corresponds to the predicted price at 0 sq ft.

But a “0 sq ft house” doesn’t exist → so the intercept is mathematically defined but not interpretable in practice.

Extrapolation is the process of making predictions or estimating values outside the range of the observed data.

Extrapolating to 110 assumes that the same linear relationship holds, which might not be true.

Plugging in blindly could lead to Unreliable Predictions and should be avoided.

that said, we’ll still use $\beta_0$ (which usually is an extrapolation) and not usually intrept it, but rather consider a parameter to make our model identifiable.

Ordinary Least Squares

The $\hat \beta$s are most commonly estimated by ordinary least squares (OLS)
Using some basic calculs, this statistical technique involves minimizing the residual sum of squares:

\[ \text{RSS}=\sum_{i=1}^n (\underbrace{y_i-\hat y_i}_{\text{residual}})^{2} \]

Advertising SLR

ISLR Fig 3.1: The least squares fit for the regression of sales onto TV. In this case, a linear fit captures the essence of the relationship, although it is somewhat deficient in the left of the plot

3D Visualization

ISLR Fig 3.4: In a three-dimensional setting, with two predictors and one response, the least squares regression line becomes a plane. The plane is chosen to minimize the sum of the squared vertical distances between each observation (shown in red) and the plane.

Fitted model

The “fitted model” is given by:

\[ \hat f(X) = \hat \beta_0 + \hat \beta_1 X_1 \]

For our SLR model:

\[ \texttt{sales} = 7.45 + 0.04\times \texttt{TV} \]

Important

❌ Don’t hard-code results into reports ✅ Do pull values from fitted object

e.g., slope can be extracted using:

fit1$coefficients["TV"] # or [2]

        TV 
0.04380845

e.g., intercept extracted using:

fit1$coefficients[1] # or ["(Intercept)"]

(Intercept) 
   7.448213

Summary of fit

# For a more detailed output use summary()
summary(fit1) # scroll to the bottom!


Call:
lm(formula = sales ~ TV, data = Advertising, subset = train_ind)

Residuals:
    Min      1Q  Median      3Q     Max 
-7.7700 -1.9111  0.1785  1.8400  6.3916 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 7.448213   0.596828   12.48   <2e-16 ***
TV          0.043808   0.003494   12.54   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.028 on 98 degrees of freedom
Multiple R-squared:  0.6161,    Adjusted R-squared:  0.6121 
F-statistic: 157.3 on 1 and 98 DF,  p-value: < 2.2e-16

lm() output

Call: shows the model you fit
Residuals: quick check of symmetry/spread
Coefficients Table : estimates, stardand errors, p-values

Below the coefficients table are a few measures of fit:

Residual Standard Error
Multiple R-squared
Adjusted R-squared

Coefficients Table

Estimates: the fitted coefficients ($\hat\beta_0$ and $\hat\beta_1$)
Std. Error: the estimated standard deviations of the coefficient estimates; measures uncertainty in $\hat\beta$s
t value: test statistic for $H_0: \beta = 0$

\[t = \dfrac{\text{Estimate}}{\text{Std. Error}}\]

Pr(>|t|): p-value for the test $H_0: \beta = 0$
- Small value → evidence the predictor is useful

Asterisks Notation in R Output

R’s regression summary gives a quick way to indicate the statistical significance of a coefficient:

*** p < 0.001 highly significant
** p < 0.01 very significant
* p < 0.05 signficant significant
. p < 0.1 almost significant
(blank) not significant not significant

⚠️ These stars are a convenience, not a substitute for careful interpretation.

Residual Standard Error

Residual standard error (RSE)

\[ \text{RSE} = \sqrt{\frac{1}{n-p}\sum_{i=1}^n (y_i - \hat y_i)^2} \]

$n$ = number of observations,
$p$ = number of estimated parameters (including the intercept),
the standard deviation of residuals
same units as the response (e.g. sales in thousands of units)

Compared with Mean Squar Error: \[ \text{MSE} = \frac{1}{n}\sum_{i=1}^n (y_i - \hat y_i)^2 \]

the average squared residual
Units are squared (e.g., sales$^2$).

RSE vs MSE

MSE is a more general metric used in ML. RSE is essentially a scaled version of MSE. Both measure model fit quality, but RSE is often easier to explain

Multiple R-squared

R-squared, $R^2$, or the coefficient of determination represents proportion of the variance in the response variable that is predictable from the independent variable(s).

\[ \begin{align} R^2 &= 1 - \frac{\text{RSS}}{\text{TSS}}\\ &= 1 - \frac{\sum_{i=1}^n (y_i - \hat y_i)^2}{\sum_{i=1}^n (y_i - \bar y)^2} \end{align} \]

Hence it ranges from 0 to 1 (0% to 100%)

Example values

Adjusted R-squared

Warning

$R^2$ can be misleading in multiple regression since $R^2$ always increases (or stays the same) when more predictors are added.

The Adjusted $R^2$ is a penalized $R^2$ value that adjusts for the number of predictors $p$:

\[ R^2_\text{adj} = 1 - \Bigg(\frac{n-1}{n-p}\Bigg)\frac{\text{RSS}}{\text{TSS}} \]

Multiple Linear Regression

We assume that we can write the following model \[\begin{equation} Y = \beta_0 + \beta_1 X_{1} + \cdots + \beta_p X_{p} + \epsilon \end{equation}\]

$Y$ is the random, quantitative response variable
$X_j$ is the $j^{th}$ random predictor variable
$\beta_0$ is the intercept
$\beta_j$ from $j = 1, \dots p$ are the regression coefficients
$\epsilon$ is the error term

MLR

Here we regress sales onto the predictors TV, Radio, and Newspaper.

\[\texttt{sales} \approx \beta_0 + \beta_1 \texttt{TV} + \beta_1 \texttt{radio} + \beta_2 \texttt{newspaper}\]

# Fit the MLR model
fit2 <- lm(sales ~ TV + radio + newspaper, 
           data = Advertising, subset = train_ind)
fit2


Call:
lm(formula = sales ~ TV + radio + newspaper, data = Advertising, 
    subset = train_ind)

Coefficients:
(Intercept)           TV        radio    newspaper  
    3.68340      0.04354      0.18608     -0.01352

Interpreting MLR Coefficients

Intercept ($\hat\beta_0$)
- Predicted (expected) value of $Y$ when all predictors and 0
- ⚠️ Often not meaningful if 0 is outside the data range

Slope ($\hat\beta_1$)
Expected change in $Y$ for a 1-unit increase in $X_j$,
holding all other predictors constant
Units: expected units of $Y$ per one unit of $X_j$

Some important questions

Is at least one of the predictors $X_1, X_2, \dots, X_p$ useful in predicting the response?
Do all the predictors help to explain $Y$, or is only a subset of the predictors useful?
How well does the model fit the data?
Given a set of predictor values, what response value should we predict, and how accurate is our prediction?

Multicollinearity

Multicollinearity occurs when two or more independent variables in a regression model are highly correlated with each other.
- e.g., redundant variables (like height in inches and height in centimeters), correlated predictors (e.g. age and years of experience)
This can lead to unstable coefficient estimates, inflated standard errors, reuduction in overall interpretiblity.

Checking for Multicollinearity

Check¹ for correlated variables as they can cause problems
Unstable Coefficient Estimates The variance of all coefficients tend to increase, sometimes dramatically
Difficulty in Interpreting Coefficients Coefficients may not reflect the true relationship between each predictor and the outcome, leading to misleading conclusions about the importance or impact of each predictor.

Checking Assumptions

Are the assumptions of the linear regression model met?

Linearity
Normality of residuals
Homoscedasticity
Independence

Note

These assumptions will usually be checked using Diagnostic Plots

Diagnostic Plots

By calling plot() on an lm object, you automatically get a set of diagnostic plots:

plot(fit1)   # diagnostic plots shown on next slide

Residuals vs Fitted
Normal QQ-plot
Scale-Location
Residual vs Leverage

Code

# plotting options
par(mfrow=c(2,2)) # sets 2x2 layout matrix for plotting
par(mar=c(5,4,1.2,0.1)) # removes some white space around margins

# plot the diagnostic plots
plot(fit1)

Residuals vs Fitted

plot(fit1, which = 1)

Purpose: This plot checks for non-linearity, constant variance (homoscedasticity), and the presence of outliers.
Ideally, the points should be randomly scattered around the horizontal line (y=0), with no clear pattern.
A systematic pattern (such as a curve) suggests a non-linear relationship, and a “funnel” shape indicates heteroscedasticity (non-constant variance).

**Good**: If the model assumptions are correct, the residuals should fall within an area representing a horizontal band.

**Bad (Curvature)**: Residuals follow a curve which violates linearity.

**Bad (Heteroscedasticity)**: Fan/funnel shape implies non-constant variance.

**Bad (Heteroscedasticity)**: Implies non-constant variance. *Source of images: PSU STAT 504: Regression Diagnostics*

Normal QQ-plot

plot(fit1, which = 2)

Purpose: This plot assesses whether the residuals are normally distributed.
Ideally, the points should lie approximately along the reference line.
Deviations from this line, particularly in the tails, indicate that the residuals are not normally distributed, which can affect the validity of hypothesis tests and confidence intervals.

Code

set.seed(123)

# Good: Standard normal
normal_data = rnorm(1000, mean=7, sd = 10)
qqnorm(normal_data, main = "Good: Normal Residuals")
qqline(normal_data, col = "red")
# Bad: Heavy tails (t with df=4)
t_data = rt(1000, df = 4)
qqnorm(t_data, main = "Bad: Heavy Tails")
qqline(t_data, col = "red")
# Bad: Skewed (exponential)
skewed_data = rexp(1000, rate = 4)
qqnorm(skewed_data, main = "Bad: Skewed")
qqline(skewed_data, col = "red")

**Good**: Points fall close to the line, indicating residuals are approximately normal.

**Bad (Heavy Tails)**: t-distribution with small df shows many outliers in tails.

**Bad (Skewed)**: Exponential distribution produces systematic curvature.

Scale-Location

(Spread-Location Plot)

plot(fit1, which = 3)

Purpose: This plot checks for homoscedasticity (constant variance of residuals).
It plots the square root of standardized residuals against the fitted values.
Ideally, the points should be scattered horizontally with no discernible pattern. A trend or pattern (e.g., increasing or decreasing spread) suggests heteroscedasticity.

Residual vs Leverage

plot(fit1, which = 5)

Purpose: This plot identifies influential observations that have a disproportionate impact on the model fit, also known as high-leverage points.
It helps to spot influential data points that might unduly affect the regression results.
Points that are far from others in the x-direction (high leverage) or those with high Cook’s distance (indicated by dashed lines) could be potential concerns.

Exemplary Residual Plot

xx <- runif(500); eps <- rnorm(500); yy = 8 + 6*xx + eps
simfit <- lm(yy~xx); plot(simfit)

scatter plot of Residuals vs. fitted. Points are distributed evenly about the horizontal line drawn at 0 on the y-axis. That is, it appears that residuals are evenly distributed for positive and negative values.

Summary of Assumptions Checked

Linearity: Assessed through the Residuals vs Fitted plot.
Normality of Residuals: Assessed through the Normal Q-Q plot.
Homoscedasticity: Assessed through both the Residuals vs Fitted and Scale-Location plots.
Independence: Not directly assessed; however, patterns in residuals can indicate problems.

Dealing with Multicollinearity

To deal with correlated variables you may consider removing one of the highly correlated predictors (e.g. removing predictors with vif() > 5)
Alternatively, you could combine correlated predictors into a single composite variable.
Regularization methods add a penalty to the regression model to shrink coefficient estimates and reduce multicollinearity effects (we will look at Ridge Regression and Lasso in future lectures)

Testing for Multicollinearity with VIF

Variance Inflation Factor (VIF): Measures how much the variance of a regression coefficient is inflated due to correlation among predictors.

\[ \text{VIF}(X_j) = \frac{1}{1 - R^2_j} \] where $R_j^2$ is the $R^2$ from regressing predictor $X_j$ on all the other predictors.

VIF = 1: No multicollinearity.
VIF > 5: Moderate concern.
VIF > 10: Serious multicollinearity problem.

High VIF inflates standard errors → coefficients may appear “insignificant” even if they’re truly important.

- e.g. unexpected sign flips, and inflated standard errors.

R demo

library(car)
fit2 <- lm(sales ~ TV + radio + newspaper, data = Advertising, subset = train_ind)
vif(fit2)

       TV     radio newspaper 
 1.007046  1.136856  1.143884

Since all VIFs are close to 1, there is no evidence of harmful multicollinearity, hence standard errors of the coefficients are not inflated due to collinearity.

Note

If your goal is prediction (not understanding individual coefficient effects), collinearity might be okay if overall predictive performance is good. Read: Multicollinearity in Regression Analysis: Problems, Detection, and Solutions Statistics by Jim

Dealing with too many predictors

Question: Do all the predictors help to explain $Y$, or is only a subset of the predictors useful?

Each $\beta_j$ has an associated $t$-statistic and $p$-value in the R output
These check whether each predictor is linearly related to the response, after adjusting for all the other predictors considered
These are useful, but we have to be careful…especially for large numbers of predictors (multiple testing problem)

Attempt no. 1

A natural temptation when considering all these individualized hypothesis tests is to “toss out” any variable(s) that does not appear significant (i.e. $p$-value > 0.05)
This brings us into the topic of variable selection, which we will go into much more detail later in the course.
As our first foray, we’ll look at the most straightforward (and also oldest) techniques for doing so… Stepwise Regression

Attempt no. 2

We have a few ways we can compare models (e.g. RSE, adjusted $R^2$, test MSE, or anova()¹) but how do we go about choosing which models to compare?
Ideally we would try every combination of $X_j$, but that becomes computationally impractical very quickly²

Instead, we may decided on a systematic approach to perform model selection: …

Stepwise Regression

Forward Selection: Start with only the intercept (no predictors, ie null model). Create $p$ simple linear regressions, whichever predictor has the lowest RSS, add it into the model. Continue until, for instance, we see a decrease in our ‘best model’ measurement.
Backward Selection: Start with all predictors¹. Remove the predictor with the largest p-value. Continue until, for instance, all variables are considered significant.
Mixed Selection: Start with no predictors. Perform forward selection, but if any $p$-values for variables currently in the model reach above a threshold, remove that variable. Continue until, for instance, all variables in your model are significant and any additional variable would not be.

Assessing Model Fit

How well does the model fit the data?

While the summary() function does not spit out the MSE by default, we can still calculate this value and compare the resulting MSE with the MSE of other fitted models using the same data.
That is to say the MSE isn’t so useful on it’s own but more-so as a comparative value across multiple models.

Making Predictions

Once a model is fit, we can use it to predict responses for “new” data:

test = Advertising[test_ind,] 
yhat_test <- predict(fit1, newdata = test)

To evaluate predictive accuracy, compare predictions to observed values in the test set:

mse_test <- mean((test$sales - yhat_test)^2)
mse_test

[1] 12.28183

Looking Forward

In this weeks lab, you cover how to fit SLR and MLR in R using a training set, and assess them on a test set.
We will look at some extension to the linear regression model that will make them more flexible.
Look at an non-parametric alternatives for regression problems (more flexible).
For more practice, go through ISLR 3.6 Lab: Linear Regression

Lecture 5: Linear Regression

Introduction

Learning objectives

iClicker

Outline

Why Linear Regression

Advertising Dataset

Advertising Plots

Simple linear regression

Assumptions

Notation

Scatterplot

Scatterplot with Sqrt Transformation

lm() function

Data Splitting

SLR in R

Interpretation

iClicker

iClicker

Plot of Line

Extrapolation

Ordinary Least Squares

Advertising SLR

3D Visualization

Fitted model

Summary of fit

lm() output

Coefficients Table

Residual Standard Error

Multiple R-squared

Example values

Adjusted R-squared

Multiple Linear Regression

MLR

Interpreting MLR Coefficients

Some important questions

Multicollinearity

Checking for Multicollinearity

Checking Assumptions

Diagnostic Plots

Residuals vs Fitted

Normal QQ-plot

Scale-Location

Residual vs Leverage

Exemplary Residual Plot

Summary of Assumptions Checked

Dealing with Multicollinearity

Testing for Multicollinearity with VIF

R demo

Dealing with too many predictors

Attempt no. 1

Attempt no. 2

Stepwise Regression

Assessing Model Fit

Making Predictions

Looking Forward

`lm()` function