DATA 311: Machine Learning
University of British Columbia Okanagan
Today we introduce our first model for the supervised regression problem: linear regression.
Linear regression is a supervised machine learning algorithm used for modeling the relationship between a dependent variable and one or more independent variables by fitting a linear equation to observed data.
The objective of this model is to find the best-fitting line (or hyperplane) that represents the relationship between the dependent and independent variables.
READ ME
Irene include something about the vif() function to address multicolinearity
iClicker
How familiar are you with regression analysis?
A. Very familiar; I’ve worked with multiple regression models, interactions, and diagnostics
B. Familiar; I’ve worked with multiple regression models
C. Somewhat familiar; I’ve seen simple linear regression before
D. Only heard of it and haven’t had much hands on experience with
E. Not familiar at all
Measuring Fit
Any slides in gray you don’t need to pay too much attention to.
The Advertising data set (csv) consists of the sales (in thousands) of a certain product and the advertising budgets (in thousands of dollars) for the product in three media:
TV, radio, and newspaper
Our goal is to develop an accurate model that can be used to predict sales on the basis of the media budgets.
Below plots the Sales vs TV, Radio and Newspaper, with a blue simple linear-regression line fit separately to each.
ISLR Fig 2.1
\[Y_i = \beta_0 + \beta_1 X_i + \epsilon_i\]
For example, \(X_i\) may represent the amount spent in TV advertising in market \(i\), and \(Y_i\) may represent sales of that product in market \(i\).
Mathematically, we can write: \[\texttt{sales} \approx \beta_0 + \beta_1 \texttt{TV}\]
Here we “regress sales onto TV”, or we are regressing \(Y\) on \(X\) (or \(Y\) onto \(X\)).
lm() function| Type of model | formula |
|---|---|
| Single predictor (SLR) | response ~ predictor |
| Multiple predictors (MLR) | response ~ predictor1 + predictor2 + ... + predictorp |
| All variables | response ~ . |
| Excluding a specific predictor | response ~ . - predictor |
| Interaction terms | response ~ predictor1 * predictor2 |
| Polynomial terms | response ~ poly(predictor, degree) |
| Categorical Variables | response ~ factor(predictor) |
| Transformed Variables | response ~ I(log(predictor)) |
Before we fit our model, we will first split our data into training and testing; here we’ll use a 50/50 split.
Typically we will save our fitted model to an object in R1
Call:
lm(formula = sales ~ TV, data = Advertising, subset = train_ind)
Coefficients:
(Intercept) TV
7.44821 0.04381
Our estimate for \(\beta_0\) is \(\hat \beta_0\) = 7.44821
Our estimate for \(\beta_1\) is \(\hat \beta_1\) = 0.04381
iClicker
Which is the correct interpretation of the intercept \(\hat\beta_0\)?
A. If sales are 0, then the expected TV advertising spending is 7.45 thousand dollars.
B. If TV advertising spending is 0, then the expected sales of the product is 7.448 units.
C. If TV advertising spending is 0, then the expected sales of the product is 7448 units.
D. The intercept is not interpretable in this scenario.
iClicker
Which is the best interpretation of the slope \(\hat\beta_1\)?
A. For each additional dollar spent on TV advertising, the sales of this product is expected to increase by an average of 0.04381 units.
B. For each additional dollar spent on TV advertising, the sales of this product is expected to increase by an average of 44 units.
C. For each additional $1,000 spent on TV advertising, the sales of this product is expected to increase by 43.81 units.
D. For each additional $1,000 spent on TV advertising, the sales of this product is expected to increase increases by $43.81.
Warning
The \(y\)-intercept is often a large extrapolation (and therefore “meaningless”), so be careful with the interpretation.
The \(\hat \beta\)s are most commonly estimated by ordinary least squares (OLS)
Using some basic calculs, this statistical technique involves minimizing the residual sum of squares:
\[ \text{RSS}=\sum_{i=1}^n (\underbrace{y_i-\hat y_i}_{\text{residual}})^{2} \]
ISLR Fig 3.1: The least squares fit for the regression of sales onto TV. In this case, a linear fit captures the essence of the relationship, although it is somewhat deficient in the left of the plot
ISLR Fig 3.4: In a three-dimensional setting, with two predictors and one response, the least squares regression line becomes a plane. The plane is chosen to minimize the sum of the squared vertical distances between each observation (shown in red) and the plane.
The “fitted model” is given by:
\[ \hat f(X) = \hat \beta_0 + \hat \beta_1 X_1 \]
For our SLR model:
\[ \texttt{sales} = 7.45 + 0.04\times \texttt{TV} \]
Call:
lm(formula = sales ~ TV, data = Advertising, subset = train_ind)
Residuals:
Min 1Q Median 3Q Max
-7.7700 -1.9111 0.1785 1.8400 6.3916
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.448213 0.596828 12.48 <2e-16 ***
TV 0.043808 0.003494 12.54 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.028 on 98 degrees of freedom
Multiple R-squared: 0.6161, Adjusted R-squared: 0.6121
F-statistic: 157.3 on 1 and 98 DF, p-value: < 2.2e-16
Call: shows the model you fitResiduals: quick check of symmetry/spreadBelow the coefficients table are a few measures of fit:
\[t = \dfrac{\text{Estimate}}{\text{Std. Error}}\]
Asterisks Notation in R Output
R’s regression summary gives a quick way to indicate the statistical significance of a coefficient:
*** p < 0.001 highly significant
** p < 0.01 very significant
* p < 0.05 signficant significant
. p < 0.1 almost significant
⚠️ These stars are a convenience, not a substitute for careful interpretation.
Residual standard error (RSE)
\[ \text{RSE} = \sqrt{\frac{1}{n-p}\sum_{i=1}^n (y_i - \hat y_i)^2} \]
sales in thousands of units)Compared with Mean Squar Error: \[ \text{MSE} = \frac{1}{n}\sum_{i=1}^n (y_i - \hat y_i)^2 \]
sales\(^2\)).RSE vs MSE
MSE is a more general metric used in ML. RSE is essentially a scaled version of MSE. Both measure model fit quality, but RSE is often easier to explain
R-squared, \(R^2\), or the coefficient of determination represents proportion of the variance in the response variable that is predictable from the independent variable(s).
\[ \begin{align} R^2 &= 1 - \frac{\text{RSS}}{\text{TSS}}\\ &= 1 - \frac{\sum_{i=1}^n (y_i - \hat y_i)^2}{\sum_{i=1}^n (y_i - \bar y)^2} \end{align} \]
Warning
\(R^2\) can be misleading in multiple regression since \(R^2\) always increases (or stays the same) when more predictors are added.
The Adjusted \(R^2\) is a penalized \(R^2\) value that adjusts for the number of predictors \(p\):
\[ R^2_\text{adj} = 1 - \Bigg(\frac{n-1}{n-p}\Bigg)\frac{\text{RSS}}{\text{TSS}} \]
We assume that we can write the following model \[\begin{equation} Y = \beta_0 + \beta_1 X_{1} + \cdots + \beta_p X_{p} + \epsilon \end{equation}\]
Here we regress sales onto the predictors TV, Radio, and Newspaper.
\[\texttt{sales} \approx \beta_0 + \beta_1 \texttt{TV} + \beta_1 \texttt{radio} + \beta_2 \texttt{newspaper}\]
Is at least one of the predictors \(X_1, X_2, \dots, X_p\) useful in predicting the response?
Do all the predictors help to explain \(Y\), or is only a subset of the predictors useful?
How well does the model fit the data?
Given a set of predictor values, what response value should we predict, and how accurate is our prediction?
Multicollinearity occurs when two or more independent variables in a regression model are highly correlated with each other.
This can lead to unstable coefficient estimates, inflated standard errors, reuduction in overall interpretiblity.
Check1 for correlated variables as they can cause problems
Unstable Coefficient Estimates The variance of all coefficients tend to increase, sometimes dramatically
Difficulty in Interpreting Coefficients Coefficients may not reflect the true relationship between each predictor and the outcome, leading to misleading conclusions about the importance or impact of each predictor.
Are the assumptions of the linear regression model met?

Note
These assumptions will usually be checked using Diagnostic Plots
By calling plot() on an lm object, you automatically get a set of diagnostic plots:
Purpose: This plot checks for non-linearity, constant variance (homoscedasticity), and the presence of outliers.
Ideally, the points should be randomly scattered around the horizontal line (y=0), with no clear pattern.
A systematic pattern (such as a curve) suggests a non-linear relationship, and a “funnel” shape indicates heteroscedasticity (non-constant variance).




Purpose: This plot assesses whether the residuals are normally distributed.
Ideally, the points should lie approximately along the reference line.
Deviations from this line, particularly in the tails, indicate that the residuals are not normally distributed, which can affect the validity of hypothesis tests and confidence intervals.
set.seed(123)
# Good: Standard normal
normal_data = rnorm(1000, mean=7, sd = 10)
qqnorm(normal_data, main = "Good: Normal Residuals")
qqline(normal_data, col = "red")
# Bad: Heavy tails (t with df=4)
t_data = rt(1000, df = 4)
qqnorm(t_data, main = "Bad: Heavy Tails")
qqline(t_data, col = "red")
# Bad: Skewed (exponential)
skewed_data = rexp(1000, rate = 4)
qqnorm(skewed_data, main = "Bad: Skewed")
qqline(skewed_data, col = "red")


Purpose: This plot checks for homoscedasticity (constant variance of residuals).
It plots the square root of standardized residuals against the fitted values.
Ideally, the points should be scattered horizontally with no discernible pattern. A trend or pattern (e.g., increasing or decreasing spread) suggests heteroscedasticity.
Purpose: This plot identifies influential observations that have a disproportionate impact on the model fit, also known as high-leverage points.
It helps to spot influential data points that might unduly affect the regression results.
Points that are far from others in the x-direction (high leverage) or those with high Cook’s distance (indicated by dashed lines) could be potential concerns.
To deal with correlated variables you may consider removing one of the highly correlated predictors (e.g. removing predictors with vif() > 5)
Alternatively, you could combine correlated predictors into a single composite variable.
Regularization methods add a penalty to the regression model to shrink coefficient estimates and reduce multicollinearity effects (we will look at Ridge Regression and Lasso in future lectures)
Variance Inflation Factor (VIF): Measures how much the variance of a regression coefficient is inflated due to correlation among predictors.
\[ \text{VIF}(X_j) = \frac{1}{1 - R^2_j} \] where \(R_j^2\) is the \(R^2\) from regressing predictor \(X_j\) on all the other predictors.
library(car)
fit2 <- lm(sales ~ TV + radio + newspaper, data = Advertising, subset = train_ind)
vif(fit2) TV radio newspaper
1.007046 1.136856 1.143884
Since all VIFs are close to 1, there is no evidence of harmful multicollinearity, hence standard errors of the coefficients are not inflated due to collinearity.
Note
If your goal is prediction (not understanding individual coefficient effects), collinearity might be okay if overall predictive performance is good. Read: Multicollinearity in Regression Analysis: Problems, Detection, and Solutions Statistics by Jim
Question: Do all the predictors help to explain \(Y\), or is only a subset of the predictors useful?
Each \(\beta_j\) has an associated \(t\)-statistic and \(p\)-value in the R output
These check whether each predictor is linearly related to the response, after adjusting for all the other predictors considered
These are useful, but we have to be careful…especially for large numbers of predictors (multiple testing problem)
A natural temptation when considering all these individualized hypothesis tests is to “toss out” any variable(s) that does not appear significant (i.e. \(p\)-value > 0.05)
This brings us into the topic of variable selection, which we will go into much more detail later in the course.
As our first foray, we’ll look at the most straightforward (and also oldest) techniques for doing so… Stepwise Regression
We have a few ways we can compare models (e.g. RSE, adjusted \(R^2\), test MSE, or anova()1) but how do we go about choosing which models to compare?
Ideally we would try every combination of \(X_j\), but that becomes computationally impractical very quickly2
Forward Selection: Start with only the intercept (no predictors, ie null model). Create \(p\) simple linear regressions, whichever predictor has the lowest RSS, add it into the model. Continue until, for instance, we see a decrease in our ‘best model’ measurement.
Backward Selection: Start with all predictors1. Remove the predictor with the largest p-value. Continue until, for instance, all variables are considered significant.
Mixed Selection: Start with no predictors. Perform forward selection, but if any \(p\)-values for variables currently in the model reach above a threshold, remove that variable. Continue until, for instance, all variables in your model are significant and any additional variable would not be.
How well does the model fit the data?
While the summary() function does not spit out the MSE by default, we can still calculate this value and compare the resulting MSE with the MSE of other fitted models using the same data.
That is to say the MSE isn’t so useful on it’s own but more-so as a comparative value across multiple models.
Once a model is fit, we can use it to predict responses for “new” data: