Linear Regression
University of British Columbia Okanagan
Today we introduce our first model for the supervised regression problem: linear regression.
Linear regression is a supervised machine learning algorithm used for modeling the relationship between a dependent variable and one or more independent variables by fitting a linear equation to observed data.
The objective of this model is to find the best-fitting line (or hyperplane) that represents the relationship between the dependent and independent variables.
READ ME
Irene include something about the vif()
function to address multicolinearity
Any slides in gray you don’t need to pay too much attention to.
The Advertising
data set (csv) consists of the sales of a certain product, along with advertising budgets for the product in: TV, radio, and newspaper
If we establish a link between advertising and sales, we can advise our client to modify their advertising budgets, consequently leading to an indirect increase in sales.
Our goal is to develop an accurate model that can be used to predict sales on the basis of the three media budgets.
Sales
using these three?Below plots the Sales
vs TV
, Radio
and Newspaper
, with a blue simple linear-regression line fit separately to each.
ISLR Fig 2.1
We assume that we can write the following model \[Y_i = \beta_0 + \beta_1 X_i + \epsilon_i\]
For simple linear regression (SLR) we only have a single \(X\).
For example, \(X_i\) may represent the amount spent in TV
advertising in market \(i\), and \(Y_i\) may represent sales
of that product in market \(i\). Mathematically, we can write this linear relationship as \[\texttt{sales} \approx \beta_0 + \beta_1 \texttt{TV}\]
Here we “regress sales
onto TV
”, or we are regressing \(Y\) on \(X\) (or \(Y\) onto \(X\)).
We assume that we can write the following model \[\begin{equation} Y_i = \beta_0 + \beta_1 X_{1,i} + \cdots + \beta_p X_{p,i} + \epsilon_i \end{equation}\]
For example \(X_1\) may represent TV
advertising, \(X_2\) may represent Radio
advertising, \(X_3\) may represent Newspaper
advertising, and \(Y\) may represent sales. Mathematically, we can write this linear relationship as \[\texttt{sales} \approx \beta_0 + \beta_1 \texttt{TV} + \beta_1 \texttt{Radio} + \beta_e \texttt{Newspaper}\]
Here we regress sales
onto the predictors TV
, Radio
, and Newspaper
.
In terms of our general SL, we have:
\[Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + \ldots + \beta_pX_p\]
After a model has been selected, we need a procedure that uses the training data to fit or train the model.
To estimate \(f(X)\), one only needs to estimate the \(p+1\) coefficients \(\beta_0, \beta_1, \ldots, \beta_p\).
We denote the estimate of \(\beta_i\) by \(\hat \beta_i\).
The most common approach to fitting the linear regression model is referred to as (ordinary) least square, or OLS.
This least squares technique involves minimizing the sum of the squared residuals, i.e. vertical distance from each point to the line, or curve.
You can imagine this as wiggling the line about until the sum of the squared vertical distances is minimized.
This minimization can be achieved using some basic calculus.
Given the parameter estimates \(\hat\beta_0\) and \(\hat\beta_1\) we can predict \(y_i\)
\[\hat y_i = \hat \beta_0 + \hat \beta_1 x_i\]
For any line, we can define the observed error as \[e_i = y_i - \hat y_i\] these are more often referred to as residuals.
To find the “best” line, we instead minimize the residual sum of squares RSS = \(\sum_{i=1}^n e_i^2\).
Derivations for \(\hat \beta_0\) and \(\hat \beta_1\) are easy with some basic calculus (e.g., you can find the derivation for SLR here).
It turns out that the following equations minimize \(\sum e_i^2\):
\[\begin{align} \hat \beta_1 &= \frac{\sum (y_i - \bar y)(x_i-\bar x)}{\sum (x_i-\bar x)^2}\\ \hat \beta_0 & = \overline{y}- \hat \beta_1\overline{x} \end{align}\]
The estimate of intercept, \(\hat \beta_0\), tells us what value we would guess for \(y\) when \(x=0\).
It is often a large extrapolation1 (and therefore “meaningless”), so be careful with the interpretation.
The estimate for slope tells us that for each unit increase in the predictor (\(x\)) we expect the response (\(y\)) to change by \(\hat \beta_1\)
Avoid Extrapolation
Ensure predictions are made within the observed range (i.e. with the range of values in the training data).
In MLR we have p predictors; \(X_j\), for \(j = 1, \dots p\).
\(\beta_j\) quantifies the association between the \(j^{th}\) predictor variable and the response.
\(\beta_j\) is the average effect on \(Y\) of a one unit increase in \(X_j\), holding all other predictors fixed
This interpretation is not very useful in practice as predictors will often change together!
Is at least one of the predictors \(X_1, X_2, \dots, X_p\) useful in predicting the response?
Do all the predictors help to explain \(Y\), or is only a subset of the predictors useful?
How well does the model fit the data?
Given a set of predictor values, what response value should we predict, and how accurate is our prediction?
Multicollinearity occurs when two or more independent variables in a regression model are highly correlated with each other.
This can lead to unstable coefficient estimates, inflated standard errors, reuduction in overall interpretiblity.
Let’s consider the cars
example from the datasets
packages (loaded automatically every time you open R)
It gives the speed of cars (speed
in mph) and the distances taken to stop (dist
in ft).
As a first step, it is always a good idea to plot the data
Note that the data were recorded in the 1920s.
General trend the faster the car is going, the longer (in terms of distance) it takes to stop
Goal find the “best” line to model this trend
lm()
functionWe fit a linear model in R using the lm()
function.
Fitting linear models in R
If we don’t attach the dataframe, we need to specify the dataframe in the data
arguments.
Alternatively, you could attach()
the data set so that the columns in the data frame can be called by name in the lm function.
Typically, we store the model output in an object and use the summary()
function on it to extract more detailed information.
Clicking on “lm()
” in these notes will bring you to their online documentation. To learn more about how summary()
works on the output from lm()
, use ?summary.lm
in R.
More generally we will have the following format where a variety of formula
specifications are allowed:
Type of model | formula |
---|---|
Single predictor (SLR) | response ~ predictor |
Multiple predictors (MLR) | response ~ predictor1 + predictor2 + ... |
All variables | response ~ . |
Excluding a specific predictor | response ~ . - predictor |
Interaction terms | response ~ predictor1 * predictor2 |
Polynomial terms | response ~ poly(predictor, degree) |
Categorical Variables | response ~ factor(predictor) |
Call:
lm(formula = dist ~ speed)
Residuals:
Min 1Q Median 3Q Max
-29.069 -9.525 -2.272 9.215 43.201
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -17.5791 6.7584 -2.601 0.0123 *
speed 3.9324 0.4155 9.464 1.49e-12 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 15.38 on 48 degrees of freedom
Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
While you should be able to extract this information directly from the summary output, if you want to extract the values from the R object (which is handy for reports) use:
iClicker
In linear regression we can have more than one independent variable
A. True
B. False
Correct answer: A. True
iClicker
Which term is used to describe the problem when independent variables in a multiple linear regression model are highly correlated?
A. Homoscedasticity
B. Multicollinearity
C. Autocorrelation
D. Heteroscedasticity
Correct answer: B. Multicollinearity
Answer: B) Multicollinearity
ISLR Fig 3.4: In a three-dimensional setting, with two predictors and one response, the least squares regression line becomes a plane. The plane is chosen to minimize the sum of the squared vertical distances between each observation (shown in red) and the plane.
ISLR Fig 3.1: The least squares fit for the regression of sales
onto TV
. In this case, a linear fit captures the essence of the relationship, although it is somewhat deficient in the left of the plot
Are the assumptions of the linear regression model met?
Residuals vs Fitted: checks for linearity. A “good” plot will have a red horizontal line, without distinct patterns.
Normal Q-Q: checks if residuals are normally distributed. It’s “good” if points follow the straight dashed line.
Scale-Location: checks the equal variance of the residuals. “Good” to see horizontal line with equally spread points.
Residuals vs Leverage: identifies influential points1.
# plotting options
par(mfrow=c(2,2)) # sets 2x2 layout matrix for plotting
par(mar=c(5,4,1.2,0.1)) # removes some white space around margins
# load in the data
library(dplyr)
Advertising <- read.csv("data/Advertising.csv")
Advertising <- rename(Advertising, market = X)
# fit the linear regerssion model
adlm <- lm(sales~TV, data = Advertising)
# plot the diagnostic plots
plot(adlm)
Purpose: This plot checks for non-linearity, constant variance (homoscedasticity), and the presence of outliers.
Ideally, the points should be randomly scattered around the horizontal line (y=0), with no clear pattern.
A systematic pattern (such as a curve) suggests a non-linear relationship, and a “funnel” shape indicates heteroscedasticity (non-constant variance).
Purpose: This plot assesses whether the residuals are normally distributed.
Ideally, the points should lie approximately along the reference line.
Deviations from this line, particularly in the tails, indicate that the residuals are not normally distributed, which can affect the validity of hypothesis tests and confidence intervals.
(Spread-Location Plot)
Purpose: This plot checks for homoscedasticity (constant variance of residuals).
It plots the square root of standardized residuals against the fitted values.
Ideally, the points should be scattered horizontally with no discernible pattern. A trend or pattern (e.g., increasing or decreasing spread) suggests heteroscedasticity.
Purpose: This plot identifies influential observations that have a disproportionate impact on the model fit, also known as high-leverage points.
It helps to spot influential data points that might unduly affect the regression results.
Points that are far from others in the x-direction (high leverage) or those with high Cook’s distance (indicated by dashed lines) could be potential concerns.
Check for correlated variables as they can cause problems
market TV radio newspaper sales
market 1.000 0.018 -0.111 -0.155 -0.052
TV 0.018 1.000 0.055 0.057 0.782
radio -0.111 0.055 1.000 0.354 0.576
newspaper -0.155 0.057 0.354 1.000 0.228
sales -0.052 0.782 0.576 0.228 1.000
Unstable Coefficient Estimates The variance of all coefficients tend to increase, sometimes dramatically
Difficulty in Interpreting Coefficients Coefficients may not reflect the true relationship between each predictor and the outcome, leading to misleading conclusions about the importance or impact of each predictor.
To deal with correlated variables you may consider removing one of the highly correlated predictors (you will look at a technique called backward selection in the lab)
Alternatively, you could combine correlated predictors into a single composite variable (we don’t cover this here).
Regularization methods add a penalty to the regression model to shrink coefficient estimates and reduce multicollinearity effects (we will look at Ridge Regression and Lasso in future lectures)
The standard error of an estimator reflects how it varies under repeated sampling. We have \[\begin{align*} SE(\hat{\beta}_1)^2 &= \frac{\sigma^2}{\sum_{i=1}^{n}(x_i - \bar{x})^2},\\ SE(\hat{\beta}_0)^2 &= \sigma^2 \left(\frac{1}{n} + \frac{\bar{x}^2}{\sum_{i=1}^{n}(x_i - \bar{x})^2}\right), \end{align*}\] where \(\sigma^2 = \text{Var}(\epsilon)\).
The \(\sigma\) used in the standard error formula represents the standard deviation of the error terms, ie. \(\sigma^2 = \text{Var}(\epsilon_i)\)
Since \(\sigma\) is generally unknown and is estimated with \[\begin{equation} RSE=\sqrt{\frac{RSS}{n-p-1}} = \sqrt{\frac{\sum (Y_i-\hat{Y}_i)^2}{n-p-1}} \end{equation}\]
\[\hat{\beta}_1 \pm 1.96 \cdot SE(\hat{\beta}_1)\]
That is, there is approximately a 95% chance that the interval
\[[\hat{\beta}_1 - 1.96 \cdot SE(\hat{\beta}_1), \hat{\beta}_1 + 1.96 \cdot SE(\hat{\beta}_1)]\]
will contain the true value of \(\beta_1\) (under a scenario where we get repeated samples like the present sample). For the advertising data, the 95% confidence interval for \(\beta_1\) is \[[0.042, 0.053]\]
Advertising <- read.csv("https://www.statlearning.com/s/Advertising.csv")
adlm <- lm(sales~TV, data = Advertising)
sumlm <- summary(adlm)
# Confidence interval for the coefficient of TV
beta_tv <- sumlm$coefficients["TV", c("Estimate", "Std. Error")]
# Calculate the critical value (two-tailed)
alpha <- 0.05
n <- nrow(Advertising)
critical_value <- qt(1 - alpha / 2, df = n - length(adlm$coefficients))
# Calculate the lower and upper bound of the CI
lower_bound <- beta_tv["Estimate"] - critical_value * beta_tv["Std. Error"]
upper_bound <- beta_tv["Estimate"] + critical_value * beta_tv["Std. Error"]
print(c(lower_bound, upper_bound))
Estimate Estimate
0.04223072 0.05284256
\[\begin{align} H_0: & \text{ There is no relationship between $X$ and $Y$ vs. }\\ H_A: & \text{ There is some relationship between $X$ and $Y$}. \end{align}\]
In the multiple regression setting with \(p\) predictors, we need to ask whether all of the regression coefficients are zero, i.e., whether \[\beta_1 = \beta_2 = \ldots = \beta_p = 0.\]
To answer this question we test the null hypothesis, \[\begin{align*} H_0:& \beta_1 = \beta_2 = \ldots = \beta_p = 0 \\ H_a:& \text{at least one } \beta_j \text{ is non-zero} \end{align*}\]
This hypothesis test is performed by computing the F-statistic where \(F\) follows an \(F\)-distribution with degrees of freedom \(\nu_1 = p\) and \(\nu_2 = n-p-1\)
\[\begin{equation} F = \frac{\text{TSS - RSS}/p}{\text{RSS}/(n - p - 1)} \end{equation}\]
If there is no relationship between the response and predictors, the F-statistic will be close to 1. Larger values produce smaller p-values and a higher chance of rejecting \(H_0\).
Once we have determined that it is likely that there is a significant relationship between \(X\) and \(Y\), we can ask: how well does the linear model fit the data?
You may remember assessing the quality of a linear regression fit using the following methods
\[\begin{equation} R^2 = \frac{\text{TSS-RSS}}{\text{TSS}} = 1-\frac{\text{RSS}}{\text{TSS}} = 1- \frac{\sum (y_i - \hat y_i)^2}{\sum (y_i - \bar y)^2}. \end{equation}\]
The coefficient of determination \(R^2\) is the proportion of the total variation that is explained by the regression line.
It measures how close the fitted line is to the observed data.
It is easily to interpret as it always lies between 0 and 1
Adjusted \(R^2\) is produced in R output and calculated by:
\[ R^2 = 1 - \dfrac{\dfrac{RSS}{n-p-1}}{\dfrac{TSS}{n-1}} \]
We usually look at adjusted \(R^2\) when comparing models with different number of predictors
RSE is calculated as follows:
\[\begin{equation} RSE = \sqrt{\frac{1}{n - p - 1} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2}\end{equation}\]
where \(n\) is the number of data points, \(p\) is the number of predictors (features), \(y_i\) is the observed value, and \(\hat y_i\) is the predicted value for the \(i^{th}\) data point.
RSE is primarily used in the context of linear regression.
It quantifies the typical or average magnitude of the residuals (the differences between observed and predicted values) in the model.
RSE is essentially the standard deviation of the residuals \(\left(\sqrt{\text{Var}(e_i)}\right)\).
RSE is used to estimate \(\sigma\) the standard error of errors \(\left(\sqrt{\text{Var}(\epsilon)}\right)\)
Recall our Mean Squared Error (MSE) formula from last lecture:
\[ MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 \]
MSE is a more general metric used in various machine learning and statistical modeling contexts
RSE is essentially a scaled version of MSE.
RSE is a specific metric used in linear regression that adjusts for the number of predictors, whereas MSE is a more general metric used in various predictive modeling tasks.
Lets consider a data set containing the life expectancy and illiteracy rates (among other measurements) for all 50 states from the 1970’s.
This dataset is available in R as state.x77
in the datasets
package (which gets loaded automatically with each new R session), see ?state.x77
We’ll fit a simple linear regression with life expectancy Life.Exp
1 as the response variable.
Our response variable is Life.Exp
the life expectancy in years (1969–71). Lets consider the predictors:
Illiteracy
: illiteracy (1970, percent of population)Murder
: murder and non-negligent manslaughter rate per 100,000 population (1976)HS Grad
: percent high-school graduates (1970)Frost
: mean number of days with minimum temperature below freezing (1931–1960) in capital or large city...
Call:
lm(formula = Life.Exp ~ Illiteracy, data = stdata)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 72.3949 0.3383 213.973 < 2e-16 ***
Illiteracy -1.2960 0.2570 -5.043 6.97e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.097 on 48 degrees of freedom
Multiple R-squared: 0.3463, Adjusted R-squared: 0.3327
F-statistic: 25.43 on 1 and 48 DF, p-value: 6.969e-06
...
iClicker coefficient
What is the coefficient of Illiteracy?
72.39
-1.3
0.35
0.33
None of the above
Correct answer: B
iClicker p-value
What is the \(p\)-value associated with the coefficient of Illiteracy?
3.47e-73
6.97e-06
213.97
-5.04
None of the above
Correct answer: B
iClicker: p-value associated with regression coefficients
What does the p-value associated with the Illiteracy coefficient represent?
The probability that Illiteracy has no effect on Life Expectancy.
The proportion of Life Expectancy variance explained by Illiteracy
The probability of observing the estimated relationship between Illiteracy and Price, or something more extreme, if Illiteracy had no impact on Life Expectancy.
The change in Life Expectancy for each additional unit of Illiteracy
Correct Answer: C
The R-squared value 0.35 suggests that Illiteracy explains around 35% of the variance in life expectancy.
Residual Standard Error (RSE): 1.097, indicating the average distance of observed values from the regression line.
Call:
lm(formula = Life.Exp ~ Frost, data = stdata)
Residuals:
Min 1Q Median 3Q Max
-2.6515 -0.7852 -0.1183 0.9382 3.4284
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 70.171631 0.418883 167.521 <2e-16 ***
Frost 0.006768 0.003597 1.881 0.066 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.309 on 48 degrees of freedom
Multiple R-squared: 0.06868, Adjusted R-squared: 0.04928
F-statistic: 3.54 on 1 and 48 DF, p-value: 0.06599
The weak positive relationship (\(\hat \beta_{\text{Frost}}\) = 0.0068) between Frost and Life Expectancy is not significant (\(p\)-value = 6.60e-02.
This suggests that there is no strong evidence of a relationship between the number of frost days and life expectancy.
Call:
lm(formula = Life.Exp ~ Illiteracy + Murder + HS.Grad + Frost,
data = stdata)
Residuals:
Min 1Q Median 3Q Max
-1.48906 -0.51040 0.09793 0.55193 1.33480
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 71.519958 1.320487 54.162 < 2e-16 ***
Illiteracy -0.181608 0.327846 -0.554 0.58236
Murder -0.273118 0.041138 -6.639 3.5e-08 ***
HS.Grad 0.044970 0.017759 2.532 0.01490 *
Frost -0.007678 0.002828 -2.715 0.00936 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.7483 on 45 degrees of freedom
Multiple R-squared: 0.7146, Adjusted R-squared: 0.6892
F-statistic: 28.17 on 4 and 45 DF, p-value: 9.547e-12
Life.Exp
)\[\begin{align} H_0&: \beta_1 = \beta_2 = \cdots = \beta_p = 0 && H_a: \text{not all }\beta_j = 0 \end{align}\]
Call:
lm(formula = Life.Exp ~ Illiteracy + Murder + HS.Grad + Frost,
data = stdata)
Residuals:
Min 1Q Median 3Q Max
-1.48906 -0.51040 0.09793 0.55193 1.33480
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 71.519958 1.320487 54.162 < 2e-16 ***
Illiteracy -0.181608 0.327846 -0.554 0.58236
Murder -0.273118 0.041138 -6.639 3.5e-08 ***
HS.Grad 0.044970 0.017759 2.532 0.01490 *
Frost -0.007678 0.002828 -2.715 0.00936 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.7483 on 45 degrees of freedom
Multiple R-squared: 0.7146, Adjusted R-squared: 0.6892
F-statistic: 28.17 on 4 and 45 DF, p-value: 9.547e-12
iClicker: What is the \(p\)-value associated with F-test
What does the p-value associated with the F test?
Correct Answer: e: 9.547e-12
Interpretation
The \(p\)-value of the F-statistic is highly significant which means that, at least, one of the predictor variables is significantly related to the outcome variable.
Q: Why do we use F-tests instead of looking at the individual p-values associated with each predictor?
Even if all \(\beta_j\)’s = 0, (that is, no variable is truly associated with the response), we would still expect 5% of hypothesis tests for \(\beta_j\) to be rejected by chance alone
The \(F\)-test therefore combats the multiple testing problem.
Warning
If \(p>n\) we cannot fit MLR model using least squares.
For that question, we use the F-statistic and F-test
Since the value is large (28.17) and the p-value is small (9.547e-12) we would reject the null \[H_0: \beta_0 = \beta_1 = \beta_2 = \beta_3\] in favour of the alternative
\[ H_a:\text{ at least one }\beta_j \neq 0. \]
Each \(\beta_j\) has an associated \(t\)-statistic and \(p\)-value in the R output
These check whether each predictor is linearly related to the response, after adjusting for all the other predictors considered
These are useful, but we have to be careful…especially for large numbers of predictors (multiple testing problem)
A natural temptation when considering all these individualized hypothesis tests is to “toss out” any variable(s) that does not appear significant (ie \(p\)-value > 0.05)
This brings us into the topic of variable selection, which we will go into much more detail later in the course.
As our first foray, we’ll look at the most straightforward (and also oldest) techniques for doing so.
Assuming we now have an appropriate measuring stick (e.g. RSE or \(R^2\)) how do we go about choosing which models to fit?
Ideally we would try every combination of \(X_j\), but that becomes computationally impractical very quickly1
Forward Selection: Start with only the intercept (no predictors, ie null model). Create \(p\) simple linear regressions, whichever predictor has the lowest RSS, add it into the model. Continue until, for instance, we see a decrease in our ‘best model’ measurement.
Backward Selection: Start with all predictors. Remove the predictor with the largest p-value. Continue until, for instance, all variables are considered significant.
Mixed Selection: Start with no predictors. Perform forward selection, but if any \(p\)-values for variables currently in the model reach above a threshold, remove that variable. Continue until, for instance, all variables in your model are significant and any additional variable would not be.
We are always going to encounter error by using models.
Because we are estimating the population parameters \(\beta_j\) based on a sample there will always be some variability in our model and inaccuracy related to the reducible error.
Our model will most likely be a simpler approximation to reality hence, we introduce model bias and add to our reducible error
Even if the multiple linear model was exactly \(f\), there would still be irreducible error. That is, our predictions \(\hat y_i \neq y_i\) for most (probabilistically all) \(i\).
While the summary()
function does not spit out the MSE by default, we can still calculate this value and compre the resulting MSE with the MSE of other fitted models using the same data.
That is to say the MSE isn’t so useful on it’s own but more-so as a comparative value across multiple models.
We can use the fitted model to make predictions for \(Y\) on the basis of a set of predictors \(X_1, X_2, \dots, X_p\) using functions in R and the fitted \(\hat \beta\)’s
To account for the three sorts of uncertainty associated with this prediction we have two methods:
Using the SLR model for the cars
data, we can predict the stopping distance for a new speed value \(x_0\).
Its stopping distance will be determined by \[Y_0 = \beta_0 + \beta_1 x_0 + \epsilon\]
Our prediction for this value would be \(\hat\beta_0 + \hat\beta_1 x_0\)
To assess the variance of this prediction, we must take into account the variance of the \(\hat\beta\)s and \(\epsilon\).
Suppose we are interested in the average stopping distance this 1920s car would incur at a speed of \(x_0\).
Our prediction for this value would be \[\hat y_0 = \hat\beta_0 + \hat\beta_1 x_0\]
To assess the variance of this prediction, we only need to take into account the variance of the \(\hat\beta\)s.
Prediction intervals are concerned with the value of a new observation \(Y_{0}\) for a given value \(x_0\) of the predictor
Confidence intervals are concerned with the value of the mean response \(E(Y_0) = \mu_{Y_0}\) for a given value \(x_0\) of the predictor
Simplicity: easy to understand, interpret and implement.
Efficiency: It is computationally inexpensive, trains very quickly compared to more complex models, especially on smaller datasets.
Baseline Model: serves as a good baseline model for more complex techniques, helping to determine if further sophistication is necessary.
Predictive Power: if the relationship between the variables is truly linear, it can produce highly accurate predictions.
Strong Assumptions: assumes assumptions is violated, the model may not perform well.
Limited to Linear Relationships: underfits when applied to complex relationships.
Sensitivity to Outliers: highly sensitive to outliers, which can disproportionately affect the model’s accuracy.
Multicollinearity: When independent variables are highly correlated, linear regression coefficients can become unstable and misleading.
Comments
Alternatively we could have attached our data and done the following
However, this approach is not recommended for this course, as it can lead to confusion when working with different datasets (e.g., training, testing, cross-validation sets).