library(ISLR2)
attach(Auto)
AutoLab 2
Regression Diagnostics and Predictive Modeling: Exploring Linear, Polynomial, and KNN Regression
We should probably include how to use anova() and AIC to choose between models. See page 117 of ISLR (lab 3).
also vif()
also stepwise regression
Introduction
This lab will cover how to make predictions, analyze diagnostic plots, identify potential problems in multiple linear regression, and compare multiple regression models using the test MSE.
In this lab, we will explore extensions to the linear regression and K-Nearest Neighbors (KNN) regression using the Auto dataset from the ISLR2 package. The dataset includes various characteristics of cars, such as engine horsepower, weight, and gas mileage, which will be used to build predictive models. The primary focus is to understand how well different regression models can predict gas mileage (measured in miles per gallon) based on these car attributes.
Supplemental: ILSR2 Section 3.3.3. and Lab 3.6, Lecture 5.
Learning outcomes
By the end of this lab students will be able to:
- Use linear regression to make predictions on new data
- Split data into training and test sets and use test mean squared error (\(MSE_{te}\)) to evaluate model performance on unseen data.
- Extend linear models to include polynomial terms to better capture non-linear relationships in the data.
- Compare the performance of simple linear models and polynomial models using test set MSE.
- Fit KNN regression models and understand the effect of different values of
kon model performance. - Compare the flexibility of KNN regression to linear models and discuss the trade-offs between bias and variance.
Perform diagnostic checks on linear regression models to verify assumptions such as linearity, constant variance, normality, and independence of errors.
Identify potential issues with the model, such as non-linearity, outliers, and high leverage points, and understand their impact on model validity.
Prediction
Let’s fit a linear regression model to the Auto dataset available in the ISLR2 package.
Let’s begin by splitting the data into a training and testing set. We’ll choose 60% for the training set and 40% for the test set.
# Set a random seed to ensure reproducibility of the train/test split
set.seed(2023)
# Total number of observations in the dataset
n <- nrow(Auto)
# Randomly select ~60% of the indices (235 out of 392) for the training set
# Sampling is done without replacement to avoid duplicate rows
train <- sample(n, round(.6*n))
# this will be the index of samples in our test set
test <- setdiff(1:n, train)fit1 attempts to explain gas mileage (in mpg mpg) using the predictor variable of the engines horsepower horsepower.
fit1 <- lm(mpg~horsepower, data = Auto, subset = train)
summary(fit1)
Call:
lm(formula = mpg ~ horsepower, data = Auto, subset = train)
Residuals:
Min 1Q Median 3Q Max
-13.376 -2.911 -0.294 2.545 15.464
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 39.32363 0.86238 45.60 <2e-16 ***
horsepower -0.15205 0.00769 -19.77 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.658 on 233 degrees of freedom
Multiple R-squared: 0.6265, Adjusted R-squared: 0.6249
F-statistic: 390.9 on 1 and 233 DF, p-value: < 2.2e-16
Judging by the very low p-values for \(\beta_1\) (slope), it appears that there a relationship between the predictor and the response. Since the sign for slope is negative (\(\hat\beta_1\) =-0.1578) we expect this relationship to be negative, that is, as the horsepower of our engine increase, gas mileage tends to go down. This relationship would also appear to be relatively strong as indicated by the high \(R^2\) value = 0.6049 (ie approximately 60% of the variance in the response is being explained by the simple model.
Prediction on the test set
To see how we predict on the test set we could use
pred_test = predict(fit1, Auto[test,])We can compare the predicted values to the actual values on an individual basis:
head(cbind(pred_test, Auto[test,]$mpg)) pred_test
1 19.557151 18
3 16.516154 18
4 16.516154 16
6 9.217761 15
8 6.632914 14
9 5.112416 14
We can calculate the MSE test using:
# mean((y - yhat)^2)
mse_test = mean((Auto[test,]$mpg - pred_test)^2)
mse_test[1] 27.70364
Prediction of a single value
To find the predicted mpg associated with a horsepower of say 90, we could use the predict function (see ?predict.lm for the help file; note that the newdata argument must be a data frame data structure):
predict(fit1, data.frame("horsepower"=90)) 1
25.63914
We could have constructed this prediction “by hand”, by obtaining the line of best fit and substituting in 90 for x but this obviously takes a little bit more work:
\[\begin{align} \hat y &= \hat \beta_0 + \hat \beta_1 \times \texttt{horsepower}\\ &= 39.3236312 + -0.1520498 \times 90\\ &= 25.64 \end{align}\]
It is also handy to note that we could see all the fitted \(\hat y_i\) values using the following command:
yhat_train <- fitted(fit1)
# (output suppressed to save space)From these, we could calculated the training MSE, i.e. \(MSE_{tr}\)
y_train <- Auto[train,]$mpg
mse_train = mean((y_train - yhat_train)^2)
mse_train[1] 21.51454
Notice that this can calculated from the Residual Standard Error (RSE) provided in the summary(fit1)$sigma. Recall since:
\[
\begin{align}
RSE &= \sqrt{\frac{RSS}{n−p−1}} = \sqrt{\frac{\sum_{i=1}^n ( y_i - \hat y_i)^2}{n−p−1}}\\
MSE_{tr} &= \frac{\sum_{i=1}^n ( y_i - \hat y_i)^2}{n} \\
\implies MSE_{tr}&=RSE^2*\frac{(n-p-1)}{n}
\end{align}
\] Notice that mse_train is the same as \(RSE^2*\frac{(n-p-1)}{n}\)
RSE <- summary(fit1)$sigma
n_train <- length(train) # number of obs in training set
p = 1 # number of predictors used in fit1
RSE^2*(n_train-p-1)/n_train[1] 21.51454
We could also calculate the MSE training directly from the sum of the residuals \[ \begin{align} \sum_{i=1}^n e_i &= \sum_{i=1}^n e_i = y_i - \hat y_i\\ \implies MSE_{tr} &= \sum_{i=1}^n \frac{e_i}{n} \end{align} \] To see the corresponding residuals, use:
sum(resid(fit1)^2)/n_train[1] 21.51454
Prediction and Confidence Intervals
To get a prediction interval we use:
predict(fit1, data.frame("horsepower"=90), interval="predict") fit lwr upr
1 25.63914 16.4392 34.83909
To get a confidence interval we use:
predict(fit1, data.frame("horsepower"=90), interval="confidence") fit lwr upr
1 25.63914 24.99907 26.27922
N.B. when using predict your data need to be in a data frame and the predictors need to be named in the same way as your training data.
Prediction and Confidence Bands
To create a band similar to the one we saw in lecture, you will need to create a sequence of plausible x values. For our purpose I will just take the observed range of hosepower values within our data set.
plot(mpg ~ horsepower, data = Auto)
abline(fit1)
# Create a new data frame for prediction intervals
new_data <- data.frame("horsepower" = seq(min(Auto$horsepower), max(Auto$horsepower), length.out = 100))
# Confidence band
conf_interval <- predict(fit1, newdata=new_data,
interval="confidence", level = 0.95)
head(conf_interval) fit lwr upr
1 32.32934 31.25414 33.40454
2 32.04674 30.99481 33.09867
3 31.76414 30.73525 32.79304
4 31.48155 30.47542 32.48767
5 31.19895 30.21531 32.18259
6 30.91635 29.95490 31.87780
lines(new_data[,1], conf_interval[,2], col="blue", lty=2)
lines(new_data[,1], conf_interval[,3], col="blue", lty=2)
# Prediction band
pred_interval <- predict(fit1, newdata=new_data,
interval="prediction", level = 0.95)
lines(new_data[,1], pred_interval[,2], col="orange", lty=2)
lines(new_data[,1], pred_interval[,3], col="orange", lty=2)
Prediction intervals are always wider than confidence intervals, because they incorporate both the error in the estimate for \(f(X)\) (the reducible error) and the uncertainty as to how much an individual point will differ from the population regression plane (the irreducible error)
Diagnostics
While the conclusions we made above may seem appropriate based on the R output, we would be amiss to not check the assumptions of this model before we jump to any conclusions. Recall that the assumptions for a linear regression model are that:
- There exists an (at least approximate) linear relationship between \(Y\) and \(X\).
- The distribution of \(\epsilon_i\) has constant variance.
- \(\epsilon_i\) is normally distributed.
- \(\epsilon_i\) are independent of one another. For example, \(\epsilon_2\) is not affected by \(\epsilon_1\).
To check the first assumption, we may like to look at the scatterplot with the line of best fit. As demonstrated below, it would appear that the response and predictor variable have a curved relationship rather than a linear one.
plot(mpg~horsepower, data=Auto)
abline(fit1, col="red")
We could also explore the diagnostic plots produced when call plot() on the output of an lm() object. To see them all in one plot, let’s change the panel layout of our plot window to store 4 figures (in this case in a 2 by 2 matrix) (for more details see ?par). In other words we could call:
par(mfrow=c(2,2))
plot(fit1)
A strong U-shaped pattern can be seen in the residual plot (located in top left corner) of the diagnostic plots below. This U-shape—and in general, any strong pattern—in the residuals indicates non-linearity in the data.
Polynomial Regression
We can extend the linear model to accommodate non-linearity by transforming predictor variables. For this particular example, the relationship between mpg and horsepower appears to be curved. Consequently, a model which includes a quadratic horsepower term, may provide a better fit.
Note that \(\sim\) in R indicates a formula object; for more details see ?formula. In formula mode, ^ is used to specify interactions (eg. ^2 denotes all second order interactions from the preceding term). So in order to use the caret to denote the arithmetic squared function, we need to insulate the term using the I() function. Alternatively, we could call/define a new variable, say hp2, that stores the squared values of hoursepower squared.
To fit this linear model—and YES this a linear model (i.e. a multiple linear regression model with \(X_1 =\) horsepower and \(X_2\) = horsepower\({}^2\) as predictors 1—we have a choice between the following two set-ups:
# option 1
hp2 <- Auto$horsepower^2
auto2 <- cbind(Auto, hp2)
fit2a <- lm(mpg~horsepower+hp2, data = auto2, subset = train)
summary(fit2a)
Call:
lm(formula = mpg ~ horsepower + hp2, data = auto2, subset = train)
Residuals:
Min 1Q Median 3Q Max
-14.219 -2.421 -0.191 2.473 14.078
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 52.6713582 2.2060649 23.876 < 2e-16 ***
horsepower -0.3957113 0.0382281 -10.351 < 2e-16 ***
hp2 0.0009721 0.0001499 6.486 5.25e-10 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.295 on 232 degrees of freedom
Multiple R-squared: 0.6839, Adjusted R-squared: 0.6812
F-statistic: 250.9 on 2 and 232 DF, p-value: < 2.2e-16
# option 2
fit2 <- lm(mpg~horsepower+I(horsepower^2),
data = auto2, subset = train)
summary(fit2)
Call:
lm(formula = mpg ~ horsepower + I(horsepower^2), data = auto2,
subset = train)
Residuals:
Min 1Q Median 3Q Max
-14.219 -2.421 -0.191 2.473 14.078
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 52.6713582 2.2060649 23.876 < 2e-16 ***
horsepower -0.3957113 0.0382281 -10.351 < 2e-16 ***
I(horsepower^2) 0.0009721 0.0001499 6.486 5.25e-10 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.295 on 232 degrees of freedom
Multiple R-squared: 0.6839, Adjusted R-squared: 0.6812
F-statistic: 250.9 on 2 and 232 DF, p-value: < 2.2e-16
Again we can compute the MSE on the test set using
pred_test2 = predict(fit2, Auto[test,])
mse_test2 = mean((Auto[test,]$mpg - pred_test2)^2)
mse_test2[1] 20.82022
Unsurprisingly, this MSE test is lower than the MSE test obtained by the simple linear regression model.
Lets compare the two fits on the scatterplot. It appears the fit with the squared horsepower is fitting the data much better:
# to create a smoother curve lets create a sequence
# of plausible x values based on the range of x's in
# our data frame:
newdata <- data.frame(horsepower=seq(min(Auto$horsepower),
max(Auto$horsepower), length.out = 1000))
newdata$pred1 <- predict(fit2, newdata)
plot(mpg ~ horsepower, data = Auto)
abline(fit1, col="red", lwd=2)
lines(newdata$horsepower, newdata$pred1, col = "green", lwd=3)
legend('topright', lwd = 2,lty=1, col=c("red","green"),
legend=c("SLR (fit1)","MLR (fit2)"))
Higher Order polynomial regression
We could now fit higher order terms (eg. horsepower\(^3\), horsepower\(^4\), ), however, we run the risk of overfitting. Note the use of poly() for creating higher-order polynomials. For instance, the eight-degree polynomial model (fit8) found below appears to be too wiggly. (This relates back to our bias and variance tradeoff).
fit3 <- lm(mpg~horsepower+I(horsepower^2)+I(horsepower^3), subset = train)
fit4 <- lm(mpg ~ poly(horsepower, degree = 5), data = Auto, subset = train)
fit5 <- lm(mpg ~ poly(horsepower, degree = 8), data = Auto, subset = train)
fit6 <- lm(mpg ~ poly(horsepower, degree = 10), data = Auto, subset = train)As can be seen with the MSE test calculation below, this curve is not as generalizable as the second-degree polynomial regression model (i.e. fit2 with MSE test = 20.820224)
pred_test3 = predict(fit3, Auto[test,])
pred_test4 = predict(fit4, Auto[test,])
pred_test5 = predict(fit5, Auto[test,])
pred_test6 = predict(fit6, Auto[test,])
mse_test3 = mean((Auto[test,]$mpg - pred_test3)^2)
mse_test4 = mean((Auto[test,]$mpg - pred_test4)^2)
mse_test5 = mean((Auto[test,]$mpg - pred_test5)^2)
mse_test6 = mean((Auto[test,]$mpg - pred_test6)^2)
paste(round(c(mse_test3, mse_test4, mse_test5, mse_test6), 2))[1] "21.23" "20.9" "20.9" "20.92"
As we can see from the MSE test, the highly flexible tenth-degree polynomial regression model (fit6) obtains a lower MSE test than the simpler fifth-degree polynomial model (fit4) indicating that we entering the overfitting territory.
newdata$pred2 <- predict(fit2, newdata)
newdata$pred3 <- predict(fit3, newdata)
newdata$pred4 <- predict(fit4, newdata)
newdata$pred5 <- predict(fit5, newdata)
newdata$pred6 <- predict(fit6, newdata)
plot(mpg ~ horsepower, data = Auto)
lines(newdata$horsepower, newdata$pred2, col = "green", lwd=2)
lines(newdata$horsepower, newdata$pred3, col = "blue", lwd=2, lty=2)
lines(newdata$horsepower, newdata$pred4, col = "purple", lwd=2, lty=2)
lines(newdata$horsepower, newdata$pred5, col = "orange", lwd=2)
lines(newdata$horsepower, newdata$pred6, col = "red", lwd=2, lty=2)
legend('topright', lwd = 3,lty=c(1, rep(2,4)), col=c("green", "blue", "purple", "orange","red"),
legend=c("2nd degree polynomial regression", "5th degree polynomial regression", "8th degree polynomial regression", "10th degree polynomial regression"))
Sticking with fit2, we can now replot our diagnostic plots to investigate assumptions. The first assumption of linearity can be checked in the residual plots. Note that the 3d plotting of this data in R is not as straight forward as their 2d counterparts. But we visualize the linear assumption in MLR with 2 predictors as having points lying approximately on a 2D plane in our 3D space (having x, y, and z variables equal to horsepower, mpg and horsepower\(^2\), respectively).
par(mfrow=c(2,2))
plot(fit2)
While there is no more curvature in our residual plots, we are prehaps seeing some evidence of heteroscedasticity In other words, assumption 2 may be violated. In addition, our QQ-plot suggests that the residuals are not normally distributed (we are seeing high amounts of deviation at the extremities).
Outliers and high leverage points
As seen in the Scale-Location graph, it appears there may also been some outliers in our data. Using the ISLR recommendation, we can check the observations whose studentized residuals and observe which ones are greater than 3.
which(rstudent(fit2)>3)310 330
40 148
It appears from the Residuals vs Leverage graph that a number of points could be high leverage points. To investigate the top, say 5, observations having the highest cooks distance, we could do the following:
order(cooks.distance(fit2), decreasing = TRUE)[1:5][1] 69 182 124 148 3
Collinearity
Now lets consider a MLR model using more predictors from the Auto data from above. We discussed in lecture how some of these predictor variables might be correlated with one another. To investigate this, lets have a look at a scatterplot matrix which includes all of the variables in the data set using the pairs() function.
pairs(Auto)
We can see that horespower and displacement for example are highly correlated. We might also decide to look at the matrix of correlations between the variables using the function cor().
# remove "name" (since it is not numeric)
cor(Auto[,names(Auto)!="name"]) mpg cylinders displacement horsepower weight
mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442
cylinders -0.7776175 1.0000000 0.9508233 0.8429834 0.8975273
displacement -0.8051269 0.9508233 1.0000000 0.8972570 0.9329944
horsepower -0.7784268 0.8429834 0.8972570 1.0000000 0.8645377
weight -0.8322442 0.8975273 0.9329944 0.8645377 1.0000000
acceleration 0.4233285 -0.5046834 -0.5438005 -0.6891955 -0.4168392
year 0.5805410 -0.3456474 -0.3698552 -0.4163615 -0.3091199
origin 0.5652088 -0.5689316 -0.6145351 -0.4551715 -0.5850054
acceleration year origin
mpg 0.4233285 0.5805410 0.5652088
cylinders -0.5046834 -0.3456474 -0.5689316
displacement -0.5438005 -0.3698552 -0.6145351
horsepower -0.6891955 -0.4163615 -0.4551715
weight -0.4168392 -0.3091199 -0.5850054
acceleration 1.0000000 0.2903161 0.2127458
year 0.2903161 1.0000000 0.1815277
origin 0.2127458 0.1815277 1.0000000
Notice that the correlation between horespower and displacement is very high (0.8972570) as is the correlation between cylinders and displacement is very high (0.9508233).
KNN Regression
Below is the example from our KNN Regression example. We will be using the knnreg function from the carret package as a demo. There was some discussion in class about the best way we might choose \(K\). In practice the most common way to do this is to use cross-validation. While this will be covered later on in the course, for now, one solution we might just plot the error (e.g. MSE test) as a function of \(k\) and select the value of \(k\) corresponding to best, i.e. lowest MSE test.
library(caret)Loading required package: ggplot2
Attaching package: 'ggplot2'
The following object is masked from 'Auto':
mpg
Loading required package: lattice
train_data <- Auto[train, ]
test_data <- Auto[-train, ] # same as Auto[test, ]
knn_mse_test = NULL
y = Auto[test,]$mpg
for (k in 2:10){
kmod <- knnreg(mpg~horsepower, data = Auto, subset = train, k = k)
yhat = predict(kmod, Auto[test,])
knn_mse_test[k] = mean((y - yhat)^2)
}
plot(knn_mse_test, type = "b")
In this case, since \(k\)=4 obtained the smallest MSE test, we would select the KNN model with 4 nearest neighours (which obtained a MSE test of 20.827562). We can plot the fitted model using:
plot(mpg~horsepower, data=Auto[train,], main="KNN Regression")
kmod4 <- knnreg(mpg~horsepower, data = Auto, subset = train, k = 4)
knn_yhat = predict(kmod4, new_data)
lines(new_data$horsepower, knn_yhat, col="red", lwd=2) 
Notably, this model still appears to be too “wiggly” and appears to be overfitting to the training set.
Conclusion
In this lab we’ve fitted six variations of the regression model, performed some diagnostic checks and compared results with the non-parameter KNN regression model. Overall, the best model (as determined by the MSE test) is the second order polynomial regression model.
| \(MSE_{test}\) | Model | Fitted R object |
|---|---|---|
| 27.704 | Simple linear regression | fit1 |
| 20.82 | Second order polynomial regression | fit2 |
| 21.23 | Third order polynomial regression | fit3 |
| 20.903 | Fifth order polynomial regression | fit4 |
| 20.899 | Eighth order polynomial regression | fit5 |
| 20.924 | Tenth order polynomial regression | fit6 |
| 20.828 | KNN regression with \(k\)=4 nearest nighbours | kmod4 |
Discussion Questions
- Replace the seed value
2023with the current year (i.e.,set.seed(2025)in Listing 1) and re-run the analysis.- Did the “best” model (according to the test MSE) change?
- If so, why might that happen?
- Did the “best” model (according to the test MSE) change?
- If the best model can change depending on the random seed, what does that suggest about the reliability of relying on a single train/test split?
How might we obtain a more reliable estimate of model performance?
If you can’t think of an answer right away, don’t worry—this will be the topic of a future lecture.
Footnotes
recall that we will include all higher and lower order terms because of the hierarchy principal↩︎