Lab 2

Regression Diagnostics and Predictive Modeling: Exploring Linear, Polynomial, and KNN Regression

Author

Affiliation

Dr. Irene Vrbik

University of British Columbia Okanagan

Introduction

This lab will cover how to make predictions, analyze diagnostic plots, identify potential problems in multiple linear regression, and compare multiple regression models using the test MSE.

In this lab, we will explore extensions to the linear regression and K-Nearest Neighbors (KNN) regression using the Auto dataset from the ISLR2 package. The dataset includes various characteristics of cars, such as engine horsepower, weight, and gas mileage, which will be used to build predictive models. The primary focus is to understand how well different regression models can predict gas mileage (measured in miles per gallon) based on these car attributes.

Supplemental: ILSR2 Section 3.3.3. and Lab 3.6, Lecture 5.

Learning outcomes

By the end of this lab students will be able to:

Use linear regression to make predictions on new data
Split data into training and test sets and use test mean squared error ($MSE_{te}$) to evaluate model performance on unseen data.
Extend linear models to include polynomial terms to better capture non-linear relationships in the data.
Compare the performance of simple linear models and polynomial models using test set MSE.
Fit KNN regression models and understand the effect of different values of k on model performance.
Compare the flexibility of KNN regression to linear models and discuss the trade-offs between bias and variance.

Perform diagnostic checks on linear regression models to verify assumptions such as linearity, constant variance, normality, and independence of errors.
Identify potential issues with the model, such as non-linearity, outliers, and high leverage points, and understand their impact on model validity.

Prediction

Let’s fit a linear regression model to the Auto dataset available in the ISLR2 package.

library(ISLR2)
attach(Auto)
Auto

Let’s begin by splitting the data into a training and testing set. We’ll choose 60% for the training set and 40% for the test set.

Listing 1: Splitting the dataset into training and test sets

# Set a random seed to ensure reproducibility of the train/test split
set.seed(2023)

# Total number of observations in the dataset
n <- nrow(Auto)

# Randomly select ~60% of the indices (235 out of 392) for the training set
# Sampling is done without replacement to avoid duplicate rows
train <- sample(n, round(.6*n))

# this will be the index of samples in our test set
test <- setdiff(1:n, train)

fit1 attempts to explain gas mileage (in mpg mpg) using the predictor variable of the engines horsepower horsepower.

fit1 <- lm(mpg~horsepower, data = Auto, subset = train)
summary(fit1)


Call:
lm(formula = mpg ~ horsepower, data = Auto, subset = train)

Residuals:
    Min      1Q  Median      3Q     Max 
-13.376  -2.911  -0.294   2.545  15.464 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 39.32363    0.86238   45.60   <2e-16 ***
horsepower  -0.15205    0.00769  -19.77   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.658 on 233 degrees of freedom
Multiple R-squared:  0.6265,    Adjusted R-squared:  0.6249 
F-statistic: 390.9 on 1 and 233 DF,  p-value: < 2.2e-16

Judging by the very low p-values for $\beta_1$ (slope), it appears that there a relationship between the predictor and the response. Since the sign for slope is negative ($\hat\beta_1$ =-0.1578) we expect this relationship to be negative, that is, as the horsepower of our engine increase, gas mileage tends to go down. This relationship would also appear to be relatively strong as indicated by the high $R^2$ value = 0.6049 (ie approximately 60% of the variance in the response is being explained by the simple model.

Prediction on the test set

To see how we predict on the test set we could use

pred_test = predict(fit1, Auto[test,])

We can compare the predicted values to the actual values on an individual basis:

head(cbind(pred_test, Auto[test,]$mpg))

  pred_test   
1 19.557151 18
3 16.516154 18
4 16.516154 16
6  9.217761 15
8  6.632914 14
9  5.112416 14

We can calculate the MSE test using:

#  mean((y - yhat)^2)
mse_test = mean((Auto[test,]$mpg - pred_test)^2)
mse_test

[1] 27.70364

Prediction of a single value

To find the predicted mpg associated with a horsepower of say 90, we could use the predict function (see ?predict.lm for the help file; note that the newdata argument must be a data frame data structure):

predict(fit1, data.frame("horsepower"=90))

       1 
25.63914

We could have constructed this prediction “by hand”, by obtaining the line of best fit and substituting in 90 for x but this obviously takes a little bit more work:

\[\begin{align} \hat y &= \hat \beta_0 + \hat \beta_1 \times \texttt{horsepower}\\ &= 39.3236312 + -0.1520498 \times 90\\ &= 25.64 \end{align}\]

It is also handy to note that we could see all the fitted $\hat y_i$ values using the following command:

yhat_train <- fitted(fit1)
# (output suppressed to save space)

From these, we could calculated the training MSE, i.e. $MSE_{tr}$

y_train <- Auto[train,]$mpg
mse_train = mean((y_train - yhat_train)^2)
mse_train

[1] 21.51454

Notice that this can calculated from the Residual Standard Error (RSE) provided in the summary(fit1)$sigma. Recall since:

\[ \begin{align} RSE &= \sqrt{\frac{RSS}{n−p−1}} = \sqrt{\frac{\sum_{i=1}^n ( y_i - \hat y_i)^2}{n−p−1}}\\ MSE_{tr} &= \frac{\sum_{i=1}^n ( y_i - \hat y_i)^2}{n} \\ \implies MSE_{tr}&=RSE^2*\frac{(n-p-1)}{n} \end{align} \] Notice that mse_train is the same as $RSE^2*\frac{(n-p-1)}{n}$

RSE <- summary(fit1)$sigma
n_train <- length(train) # number of obs in training set
p = 1 # number of predictors used in fit1
RSE^2*(n_train-p-1)/n_train

[1] 21.51454

We could also calculate the MSE training directly from the sum of the residuals \[ \begin{align} \sum_{i=1}^n e_i &= \sum_{i=1}^n e_i = y_i - \hat y_i\\ \implies MSE_{tr} &= \sum_{i=1}^n \frac{e_i}{n} \end{align} \] To see the corresponding residuals, use:

sum(resid(fit1)^2)/n_train

[1] 21.51454

Prediction and Confidence Intervals

To get a prediction interval we use:

predict(fit1, data.frame("horsepower"=90), interval="predict")

       fit     lwr      upr
1 25.63914 16.4392 34.83909

To get a confidence interval we use:

predict(fit1, data.frame("horsepower"=90), interval="confidence")

       fit      lwr      upr
1 25.63914 24.99907 26.27922

N.B. when using predict your data need to be in a data frame and the predictors need to be named in the same way as your training data.

Prediction and Confidence Bands

To create a band similar to the one we saw in lecture, you will need to create a sequence of plausible x values. For our purpose I will just take the observed range of hosepower values within our data set.

plot(mpg ~ horsepower, data = Auto)
abline(fit1)

# Create a new data frame for prediction intervals
new_data <- data.frame("horsepower" = seq(min(Auto$horsepower), max(Auto$horsepower), length.out = 100))

# Confidence band
conf_interval <- predict(fit1, newdata=new_data, 
                      interval="confidence", level = 0.95)
head(conf_interval)

       fit      lwr      upr
1 32.32934 31.25414 33.40454
2 32.04674 30.99481 33.09867
3 31.76414 30.73525 32.79304
4 31.48155 30.47542 32.48767
5 31.19895 30.21531 32.18259
6 30.91635 29.95490 31.87780

lines(new_data[,1], conf_interval[,2], col="blue", lty=2)
lines(new_data[,1], conf_interval[,3], col="blue", lty=2)

# Prediction band
pred_interval <- predict(fit1, newdata=new_data,
                         interval="prediction", level = 0.95)
lines(new_data[,1], pred_interval[,2], col="orange", lty=2)
lines(new_data[,1], pred_interval[,3], col="orange", lty=2)

Prediction intervals are always wider than confidence intervals, because they incorporate both the error in the estimate for $f(X)$ (the reducible error) and the uncertainty as to how much an individual point will differ from the population regression plane (the irreducible error)

Diagnostics

While the conclusions we made above may seem appropriate based on the R output, we would be amiss to not check the assumptions of this model before we jump to any conclusions. Recall that the assumptions for a linear regression model are that:

There exists an (at least approximate) linear relationship between $Y$ and $X$.
The distribution of $\epsilon_i$ has constant variance.
$\epsilon_i$ is normally distributed.
$\epsilon_i$ are independent of one another. For example, $\epsilon_2$ is not affected by $\epsilon_1$.

To check the first assumption, we may like to look at the scatterplot with the line of best fit. As demonstrated below, it would appear that the response and predictor variable have a curved relationship rather than a linear one.

plot(mpg~horsepower, data=Auto)
abline(fit1, col="red")

We could also explore the diagnostic plots produced when call plot() on the output of an lm() object. To see them all in one plot, let’s change the panel layout of our plot window to store 4 figures (in this case in a 2 by 2 matrix) (for more details see ?par). In other words we could call:

par(mfrow=c(2,2))
plot(fit1)

A strong U-shaped pattern can be seen in the residual plot (located in top left corner) of the diagnostic plots below. This U-shape—and in general, any strong pattern—in the residuals indicates non-linearity in the data.

Polynomial Regression

We can extend the linear model to accommodate non-linearity by transforming predictor variables. For this particular example, the relationship between mpg and horsepower appears to be curved. Consequently, a model which includes a quadratic horsepower term, may provide a better fit.

Note that $\sim$ in R indicates a formula object; for more details see ?formula. In formula mode, ^ is used to specify interactions (eg. ^2 denotes all second order interactions from the preceding term). So in order to use the caret to denote the arithmetic squared function, we need to insulate the term using the I() function. Alternatively, we could call/define a new variable, say hp2, that stores the squared values of hoursepower squared.

To fit this linear model—and YES this a linear model (i.e. a multiple linear regression model with $X_1 =$ horsepower and $X_2$ = horsepower${}^2$ as predictors ¹—we have a choice between the following two set-ups:

# option 1
hp2 <- Auto$horsepower^2
auto2 <- cbind(Auto, hp2)
fit2a <- lm(mpg~horsepower+hp2, data = auto2, subset = train)
summary(fit2a)


Call:
lm(formula = mpg ~ horsepower + hp2, data = auto2, subset = train)

Residuals:
    Min      1Q  Median      3Q     Max 
-14.219  -2.421  -0.191   2.473  14.078 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) 52.6713582  2.2060649  23.876  < 2e-16 ***
horsepower  -0.3957113  0.0382281 -10.351  < 2e-16 ***
hp2          0.0009721  0.0001499   6.486 5.25e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.295 on 232 degrees of freedom
Multiple R-squared:  0.6839,    Adjusted R-squared:  0.6812 
F-statistic: 250.9 on 2 and 232 DF,  p-value: < 2.2e-16

# option 2
fit2 <- lm(mpg~horsepower+I(horsepower^2), 
           data = auto2, subset = train)
summary(fit2)


Call:
lm(formula = mpg ~ horsepower + I(horsepower^2), data = auto2, 
    subset = train)

Residuals:
    Min      1Q  Median      3Q     Max 
-14.219  -2.421  -0.191   2.473  14.078 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)     52.6713582  2.2060649  23.876  < 2e-16 ***
horsepower      -0.3957113  0.0382281 -10.351  < 2e-16 ***
I(horsepower^2)  0.0009721  0.0001499   6.486 5.25e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.295 on 232 degrees of freedom
Multiple R-squared:  0.6839,    Adjusted R-squared:  0.6812 
F-statistic: 250.9 on 2 and 232 DF,  p-value: < 2.2e-16

Again we can compute the MSE on the test set using

pred_test2 = predict(fit2, Auto[test,])
mse_test2 = mean((Auto[test,]$mpg - pred_test2)^2)
mse_test2

[1] 20.82022

Unsurprisingly, this MSE test is lower than the MSE test obtained by the simple linear regression model.

Lets compare the two fits on the scatterplot. It appears the fit with the squared horsepower is fitting the data much better:

# to create a smoother curve lets create a sequence
# of plausible x values based on the range of x's in 
# our data frame:
newdata <- data.frame(horsepower=seq(min(Auto$horsepower), 
                      max(Auto$horsepower), length.out = 1000))
newdata$pred1 <- predict(fit2, newdata)
plot(mpg ~ horsepower, data = Auto)
abline(fit1, col="red", lwd=2)
lines(newdata$horsepower, newdata$pred1, col = "green", lwd=3)
legend('topright', lwd = 2,lty=1, col=c("red","green"), 
       legend=c("SLR (fit1)","MLR (fit2)"))

Higher Order polynomial regression

We could now fit higher order terms (eg. horsepower$^3$, horsepower$^4$, ), however, we run the risk of overfitting. Note the use of poly() for creating higher-order polynomials. For instance, the eight-degree polynomial model (fit8) found below appears to be too wiggly. (This relates back to our bias and variance tradeoff).

fit3 <- lm(mpg~horsepower+I(horsepower^2)+I(horsepower^3), subset = train)
fit4 <- lm(mpg ~ poly(horsepower, degree = 5), data = Auto, subset = train)
fit5 <- lm(mpg ~ poly(horsepower, degree = 8), data = Auto, subset = train)
fit6 <- lm(mpg ~ poly(horsepower, degree = 10), data = Auto, subset = train)

As can be seen with the MSE test calculation below, this curve is not as generalizable as the second-degree polynomial regression model (i.e. fit2 with MSE test = 20.820224)

pred_test3 = predict(fit3, Auto[test,])
pred_test4 = predict(fit4, Auto[test,])
pred_test5 = predict(fit5, Auto[test,])
pred_test6 = predict(fit6, Auto[test,])
mse_test3 = mean((Auto[test,]$mpg - pred_test3)^2)
mse_test4 = mean((Auto[test,]$mpg - pred_test4)^2)
mse_test5 = mean((Auto[test,]$mpg - pred_test5)^2)
mse_test6 = mean((Auto[test,]$mpg - pred_test6)^2)
paste(round(c(mse_test3, mse_test4, mse_test5, mse_test6), 2))

[1] "21.23" "20.9"  "20.9"  "20.92"

As we can see from the MSE test, the highly flexible tenth-degree polynomial regression model (fit6) obtains a lower MSE test than the simpler fifth-degree polynomial model (fit4) indicating that we entering the overfitting territory.

newdata$pred2 <- predict(fit2, newdata)
newdata$pred3 <- predict(fit3, newdata)
newdata$pred4 <- predict(fit4, newdata)
newdata$pred5 <- predict(fit5, newdata)
newdata$pred6 <- predict(fit6, newdata)
plot(mpg ~ horsepower, data = Auto)
lines(newdata$horsepower, newdata$pred2, col = "green", lwd=2)
lines(newdata$horsepower, newdata$pred3, col = "blue", lwd=2, lty=2)
lines(newdata$horsepower, newdata$pred4,  col = "purple", lwd=2, lty=2)
lines(newdata$horsepower, newdata$pred5,  col = "orange", lwd=2)
lines(newdata$horsepower, newdata$pred6,  col = "red", lwd=2, lty=2)
legend('topright', lwd = 3,lty=c(1, rep(2,4)), col=c("green", "blue", "purple", "orange","red"), 
       legend=c("2nd degree polynomial regression", "5th degree polynomial regression", "8th degree polynomial regression", "10th degree polynomial regression"))

Sticking with fit2, we can now replot our diagnostic plots to investigate assumptions. The first assumption of linearity can be checked in the residual plots. Note that the 3d plotting of this data in R is not as straight forward as their 2d counterparts. But we visualize the linear assumption in MLR with 2 predictors as having points lying approximately on a 2D plane in our 3D space (having x, y, and z variables equal to horsepower, mpg and horsepower$^2$, respectively).

par(mfrow=c(2,2))
plot(fit2)

While there is no more curvature in our residual plots, we are prehaps seeing some evidence of heteroscedasticity In other words, assumption 2 may be violated. In addition, our QQ-plot suggests that the residuals are not normally distributed (we are seeing high amounts of deviation at the extremities).

Outliers and high leverage points

As seen in the Scale-Location graph, it appears there may also been some outliers in our data. Using the ISLR recommendation, we can check the observations whose studentized residuals and observe which ones are greater than 3.

which(rstudent(fit2)>3)

310 330 
 40 148

It appears from the Residuals vs Leverage graph that a number of points could be high leverage points. To investigate the top, say 5, observations having the highest cooks distance, we could do the following:

order(cooks.distance(fit2), decreasing = TRUE)[1:5]

[1]  69 182 124 148   3

Collinearity

Now lets consider a MLR model using more predictors from the Auto data from above. We discussed in lecture how some of these predictor variables might be correlated with one another. To investigate this, lets have a look at a scatterplot matrix which includes all of the variables in the data set using the pairs() function.

pairs(Auto)

We can see that horespower and displacement for example are highly correlated. We might also decide to look at the matrix of correlations between the variables using the function cor().

# remove "name" (since it is not numeric)
cor(Auto[,names(Auto)!="name"])

                    mpg  cylinders displacement horsepower     weight
mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442
cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273
displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944
horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377
weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000
acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392
year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199
origin        0.5652088 -0.5689316   -0.6145351 -0.4551715 -0.5850054
             acceleration       year     origin
mpg             0.4233285  0.5805410  0.5652088
cylinders      -0.5046834 -0.3456474 -0.5689316
displacement   -0.5438005 -0.3698552 -0.6145351
horsepower     -0.6891955 -0.4163615 -0.4551715
weight         -0.4168392 -0.3091199 -0.5850054
acceleration    1.0000000  0.2903161  0.2127458
year            0.2903161  1.0000000  0.1815277
origin          0.2127458  0.1815277  1.0000000

Notice that the correlation between horespower and displacement is very high (0.8972570) as is the correlation between cylinders and displacement is very high (0.9508233).

KNN Regression

Below is the example from our KNN Regression example. We will be using the knnreg function from the carret package as a demo. There was some discussion in class about the best way we might choose $K$. In practice the most common way to do this is to use cross-validation. While this will be covered later on in the course, for now, one solution we might just plot the error (e.g. MSE test) as a function of $k$ and select the value of $k$ corresponding to best, i.e. lowest MSE test.

library(caret)

Loading required package: ggplot2


Attaching package: 'ggplot2'

The following object is masked from 'Auto':

    mpg

Loading required package: lattice

train_data <- Auto[train, ]
test_data <- Auto[-train, ] # same as Auto[test, ] 

knn_mse_test = NULL
y = Auto[test,]$mpg
for (k in 2:10){
  kmod <- knnreg(mpg~horsepower, data = Auto, subset = train, k = k)
  yhat = predict(kmod, Auto[test,])
  knn_mse_test[k] = mean((y - yhat)^2)
}

plot(knn_mse_test, type = "b")

In this case, since $k$=4 obtained the smallest MSE test, we would select the KNN model with 4 nearest neighours (which obtained a MSE test of 20.827562). We can plot the fitted model using:

plot(mpg~horsepower, data=Auto[train,], main="KNN Regression")
kmod4 <- knnreg(mpg~horsepower, data = Auto, subset = train, k = 4)
knn_yhat = predict(kmod4, new_data)
lines(new_data$horsepower, knn_yhat, col="red", lwd=2)

Notably, this model still appears to be too “wiggly” and appears to be overfitting to the training set.

Conclusion

In this lab we’ve fitted six variations of the regression model, performed some diagnostic checks and compared results with the non-parameter KNN regression model. Overall, the best model (as determined by the MSE test) is the second order polynomial regression model.

$MSE_{test}$	Model	Fitted R object
27.704	Simple linear regression	`fit1`
20.82	Second order polynomial regression	`fit2`
21.23	Third order polynomial regression	`fit3`
20.903	Fifth order polynomial regression	`fit4`
20.899	Eighth order polynomial regression	`fit5`
20.924	Tenth order polynomial regression	`fit6`
20.828	KNN regression with $k$=4 nearest nighbours	`kmod4`

Discussion Questions

Replace the seed value 2023 with the current year (i.e., set.seed(2025) in Listing 1) and re-run the analysis.
- Did the “best” model (according to the test MSE) change?
- If so, why might that happen?
If the best model can change depending on the random seed, what does that suggest about the reliability of relying on a single train/test split?

🍎 Food for thought 🧠

How might we obtain a more reliable estimate of model performance?
If you can’t think of an answer right away, don’t worry—this will be the topic of a future lecture.

Footnotes

recall that we will include all higher and lower order terms because of the hierarchy principal↩︎