Lab 3

Regression Diagnositics and KNN Regression

Author

Dr. Irene Vrbik

Published

September 26, 2023


Summary This lab will cover how to make predictions, analyze diagnostic plots, identify potential problems in multiple linear regression, and compare multiple regression models using the test MSE. For more details, consults your textbook (ILSR2 Section 3.3.3. and Lab 3.6).


Learning outcomes

By the end of this lab students will be able to:

  • Fit polynomial regression models

  • Assess linear model assumptions using diagnostic plots in R

  • Create a training and testing set

  • Fit a KNN regression model

  • Calculate predictions based on a fitted models

  • Compare various regression models using test MSE


Prediction

Let’s fit a linear regression model to the Auto dataset available in the ISLR2 package.

library(ISLR2)
attach(Auto)
head(Auto)
  mpg cylinders displacement horsepower weight acceleration year origin
1  18         8          307        130   3504         12.0   70      1
2  15         8          350        165   3693         11.5   70      1
3  18         8          318        150   3436         11.0   70      1
4  16         8          304        150   3433         12.0   70      1
5  17         8          302        140   3449         10.5   70      1
6  15         8          429        198   4341         10.0   70      1
                       name
1 chevrolet chevelle malibu
2         buick skylark 320
3        plymouth satellite
4             amc rebel sst
5               ford torino
6          ford galaxie 500

Let’s begin by splitting the data into a training and testing set. We’ll choose 60% for the training set and 40% for the test set.

# store the index for the training set:
set.seed(2023)
n <- nrow(Auto)

# this will sample 235 number from 1 -- 392 without replacement
train <- sample(n, round(.6*n))
# this will be the index of samples in our test set
test <- setdiff(1:n, train)

fit1 attempts to explain gas mileage (in mpg mpg) using the predictor variable of the engines horsepower horsepower.

fit1 <- lm(mpg~horsepower, data = Auto, subset = train)
summary(fit1)

Call:
lm(formula = mpg ~ horsepower, data = Auto, subset = train)

Residuals:
    Min      1Q  Median      3Q     Max 
-13.376  -2.911  -0.294   2.545  15.464 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 39.32363    0.86238   45.60   <2e-16 ***
horsepower  -0.15205    0.00769  -19.77   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.658 on 233 degrees of freedom
Multiple R-squared:  0.6265,    Adjusted R-squared:  0.6249 
F-statistic: 390.9 on 1 and 233 DF,  p-value: < 2.2e-16

Judging by the very low p-values for \(\beta_1\) (slope), it appears that there a relationship between the predictor and the response. Since the sign for slope is negative (\(\hat\beta_1\) =-0.1578) we expect this relationship to be negative, that is, as the horsepower of our engine increase, gas mileage tends to go down. This relationship would also appear to be relatively strong as indicated by the high \(R^2\) value = 0.6049 (ie approximately 60% of the variance in the response is being explained by the simple model.

Prediction on the test set

To see how we predict on the test set we could use

pred_test = predict(fit1, Auto[test,])

We can compare the predicted values to the actual values on an individual basis:

head(cbind(pred_test, Auto[test,]$mpg))
  pred_test   
1 19.557151 18
3 16.516154 18
4 16.516154 16
6  9.217761 15
8  6.632914 14
9  5.112416 14

We can calculate the MSE test using:

#  mean((y - yhat)^2)
mse_test = mean((Auto[test,]$mpg - pred_test)^2)
mse_test
[1] 27.70364

Prediction of a single value

To find the predicted mpg associated with a horsepower of say 90, we could use the predict function (see ?predict.lm for the help file; note that the newdata argument must be a data frame data structure):

predict(fit1, data.frame("horsepower"=90))
       1 
25.63914 

We could have constructed this prediction “by hand”, by obtaining the line of best fit and substituting in 90 for x but this obviously takes a little bit more work:

coef(fit1)
(Intercept)  horsepower 
 39.3236312  -0.1520498 
x <- 90
coef(fit1)["(Intercept)"] + coef(fit1)["horsepower"]*x
(Intercept) 
   25.63914 

It is also handy to note that we could see all the fitted \(\hat y_i\) values using the following command:

fitted(fit1)
# (output suppressed to save space)

To see the corresponding residuals, use:

resid(fit1)
# (output suppressed to save space)

Prediction and Confidence Intervals

To get a prediction interval we use:

predict(fit1, data.frame("horsepower"=90), interval="predict")
       fit     lwr      upr
1 25.63914 16.4392 34.83909

To get a confidence interval we use:

predict(fit1, data.frame("horsepower"=90), interval="confidence")
       fit      lwr      upr
1 25.63914 24.99907 26.27922

N.B. when using predict your data need to be in a data frame and the predictors need to be named in the same way as your training data.

Prediction and Confidence Bands

To create a band similar to the one we saw in lecture, you will need to create a sequence of plausible x values. For our purpose I will just take the observed range of hosepower values within our data set.

plot(mpg ~ horsepower, data = Auto)
abline(fit1)

# Create a new data frame for prediction intervals
new_data <- data.frame("horsepower" = seq(min(Auto$horsepower), max(Auto$horsepower), length.out = 100))

# Confidence band
conf_interval <- predict(fit1, newdata=new_data, 
                      interval="confidence", level = 0.95)
head(conf_interval)
       fit      lwr      upr
1 32.32934 31.25414 33.40454
2 32.04674 30.99481 33.09867
3 31.76414 30.73525 32.79304
4 31.48155 30.47542 32.48767
5 31.19895 30.21531 32.18259
6 30.91635 29.95490 31.87780
lines(new_data[,1], conf_interval[,2], col="blue", lty=2)
lines(new_data[,1], conf_interval[,3], col="blue", lty=2)

# Prediction band
pred_interval <- predict(fit1, newdata=new_data,
                         interval="prediction", level = 0.95)
lines(new_data[,1], pred_interval[,2], col="orange", lty=2)
lines(new_data[,1], pred_interval[,3], col="orange", lty=2)

Prediction intervals are always wider than confidence intervals, because they incorporate both the error in the estimate for \(f(X)\) (the reducible error) and the uncertainty as to how much an individual point will differ from the population regression plane (the irreducible error)

Diagnostics

While the conclusions we made above may seem appropriate based on the R output, we would be amiss to not check the assumptions of this model before we jump to any conclusions. Recall that the assumptions for a linear regression model are that:

  1. There exists an (at least approximate) linear relationship between \(Y\) and \(X\).
  2. The distribution of \(\epsilon_i\) has constant variance.
  3. \(\epsilon_i\) is normally distributed.
  4. \(\epsilon_i\) are independent of one another. For example, \(\epsilon_2\) is not affected by \(\epsilon_1\).

To check the first assumption, we may like to look at the scatterplot with the line of best fit. As demonstrated below, it would appear that the response and predictor variable have a curved relationship rather than a linear one.

plot(mpg~horsepower, data=Auto)
abline(fit1, col="red")

We could also explore the diagnostic plots produced when call plot() on the output of an lm() object. To see them all in one plot, let’s change the panel layout of our plot window to store 4 figures (in this case in a 2 by 2 matrix) (for more details see ?par). In other words we could call:

par(mfrow=c(2,2))
plot(fit1)

A strong U-shaped pattern can be seen in the residual plot (located in top left corner) of the diagnostic plots below. This U-shape—and in general, any strong pattern—in the residuals indicates non-linearity in the data.

Polynomial Regression

We can extend the linear model to accommodate non-linearity by transforming predictor variables. For this particular example, the relationship between mpg and horsepower appears to be curved. Consequently, a model which includes a quadratic horsepower term, may provide a better fit.

Note that \(\sim\) in R indicates a formula object; for more details see ?formula. In formula mode, ^ is used to specify interactions (eg. ^2 denotes all second order interactions from the preceding term). So in order to use the caret to denote the arithmetic squared function, we need to insulate the term using the I() function. Alternatively, we could call/define a new variable, say hp2, that stores the squared values of hoursepower squared.

To fit this linear model—and YES this a linear model (i.e. a multiple linear regression model with \(X_1 =\) horsepower and \(X_2\) = horsepower\({}^2\) as predictors 1—we have a choice between the following two set-ups:

# option 1
hp2 <- Auto$horsepower^2
auto2 <- cbind(Auto, hp2)
fit2a <- lm(mpg~horsepower+hp2, data = auto2, subset = train)
summary(fit2a)

Call:
lm(formula = mpg ~ horsepower + hp2, data = auto2, subset = train)

Residuals:
    Min      1Q  Median      3Q     Max 
-14.219  -2.421  -0.191   2.473  14.078 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) 52.6713582  2.2060649  23.876  < 2e-16 ***
horsepower  -0.3957113  0.0382281 -10.351  < 2e-16 ***
hp2          0.0009721  0.0001499   6.486 5.25e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.295 on 232 degrees of freedom
Multiple R-squared:  0.6839,    Adjusted R-squared:  0.6812 
F-statistic: 250.9 on 2 and 232 DF,  p-value: < 2.2e-16
# option 2
fit2 <- lm(mpg~horsepower+I(horsepower^2), 
           data = auto2, subset = train)
summary(fit2)

Call:
lm(formula = mpg ~ horsepower + I(horsepower^2), data = auto2, 
    subset = train)

Residuals:
    Min      1Q  Median      3Q     Max 
-14.219  -2.421  -0.191   2.473  14.078 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)     52.6713582  2.2060649  23.876  < 2e-16 ***
horsepower      -0.3957113  0.0382281 -10.351  < 2e-16 ***
I(horsepower^2)  0.0009721  0.0001499   6.486 5.25e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.295 on 232 degrees of freedom
Multiple R-squared:  0.6839,    Adjusted R-squared:  0.6812 
F-statistic: 250.9 on 2 and 232 DF,  p-value: < 2.2e-16

Again we can compute the MSE on the test set using

pred_test2 = predict(fit2, Auto[test,])
mse_test2 = mean((Auto[test,]$mpg - pred_test2)^2)
mse_test2
[1] 20.82022

Unsurprisingly, this MSE test is lower than the MSE test obtained by the simple linear regression model.

Lets compare the two fits on the scatterplot. It appears the fit with the squared horsepower is fitting the data much better:

# to create a smoother curve lets create a sequence
# of plausible x values based on the range of x's in 
# our data frame:
newdata <- data.frame(horsepower=seq(min(Auto$horsepower), 
                      max(Auto$horsepower), length.out = 1000))
newdata$pred1 <- predict(fit2, newdata)
plot(mpg ~ horsepower, data = Auto)
abline(fit1, col="red", lwd=2)
lines(newdata$horsepower, newdata$pred1, col = "green", lwd=3)
legend('topright', lwd = 2,lty=1, col=c("red","green"), 
       legend=c("SLR (fit1)","MLR (fit2)"))

Higher Order polynomial regression

We could now fit higher order terms (eg. horsepower\(^3\), horsepower\(^4\), ), however, we run the risk of overfitting. Note the use of poly() for creating higher-order polynomials. For instance, the eight-degree polynomial model (fit8) found below appears to be too wiggly. (This relates back to our bias and variance tradeoff).

fit3 <- lm(mpg~horsepower+I(horsepower^2)+I(horsepower^3), subset = train)
fit4 <- lm(mpg ~ poly(horsepower, degree = 5), data = Auto, subset = train)
fit5 <- lm(mpg ~ poly(horsepower, degree = 8), data = Auto, subset = train)
fit6 <- lm(mpg ~ poly(horsepower, degree = 10), data = Auto, subset = train)

As can be seen with the MSE test calculation below, this curve is not as generalizable as the second-degree polynomial regression model (i.e. fit2 with MSE test = 20.820224)

pred_test3 = predict(fit3, Auto[test,])
pred_test4 = predict(fit4, Auto[test,])
pred_test5 = predict(fit5, Auto[test,])
pred_test6 = predict(fit6, Auto[test,])
mse_test3 = mean((Auto[test,]$mpg - pred_test3)^2)
mse_test4 = mean((Auto[test,]$mpg - pred_test4)^2)
mse_test5 = mean((Auto[test,]$mpg - pred_test5)^2)
mse_test6 = mean((Auto[test,]$mpg - pred_test6)^2)
paste(round(c(mse_test3, mse_test4, mse_test5, mse_test6), 2))
[1] "21.23" "20.9"  "20.9"  "20.92"

As we can see from the MSE test, the highly flexible tenth-degree polynomial regression model (fit6) obtains a lower MSE test than the simpler fifth-degree polynomial model (fit4) indicating that we entering the overfitting territory.

newdata$pred2 <- predict(fit2, newdata)
newdata$pred3 <- predict(fit3, newdata)
newdata$pred4 <- predict(fit4, newdata)
newdata$pred5 <- predict(fit5, newdata)
newdata$pred6 <- predict(fit6, newdata)
plot(mpg ~ horsepower, data = Auto)
lines(newdata$horsepower, newdata$pred2, col = "green", lwd=2)
lines(newdata$horsepower, newdata$pred3, col = "blue", lwd=2, lty=2)
lines(newdata$horsepower, newdata$pred4,  col = "purple", lwd=2, lty=2)
lines(newdata$horsepower, newdata$pred5,  col = "orange", lwd=2)
lines(newdata$horsepower, newdata$pred6,  col = "red", lwd=2, lty=2)
legend('topright', lwd = 3,lty=c(1, rep(2,4)), col=c("green", "blue", "purple", "orange","red"), 
       legend=c("2nd degree polynomial regression", "5th degree polynomial regression", "8th degree polynomial regression", "10th degree polynomial regression"))

Sticking with fit2, we can now replot our diagnostic plots to investigate assumptions. The first assumption of linearity can be checked in the residual plots. Note that the 3d plotting of this data in R is not as straight forward as their 2d counterparts. But we visualize the linear assumption in MLR with 2 predictors as having points lying approximately on a 2D plane in our 3D space (having x, y, and z variables equal to horsepower, mpg and horsepower\(^2\), respectively).

par(mfrow=c(2,2))
plot(fit2)

While there is no more curvature in our residual plots, we are prehaps seeing some evidence of heteroscedasticity In other words, assumption 2 may be violated. In addition, our QQ-plot suggests that the residuals are not normally distributed (we are seeing high amounts of deviation at the extremities).

Outliers and high leverage points

As seen in the Scale-Location graph, it appears there may also been some outliers in our data. Using the ISLR recommendation, we can check the observations whose studentized residuals and observe which ones are greater than 3.

which(rstudent(fit2)>3)
310 330 
 40 148 

It appears from the Residuals vs Leverage graph that a number of points could be high leverage points. To investigate the top, say 5, observations having the highest cooks distance, we could do the following:

order(cooks.distance(fit2), decreasing = TRUE)[1:5]
[1]  69 182 124 148   3

Collinearity

Now lets consider a MLR model using more predictors from the Auto data from above. We discussed in lecture how some of these predictor variables might be correlated with one another. To investigate this, lets have a look at a scatterplot matrix which includes all of the variables in the data set using the pairs() function.

pairs(Auto)

We can see that horespower and displacement for example are highly correlated. We might also decide to look at the matrix of correlations between the variables using the function cor().

# remove "name" (since it is not numeric)
cor(Auto[,names(Auto)!="name"])
                    mpg  cylinders displacement horsepower     weight
mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442
cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273
displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944
horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377
weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000
acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392
year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199
origin        0.5652088 -0.5689316   -0.6145351 -0.4551715 -0.5850054
             acceleration       year     origin
mpg             0.4233285  0.5805410  0.5652088
cylinders      -0.5046834 -0.3456474 -0.5689316
displacement   -0.5438005 -0.3698552 -0.6145351
horsepower     -0.6891955 -0.4163615 -0.4551715
weight         -0.4168392 -0.3091199 -0.5850054
acceleration    1.0000000  0.2903161  0.2127458
year            0.2903161  1.0000000  0.1815277
origin          0.2127458  0.1815277  1.0000000

Notice that the correlation between horespower and displacement is very high (0.8972570) as is the correlation between cylinders and displacement is very high (0.9508233).

KNN Regression

Below is the example from our KNN Regression example. We will be using the knnreg function from the carret package as a demo. There was some discussion in class about the best way we might choose \(K\). In practice the most common way to do this is to use cross-validation. While this will be covered later on in the course, for now, one solution we might just plot the error (e.g. MSE test) as a function of \(k\) and select the value of \(k\) corresponding to best, i.e. lowest MSE test.

library(caret)
Loading required package: ggplot2

Attaching package: 'ggplot2'
The following object is masked from 'Auto':

    mpg
Loading required package: lattice
train_data <- Auto[train, ]
test_data <- Auto[-train, ] # same as Auto[test, ] 

knn_mse_test = NULL
y = Auto[test,]$mpg
for (k in 2:10){
  kmod <- knnreg(mpg~horsepower, data = Auto, subset = train, k = k)
  yhat = predict(kmod, Auto[test,])
  knn_mse_test[k] = mean((y - yhat)^2)
}

plot(knn_mse_test, type = "b")

In this case, since \(k\)=4 obtained the smallest MSE test, we would select the KNN model with 4 nearest neighours (which obtained a MSE test of 20.827562). We can plot the fitted model using:

plot(mpg~horsepower, data=Auto[train,], main="KNN Regression")
kmod4 <- knnreg(mpg~horsepower, data = Auto, subset = train, k = 4)
knn_yhat = predict(kmod4, new_data)
lines(new_data$horsepower, knn_yhat, col="red", lwd=2)  

Notably, this model still appears to be too “wiggly” and appears to be overfitting to the training set.

Conclusion

In this lab we’ve fitted six variations of the regression model, performed some diagnostic checks and compared results with the non-parameter KNN regression model. Overall, the best model (as determined by the MSE test) is the second order polynomial regression model.

\(MSE_{test}\) Model Fitted R object
27.703637 Simple linear regression fit1
20.820224 Second order polynomial regression fit2
21.2296584 Third order polynomial regression fit3
20.9027872 Fifth order polynomial regression fit4
20.8993936 Eighth order polynomial regression fit5
20.923951 Tenth order polynomial regression fit6
20.827562 KNN regression with \(k\)=4 nearest nighbours kmod4

Footnotes

  1. recall that we will include all higher and lower order terms because of the hierarchy principal↩︎