DATA 311 – Lecture 8: Cross Validation and Data Splitting

Introduction

# README
# 
# Irene, you should redo the pictures so that it is clear that the test set is still reserved when we do cross-validation

# GO back to the old pictures /lecture
# 1. train/test (bad cuz we peak)
# 2. train/test/validation (bad cuz only one example)
# 3. train(with cv)/test more stable
#
# k-fold more stable than loocv
#
# use something other than the cars data, it's too weird with it's discrete X
# and its not big enough to split into train and test first.
#
# we have a hacky way to estimate standard error of this estimate, but we'll get a better way to do this next class (bootsrap!)

In this lecture we’ll be looking at a resampling method called cross validation (or CV for short)
This technique will be used for validating/accessing our model by providing an estimate of model performance on unseen data (i.e. not used in training).
By the end of this lecture, you should understand the difference between training, validation, and test datasets, the need for these splits, and how to implement CV.

Metrics for Model Assessment

For regression we aim to reduce the test MSE \[MSE = \frac{1}{m} \sum_{i=1}^m (y_i-\hat y_i)^2\]

For classification we may consider the test error rate \[\frac{1}{m} \sum_{i=1}^m I(y_i \neq \hat y_i)\]

Where \(1, \dots, m\) denotes unseen test observations that not used in the model-fitting process.

Recall

Ideally, we could fit our model on a sample of data, and then compare the predictions from the estimated model on a very large data set (not used in fitting) to estimate the MSE¹.
If we only have one sample, we often split it into a training set (used to fit the model) and testing set (used to estimate the test MSE).
This train/test split approach is sometimes referred to validation approach.

Training vs Test-set Performance

Recall: training error rate often is quite different from the test error rate [Image source]

Validation Approach

Train/Test Split

set.seed(1234) # for reproducibility (eg. student number)
N = 200        # total number of obs. in our data set

# Randomly sample 80%/20% for training/testing
tr.ind <- sample(N, 0.8*N, replace=FALSE)
te.ind <- setdiff(1:N, tr.ind)

tr.ind

2, 3, 4, 6, 7, 8, 9, 10, 11, 14, …

te.ind

1, 5, 12, 13, 15, 18, 24, 31, …

In this case, we fit the model with samples with index tr.ind and estimate the MSE using samples with index te.ind

Validation Approach

Train/Test Split

set.seed(67)   # for reproducibility (eg. student number)
N = 200        # total number of obs. in our data set

# Randomly sample 80%/20% for training/testing
tr.ind <- sample(N, 0.8*N, replace=FALSE)
te.ind <- setdiff(1:N, tr.ind)

tr.ind

1, 2, 3, 4, 6, 7, 8, 9, 11, 12, …

te.ind

5, 10, 21, 27, 28, 30, 34, 40, …

If we use a different seed, we get a different split.

Validation Approach Schematic

A rectangle divided into two parts: a red part (representing the 80% of training data) and a blue part (representing the 20% of testing/validation data.

A schematic display of the validation set approach. A set of N observations are randomly split into training set (shown in red, containing \(n\) observations 2,3,4,6,7,8,9,10,11,14,…, among others) and a validation set (shown in blue, containing \(m\) observation 1, 5, 12, 13, 15, 18, 24, 31, 33, among others). The ML method is fit on the training set, and its performance is evaluated on the validation set.

Can you think of any limitations/issues with this this method?

ISLR Train Test Figure

jump forward to cross-validation approach

ILSR Fig 5.2 Compares linear vs. higher-order polynomial terms in Linear Regression Left: Validation error estimates for a single split into training (196 obs.) and validation (196 obs.) sets. Right: The validation method was repeated ten times, each time using a different random split of the observations into a training set and a validation set. This illustrates the variability in the estimated test MSE from this approach applied to the Auto data set.

Validation Approach: some problems

High Variance: Results depend heavily on how the data is split.
Inefficient Use of Data: Less data is available for training, especially with small datasets.
Single Evaluation: One run may not reflect true generalization.
Data leakage: the model “learns” from data it shouldn’t have access to during training which may lead to overly optimistic performance estimates
Overfitting Risk: The model may overfit to the validation set.

Since the model is evaluated on a single validation set, the performance estimate can vary significantly depending on how the data is split. Different random splits may yield different results, leading to unreliable estimates of model performance.
- If the dataset is small or not well-balanced, a single unlucky split can make the model look much better or worse than it really is.
By splitting the data into just a training and validation set, you’re not using all available data for training. This is especially problematic with small datasets, where reducing the size of the training set can hurt model performance.
- even wehn you are just performing cv just on your training set, you make fuller use of your training data:
- you train and validate on multiple partitions instead of holding back a single validation set. — points are used both in training and validating. e.g. Each observation serves as validation once and as training four times.
The lack of a separate test set means the validation set is being used for both tuning and evaluation. This can lead to biased performance estimates, as the model may perform well on the validation set but poorly on new, unseen data.
You get only one point of evaluation for the model. This single validation score might not reflect the true performance, as it could be too high or too low purely by chance.
As the model is tuned using the validation set, there is a risk that the model may overfit to this specific data subset, making it less effective on truly unseen data.
- You’re tailoring your model choice to the idiosyncrasies (noise) of one specific subset of data.

(Goodhart’s law) is an adage often stated as, “When a measure becomes a target, it ceases to be a good measure”.

Overall, these issues make the validation set approach less reliable compared to more robust methods like cross-validation or the train-validation-test split.

Multiple are needed to ensure that differences between models are not due to random variation in the test data.

Cross-Validation: The Solution

k-fold cross-validation

Split data into k folds.
Train on k−1 folds, validate on the remaining one.
Repeat k times and average errors.

K-fold Cross-Validation

Training\(_1\)

valid.\(_1\)

Training\(_2\)

valid.\(_2\)

Train\(_2\)

Training\(_3\)

valid.\(_3\)

Training\(_3\)

Training\(_4\)

valid.\(_4\)

Training\(_4\)

Training\(_5\)

valid.\(_5\)

Training\(_5\)

Training\(_6\)

valid.\(_6\)

Training\(_6\)

Train\(_7\)

valid.\(_7\)

Training\(_7\)

valid.\(_8\)

Training\(_8\)

we do not want to evaluate the candidate models once. Instead, we prefer to evaluate each model multiple times with different dataset and take the average score for our decision.

So in this case, we’d have 8 different runs, and each data point will have a turn at belonging to the validation set.

The trick of \(k\)-fold (k = 8 here) to resample the same dataset multiple times and pretend they are different.

K Iterations (K-Folds): The dataset is divided into k equal-sized “folds” (in this example, k = 8). Each fold is used as a validation set once, while the remaining folds are combined to form the training set.

1st iteration: The first fold (in blue) is used as the validation set, while the remaining folds are used for training the model. The performance of the model on this validation fold is recorded as Performance₁.
2nd iteration: The second fold is used as the validation set, and the others for training. This gives Performance₂.
The process repeats for all folds (3rd, 4th, … 8th), each time using a different fold as the validation set, while the rest are used for training.

Key Takeaways:

Each fold gets a chance to be used as both training and validation data, which provides a better estimate of the model’s ability to generalize to unseen data.
This method helps reduce bias from a single split and variance by considering multiple data splits.

Steps of K-fold CV

Step 0: Divide dataset into \(k\) equal-sized “folds”; each fold is used as a validation set once, while the remaining folds form the training set

1st iteration: The first fold (in blue) is used as validation set\(_1\), while the remaining folds are used for training the model. The performance of the model on this validation fold is recorded as Performance₁.
2nd iteration: The second fold is used as validation set\(_2\), and the others for training. This gives Performance₂.

The process repeats for all \(k\) folds, each time using a different fold as the validation set, while the rest are used for training.

CV Performance

After completing all \(k\) iterations, the performances from each fold are averaged to get a more reliable estimate of the model’s performance across different subsets of the data. Source of image: [Zitao’s Web]

Typical Workflow

Split data into training and test sets.
Use cross-validation on training data to:
- Compare models (e.g., logistic vs KNN vs tree)
- Tune hyperparameters (e.g., \(k\), \(\lambda\), depth)
Choose the model with lowest CV error.
Retrain on all training data.

Final model testing

Once we have refitted the selected model on the non-test examples, we use the test dataset for unbiased final evaluation (e.g. \(MSE_{test}\) or test error)
The test dataset, untouched until now, estimates model performance on unseen data and checks for overfitting¹.

The test dataset used for evaluation is independent from the validation sets we used in cross validation.

Summary of steps

Training Phase: Use Resampling Methods to obtain multiple training sets to train candidate models.
Model Evaluation: Instead of evaluating candidates once, evaluate each model multiple times with different validation sets.
Model Selection: Choose the best performing model based on the average validation scores.
Retrain: Combine training and validation sets (non-test set) for final training.
Final Model Testing: Evaluate the model’s performance on the test dataset

Resampling Methods

\(k\)-fold cross-validation (CV) is one of the most commonly used resampling methods
Resampling methods involve repeatedly drawing samples from a dataset and training the model on different subsets of data to evaluate its performance.
Another popular choose is Leave-one-out CV
Next lecture we’ll discuss another resampling method called the bootstrap

Leave One Out Cross-Validation

Leave One Out Cross-Validation (LOOCV) is a special case of \(k\)-fold CV; namely, where \(k\) equals the number of data points in the dataset.

The dataset is split into \(n\) folds, where \(n\) is the number of data points.
In each iteration, 1 data point is used as the validation set, and the remaining \(n-1\) data points are used as the training set.

LOOCV approach

Full sample index (1, 2, … , 200) is systematically divided into:

	Index
CV Set	Training	Validation*
1	2, 3, 4, ..., 200	1
2	1, 3, 4, ..., 200	2
3	1, 2, 4, ..., 200	3
...	...	...
200	1, 2, 3, ..., 199	200

LOOCV Schematic

A visualization of LOOCV. Each row represents one LOOCV training/validation set. Observations are depicted as areas on a rectangle. Red areas are our training sets (made up of all but one observation), blue areas correspond to the validation set comprised of the single observation that was left out of our training set.

A schematic display of LOOCV. A set of n data points is repeatedly split into a training set (shown in red) containing all but one observation, and a validation set that contains only that observation (shown in blue). The test error is then estimated by averaging the n resulting MSE’s. The first training set contains all but observation 1, the second training set contains all but observation 2, and so forth.

LOOCV Steps

Steps:

Create \(n\) training/validation sets, then for each set:

Fit the model to the training set
Use the fitted model to predict the responses for the (single) observation in the validation set.
Record the validation set error.

The final performance is the average of the performance scores from each iteration.

CV Estimation

If our response is quantitative, our validation set error can be assessed using the MSE.
For LOOCV the MSE corresponds to a single left out observation \(x_i\) will be defined as \(MSE_i = (y_i-\hat y_i)^2\) where \(\hat y_i\) is the prediction found from the \(i^{th}\) fitted model.
We can then estimate the MSE of the model using \[CV_{(n)} = \frac{1}{n} \sum_{i=1}^n MSE_i\]

LOOCV Steps Visualized

Training Set	Prediction	Metric \(MSE_i\)
2,3,4, …, 199, 200	\(\hat y_1 = \hat f_{1}(x_1)\)	\((y_1 - \hat y_1)^2\)
1,3,4, …, 199, 200	\(\hat y_2 = \hat f_{2}(x_2)\)	\((y_2 - \hat y_2)^2\)
1,2,4, …, 199, 200	\(\hat y_3 = \hat f_{3}(x_3)\)	\((y_3 - \hat y_3)^2\)
\(\vdots\)	\(\vdots\)	\(\vdots\)
1,2,3, …, 198, 199	\(\hat y_{200} = \hat f_{200}(x_{200})\)	\((y_{200} -\hat y_{200})^2\)

Pros to LOOCV

Less bias: LOOCV uses almost all of the data to fit the model whereas the validation approach may only use 50% (which tends to lead to an overestimate of the test error).
Deterministic estimates: LOOCV always gives the same estimate because the partitions are not chosen randomly.
Very general: LOOCV¹ can be used with any kind of predictive modeling; logistic regression, LDA, you name it!

Cons to LOOCV

Computationally demanding¹: LOOCV requires the model to be fitted \(n\) times which can be very costly if \(n\) is large and/or the model is slow to fit.
More variance since LOOCV estimates are highly correlated² \[\begin{align} \text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y) + 2\cdot\text{Cov}(X,Y) \end{align}\]

Comments on k-Fold CV

As mentioned earlier, \(k\)-fold CV is more general than LOOCV (with the added benefit of less variance)
While \(k\) can be anything from 1 to \(n\), typically we use 5- or 10-fold CV.
While we try to subdivide the data into equally-sized non-overlapping sets, sometimes that is impossible so we approximate¹
Each set (or “fold”) takes a turn as the validation set, with the remainder of the data used to train the model.
We usually shuffle the data before making the folds

5-fold Visualization

A visualization of 5-fold cross validation (CV). There are 5 rows, each row represents one CV training/validation set. Observations are depicted as areas on a rectangle. Envision that we fold the rectangle into 5 equal(ish) parts. In the first row, the first 4 rectangles are used for training and the 5th is used for testing/validating. In the second row, the first, second, third, and fifth rectangles are used for training and the 4th is used for testing/validating. In the third row, the first, second, foruth, and fifth rectangles are used for training and the third is used for testing/validating. In the forth row, the first, third, fourth and fifth rectangles are used for training and the second is used for testing/validating. In the firth row, the second, third, fourth and fifth rectangles are used for training and the first rectangle is used for testing/validating.

A schematic display of 5-fold CV. A set of n = 20 observations is randomly split into five non-overlapping groups (or folds). Each of these folds takes a turn a validation set (shown in blue), and the remainder as a training set (shown in red). The test error is estimated by averaging the five resulting MSE estimates.

knitr::include_graphics("img/11/k-fold-rearranged.png")

Same image as on previous slide with the observations rearranged such that all of the validation sets appear in the right-most column.

\(k\)-Fold split

Full Sample Index: 1, 2, … , 20, with \(k = 5\)

Training Set	Validation*
1, 2, 4, 5, 6, 8, 9, 11, 13, 14, 15, 16, 17, 18, 19, 20	3, 7, 10, 12
1, 3, 5, 6, 7, 8, 9, 10, 11, 12, 13, 15, 16, 17, 18, 20	2, 4, 14, 19
1, 2, 3, 4, 6, 7, 8, 9, 10, 11, 12, 14, 15, 17, 19, 20	5, 13, 16, 18
1, 2, 3, 4, 5, 7, 10, 11, 12, 13, 14, 16, 17, 18, 19, 20	6, 8, 9, 15
2, 3, 4, 5, 6, 7, 8, 9, 10, 12, 13, 14, 15, 16, 18, 19	1, 11, 17, 20

5-fold CV Calculations

Where \(\hat f_j\) is the model fitted using all observations *not* in validation set (i.e. \(i \notin V_j\)).
Validation Set \(V\)	Prediction	Metric \(MSE_j\)
\(V_1\) = 3, 7, 10, 12	\(\hat f_1 (x_3, x_7, x_{10}, x_{12})\)	\(\frac{1}{4}\sum_{i\in V_1} (y_i - \hat y_i)^2\)
\(V_2\) = 2, 4, 14, 19	\(\hat f_2 (x_2, x_4, x_{14}, x_{19})\)	\(\frac{1}{4}\sum_{i\in V_2} (y_i - \hat y_i)^2\)
\(V_3\) =5, 13, 16, 18	\(\hat f_3 (x_5, x_{13}, x_{16}, x_{18})\)	\(\frac{1}{4}\sum_{i\in V_3} (y_i - \hat y_i)^2\)
\(V_4\) =6, 8, 9, 15	\(\hat f_4 (x_6, x_8, x_{9}, x_{15})\)	\(\frac{1}{4}\sum_{i\in V_4} (y_i - \hat y_i)^2\)
\(V_5\) =1, 11, 17, 20	\(\hat f_5 (x_1, x_{11}, x_{17}, x_{20})\)	\(\frac{1}{4}\sum_{i\in V_5} (y_i - \hat y_i)^2\)

\[\begin{align} \text{CV}_{(5)} = \frac{1}{5} \sum_{j=1}^5 MSE_j \end{align}\]

\(k\)-Fold Estimation

More generally, we can calculate the MSE for each validation set \(j\), for \(j = 1, \dots, k\) \[MSE_j = \frac{1}{n_j} \sum_{i \in j}(y_i-\hat y_i)^2\] where \(n_j\) is the number of observations in the \(j\)th validation set. A cross-validated estimate the MSE given by \[ CV_{(k)} = \frac{1}{k}\sum_{j=1}^k MSE_j \]

ISLR Cross Validation Figure

jump back to validation approach

ILSR Fig 5.4 Cross-validation was used on the Auto data set in order to estimate the test error that results from predicting mpg using polynomial functions of horsepower. Left: The LOOCV error curve. Right: 10-fold CV was run nine separate times, each with a different random split of the data into ten parts. The figure shows the nine slightly different CV error curves. These are much less variable than the Validation Approach

\(k\)-fold vs. LOOCV

Pros of \(k\)-fold

Less variance than LOOCV
More computationally feasible than LOOCV (\(k\) vs. \(n\) model fits)
Often gives more accurate estimates of test error than LOOCV.

Cons of \(k\)-fold

more bias than LOOCV when estimating MSE
Non deterministic estimate of the MSE (randomly spliting the data into \(k\) folds)

Pros 1. Since the mean of many highly correlated quantities has higher variance than does the mean of many quantities that are not as highly correlated, the test error estimate resulting from LOOCV tends to have higher variance than does the test error estimate resulting from k-fold CV. 3. this has to do with bias-var tradeoff

why (next slide?) Bias-Variance Tradeoff: LOOCV has lower bias but higher variance, whereas k-fold CV strikes a better balance between bias and variance.

Cons 1. k-fold is trained using less data so more bias (since it uses less data) (LOOCV is almost unbiased becuase it uses almost all available data in our training set)

In practice, k-fold CV (usually k = 5 or 10) is preferred, because:

It achieves a better bias–variance balance.
It is computationally cheaper (fit only k models, not n).
It gives a more stable estimate of test error.

Bias-Variance trade-off

A weighing scale depicting the fact that k-fold is more bias than LOOCV

A weighing scale depicting the fact that k-fold has less variance than LOOCV

LOOCV will have lower bias than does \(k\)-fold CV with \(k < n\), but LOOCV has higher variance.

k-fold or LOOCV?

k-fold or LOOCV

k-fold or LOOCV?

~~k-fold or LOOCV?~~ ¹ What value of \(k\) is best?

The ISLR authors argue that for intermediate values of \(k\), say 5 or 10, \(k\)-fold CV provides a better estimate of test error rate than LOOCV²
In other words, the decrease in variance tends to outweigh the increase in bias when we chose 5/10-fold validation over LOOCV.

CV and Classification (LOO)

CV (either \(k\)-fold or LOO) can be applied easily in a classification context as well.
For instance, in LOOCV we can calculate cross-validated misclassification rates by replacing \(MSE_i\) with the validation misclassification rate (aka error rate \(\text{Err}_i\)) : \[ CV_{(n)} = \frac{1}{n} \sum_i^n \text{Err}_i = \frac{1}{n} \sum_i^n I(y_i \neq \hat y_i) \] where \(I(y_i \neq \hat y_i)\) is 1 if the left-out observation was misclassified using the \(i\)th model, and 0 otherwise.

CV and Classification (k-fold)

We divide the data into \(k\) roughly equal-sized parts \(C_1, C_2, \ldots, C_k\) where \(n_j\) is the number of observations in part \(j\): if \(n\) is a multiple of \(k\), then \(n_j = n/k\).

\[\begin{align} MC_j &= \frac{1}{n_j} \sum_{i \in j} I(y_i \neq \hat y_i) \\ CV_{(k)} &= \sum_{j=1}^k \frac{n_j}{n} MC_j \end{align}\]

Model Selection

We have just seen how we could use these methods to to assess model performance.
Now let’s switch gears and see how these methods can be used for model selection.
Now we can use cross-validation (CV) to compare and select the best-performing model among several candidates.
In this case, we use CV to choose which value of k¹ for KNN performs the best in terms of generalization.

iClicker

iClicker:

What is the primary purpose of the training dataset in machine learning?

To tune the model’s hyperparameters
To evaluate the final performance of the model
To train the model on a given dataset
To prevent overfitting during training

iClicker

iClicker:

What is the key difference between a validation set and a test set?

The validation set is used for final model evaluation
The test set is used for hyperparameter tuning
The validation set is used for hyperparameter tuning, while the test set is for evaluating generalization
The validation set is larger than the test set

iClicker

iClicker:

Which of the following best describes the concept of data leakage

Including irrelevant features in the training set
Sharing test data with the training data
Training on a small portion of data
Using a non-random dataset split

Cars Example

The cars data give the speed of 1920s cars (speed) and the distances taken to stop (dist).

Scatterplot of cars data with speed on the x-axis and distance on the y-axis. These variables are positively correlated

KNN (from FNN package)

library(FNN)
knn.reg(train, test = NULL, y, k)

train matrix or data frame of training set cases.
test matrix or data frame of test set cases. A vector will be interpreted as a row vector for a single case. If not supplied, LOOCV will be done.
y reponse of each observation in the training set.
k number of neighbours considered.

Cross-validated estimates

We fit all possible k values (1 to 49 - the number of samples)
If we don’t supply a test test, knn.reg performs LOOCV
We plot the cross-validated estimates of MSE across all values of k each calculated using: \[CV_{(n)} = \frac{1}{n} \sum_{i=1}^n MSE_i\]

`k` vs MSE plot

library(FNN)
data(cars, package="datasets")
attach(cars)
kr <- list()
predmse <- NA
for(i in 1:49){
  # LOOCV (default if no test set is supplied)
  kr[[i]] <- knn.reg(speed, test=NULL, dist, k=i)
  predmse[i] <- kr[[i]]$PRESS/nrow(cars) # LOOCV estimate for k = i
}
par(mar = c(4, 4.2, 1.1, 0.1))
plot(predmse, type="b", lwd=3, col="red", 
     ylab = expression(CV[(n)]), 
     xlab = "k (nearest neighbours)",
     main ="LOOCV MSE Estimates")

`k` vs MSE plot

Zoomed in

The minimum of that plot suggests that the best value of k is at 2.

`k` = 1

Plot speed (x-axis) and distance (y-axis) for the 1920 cars data. The points are positively correlated. The fit superimposed on the scatterplot corresponds to knn regression with k = 1. It is a connect-the-dots model. Hence it is overfitting.

k = 10

`k` = 20

`k` = 2 (winner)

Cars: using 10-fold

set.seed(123)  # for reproducibility

# We'll predict dist from speed
X <- cars$speed
y <- cars$dist
n <- length(y)

# number of folds
Kfold <- 10

# create fold assignments (roughly equal sizes)
fold_id <- sample(rep(1:Kfold, length.out = n))

# function to get 10-fold CV MSE for a single k
cv_mse_for_k <- function(k_val) {
  sq_errs <- numeric(n)  # store squared errors for each point
  
  for (fold in 1:Kfold) {
    test_idx  <- which(fold_id == fold)      # indices in this fold
    train_idx <- setdiff(1:n, test_idx)      # everything else

    # training data for this fold
    train_X <- matrix(X[train_idx], ncol = 1)
    train_y <- y[train_idx]

    # test data for this fold
    test_X <- matrix(X[test_idx], ncol = 1)

    # fit KNN regression using training fold, predict on test fold
    pred <- knn.reg(train = train_X,
                    test  = test_X,
                    y     = train_y,
                    k     = k_val)$pred

    # store squared errors for this fold's test points
    sq_errs[test_idx] <- (y[test_idx] - pred)^2
  }

  # average MSE across all points
  mean(sq_errs)
}

# try a grid of k values
k_grid <- 1:15  # (you can extend if you like)
cv_mse <- sapply(k_grid, cv_mse_for_k)

plot(data.frame(k = k_grid, mse = cv_mse), type="b")

`k` = 5 (winner)

Comments

LOOCV chose k = 2, but the fitted curve looked too wiggly;
a clear sign of overfitting.

LOOCV refits the model n times (once per obs).
Each training set differs by only one point →
estimates are highly correlated, leading to high variance.

10-fold CV chose k = 5, producing a smoother, more generalizable fit.

10-Fold CV uses larger, distinct test folds →
less correlation, lower variance, and more stable error estimates.

Comparing Models

Not only can we select k using CV, but we can furthermore compare that best KNNreg model’s predictive performance with any other potential model.
Since CV estimates the MSE, there’s no reason we cannot provide that estimate in the context of, say, simple linear regression…

Cars example (regression)

Car data, simple linear regression on full sample

cars.line <- lm(dist~speed)
summary(cars.line)

...
Call:
lm(formula = dist ~ speed)

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -17.5791     6.7584  -2.601   0.0123 *  
speed         3.9324     0.4155   9.464 1.49e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 15.38 on 48 degrees of freedom
Multiple R-squared:  0.6511,    Adjusted R-squared:  0.6438 
F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12
...

Regression MSE estimate

Under the inferential assumptions for SLR, we have a theoretical (unbiased) estimate of the test MSE via \[ \hat{\text{MSE}} = \frac{\text{RSS}}{n-2} = \frac{\sum (y_i - f(x_i))^2}{n-2}\]
For the cars data, that gives us: \(\frac{11353.52}{48}= 236.53\)
In this case k = 2 KNN model gets a CV MSE estimate of 178.54 hence is suggested as the better model.

LOOCV Regression Shortcut

While we could compare these estimates of MSE directly, we could go one step further and get the CV estimate for MSE by applying LOOCV to the linear model!
Notably, there is a mathematical shortcut for LOOCV with the least squares linear or polynomial regression that calculate \(CV_{(n)}\) using a single fit!
The details of this shortcut are discussed in Section 5.1.2 of the ILSR textbook (we won’t go over them here).

LOO Regression Animation

LOO regression fits

Plots the \(n\) (highly correlated) LOOCV regression fits produce a cross-validated estimates MSE = \(CV_{(n)} = 246.405416\)

5-fold CV Animation

Scatter plot of speed vs distance for the 1920s car data. An annimation of 5-fold CV on 50 observations. Each image shows which 10 observations were left out of the fit.

5-fold estimates MSE as \(CV_{(5)} = 254.0391118\)

Scatter plot of speed vs distance for the 1920s car data. Superimpossed are the 10 fits produced using 5-fold CV on 50 observations.

10-fold CV Animation

Scatter plot of speed vs distance for the 1920s car data. An annimation of 10-fold CV on 50 observations. Each image shows which 5 observations were left out of the fit.

10-fold estimates MSE as \(CV_{(10)} = 243.7798735\)

Scatter plot of speed vs distance for the 1920s car data. Superimpossed are the 5 fits produced using 10-fold CV on 50 observations.

Summary

\(k\)	\(CV_{(k)}\)
5	254.04
10	243.78
\(n\) (LOOCV)	246.41

The LOOCV 5-fol and 10-fold CV fits described on the previous three slide all plotted in a single row.

Wine example

We now consider the wine data set from the pgmm package
27 measurements on 178 samples of red wine.
Samples originate from the same region of Italy (Piedmont), but are of different varietals (Barolo, Barbera, Grignolino).
We can do KNN classification, using LOOCV¹ to choose k, the number of neighbours…

Best `k`

In this case the minimum is at k = 8

Wine Example

We can provide a cross-validated classification table,

	Predicted
	1	2	3
Barolo	52	3	2
Grignolino	2	59	11
Barbera	5	9	35

or LOOCV Error …

\[ \begin{align} CV_{(n)} &= \frac{1}{n} \sum_i^n \text{Err}_i\\ & = \frac{1}{n} \sum_i^n I(y_i \neq \hat y_i)\\ & = \frac{32}{178} = 0.18 \end{align} \]

… or cross-validated versions of any of our other classification performance indices exist!

Standard Errors

More generally: \(CV_K = \sum_{k=1}^K \frac{n_k}{n} \text{Err}_k\)
In addition to computing \(CV_K\) estimate of the error rate, we can also estimate standard deviation of \(CV_K\).
The estimated standard deviation of \(CV_K\) is given by:

\[\begin{equation} \hat{\text{SE}}(CV_K) = \sqrt{ \sum_{k=1}^{K} \left(\text{Err}_k - \overline{\text{Err}_k}\right)^2/ (K-1)} \end{equation}\]

Final Remark

Now we have \(n\) different fits of the model, which model should we report?

None of them

While CV is used to help select a model, the actual reported model would be the one run with all the data.
That is, we would need to run knn() with k = 8 on the full data set with no observations removed.

Lecture 8: Cross Validation and Data Splitting

Introduction

Metrics for Model Assessment

Recall

Training vs Test-set Performance

Validation Approach

Train/Test Split

Validation Approach

Train/Test Split

Validation Approach Schematic

ISLR Train Test Figure

Validation Approach: some problems

Cross-Validation: The Solution

K-fold Cross-Validation

Steps of K-fold CV

CV Performance

Typical Workflow

Final model testing

Summary of steps

Resampling Methods

Leave One Out Cross-Validation

LOOCV approach

LOOCV Schematic

LOOCV Steps

Steps:

CV Estimation

LOOCV Steps Visualized

Pros to LOOCV

Cons to LOOCV

Comments on k-Fold CV

5-fold Visualization

\(k\)-Fold split

5-fold CV Calculations

\(k\)-Fold Estimation

ISLR Cross Validation Figure

\(k\)-fold vs. LOOCV

Pros of \(k\)-fold

Cons of \(k\)-fold

Bias-Variance trade-off

k-fold or LOOCV?

k-fold or LOOCV?

CV and Classification (LOO)

CV and Classification (k-fold)

Model Selection

iClicker

iClicker

iClicker

Cars Example

KNN (from FNN package)

Cross-validated estimates

k vs MSE plot

k vs MSE plot

Zoomed in

k = 1

k = 10

k = 20

k = 2 (winner)

Cars: using 10-fold

k = 5 (winner)

Comments

Comparing Models

Cars example (regression)

Regression MSE estimate

LOOCV Regression Shortcut

LOO Regression Animation

LOO regression fits

5-fold CV Animation

10-fold CV Animation

Summary

Wine example

Best k

Wine Example

Standard Errors

Final Remark

`k` vs MSE plot

`k` vs MSE plot

`k` = 1

`k` = 20

`k` = 2 (winner)

`k` = 5 (winner)

Best `k`