Data 311: Machine Learning

Lecture 3: Assessing Regression Models

Dr. Irene Vrbik

University of British Columbia Okanagan

Recap

  • Last lecture we introduced some notation and terminology that we will be using throughout the course.

  • We discussed how different tasks, namely inference vs prediction may lead us to favour certain models over others.

  • Today we discuss the topic of model assessment to address the challenging task of model selection

Motivation

  • There is no one-size-fits-all best model.

  • This course aims to introduce you a small subset of the machine learning approaches available each with their own set of limitations.

  • Selecting the best approach for a particular problem can be one of the most challenging tasks for a data scientist.

  • Today’s lecture will add to this list of considerations when choosing a model in the supervised regression setting …

Introduction

  • If our data is labelled, we may want to investigate how our predictions for some response/output \(y\) (denoted \(\hat y\) and based on the fitted model \(\hat f\)) compare to the observed values \(y\).

  • Naturally, the closer our predicted values are to the true response value, the better.

  • Our response variable will generally be one of two forms: categorical or numeric. Today we focus on assessing models with a numeric response.

Outline

Statistical Learning Model

Recall our general model for statistical learning \[Y = f(X) + \epsilon\]

  • where \(X\) are our inputs,
  • \(Y\) is the numeric output (at least in this setting)
  • \(\epsilon\) is the error term (independent of \(X\) and with mean 0),
  • \(f\) is the systematic information \(X\) provides about \(Y\).

Goal: find an \(\hat f(X)\) that approximates the true function \(f(x)\) as well as possible.

Workflow

Regression Problem

  • When the response variable is numeric, we fall into the category of regression (the topic for the next three lectures)

  • In the context of regression, the most commonly-used metric for assessing performance is the mean squared error (or MSE)

  • In words, the MSE represents the average squared difference of a prediction \(\hat y = \hat f(x)\) from its true value \(y\).

MSE

\[\begin{align} \text{MSE} &= \frac{1}{n} \sum_{i=1}^n (y_i-\hat y_i)^2 \end{align}\]

where \(\hat f(x_i) = \hat y_i\) is the prediction for the \(i\)th input \(x_i\), and \(y_i\) is the response actually observed (ie “truth”).

In the simple linear regression context, this was known as the Residual Sum of Squares (or RSS).

MSE Visualized

Given some data: \((X, Y)\)

Fit a model \(\hat f\) (plotted in orange)

For each \(x_i\) we have a true value \(y_i\), …

… and predicted value \(\hat f(x_i) = \hat y_i\)

We average the squared differences to get MSE

Properties of MSE

\[MSE = \frac{1}{n} \sum_{i=1}^n (y_i-\hat f(x_i))^2 \]

Notice that this has some desirable properties:

  • MSE is small when predicted values , \(\hat y_i - \hat f(x_i)\), are close to the true response, \(y\)

  • MSE is large when predicted values, \(\hat y_i - \hat f(x_i)\), are far from the true response, \(y\)

Motivating Question

Question: Would this be a good model?

Splitting

Notation

We denote the training set, i.e. the collection of observations we use to fit our model by:

\[\textbf{X}_{Tr} = \{(x_1, y_1),(x_2, y_2),...,(x_n, y_n)\} = \{x_i, y_i\}_{1}^n\]

We will denote the testing set, i.e. the collection of observations that we keep separate from the fitting process and reserve for assessing the model by:

\[\textbf{X}_{Te} = \{(x_{n+1}, y_{n+1}),...,(x_{n+m}, y_{n+m})\} = \{x_i, y_i\}_{1}^m\]

Model Fitting

  • We might find our \(\hat f\) based on training data \(\textbf{X}_{Tr}\) and see how close \(\hat f(x_{n+1},...,x_{n+m})\) predicts \((y_{n+1},...y_{n+m})\)

  • Recall that \(\textbf{X}_{Te}\) is not used to train the statistical learning method and has never been “seen” by the algorithm.

Testing vs. Training MSE

When MSE is calculated using \(\textbf{X}_{Tr}\) we call it the training MSE

\[MSE_{Tr} = \frac{1}{n} \sum_{i=1}^n (y_i-\hat f(x_i))^2\] When MSE is calculated using \(\textbf{X}_{Te}\) we call it the test MSE:

\[MSE_{Te} = \frac{1}{m} \sum_{i=1}^m (y_i-\hat f(x_i))^2\]

MSE train = average squared differences using training data

MSE test = average squared differences using the test data

Goal

  • Plainly speaking, we do not really care how well the method works on the training data.
  • We are interested in the accuracy of predictions to data (e.g. new patients, future stock prices) it’s never seen before.

  • \(\text{MSE}_{Te}\) is a metric of how well \(\hat f\) generalizes to unseen data, thus, we want this (rather than \(\text{MSE}_{Tr}\)) to be as low as possible.

Error types

In general, \(\hat f\) will not match \(f\) perfectly and the discrepancy can be broken down into two types of errors…

  1. reducible error which we can reduce by picking a better statistical learning technique; and

  2. irreducible error which we can never improve even if we estimate \(f\) perfectly

Math

Suppose, for notational convenience, we have a training set \(X_{Tr} = S\) that was used to fit our model \(\hat f_S\). Given a test set \(X_{Te} = (X, Y)\), we can write:

\[\begin{align*} &= E[(y - \hat y)^2]\\ &= E[(y - \hat f_S(x))^2]\\ &= E[(f(x) + \epsilon - \hat f_S(x))^2]\\ &= E[\left((f(x) - \hat f_S(x)) + \epsilon \right)^2]\\ &= E[(f(x) - \hat f(x))^2 + 2(f(x) - \hat f(x))\epsilon + \epsilon^2]\\ \end{align*}\]

\(x\) and \(\hat f_S(x)\) assumed fixed, \(f\) fixed (but unknown)

\[\begin{align*} &= \left(f(x) - \hat f(x)\right)^2 + 2\left(f(x) - \hat f_S(x)\right)E[\epsilon] + E[\epsilon^2]\\ &= (f(x) - \hat f_S(x))^2 + E[\epsilon^2] \quad \quad {\scriptsize \text{since $E[\epsilon] = 0$ }}\\ \end{align*}\]

Recall \(\text{Var}(X) = E[X^2] - (E[X])^2\).
Hence \(\text{Var}(\epsilon) = E[\epsilon^2] - (E[\epsilon])^2 = E[\epsilon^2] \text{ (since } E[\epsilon] = 0)\)

\[\begin{align*} &= \underbrace{(f(x) - \hat f_S(x))^2}_{\text{reducible error}} + \underbrace{\text{Var}[\epsilon]}_{\text{irreducible error}}\\ \end{align*}\]

Math (cont’d)

Reducible error can be further decomposed into error due to squared bias and variance.

\[\begin{align*} &= \mathbb{E}\left[\text{reducible error}\right]\\ &=\mathbb{E}\left[ (f(x) - \hat f_S(x))^2 \right]\\ &= \underbrace{ (f(x) - \mathbb{E}[\hat f_S(x)])^2 }_{\text{Bias}^2(\hat f(x))} + \underbrace{ \mathbb{E}\left[ (\hat f_S(x) - \mathbb{E}[\hat f_S(x)]^2) \right]}_{\text{Var}(\hat f(x))} \end{align*}\]

Reducible error

  • The reducible error as its name suggest is the portion of the error in a model that can be reduced or eliminated through improvements in the modeling process.

  • Our goal can now be rephrased as minimizing the reducible error (also known as model error) as much as possible.

    • Using a more sophisticated or appropriate model.
    • Collecting more relevant data or improving data quality.
    • Tuning model hyperparameters for better performance.

Decomposition of MSE

All together we have the expected test MSE at a new point \(x\).
\[\begin{align*} &=\text{E}_{Y|x, S}\left[(y - \hat f (x))^2\right] \\ &=\underbrace{\text{Var}(\hat f(x)) + \text{Bias}^2(\hat f(x))}_{\text{reducible error}} + \underbrace{\text{Var}(\epsilon)}_{\text{irreducible error}} \end{align*}\] It’s the average \(MSE_{Te}\) we’d get if we repeatedly estimated \(f\) using a large number of training sets, and tested at each at \(x\)

Irreducible Error

  • The variance of error term \(\epsilon\) is our irreducible error

  • The irreducible error is a measure of the amount of noise in our data.

  • Even if we estimate \(f\) perfectly, our data will always have some noise that can not be reduced.

Important no matter how good we make our model, our data will have certain amount of noise or irreducible error that can not be removed.

Bias-Variance

Definitions

Bias refers to the error that is introduced by approximating a real-life problem, which may be extremely complicated, by a much simpler model.

  • High bias can cause an algorithm to miss the relevant relations between features and target outputs (underfitting)

Variance refers to the amount by which \(\hat f\) would change if we estimated it using a different training data set.

  • High variance may result from an algorithm modeling the random noise in the training data (overfitting).

Target Visualizations

Inspired by Essays by Scott Fortmann-Roe

Target Explained

  • Low Bias (first row) Models that approximates the real-life problem well will have low bias (hits will be centered around the bullseye)

  • High Bias (second row) will systematically be “off the mark”

  • Low variance (first column) indicates that \(\hat f\) would not change much even if we estimated it using a different training data set. (so the hits will all be close together)

  • High variance (second column) indicates that \(\hat f\) will be very sensitivity to small fluctuations in the training (so the hits will be less close together)

Simulations

Let’s simulate a training set of size \(n = 100\) by uniformly generating \(X\) values between 0 and \(2\pi\), and generate \(Y\) values according to this formula (\(f(x) = \sin(x)\)): \[\begin{align*} Y &= \text{sin}(X) + \epsilon \end{align*}\] where \(\epsilon \sim \text{Normal}(\mu =0, \sigma = 1)\).

Training set 1

Training set #1: This is one example of a potential training set

Training set 2

Training set #2: Here’s another…

Generative Model

Since we know \(f\) we can plot as well

Low Variance High Bias

Training set 1

Let’s start by fitting a simple linear regression (SLR) model to training set #1. The resulting fitted line \(\hat f\) is plotted below

Training set 2

Fitting a SLR to training set #2 will produce this (slightly different) fitted line \(\hat f\)

10 training sets

The fitted line for 10 different fits using 10 different training sets.

Low Variance

  • If we do this repeatedly on different training sets simulated from the same model described on this slide, you will notice that we don’t get very much variation in our fitted model.

  • This model is therefore said to have low variance.

  • That is to say, it is not sensitivity to small fluctuations in the training set.

High Bias

High bias refers to a situation where a model makes strong simplifying assumptions about the data, resulting in systematic errors in its predictions.

  • this straight line is not flexible enough to capture the curved relationship between these two variables

Underfitting is a consequence of high bias. It refers to the poor performance of a model on both the training and test data due to its inability to capture the data’s underlying patterns.

Low Bias High Variance

Green Model

  • To demonstrate this concept we’ll use the local polynomial model (using the loess function in R).

  • We will call this the “green model”

  • Don’t worry if you don’t know what this model is. We are unlikely to cover these in this course — just consider it a relatively flexible model.

  • The span argument controls level of flexibility. The “green model” will fit a local polynomial with a lot of flexibility.

Training set 2

Fitting a highly flexible loess model to training set #1 will produce this fitted curve \(\hat f\)

Training set 2

Fitting a highly flexible loess model to training set #2 will produce this (very different) fitted curve \(\hat f\)

10 training sets

The fitted model for 10 different fits using 10 different training sets.

High Variance

High Variance models tend to to capture noise and random fluctuations in the training data rather than the underlying patterns.

  • This “green” model has high variance since it is very sensitivity to small fluctuations in the training set and highly variable from training set to training set.

Models like these are said to be overfitting to the training data. Overfitting tends to result in very low training error and comparitively high testing error (poor generalization)

Low Bias

\[\text{Bias}^2(\hat f(x)) = (f(x) - \mathbb{E}[\hat f_S(x)])^2 \]

  • While small changes in training set causes the estimate \(\hat f\) to change considerably, on average, it is capturing the general sine wave nature of this data.

  • Thus the green model has low bias.

  • Even though a single fitted model corresponds too closely to the training data, on average this model is close the truth.

Low bias and Low variance

Blue Model

  • As before we will use the local polynomial model (using the loess function in R) to fit this model.

  • We will call this the “blue model”

  • We will adjust the span argument to decrease the level of flexibility.

Training set 1

Fitting a loess model with medium flexibility to training set #1 will produce this fitted curve \(\hat f\) plotted below

Training set 2

Fitting a loess model with medium flexibility to training set #2 will produce this (different) fitted curve \(\hat f\)

10 training sets

The fitted model for 10 different fits using 10 different training sets.

Low Variance

  • If we do this repeatedly on different training sets simulated from the same model described on this slide, you will notice that we don’t get very much variation in our fitted model.

  • This model is therefore said to have low variance

  • That is to say, it is not sensitivity to small fluctuations in the training set.

Low Bias

  • The blue model has low bias.

  • That is, on average the estimate is close the truth.

  • Unlike the green model (which also has low bias), this model is not overfitting to the data (low bias).

  • The blue model strikes a nice balance between low variance and low bias.

Average model across simulations

Let’s explore what the average model looks like for each of these scenarios …

Goldilocks Principle

Image adapted from MollyMooCrafts

Bias Variance Tradeoff

The bias–variance tradeoff is the conflict in trying to simultaneously minimize these two sources of error that prevent supervised learning algorithms from generalizing beyond their training set:[1][2][3]

  • Since the test MSE is comprised of both bias and variance we want to try to reduce both of them.

  • However, as we decrease bias, we tend to increase variance and vice versa.

Looking Forward

  • To manage the variance-bias tradeoff effectively, you can employ techniques such as cross-validation, regularization, feature selection, and ensemble methods (e.g., bagging and boosting), all of which will be discussed throughout the course.

  • These techniques help you find the optimal model complexity and reduce overfitting or underfitting, ultimately improving a model’s generalization performance.

ISLR Simulations

Section 2.2.2

  • The following simulations are taken from your ISLR2 textbook.

  • Looking at a variety of different data we can gain some general insights on \(\text{MSE}_{Te}\) and \(\text{MSE}_{Tr}\)

  • Again, the particulars about the specific models (linear regression and loess smoothing splines) are not important.

  • What is important is it understand how the level of flexibility impact the bias and variance of the models (and therefore the MSE).

Figure 2.9 (left plot)

ISLR Figure 2.9 (left plot) Data simulated from \(f\), shown in black. Three estimates of \(f\) are shown: the linear regression line (orang ecurve), and two smoothing spline fits (blue and green curves). jump to right-hand panel

Figure 2.9 (left plot).

Orange curve

  • High bias, low variance

  • Low Variance the oranage fit would not have much variability from training set to training set

  • High Bias it systematically underestimate between 40–80 and overestimate towards the boundaries, for example.

Figure 2.9 (left plot).

Green curve

  • Low bias, high variance

  • As the green curve is the most flexible, it matches the training data very closely

  • However, it is much more wiggly than \(f\) (ie the “true” generating black curve)

Figure 2.9 (left plot).

Blue curve

  • Low bias, low variance

  • The blue curve strikes the balance between low variance and low bias

  • As one may expect, the average fitted curve is quite similar to \(f\) (ie the “true” generating black curve)

Estimated test MSE

  • To estimate the expected \(\text{MSE}_{Tr}\) and \(\text{MSE}_{Te}\) their values over a very large collection of data sets.

  • The average training MSE and testing MSE is plotted in gray and red, respectively, as a function of flexibility.

  • The flexibility of these models are mapped to numeric values on the \(x\)-axis (lower values indicate less flexibility).

  • Squares (🟧, 🟦, 🟩) indicate the MSEs associated with the corresponding orange, blue, and green models.

Figure 2.9 (right)

ILSR Figure 2.9 (right plot) Training MSE (grey curve), test MSE (red curve), and minimum possible test MSE over all methods (dashed line). Squares represent the training and test MSEs for the three fits shown in the left-hand panel.

Test MSE

The orange and green models have high \(\text{MSE}_{Te}\) but for different reasons

  • orange is underfitting
  • green is overfitting

Blue is close to optimal

Minimizing test MSE

  • The horizontal dashed line indicates \(\text{Var}(\epsilon)\), the irreducible error

  • This line corresponds to the lowest achievable test MSE among all possible methods.

  • Hence, the “blue model” is close to optimal.

Training MSE

The green curve has the lowest training MSE of all three methods, since it corresponds to the most flexible of the three curves fit in the left-hand panel

Training MSE

The orange curve has the highest training MSE of all three methods, since even on the training set, it is not flexible enough to approximate the underlying relationship

Training MSE

The blue curve obtains a similar training and testing MSE.