stat205 – Bootstrap CIs

Introduction

To do statistical inference, we need the sampling distribution of a statistic.
Already we’ve seen how parametric methods rely on assumptions and theory to obtain exact or approximate sampling distributions for quantities such as:
- \(\bar{X}\) (sample mean)
- \(\hat{p}\) (sample proportion)
- \(\bar{X}_1 - \bar{X}_2\) (difference in means)
- \(\hat{p}_1 - \hat{p}_2\) (difference in proportions)

Problem

What if the assumptions are questionable/fail?

This tends to happen when:
- The data are highly skewed or non-normal
- The sample size is small
- The data contain outliers

What if it is too complicated to derive the theoretical sampling distribution?

e.g. the median has no simple sampling distribution

Two Solutions

Nonparametric methods
- Focus on order information (signs, ranks)
- Fewer assumptions about the population

Bootstrap (resampling)
- Recreate the sampling process using the observed data
- Approximate the sampling distribution directly

Last class, we introduced some nonparametric methods, today we focus on the Bootstrap as a resampling technique.

Main Idea

We approximate the sampling distribution directly from the data.

Resampling methods do exactly this by repeatedly sampling from the observed data.

This idea mirrors what we did earlier in the course when we built the empirical sampling distribution of the sample mean. Recall \(\dots\)

Sampling Distribution of \(\bar X\)

Under certain conditions¹, the CLT tells us the sample mean can be approximated by the normal distribution, namely

\[ \bar X \sim \text{Normal}(\mu, \sigma/\sqrt{n}) \]

While we know this theoretical result, we could (in theory) approximate the empirical sampling distribution by repeatedly resampling from the population and plotting the statistic.

Empirical Sampling Distribution of \(\bar X\)

To approximate the sampling distribution of \(\bar X\):

For \((i = 1, 2, \dots , B)\):

Draw a sample of size \(n\)
Compute \(\bar x_i\)

The distribution of \(\left\{ \bar x_1, \bar x_2, \dots, \bar x_B \right\}\) approximates the sampling distribution of the mean.

Empirical sampling distribution of the sample mean based on repeated samples (1000) of size \(n=\) 30 with the theoretical distribution overlaid (solid red curve).

When the CLT fails

When the underlying population is highly skewed, even a “large” sample of size \(n=30\) can result in a sampling distribution of \(\bar X\) that is still skewed.

Normal Approximation Fails

Underlying population (heavily right skewed)

A poor normal approximation of the sample mean based on 100 samples of size \(n =\) 30.

Normal Approximation Fails

If we base inference on the poor normal approximation (solid red curve on the previous slide):

Confidence intervals may be inaccurate
Hypothesis tests (\(z\)- and \(t\)-tests) may give misleading results
Standard errors may not reflect true variability

A Better Target

The empirical sampling distribution (histogram) would provide a more accurate basis for inference
It captures the correct shape and true variability of the statistic.

Problem: the empirical distribution relies on many samples from the population but we usually only have one.

But hat if we could approximate the empirical sampling distribution using just one sample?

Bootstrap

In 1979, Brad Efron introduced a revolutionary resampling method called the bootstrap.
Bootstrapping involves repeatedly sampling with replacement from a (single) dataset to create multiple bootstrap samples.
Purpose: It estimates the distribution of a statistic (e.g., sample mean) by simulating many possible outcomes.

The name originates from the phrase “to pull oneself up by one’s bootstraps,” which refers to achieving something seemingly impossible or without external help. Source of image: Significance Magazine (2010)

Steps

Consider some estimator \(\hat \alpha\) for \(\alpha\). For \(j = 1, 2, \dots B\) ¹

take a random sample of size \(n\) from your original data set with replacement. This is the \(j\)th bootstrap sample, \(Z^{*j}\)
calculate the estimated value of your parameter from your bootstrap sample, \(Z^{*j}\), call this value \(\hat \alpha^{*j}\).

Estimate the standard error and/or bias of the estimator, \(\hat \alpha\), using the equations given on the upcoming slides.

Sample with replacement

Bootstrap/full Estimates

Bootstrap Standard Error

Bootstrap estimates the standard error of estimator \(\hat \alpha\) using¹

\[\begin{equation} \hat{\text{SE}}(\hat \alpha) = \sqrt{\frac{\sum_{i=1}^{B} (\hat \alpha^{*i} - \overline{\hat \alpha})^2}{B -1}} \end{equation}\]

where \(\overline{\hat \alpha} = \sum_{j = 1}^B \hat \alpha^{*j}\) is the average estimated value over all \(B\) bootstrap samples.

Bootstrap Bias

The bootstrap estimates the Bias of estimator \(\hat \alpha\) using:

\[\begin{align} \hat{\text{Bias}}(\hat \alpha) = \overline {\hat \alpha} - \hat \alpha_f \end{align}\]

where \(\overline{\hat \alpha} = \sum_{j = 1}^B \hat \alpha^{*j}\) and \(\hat \alpha_f\) is the estimate obtained on the original data set.

Example: Skewed Right

Returning to our skewed-right population from before, consider this single sample of size \(n\) = 30.

Code

set.seed(123)
n <- 30
B <- 1000

# log-normal parameters
mu_log <- 0
sd_log <- 1.2

# one observed sample
x <- rlnorm(n, meanlog = mu_log, sdlog = sd_log)

# observed sample mean
xbar_obs <- mean(x)

 [1] 0.51039478 0.75865133 6.49121664 1.08829263 1.16782761 7.83090422
 [7] 1.73863348 0.21913365 0.43857545 0.58578974 4.34444561 1.53999103
[13] 1.61757116 1.14204357 0.51324120 8.53602774 1.81742484 0.09442583
[19] 2.32013896 0.56702624 0.27765153 0.76984206 0.29194048 0.41699983
[25] 0.47234429 0.13212168 2.73284845 1.20207321 0.25518556 4.50225285

This sample has a mean of mean(x) = 1.8125005

Bootstrap sample

For our first bootstrap sample, we resample¹ from x with replacement and recompute the sample mean.

# first boostrap sample
(b1 <- sample(x, replace = TRUE))

 [1] 1.16782761 0.21913365 1.53999103 1.61757116 0.09442583 0.51039478
 [7] 0.47234429 2.73284845 0.47234429 7.83090422 0.27765153 0.51324120
[13] 0.43857545 0.51324120 0.13212168 1.20207321 8.53602774 0.56702624
[19] 4.50225285 7.83090422 4.34444561 0.21913365 0.76984206 0.76984206
[25] 1.73863348 8.53602774 1.81742484 0.76984206 0.09442583 1.81742484

This bootstrap sample has a mean of mean(b1) = 2.0682648.

Bootstrap sample

For our second bootstrap sample, we resample¹ from x with replacement and recompute the sample mean.

# first boostrap sample
(b2 <- sample(x, replace = TRUE))

 [1] 0.7586513 1.0882926 1.6175712 1.1678276 0.7698421 2.3201390 0.4723443
 [8] 0.5670262 0.7698421 0.4723443 1.1420436 0.4723443 0.2919405 6.4912166
[15] 0.2191336 8.5360277 4.5022529 1.5399910 0.4723443 1.1420436 4.5022529
[22] 6.4912166 1.1420436 0.2551856 1.7386335 6.4912166 0.2919405 0.7698421
[29] 0.1321217 0.5132412

This bootstrap sample has a mean of mean(b1) = 1.9046971.

Bootstrap sample

For our thirds bootstrap sample, we resample¹ from x with replacement and recompute the sample mean.

# first boostrap sample
(b2 <- sample(x, replace = TRUE))

 [1] 0.27765153 1.16782761 0.21913365 2.32013896 0.58578974 0.09442583
 [7] 0.58578974 1.53999103 0.75865133 0.58578974 0.76984206 1.53999103
[13] 0.56702624 1.14204357 1.81742484 4.50225285 1.14204357 0.76984206
[19] 6.49121664 0.21913365 1.14204357 2.32013896 4.50225285 0.51324120
[25] 0.41699983 1.81742484 0.29194048 4.34444561 1.73863348 0.25518556

This bootstrap sample has a mean of mean(b1) = 1.4812771.

… and so on …

In R

We could write this in a for loop, or use:

B = 1000
n = length(x)

# bootstrap sample means
boot_xbar <- replicate(B, mean(sample(x, size = n, replace = TRUE)))

This gives us a collection of bootstrap estimates:

\[ \hat \alpha^*_{1} , \hat \alpha^*_{2},... \hat \alpha^*_{B}\]

We can now plot them to get a bootstrap approximation of the sampling distribution of \(\hat \alpha\)

Compare:

Code

df_boot <- data.frame(xbar = boot_xbar)
binwdth_boot <- round(diff(range(boot_xbar)) / 30, 2)

# true sampling distribution from repeated samples from the population
xbar_vals <- replicate(B, mean(rlnorm(n, meanlog = mu_log, sdlog = sd_log)))

df_compare <- data.frame(
  value = c(xbar_vals, boot_xbar),
  type = rep(c("True sampling distribution", "Bootstrap distribution"), each = B)
)

ggplot(df_compare, aes(x = value)) +
  geom_histogram(aes(y = after_stat(density)),
                 bins = 30,
                 fill = "gray",
                 color = "white") +
  facet_wrap(~type, scales = "free_y") +
  labs(
    x = "Sample Mean",
    y = "Density"
  ) +
  theme_minimal(base_size = 16)

Computing Bootstrap CI

We alluded to the potential of computing confidence intervals (CI) with the bootstrap.
We already saw how the \(B\) bootstrap estimates of parameter \(\alpha\) provide an empirical estimate of the sampling distribution of estimator \(\hat \alpha\).
To obtain confidence intervals from this empirical distribution, we would simply finding appropriate percentiles.

Ordered Bootstrap Samples

We can now sort these estimates from least to largest. We denote these ordered bootstrap estimates by

\[ \hat \alpha^*_{(1)} , \hat \alpha^*_{(2)},... \hat \alpha^*_{(B)}\]

where

\(\hat \alpha^*_{(1)}\) is the smallest estimate of the \(\alpha\) found in one of the 1000 bootstrap samples, and
\(\hat \alpha^*_{(B)}\) is the largest.

Bootstrap Confidence Intervals

A 95% confidence interval is given by:

\[(\hat \alpha^*_{(0.025)}, \hat \alpha^*_{(0.975)})\] where

\(\hat \alpha^*_{(0.025)}\) is the 2.5th percentile, and
\(\hat \alpha^*_{(0.975)}\) is the 97.5th percentile

This interval captures the middle 95% of the bootstrap distribution

Bootstrap CI

Code

# percentile CI
ci <- quantile(boot_xbar, c(0.025, 0.975))

# histogram
hist(boot_xbar,
     probability = TRUE,
     col = "gray80",
     border = "white",
     main = "Bootstrap Distribution with 95% CI",
     xlab = "Sample Mean")

# density estimate
d <- density(boot_xbar)
lines(d, lwd = 2)

# indices for middle 95%
idx <- d$x >= ci[1] & d$x <= ci[2]

# shade only the middle 95% under the curve
polygon(c(ci[1], d$x[idx], ci[2]),
        c(0,     d$y[idx], 0),
        density = 50,
        angle = 45,
        col = rgb(0.2, 0.5, 0.9, 0.5),   # soft blue
        border = NA)

# redraw density curve on top
lines(d, lwd = 2)

# CI bounds
abline(v = ci, lwd = 2, lty = 2)

# observed sample mean
abline(v = mean(x), lwd = 2)

# optional labels
text(ci[1], par("usr")[4] * 0.92, "2.5%", pos = 4)
text(ci[2], par("usr")[4] * 0.92, "97.5%", pos = 2)

Regression Example

We simulate 30 observations from a simple linear regression model with intercept \(\beta_0 = 0\) and slope \(\beta_1 = 2\). That is: \[Y = 2X + \epsilon\] where \(\epsilon \sim N(0, 0.25^2)\).

Data Generation

We generate a single sample from our population here:

set.seed(123978)
n = 30
beta1 = 2
sigma = 0.25 
x <- runif(n, 0, 1)
y <- beta1*x + rnorm(n, sd=sigma)
xy <- cbind(x,y)

Summary of Fit

And fit a simple linear regression model:

fullfit <- lm(y~x) 
(sfit <- summary(fullfit))

...

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.05798    0.07773   0.746    0.462    
x            1.86641    0.14348  13.008 2.17e-13 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2213 on 28 degrees of freedom
Multiple R-squared:  0.858, Adjusted R-squared:  0.8529 
F-statistic: 169.2 on 1 and 28 DF,  p-value: 2.172e-13
...

Sampling Distribution of \(\hat \beta_1\)

It can be shown that the sampling distribution of \(\hat \beta_1\) is given by a normal distribution centered at the true value, and standard error given below

\[\begin{align} \hat\beta_1 &\sim \text{Normal}\left(\beta_1, \dfrac{\sigma^2}{S_{xx}}\right) & \text{where } S_{xx} = \sum_{i=1}^n (x_i - \overline x)^2 \end{align}\]

Note that this is an unbiased with known standard error!

Theoretical Bias and SE

In other words our we are using an unbiased estimator for slope (bias = 0) with the following theoretical standard error:

sxx <- sum((x - mean(x))^2)
(true.seb = sqrt(sigma^2/sxx))

[1] 0.162064

\[\begin{align} SE(\hat\beta_1)& = \sqrt{\dfrac{\sigma^2}{S_{xx}}} = \sqrt{\dfrac{0.25^2}{2.38}} = 0.1621 \end{align}\]

Estimates for a single fit

Notice that our summary output provides an estimate of the standard error of \(\hat \beta_1\) from the summary of the lm output (calculated using):

\[\hat{SE}(\hat\beta_1)=\sqrt{\frac{\sum_i\hat\epsilon_i^2}{(n-2)\sum_i(x_i-\bar x)^2}} = 0.1435\]

Since we’ll need this later, note the following estimate on the original data: \(\hat \beta_1 = 1.8664\) (we’ll call this \(\hat\beta_f\) later)

Bootstrap Simulation

While it is a fairly routine to arrive at these equations, this might not be the case for more complicated scenarios.
As an exercise, let’s pretend that we do not know what the standard error and bias of our estimator \(\hat \beta_1\) is.
Instead, let’s estimate the bias (which we know should be 0 since \(\hat \beta_1\) is and unbiased estimator) and standard error using the bootstrap method.¹

R code

Code

# Set a random seed for reproducibility
set.seed(311)

# Initialize lists to store bootstrap samples and fitted models
bootsamp <- list()   # bootstrap samples
bootsmod <- list()   # lm fits for each bootstrap sample
bootcoef <- NA       # slope coefficients (beta_1) from each fitted model

# Define the number of bootstrap samples to be taken
B <- 1000            

# Perform the bootstrap procedure
for (i in 1:B) {
  
  # Step 1: Resample with replacement from the observed data
  Zj <- xy[sample(1:n, n, replace = TRUE), ] 
  
  # Step 2: Store the bootstrap sample
  bootsamp[[i]] <- Zj 
  
  # Step 3: Fit a linear regression model to the bootstrap sample
  bootsmod[[i]] <- lm(Zj[, 2] ~ Zj[, 1]) 
  
  # Step 4: Extract and store the slope coefficient (beta_1) from the model
  bootcoef[i] <- bootsmod[[i]]$coefficient[2] 
}

At the end of this loop:

bootsamp contains the 1000 bootstrap samples
bootsmod contains the fitted models for each sample
bootcoef contains the 1000 bootstrap estimates of the slope coefficient

Code

# Create a histogram of the bootstrap coefficients
hist(bootcoef, main = "Bootstrap Distribution of Slope Coefficients", 
     xlab = "Slope Coefficients")

# Add a vertical line for the mean of the bootstrap coefficients
abline(v = mean(bootcoef), col = 2, lwd = 2) # Red line for the mean

# Calculate the standard error of the bootstrap coefficients
boot_se <- sd(bootcoef) # Standard deviation of the bootstrap estimates

# Add arrows to indicate the standard error
arrows(mean(bootcoef) - boot_se, 0, mean(bootcoef) + boot_se, 0, 
       angle = 90, code = 3, length = 0.1, col = 4, lwd = 2)

# Add labels for clarity
text(mean(bootcoef) - boot_se, max(hist(bootcoef, plot = FALSE)$counts)/2, 
     labels = "-1 SE", col = 4, pos = 3)
text(mean(bootcoef) + boot_se, max(hist(bootcoef, plot = FALSE)$counts)/2, 
     labels = "+1 SE", col = 4, pos = 3)

Bootstrap Estimate for Bias

We now have \(B = 1000\) estimates for \(\beta\) stored in bootcoef = \(\{\hat \beta^{*1},\hat \beta^{*2}, \dots, \hat \beta^{*1000}\}\) and the following values

\[\begin{gather} \overline{\hat \beta} = \sum_{j = 1}^{B = 1000} \hat \beta^{*j} = 1.8704 \quad \hat \beta_f = 1.86641 \\ \text{Bias}(\hat \beta) = \overline {\hat \beta} - \hat \beta_f = 1.8704 - 1.8664 = 0.004 \end{gather}\]

Bootstrap Estimate for SE

And the estimate for SE is given by this:

(seb = sqrt(sum((bootcoef - mean(bootcoef))^2)/(B-1)))

[1] 0.1543347

sd(bootcoef) # equivalent to the above

[1] 0.1543347

\[\begin{align} \hat{\text{SE}}(\hat \beta) &= \sqrt{\frac{\sum_{i=1}^{B} (\hat \beta^{*i} - \overline{\hat \beta})^2}{B -1}} = 0.1543 \end{align}\]

Bootstrap vs Theoretical SE(\(\hat\beta_1\))

The bootstrap estimate (=0.1543) is closer to the theoretical value (=0.1621) that the estimate obtained in the lm summary table from the (single) original data set (= 0.1435).

Important

This estimate was created with no knowledge of the theoretical sampling distribution!!

Had we not known the theoretical sampling distribution for \(\hat \beta_1\) the bootstrap enables us to construct confidence intervals, perform hypothesis test (calculate \(p\)-values!)

Bootstrap vs Simulation-based SE(\(\hat\beta_1\))

For fun, let’s compare this estimate to the simulation-based method which allows us to resample from this population many many times, …

for(i in 1:1000){
  newx <- runif(30, 0, 1)                     # generate a new sample
  newy <- 2*newx + rnorm(30, sd=0.25)         # for each iteration
  sample.i <- cbind(newx, newy)
  modnew[[i]] <- lm(sample.i[,2]~sample.i[,1])
  coefs[i] <- modnew[[i]]$coefficients[2]     # save estimate for beta1
}

Results (Visual)

Results (Table)

SE estimates from the simulation and the bootstrap simulation are very similar to each other and the theoretical value (=0.162)

Estimate/ R Code	Description
0.143 = from output table `summary(fullfit)`	Estimate from a single fit on the full data
0.154 = `sd(bootcoef)`	Bootstrap estimate obtained from 1000 bootstrap samples
0.160 = `sd(coefs)`	Standard errors obtained from resampling from the population 1000 times.

Comments

Note that the center of the bootstrap histogram is centered at \(\hat \beta_f\), the estimate obtained using the full data, rather than at \(\beta_1 = 2\) (the true simulated value).
While we will use the bootstrap to estimate the SE of our estimator, typically the point estimate will be given by the estimate obtained using the full data, rather than say the mean of the bootstrap estimates.

iClicker

Goal of bootstrapping

What is the primary purpose of bootstrap resampling?

To estimate parameters by randomly selecting subsamples without replacement
To create artificial datasets that mimic the population and estimate confidence intervals
To increase the sample size by duplicating original observations
To reduce bias in parameter estimates from large samples

iClicker

Bootstrap Concept Check

Which statement best describes the bootstrap?

It assumes the population is normal
It resamples from the population
It resamples from the sample with replacement
It resamples from the sample without replacement

Bootstrap Confidence Intervals (CI)

For our example, the 95% bootstrap CI for \(\beta_1\) would correspond to the 25th value and the 975th value of the sorted bootstrap estimates.

quantile

These values can be found using the quantile() function:

quantile(bootcoef, .025) ; quantile(bootcoef, 1-.025)

    2.5% 
1.567437

   97.5% 
2.182977

Estimated CI

So we have a 95% CI for true \(\beta_1\) lying within the interval: \[(1.5674, 2.183)\]

Quote

Statistical ‘rock star’ wins coveted international prize

The bootstrap is rarely the star of statistics, but it is the best supporting actor (Dr. Bradley Efron)

You can hear Dr. Brad Efron speak about the bootstrap here.

Dr. Efron gives a nice interview with the ILSR authors here.

“inventer” of the bootstrap dr. bradley efron a statistician at Stanford
wins the statistics version of the “nobel prize”: Internation Prize in statistics in 2018
SE/CIs are ubigquitous in applied papers across natural science and for complex models with no closed-form soltuions, they’ll use the bootstraped SE/CIs

Ensemble Methods (e.g., Bagging)

Bootstrap Aggregating (Bagging): One of the most common applications of bootstrapping is in bagging (Bootstrap Aggregating). In bagging, multiple bootstrap samples are drawn from the training data, and separate models are trained on each bootstrap sample. The predictions from these models are then averaged (for regression) or majority-voted (for classification) to create a more stable and accurate prediction.
- Example: Random Forests is a popular bagging algorithm where decision trees are trained on different bootstrap samples, and predictions are averaged over all trees. This reduces variance and prevents overfitting.
- Benefit: Bootstrapping helps ensure that each model in the ensemble is trained on slightly different data, which increases the diversity of the models and reduces the model’s variance.

Comment

The bootstrap is a tremendously versatile method for estimating the standard errors and bias of parameter estimates.
Can handle more complex scenarios (e.g., small sample sizes, non-i.i.d. data) and provides more flexibility in constructing confidence intervals and testing hypotheses.