To do statistical inference, we need the sampling distribution of a statistic.
Already we’ve seen how parametric methods rely on assumptions and theory to obtain exact or approximate sampling distributions for quantities such as:
\(\bar{X}\) (sample mean)
\(\hat{p}\) (sample proportion)
\(\bar{X}_1 - \bar{X}_2\) (difference in means)
\(\hat{p}_1 - \hat{p}_2\) (difference in proportions)
Problem
What if the assumptions are questionable/fail?
This tends to happen when:
The data are highly skewed or non-normal
The sample size is small
The data contain outliers
What if it is too complicated to derive the theoretical sampling distribution?
e.g. the median has no simple sampling distribution
Two Solutions
Nonparametric methods
Focus on order information (signs, ranks)
Fewer assumptions about the population
Bootstrap (resampling)
Recreate the sampling process using the observed data
Approximate the sampling distribution directly
Last class, we introduced some nonparametric methods, today we focus on the Bootstrap as a resampling technique.
Main Idea
We approximate the sampling distribution directly from the data.
Resampling methods do exactly this by repeatedly sampling from the observed data.
This idea mirrors what we did earlier in the course when we built the empirical sampling distribution of the sample mean. Recall \(\dots\)
Sampling Distribution of \(\bar X\)
Under certain conditions1, the CLT tells us the sample mean can be approximated by the normal distribution, namely
\[
\bar X \sim \text{Normal}(\mu, \sigma/\sqrt{n})
\]
While we know this theoretical result, we could (in theory) approximate the empirical sampling distribution by repeatedly resampling from the population and plotting the statistic.
Empirical Sampling Distribution of \(\bar X\)
To approximate the sampling distribution of \(\bar X\):
For \((i = 1, 2, \dots , B)\):
Draw a sample of size \(n\)
Compute \(\bar x_i\)
The distribution of \(\left\{ \bar x_1, \bar x_2, \dots, \bar x_B \right\}\) approximates the sampling distribution of the mean.
Empirical sampling distribution of the sample mean based on repeated samples (1000) of size \(n=\) 30 with the theoretical distribution overlaid (solid red curve).
When the CLT fails
When the underlying population is highly skewed, even a “large” sample of size \(n=30\) can result in a sampling distribution of \(\bar X\) that is still skewed.
Normal Approximation Fails
Underlying population (heavily right skewed)
A poor normal approximation of the sample mean based on 100 samples of size \(n =\) 30.
Normal Approximation Fails
If we base inference on the poor normal approximation (solid red curve on the previous slide):
Confidence intervals may be inaccurate
Hypothesis tests (\(z\)- and \(t\)-tests) may give misleading results
Standard errors may not reflect true variability
A Better Target
The empirical sampling distribution (histogram) would provide a more accurate basis for inference
It captures the correct shape and true variability of the statistic.
Problem: the empirical distribution relies on many samples from the population but we usually only have one.
But hat if we could approximate the empirical sampling distribution using just one sample?
Bootstrap
In 1979, Brad Efron introduced a revolutionary resampling method called the bootstrap.
Bootstrapping involves repeatedly sampling with replacement from a (single) dataset to create multiple bootstrap samples.
Purpose: It estimates the distribution of a statistic (e.g., sample mean) by simulating many possible outcomes.
The name originates from the phrase “to pull oneself up by one’s bootstraps,” which refers to achieving something seemingly impossible or without external help. Source of image:Significance Magazine (2010)
Steps
Consider some estimator\(\hat \alpha\) for \(\alpha\). For \(j = 1, 2, \dots B\)1
take a random sample of size \(n\) from your original data set with replacement. This is the \(j\)th bootstrap sample, \(Z^{*j}\)
calculate the estimated value of your parameter from your bootstrap sample, \(Z^{*j}\), call this value \(\hat \alpha^{*j}\).
Estimate the standard error and/or bias of the estimator, \(\hat \alpha\), using the equations given on the upcoming slides.
Sample with replacement
Bootstrap/full Estimates
Bootstrap Standard Error
Bootstrap estimates the standard error of estimator \(\hat \alpha\) using1
We alluded to the potential of computing confidence intervals (CI) with the bootstrap.
We already saw how the \(B\) bootstrap estimates of parameter \(\alpha\) provide an empirical estimate of the sampling distribution of estimator \(\hat \alpha\).
To obtain confidence intervals from this empirical distribution, we would simply finding appropriate percentiles.
Ordered Bootstrap Samples
We can now sort these estimates from least to largest. We denote these ordered bootstrap estimates by
\(\hat \alpha^*_{(1)}\) is the smallest estimate of the \(\alpha\) found in one of the 1000 bootstrap samples, and
\(\hat \alpha^*_{(B)}\) is the largest.
Bootstrap Confidence Intervals
A 95% confidence interval is given by:
\[(\hat \alpha^*_{(0.025)}, \hat \alpha^*_{(0.975)})\] where
\(\hat \alpha^*_{(0.025)}\) is the 2.5th percentile, and
\(\hat \alpha^*_{(0.975)}\) is the 97.5th percentile
This interval captures the middle 95% of the bootstrap distribution
Bootstrap CI
Code
# percentile CIci <-quantile(boot_xbar, c(0.025, 0.975))# histogramhist(boot_xbar,probability =TRUE,col ="gray80",border ="white",main ="Bootstrap Distribution with 95% CI",xlab ="Sample Mean")# density estimated <-density(boot_xbar)lines(d, lwd =2)# indices for middle 95%idx <- d$x >= ci[1] & d$x <= ci[2]# shade only the middle 95% under the curvepolygon(c(ci[1], d$x[idx], ci[2]),c(0, d$y[idx], 0),density =50,angle =45,col =rgb(0.2, 0.5, 0.9, 0.5), # soft blueborder =NA)# redraw density curve on toplines(d, lwd =2)# CI boundsabline(v = ci, lwd =2, lty =2)# observed sample meanabline(v =mean(x), lwd =2)# optional labelstext(ci[1], par("usr")[4] *0.92, "2.5%", pos =4)text(ci[2], par("usr")[4] *0.92, "97.5%", pos =2)
Regression Example
We simulate 30 observations from a simple linear regression model with intercept \(\beta_0 = 0\) and slope \(\beta_1 = 2\). That is: \[Y = 2X + \epsilon\] where \(\epsilon \sim N(0, 0.25^2)\).
Data Generation
We generate a single sample from our population here:
...
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.05798 0.07773 0.746 0.462
x 1.86641 0.14348 13.008 2.17e-13 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.2213 on 28 degrees of freedom
Multiple R-squared: 0.858, Adjusted R-squared: 0.8529
F-statistic: 169.2 on 1 and 28 DF, p-value: 2.172e-13
...
Sampling Distribution of \(\hat \beta_1\)
It can be shown that the sampling distribution of \(\hat \beta_1\) is given by a normal distribution centered at the true value, and standard error given below
Since we’ll need this later, note the following estimate on the original data: \(\hat \beta_1 = 1.8664\) (we’ll call this \(\hat\beta_f\) later)
Bootstrap Simulation
While it is a fairly routine to arrive at these equations, this might not be the case for more complicated scenarios.
As an exercise, let’s pretend that we do not know what the standard error and bias of our estimator \(\hat \beta_1\) is.
Instead, let’s estimate the bias (which we know should be 0 since \(\hat \beta_1\) is and unbiased estimator) and standard error using the bootstrap method.1
R code
Code
# Set a random seed for reproducibilityset.seed(311)# Initialize lists to store bootstrap samples and fitted modelsbootsamp <-list() # bootstrap samplesbootsmod <-list() # lm fits for each bootstrap samplebootcoef <-NA# slope coefficients (beta_1) from each fitted model# Define the number of bootstrap samples to be takenB <-1000# Perform the bootstrap procedurefor (i in1:B) {# Step 1: Resample with replacement from the observed data Zj <- xy[sample(1:n, n, replace =TRUE), ] # Step 2: Store the bootstrap sample bootsamp[[i]] <- Zj # Step 3: Fit a linear regression model to the bootstrap sample bootsmod[[i]] <-lm(Zj[, 2] ~ Zj[, 1]) # Step 4: Extract and store the slope coefficient (beta_1) from the model bootcoef[i] <- bootsmod[[i]]$coefficient[2] }
At the end of this loop:
bootsamp contains the 1000 bootstrap samples
bootsmod contains the fitted models for each sample
bootcoef contains the 1000 bootstrap estimates of the slope coefficient
Code
# Create a histogram of the bootstrap coefficientshist(bootcoef, main ="Bootstrap Distribution of Slope Coefficients", xlab ="Slope Coefficients")# Add a vertical line for the mean of the bootstrap coefficientsabline(v =mean(bootcoef), col =2, lwd =2) # Red line for the mean# Calculate the standard error of the bootstrap coefficientsboot_se <-sd(bootcoef) # Standard deviation of the bootstrap estimates# Add arrows to indicate the standard errorarrows(mean(bootcoef) - boot_se, 0, mean(bootcoef) + boot_se, 0, angle =90, code =3, length =0.1, col =4, lwd =2)# Add labels for claritytext(mean(bootcoef) - boot_se, max(hist(bootcoef, plot =FALSE)$counts)/2, labels ="-1 SE", col =4, pos =3)text(mean(bootcoef) + boot_se, max(hist(bootcoef, plot =FALSE)$counts)/2, labels ="+1 SE", col =4, pos =3)
Bootstrap Estimate for Bias
We now have \(B = 1000\) estimates for \(\beta\) stored in bootcoef = \(\{\hat \beta^{*1},\hat \beta^{*2}, \dots, \hat \beta^{*1000}\}\) and the following values
The bootstrap estimate (=0.1543) is closer to the theoretical value (=0.1621) that the estimate obtained in the lm summary table from the (single) original data set (= 0.1435).
Important
This estimate was created with no knowledge of the theoretical sampling distribution!!
Had we not known the theoretical sampling distribution for \(\hat \beta_1\) the bootstrap enables us to construct confidence intervals, perform hypothesis test (calculate \(p\)-values!)
Bootstrap vs Simulation-based SE(\(\hat\beta_1\))
For fun, let’s compare this estimate to the simulation-based method which allows us to resample from this population many many times, …
for(i in1:1000){ newx <-runif(30, 0, 1) # generate a new sample newy <-2*newx +rnorm(30, sd=0.25) # for each iteration sample.i <-cbind(newx, newy) modnew[[i]] <-lm(sample.i[,2]~sample.i[,1]) coefs[i] <- modnew[[i]]$coefficients[2] # save estimate for beta1}
Results (Visual)
Results (Table)
SE estimates from the simulation and the bootstrap simulation are very similar to each other and the theoretical value (=0.162)
Estimate/ R Code
Description
0.143 = from output table summary(fullfit)
Estimate from a single fit on the full data
0.154 = sd(bootcoef)
Bootstrap estimate obtained from 1000 bootstrap samples
0.160 = sd(coefs)
Standard errors obtained from resampling from the population 1000 times.
Comments
Note that the center of the bootstrap histogram is centered at \(\hat \beta_f\), the estimate obtained using the full data, rather than at \(\beta_1 = 2\) (the true simulated value).
While we will use the bootstrap to estimate the SE of our estimator, typically the point estimate will be given by the estimate obtained using the full data, rather than say the mean of the bootstrap estimates.
iClicker
Goal of bootstrapping
What is the primary purpose of bootstrap resampling?
To estimate parameters by randomly selecting subsamples without replacement
To create artificial datasets that mimic the population and estimate confidence intervals
To increase the sample size by duplicating original observations
To reduce bias in parameter estimates from large samples
iClicker
Bootstrap Concept Check
Which statement best describes the bootstrap?
It assumes the population is normal
It resamples from the population
It resamples from the sample with replacement
It resamples from the sample without replacement
Bootstrap Confidence Intervals (CI)
For our example, the 95% bootstrap CI for \(\beta_1\) would correspond to the 25th value and the 975th value of the sorted bootstrap estimates.
quantile
These values can be found using the quantile() function:
The bootstrap is rarely the star of statistics, but it is the best supporting actor (Dr. Bradley Efron)
You can hear Dr. Brad Efron speak about the bootstrap here.
Dr. Efron gives a nice interview with the ILSR authors here.
Comment
The bootstrap is a tremendously versatile method for estimating the standard errors and bias of parameter estimates.
Can handle more complex scenarios (e.g., small sample sizes, non-i.i.d. data) and provides more flexibility in constructing confidence intervals and testing hypotheses.
Comments
Note that the center of the bootstrap histogram is centered at \(\hat \beta_f\), the estimate obtained using the full data, rather than at \(\beta_1 = 2\) (the true simulated value).
While we will use the bootstrap to estimate the SE of our estimator, typically the point estimate will be given by the estimate obtained using the full data, rather than say the mean of the bootstrap estimates.