Lecture 3: Sampling Distributions

STAT 205: Introduction to Mathematical Statistics

Dr. Irene Vrbik

University of British Columbia Okanagan

January 30, 2024

Outline

\(\newcommand{\Var}[1]{\text{Var}({#1})}\)

In this lecture we will be covering

Introduction

Based on the probabilistic foundations covered in STAT 203, this lecture will cover a fundamental concept in statistics that describes the behavior of the sampling distribution.

Statistical inference aims to draw meaningful and reliable conclusions about a population based on a sample of data drawn from that population.

Example: loan* data

  • Let’s return to our loan50 example from previous lectures.

  • This data set represents 50 loans made through the Lending Club platform, which is a platform that allows individuals to lend to other individuals.

library(openintro)
Loading required package: airports
Loading required package: cherryblossom
Loading required package: usdata
data("loan50")             # data from last lecture
data("loans_full_schema")  # full data 

Population data

  • The loans_full_schema comprise 10,000 loans made through the Lending Club platform.

  • Let’s pretend for demonstration purposes that this data set makes up all the loans made through the Lending Club and thus represents the population.

  • loan50 is therefore a sample from that population.

Histogram: Population vs Sample

hist(loans_full_schema$loan_amount, xlab = "Loan Amount",
     main = "loans_full_schema Population")

hist(loan50$loan_amount,  xlab = "Loan Amount",
     main = "loan50 Sample", breaks = 15)

For the population we might be interested in some parameter, for example, the average loan amount, i.e. the mean value \(\mu\), and it’s standard deviation \(\sigma\).

Parameters

While the population mean is not usually possible to compute, since we have the data for the entire population we can easily calculate it using:

mu = mean(loans_full_schema$loan_amount)
sigma = sd(loans_full_schema$loan_amount)

There average loan amount is $16,361.92

The standard deviation is $10,301.96

If we did not have access to the population information (as is often the case) and instead only had access to a sample, we would instead compute the sample mean:

xbar = mean(loan50$loan_amount)
sample_sd = sd(loan50$loan_amount)

There sample average loan amount is $17,083.00

The sample standard deviation is $10,455.46

Random Samples in R

Of course, a different random sample of 50 loans will produce a different sample mean:

N = nrow(loans_full_schema) # 10,000 observations
sampled50_loans <- loans_full_schema[sample(N, 50), ]
(mean_loan_amount <- mean(sampled50_loans$loan_amount))
[1] 15745

There sample average loan amount is $15,745.00

If we do it again, we get yet another (different) sample mean:

sampled50_loans <- loans_full_schema[sample(N, 50), ]
(mean_loan_amount <- mean(sampled50_loans$loan_amount))
[1] 16183

There sample average loan amount is $16,183.00

Sampling Distribution of the mean

Let’s repeat this process many times and keep track of all of the sample mean calculations (one for each sample).

mean_vec = NULL 
for (i in 1:1000) {
  sampled50_loans <- loans_full_schema[sample(N, 50), ]
  mean_vec[i] <- mean(sampled50_loans$loan_amount)
}

The distribution of the sample means is an example of a sampling distribution.

Empirical Sampling Distribution

To get so-called empirical estimate of that distribution we can plot a histogram of it’s values

Theoretical Sampling Distribution

Thanks to the Central Limit Theorem we actually know what the theoretical probability distribution of \(\overline{X}\) is …

Theoretical Sampling Distribution

Thanks to the Central Limit Theorem we actually know what the theoretical probability distribution of \(\overline{X}\) is …

CLT for the Sample Mean

This phenomenon is known as the Central Limit theorem.

It says the probability distribution of \(\overline{X}\) approaches a Normal distribution as long as the sample size is large enough1.

The mean (expected value) of the sampling distribution of the sample mean is equal to the population mean (\(\mu\)) and the standard deviation (more commonly referred to as the standard error) is \(\sigma_\overline{X} = \frac{\sigma}{n}\). We often write:

\[ \begin{equation} \overline{X} \sim N(\mu_\overline{X} = \mu, \sigma_\overline{X} = \sigma/n) \end{equation} \]

Random Sample

Definition: Sample

If \(X_1, X_2, \dots, X_n\) are independent random variables (RVs) having a common distribution \(F\), then we say that they constitute a sample (sometimes called a random sample) from the distribution \(F\).

  • Oftentimes, \(F\) is not known and we attempt to make inferences about it based on a sample.
  • Parametric inference problems assume \(F\) follows a named distribution (e.g. normal, Poisson, etc) with unknown parameters.

Statistic

Definition: (Sample) Statistic

Let’s denote the sample data as \(X = (X_1, X_2, \dots, X_n)\) , where \(n\) is the sample size. A statistic, denoted \(T(X)\), is a function of a sample of observations.

\[ T(X) = g(X_1, X_2, \dots, X_n) \]

  • A prime example of a statistic is the sample mean and sample variance
  • Understanding that these are both RV, we can calculate things like their Expected value.

Sampling Distribution

Definition: Sampling Distribution

A sampling distribution is the probability distribution of a given a sample statistic.

In other words, if were were to repeatedly draw samples from the population and compute the value of the statistic (e.g. sample mean, or variance), the sampling distribution is the probability distribution of the values that the statistic takes on.

Note

In many contexts, only one sample is observed, but the sampling distribution can be found theoretically.

Review of CLT

Central Limit Theorem

Let \(X_1, X_2, \dots, X_n\) be a sequence of independent and identically distributed (i.i.d) RVs each having mean \(\mu\) and variance \(\sigma^2\). Then for \(n\) large, the distribution of the sum of those random variables is approximately normal with mean \(n\mu\) and variance \(n\sigma^2\)

\[ \begin{align*} X_1 + X_2 + \dots + X_n &\sim N(n\mu, n\sigma^2) \end{align*} \]

\[ \begin{align*} \dfrac{X_1 + X_2 + \dots + X_n}{n} &\sim N\left(\mu, \dfrac{\sigma^2}{n}\right) \end{align*} \]

\[ % Z_n = \dfrac{X_1 + \dots + X_n - n\mu}{\sigma/\sqrt{n}} = \dfrac{\sum_{i=1}^n X_i - \mu}{\sigma/\sqrt{n}} \rightarrow N(0,1) Z_n = \dfrac{\overline{X} - \mu}{\sigma/\sqrt{n}} \rightarrow N(0,1) \text{ as } n \rightarrow \infty \]

Sample Mean

A prime example of a sample statistic is the sample mean:

\[ \overline{X} = \dfrac{X_1 + X_2 + \dots, + X_n}{n} \]

This is simply another RV having it’s own:

  • expected value \(\mathbb{E}[\overline{X}]\) (i.e. “mean of the sampling mean”)
  • variance \(\Var{\overline{X}}\)
  • and distribution (i.e. sampling distribution)

Expected Value of the Sample Mean

Assuming \(X_i \overset{\mathrm{iid}}{\sim} F\) where \(F\) has a mean of \(\mu\) and variance \(\sigma^2\):

\[ \begin{align*} \class{fragment}{\mathbb{E}[\overline{X}]} &\class{fragment}{{} = \mathbb{E}\left[\dfrac{X_1+ \dots X_n}{n}\right]} \\ &\class{fragment}{{} = \dfrac{\mathbb{E}[X_1]+ \dots \mathbb{E}[X_n]}{n}} \\ &\class{fragment}{{} = \dfrac{\sum_{i=1}^n \mathbb{E}[X_i]}{n}} \\ &\class{fragment}{{} = \dfrac{n\mu}{n}} \class{fragment}{{} = \mu} \end{align*} \]

Variance of the Sample Mean

\[ \begin{align*} \text{Var}\left(\overline{X}\right) &\class{fragment}{{} = \text{Var}\left(\frac{X_1 + \dots + X_n}{n}\right)} \\ & \class{fragment}{{} \text{by independence ...}}\\ & \class{fragment}{{} = \frac{1}{n^2} \left[\text{Var}(X_1) + \dots + \text{Var}(X_n)\right]} \\ & \class{fragment}{{} = \frac{n \sigma^2}{n^2} = \frac{\sigma^2}{n}} \end{align*} \]

Rolling a die

Example: Rolling a die

What is the sampling distribution of the mean of 5 rolls of a dice? \(\overline{X} \sim ?\)

Q: What distribution does \(X_i\) follow?, i.e. \(X_i \sim ?\)

\(X_i\) follows a discrete uniform distribution Unif\((a = 1, b = 6)\) with probability mass function (PMF) given by \[P(X = i) = \frac{1}{6} \quad \text{for } i = 1, 2, \dots 6\]

Mean and standard error

\[\begin{align} \mathbb{E}(X_i) = \frac{a + b}{2} = \frac{1 + 6}{2} = 3.5 \end{align}\] \[\begin{align} \text{Var}(X_i) &= \frac{(b-a+1)^2-1}{12} \\ &= \frac{(6 - 1 + 1)^2 - 1}{12} = 2.916667 \end{align}\]

Theoretical Result

Let \(X_i\) be the number of dots facing up when rolling a fair die.

Let \(\overline{X} = \frac{X_1 + X_2 + \dots X_5}{5}\) be the mean of five rolls.

Then from the CLT we know: \[ \begin{align*} \overline{X} &\sim N(\mu_{\overline{X}} = 3.5, \sigma_{\overline{X}} = \frac{\sqrt{((6-1+1)^2 -1)/12)}}{\sqrt{5}})\\ &\sim N(\mu_{\overline{X}} = 3.5, \sigma_{\overline{X}} = 0.7637626) \end{align*} \]

Empirical Result

n = 1

n = 10

n = 30

Difference Sample Sizes

Solar Energy Example

Example: Solar Energy

Suppose we are interested in the proportion of American adults who support the expansion of solar energy. Suppose we don’t have access to the population of all American adults, which is a quite likely scenario, and we sample 1000 American adults and 895 approve of solar energy. What is the sampling distribution of the proportion?

Q: To answer this let’s first focus on \(X_i=\) be 1 if the \(i\)th randomly selected American within our sample of 1000 approves of solar energy. What distribution does \(X_i\), follow?

\(X\) follows a Bernoulli(\(p\)) distribution with PMF) given by \[P(X = 1) = p = 1 - P(X = 0 ) = 1 - q\] ewline \(E[X_i] = p\) and Var\((X_i) = p(1-p)\)

Sampling Distribution for Proportions

CLT for proportions

When observations are independent and the sample size is sufficiently large, the sample proportion \(\hat p\) is given by

\[ \hat p = \frac{X_1 + X_2 + \dots X_n}{n} \rightarrow N\left(\mu_{\hat p} = p, \sigma_{\hat p} = \sqrt{\frac{p(1-p)}{n}}\right) \]

Success-failure condition

In order for the Central Limit Theorem to hold, the sample size is typically considered sufficientlylarge when \(np \geq 10\) and \(n(1-p) \geq 10\) , which is called the success-failure condition.

Standard error of proportion

\[\begin{align*} \sigma_{\hat p} &= \sqrt{\frac{p(1-p)}{n}} \end{align*}\]

To approximate this we sub \(p\) (the unknown population parameter) with the point estimate \(\hat p\): \[\begin{align*} \hat{\sigma_{\hat p}} &\approx \sqrt{\frac{\hat p(1-\hat p)}{n}} \\ \end{align*}\]

This is sometimes called the plug-in principal.

Empirical Sampling distribution of Proportion

Because I have simulated this data, we can compare the empirical sample distribution to the theoretical sampling distribution (red) with the sampling distribution using the estimate of standard error (blue)

Sampling Distribution of Variance

The sample variance, \(S^2\), is yet another sample statistic which we define as \[S^2 = \frac{\sum_{i=1}^{n} (X_i - \overline{X})^2}{n-1}\]

It can be shown that this is an unbiased estimate for \(\sigma^2\), i.e. \[\mathbb{E}[S^2] = \sigma^2\]

Expected value of Variance

Using the following algebric result (see Proof):

\[ \sum_{i=1}^{n} \left( x_i - \overline{x} \right)^2 = \sum_{i=1}^{n} x_i^2 - n\overline{x}^2 \] we can rewrite the statistic as:

\[ (n-1) S^2 = \sum_{i=1}^{n} (X_i - \overline{X})^2 = \sum_{i=1}^{n}X_i^2 - n \overline{X}^2 \]

Proof

\[\begin{align*} \sum_{i=1}^{n} \left( x_i - \overline{x} \right)^2 &= \sum_{i=1}^{n} \left( x_i^2 - 2x_i\overline{x} + \overline{x}^2 \right) \\ &= \sum_{i=1}^{n} x_i^2 - 2\overline{x} \sum_{i=1}^{n} x_i + \sum_{i=1}^{n} \overline{x}^2 \\ &= \sum_{i=1}^{n} x_i^2 - 2\overline{x} (n\cdot \overline{x}) + n\cdot \overline{x}^2 \\ &= \sum_{i=1}^{n} x_i^2 + \overline{x}^2(n - 2n) \\ &= \sum_{i=1}^{n} x_i^2 - n\overline{x}^2 \end{align*}\]

Expected value of Variance (cont’d)

Taking the expectation of both sides … \[ (n-1)\mathbb{E}[S^2] = \mathbb{E}\left[\sum_{i=1}^{n} X_i^2\right] - n\mathbb{E}[\overline{X}^2] \] Recall for any random variable \(W\), \(\mathbb{E}[W^2] = \text{Var}(W) + (\mathbb{E}[W])^2\) \[\begin{align*} &= n\text{Var}(X_i) + \left(\sum_{i=1}^n \mathbb{E}[X_i]\right)^2 - n\left( \text{Var}(\overline{X}) + (\mathbb{E}[\overline{X}])^2 \right)\\ &= n\sigma^2 + n\mu^2 - n \text{Var}(\overline{X}) - n\mu^2 \\ &= n\sigma^2 - n \text{Var}(\overline{X}) \\ &= n\sigma^2 - n \frac{\sigma^2}{n} = (n-1)\sigma^2 \end{align*}\]

Or \(E[S^2] = \frac{(n-1)\sigma^2}{(n-1)} = \sigma^2\)

Sampling Distribution of Variance

Normal Population

Let \(X_1, X_2, \dots, X_n\) be a random sample from a normal distribution with mean \(\mu\) and variance \(\sigma^2\). It can be shown that

\[\begin{align*} \dfrac{(n−1)S^2}{\sigma^2} \sim \chi^2_{(n-1)} \end{align*}\] where \(\chi^2_{(n-1)}\) denotes a chi-squared distribution with \(n−1\) degrees of freedom.

Comment

Warning

It’s important to note that this result holds under the assumption of normality for the underlying population. If the population is not normal, the distribution of the sample variance may still be approximately chi-squared if the sample size is sufficiently large due to the central limit theorem.

Simulation 1: Normal Population

Example: Chi-square simulation (Normal population)

Suppose we take 10000 samples of size \(n =\) 5 from a normal population with \(\sigma^2 =\) 1. Plot the empirical distribution from 10000 runs alongside the theoretical distribution of \(\frac{(n-1)S^2}{\sigma^2}\).

n = 10

n = 20

n = 100

Simulation 2: Uniform Population

Example: Chi-square simulation (Uniform population)

Suppose we take 10000 samples of size \(n =\) 5 from a uniform population with \(\sigma^2 =\) 1. Plot the empirical distribution of \(\frac{(n-1)S^2}{\sigma^2}\) from 10000 runs alongside the theoretical distribution of \(\frac{(n-1)S^2}{\sigma^2}\) if the population were normal.

n = 10

n = 20

n = 100

References