STAT 205: Introduction to Mathematical Statistics
University of British Columbia Okanagan
\(\newcommand{\Var}[1]{\text{Var}({#1})}\)
In this lecture we will be covering
🤔 What is a “statistic”?
Definition: (Sample) Statistic
Let’s denote the sample data as \(X = (X_1, X_2, \dots, X_n)\) , where \(n\) is the sample size. A statistic, denoted \(T(X)\), is a function of a sample of observations.
\[ T(X) = g(X_1, X_2, \dots, X_n) \]
Probability Distribution
A probability distribution is a function showing the likelihood of all possible outcomes for a random variable.
Definition: Sample
If \(X_1, X_2, \dots, X_n\) are independent random variables (RVs) having a common distribution \(F\), then we say that they constitute a sample (sometimes called a random sample) from the distribution \(F\).
A prime example of a statistic is the sample mean
\[ \bar X = \dfrac{X_1 + X_2 + \dots + X_n}{n} \]
At the end of the day, \(\bar X\) is just another random variable, and therefore has its own probability distribution.
Rather than recalling the theorem right away, let’s explore it empirically using a simulation.
loan* dataLet’s return to our loan50 example from previous lectures.
This data set represents 50 loans made through the Lending Club platform, which is a platform that allows individuals to lend to other individuals.
The loans_full_schema comprise 10,000 loans made through the Lending Club platform.
Let’s pretend for demonstration purposes that this data set makes up all the loans made through the Lending Club and thus represents the population.
loan50 is therefore a sample from that population.


For the population we might be interested in some parameter, for example, the average loan amount, i.e. the mean value \(\mu\), and its standard deviation \(\sigma\).
The population mean is not usually possible to compute, but since we have the data for the entire population we can easily calculate it using:
There average loan amount is
\(\quad\mu\) = $16,361.92
The standard deviation is
\(\quad\sigma\) = $10,301.96
If we did not have access to the population information (as is often the case) and instead only had access to a sample, we would instead compute the sample mean:
There sample average loan amount is
\(\quad\bar x\) = $17,083.00
The sample standard deviation is
\(\quad s\) = $10,455.46
2🎲 A different sample of 50 will produce a different estimate:
[1] $15,745.00
Let’s repeat this process many times and keep track of all of the sample mean calculations (one for each sample).
mean_vec = NULL
for (i in 1:1000) {
sampled50_loans <- loans_full_schema[sample(N, 50), ]
mean_vec[i] <- mean(sampled50_loans$loan_amount)
}
mean_vec[1:20] [1] 15564.0 14651.5 16365.5 17454.5 15201.0 16241.0 17368.0 15622.5 17349.5
[10] 16319.0 15689.5 17992.0 17486.5 16188.5 15035.0 15782.0 16204.0 18856.5
[19] 12639.0 15587.0
...
The distribution of the sample means is an example of a sampling distribution.
For the first sample, \(\bar x_1\) = 15564
For the second sample, \(\bar x_2\) = 14651.5
For the third sample, \(\bar x_3\) = 16365.5
For the forth sample, \(\bar x_4\) = 17454.5
For the fifth sample, \(\bar x_5\) = 15201
For the sixth sample, \(\bar x_6\) = 16241

In simulation land, I can take as many random samples as I like.
👈 Here I take 500 samples of size \(n=50\) and keep track of the sample mean, \(\bar x\), for each.
⏩ The collection of \(\bar x\)’s forms an rough estimate of the probability distribution of \(\bar X\).

In simulation land, I can take as many random samples as I like.
👈 Here I take 500 samples of size \(n=50\) and keep track of the sample mean, \(\bar x\), for each.
⏩ The collection of \(\bar x\)’s forms an empirical approximation of the sampling distribution of the sample mean \(\bar X\).
To get an empirical estimate of the distribution, we plot a histogram of sample means we have observed.
Thanks to the Central Limit Theorem we actually know what the theoretical probability distribution of \(\overline{X}\) is …
Thanks to the Central Limit Theorem we actually know what the theoretical probability distribution of \(\overline{X}\) is …
Theorem 1 (Central Limit Theorem (CLT)) Under appropriate conditions 1, the distribution of the sample mean \(\overline{X}\) is approximately normal, with mean \(\mu_{\bar x} = \mu\), the population mean and standard deviation, aka the standard error (SE), given by \(\sigma_{\overline{X}} = \frac{\sigma}{\sqrt{n}}\). We often write:
\[ \overline{X} \;\approx\; N\!\left(\mu,\; \frac{\sigma}{\sqrt{n}}\right). \]
Understanding Sampling Distributions
A sampling distribution represents:
The distribution of a sample.
The distribution of a population.
The distribution of a statistic, like the sample mean, across repeated samples.
The distribution of a parameter, like the population mean.
Central Limit Theorem
According to the Central Limit Theorem, the sampling distribution of the sample mean:
Always follows a normal distribution regardless of sample size.
Becomes approximately normal as the sample size increases.
Becomes approximately normal as the number of samples increase.
Has a standard deviation equal to the population standard deviation.
Is skewed when the population distribution is skewed.
Variability of the Sample Mean
As the sample size increases, the variability of the sample mean:
Increases.
Decreases.
Remains constant.
Depends on the population distribution.
Definition: Sampling Distribution
A sampling distribution is the probability distribution of a statistic
In other words, if were were to repeatedly draw samples from the population and compute the value of the statistic (e.g. sample mean, or variance), the sampling distribution is the probability distribution of the values that the statistic takes on.
Important
In many contexts, only one sample is observed, but the sampling distribution can be found theoretically.
Recall that we can standardize any normal RV, say \(X\sim\)Normal(\(\mu\), \(\sigma\)) to the standard normal distribution, i.e. a Normal(0, 1)
The standardization formal is
\[ Z = \dfrac{X -\mu}{\sigma} \]
where \(Z \sim\) Normal(0,1)
If we take a sample of size \(n\), the mean \(\bar{X}\) follows:
\[ \bar{X} \sim \text{Normal}\left(\mu_{\bar X} = \mu, \sigma_{\bar X} = {\sigma}/{\sqrt{n}}\right) \]
To standardize the sample mean:
\[ Z = \frac{\bar{X} - \mu_{\bar X}}{\sigma_{\bar X}} = \frac{\bar{X} - \mu}{\sigma / \sqrt{n}} \sim \text{Normal}(0,1) \]
We standardize so that different problems can be compared using the same reference distribution.
Exam scores
Example 1 Suppose we are interested in exam scores from two different courses:
A student scores 75 in both courses.
Without standardization: A score of 75 looks “the same” in both courses.
After standardizing:
Interpretation:


There are two ways we’ll be answering inferential questions:
Important
You should be able to answer inference questions by hand and in R.
✍️ Pros for by-hand ✍️
🧠 Build understanding
🔍Develops Critical Thinking
📝 Essential for exams
💻 Pros for R 💻
⚡ Fast and accurate calculations
📊 Handles complex computations easily
🛠️ Standard tool in practice
To find \(\Pr(X < 180)\) we use the standard normal distribution formula:
\(\phantom{\Pr(X < 180)}=\Pr\left(Z < \dfrac{180 -170}{8}\right)\)
\(\phantom{\Pr(X < 180)}=\Pr\left(Z < \dfrac{10}{8}\right)\)
\(\phantom{\Pr(X < 180)}=\Pr\left(Z < 1.25\right)\)
\(\phantom{\Pr(X < 180)}=\quad ?\)
At this point we can consult our Z-table
Use this table to find probabilities for
positive Z-scores!
Use this table to find
probabilities for
negative Z-scores!
Note
Notice how \(Z \sim N(0,1)\) is the default
qvector of quantiles
meanmean (default is 0)
sdstandard deviation (default is 1)
lower.taillogical; if TRUE (default), probabilities are \(\Pr(X \leq q)\) otherwise \(\Pr(X > q)\)
For the standard normal we use the defaults
For some \(X \sim N(\mu = \texttt{mu}, \sigma = \texttt{sig})\)…
To find probabilites for \(X\sim N(\mu, \sigma)\)
Convert \(X\) to a \(Z\)-score:
\[ Z = \frac{x - \mu}{\sigma} = z \]
Use the standard normal Z-table or R
| \(P(Z < z)\) | pnorm(z) |
| \(P(Z > z)\) | pnorm(z, lower.tail = FALSE) |
| \(P(a < Z < b)\) | pnorm(b) - pnorm(a) |
Household Groceries (iClicker)
Exercise 1 Weekly Grocery Expenses The weekly grocery expenses for households in a certain region follow the the distribution given in Figure 1. According to a national consumer survey, the average grocery expense for this region is 107 with a standard deviation of \(38\). A random sample of 30 households is selected from this population.
What is the sampling distribution1 of \(\bar X\)?
✏️ Household Groceries
Exercise 2 Weekly Grocery Expenses The weekly grocery expenses for households in a certain region follow the the distribution given in Figure 1. According to a national consumer survey, the average grocery expense for this region is 107 with a standard deviation of \(38\). A random sample of 30 households is selected from this population.

What is the probability that the average weekly grocery expense for a randomly selected sample of 30 households exceeds $120?
🔹 Although the population distribution is skewed, the CLT tells us that the sampling distribution of \(\bar{X}\) can be approximated by a normal distribution when the sample size is sufficiently large.
🔹 We standardize \(\bar X\) to find probabilities on the standard normal curve.

Rolling a die
Example 2 What is the sampling distribution of the average number of dots facing up 5 dice?
We first need to identify the distribution of \(X_i\)
Let \(X_i\) be the number of dots facing up when rolling a fair die.
Let \(\overline{X} = \frac{X_1 + X_2 + \dots X_5}{5}\) be the mean of five rolls.
Then from the CLT we know: \[ \begin{align*} \overline{X} &\sim N(\mu_{\overline{X}} = 3.5, \sigma_{\overline{X}} = \frac{\sqrt{((6-1+1)^2 -1)/12)}}{\sqrt{5}})\\ &\sim N(\mu_{\overline{X}} = 3.5, \sigma_{\overline{X}} = 0.7637626) \end{align*} \]
Each observation provides new information and does not influence the others.
Common violations:
All observations come from the same population with the same distribution.
Common violations:
The population has a well-defined average and spread.
When this fails:
The CLT is an approximation that improves as \(n\) increases.
Rule of thumb:
Important
The more non-normal the population, the larger the sample size needed.
Special case: If the population distribution is normal, then the sampling distribution of \(\bar X\) is normal for any sample size \(n\).
Comments
The sampling distribution gets “pointier” as the sample size increases
Hence as \(n\) increases the SE decreases
Hence as \(n\) decrease the SE increases