
Unknown Population Variance \(\sigma^2\)
University of British Columbia Okanagan
Previously we constructed confidence intervals (CI) for \(\mu\) assuming the population standard deviation \(\sigma\) was known.
While this serves as a useful stepping-stone, in most practical situations, if we don’t know \(\mu\), chances are we don’t know \(\sigma\)
Today we will cover how we can construction CI for the population mean when \(\sigma\) is unknown.
Conditions for the \(t\)-based methods
Random Sampling\(^\ddagger\): The data should be collected through a process of random sampling or a process that mimics random sampling.
Independence\(^\ddagger\): The observations in the sample must be independent of each other.
Approximate Normality or Sufficiently Large Sample: One of the following:
Warning
The \(t\) method is NOT robust to strong skewness or outliers when \(n\) is small.
Exclusive Relationships
Exercise 1 A random sample of 50 college students were asked how many exclusive relationships they have been in so far. This sample yielded a mean of 3.2 and a standard deviation of 1.74. Construct a 95% confidence interval for the true mean number of exclusive relationships among college students.
\[ \bar{x} = 3.2,\quad s = 1.74 \quad \text{and }n>30 \]
The approximate 95% confidence interval is defined as:
\[ \text{point estimate} \pm 1.96 \times SE \]
\[ \begin{align} SE = \sigma/\sqrt{n} = {?}/\sqrt{n} \end{align} \]
You might be tempted to simply use \(s\) in place of \(\sigma\),
\[ \dfrac{\bar{X} - \mu}{s/\sqrt{n}} \]
⚠️ BUT ☝️this does NOT follow a standard normal distribution
\(t\)-Statistic
Under the aforementioned conditions, \(\frac{\bar{X} - \mu}{s/\sqrt{n}}\) follows a student-\(t\) distribution with degrees of freedom \(\nu = n-1\). We write this as:
\[ \dfrac{\bar{X} - \mu}{s/\sqrt{n}} \sim t_{n-1} \]
The (Student) \(t\)-distribution was developed by William Sealy Gosset, a statistician working at the Guinness Brewery in Dublin.
🍺 Guinness relied on small samples for quality control and \(\sigma\) was not known.
🍺 He developed a new distribution that accounted for the extra variability introduced by replacing \(\sigma\) with \(s\).
🍺 Since Guiness prohibited employees from publishing under their real name, he used the pseudonym “Student”.
iClicker: t vs z
Compared to a confidence interval constructed using the z method, a confidence interval for the mean constructed using the \(t\)-method (with the same confidence level and sample size) will generally be:
Narrower
The same width
Wider
It depends
There are two ways we’ll be answering inferential questions:
rows:
the degrees of freedom \(\nu\)
rows:
the degrees of freedom \(\nu\)

rows:
the degrees of freedom \(\nu\)

rows:
the degrees of freedom \(\nu\)

rows:
the degrees of freedom \(\nu\)

columns:
right-tail probabilities
Note
Probabilities are limited to only
\(p\) = 0.10, 0.05, 0.025, 0.01, 0.005
body cell values:
\(t\)-values (akin to Z-scores)
\(\phantom{\Pr(t_{2}>1.886) = 0.1}\)

\(\phantom{\Pr(t_{2}>1.886) = 0.1}\)

\(\Pr(t_{2}>1.886) = 0.1\)

\(\Pr(t_{2}>2.920) = 0.05\)

\(\Pr(t_{2}>4.303) = 0.025\)

\(\Pr(t_{2}>6.965) = 0.01\)

\(\Pr(t_{2}>9.925) = 0.005\)

Approximate Probablity Ranges
If your \(t\)-value is not listed in the table (which is common), report the probability as a range using the one/two nearest table value(s).
Consider a \(t_{10}\) random variable…
\(\Pr(t_{10}>2) = ??\)
\(\Pr(t_{10}>2) = ??\) We know:
\(\Pr(t_{10}>2.228) = 0.025\)

\(\Pr(t_{10}>2) = ??\) We know:
\(\Pr(t_{10}>1.812) = 0.05\)

Visually we can see that

\(\textcolor{purple}{\text{Area in Purple}} < \textcolor{orange}{\text{Area in Orange} = 0.05}\) \(\textcolor{purple}{\text{Area in Purple}} > \textcolor{deepskyblue}{\text{Area in Blue} = 0.025}\)
In probability statements
\(\textcolor{purple}{\Pr(t_{10}>2)} < \textcolor{orange}{\Pr(t_{10}>1.812) = 0.05}\) \(\textcolor{purple}{\Pr(t_{10}>2)} < \textcolor{deepskyblue}{\Pr(t_{10}>2.228) = 0.025}\)
Hence …
\[\textcolor{deepskyblue}{0.025} < \textcolor{purple}{\Pr(t_{10}>2)} < \textcolor{orange}{0.05}\]
# Probability
pt(q, df, ncp, lower.tail = TRUE, log.p = FALSE)
# Quantiles (t-values)
qt(p, df, ncp, lower.tail = TRUE, log.p = FALSE)where
x, q vector of quantiles.p vector of probabilities.df degrees of freedom (\(>0\), non-integer and Inf is allowed.lower.tail if TRUE (default), probabilities are \(\Pr(X \leq x)\) else \(\Pr(X > x)\)
Approximate from tables:
\(0.025 < \Pr(2 < t_{10} < 2.28) < 0.050\)
iClicker
Exercise 2 Using the t-table approximate the following probability
\[\Pr(t_7 > 1)\]
(1-\(\alpha\))100% Confidence Interval (unknown \(\sigma\))
Under the aforementioned conditions, a \(100(1-\alpha)\)% confidence interval for \(\mu\) is \[ \begin{align} \text{point estimate} &\pm \textcolor{red}{\boxed{\text{Margin of Error}}}\\ \bar{x} \ &\pm \textcolor{red}{\boxed{t_{\nu, \alpha/2} \times \frac{s}{\sqrt{n}}}} &\text{ or} \end{align} \] where \(t_{\nu, \alpha/2}\) is the value such that \(P(-t_{\nu, \alpha/2} < t_{\nu} < t_{\nu, \alpha/2}) = 1 - \alpha\) where \(\nu = n-1\) and \(n\) is the sample size. Alternatively, we could express our CI as:
\[ \left(\bar{x} - \textcolor{red}{\boxed{t_{\nu, \alpha/2} \times \frac{s}{\sqrt{n}}}}, \bar{x} + \textcolor{red}{\boxed{t_{\nu, \alpha/2} \times \frac{s}{\sqrt{n}}}}\right) \]
Example: Exclusive Relationships
A random sample of 50 college students were asked how many exclusive relationships they have been in so far. This sample yielded a mean of 3.2 and a standard deviation of 1.74. Construct a 95% confidence interval for the true mean number of exclusive relationships among college students.
\[ \bar{x} = 3.2,\quad s = 1.74 \quad \text{and }n>30 \]
The approximate 95% confidence interval is defined as:
\[ \text{point estimate} \pm \textcolor{red}{\boxed{t_{\nu, \alpha/2} \cdot \text{SE}}} \]
\[ \begin{align} \text{SE} = \frac{s}{\sqrt{n}} = \frac{1.74}{\sqrt{50}} = 0.2460732 \end{align} \]
\(t_{\nu, \alpha/2}\) = qt(0.025, 50-1) = -2.0095752
\[ \begin{align} \bar{x} \pm \textcolor{red}{\boxed{t_{\nu, \alpha/2} \times SE}} & \rightarrow 3.2 \pm \textcolor{red}{\boxed{2.01 \times 0.246}} \\ & \rightarrow 3.2 \pm \textcolor{red}{\boxed{0.495}} \end{align} \]
Since 49 is not on our table, we round down to the nearest integer (in this case 38).
We need to find the \(t_{\nu, \alpha/2}\) that satisfies:
\[ \begin{align} \Pr(t_{50} > t_{\nu, \alpha/2}) &= 0.05\\ \implies t_{\nu, \alpha/2} &\approx 2.024 \end{align} \]
Important
Always round down1 to the nearest available df in the table to ensure the confidence level is at least as high as desired. Rounding down gives a larger \(t_{\nu, \alpha/2}\) than needed and therefore produces a more conservative confidence interval.

Using the tables with conservative \(\nu\) = 38:
\[ \begin{align} \Pr(t_{50} > t_{\nu, \alpha/2}) &= 0.05\\ \implies t_{\nu, \alpha/2} &\approx 2.024\\ [2.7019&,3.6981]\\ \textcolor{green}{\text{Width}} &= \textcolor{green}{0.9962} \end{align} \]
Using exact R calculations (\(\nu = 49\)) \[ \begin{align} t_{\nu, \alpha/2} &= 2.0095752 \\ [2.7055&,3.6945]\\ \textcolor{purple}{\text{Width}} &= \textcolor{purple}{0.989} \end{align} \]

If we incorrectly round up (to \(\infty\)) \[ \begin{align} \Pr(t_{50} > t_{\nu, \alpha/2}) &= 0.05\\ \implies t_{\nu, \alpha/2} &\approx 1.96\\ [2.7177&,3.6823]\\ \textcolor{red}{\text{Width}} &= \textcolor{red}{0.9646} \end{align} \]
Using R with the \(\nu = 50-1 = 49\)
\[ \begin{align} t_{\nu, \alpha/2} &= 2.0095752 \\ [2.7055&,3.6945]\\ \textcolor{purple}{\text{Width}} &= \textcolor{purple}{0.989} \end{align} \]

Interpretation of Confidence intervals
Which of the following is the correct interpretation of this confidence interval. We are 95% confident that…
Use the clipboard to copy the data into R:
length(data) = 20mean(data) = 77.35sd(data) = 11.3521387library(car)
par(mfrow=c(2,2)) ; par(mar=c(2,4,2,0)+0.1)
# install once if needed
# install.packages("vioplot")
boxplot(data, main = "Boxplot")
hist(data, main = "Histogram", col = "skyblue", border = "white")
# qqnorm(data); qqline(data)
this = qqPlot(data, main = "QQ-plot")
vioplot(data, main = "Violin plot", col = "skyblue", horizontal = FALSE)Plots for assessing normality. This is necessary to determine whether using a t-distribution for constructing the confidence interval is appropriate.
The sample size is 20, which yields \((n - 1 = 19)\) degrees of freedom. From the table or software, we can find that \(t_{19}(0.01) = 2.539\)
The estimate of the standard error is \[ \hat \sigma_{\bar{X}} = \frac{s}{\sqrt{n}} = \frac{8.9420132}{\sqrt{25}} = 1.7884026 \approx 1.788 \]
Putting these values into the confidence interval formula:
\[ \begin{align} &\bar{X} \pm \textcolor{red}{\boxed{t_{\nu, \alpha/2}\cdot SE(\bar{X})}} \\ &= 33.372 \pm \textcolor{red}{\boxed{2.539 \times 1.7884026}} \\ &= 33.372 \pm \textcolor{red}{\boxed{4.5407543}} \end{align} \]
We are 98% confident that the ture mean fasting blood glucose level of women (after a 12-hour fast) lies between 28.831 and 37.913 mg/100 mL.