Confidence Intervals for the Mean

Unknown Population Variance \(\sigma^2\)

Dr. Irene Vrbik

University of British Columbia Okanagan

Introduction

  • Previously we constructed confidence intervals (CI) for \(\mu\) assuming the population standard deviation \(\sigma\) was known.

  • While this serves as a useful stepping-stone, in most practical situations, if we don’t know \(\mu\), chances are we don’t know \(\sigma\)

  • Today we will cover how we can construction CI for the population mean when \(\sigma\) is unknown.

Conditions

Conditions for the \(t\)-based methods

  1. Random Sampling\(^\ddagger\): The data should be collected through a process of random sampling or a process that mimics random sampling.

  2. Independence\(^\ddagger\): The observations in the sample must be independent of each other.

  3. Approximate Normality or Sufficiently Large Sample: One of the following:

    1. the population distribution is approximately normal, or
    2. The sample size large (typically \(n \geq 30\))

Warning

The \(t\) method is NOT robust to strong skewness or outliers when \(n\) is small.

Exclusive Relationships

Exercise 1 A random sample of 50 college students were asked how many exclusive relationships they have been in so far. This sample yielded a mean of 3.2 and a standard deviation of 1.74. Construct a 95% confidence interval for the true mean number of exclusive relationships among college students.

\[ \bar{x} = 3.2,\quad s = 1.74 \quad \text{and }n>30 \]

The approximate 95% confidence interval is defined as:

\[ \text{point estimate} \pm 1.96 \times SE \]

\[ \begin{align} SE = \sigma/\sqrt{n} = {?}/\sqrt{n} \end{align} \]

CI for unknown \(\sigma\)

You might be tempted to simply use \(s\) in place of \(\sigma\),

\[ \dfrac{\bar{X} - \mu}{s/\sqrt{n}} \]

⚠️ BUT ☝️this does NOT follow a standard normal distribution

  • Estimating \(\sigma\) by \(s\) adds some more uncertainty
  • Consequently, our statistic will have greater variability.

\(t\)-Statistic

\(t\)-Statistic

Under the aforementioned conditions, \(\frac{\bar{X} - \mu}{s/\sqrt{n}}\) follows a student-\(t\) distribution with degrees of freedom \(\nu = n-1\). We write this as:

\[ \dfrac{\bar{X} - \mu}{s/\sqrt{n}} \sim t_{n-1} \]

Student \(t\) distribution

The (Student) \(t\)-distribution was developed by William Sealy Gosset, a statistician working at the Guinness Brewery in Dublin.

🍺 Guinness relied on small samples for quality control and \(\sigma\) was not known.

🍺 He developed a new distribution that accounted for the extra variability introduced by replacing \(\sigma\) with \(s\).

🍺 Since Guiness prohibited employees from publishing under their real name, he used the pseudonym “Student”.

Standard Normal vs \(t\)-Distribution

Properties of the \(t\)-distribution

  • The \(t\) distribution is symmetric and always centered at 0
  • It only has one parameter: \(\nu\) it’s degrees of freedom (df)
    • Smaller \(\nu \implies\) “heavier tails”1
    • Large \(\nu \implies\) thinner tails2
    • As \(\nu \rightarrow \infty\) it approaches the standard normal

iClicker

iClicker: t vs z

Compared to a confidence interval constructed using the z method, a confidence interval for the mean constructed using the \(t\)-method (with the same confidence level and sample size) will generally be:

  1. Narrower

  2. The same width

  3. Wider

  4. It depends

R and Tables

There are two ways we’ll be answering inferential questions:

  1. \(t\)-tables (click here to download)
    • Commonly used when working by hand
  2. R
    • Computes probabilities, quantiles, test statistics, confidence intervals directly
    • Useful for checking work; commonly use in practice.

Reading the t-table

Rows of the t-table

rows:

the degrees of freedom \(\nu\)

Rows of the t-table

rows:

the degrees of freedom \(\nu\)

Rows of the t-table

rows:

the degrees of freedom \(\nu\)

Rows of the t-table

rows:

the degrees of freedom \(\nu\)

Rows of the t-table

rows:

the degrees of freedom \(\nu\)

Columns of the t-table

columns:

right-tail probabilities

Note

Probabilities are limited to only
\(p\) = 0.10, 0.05, 0.025, 0.01, 0.005

Body of the t-table

body cell values:

\(t\)-values (akin to Z-scores)

  • They represent the values on the \(t\)-distribution that correspond to corresponding tail probabilities.

Columns of the t-table

\(\phantom{\Pr(t_{2}>1.886) = 0.1}\)

Columns of the t-table

\(\phantom{\Pr(t_{2}>1.886) = 0.1}\)

Columns of the t-table

\(\Pr(t_{2}>1.886) = 0.1\)

Columns of the t-table

\(\Pr(t_{2}>2.920) = 0.05\)

Columns of the t-table

\(\Pr(t_{2}>4.303) = 0.025\)

Columns of the t-table

\(\Pr(t_{2}>6.965) = 0.01\)

Columns of the t-table

\(\Pr(t_{2}>9.925) = 0.005\)

Probablity Ranges

Approximate Probablity Ranges

If your \(t\)-value is not listed in the table (which is common), report the probability as a range using the one/two nearest table value(s).

Consider a \(t_{10}\) random variable…

Probablity Ranges

\(\Pr(t_{10}>2) = ??\)

Probablity Ranges

\(\Pr(t_{10}>2) = ??\) We know:

\(\Pr(t_{10}>2.228) = 0.025\)

Probablity Ranges

\(\Pr(t_{10}>2) = ??\) We know:

\(\Pr(t_{10}>1.812) = 0.05\)

Probablity Ranges

Visually we can see that

\(\textcolor{purple}{\text{Area in Purple}} < \textcolor{orange}{\text{Area in Orange} = 0.05}\) \(\textcolor{purple}{\text{Area in Purple}} > \textcolor{deepskyblue}{\text{Area in Blue} = 0.025}\)

In probability statements

\(\textcolor{purple}{\Pr(t_{10}>2)} < \textcolor{orange}{\Pr(t_{10}>1.812) = 0.05}\) \(\textcolor{purple}{\Pr(t_{10}>2)} < \textcolor{deepskyblue}{\Pr(t_{10}>2.228) = 0.025}\)

Hence …

\[\textcolor{deepskyblue}{0.025} < \textcolor{purple}{\Pr(t_{10}>2)} < \textcolor{orange}{0.05}\]

t-distribution in R

# Probability 
pt(q, df, ncp, lower.tail = TRUE, log.p = FALSE)
# Quantiles (t-values)
qt(p, df, ncp, lower.tail = TRUE, log.p = FALSE)

where

  • x, q vector of quantiles.
  • p vector of probabilities.
  • df degrees of freedom (\(>0\), non-integer and Inf is allowed.
  • lower.tail if TRUE (default), probabilities are \(\Pr(X \leq x)\) else \(\Pr(X > x)\)

Example 1: with tables and R

Examples:

\[ \begin{align} \Pr(t_8 > 1.397) &= 0.10 \end{align} \]

# returns probs
pt(1.397, df = 8, 
lower.tail = FALSE)
[1] 0.09997325
# returns t-values
qt(0.100, df = 8, 
lower.tail = FALSE)
[1] 1.396815

Example 2

Approximate from tables:

\[ \Pr(t_5 > 4.5) < 0.005 \]

Exactly in R:

# returns probs
pt(4.5, df = 5, 
lower.tail = FALSE)
[1] 0.003199768

Example 3

Approximate from tables:

\(0.025 < \Pr(2 < t_{10} < 2.28) < 0.050\)

Exact in R:

# returns probs
pt(2, df = 10, 
lower.tail = FALSE)
[1] 0.03669402

iClicker: Approximating t-probabilities

iClicker

Exercise 2 Using the t-table approximate the following probability

\[\Pr(t_7 > 1)\]

  1. \(\Pr(t_7 > 1) > 0.10\)
  2. \(0.05 < \Pr(t_7 > 1) < 0.10\)
  3. \(0.025 < \Pr(t_7 > 1) < 0.05\)
  4. \(0.01 < \Pr(t_7 > 1) < 0.025\)
  5. \(0.005 < \Pr(t_7 > 1) < 0.01\)
  6. None of the above

CI for \(\mu\) (\(\sigma\) unknown)

(1-\(\alpha\))100% Confidence Interval (unknown \(\sigma\))

Under the aforementioned conditions, a \(100(1-\alpha)\)% confidence interval for \(\mu\) is \[ \begin{align} \text{point estimate} &\pm \textcolor{red}{\boxed{\text{Margin of Error}}}\\ \bar{x} \ &\pm \textcolor{red}{\boxed{t_{\nu, \alpha/2} \times \frac{s}{\sqrt{n}}}} &\text{ or} \end{align} \] where \(t_{\nu, \alpha/2}\) is the value such that \(P(-t_{\nu, \alpha/2} < t_{\nu} < t_{\nu, \alpha/2}) = 1 - \alpha\) where \(\nu = n-1\) and \(n\) is the sample size. Alternatively, we could express our CI as:

\[ \left(\bar{x} - \textcolor{red}{\boxed{t_{\nu, \alpha/2} \times \frac{s}{\sqrt{n}}}}, \bar{x} + \textcolor{red}{\boxed{t_{\nu, \alpha/2} \times \frac{s}{\sqrt{n}}}}\right) \]

Returning to our example…

Example: Exclusive Relationships

A random sample of 50 college students were asked how many exclusive relationships they have been in so far. This sample yielded a mean of 3.2 and a standard deviation of 1.74. Construct a 95% confidence interval for the true mean number of exclusive relationships among college students.

\[ \bar{x} = 3.2,\quad s = 1.74 \quad \text{and }n>30 \]

The approximate 95% confidence interval is defined as:

\[ \text{point estimate} \pm \textcolor{red}{\boxed{t_{\nu, \alpha/2} \cdot \text{SE}}} \]

\[ \begin{align} \text{SE} = \frac{s}{\sqrt{n}} = \frac{1.74}{\sqrt{50}} = 0.2460732 \end{align} \]

\(t_{\nu, \alpha/2}\) = qt(0.025, 50-1) = -2.0095752

\[ \begin{align} \bar{x} \pm \textcolor{red}{\boxed{t_{\nu, \alpha/2} \times SE}} & \rightarrow 3.2 \pm \textcolor{red}{\boxed{2.01 \times 0.246}} \\ & \rightarrow 3.2 \pm \textcolor{red}{\boxed{0.495}} \end{align} \]

We could stop there or express our answer as an interval

3.2 + c(1,-1)*qt(0.025, 50-1)*1.74/sqrt(50)
[1] 2.705497 3.694503

\(t_{\nu, \alpha/2}\) using the tables

Since 49 is not on our table, we round down to the nearest integer (in this case 38).

We need to find the \(t_{\nu, \alpha/2}\) that satisfies:

\[ \begin{align} \Pr(t_{50} > t_{\nu, \alpha/2}) &= 0.05\\ \implies t_{\nu, \alpha/2} &\approx 2.024 \end{align} \]

Important

Always round down1 to the nearest available df in the table to ensure the confidence level is at least as high as desired. Rounding down gives a larger \(t_{\nu, \alpha/2}\) than needed and therefore produces a more conservative confidence interval.

\(t_{\nu, \alpha/2}\) Comparison

Using the tables with conservative \(\nu\) = 38:

\[ \begin{align} \Pr(t_{50} > t_{\nu, \alpha/2}) &= 0.05\\ \implies t_{\nu, \alpha/2} &\approx 2.024\\ [2.7019&,3.6981]\\ \textcolor{green}{\text{Width}} &= \textcolor{green}{0.9962} \end{align} \]

Using exact R calculations (\(\nu = 49\)) \[ \begin{align} t_{\nu, \alpha/2} &= 2.0095752 \\ [2.7055&,3.6945]\\ \textcolor{purple}{\text{Width}} &= \textcolor{purple}{0.989} \end{align} \]

\(t_{\nu, \alpha/2}\) Comparison

If we incorrectly round up (to \(\infty\)) \[ \begin{align} \Pr(t_{50} > t_{\nu, \alpha/2}) &= 0.05\\ \implies t_{\nu, \alpha/2} &\approx 1.96\\ [2.7177&,3.6823]\\ \textcolor{red}{\text{Width}} &= \textcolor{red}{0.9646} \end{align} \]

Using R with the \(\nu = 50-1 = 49\)

\[ \begin{align} t_{\nu, \alpha/2} &= 2.0095752 \\ [2.7055&,3.6945]\\ \textcolor{purple}{\text{Width}} &= \textcolor{purple}{0.989} \end{align} \]

iClicker

Interpretation of Confidence intervals

Which of the following is the correct interpretation of this confidence interval. We are 95% confident that…

  1. the average number of exclusive relationships college students in this sample have been in is between 2.7 and 3.7.
  2. college students on average have been in between 2.7 and 3.7 exclusive relationships.
  3. a randomly chosen college student has been in 2.7 to 3.7 exclusive relationships.
  4. 95% of college students have been in 2.7 to 3.7 exclusive relationships.

Example: fasting glucose

The following data represent fasting blood glucose level (mg/100 mL) after a 12-hour fast for a random sample of 20 women. Calculate the appropriate 98% CI.

data = c(85, 109, 66, 67, 71, 84, 63, 73, 78, 83, 80, 73, 77, 59,
  89, 80, 76, 83, 87, 64)

Use the clipboard to copy the data into R:

  • \(n\) = length(data) = 20
  • \(\bar{x}\) = mean(data) = 77.35
  • \(s\) = sd(data) = 11.3521387

Code
library(car)

par(mfrow=c(2,2)) ; par(mar=c(2,4,2,0)+0.1)

# install once if needed
# install.packages("vioplot")


boxplot(data, main = "Boxplot")
hist(data, main = "Histogram", col = "skyblue", border = "white")
# qqnorm(data); qqline(data)
this = qqPlot(data, main = "QQ-plot")
vioplot(data, main = "Violin plot", col = "skyblue", horizontal = FALSE)

Plots for assessing normality. This is necessary to determine whether using a t-distribution for constructing the confidence interval is appropriate.

Solution

The sample size is 20, which yields \((n - 1 = 19)\) degrees of freedom. From the table or software, we can find that \(t_{19}(0.01) = 2.539\)

The estimate of the standard error is \[ \hat \sigma_{\bar{X}} = \frac{s}{\sqrt{n}} = \frac{8.9420132}{\sqrt{25}} = 1.7884026 \approx 1.788 \]

Solution (cont’d)

Putting these values into the confidence interval formula:

\[ \begin{align} &\bar{X} \pm \textcolor{red}{\boxed{t_{\nu, \alpha/2}\cdot SE(\bar{X})}} \\ &= 33.372 \pm \textcolor{red}{\boxed{2.539 \times 1.7884026}} \\ &= 33.372 \pm \textcolor{red}{\boxed{4.5407543}} \end{align} \]

We are 98% confident that the ture mean fasting blood glucose level of women (after a 12-hour fast) lies between 28.831 and 37.913 mg/100 mL.