Inference for Two Proportions

STAT 205: Introduction to Mathematical Statistics

Dr. Irene Vrbik

University of British Columbia Okanagan

Introduction

Many real-world comparisons involve proportions, not just means. Examples:

  • Do men and women differ in the proportion who vote in elections?
  • Is a new COVID-19 vaccine more effective than an existing one?
  • Are urban and rural drivers equally likely to wear seat belts?
  • To answer these, we use statistical inference for two proportions.

Comparing Two Proportions

  • In these cases we analyze the difference between two population proportions (\(p_1, p_2\)).
  • Common summary statistics:
    • Sample proportions: \(\hat{p}_1, \hat{p}_2\)
    • Sample sizes: \(n_1, n_2\)

Inferential Methods

By the end of this lecture you should be able to

  • construct confidence intervals for estimating the difference \(p_1 - p_2\).
  • perform formal hypothesis tests for checking if
    • \(H_0: p_1 = p_2\) (or equivalently \(H_0: p_1 - p_2 =0\)

Normal Approximation

  • These inferential procedures will rely on the knowledge of the sampling distribution of \(\hat p_1 - \hat p_1\).

  • Recall that for one-population and large sample sizes we have the following approximation:

    \[ \hat p \sim N \left(p, \sqrt{\frac{p(1-p)}{n}} \right) \]

Sampling Distribution

Let \(\hat p_1\) and \(\hat p_2\) represent the sample proportions from two independent samples (of size \(n_1\) and \(n_2\), respectively). Then for large sample sizes, we have the following approximation

\[ \hat p_1 - \hat p_2 \sim N\left(p_1 - p_2, \sqrt{\frac{p_1(1-p_1)}{n_1} + \frac{p_2(1-p_2)}{n_2}}\right) \]

Assumptions for Two-Proportion Inference

  • Independence: Samples must be random and independent.

  • Success-Failure Condition: Each group should have at least 101 successes and 10 failures:

    \[ n_1 \hat{p}_1 \geq 10, \quad n_1 (1 - \hat{p}_1) \geq 10 \]

    \[ n_2 \hat{p}_2 \geq 10, \quad n_2 (1 - \hat{p}_2) \geq 10 \]

  • When these conditions hold, we can use the normal approximation.

Confidence Interval

If the conditions outlined previous are met, a 100(1- \(\alpha\))% CI for \(p_1 - p_2\) is given by:

\[ \begin{align} \hat p_1 - \hat p_2 \pm z_{\alpha/2} SE(\hat p_1 - \hat p_2) \end{align} \]

\[ \begin{align} SE(\hat p_1 - \hat p_2) &= \sqrt{\frac{p_1(1-p_1)}{n_1} + \frac{p_2(1-p_2)}{n_2}}\\ &\approx \sqrt{\frac{\hat p_1(1-\hat p_1)}{n_1} + \frac{\hat p_2(1-\hat p_2)}{n_2}} \end{align} \]

Hypothesis Test

  • Null hypothesis (\(H_0\)): No difference in population proportions. \[ H_0: p_1 = p_2 \]

  • Alternative hypothesis (\(H_A\)):

    • Two-tailed: \(H_A: p_1 \neq p_2\) (any difference)
    • One-tailed: \(H_A: p_1 > p_2\) or \(H_A: p_1 < p_2\)

Test Statistic

  • The test statistic follows a normal distribution:

    \[ Z = \frac{(\hat p_1 - \hat p_2) -0}{SE_0(\hat p_1 - \hat p_2)} = \frac{(\hat{p}_1 - \hat{p}_2) - 0}{\sqrt{\hat{p}(1 - \hat{p}) \left(\frac{1}{n_1} + \frac{1}{n_2} \right)}} \]

    where \(\hat{p}\) is the pooled proportion:

    \[ \hat{p} = \frac{x_1 + x_2}{n_1 + n_2} \]

Comment

  • We introduce new notation here \(SE_0\) to indicate that this is the standard under under the null hypothesis, i.e. assuming \(H_0: p_1 = p_2\) is true.

  • Since we assume that the two populations have the same proportion, it makes sense to combine (or “pool”) the data from both samples to get a better estimate of the common proportion \(p\).

Note

The standard error we are using for hypothesis testing is different than the standard error we are using for constructing confidence intervals.

Online Learning Resources

Are college students more likely to use online learning resources than high school students?

  • Sample Data:
    • College students: 120 out of 200 use online resources (\(\hat{p}_1 = 0.60\)).
    • High school students: 90 out of 200 use online resources (\(\hat{p}_2 = 0.45\)).

iClicker: Standard Error for Confidence Interval

Exercise 1 What is the standard error used to compute the 95% confidence interval for the difference in proportions?

  1. \(\sqrt{\frac{0.525(1 - 0.525)}{200} + \frac{0.525(1 - 0.525)}{200}}\)
  2. \(\sqrt{\frac{0.60(1 - 0.60)}{200} + \frac{0.45(1 - 0.45)}{200}}\)
  3. \(\frac{(0.60 - 0.45)}{\sqrt{\frac{0.60(1 - 0.60)}{200} + \frac{0.45(1 - 0.45)}{200}}}\)
  4. None of the above

Confidence Interval

iClicker: Standard Error for Hypothesis Test

Exercise 2 What is the standard error used to compute the test statistic for the hypothesis test?

  1. \(\sqrt{\frac{0.60(1 - 0.60)}{200} + \frac{0.45(1 - 0.45)}{200}}\)
  2. \(\sqrt{\frac{0.60(1 - 0.60)}{n_1} + \frac{0.45(1 - 0.45)}{n_2}}\)
  3. \(\sqrt{\frac{0.525(1 - 0.525)}{200} + \frac{0.525(1 - 0.525)}{200}}\)
  4. None of the above

Hypothesis Test

p-value

Conclusion

  • p-value = 0.00133 (very small).
  • Since \(p < 0.05\), we reject \(H_0\).

Conclusion: There is strong evidence that college students are more likely to use online learning resources than high school students.

In R

  • prop.test() is an R function used to compare two proportions.
  • It performs a two-sample z-test for proportions.
  • Can be used for:
    • Hypothesis testing (\(H_0: p_1 = p_2\)).
    • Confidence intervals for the difference in proportions.

Syntax of prop.test()

prop.test(x, n, alternative = "two.sided", correct = FALSE)
  • x: Vector of success counts for each group.

  • n: Vector of sample sizes for each group.

  • alternative: "two.sided", "greater", or "less".

  • correct = FALSE: Disables Yates’ continuity correction (optional).1

Redo in R

Code
# Given data
x1 <- 120  # College students using resources
n1 <- 200  # Total college students
x2 <- 90   # High school students using resources
n2 <- 200  # Total high school students

# Sample proportions
p1_hat <- x1 / n1
p2_hat <- x2 / n2

# Two-proportion Z-test in R (built-in function)
prop.test(c(x1, x2), c(n1, n2), alternative = "greater", correct = FALSE)

    2-sample test for equality of proportions without continuity correction

data:  c(x1, x2) out of c(n1, n2)
X-squared = 9.0226, df = 1, p-value = 0.001333
alternative hypothesis: greater
95 percent confidence interval:
 0.06879186 1.00000000
sample estimates:
prop 1 prop 2 
  0.60   0.45 

Comments

  • The X-squared test statistic1 is reported because the two-sample test for proportions is mathematically equivalent to a Chi-Square test for two independent groups, i.e

    \[ \texttt{X-squared} = Z^2 \]

  • Even though R reports X-squared, the results are the same as the two-proportion z-test.

  • The p-value computed from \(\chi^2\) is identical to the one obtained from a one-tailed or two-tailed z-test.

Comments (cont’d)

  • Notice that the one sided confidence interval produced by the R output is bounded by 1 instead of infinity1.

  • A one-sided confidence interval for the difference in proportions (\(p_1 - p_2\)) provides a lower or upper bound rather than a range.

  • Recall that a one-sided hypothesis tests do NOT always match a two-sided CI.

One-Sided Confidence Interval

For a one-sided \((1 - \alpha) \times 100%\) confidence interval:

\[(\hat{p}_1 - \hat{p}_2) \pm z_{\alpha/2} \times SE(\hat{p}_1 - \hat{p}_2)\] For an upper bound (e.g., we want an interval like \((L, 1)\):

\[ L = (\hat{p}_1 - \hat{p}_2) - z_{1-\alpha} \times SE(\hat{p}_1 - \hat{p}_2) \]

For a lower bound (e.g., we want an interval like \((0, U)\):

\[ U = (\hat{p}_1 - \hat{p}_2) + z_{\alpha} \times SE(\hat{p}_1 - \hat{p}_2) \]

One-sided CI by hand

Code
# Standard error (unpooled for CI)
SE_unpooled <- sqrt((p1_hat * (1 - p1_hat) / n1) + (p2_hat * (1 - p2_hat) / n2))

# One-sided Z* value for 95% CI
z_onesided <- qnorm(0.95)  # 1.645
ME = z_onesided * SE_unpooled

# Upper bound CI
CI_upper <- (p1_hat - p2_hat) + z_onesided * SE_unpooled

# Lower bound CI
CI_lower <- (p1_hat - p2_hat) - z_onesided * SE_unpooled

# cat("95% One-Sided Confidence Interval:\n")
 cat("Lower Bound:", CI_lower, 3, "\n")
Lower Bound: 0.06879186 3 
Code
# cat("Upper Bound:", round(CI_upper, 3), "\n")

For our one-sided CI \(z_\alpha\) = 1.6448536. To get our lower bound:

\[ \begin{align} L &= (\hat{p}_1 - \hat{p}_2) - z_{\alpha} \times SE(\hat{p}_1 - \hat{p}_2)\\ &= (0.6 - 0.45) \pm 1.645 \times \sqrt{\frac{0.6(1 - 0.6)}{200} + \frac{0.45(1 - 0.45)}{200}}\\ &= 0.15 - 0.081\\ &= 0.069 \end{align} \]