STAT 205: Introduction to Mathematical Statistics
University of British Columbia Okanagan
Many real-world comparisons involve proportions, not just means. Examples:
By the end of this lecture you should be able to
These inferential procedures will rely on the knowledge of the sampling distribution of \(\hat p_1 - \hat p_1\).
Recall that for one-population and large sample sizes we have the following approximation:
\[ \hat p \sim N \left(p, \sqrt{\frac{p(1-p)}{n}} \right) \]
Let \(\hat p_1\) and \(\hat p_2\) represent the sample proportions from two independent samples (of size \(n_1\) and \(n_2\), respectively). Then for large sample sizes, we have the following approximation
\[ \hat p_1 - \hat p_2 \sim N\left(p_1 - p_2, \sqrt{\frac{p_1(1-p_1)}{n_1} + \frac{p_2(1-p_2)}{n_2}}\right) \]
Assumptions for Two-Proportion Inference
Independence: Samples must be random and independent.
Success-Failure Condition: Each group should have at least 101 successes and 10 failures:
\[ n_1 \hat{p}_1 \geq 10, \quad n_1 (1 - \hat{p}_1) \geq 10 \]
\[ n_2 \hat{p}_2 \geq 10, \quad n_2 (1 - \hat{p}_2) \geq 10 \]
When these conditions hold, we can use the normal approximation.
If the conditions outlined previous are met, a 100(1- \(\alpha\))% CI for \(p_1 - p_2\) is given by:
\[ \begin{align} \hat p_1 - \hat p_2 \pm z_{\alpha/2} SE(\hat p_1 - \hat p_2) \end{align} \]
\[ \begin{align} SE(\hat p_1 - \hat p_2) &= \sqrt{\frac{p_1(1-p_1)}{n_1} + \frac{p_2(1-p_2)}{n_2}}\\ &\approx \sqrt{\frac{\hat p_1(1-\hat p_1)}{n_1} + \frac{\hat p_2(1-\hat p_2)}{n_2}} \end{align} \]
Null hypothesis (\(H_0\)): No difference in population proportions. \[ H_0: p_1 = p_2 \]
Alternative hypothesis (\(H_A\)):
The test statistic follows a normal distribution:
\[ Z = \frac{(\hat p_1 - \hat p_2) -0}{SE_0(\hat p_1 - \hat p_2)} = \frac{(\hat{p}_1 - \hat{p}_2) - 0}{\sqrt{\hat{p}(1 - \hat{p}) \left(\frac{1}{n_1} + \frac{1}{n_2} \right)}} \]
where \(\hat{p}\) is the pooled proportion:
\[ \hat{p} = \frac{x_1 + x_2}{n_1 + n_2} \]
We introduce new notation here \(SE_0\) to indicate that this is the standard under under the null hypothesis, i.e. assuming \(H_0: p_1 = p_2\) is true.
Since we assume that the two populations have the same proportion, it makes sense to combine (or “pool”) the data from both samples to get a better estimate of the common proportion \(p\).
Note
The standard error we are using for hypothesis testing is different than the standard error we are using for constructing confidence intervals.
Online Learning Resources
Are college students more likely to use online learning resources than high school students?
iClicker: Standard Error for Confidence Interval
Exercise 1 What is the standard error used to compute the 95% confidence interval for the difference in proportions?
iClicker: Standard Error for Hypothesis Test
Exercise 2 What is the standard error used to compute the test statistic for the hypothesis test?
Conclusion: There is strong evidence that college students are more likely to use online learning resources than high school students.
prop.test()
is an R function used to compare two proportions.prop.test()
x
: Vector of success counts for each group.
n
: Vector of sample sizes for each group.
alternative
: "two.sided"
, "greater"
, or "less"
.
correct = FALSE
: Disables Yates’ continuity correction (optional).1
# Given data
x1 <- 120 # College students using resources
n1 <- 200 # Total college students
x2 <- 90 # High school students using resources
n2 <- 200 # Total high school students
# Sample proportions
p1_hat <- x1 / n1
p2_hat <- x2 / n2
# Two-proportion Z-test in R (built-in function)
prop.test(c(x1, x2), c(n1, n2), alternative = "greater", correct = FALSE)
2-sample test for equality of proportions without continuity correction
data: c(x1, x2) out of c(n1, n2)
X-squared = 9.0226, df = 1, p-value = 0.001333
alternative hypothesis: greater
95 percent confidence interval:
0.06879186 1.00000000
sample estimates:
prop 1 prop 2
0.60 0.45
Notice that the one sided confidence interval produced by the R output is bounded by 1 instead of infinity1.
A one-sided confidence interval for the difference in proportions (\(p_1 - p_2\)) provides a lower or upper bound rather than a range.
Recall that a one-sided hypothesis tests do NOT always match a two-sided CI.
For a one-sided \((1 - \alpha) \times 100%\) confidence interval:
\[(\hat{p}_1 - \hat{p}_2) \pm z_{\alpha/2} \times SE(\hat{p}_1 - \hat{p}_2)\] For an upper bound (e.g., we want an interval like \((L, 1)\):
\[ L = (\hat{p}_1 - \hat{p}_2) - z_{1-\alpha} \times SE(\hat{p}_1 - \hat{p}_2) \]
For a lower bound (e.g., we want an interval like \((0, U)\):
\[ U = (\hat{p}_1 - \hat{p}_2) + z_{\alpha} \times SE(\hat{p}_1 - \hat{p}_2) \]
# Standard error (unpooled for CI)
SE_unpooled <- sqrt((p1_hat * (1 - p1_hat) / n1) + (p2_hat * (1 - p2_hat) / n2))
# One-sided Z* value for 95% CI
z_onesided <- qnorm(0.95) # 1.645
ME = z_onesided * SE_unpooled
# Upper bound CI
CI_upper <- (p1_hat - p2_hat) + z_onesided * SE_unpooled
# Lower bound CI
CI_lower <- (p1_hat - p2_hat) - z_onesided * SE_unpooled
# cat("95% One-Sided Confidence Interval:\n")
cat("Lower Bound:", CI_lower, 3, "\n")
Lower Bound: 0.06879186 3
For our one-sided CI \(z_\alpha\) = 1.6448536. To get our lower bound:
\[ \begin{align} L &= (\hat{p}_1 - \hat{p}_2) - z_{\alpha} \times SE(\hat{p}_1 - \hat{p}_2)\\ &= (0.6 - 0.45) \pm 1.645 \times \sqrt{\frac{0.6(1 - 0.6)}{200} + \frac{0.45(1 - 0.45)}{200}}\\ &= 0.15 - 0.081\\ &= 0.069 \end{align} \]
Comments
The
X-squared
test statistic1 is reported because the two-sample test for proportions is mathematically equivalent to a Chi-Square test for two independent groups, i.e\[ \texttt{X-squared} = Z^2 \]
Even though R reports
X-squared
, the results are the same as the two-proportion z-test.The p-value computed from \(\chi^2\) is identical to the one obtained from a one-tailed or two-tailed z-test.