[1] 5.991465
STAT 205: Introduction to Mathematical Statistics
University of British Columbia Okanagan
April 8, 2024
Contingency Table Analysis is a statistical tool used to analyze the relationship between two or more categorical variables.
It is often used to understand if the variables are independent of each other or if there is a significant association between them.
This analysis is fundamental in various fields such as epidemiology, marketing, sociology, and psychology.
A contingency table1 presents the frequency distribution of variables in a matrix format. The rows represent the categories of one variable, while the columns represent the categories of another.
Level 1 | Level 2 | Total | |
---|---|---|---|
Group A | \(A\) | \(B\) | \(A + B\) |
Group B | \(C\) | \(D\) | \(C + D\) |
Total | \(A +C\) | \(B + D\) | \(T\) |
Success | Failure | Total | |
---|---|---|---|
Group 1 | \(A\) | \(B\) | \(A + B\) |
Group 2 | \(C\) | \(D\) | \(C + D\) |
Total | \(A +C\) | \(B + D\) | \(T\) |
Let \(G_1\) denote the event of belonging to ‘Group 1’
Let \(S\) denote the event ‘Success.’
Recall that if two events are independent, then their intersection is the product of their respective probabilities.
\[\begin{align} \Pr(A \cap B) &= \Pr(A) \Pr(B) \end{align}\]In our case, if \(G_1\) and \(S\) are idependent we would expect: \[\begin{align} \Pr(G_1 \cap S) &= \dfrac{A+B}{A+B+C+D}* \dfrac{A+C}{A+B+C+D} \\ &= \dfrac{(A+B)(A+C)}{(A+B+C+D)^2}\\ \end{align}\]
We can calculate the expected counts in each cell:
Success | Failure | Total | |
---|---|---|---|
Group A | \(\frac{(A+B)(A+C)}{T}\) | \(\frac{(A+B)(B+D)}{T}\) | \(A + B\) |
Group B | \(\frac{(C+D)(A+C)}{T}\) | \(\frac{(C+D)(B+D)}{T}\) | \(C + D\) |
Total | \(A +C\) | \(B + D\) | \(T\) |
Expected cell count
The expected count for each cell under the null hypothesis is:
\[ \dfrac{(\text{row total})(\text{column total})}{(\text{total sample size})} \]
Example 1: Political Affiliation and Opinion Section
A random sample of 500 U.S. adults is questioned regarding their political affiliation and opinion on a tax reform bill. The results of this survey are summarized in the following contingency table:
Favor | Indifferent | Opposed | Total | |
---|---|---|---|---|
Democrat | 138 | 83 | 64 | 285 |
Republican | 64 | 67 | 84 | 215 |
Total | 202 | 150 | 148 | 500 |
We want to determine if an association (relationship) exists between Political Party Affiliation and Opinion on Tax Reform Bill. That is, are the two variables dependent?
What is the size of the table? 2 \(\times\) 3
Independence: Variables are independent if the distribution of one variable is the same for all categories of the other.
Association/Dependence: If variables are not independent (i.e dependent), they may have an association.
In Example 1: If political affiliation is independent of opinion on a tax reform bill, this implies that the proportion of Democrats and Republicans who favor, are indifferent to, or oppose the bill should be roughly the same, reflecting no preference trend influenced by their political group.
Independence is tested using a Chi-square test for Independence.
There are several ways to phrase these hypotheses including:
Null Hypothesis (\(H_0\))
In the population, …
Alternative Hypothesis (\(H_A\))
In the population, …
The Observed Counts Table (or simply Observed Table) represents the observed counts obtained from our sample.
Favor | Indifferent | Opposed | Total | |
---|---|---|---|---|
Democrat | 138 | 83 | 64 | 285 |
Republican | 64 | 67 | 84 | 215 |
Total | 202 | 150 | 148 | 500 |
This table represents the expected counts under the null hypothesis, i.e., the frequency of observations that would be expected if the two variables (political affiliation and opinion) were independent.
Favor | Indifferent | Opposed | Total | |
---|---|---|---|---|
Democrat | \(\frac{285\cdot 202}{500} = 115.14\) | \(\frac{285\cdot 150}{500} = 85.5\) | \(\frac{285\cdot 148}{500} = 84.36\) | 285 |
Republican | \(\frac{215\cdot 202}{500} = 86.86\) | \(\frac{215\cdot 150}{500} = 64.5\) | \(\frac{215\cdot 148}{500} = 63.64\) | 215 |
Total | 202 | 150 | 148 | 500 |
The Observed Proportions: the proportion of each cell by taking a cell’s observed count divided by its row total
Favor | Indifferent | Opposed | Total | |
---|---|---|---|---|
Democrat | 0.48 | 0.29 | 0.22 | 285 |
Republican | 0.3 | 0.31 | 0.39 | 215 |
Total | 0.404 | 0.3 | 0.296 | 500 |
The Expected proportions: the proportion of each cell by taking a cell’s expected count divided by its row total
Favor | Indifferent | Opposed | Total | |
---|---|---|---|---|
Democrat | 0.404 | 0.3 | 0.296 | 285 |
Republican | 0.404 | 0.3 | 0.296 | 215 |
Total | 0.404 | 0.3 | 0.296 | 500 |
🤔 Question: Is it reasonable to conclude the difference between the observed and expected counts are merely a result of random chance, or is there exists substantial evidence to question our null hypothesis.
Chi-Square Test Statistic
The Chi-square test statistic measures the difference between observed and expected frequencies. Under the null hypothesis, the test statistics will have (approximately) a chi-squared distribution with degrees of freedom \((r-1)(c - 1)\).
\[ \begin{equation} X^{2} = \sum_{i = 1}^{rc} \dfrac{(O_i - E_i)^2}{E_i} \end{equation} \]
Conditions for the chi-square test
There are two conditions that must be checked before performing a chi-square test:
Failing to check conditions may affect the test’s error rates.
The Observed Counts Table
Favor | Indifferent | Opposed | Total | |
---|---|---|---|---|
Democrat | 138 | 83 | 64 | 285 |
Republican | 64 | 67 | 84 | 215 |
Total | 202 | 150 | 148 | 500 |
The Expected Counts Table
Favor | Indifferent | Opposed | Total | |
---|---|---|---|---|
Democrat | 115.14 | 85.5 | 84.36 | 285 |
Republican | 86.86 | 64.5 | 63.64 | 215 |
Total | 202 | 150 | 148 | 500 |
\[ \begin{align} X^{2}_{obs} &= \frac{(138 - 115.14)^2}{115.14} + \frac{(83 - 115.14)^2}{85.5} + \frac{(64 - 115.14)^2}{84.36} + \dots \\ & \quad \dots \frac{(64 - 115.14)^2}{86.86} + \frac{(67 - 115.14)^2}{64.5} + \frac{(84 - 115.14)^2}{84.36} \\ &= 22.1524686 \end{align} \]
We now compare this to a chi-square distribution with \((r-1)(c-1)\) = \((2-1)*(3-1)\) = 2 degrees of freedom.
Using a df of \((r-1)(c-1)\) = \((2-1)*(3-1)\) = 2 we get the critical value in the usual way:
Decision: Since \(\chi^2 = 22.15\) falls in the rejection region (or equivalently, since the \(p\)-value \(< \alpha\)), we reject \(H_0\) in favour of the alternative.
Note
As with our ANOVA tests, there will be functions in R to calculate all the details about this test for us …
The example we just covered was a so-called test for two-way tables or \(\chi^2\) test for independence.
Another variant is the test for one-way tables or the goodness-of-fit tests
In these tests we evaluate if the observed frequencies significantly deviate from the expected frequencies under a specific hypothesis about the distribution of the variable.
Application: used with one-way tables where there’s a single categorical variable
Goal: To evaluate if the observed frequencies deviate from a hypothesized distribution. \[ \begin{equation} \sum_{i = 1}^{k} \dfrac{(O_i - E_i)^2}{E_i} \sim \chi^2_{df = k-1} \end{equation} \] A high \(\chi^2\) value value indicates a poor fit; the data does not follow the hypothesized distribution
Application: used with two-way contingency tables having two categorical variables
Goal: To determine if there is a significant association between the two variables \[ \begin{equation} \sum_{i = 1}^{rc} \dfrac{(O_i - E_i)^2}{E_i} \sim \chi^2_{df = (r-1)(c-1)} \end{equation} \] A high \(\chi^2\) value value indicates an association between the variables
Example: NHL Hockey Birthdays
In Malcolm Gladwell’s book Outliers1, he discusses a pattern regarding the birthdays of professional hockey players. Gladwell claims that an overwhelming majority of elite hockey players have birthdays in the first few months of the year (January, February, and March). This observation supports his argument about the relative age effect (RAE), which suggests that individuals born closer to the beginning of a calendar year have a significant advantage in sports.
Based on the data supplied here, let’s randomly sample some hockey players to see if birth month is related to making it to the NHL.
# n = total sample size = 80
# Create vectors for months and players
months <- c("February", "January", "July", "March", "October", "May", "June", "April", "September", "December", "August", "November")
players <- c(824, 885, 704, 833, 624, 783, 699, 796, 638, 552, 607, 563)
# Create a vector repeating each month according to the number of players
month_vector <- unlist(mapply(rep, months, players))
set.seed(2024)
hockey_sample <- sample(month_vector, n)
# convert the months to numbers
# 1 = Jan, ... 12 = Dec
hockey_sample <- match(hockey_sample, month.name)
# Create bins based on months
bins <- cut(hockey_sample, breaks = c(0, 3, 6, 9, 12),
labels = c("Jan-Mar", "Apr-Jun", "Jul-Sep", "Oct-Dec"))
hockey_tab <- table(bins)
knitr::kable(t(hockey_tab))
Jan-Mar | Apr-Jun | Jul-Sep | Oct-Dec |
---|---|---|---|
24 | 20 | 19 | 17 |
A one-way table summarizing birthday months for 80 randomly sampled players from the NHL.
We want to investigate if the each of the four quarters of the year is equally likely for the birth of a hockey player, which would be the expected scenario if birth month had no influence on becoming a hockey player.
\(H_0\): The distribution of hockey players’ birth months follows a uniform distribution across all four quarters.
\(H_A\): The distribution of hockey players’ birth months is not uniformly distributed across all four quarters.
Jan-Mar | Apr-Jun | Jul-Sep | Oct-Dec | |
---|---|---|---|---|
Observed | 24 | 20 | 19 | 17 |
Expected | 20 | 20 | 20 | 20 |
\[ \begin{align} X^2_{obs} &= \frac{(24-20)^2}{20} + \frac{(20-20)^2}{20} + \frac{(19-20)^2}{20} + \frac{(17-20)^2}{20}\\ &= 1.3 \end{align} \]
Now let’s find the corresponding \(p\)-value on the null distribution (a chi-squared distribution with \(k-1 = 3\) degrees of freedom)
Since the \(p\)-value (0.729) is larger than \(\alpha = 0.05\), we fail to reject the null hypothesis that the distribution of hockey players’ birth months follows a uniform distribution across all four quarters.
As we have seen with other hypothesis tests, rather than computing these tedious formulas “by hand”, we rely on software like R perform these calculations for us.
x
a numeric vector or matrix. x
and y
can also both be factors.y
a numeric vector; ignored if x is a matrix. If x is a factor, y should be a factor of the same length.p
a vector of probabilities the same length as x
Alternatively, you could specify a contingency table, …
Returning to Example 1, we could perform the test in R using the following code:
Returning to Example 2, we could perform the test in R using the following code
Example: Type 2 Diabetes
The following contingency table summarizes the results of an experiment evaluating three treatments for Type 2 Diabetes in patients aged 10-17 who were being treated with metformin. The three treatments considered were continued treatment with metformin (met
), treatment with metformin combined with rosiglitazone (rosi
), or a lifestyle intervention program (lifestyle
). Each patient had a primary outcome, which was either lacked glycemic control (failure
) or did not lack that control (success
). Perform the appropriate hypotheses test.
\(H_0\): There is no difference in the effectiveness of the three treatments.
\(H_A\): There is some difference in effectiveness between the three treatments, e.g. perhaps the rosi treatment performed better than lifestyle.
Pearson's Chi-squared test
data: diabetes2$treatment and diabetes2$outcome
X-squared = 8.1645, df = 2, p-value = 0.01687
Since the \(p\)-value is less than \(\alpha = 0.05\) we reject the null hypothesis in favour of the alternative. That is to say, there is sufficient evidence to suggest that there is some difference in effectiveness between the three treatments.
Comments on the Test Statistic
When the observed counts are close the expected counts, our test statistic will be small.
Conversely, if there’s a big difference between the observed and expected counts, the test statistic will be large.
Hence a higher test statistic value implies stronger evidence against the null hypothesis (i.e., smaller \(p\)-values).