Lecture 19: Contingency Table Analysis

STAT 205: Introduction to Mathematical Statistics

Dr. Irene Vrbik

University of British Columbia Okanagan

April 8, 2024

Introduction

Contingency Table Analysis is a statistical tool used to analyze the relationship between two or more categorical variables.
It is often used to understand if the variables are independent of each other or if there is a significant association between them.
This analysis is fundamental in various fields such as epidemiology, marketing, sociology, and psychology.

Contingency Tables

A contingency table¹ presents the frequency distribution of variables in a matrix format. The rows represent the categories of one variable, while the columns represent the categories of another.

	Level 1	Level 2	Total
Group A	\(A\)	\(B\)	\(A + B\)
Group B	\(C\)	\(D\)	\(C + D\)
Total	\(A +C\)	\(B + D\)	\(T\)

\(A, B, C, D\): Number of observations for a specific category combination.
The marginal totals are given in the marigns, and represent the total observations in the respective row/column.
\(T = A+ B+ C+D\): Total observations in the table.

Contingency Table Terminology

The size of a contingency table is defined by the number of rows (\(r\)) times the number of columns (\(c\)) associated with the levels of the two categorical variables.
A cell displays the count for the intersection of a row and column.
Thus the size of a contingency table also gives the number of cells for that table.

Probabilities from tables

	Success	Failure	Total
Group 1	\(A\)	\(B\)	\(A + B\)
Group 2	\(C\)	\(D\)	\(C + D\)
Total	\(A +C\)	\(B + D\)	\(T\)

For example:: Let \(G_1\) denote the event of belonging to ‘Group 1’; Let \(S\) denote the event ‘Success.’

\[\begin{align} \Pr(G_1) &= \dfrac{A+B}{T} & \Pr(S) &= \dfrac{A+C}{T} \end{align}\]

Independence

Recall that if two events are independent, then their intersection is the product of their respective probabilities.

\[\begin{align} \Pr(A \cap B) &= \Pr(A) \Pr(B) \end{align}\]

In our case, if \(G_1\) and \(S\) are idependent we would expect: \[\begin{align} \Pr(G_1 \cap S) &= \dfrac{A+B}{A+B+C+D}* \dfrac{A+C}{A+B+C+D} \\ &= \dfrac{(A+B)(A+C)}{(A+B+C+D)^2}\\ \end{align}\]

Expected Counts

We can calculate the expected counts in each cell:

	Success	Failure	Total
Group A	\(\frac{(A+B)(A+C)}{T}\)	\(\frac{(A+B)(B+D)}{T}\)	\(A + B\)
Group B	\(\frac{(C+D)(A+C)}{T}\)	\(\frac{(C+D)(B+D)}{T}\)	\(C + D\)
Total	\(A +C\)	\(B + D\)	\(T\)

Expected cell count

The expected count for each cell under the null hypothesis is:

\[ \dfrac{(\text{row total})(\text{column total})}{(\text{total sample size})} \]

Example

Example 1: Political Affiliation and Opinion Section

A random sample of 500 U.S. adults is questioned regarding their political affiliation and opinion on a tax reform bill. The results of this survey are summarized in the following contingency table:

	Favor	Indifferent	Opposed	Total
Democrat	138	83	64	285
Republican	64	67	84	215
Total	202	150	148	500

We want to determine if an association (relationship) exists between Political Party Affiliation and Opinion on Tax Reform Bill. That is, are the two variables dependent?

What is the size of the table? 2 \(\times\) 3

Key Concepts

Independence: Variables are independent if the distribution of one variable is the same for all categories of the other.

Association/Dependence: If variables are not independent (i.e dependent), they may have an association.

In Example 1: If political affiliation is independent of opinion on a tax reform bill, this implies that the proportion of Democrats and Republicans who favor, are indifferent to, or oppose the bill should be roughly the same, reflecting no preference trend influenced by their political group.

Chi-square test for Independence

Independence is tested using a Chi-square test for Independence.

Hypotheses

There are several ways to phrase these hypotheses including:

Null Hypothesis (\(H_0\))

In the population, …

the two categorical variables are independent
there is no association between the two categorical variables.
there is no relationship between the two categorical variables

Alternative Hypothesis (\(H_A\))

In the population, …

the two categorical variables are dependent
there is an association between the two categorical variables.
there is a relationship between the two categorical variables

Observed Counts Table

The Observed Counts Table (or simply Observed Table) represents the observed counts obtained from our sample.

	Favor	Indifferent	Opposed	Total
Democrat	138	83	64	285
Republican	64	67	84	215
Total	202	150	148	500

Expected Counts Table

This table represents the expected counts under the null hypothesis, i.e., the frequency of observations that would be expected if the two variables (political affiliation and opinion) were independent.

	Favor	Indifferent	Opposed	Total
Democrat	\(\frac{285\cdot 202}{500} = 115.14\)	\(\frac{285\cdot 150}{500} = 85.5\)	\(\frac{285\cdot 148}{500} = 84.36\)	285
Republican	\(\frac{215\cdot 202}{500} = 86.86\)	\(\frac{215\cdot 150}{500} = 64.5\)	\(\frac{215\cdot 148}{500} = 63.64\)	215
Total	202	150	148	500

Proportions

The Observed Proportions: the proportion of each cell by taking a cell’s observed count divided by its row total

	Favor	Indifferent	Opposed	Total
Democrat	0.48	0.29	0.22	285
Republican	0.3	0.31	0.39	215
Total	0.404	0.3	0.296	500

The Expected proportions: the proportion of each cell by taking a cell’s expected count divided by its row total

	Favor	Indifferent	Opposed	Total
Democrat	0.404	0.3	0.296	285
Republican	0.404	0.3	0.296	215
Total	0.404	0.3	0.296	500

🤔 Question: Is it reasonable to conclude the difference between the observed and expected counts are merely a result of random chance, or is there exists substantial evidence to question our null hypothesis.

Test Statistic

Chi-Square Test Statistic

In a summary table, we have \(r \times c = rc\) cells
Let \(O_i\) denote the observed counts for the \(i\)th cell and
let \(E_i\) denote the expected counts for the \(i\)th cell.

The Chi-square test statistic measures the difference between observed and expected frequencies. Under the null hypothesis, the test statistics will have (approximately) a chi-squared distribution with degrees of freedom \((r-1)(c - 1)\).

\[ \begin{equation} X^{2} = \sum_{i = 1}^{rc} \dfrac{(O_i - E_i)^2}{E_i} \end{equation} \]

Comments on the Test Statistic

When the observed counts are close the expected counts, our test statistic will be small.
Conversely, if there’s a big difference between the observed and expected counts, the test statistic will be large.
Hence a higher test statistic value implies stronger evidence against the null hypothesis (i.e., smaller \(p\)-values).

Assumptions

Conditions for the chi-square test

There are two conditions that must be checked before performing a chi-square test:

Independence. Each case that contributes a count to the table must be independent of all the other cases in the table.
Sample size / distribution. Each particular scenario (i.e. cell count) must have at least 5 expected cases.

Failing to check conditions may affect the test’s error rates.

Observed Test Statistic

The Observed Counts Table

	Favor	Indifferent	Opposed	Total
Democrat	138	83	64	285
Republican	64	67	84	215
Total	202	150	148	500

The Expected Counts Table

	Favor	Indifferent	Opposed	Total
Democrat	115.14	85.5	84.36	285
Republican	86.86	64.5	63.64	215
Total	202	150	148	500

\[ \begin{align} X^{2}_{obs} &= \frac{(138 - 115.14)^2}{115.14} + \frac{(83 - 115.14)^2}{85.5} + \frac{(64 - 115.14)^2}{84.36} + \dots \\ & \quad \dots \frac{(64 - 115.14)^2}{86.86} + \frac{(67 - 115.14)^2}{64.5} + \frac{(84 - 115.14)^2}{84.36} \\ &= 22.1524686 \end{align} \]

We now compare this to a chi-square distribution with \((r-1)(c-1)\) = \((2-1)*(3-1)\) = 2 degrees of freedom.

Critical Value

As we have done with other statistical tests, we make our decision by either comparing the value of the test statistic to a critical value (rejection region approach) or by finding the probability of getting this test statistic value or one more extreme (p-value approach).

Decision

Using a df of \((r-1)(c-1)\) = \((2-1)*(3-1)\) = 2 we get the critical value in the usual way:

qchisq(0.05, df = 2, lower.tail = FALSE)

[1] 5.991465

The \(p\)-value corresponding to our observed test statistic \(\chi^2 = 22.15\) would be found using:

pchisq(22.15, df = 2, lower.tail = FALSE)

[1] 1.549489e-05

Decision: Since \(\chi^2 = 22.15\) falls in the rejection region (or equivalently, since the \(p\)-value \(< \alpha\)), we reject \(H_0\) in favour of the alternative.

Note

As with our ANOVA tests, there will be functions in R to calculate all the details about this test for us …

Another Variant

The example we just covered was a so-called test for two-way tables or \(\chi^2\) test for independence.
Another variant is the test for one-way tables or the goodness-of-fit tests
In these tests we evaluate if the observed frequencies significantly deviate from the expected frequencies under a specific hypothesis about the distribution of the variable.
- e.g This hypothesis could be a uniform distribution, a specific ratio of categories, or any theoretical distribution.

One-way vs. two-way

One-way tables: Application: used with one-way tables where there’s a single categorical variable; Goal: To evaluate if the observed frequencies deviate from a hypothesized distribution. \[ \begin{equation} \sum_{i = 1}^{k} \dfrac{(O_i - E_i)^2}{E_i} \sim \chi^2_{df = k-1} \end{equation} \] A high \(\chi^2\) value value indicates a poor fit; the data does not follow the hypothesized distribution

Two-way tables: Application: used with two-way contingency tables having two categorical variables; Goal: To determine if there is a significant association between the two variables \[ \begin{equation} \sum_{i = 1}^{rc} \dfrac{(O_i - E_i)^2}{E_i} \sim \chi^2_{df = (r-1)(c-1)} \end{equation} \] A high \(\chi^2\) value value indicates an association between the variables

Example: NHL Hockey Birthdays

Example: NHL Hockey Birthdays

In Malcolm Gladwell’s book Outliers¹, he discusses a pattern regarding the birthdays of professional hockey players. Gladwell claims that an overwhelming majority of elite hockey players have birthdays in the first few months of the year (January, February, and March). This observation supports his argument about the relative age effect (RAE), which suggests that individuals born closer to the beginning of a calendar year have a significant advantage in sports.

Based on the data supplied here, let’s randomly sample some hockey players to see if birth month is related to making it to the NHL.

Data Generation

Code

#  n = total sample size = 80

# Create vectors for months and players
months <- c("February", "January", "July", "March", "October", "May", "June", "April", "September", "December", "August", "November")
players <- c(824, 885, 704, 833, 624, 783, 699, 796, 638, 552, 607, 563)

# Create a vector repeating each month according to the number of players
month_vector <- unlist(mapply(rep, months, players))

set.seed(2024)
hockey_sample <- sample(month_vector, n)

# convert the months to numbers
# 1 = Jan, ... 12 = Dec
hockey_sample <- match(hockey_sample, month.name)

# Create bins based on months
bins <- cut(hockey_sample, breaks = c(0, 3, 6, 9,  12), 
            labels = c("Jan-Mar", "Apr-Jun", "Jul-Sep", "Oct-Dec"))
hockey_tab <- table(bins)
knitr::kable(t(hockey_tab))

Jan-Mar	Apr-Jun	Jul-Sep	Oct-Dec
24	20	19	17

A one-way table summarizing birthday months for 80 randomly sampled players from the NHL.

Hypothesis

We want to investigate if the each of the four quarters of the year is equally likely for the birth of a hockey player, which would be the expected scenario if birth month had no influence on becoming a hockey player.

\(H_0\): The distribution of hockey players’ birth months follows a uniform distribution across all four quarters.

\(H_A\): The distribution of hockey players’ birth months is not uniformly distributed across all four quarters.

Observed vs Expected Counts

	Jan-Mar	Apr-Jun	Jul-Sep	Oct-Dec
Observed	24	20	19	17
Expected	20	20	20	20

\[ \begin{align} X^2_{obs} &= \frac{(24-20)^2}{20} + \frac{(20-20)^2}{20} + \frac{(19-20)^2}{20} + \frac{(17-20)^2}{20}\\ &= 1.3 \end{align} \]

Null Distribution

Now let’s find the corresponding \(p\)-value on the null distribution (a chi-squared distribution with \(k-1 = 3\) degrees of freedom)

Conclusion

Since the \(p\)-value (0.729) is larger than \(\alpha = 0.05\), we fail to reject the null hypothesis that the distribution of hockey players’ birth months follows a uniform distribution across all four quarters.

In R

As we have seen with other hypothesis tests, rather than computing these tedious formulas “by hand”, we rely on software like R perform these calculations for us.

chisq.test(x, y = NULL, p = rep(1/length(x), length(x)))

x a numeric vector or matrix. x and y can also both be factors.
y a numeric vector; ignored if x is a matrix. If x is a factor, y should be a factor of the same length.
p a vector of probabilities the same length as x

Alternatively, you could specify a contingency table, …

Political Opinion in R

Returning to Example 1, we could perform the test in R using the following code:

demo

     [,1] [,2] [,3]
[1,]  138   83   64
[2,]   64   67   84

chisq.test(demo)


    Pearson's Chi-squared test

data:  demo
X-squared = 22.152, df = 2, p-value = 1.548e-05

NHL in R

Returning to Example 2, we could perform the test in R using the following code

hockey_tab

bins
Jan-Mar Apr-Jun Jul-Sep Oct-Dec 
     24      20      19      17

chisq.test(hockey_tab)


    Chi-squared test for given probabilities

data:  hockey_tab
X-squared = 1.3, df = 3, p-value = 0.7291

Example

Example: Type 2 Diabetes

The following contingency table summarizes the results of an experiment evaluating three treatments for Type 2 Diabetes in patients aged 10-17 who were being treated with metformin. The three treatments considered were continued treatment with metformin (met), treatment with metformin combined with rosiglitazone (rosi), or a lifestyle intervention program (lifestyle). Each patient had a primary outcome, which was either lacked glycemic control (failure) or did not lack that control (success). Perform the appropriate hypotheses test.

library(openintro)
library(tidyverse)
data(diabetes2)

Data

diabetes2

Contingency Table

table(diabetes2)

           outcome
treatment   failure success
  lifestyle     109     125
  met           120     112
  rosi           90     143

Hypothesis

\(H_0\): There is no difference in the effectiveness of the three treatments.

\(H_A\): There is some difference in effectiveness between the three treatments, e.g. perhaps the rosi treatment performed better than lifestyle.

R code

chisq.test(diabetes2$treatment, diabetes2$outcome)


    Pearson's Chi-squared test

data:  diabetes2$treatment and diabetes2$outcome
X-squared = 8.1645, df = 2, p-value = 0.01687

Since the \(p\)-value is less than \(\alpha = 0.05\) we reject the null hypothesis in favour of the alternative. That is to say, there is sufficient evidence to suggest that there is some difference in effectiveness between the three treatments.