Post Midterm 2 Practice Problems

Nonparametric Statistics

Sign Test

Exercise 1 (Sign Test) A physiotherapist wants to test whether a new stretching routine improves flexibility. She records the change in forward reach (cm) for 10 patients (after − before):

x <- c(2, -1, 3, 0, 4, -2, 1, 5, -3, 2)

State the hypotheses

NoteClick to see Solution

Let $d$ denote the population median change in forward reach (after $-$ before). Since “improves flexibility” corresponds to a positive change, the hypotheses are

\[ H_0: d = 0 \qquad\text{vs}\qquad H_A: d > 0 \]

Under $H_0$, positive and negative differences are equally likely.
Remove any observations if necessary.
NoteClick to see Solution
For a sign test, any difference equal to 0 is removed because it is neither positive nor negative.
x_nz <- x[x != 0] x_nz

[1] 2 -1 3 4 -2 1 5 -3 2
In the above code one 0’s are removed. Among the 10 remaining observations,
- 6 are positive
- 3 are negative
Calculate the $p$-value for the sign test at $\alpha = 0.05$
NoteClick to see Solution
Let $X$ be the number of positive differences among the 9 nonzero differences. Under $H_0$,

\[ X \sim \text{Binomial}(n=9, p=0.5) \]

The observed test statistic is

\[ x = 6 \]

Since the alternative is $H_A:d>0$, this is a right-tailed test. The p-value is

\[ P(X \ge 6) \]

Using R we get:
pbinom(5, size = 9, prob = 0.5, lower.tail = FALSE)

[1] 0.2539063
State a conclusion in context

NoteClick to see Solution

Since the $p$-value (0.2539) is greater than the significance level of $\alpha$ = 0.05, we fail to reject the null hypothesis. Hence there is not enough evidence at the 5% significance level to conclude that the new stretching routine improves median forward reach.

Wilcoxon Signed-Rank Test

Exercise 2 (Wilcoxon Signed-Rank Test) A study measures reaction time (ms) before and after caffeine consumption:

before <- c(250, 300, 275, 290, 310, 260)
after  <- c(240, 310, 260, 280, 295, 255)
dat <- data.frame(subject = 1:6, before, after)
dat

Define the differences (After − Before)
NoteClick to see Solution
d <- after - before d

[1] -10 10 -15 -10 -15 -5
Rank the absolute differences
NoteClick to see Solution
First compute the absolute differences:
abs_d <- abs(d) abs_d

[1] 10 10 15 10 15 5
Then rank them:
rank(abs_d, ties.method = "average")

[1] 3.0 3.0 5.5 3.0 5.5 1.0
- Since 5 is the smallest number, observation 6 receives the rank: 1
- Since 10 is the next smallest number, observations 1, 2, 4 receive the rank: 3 (obtained by averaging the ranks: 3, 4, 5, 6, 7)
- Since 15 is the next smallest number, observations 3, 5 receive the rank: 5.5 (obtained by averaging the ranks: 8, 9)

Compute the test statistic

Click to see Solution

Attach the signs and sum the positive ranks.

# compute signed ranks
signed_ranks <- sign(d) * r

# test statistic (R reports sum of positive ranks)
V <- sum(r[d > 0])
V

[1] 3

Use the appropriate R function to find a $p$-value for this test.

Click to see Solution

wilcox.test(after, before, paired = TRUE)

Warning in wilcox.test.default(after, before, paired = TRUE): cannot compute
exact p-value with ties


    Wilcoxon signed rank test with continuity correction

data:  after and before
V = 3, p-value = 0.1367
alternative hypothesis: true location shift is not equal to 0

Alternatively we could have done:

wilcox.test(d)

Warning in wilcox.test.default(d): cannot compute exact p-value with ties


    Wilcoxon signed rank test with continuity correction

data:  d
V = 3, p-value = 0.1367
alternative hypothesis: true location is not equal to 0

Note to students: Since our data has ties R performs a slightly different test; details of this correction was not covered in lecture so just concern yourself with interpreting the output.

State the Decision and Conclusion in context.
NoteClick to see Solution
Warning in wilcox.test.default(d): cannot compute exact p-value with ties
Since the $p$-value (0.1367) is greater than the significance level of $\alpha$ = 0.05, we fail to reject the null hypothesis.

Hence there is insufficient evidence to suggest that caffeine has an effect on reaction time.
Why might this test be preferred over the sign-test?

NoteClick to see Solution

The sign-rank test uses both magnitude and direction → more powerful than the sign test.

Mann-Whitney U Test (aka Wilcoxon Rank-Sum)

Exercise 3 (Mann–Whitney Test) A tennis coach compares serve speeds (km/h) between two training programs:

A <- c(110, 115, 120, 125, 130) # Program A
B <- c(105, 108, 112, 118, 122) # Program B

Assuming these groups are independent, perform the appropriate non-parameteric test to determine whether there is a difference in serve speeds between the two training programs.

State the hypotheses

NoteClick to see Solution

\[ H_0:\text{distributions equal} \quad H_A:\text{distributions differ} \]
Rank the observations
NoteClick to see Solution
Then rank them:
# combine data values <- c(A, B) nA = length(A) nB = length(B) group <- c(rep("A", nA), rep("B", nB)) # compute ranks ranks <- rank(values, ties.method = "average") (df <- data.frame(values, group, ranks))

Compute the test statistic

Click to see Solution

wilcox.test(A, B)


    Wilcoxon rank sum exact test

data:  A and B
W = 19, p-value = 0.2222
alternative hypothesis: true location shift is not equal to 0

If we wanted to do this by hand:

# sum of ranks for group A
(W_A <- sum(ranks[group == "A"]))

[1] 34

(W_B <- sum(ranks[group == "B"]))

[1] 21

(U_A = W_A - (nA*(nA+1))/2)

[1] 19

(U_B = W_B - (nB*(nB+1))/2)

[1] 6

Hence the test statistic is:

min(U_A, U_B)

[1] 6

Hmmm but does agree with the test statistic produced in the output of wilcox.test(A, B)??

NOTE: R reports the statistic for the first sample you pass, in this case $U = U_A$. If we want to match the test statistic we got by hand we’d do:

wilcox.test(B, A)


    Wilcoxon rank sum exact test

data:  B and A
W = 6, p-value = 0.2222
alternative hypothesis: true location shift is not equal to 0

Notice how either way we get the same $p$-value.

Decision and conclusion in context.
NoteClick to see Solution
wtest = wilcox.test(B, A)
Since the $p$-value (0.2222) is greater than the significance level of $\alpha$ = 0.05, we fail to reject the null hypothesis.

Hence there is insufficient evidence to suggest that one training program produces different serve speeds than the other.
Why is this test preferred?
NoteClick to see Solution
The appropriate test is the Mann–Whitney test (also called the Wilcoxon rank-sum test), which compares the distributions of the two groups, since:
- The two groups (Program A and Program B) are independent
- The sample sizes are small
- We are not comfortable assuming normality

Kruskal–Wallis Test

Exercise 4 (Kruskal–Wallis Test) A researcher compares pain scores (lower = better) across three treatments:

A <- c(3, 4, 5) # Treatment A
B <- c(2, 3, 4) # Treatment B
C <- c(6, 7, 8) # Treatment C

State hypotheses

NoteClick to see Solution

The Kruskal–Wallis test compares 3 or more independent groups. The hypotheses are:

\[ \begin{align} H_0&:\text{all population distributions are the same}\\ H_A&:\text{at least one population distribution differs} \end{align} \]

Note: If the group distributions have similar shape, this is often interpreted as a test for whether at least one population median differs.
Rank all data
NoteClick to see Solution
Combine all data and rank them
group <- c(rep("A", 3), rep("B", 3), rep("C", 3)) score <- c(A, B, C) kw_dat <- data.frame(group, score, rank = rank(score, ties.method = "average")) kw_dat[order(kw_dat$score), ]
Rank sums:
aggregate(rank ~ group, data = kw_dat, sum)
So,
- $R_A = 2.5 + 4.5 + 6 = 13$
- $R_B = 1 + 2.5 + 4.5 = 8$
- $R_C = 7 + 8 + 9 = 24$
Compute the Kruskal–Wallis statistic
NoteClick to see Solution
The test statistic is

\[ H = \frac{12}{N(N+1)} \sum_{i=1}^k \frac{R_i^2}{n_i} - 3(N+1) \]

Here:
- $N=9$
- $k=3$
- $n_1=n_2=n_3=3$
- $R_A=13,\ R_B=8,\ R_C=24$
So,

\[ H = \frac{12}{9(10)}\left(\frac{13^2}{3} + \frac{8^2}{3} + \frac{24^2}{3}\right) - 3(10) \]
N <- 9 R <- c(13, 8, 24) n <- c(3, 3, 3) H <- 12/(N*(N+1)) * sum(R^2/n) - 3*(N+1) H

[1] 5.955556
So the Kruskal–Wallis statistic is approximately

\[ H \approx 5.96 \]

But WAIT!! We have ties. We need to use the tie correction
tie_sizes <- table(score) tie_sizes <- tie_sizes[tie_sizes > 1] # don't use "C" (that will override your data vector above!) Correction <- 1 - sum(tie_sizes^3 - tie_sizes)/(N^3 - N) H_corrected <- H / Correction H_corrected

[1] 6.056497
Compare to $\chi^2$ with appropriate df
NoteClick to see Solution
For a Kruskal–Wallis test with $k=3$ groups, the degrees of freedom are

\[ df = k - 1 = 2 \]

Using the chi-square approximation, the p-value is
pchisq(H_corrected, df = 2, lower.tail = FALSE)

[1] 0.04840033
This gives a p-value of about 0.051.

Since this is slightly greater than 0.05, we fail to reject $H_0$ at the 5% significance level.

Perform the test in R

Click to see Solution

group <- c(rep("A", length(A)), rep("B", length(B)), rep("C", length(C)))
values <- c(A, B, C)

ktest = kruskal.test(values ~ group)
ktest


    Kruskal-Wallis rank sum test

data:  values by group
Kruskal-Wallis chi-squared = 6.0565, df = 2, p-value = 0.0484

State the conclusion in context

NoteClick to see Solution

Since the $p$-value (0.0484) is less than the significance level of $\alpha$ = 0.05, we reject the null hypothesis in favour of the alternative..

Hence there is not enough evidence at the 5% significance level to conclude that the population distributions of pain scores differ across the three treatments.

If we additionally assume the population distributions have similar shape, then there is not enough evidence to conclude that at least one treatment has a different population median pain score.

There is evidence that at least one of the treatment groups has a different distribution of pain scores.

If we additionally assume the population distributions have similar shape, this suggests that at least one treatment has a different population median pain score.

Miscellaneous

Exercise 5 (Conceptual Multiple Choice) Which of the following is TRUE?

The sign test uses magnitudes of differences
The Wilcoxon signed-rank test assumes normality
The Mann–Whitney test compares medians (under similar shapes)
The Kruskal–Wallis test requires equal variances

Click to see Solution

❌ A is false because the sign test only uses the direction of the differences, not their magnitudes.
❌ B is false because the Wilcoxon signed-rank test does not assume normality.
✅ C is true because the Mann–Whitney test is often interpreted as comparing medians when the two population distributions have similar shape.
❌ D is false because the Kruskal–Wallis test does not require equal variances in the same way ANOVA does.

Exercise 6 (Interpretation (non-significant result)) A Mann–Whitney test comparing two diets gives:

p = 0.18

State the statistical decision

NoteClick to see Solutions

Since $p=0.18 > 0.05$, we fail to reject $H_0$ at the 5% significance level.
Write a proper contextual conclusion

NoteClick to see Solutions

At the 5% significance level, there is not enough evidence to conclude that the two diets differ in the population distribution of the outcome being measured.

If the two population distributions are assumed to have similar shape, then we would say there is not enough evidence to conclude that the population medians differ.
Why is “no difference exists” incorrect?

Note

A non-significant result does not prove that the two populations are the same. It only means that this sample did not provide strong enough evidence against $H_0$.

There may still be a real difference, but the study may have lacked enough data or power to detect it.

Exercise 7 (Choosing the Right Test) For each scenario, choose the most appropriate test:

Paired data, only direction of change matters
NoteClick to see Solutions
I would accept the Sign test or the Sign-Rank here. Here is some extra nuance:

We might prefer the sign test here because we are only interested in whether the treatment tends to increase flexibility, not in how large the increases are. In addition, the sign test is more appropriate when:
- the sample size is small
- the distribution of differences may be skewed
- or there may be outliers that could distort magnitude-based methods
By reducing each observation to its sign (+ or −), the sign test provides a robust method for testing whether positive differences occur more frequently than would be expected by chance.
Two independent groups, skewed data

NoteClick to see Solutions

Use the Mann–Whitney test (also called the Wilcoxon rank-sum test).
Three independent groups, ordinal outcome

NoteClick to see Solutions

Use the Kruskal–Wallis test.
Paired data, want to account for magnitude of change

NoteClick to see Solutions

Use the Wilcoxon signed-rank test.

Exercise 8 (Assumptions (important conceptual)) The Mann–Whitney test is often interpreted as a test of medians.

What assumption is required for this interpretation?

NoteClick to see Solutions

To interpret the Mann–Whitney test as a test of medians, the two population distributions should have the same general shape and spread, differing only possibly in location.
What does the test actually test without that assumption?

NoteClick to see Solutions

Without the similar-shape assumption, the Mann–Whitney test is more generally a test of whether the two population distributions are the same.

A significant result could be caused by differences in center, spread, shape, or some combination of these.
Give an example where this interpretation would fail

NoteClick to see Solutions

Suppose two populations have the same median, but one is much more spread out than the other. Then the Mann–Whitney test might detect a difference even though the medians are equal.

In that case, interpreting the test as a “median test” would be misleading.

Exercise 9 (Small Data, Exact Thinking) You observe paired differences:

+, +, +, -, +

Perform an exact sign test to determine whether the median difference between pairs is positive.

NoteClick to see Solutions

There are 5 nonzero paired differences, of which 4 are positive.

Let $X$ be the number of positive signs. Under $H_0$,

\[ X \sim \text{Binomial}(n=5, p=0.5) \]

The observed value is $x=4$.

Since we are testing whether the median difference is positive, then the p-value would be

\[P(X \ge 4)\]
Compute the exact p-value
NoteClick to see Solutions
$P(X \ge 4) = P(X=4) + P(X=5)$
# Pr(X > 3) = Pr(X>=4) pval = pbinom(3, size = 5, prob = 0.5, lower.tail = FALSE) pval

[1] 0.1875
or equivalently:
# Pr(X>=4) = 1 - Pr(X < 4) = 1 - Pr(X <= 3) 1 - pbinom(3, size = 5, prob = 0.5)

[1] 0.1875
Or directly from the binomial distribuiton:

\[ \begin{align} P(X=k)= {n \choose k} (p)^k (1-p)^{n-k} \end{align} \] we get: \[ \begin{align} P(X\geq 4) &= {5 \choose 4} (0.5)^4 (1-0.5)^{5-4} + {5 \choose 5} (0.5)^5 (1-0.5)^{5-5}\\ &= {5 \choose 4} (0.5)^4 (1-0.5)^{1} + {5 \choose 5} (0.5)^5\\ &= {5 \choose 4} (1/2)^5 + {5 \choose 5} (1/2)^5\\ &= \frac{{5 \choose 4} + {5 \choose 5}}{2^5} \\ &= \frac{5 + 1}{32} = \frac{6}{32} = 0.1875 \end{align} \]
Would you reject at $\alpha = 0.05$?

NoteClick to see Solutions

No. Since $0.1875 > 0.05$, we fail to reject $H_0$.

There is not enough evidence at the 5% level to conclude that the median difference is positive.

Exercise 10 (Extension / Critical Thinking) Explain:

Why might a nonparametric test be less powerful than a parametric test when assumptions do hold?

Click to see Solutions

When the assumptions of a parametric test are satisfied, the parametric test uses more of the information in the data.

For example, a $t$-test uses the actual numerical values of the observations, while many nonparametric tests replace the data with signs or ranks. Since signs and ranks contain less information than the original measurements, the nonparametric test may be less sensitive to real differences.

As a result, when parametric assumptions hold, the parametric test often has greater power to detect an effect.