Post Midterm 2 Practice Problems

Nonparametric Statistics

Sign Test

Exercise 1 (Sign Test) A physiotherapist wants to test whether a new stretching routine improves flexibility. She records the change in forward reach (cm) for 10 patients (after − before):

x <- c(2, -1, 3, 0, 4, -2, 1, 5, -3, 2)
  1. State the hypotheses

    Let \(d\) denote the population median change in forward reach (after \(-\) before). Since “improves flexibility” corresponds to a positive change, the hypotheses are

    \[ H_0: d = 0 \qquad\text{vs}\qquad H_A: d > 0 \]

    Under \(H_0\), positive and negative differences are equally likely.

  2. Remove any observations if necessary.

    For a sign test, any difference equal to 0 is removed because it is neither positive nor negative.

    x_nz <- x[x != 0]
    x_nz
    [1]  2 -1  3  4 -2  1  5 -3  2

    In the above code one 0’s are removed. Among the 10 remaining observations,

    • 6 are positive
    • 3 are negative
  3. Calculate the \(p\)-value for the sign test at \(\alpha = 0.05\)

    Let \(X\) be the number of positive differences among the 9 nonzero differences. Under \(H_0\),

    \[ X \sim \text{Binomial}(n=9, p=0.5) \]

    The observed test statistic is

    \[ x = 6 \]

    Since the alternative is \(H_A:d>0\), this is a right-tailed test. The p-value is

    \[ P(X \ge 6) \]

    Using R we get:

    pbinom(5, size = 9, prob = 0.5, lower.tail = FALSE)
    [1] 0.2539063
  4. State a conclusion in context

    Since the \(p\)-value (0.2539) is greater than the significance level of \(\alpha\) = 0.05, we fail to reject the null hypothesis. Hence there is not enough evidence at the 5% significance level to conclude that the new stretching routine improves median forward reach.

Wilcoxon Signed-Rank Test

Exercise 2 (Wilcoxon Signed-Rank Test) A study measures reaction time (ms) before and after caffeine consumption:

before <- c(250, 300, 275, 290, 310, 260)
after  <- c(240, 310, 260, 280, 295, 255)
dat <- data.frame(subject = 1:6, before, after)
dat
  1. Define the differences (After − Before)

    d <- after - before
    d
    [1] -10  10 -15 -10 -15  -5
  2. Rank the absolute differences

    First compute the absolute differences:

    abs_d <- abs(d)
    abs_d
    [1] 10 10 15 10 15  5

    Then rank them:

    rank(abs_d, ties.method = "average")
    [1] 3.0 3.0 5.5 3.0 5.5 1.0
    • Since 5 is the smallest number, observation 6 receives the rank: 1
    • Since 10 is the next smallest number, observations 1, 2, 4 receive the rank: 3 (obtained by averaging the ranks: 3, 4, 5, 6, 7)
    • Since 15 is the next smallest number, observations 3, 5 receive the rank: 5.5 (obtained by averaging the ranks: 8, 9)
  3. Compute the test statistic

    Attach the signs and sum the positive ranks.

    # compute signed ranks
    signed_ranks <- sign(d) * r
    
    # test statistic (R reports sum of positive ranks)
    V <- sum(r[d > 0])
    V
    [1] 3
  4. Use the appropriate R function to find a \(p\)-value for this test.

    wilcox.test(after, before, paired = TRUE)
    Warning in wilcox.test.default(after, before, paired = TRUE): cannot compute
    exact p-value with ties
    
        Wilcoxon signed rank test with continuity correction
    
    data:  after and before
    V = 3, p-value = 0.1367
    alternative hypothesis: true location shift is not equal to 0

    Alternatively we could have done:

    wilcox.test(d) 
    Warning in wilcox.test.default(d): cannot compute exact p-value with ties
    
        Wilcoxon signed rank test with continuity correction
    
    data:  d
    V = 3, p-value = 0.1367
    alternative hypothesis: true location is not equal to 0

    Note to students: Since our data has ties R performs a slightly different test; details of this correction was not covered in lecture so just concern yourself with interpreting the output.

  5. State the Decision and Conclusion in context.

    Warning in wilcox.test.default(d): cannot compute exact p-value with ties

    Since the \(p\)-value (0.1367) is greater than the significance level of \(\alpha\) = 0.05, we fail to reject the null hypothesis.

    Hence there is insufficient evidence to suggest that caffeine has an effect on reaction time.

  6. Why might this test be preferred over the sign-test?

    The sign-rank test uses both magnitude and direction → more powerful than the sign test.

Mann-Whitney U Test (aka Wilcoxon Rank-Sum)

Exercise 3 (Mann–Whitney Test) A tennis coach compares serve speeds (km/h) between two training programs:

A <- c(110, 115, 120, 125, 130) # Program A
B <- c(105, 108, 112, 118, 122) # Program B

Assuming these groups are independent, perform the appropriate non-parameteric test to determine whether there is a difference in serve speeds between the two training programs.

  1. State the hypotheses

    \[ H_0:\text{distributions equal} \quad H_A:\text{distributions differ} \]

  2. Rank the observations

    Then rank them:

    # combine data
    values <- c(A, B)
    nA = length(A)
    nB = length(B)
    group  <- c(rep("A", nA), rep("B", nB))
    
    # compute ranks
    ranks <- rank(values, ties.method = "average")
    
    (df <- data.frame(values, group, ranks))
  3. Compute the test statistic

    wilcox.test(A, B)
    
        Wilcoxon rank sum exact test
    
    data:  A and B
    W = 19, p-value = 0.2222
    alternative hypothesis: true location shift is not equal to 0

    If we wanted to do this by hand:

    # sum of ranks for group A
    (W_A <- sum(ranks[group == "A"]))
    [1] 34
    (W_B <- sum(ranks[group == "B"]))
    [1] 21
    (U_A = W_A - (nA*(nA+1))/2)
    [1] 19
    (U_B = W_B - (nB*(nB+1))/2)
    [1] 6

    Hence the test statistic is:

    min(U_A, U_B)
    [1] 6

    Hmmm but does agree with the test statistic produced in the output of wilcox.test(A, B)??

    NOTE: R reports the statistic for the first sample you pass, in this case \(U = U_A\). If we want to match the test statistic we got by hand we’d do:

    wilcox.test(B, A)
    
        Wilcoxon rank sum exact test
    
    data:  B and A
    W = 6, p-value = 0.2222
    alternative hypothesis: true location shift is not equal to 0

    Notice how either way we get the same \(p\)-value.

  4. Decision and conclusion in context.

    wtest = wilcox.test(B, A)

    Since the \(p\)-value (0.2222) is greater than the significance level of \(\alpha\) = 0.05, we fail to reject the null hypothesis.

    Hence there is insufficient evidence to suggest that one training program produces different serve speeds than the other.

  5. Why is this test preferred?

    The appropriate test is the Mann–Whitney test (also called the Wilcoxon rank-sum test), which compares the distributions of the two groups, since:

    • The two groups (Program A and Program B) are independent
    • The sample sizes are small
    • We are not comfortable assuming normality

Kruskal–Wallis Test

Exercise 4 (Kruskal–Wallis Test) A researcher compares pain scores (lower = better) across three treatments:

A <- c(3, 4, 5) # Treatment A
B <- c(2, 3, 4) # Treatment B
C <- c(6, 7, 8) # Treatment C
  1. State hypotheses

    The Kruskal–Wallis test compares 3 or more independent groups. The hypotheses are:

    \[ \begin{align} H_0&:\text{all population distributions are the same}\\ H_A&:\text{at least one population distribution differs} \end{align} \]

    Note: If the group distributions have similar shape, this is often interpreted as a test for whether at least one population median differs.

  2. Rank all data

    Combine all data and rank them

    group <- c(rep("A", 3), rep("B", 3), rep("C", 3))
    score <- c(A, B, C)
    kw_dat <- data.frame(group, score, rank = rank(score, ties.method = "average"))
    kw_dat[order(kw_dat$score), ]

    Rank sums:

    aggregate(rank ~ group, data = kw_dat, sum)

    So,

    • \(R_A = 2.5 + 4.5 + 6 = 13\)

    • \(R_B = 1 + 2.5 + 4.5 = 8\)

    • \(R_C = 7 + 8 + 9 = 24\)

  3. Compute the Kruskal–Wallis statistic

    The test statistic is

    \[ H = \frac{12}{N(N+1)} \sum_{i=1}^k \frac{R_i^2}{n_i} - 3(N+1) \]

    Here:

    • \(N=9\)
    • \(k=3\)
    • \(n_1=n_2=n_3=3\)
    • \(R_A=13,\ R_B=8,\ R_C=24\)

    So,

    \[ H = \frac{12}{9(10)}\left(\frac{13^2}{3} + \frac{8^2}{3} + \frac{24^2}{3}\right) - 3(10) \]

    N <- 9
    R <- c(13, 8, 24)
    n <- c(3, 3, 3)
    H <- 12/(N*(N+1)) * sum(R^2/n) - 3*(N+1)
    H
    [1] 5.955556

    So the Kruskal–Wallis statistic is approximately

    \[ H \approx 5.96 \]

    But WAIT!! We have ties. We need to use the tie correction

    tie_sizes <- table(score)
    tie_sizes <- tie_sizes[tie_sizes > 1]
    
    # don't use "C" (that will override your data vector above!)
    Correction <- 1 - sum(tie_sizes^3 - tie_sizes)/(N^3 - N)
    
    
    H_corrected <- H / Correction
    H_corrected
    [1] 6.056497
  4. Compare to \(\chi^2\) with appropriate df

    For a Kruskal–Wallis test with \(k=3\) groups, the degrees of freedom are

    \[ df = k - 1 = 2 \]

    Using the chi-square approximation, the p-value is

    pchisq(H_corrected, df = 2, lower.tail = FALSE)
    [1] 0.04840033

    This gives a p-value of about 0.051.

    Since this is slightly greater than 0.05, we fail to reject \(H_0\) at the 5% significance level.

  5. Perform the test in R

    group <- c(rep("A", length(A)), rep("B", length(B)), rep("C", length(C)))
    values <- c(A, B, C)
    
    ktest = kruskal.test(values ~ group)
    ktest
    
        Kruskal-Wallis rank sum test
    
    data:  values by group
    Kruskal-Wallis chi-squared = 6.0565, df = 2, p-value = 0.0484
  6. State the conclusion in context

    Since the \(p\)-value (0.0484) is less than the significance level of \(\alpha\) = 0.05, we reject the null hypothesis in favour of the alternative..

    There is evidence that at least one of the treatment groups has a different distribution of pain scores.

    If we additionally assume the population distributions have similar shape, this suggests that at least one treatment has a different population median pain score.

Miscellaneous

Exercise 5 (Conceptual Multiple Choice) Which of the following is TRUE?

  1. The sign test uses magnitudes of differences
  2. The Wilcoxon signed-rank test assumes normality
  3. The Mann–Whitney test compares medians (under similar shapes)
  4. The Kruskal–Wallis test requires equal variances
  • ❌ A is false because the sign test only uses the direction of the differences, not their magnitudes.
  • ❌ B is false because the Wilcoxon signed-rank test does not assume normality.
  • ✅ C is true because the Mann–Whitney test is often interpreted as comparing medians when the two population distributions have similar shape.
  • ❌ D is false because the Kruskal–Wallis test does not require equal variances in the same way ANOVA does.

Exercise 6 (Interpretation (non-significant result)) A Mann–Whitney test comparing two diets gives:

p = 0.18

  1. State the statistical decision

    Since \(p=0.18 > 0.05\), we fail to reject \(H_0\) at the 5% significance level.

  2. Write a proper contextual conclusion

    At the 5% significance level, there is not enough evidence to conclude that the two diets differ in the population distribution of the outcome being measured.

    If the two population distributions are assumed to have similar shape, then we would say there is not enough evidence to conclude that the population medians differ.

  3. Why is “no difference exists” incorrect?

    A non-significant result does not prove that the two populations are the same. It only means that this sample did not provide strong enough evidence against \(H_0\).

    There may still be a real difference, but the study may have lacked enough data or power to detect it.

Exercise 7 (Choosing the Right Test) For each scenario, choose the most appropriate test:

  1. Paired data, only direction of change matters

    I would accept the Sign test or the Sign-Rank here. Here is some extra nuance:

    We might prefer the sign test here because we are only interested in whether the treatment tends to increase flexibility, not in how large the increases are. In addition, the sign test is more appropriate when:

    • the sample size is small
    • the distribution of differences may be skewed
    • or there may be outliers that could distort magnitude-based methods

    By reducing each observation to its sign (+ or −), the sign test provides a robust method for testing whether positive differences occur more frequently than would be expected by chance.

  2. Two independent groups, skewed data

    Use the Mann–Whitney test (also called the Wilcoxon rank-sum test).

  3. Three independent groups, ordinal outcome

    Use the Kruskal–Wallis test.

  4. Paired data, want to account for magnitude of change

    Use the Wilcoxon signed-rank test.

Exercise 8 (Assumptions (important conceptual)) The Mann–Whitney test is often interpreted as a test of medians.

  1. What assumption is required for this interpretation?

    To interpret the Mann–Whitney test as a test of medians, the two population distributions should have the same general shape and spread, differing only possibly in location.

  2. What does the test actually test without that assumption?

    Without the similar-shape assumption, the Mann–Whitney test is more generally a test of whether the two population distributions are the same.

    A significant result could be caused by differences in center, spread, shape, or some combination of these.

  3. Give an example where this interpretation would fail

    Suppose two populations have the same median, but one is much more spread out than the other. Then the Mann–Whitney test might detect a difference even though the medians are equal.

    In that case, interpreting the test as a “median test” would be misleading.

Exercise 9 (Small Data, Exact Thinking) You observe paired differences:

+, +, +, -, +

  1. Perform an exact sign test to determine whether the median difference between pairs is positive.

    There are 5 nonzero paired differences, of which 4 are positive.

    Let \(X\) be the number of positive signs. Under \(H_0\),

    \[ X \sim \text{Binomial}(n=5, p=0.5) \]

    The observed value is \(x=4\).

    Since we are testing whether the median difference is positive, then the p-value would be

    \[P(X \ge 4)\]

  2. Compute the exact p-value

    \(P(X \ge 4) = P(X=4) + P(X=5)\)

     # Pr(X > 3) = Pr(X>=4)
    pval = pbinom(3, size = 5, prob = 0.5, lower.tail = FALSE)
    pval
    [1] 0.1875

    or equivalently:

    # Pr(X>=4) = 1 - Pr(X < 4)  = 1 - Pr(X <= 3)
    1 - pbinom(3, size = 5, prob = 0.5)                     
    [1] 0.1875

    Or directly from the binomial distribuiton:

    \[ \begin{align} P(X=k)= {n \choose k} (p)^k (1-p)^{n-k} \end{align} \] we get: \[ \begin{align} P(X\geq 4) &= {5 \choose 4} (0.5)^4 (1-0.5)^{5-4} + {5 \choose 5} (0.5)^5 (1-0.5)^{5-5}\\ &= {5 \choose 4} (0.5)^4 (1-0.5)^{1} + {5 \choose 5} (0.5)^5\\ &= {5 \choose 4} (1/2)^5 + {5 \choose 5} (1/2)^5\\ &= \frac{{5 \choose 4} + {5 \choose 5}}{2^5} \\ &= \frac{5 + 1}{32} = \frac{6}{32} = 0.1875 \end{align} \]

  3. Would you reject at \(\alpha = 0.05\)?

    No. Since \(0.1875 > 0.05\), we fail to reject \(H_0\).

    There is not enough evidence at the 5% level to conclude that the median difference is positive.

Exercise 10 (Extension / Critical Thinking) Explain:

Why might a nonparametric test be less powerful than a parametric test when assumptions do hold?

When the assumptions of a parametric test are satisfied, the parametric test uses more of the information in the data.

For example, a \(t\)-test uses the actual numerical values of the observations, while many nonparametric tests replace the data with signs or ranks. Since signs and ranks contain less information than the original measurements, the nonparametric test may be less sensitive to real differences.

As a result, when parametric assumptions hold, the parametric test often has greater power to detect an effect.