Practice Problems

STAT 205: Introduction to Mathematical Statistics

Author
Affiliation

Dr. Irene Vrbik

University of British Columbia Okanagan

1 Statistical Inference

Confidence Intervals

Exercise 1.1 Which of the following best describes the purpose of a confidence interval?

  1. It provides a single best estimate of a parameter.
  2. It describes the variability of a sample statistic.
  3. It gives a range of plausible values for a population parameter.
  4. It proves a hypothesis to be true or false.
  1. It gives a range of plausible values for a population parameter.

Sample Mean Known \(\sigma\)

Sample Mean unknown \(\sigma\)

Exercise 1.2 As part of an investigation from Union Carbide Corporation, the following data represent naturally occurring amounts of sulfate S04 (in parts per million) in well water. The data is from a random sample of 24 water wells in Northwest Texas.

  1. Do the plots in Figure 1.1 indicate a departure from the assumption of normality in the data?
Figure 1.1

No, there is no indication of a violation of the normality assumption. The boxplot appears roughly symmetric, and there is no systematic curvature or evident outliers in the normal QQ plot.

  1. Based on the following summary statistics, estimate the standard error for \(\bar{X}\) (assume that the 24 observations are stored in a vector called data)

    c(mean(data), median(data), sd(data), var(data))
    [1]   1412.9167   1272.5000    636.4592 405080.2536

\[ \sigma_{\bar X} = \text{StError}(\bar X) = \frac{\sigma}{\sqrt{n}} \]

Since we don’t have \(\sigma\) we will estimate this using:

\[ \hat \sigma_{\bar X} = \frac{s}{\sqrt{n}} = \frac{636.4592}{\sqrt{24}} = 129.9166902 \approx 129.917 \]

  1. Construct a 90% confidence interval for \(\mu\).

Using the \(t\)-table with \(n - 1 = 23\) degrees of freedom we want:

\[\begin{align*} \Pr(t_{23} > t^*) &= 0.05\\ \implies t^* &= 1.714 \end{align*}\]

The 90% confidence interval for \(\mu\) is computed using

\[\begin{align*} \bar x &\pm t^* \sigma_{\bar X}\\ 1412.9167 &\pm 1.714 \times 129.917\\ 1412.9167 &\pm 222.677738\\ (1190.239 &,1635.594 ) \end{align*}\] In words, we are 90% confident that the average sulfate S04 in well water is between 1190.2 and 1635.6 in parts per million.

Sample Proportion

Exercise 1.3 In New York City on October 23rd, 2014, a doctor who had recently been treating Ebola patients in Guinea went to the hospital with a slight fever and was subsequently diagnosed with Ebola. Soon thereafter, an NBC 4 New York/The Wall Street Journal/Marist Poll found that 82% of New Yorkers favored a “mandatory 21-day quarantine for anyone who has come into contact with an Ebola patient.” This poll included responses from 1042 New York adults between October 26th and 28th, 2014.

  1. What is the point estimate in this case, and is it reasonable to use a normal distribution to model that point estimate?


The point estimate, based on a sample size of \(n = 1042\), is \(\hat{p} = 0.82\).

To check whether \(\hat{p}\) can be reasonably modeled using a normal distribution, we check:

  • Independence (poll is based on a simple random sample)
  • Success-failure condition:
    • \(n \hat{p} = 1042 \times 0.82 = 854.44 \geq 10\) (condition met)
    • \(n (1 - \hat{p}) = 1042 \times 0.18 = 187.56 \geq 10\) (condition met)

Since both conditions are met, we can assume the sampling distribution of \(\hat{p}\) follows an approxiamte normal distribution.

  1. Estimate the standard error of \(\hat{p}\).


The standard error is given by:

\[ SE_{\hat{p}} = \sqrt{\frac{\hat{p}(1-\hat{p})}{n}} \]

Substituting values:

\[ SE_{\hat{p}} = \sqrt{\frac{0.82(1 - 0.82)}{1042}} = \sqrt{\frac{0.82 \times 0.18}{1042}} \approx 0.012 \]

  1. Construct a 95% confidence interval for \(\hat{p}\).


Using \(SE = 0.012\), \(\hat{p} = 0.82\), and the critical value \(z^* = 1.96\) for a 95% confidence level:

\[\begin{align*} \text{Confidence Interval} &= \hat{p} \pm z^* \times SE \\ &= 0.82 \pm 1.96 \times 0.012 \\ &= 0.82 \pm 0.02352 &\text{ans option 1}\\ &= (0.797, 0.843)&\text{ans option 2} \end{align*}\]

Either the point estimate plus/minus the margin of error, or the confidence interval are acceptable final answers.

  1. Interpret the confidence interval.


We are 95% confident that the proportion of New York adults in October 2014 who supported a quarantine for anyone who had come into contact with an Ebola patient was between 0.797 and 0.843.

  1. If we instead wanted an 85% confidence interval, the interval would get
    1. wider
    2. narrower
    3. stay the same
    4. There is not enough information to say


narrower

A lower confidence level corresponds to a smaller \(z^*\) value, which results in a narrower confidence interval.

(a) zstar values
(b) zstar values
Figure 1.2: Left The z* values for a 95% CI. Right The critical z* values of a 85% CI.

Sample Variance

Hypothesis Testing

Significance Level

Unless otherwise specified, assume that the significance level \(\alpha\) is equal to 0.05.

General Concepts

Exercise 1.4 When conducting a hypothesis test, what does a small p-value indicate?

  1. Strong evidence against the null hypothesis.
  2. Strong evidence in favor of the null hypothesis.
  3. No evidence to support either hypothesis.
  4. The null hypothesis is proven false.
  1. Strong evidence against the null hypothesis.

One-Sample

Mean

Proportions

Exercise 1.5 A basketball analyst claims that professional players hit free throws at an average rate of 75%. However, some argue that elite players perform better. To investigate, a random sample of 20 elite players was selected. Their individual free throw percentages are given below:

free_throw_perc <- c(84.9, 75.2, 79.8, 81.2, 80, 77.5, 85.6,
                    77.5, 88.1, 77.7, 84.5, 89.4, 71.1, 76.6,
                    77.3, 81.2, 76.6, 64.7, 65.8, 84.6)

Rounding to 1 decmical place to simplify hand calculations, their mean free throw percentage was 79%, with a standard deviation of 6.6%.

Test whether the mean free throw percentage exceeds the league average of 75 percent.

  1. Assumptions: Check the assumptions for the appropriate test

Since there are less than 30 observations, the appropriate test is the one-sample t-test. Recall our flowchart for making this decision:

Since one-sample t-test relies on the following assumptions:

  1. Independence of Observations The sample of elite players must be randomly selected, and each player’s free throw percentage should be independent of others. This is primarily determined by study design. If the data was collected randomly and each observation represents a different player, this assumption is likely satisfied.
  2. Normality of the Population Distribution: The sample mean follows a t-distribution, but this is valid only if the underlying population is approximately normal or if the sample size is large enough for the Central Limit Theorem (CLT) to apply. There are several ways in which we can check this:
    1. Visual checks:
      1. Histogram (Look for a roughly symmetric, bell-shaped distribution)

        hist(free_throw_perc, breaks = 20)

        ⚠️ non-Symmetry: The distribution appears somewhat right-skewed, particularly in the second histogram. We should be weary of this assumption.

      2. Q-Q Plot (Quantile-Quantile Plot): Compare sample quantiles to a theoretical normal distribution.

        I like to use the qqPlot() function from the cars package since it provides some guiding confidence bands which makes it easy to detect deviations.

        # Load necessary library
        library(car)
        # Generate Q-Q plot with confidence bands
        qqPlot(free_throw_perc, main = "Q-Q Plot with Confidence Bands")

        [1] 18 19

        The Q-Q plot with confidence bands shows that most points fall within the bands, indicating approximate normality. There is some deviation in the lower tail (left side) suggests a slight departure from normality, but it isn’t extreme. These correspond to observations (indices) outputted by qqPlot (i.e. observation 18 and 19) are considered potential outliers. That is to say, these are the two most extreme points that deviate from the normality assumption.

      3. Summary:

        • The Q-Q plot suggests that the data is roughly normal, though there are slight deviations at the lower end.

        • ⚠️ Observations 18 and 19 are flagged as potential outliers, but they are not extreme enough to invalidate the normality assumption.

      4. Shapiro-Wilk Test: A statistical test that checks for normality. If the p-value is greater than 0.05, we do not reject normality.

        shapiro.test(free_throw_perc)
        
            Shapiro-Wilk normality test
        
        data:  free_throw_perc
        W = 0.94389, p-value = 0.2837
        • \(H_0\) : The data follows a normal distribution.

        • \(H_A\) : The data does not follow a normal distribution.

        • p-value: 0.2837 \(> \alpha = 0.05\) \(\implies\) fail to reject

        Conclusion: There is insufficient evidence against normality, and we can reasonably assume that the data is approximately normally distributed.

        Hence we carry on with the one-sample \(t\)-test…

  1. Define Hypotheses: State the null and alternative hypothesis:

We are testing whether the mean free throw percentage of elite players is significantly greater than the league average of 75%.

\[ \begin{align} H_0: &\mu = 75 && H_A: &\mu > 75 \end{align} \]

Hence this is a one-tailed (more specifically a upper-tailed) t-test.

  1. Test Statistic: identify the sampling distribution of the test statistic (random variable) and compute the observed test statistic (no longer a random variable).

Test statistic:

\[ \dfrac{\bar X - \mu_0}{s/\sqrt{n}} \sim t_{\nu = n-1} \]

Hence our test statistic follows a Student-\(t\) distribution with \(\nu = n - 1 = 20 - 1 = 19\) degrees of freedom.

Observed Test Statistic:

\[ \begin{align} t_{obs} &= \dfrac{\bar x - \mu_0}{s/\sqrt{n}}\\ &= \dfrac{79 - 75}{6.6/\sqrt{20}} = 2.71 \end{align} \]

  1. Decision:

    1. Decide whether or not the reject \(H_0\) using the critical value method.

      Critical Value Approach: to use the critical value approach we first need to find the critical values on the null distribution. Hence we need to find \(t^*\) such that

      \[ \Pr(t_{\nu = 19} > t_\alpha^*) = \alpha \] Note that we don’t need to divide \(\alpha\) by two since we are doing the upper-tailed test. The critical value is visualized in Figure 1.3

      Figure 1.3: The critical value for an upper-tailed test.
    2. Decide whether or not the reject \(H_0\) using the \(p\)-value approach (this should always agree with the critical value method).

      p-value approach

      We need to find

      \[ \Pr(t_{\nu = 19} > 2.71) \]

      Using the \(t\)-tables:

      From the t-tables we can deduce that the \(p\)-value is

      \[ \begin{align} \Pr(t_{\nu = 19} > 2.861) < &\Pr(t_{\nu = 19} > 2.7103854) < \Pr(t_{\nu = 19} > 2.539)\\ \text{area in gray }< & p\text{-value} < \text{area in blue}\\ 0.005< & p\text{-value} < 0.01\\ \end{align} \]

      Using R:

      pt(2.7103854, df = 19, lower.tail = FALSE)
      [1] 0.006937378

      Since the \(p\)-value is less than the significance level of \(\alpha\) = 0.05 we reject the null hypothesis.

  2. State the appropriate conclusion. Is there evidence that elite players perform better than the league average?

    There is strong statistical evidence to suggest that the mean free throw percentage of elite players exceeds the league average of 75%.

Two-Sample

Mean

Paired

Pooled

Welch

Proprotions