c(mean(data), median(data), sd(data), var(data))
[1] 1412.9167 1272.5000 636.4592 405080.2536
STAT 205: Introduction to Mathematical Statistics
Dr. Irene Vrbik
University of British Columbia Okanagan
Exercise 1.1 Which of the following best describes the purpose of a confidence interval?
Exercise 1.2 As part of an investigation from Union Carbide Corporation, the following data represent naturally occurring amounts of sulfate S04 (in parts per million) in well water. The data is from a random sample of 24 water wells in Northwest Texas.
No, there is no indication of a violation of the normality assumption. The boxplot appears roughly symmetric, and there is no systematic curvature or evident outliers in the normal QQ plot.
Based on the following summary statistics, estimate the standard error for \(\bar{X}\) (assume that the 24 observations are stored in a vector called data
)
\[ \sigma_{\bar X} = \text{StError}(\bar X) = \frac{\sigma}{\sqrt{n}} \]
Since we don’t have \(\sigma\) we will estimate this using:
\[ \hat \sigma_{\bar X} = \frac{s}{\sqrt{n}} = \frac{636.4592}{\sqrt{24}} = 129.9166902 \approx 129.917 \]
Using the \(t\)-table with \(n - 1 = 23\) degrees of freedom we want:
\[\begin{align*} \Pr(t_{23} > t^*) &= 0.05\\ \implies t^* &= 1.714 \end{align*}\]
The 90% confidence interval for \(\mu\) is computed using
\[\begin{align*} \bar x &\pm t^* \sigma_{\bar X}\\ 1412.9167 &\pm 1.714 \times 129.917\\ 1412.9167 &\pm 222.677738\\ (1190.239 &,1635.594 ) \end{align*}\] In words, we are 90% confident that the average sulfate S04 in well water is between 1190.2 and 1635.6 in parts per million.
Exercise 1.3 In New York City on October 23rd, 2014, a doctor who had recently been treating Ebola patients in Guinea went to the hospital with a slight fever and was subsequently diagnosed with Ebola. Soon thereafter, an NBC 4 New York/The Wall Street Journal/Marist Poll found that 82% of New Yorkers favored a “mandatory 21-day quarantine for anyone who has come into contact with an Ebola patient.” This poll included responses from 1042 New York adults between October 26th and 28th, 2014.
The point estimate, based on a sample size of \(n = 1042\), is \(\hat{p} = 0.82\).
To check whether \(\hat{p}\) can be reasonably modeled using a normal distribution, we check:
Since both conditions are met, we can assume the sampling distribution of \(\hat{p}\) follows an approxiamte normal distribution.
The standard error is given by:
\[ SE_{\hat{p}} = \sqrt{\frac{\hat{p}(1-\hat{p})}{n}} \]
Substituting values:
\[
SE_{\hat{p}} = \sqrt{\frac{0.82(1 - 0.82)}{1042}} = \sqrt{\frac{0.82 \times 0.18}{1042}} \approx 0.012
\]
Using \(SE = 0.012\), \(\hat{p} = 0.82\), and the critical value \(z^* = 1.96\) for a 95% confidence level:
\[\begin{align*} \text{Confidence Interval} &= \hat{p} \pm z^* \times SE \\ &= 0.82 \pm 1.96 \times 0.012 \\ &= 0.82 \pm 0.02352 &\text{ans option 1}\\ &= (0.797, 0.843)&\text{ans option 2} \end{align*}\]
Either the point estimate plus/minus the margin of error, or the confidence interval are acceptable final answers.
We are 95% confident that the proportion of New York adults in October 2014 who supported a quarantine for anyone who had come into contact with an Ebola patient was between 0.797 and 0.843.
narrower
A lower confidence level corresponds to a smaller \(z^*\) value, which results in a narrower confidence interval.
Unless otherwise specified, assume that the significance level \(\alpha\) is equal to 0.05.
Exercise 1.4 When conducting a hypothesis test, what does a small p-value indicate?
Exercise 1.5 A basketball analyst claims that professional players hit free throws at an average rate of 75%. However, some argue that elite players perform better. To investigate, a random sample of 20 elite players was selected. Their individual free throw percentages are given below:
free_throw_perc <- c(84.9, 75.2, 79.8, 81.2, 80, 77.5, 85.6,
77.5, 88.1, 77.7, 84.5, 89.4, 71.1, 76.6,
77.3, 81.2, 76.6, 64.7, 65.8, 84.6)
Rounding to 1 decmical place to simplify hand calculations, their mean free throw percentage was 79%, with a standard deviation of 6.6%.
Test whether the mean free throw percentage exceeds the league average of 75 percent.
Since there are less than 30 observations, the appropriate test is the one-sample t-test. Recall our flowchart for making this decision:
Since one-sample t-test relies on the following assumptions:
Histogram (Look for a roughly symmetric, bell-shaped distribution)
⚠️ non-Symmetry: The distribution appears somewhat right-skewed, particularly in the second histogram. We should be weary of this assumption.
Q-Q Plot (Quantile-Quantile Plot): Compare sample quantiles to a theoretical normal distribution.
I like to use the qqPlot()
function from the cars package since it provides some guiding confidence bands which makes it easy to detect deviations.
# Load necessary library
library(car)
# Generate Q-Q plot with confidence bands
qqPlot(free_throw_perc, main = "Q-Q Plot with Confidence Bands")
[1] 18 19
The Q-Q plot with confidence bands shows that most points fall within the bands, indicating approximate normality. There is some deviation in the lower tail (left side) suggests a slight departure from normality, but it isn’t extreme. These correspond to observations (indices) outputted by qqPlot
(i.e. observation 18
and 19
) are considered potential outliers. That is to say, these are the two most extreme points that deviate from the normality assumption.
Summary:
✅ The Q-Q plot suggests that the data is roughly normal, though there are slight deviations at the lower end.
⚠️ Observations 18 and 19 are flagged as potential outliers, but they are not extreme enough to invalidate the normality assumption.
Shapiro-Wilk Test: A statistical test that checks for normality. If the p-value is greater than 0.05, we do not reject normality.
Shapiro-Wilk normality test
data: free_throw_perc
W = 0.94389, p-value = 0.2837
\(H_0\) : The data follows a normal distribution.
\(H_A\) : The data does not follow a normal distribution.
p-value: 0.2837 \(> \alpha = 0.05\) \(\implies\) fail to reject
Conclusion: There is insufficient evidence against normality, and we can reasonably assume that the data is approximately normally distributed.
Hence we carry on with the one-sample \(t\)-test…
We are testing whether the mean free throw percentage of elite players is significantly greater than the league average of 75%.
\[ \begin{align} H_0: &\mu = 75 && H_A: &\mu > 75 \end{align} \]
Hence this is a one-tailed (more specifically a upper-tailed) t-test.
Test statistic:
\[ \dfrac{\bar X - \mu_0}{s/\sqrt{n}} \sim t_{\nu = n-1} \]
Hence our test statistic follows a Student-\(t\) distribution with \(\nu = n - 1 = 20 - 1 = 19\) degrees of freedom.
Observed Test Statistic:
\[
\begin{align}
t_{obs} &= \dfrac{\bar x - \mu_0}{s/\sqrt{n}}\\
&= \dfrac{79 - 75}{6.6/\sqrt{20}} =
2.71
\end{align}
\]
Decision:
Decide whether or not the reject \(H_0\) using the critical value method.
Critical Value Approach: to use the critical value approach we first need to find the critical values on the null distribution. Hence we need to find \(t^*\) such that
\[ \Pr(t_{\nu = 19} > t_\alpha^*) = \alpha \] Note that we don’t need to divide \(\alpha\) by two since we are doing the upper-tailed test. The critical value is visualized in Figure 1.3
Decide whether or not the reject \(H_0\) using the \(p\)-value approach (this should always agree with the critical value method).
p-value approach
We need to find
\[ \Pr(t_{\nu = 19} > 2.71) \]
Using the \(t\)-tables:
From the t-tables we can deduce that the \(p\)-value is
\[ \begin{align} \Pr(t_{\nu = 19} > 2.861) < &\Pr(t_{\nu = 19} > 2.7103854) < \Pr(t_{\nu = 19} > 2.539)\\ \text{area in gray }< & p\text{-value} < \text{area in blue}\\ 0.005< & p\text{-value} < 0.01\\ \end{align} \]
Using R:
[1] 0.006937378
Since the \(p\)-value is less than the significance level of \(\alpha\) = 0.05 we reject the null hypothesis.
State the appropriate conclusion. Is there evidence that elite players perform better than the league average?
There is strong statistical evidence to suggest that the mean free throw percentage of elite players exceeds the league average of 75%.