STAT 205 Learning Outcomes

Author

Dr. Irene Vrbik

Legend

🔴 Very Important
🟠 Important
🟡 Kind of Important
🟢 Not Very Important

Anything with a ~~strikeout~~ will not be tested on the final exam.

1: Introduction to Statistics

Descriptive statistics

🟠 LO: able to accurately calculate and interpret key descriptive statistics, including:
- measures of central tendency (mean, median, mode) and
- measures of variability (range, variance, standard deviation).
- measures of location of data (quartiles and percentiles)
🔴 LO: proficient in creating and analyzing graphical representations of data, such as
- histograms, box plots, and scatter plots, to summarize and describe data distributions effectively.

Principles of random sampling

🟠 LO: Understand the principles of probability sampling and how they form the basis for making statistical inferences from a sample to a population.
🔴 LO: Understand difference between sample (of size $n$) and a population (either “infinite” or size $N$)
🔴 LO: Understand difference between sample statistics (e.g. $\bar x$, $s$, $\hat p$) and a population parameters (e.g. $\mu$, $\sigma$, $p$)
🟢 LO: Distinguish between different sampling designs (e.g. simple random, stratified, cluster)
🟡 LO: Identify a target population

Types of data

🔴 LO: Identify variables as numerical (continuous or discrete) and categorical (nominal or ordinal)
🟠 LO: Differentiate between variables that are associated (positive or negative) and those that are independent.
🟠 LO: Detect outliers in various types of data set using graphical methods (box plots, scatter plots)
🟢 LO: Understand the difference between observational studies and experiments

2: Summarizing Data

Storing Data

🟡 LO: Describe and identify the basic data types in R (vectors, factors (ordered or unordered), lists, and data frames (with observations typically in rows, and variables in columns)
🟡 LO: Data indexing: index vectors using [] and columns in a data frame using $ ; see note¹
🟢 LO: Understand the character data type (character or string) and logical data type in R (TRUE or FALSE):
🟡 LO: Apply coercion in R to coerce data types in R to the appropriate data type (e.g. the as.factor() to convert a numeric vector to a factor.
🟢 LO: Displaying data (using str(), head(), or View()
🟠 LO: construct and interpret contingency tables (along with marginals) to organize and two categorical variables.
🟡 LO: Define a robust statistic (e.g. median, IQR) as measures that are not heavily affected by skewness and extreme outliers, and determine when they are more appropriate measured of center and spread compared to other similar statistics.

Plotting Data

🟡 LO: create simple plots using functions like plot(), hist(), boxplot(), etc
🟢 LO: create advanced plots like stacked/side-by-side bar plots, side-by-side boxplots
🟢 LO: customize plot appearance by modifying attributes such as colors, labels, titles, axis limits, line types, etc.
🟢 LO: explore themes and packages (like ggplot2) for more advanced and polished visualizations.
🟡 LO: recognize and describe the common shapes of data distributions, including normal, skewed (right or left), uniform, and bimodal distributions.
🟠 LO: visually identify and interpret and estimate key statistical metrics such as the mean, mode, and interquartile range (IQR) from graphical representations including histograms and box plots.

3: Sampling Distributions

🔴 LO: Explain the concept of a sampling distribution and its importance in statistical inference.
🔴 LO: Define the Central Limit Theorem and its significance in statistical theory.
🔴 LO: Understand the conditions under which the CLT applies and its implications for sample means and proportions
🔴 LO: Define standard error as the standard deviation of a sampling distribution, representing the variability of sample statistics around the population parameter.
🔴 LO: Explain the conceptual difference between standard error and standard deviation, emphasizing their respective roles in describing variability in populations and samples.
🟢 LO: Derive the sampling distributions for the sample mean, proportion, and variance
🟠 LO: Use the sampling distribution of sample statistic to create point estimates and confidence intervals.
🟠 LO: Apply knowledge of sampling distributions to the practical applications of hypothesis testing and constructing confidence intervals.

4: Getting Started with Quarto

🔴 LO: Understand the advantages of using Quarto for reproducible document generation
🔴 LO: Create .qmd documents using RStudio and demonstrate the ability to integrate:
- executable code chunks, in-line code,
- embedded figures and images
- basic Markdown syntax (e.g. tables, headers, bold, italics, lists)
🟡 LO: describe the key features of the YAML
🟢 LO: identify and use common keyboard shortcuts
🟡 LO: navigate the RStudio interface proficiently and explain its major components, including the script editor, console, environment pane, and visualization tools.
🟡 LO: understand and customize code chunk options (eg. echo option controls whether the code within a code chunk is displayed in the output document.)
🟢 LO: LaTeX equations
🟠 LO: explain the importance of setting a seed in random number generation for reproducibility
🟡 LO: demonstrate the ability to use the set.seed()

5: Likelihood and Parameter Estimation

🔴 LO: Define, calculate, and identify point estimators
🔴 LO: Define, construct, and interpret confidence intervals
🟠 LO: define and describe the Method of Moments for parameter estimation.
🟠 LO: use sample moments (such as sample means, variances, and higher moments) to estimate the parameters of a specified distribution.
🟢 LO: derive moment equations (moments will be provided on an exam if needed)
🟠 LO: Define likelihood (and log-likelihood) in the context of statistical inference.
🟠 LO: interpret the likelihood function as a tool for statistical inference and its difference from probability.
🔴 LO: Derive a maximum likelihood estimator (MLE)
🟡 LO: Define and explain common considerations in statistical estimation, including bias, consistency, efficiency, sufficiency and asymptotic normality.

6/7: Confidence Intervals for Means, Proportions, and Variance

🟠 LO: Understand what a pivotal quantity is and explain its role in statistical inference,
🔴 LO: Construct a confidence interval given a particular confidence level; either in (a, b) form or in the form point estimate $\pm$ margin of error.
🔴 LO: Identify and compute a margin of error
🔴 LO: Calculate the standard error for sample statistics using appropriate formulas.
🔴 LO: Interpret a given confidence interval as the plausible range of values for a population parameter (e.g. $\mu$, $p$, or $\sigma^2$) in the context of probability and uncertainty “We are XX% confident that the true population parameter is in this interval”, where XX% is the desired confidence level
🟠 LO: Understand and describe why we use “confidence” instead of the term “probability”
🟠 LO: Determine appropriate sample sizes based on desired confidence levels of precision (margin of error).
🟠 LO: Check the assumptions required for using this method (e.g. success-failure check: $np \geq 10$ and $n(1-p) \geq 10$)
🟠 LO: Identify factors that influence the margin of error (effectively the width) of a confidence intervals, including sample size, confidence level, and population variability.
🟠 LO: Calculate the required minimum sample size for a given margin of error at a given confidence level
🟢 LO: Derive confidence intervals from the knowledge of sampling distributions and probability statements.
🟠 LO: Understand the relationship between confidence intervals and hypothesis testing.
🟠 LO: Use confidence intervals to make inferences about population parameters
🔴 LO: Interpret QQ plots to assess the normality assumption
🔴 LO: Extract critical values from Z-tables, t-tables and chi-square tables.
🔴 LO: Approximate probabilities from Z-tables, t-tables and chi-square tables.
🔴 LO: Demonstrate the ability to use functions like qnorm(), qt(), and qchisq() to find critical values

Non-parameter confidence intervals

🟠 LO: Explain what nonparametric tests are and identify situations where they are more appropriate than parametric tests.
LO: ~~Construct a non-parameter confidence interval for the median~~
LO: ~~Construct a non-parameter confidence interval for the variance using resampling (bootstrap) methods~~

8: Sampling Distribution Theory

🟠 LO: Define sampling distribution and its significance in inferential statistics.
🟢 ~~LO: Develop an understanding of how the CDF, PDF and MGF can be used to derive relationships between different types of random variables~~
🟢 ~~LO: Gain hands-on experience in proving the distribution characteristics of pivotal quantities~~

9: Sampling from Finite Populations

🟡 LO: Define finite population sampling and its significance in survey research and applied statistics.
🟠 LO: Define a simple random sample (SRS) from a population
🔴 LO: Explain the importance of a simple random sampling (SRS) for statistical inference
🟠 LO: Describe the difference between a simple random sample without replacement (SRS) vs simple random sample with replacement (SRSWR)
🟡 LO: Explain the differences between sampling from finite populations and sampling from infinite populations.
🟡 LO: Understand how finite population characteristics influence sampling design, estimation, and inference.
🟠 LO: Define the finite population correction factor and its role in adjusting variance estimates for SRSWOR samples from finite populations
🟠 LO: ~~apply the finite population correction factor to correct standard errors and confidence intervals for finite populations.~~

10: Properties of Parameter Estimators

🔴 LO: Define an unbiased estimator
🔴 LO: Determine whether a given estimator is unbiased
🟠 LO: Define and interpret² the Mean Squared Error (MSE);
🟡 LO: ~~Understanding the decomposition of MSE and its significance in evaluating the performance of estimators.~~
🟠 LO: Calculate and compare the relative efficiency of two estimators
🟠 LO: understanding how to determine which estimator provides more precise estimates under given conditions.
🟡 LO: Understand the concept of consistency in estimators (know that MLE’s are consistent under certain conditions)
🟡 LO: Define the Minimum Variance Unbiased Estimator (MVUE) within the context of statistical estimation.
🟡 LO: ~~Apply CRLB to identify if a unbiased estimator achieves is MVUE~~
🟠 LO: ~~Identify and derive MVUEs using the Cramér-Rao Lower bound (CRLB) theorem and understand the conditions under which an unbiased estimator achieves minimum variance.~~
🟠 LO: ~~Use the Cramér-Rao Lower Bound to find the lower bound of the variance of unbiased estimators~~
🟢 LO: ~~Define Fisher’s information~~
🟢 LO: ~~Explain the significance of Fisher’s information in statistical inference and parameter estimation.~~

11/12/13: Hypothesis Testing for one-sample

Steps involved in hypothesis testing:

1. State the Hypotheses:

Null Hypothesis ($H_0$): Represents the default assumption, often stating that there is no effect or no difference (i.e. no-change or status-quo).
Alternative Hypothesis($H_A$): Represents the claim or research question you are testing for, stating that there is an effect or a difference.
Always construct hypotheses about population parameters (e.g. population mean $\mu$) and not the sample statistics (e.g. sample mean, $\bar x$)

2. Identify/choose the Significance Level ($\alpha$)

This is the probability of rejecting the null hypothesis when it is actually true.
Common choices for $\alpha$ include 5% (most common) or 1%, 2%, or 10%
Unless otherwise specified, you can assume $\alpha = 0.05$

3. Select a Statistical Test

Choose an appropriate statistical test based on the type of data and the hypothesis being tested. Common tests include t-tests, z-tests, chi-square tests, ANOVA, etc.

4. Collect Data and Calculate Test Statistic

Collect a sample of data from the population of interest.
Calculate the appropriate test statistic based on the chosen test and the sample data.

5. Determine the Distribution of the Test Statistic

Identify the null distribution, i.e. the distribution of the test statistic “under the null hypothesis” (i.e. assume $H_0$ is true).

6. Calculate Critical Value or p-value

Based on the assumed sampling distribution of the test statistic, calculate either the critical value or the $p$-value.
- You should be able to do this using the appropriate table as well as in R.

7. Make a Decision

For the critical value approach: if the test statistic falls with the rejection region, reject $H_0$ in favour of $H_A$, otherwise, fail to reject $H_0$.
For the $p$-value approach: if the $p$-value is less than the significance level ($\alpha$), reject $H_0$ in favour of $H_A$, otherwise, fail to reject $H_0$.
Note that we can never “accept” the null hypothesis since the hypothesis testing framework does not allow us to confirm it.

8. Draw Conclusions

Based on the decision in step 7, draw conclusions about the population parameter(s) being tested.
Note that your interpretation must always be in context of the data – mention what the population is and what the parameter is (e.g. mean or proportion) and be sure to make reference to your alternative hypothesis.

🔴 LO: Explain the concepts of null ($H_0)$and alternative hypotheses ($H_A$), test statistics, significance levels, and p-values.
🔴 LO: Formulate the appropriate null hypothesis (either in symbols or in words) and and alternative hypotheses given a word problem (determine the appropriate direction of the alternative: it is upper-tailed, lower-tailed, or a two-sided hypothesis test)
🟠 LO: Define Type I and Type II errors. Note that the conclusion of a hypothesis test might be erroneous regardless of the decision we make.
- Type 1 error is the probability of rejecting the null hypothesis when the null hypothesis is actually true.
- Type 2 error is the probability of failing to reject the null hypothesis when the alternative hypothesis is actually true.
🔴 LO: Define the significance level (alpha) and explain its role in hypothesis testing
🔴 LO: Identify the appropriate test statistic for a given problem
🟠 LO: Define the null distribution and explain its role in hypothesis testing.
🔴 LO: Calculate and identify an observed test statistic given some data
🟠 LO: Understand how sample size impacts the SE of point estimators
🔴 LO: Define sample statistic as a point estimate for a population parameter, (e.g. the sample mean $\bar x$ is used to estimate the population mean $\mu$) and note that point estimate and sample statistic are synonymous.
🔴 LO: Explain the theoretical foundations of z-tests and t-tests, including their assumptions, conditions, and applicability
🔴 LO: Explain why the t-distribution helps make up for the additional variability introduced by using s (sample standard deviation) in calculation of the standard error, in place of σ (population standard deviation).
🔴 LO: Identify and check the assumptions and conditions necessary for valid z-tests and t-tests
- Note: the independence of observations in a sample is provided by a simple random sampling design.
- 🔴 LO: Use graphical methods (e.g., Q-Q plots, histograms) to assess the normality assumption
🔴 LO: know when to use a $t$-test vs. $z$-test (refer to flowchart)
🔴 LO: Describe the different characteristics of the standard normal (i.e. $Z$-distribution) as compared to the Student $t$ distribution.
- e.g. the $t$-distribution has a single parameter, degrees of freedom, and as the degrees of freedom increases this distribution approaches the normal distribution.
🔴 LO: Perform one-sample z-tests and t-tests including assumptions checking, and steps outlined above.
🟠 LO: Understand the connection between 100*(1- $\alpha$)% CI and two-sided hypothesis tests.
🔴 LO: Define a $p$-value as the conditional probability of obtaining a sample statistic at least as extreme as the one observed given that the null hypothesis is true, i.e. $\Pr(\text{observed or more extreme sample statistic | } H_0 \text{ true})$
🟠 LO: Visualize on a null distribution the rejection region (values of the test statistic for which $H_0$ will be rejected) and/or $p$-values (area under the curves)

15/16: Inference for Two Samples

Many of the Learning Outcomes from 11/12/13: Hypothesis Testing for one-sample will carry over to this unit. In addition we have:

🟠 LO: Explain the objectives and applications of inference for two samples in research and data analysis.
🔴 LO: Differentiate between independent and dependent populations.
🔴 LO: Differentiate between the three types of $t$-tests: paired $t$-tests, Welch procedure and Pooled $t$-tests (refer to flowchart)
🔴 LO: Perform two-sample $t$-tests, including assumptions checking, and steps outlined above.
🔴 LO: Perform the two-sample $t$-tests in R using the t.test() function
- Specify the appropriate arguments: alternative = c("two.sided", "less", "greater"), mu = 0, paired = FALSE, var.equal = FALSE, conf.level = 0.95 .
- Specify the data either using x, y, or formulas with data = ...
🔴 LO: Interpret the output of t.test() function including test statistics, degrees of freedom, $p$-values, and confidence intervals.
🔴 LO: Identify the degrees of freedom associated with different statistical tests.
🟠 LO: Define pooled variance and explain the conceptual basis in the context of two-sample hypothesis testing

17: ANOVA

🟡 LO: Explain the conceptual basis of ANOVA, including the partitioning of variance and the F-test for assessing group differences.
🔴 LO: Define Analysis of Variance (ANOVA) as a statistical method used to compare means across multiple groups or treatments.
🔴 LO: State the null and alternative hypothesis for a one-way anova (either in words or symbols)
🔴 LO: State and check the assumptions underlying the model.
- for checking normality we use visual aids (e.g. QQ plot)
- for equal variance we use the rule of thumb $0.5 < \frac{s_a}{s_b} < 2$ where $s_a$ and $s_b$ are the smallest and largest sample standard deviation, respectively.
🔴 LO: interpret side-by-side boxplots to assess it the equal variance assumption is reasonable
🔴 LO: Identify and calculate the appropriate degrees of freedoms in an one-way ANOVA
🟢 LO: Explain the conceptual basis of degrees of freedom and its importance in hypothesis testing and parameter estimation.
🟡 LO: Explain the meaning and importance of balanced study designs
🔴 LO: Identify and describe the main components of an ANOVA table, including the sources of variation, degrees of freedom, sum of squares, mean squares, and the F-statistic.
🔴 LO: Calculate missing values in an ANOVA table based on the relationships between cells
🟢 LO: Interpret the sources of variation presented in the ANOVA table, such as between-groups variation, within-groups variation, and total variation.
🟠 LO: Perform a one-way ANOVA in R using the aov() function
- know the formula notation y~x with data specification
🔴 LO: Interpret the results of hypothesis testing (either in an ANOVA table or R aov() output) based on the F-statistic and $p$-value including decisions to reject or fail to reject the null hypothesis.
🟡 LO: Describe why calculation of the p-value for ANOVA is always “one sided”.
🟠 LO: Explain the purpose and rationale for conducting post-hoc tests to identify specific group differences.
🟠 LO: Identify and applying appropriate post-hoc tests when necessary (e.g. pairwise pooled $t$-tests to discover which group means are different after a significant ANOVA result using the Bonferroni correction)
🟠 LO: Explain why multiple comparison procedures like Bonferonni correction are necessary to control for Type I errors in hypothesis testing.

18: Linear Regression and Correlation

🟠 LO: Define and identify the explanatory variable (aka independent variable or predictor), and the response variable (aka the dependent/response variable).
🟠 LO: Use scatter plots (explanatory variable ($x$) on the x-axis and the response variable ($y$) on the y-axis) to describe the strength and direction (positive or negative) of the linear relationship
🟠 LO: Define simple linear regression (SLR) as a statistical method used to model the relationship between a single independent variable (predictor) and a continuous dependent variable (outcome or response variable).
🔴 LO: State and check the assumptions for using SLR, i.e. linearity, nearly normal residuals, constant variability (homoscedasticity).
🔴 LO: Identify and interpret the parameters of a SLR model ($\beta_0$ and $\beta_1$)
- Interpret the slope as
  - “For each unit increase in x, we would expect y to increase/decrease on average by $\mid \hat \beta_1 \mid$ units”
  - Note that whether the response variable increases or decreases is determined by the sign of $\hat \beta_1$
- Interpret the intercept as
  - “When x = 0, we would expect y to equal, on average, $\hat \beta_0$”
  - Explain why the intercept often does not have any practical significance
🟠 LO: Plot the and fitted SLR line and understand the graphical representation of the slope ($\hat \beta_1$) and intercept ($\hat \beta_0$)
🟡 LO: Define and identify residuals $e_i$ as the difference between the observed $y$ and predicted $\hat y$ values of the response variable.
🟡 LO: Explain how parameters are estimated using ordinary least squares (OLS), i.e. the OLS estimators are those that minimize the sum of the squared residuals
LO: ~~Derive the OLS estimators~~
🔴 LO: Make predictions based on the fitted line using $\hat y = \hat \beta_0 + \hat \beta_1 x$
🟡 LO: Interpret the values of the Pearson’s correlation coefficient ($r$)
🟡 LO: Describe the relationship between Pearson’s correlation coefficient ($r$) and the coefficient of determination, denoted $R^2$ (AKA R-squared value) in SLR:
- This value is calculated as the square of the correlation coefficient, and is between 0 and 1, inclusive.
- An R-squared value of 1 indicates a perfect fit of the regression model to the observed data
🟠 LO: Use residual plots to identify potential outliers (any unusual observations that stand out)
🟠 LO: Assess the Residual vs. Fitted plots to check the assumptions
🟠 LO: Assess the QQ Residuals plot to check the normality assumptions
🟡 LO: Define extrapolation and distinguish it from interpolation (predicting for values of $x$ that are in the range of the observed data).
🟡 ~~LO: Perform hypothesis tests on the slope~~³ ~~coefficient in a simple linear regression model.~~
🔴 LO: Fit a linear model in R using the lm(formula, data) function.
🔴 LO: Identify (from the summary output of an lm() model in R) the parameter estimates (and therefore the fitted OLS line)
🔴 LO: Interpret the summary output of an lm() model in R to determine if a significant linear relationship exists (by interpreting the $p$-value associated with the slope parameter)
🔴 LO: Identify (from the summary output of an lm() model in R) the $R^2$ value and interpret its meaning as the percentage of the variability in the response variable explained by the explanatory variable.

19: Chi-square tests

🔴 LO: construct and interpret contingency tables (along with marginals) to organize and two categorical variables.
🔴 LO: Use side-by-side box plots for assessing the relationship between a numerical and a categorical variable
🟠 LO: Determine the size of a contingency table ( $r \times c$ )
🔴 LO: Conduct goodness-of-fit tests to determine how well observed data fit a specific distribution.
🔴 LO: perform the two types of chi-square tests:
1. tests for one-way tables (AKA goodness of fit tests) and
2. test for two-way tables (AKA tests for independence)
following the steps outlined above. Note you should be able to ….
- calculated an expected cell count
- compute the chi-squared test statistic by hand
- find critical values/approximate $p$-values for this tests using the Chi-square table
🟠 LO: State and check the assumptions for these tests
🟠 LO: Use the chisq.test() to conduct the appropriate chi-squared tests
🟠 LO: Interpret the output of chisq.test()

Footnotes

You can alternative attach() your data frame which makes the columns of the data frame available as if they were named vector objects in R.↩︎
Understand how larger MSE values indicate greater discrepancy between estimated and true values, while smaller MSE values indicate better performance.↩︎
Note that a hypothesis test for the intercept is often irrelevant since it’s usually out of the range of the data, and hence it is usually an extrapolation.↩︎