Midterm 1 review

Author

Dr. Irene Vrbik

Midterm format:

Coverage: Lectures 1–10 (inclusive) and Assignments 1 and 2

Location: In class on Web Feb 28

Duration: Entire class time (80 minutes)

The midterm will be a mix of

  • Multiple choice,
  • True/False (with supporting statements required),
  • Short Answer and
  • Long Answer

Some practice problems have been uploaded to Canvas to provide some examples structure and difficulty level.

Material provided:

  1. Standard normal table
  2. Student \(t\)-distribution table
  3. Chi-squared distribution table

Other permitted material:

  • Non-graphing, non-programmable, calculator (e.g. Casio FX-991, Sharp EL-W516, EL-520 , Texas Instruments TI-36X Pro, CASIO fx-991ES PLUS C*). If you are unsure if your calculator is permitted, please ask me ASAP.

  • One standard page cheat sheet (front and back). You will be required to hand this in at the end of the test.

Review

Describing Data

  • What is statistics (descriptive vs inference)

    • descriptive: involves describing, summarizing, and displaying data
    • inferential: involves using a sample of data to draw conclusions about a population
  • Summary statistics:

    • Measures of central tendency: mean, median (more robust1 than the mean), mode
    • Measures of dispersion: variance, standard deviation (same units as the data), IQR (more robust than sd/var),
  • Summary plots: boxplot, histogram, scatterplots, dotplots, …

    • know your functions: hist, ggplot, labels, colors, pch, legends, …
    • Note: you will not be asked to produce code from scratch but you may be asked about the functions and they’re arguments.
    library(openintro) # load the library
    table(loan50$interest_rate)
    
     5.31  5.32  6.08  6.71  7.34  7.35  7.96  7.97  9.43  9.44  9.92  9.93 10.42 
        2     1     3     2     1     2     3     1     2     3     4     2     2 
     10.9 10.91 11.98 12.62 14.08 15.04 16.02 17.09 18.06 18.45 19.42    20 21.45 
        2     3     1     3     1     1     1     3     1     1     1     1     1 
    24.85  26.3 
        1     1 
    Code
    # Define a function to calculate the mode
    Mode <- function(x) {
      ux <- unique(x)
      ux[which.max(tabulate(match(x, ux)))]
    }
    
    mean_value = mean(loan50$interest_rate)
    median_value = median(loan50$interest_rate)
    mode_value <- Mode(loan50$interest_rate)
    Mean Median Mode
    11.5672 9.93 9.92
Code
# Create a dot plot with mean and median using ggplot2

library(ggplot2)

ggplot(loan50, aes(x = interest_rate)) +
  geom_dotplot() + labs(x = "Interest Rate (in %)") +
  geom_point(aes(x = mean_value, y = 0), color = "red", size = 5, shape = 17) +
  geom_point(aes(x = median_value, y = 0), color = "green", size = 5, shape = 15) +
  geom_vline(xintercept = mode_value, color = "blue", linetype = "dashed") +
  annotate("text", x = mode_value , y = 0.6, label = "Mode", vjust = 1.5, color = "blue") +
   coord_cartesian(ylim = c(0, 0.25))  # Adjust the y-axis limits here
Warning in geom_point(aes(x = mean_value, y = 0), color = "red", size = 5, : All aesthetics have length 1, but the data has 50 rows.
ℹ Did you mean to use `annotate()`?
Warning in geom_point(aes(x = median_value, y = 0), color = "green", size = 5, : All aesthetics have length 1, but the data has 50 rows.
ℹ Did you mean to use `annotate()`?

A dotplot for the interest rates in the loan50 dataset. The mean is ploted as a red triangle, and the median is plotted as a green square. The mode (the value that appears most often in a set of data values) is 9.92 and displayed as a blue dash.
boxplot(loan50$interest_rate)

A boxplot for the interest rates in the loan50 dataset. The box in the middle of the plot represents the interquartile range (IQR), which spans from the first quartile (Q1) to the third quartile (Q3). The length of the box shows the spread of the middle 50% of the data. The horizontal line inside the box represents the median (Q2), which is the middle value of the dataset when it’s sorted in ascending order. Whiskers extend from the edges of the box to the minimum and maximum values within the range of [Q1 - 1.5XIQR, Q3 + 1.5XIQR]. Observations outside this range are considered potential outliers and are plotted individually (here there are two potential outliers as indicated by the two open circles. Notice how the skewed shape becomes apparent in this boxplot.
  • The shape of a distribution can be described as

    • symmeteric: where the left side of the distribution mirrors the right side.
      • The mean, median, and mode are all located at the center of the distribution.
      • Examples include the normal distribution and the uniform2 distribution.
    • skewed: asymmetrical, with the data clustering more toward one end of the distribution.
      • either ight (positive skew) or left (negative skew).
      • The mean will be less (resp. more) than the median for right (resp left) skewed distribution
  • The modality of a distribution describes the number of peaks:

    • unimodal (one peak), bimodal (two peaks), multi-modal (2 or more peaks), uniform (no peaks.

Sampling from Target Populations

  • Reasons for sampling vs. census: save time and money, logistics, feasibility

  • Sampling techniques: simple, stratified, and cluster sampling

  • Types of bias: non-response (e.g. illegal immigrants), selection bias (technology-based health interventions).

    Difference in notation between a population and a sample
    Population Sample
    size \(N\) sample size \(n\)
    mean \(\mu = \dfrac{x_1 + \dots + x_N}{N}\) sample mean \(\bar{x} = \dfrac{x_1 + \dots + x_N}{n}\)
    variance \(\sigma^2 = \frac{1}{N} \sum_{i=1}^{N} (x_i - \mu)^2\) sample variance \(S^2 = \frac{1}{n-1} \sum_{i=1}^{n} (X_i - \bar{X})^2\)
    proportion \(p\) sample proportion \(\hat p\)

Data types

  • Numeric variables: variables that represent quantities or numerical values
    • continuous (numeric): can take on any value within a certain range
    • discrete (integer): can only take on specific, distinct values
  • categorical: variables that represent categories or groups
    • nominal/unordered categorical (factor): represent categories with no inherent order or ranking.
    • ordinal/ordered categorical (ordered factor): represent categories with a natural order or ranking, e.g. sanctification rating (poor, fair, good, excellent)
    • can be codified as numbers but treated different than iteger
  • logical (0/1 or boolean) variables.
    • check with as.logical or class; coerce with as.logical

Relationships between variables

  • When two variables show some connection/relationship with one another, they are called associated (or dependent) variables.
  • If two variables are not associated, i.e. there is no evident connection between the two, then they are said to be independent.

Outliers

  • Outliers: data points that significantly differ from the rest of the data
    • potential causes: measurement errors, data entry (human error), natural variation, or genuine anomalies (novelties in data).
    • potential ramifications: can distort statistical analyses and models, influencing results and interpretations.
    • potential treatment: removal, transformation, employing robust methods

One way that we have identified outliers is through boxplots.

Types of studies

  1. Observational: collects data on variables of interest without intervening or manipulating any aspect of the study participants.
    • often used to describe patterns, associations, or relationships between variables in their natural settings.
  2. Experiments: researcher deliberately manipulates one or more independent variables and observes the effect on one or more dependent variables while controlling for other factors.
    • often used to establish causality by determining whether changes in one variable (the independent/explanatory variable) cause changes in another variable (the dependent/response variable).

Rstudio and quarto docs

R basics

  • Types of data: numeric, factor, character, …

  • data structures: vector, matrix, data frame, tibble, lists,

    • vectors can only store one data type
    • data frames/tibbles can store different data types (same within each column, however)
  • vectorization

    z = c(1,9,-6,3)
    z + 2
    [1]  3 11 -4  5
  • indexing using [] and $

  • how to read in data (e.g. read.csv)

  • coercing data types (e.g. as.numeric, factor, …)

  • examining data (e.g. head, str, View, …)

  • help function/operator: help(), ?<name of thing you need help manual for>

Basic Rmarkdown:


## Level 2 header

*italic*, **bold**, ***bold and italic***

- unordered list item 1
- unordered list item 2
- unordered list item 3

1. ordered list item 1
1. ordered list item 2
1. ordered list item 2
```{r}
x = 4
```
My favourite number is `r x`

![](https://irene.vrbik.ok.ubc.ca/img/stat-chart.png)

Which renders to:

Level 2 header

italic, bold, bold and italic

  • unordered list item 1
  • unordered list item 2
  • unordered list item 3
  1. ordered list item 1
  2. ordered list item 2
  3. ordered list item 2
x = 4

My favourite number is 4

Code chunk options

  • echo: to show or hide code chunks
  • eval: to evaluate a code chunk or not
  • fig-: options for figure rendering (eg. -width, -height, -cap)
  • out-width: controls how much space the outputted figure will take up
  • label: for labeling R code chunks

Authoring Quarto documents.

  • YAML

    ---
    title: "My Document"
    subtitle: "Stat 205"
    author: "Dr. Irene Vrbik"
    format: 
      html:
        toc: true
        code-fold: true
        df-print: paged
    ---
  • tables

  • latex equations (will not be tested)

  • how to include images

  • setting seeds

Sampling Distributions

  • A statistic is a numerical quantity calculated from a sample of data: \(T(X) = g(X_1, X_2, \dots, X_n)\)

  • \(T(X)\) is a random variable for which you can find expected values for.

  • A sampling distribution is the probability distribution of a given a sample statistic.

    Central Limit Theorem

    Let \(X_1, X_2, \dots, X_n\) be a sequence of independent and identically distributed (i.i.d) RVs each having mean \(\mu\) and variance \(\sigma^2\). Then for \(n\) large, the distribution of the sum of those random variables is approximately normal with mean \(n\mu\) and variance \(n\sigma^2\)

    \[ \begin{align*} X_1 + X_2 + \dots + X_n &\sim N(n\mu, n\sigma^2) \end{align*} \]

    \[ \begin{align*} \dfrac{X_1 + X_2 + \dots + X_n}{n} &\sim N\left(\mu, \dfrac{\sigma^2}{n}\right) \end{align*} \]

    \[ % Z_n = \dfrac{X_1 + \dots + X_n - n\mu}{\sigma/\sqrt{n}} = \dfrac{\sum_{i=1}^n X_i - \mu}{\sigma/\sqrt{n}} \rightarrow N(0,1) Z_n = \dfrac{\overline{X} - \mu}{\sigma/\sqrt{n}} \rightarrow N(0,1) \text{ as } n \rightarrow \infty \]

    Rule of thumb

    In order for the Central Limit Theorem to hold, the sample size is typically considered sufficiently large when \(n \geq 30\).

    CLT for proportions

    When observations are independent and the sample size is sufficiently large, the sample proportion \(\hat p\) is given by

    \[ \hat p = \frac{X_1 + X_2 + \dots X_n}{n} \rightarrow N\left(\mu_{\hat p} = p, \sigma_{\hat p} = \sqrt{\frac{p(1-p)}{n}}\right) \]

    Success-failure condition

    In order for the Central Limit Theorem to hold, the sample size is typically considered sufficientlylarge when \(np \geq 10\) and \(n(1-p) \geq 10\) , which is called the success-failure condition.

  • Plug in principal: to approximate this we sub \(p\) (the unknown population parameter) with the point estimate \(\hat p\):

    \[\begin{align*} \hat{\sigma_{\hat p}} &\approx \sqrt{\frac{\hat p(1-\hat p)}{n}} \\ \end{align*}\]
    Sampling distribution of Sample Variance

    Let \(X_1, X_2, \dots, X_n\) be a random sample from a normal distribution with mean \(\mu\) and variance \(\sigma^2\). It can be shown that

    \[\begin{align*} \dfrac{(n−1)S^2}{\sigma^2} = \frac{\sum_{i=1}^n (X_i - \bar{X})^2}{{\sigma^2}} \sim \chi^2_{(n-1)} \end{align*}\] where \(\chi^2_{(n-1)}\) denotes a chi-squared distribution with \(n−1\) degrees of freedom.

    Sampling distribution of Sample Mean with unknown population variance

    Let \(X_1, \ldots, X_n\) be a sample from a normal population with mean \(\mu\). If \(X\) denotes the sample mean and \(S\) the sample standard deviation, then \[ \frac{\sqrt{X - \mu}}{S/\sqrt{n}} \sim t_{n-1} \] That is, \(\frac{n(X - \mu)}{S \sqrt{n}}\) follows a t-distribution with \(n - 1\) degrees of freedom.

Parameter Estimation

  • A Point estimators provide a single “best guess” for the parameter.

  • The sample statistic \(\hat \theta (X_1, \dots, X_n)\) is an estimator for the parameters; the observed value \(\hat \theta (x_1, \dots, x_n)\)(calculated using particular sample data) are called estimates of the parameters.

Desirable properties of estimators

  1. An estimator, \(\hat \theta\), is unbiased if the mean of its sampling distribution is the parameter \(\theta\). The bias of an estimator, \(\hat \theta\), is given by \[\text{Bias}(\hat \theta) = B = \mathbb{E}[\theta] - \theta\]

  2. An efficient estimator achieves the smallest possible variance among a class of unbiased estimators

    • The mean square error of the estimator \(\hat\theta\), denoted by \(\text{MSE}(\hat\theta)\), is defined as: \[ \text{MSE}(\hat\theta) = \mathbb{E}[(\hat\theta - \theta)^2] \]

    • For unbiased estimators \(\hat\theta\) (where the bias is 0) it is clear that

      \[ \text{MSE}(\hat\theta) = \text{Var}(\hat\theta) \]

    • Given two estimates, \(\hat\theta_1\) and \(\hat\theta_2\), of a parameter \(\theta\), the efficiency of \(\hat\theta_1\) relative to \(\hat\theta_2\), is defined to be

      \[ \text{eff}(\hat\theta_1, \hat\theta_2) = \frac{\text{Var}(\hat\theta_1)}{\text{Var}(\hat\theta_2)} \]

      Thus, if the efficiency is smaller than 1, \(\hat\theta_2\) has a larger variance than \(\hat\theta_1\) has.

  3. Minimum variance unbiased estimator (MVUE) is an unbiased estimator whose variance is smaller or equal to the variance of any other unbiased estimator for all potential values of \(\theta\).

    • Cramer Rao Lower Bound (CRLB) Suppose that \(X_1, \dots X_n\) is a random sample from a population having a common density function \(f(x; \theta)\) depending on a parameter \(\theta \in \Omega\). Assume that the regularity conditions \((R0)–(R4)\) hold and let \(\hat\theta\) be an unbiased estimator of \(\theta\). The variance of any estimator \(\hat{\theta}\) of such a parameter must meet the following inequality: \[\begin{equation} \text{Var}(\hat{\theta})\geq \dfrac{1}{I_n(\theta)} = \dfrac{1}{nI(\theta)}\label{C-R} \end{equation}\] where \(I_n(\theta)\) denotes the Fisher information in the sample, and \(I(\theta)\) denotes the [Fishers Information] in a single observation from \(f(x; \theta)\)

    • If CRV is the theoretical variance bound given by the CRLB theorem, then the efficiency of an unbiased estimator \(\hat\theta\) is given by:

      \[ \text{efficiency} = \dfrac{CRV}{\text{Var}(\hat\theta)} \]

      Efficiency cannot be bigger than 1 and is usually expressed in percent. An estimator which reaches 100% efficiency when \(n \rightarrow \infty\) is called asymptotically efficient.

  4. An estimator is sufficient if “no other statistic that can be calculated from the same sample provides any additional information as to the value of the parameter

  5. A minimal sufficient statistics are those that are sufficient for the parameter and are functions of every other set of sufficient statistics for those same parameters.

  6. A consistent estimator is a statistical estimator whose accuracy improves as the sample size increases. A consistent estimator must have two properties:

    \[\mathbb{E}[\hat \theta] \underset{n\rightarrow \infty }{\longrightarrow } \theta\] i.e. be asymptotically unbiased, and \[ \text{Var}(\hat{\theta})\underset{n\rightarrow \infty }{\longrightarrow }0 \] meaning that its variance must tend to zero with increasing sample size.

Methods of Finding Point Estimators

  • Method of Moments: sample moments are equated to their corresponding population moments to obtain estimates for the parameters.

    The \(k\)th (population) moment (about the origin) of a random variable \(X\), denoted \(\mu_k'\) is the expected value of \(X^k\).

    \[\mu_k' = \mathbb{E}[X^k]\]

    The \(k\)th sample moment (about the origin) \(m_k'\) is defined as the average of the \(k\)th powers of the observed data points.

    \[m_k' = \frac{1}{n} \sum_{i=1}^n X_i^k\]

    Suppose there are \(l\) parameters to be estimated. From the system of equations

    \[\mu_k' = \frac{1}{n} \sum_{i=1}^n X_i^k\]

    for \(k = 1,2, \dots, l\) solve for \(\Theta = (\theta_1, \dots, \theta_l)\)

  • Maximum likelihood estimation (MLE): seeks to find the values of model parameters that maximize the so-called likelihood function.

    Likelihood (definition)

    Let \(f(x_1, \ldots, x_n; \theta)\), \(\theta \in \Theta \subseteq \mathbb{R}^k\), be the joint probability (or density) function of \(n\) random variables (\(X_1, \ldots, X_n\)) with sample values (\(x_1, \ldots, x_n\)). The likelihood function of the sample is given by: \[\begin{equation} L(\theta; x_1, \ldots, x_n) = f(x_1, \ldots, x_n; q) \end{equation}\] Note: \(L\) is a function of \(\theta\) for fixed sample values.

    If \((X_1, \ldots, X_n)\) are independent and identically distributed (iid) discrete random variables with probability mass function (PMF) \(p(x, \theta)\), then the likelihood function is given by: \[ \begin{align*} L(\theta) &= P(X_1 = x_1, \dots, X_n = x_n) \\ &= \prod_{i=1}^{n} P(X_i = x_i) \text{ from independence}\\ &= \prod_{i=1}^{n} p(x_i; \theta) \end{align*} \]

  • And in the continuous case, if the density is \(f(x, \theta)\), then the likelihood function is:

    \[ L(\theta) = \prod_{i=1}^{n} f(x_i; \theta) \]

    Maximum likelihood estimators (MLEs)

    Maximum likelihood estimators or MLEs are those values of the parameters that maximize the likelihood function with respect to the parameter \(\theta\). That is, \[ \hat{\theta}_{\text{MLE}} = \underset{\theta \in \Theta}{\arg\max} \, L(\theta) \] where \(\Theta\) is the set of possible values of the parameter \(\theta\).

    It is often easier to work with the log-likelihood function denoted by \(\ell(\theta)\).

Confidence Interval

  • Interval estimators, also known as a confidence interval (CI), provides a range of values within which the true value of the parameter is expected to fall, along with a level of confidence.

  • CI provides some prescribed degree of confidence of securing the true parameter (typically 90%, 95%, or 99%)

    \[ \begin{align} \text{point estimate} & \pm \text{Margin of error}\\ & \pm t/z^* \times \text{SE(estimator)} \end{align} \]

  • In the following CI, we have:

    • point estimates: \(\bar{x}\) (sample mean), \(\hat p\) (sample proportion), \(s^2\) (sample variance)
    • significance level \(\alpha\), and corresponding confidence level \(C = 1 - \alpha\)
    • \(z^*\) the \(z\)-score corresponding to the desired level of confidence \(C = 1-\alpha\)
    • \(t^*_{n-1}\) is the \(t\)-score corresponding to the desired level of confidence
    • \(n\) sample size
    CI for \(\mu\) (known \(\sigma^2\))

    CI for population mean (\(\mu\)) with know population standard deviation (\(\sigma\) )

    \[ \begin{equation} \bar{x} \pm z^* \left(\sigma/\sqrt{n}\right) \end{equation} \]

    Assumptions: When the sample data is a SRS from the population of interest, this CI is exact with population is normal, and approximate when (\(n>30\) as a rough guideline).

    CI for \(\mu\) (unknown \(\sigma^2\))

    CI for population mean (\(\mu\)) with unknow population standard deviation (\(\sigma\) )

    \[ \begin{align} \bar{x} & \pm t^*_{n-1} \left(s/\sqrt{n}\right) \end{align} \]

    Assumptions: When the sample data is a SRS from the population of interest, this CI is exact with population is normal, and approximate when (\(n>30\) as a rough guideline).

    CI for \(p\)

    A (\(1-\alpha\))% CI for population proportion (\(p\)) is given by:

    \[ \begin{align} \hat p & \pm z^* \sqrt{\dfrac{\hat p​(1 - \hat p​)​}{n}​} \end{align} \]

    Assumptions: The sample is drawn from a binomial distribution and we satisfy the success-failure condition for normal approximation to be valid (we use \(np > 10\) and \(n(1-p) > 10\)).

    CI for \(\sigma^2\)

    A \((1 - \alpha) \times 100\%\) confident that the population variance \(\sigma^2\) falls in the interval \[ \left(\frac{(n - 1)S^2}{\chi_{\frac{\alpha}{2}}^2}, \frac{(n - 1)S^2}{\chi_{1 - \frac{\alpha}{2}}^2}\right), \]

    where the \(\chi^2\) distribution has \((n - 1)\) degrees of freedom.

    If we feel like the assumptions of the above procedures are not justified we can use nonparametric procedures instead.

    Nonparametric Confidence Intervals

Footnotes

  1. not as sensitive to extreme values↩︎

  2. a special type of symmetric distribution where all values have the same frequency, resulting in a flat shape.↩︎