
STAT 205: Introduction to Mathematical Statistics
University of British Columbia Okanagan
Many applied statistical analysis problems involves estimating population parameters
Direct observation of these numerical characteristics may not be possible, so random variables are observed
Objective: Develop methods using sample data to gain information about unknown population characteristics.
Population
π§π§π»π§π½π§πΎπ§πΏπ§ββοΈπ§π»ββοΈπ§πΌββοΈπ§π½ββοΈπ§πΎββοΈπ§πΏββοΈπ§ββοΈ
π§ββοΈπ§π½ββοΈπ§πΎββοΈπ§πΏββοΈπ§πΎββοΈπ§πΏββοΈπ§π½ββοΈπ§πΏπ§ββοΈπ§π½ββοΈπ§πΏπ§πΌββοΈ
π§πΏββοΈπ§π§π»π§ββοΈπ§π½π§πΎπ§πΎββοΈπ§ββοΈπ§π»ββοΈπ§πΌββοΈπ§π½ββοΈπ§ββοΈ
π§πΏββοΈπ§π§π»π§ββοΈπ§π½π§πΏββοΈπ§πΎπ§πΎββοΈπ§πΎββοΈπ§ββοΈπ§π»ββοΈπ§π½ββοΈ
Important
Taking an SRS minimizes bias, enables valid statistical inferences, and makes results generalizable.
\[\downarrow\]
Sample
π§π§πΎπ§πΌββοΈπ§π½ββοΈπ§ββοΈπ§π½ββοΈ
SRS (Simple Random Sample)
A Simple Random Sample (SRS) is a sampling method in which every element in the population has an equal probability of being selected, and each subset of the population of a given size has an equal chance of being chosen.
Goal: to use the sample to obtain the best possible estimates of these population parameters.
Population
π§π§π»π§π½π§πΎπ§πΏπ§ββοΈπ§π»ββοΈπ§πΌββοΈπ§π½ββοΈπ§πΎββοΈπ§πΏββοΈπ§ββοΈ
π§ββοΈπ§π½ββοΈπ§πΎββοΈπ§πΏββοΈπ§πΎββοΈπ§πΏββοΈπ§π½ββοΈπ§πΏπ§ββοΈπ§π½ββοΈπ§πΏπ§πΌββοΈ
π§πΏββοΈπ§π§π»π§ββοΈπ§π½π§πΎπ§πΎββοΈπ§ββοΈπ§π»ββοΈπ§πΌββοΈπ§π½ββοΈπ§ββοΈ
π§πΏββοΈπ§π§π»π§ββοΈπ§π½π§πΏββοΈπ§πΎπ§πΎββοΈπ§πΎββοΈπ§ββοΈπ§π»ββοΈπ§π½ββοΈ
\[\downarrow \text{SRS}\]
Sample
π§π§πΎπ§πΌββοΈπ§π½ββοΈπ§ββοΈπ§π½ββοΈ
Population Distribution
\(f(x \mid\mu, \sigma)= \frac{1}{\sqrt{2\pi\sigma^2}}\exp\left\{ -\frac{(x - \mu)^2}{2\sigma^2} \right\}\)

Population Parameters
\[ \begin{align} \mu &= ? & \sigma^2 &= ? \end{align} \]
Population
π§π§π»π§π½π§πΎπ§πΏπ§ββοΈπ§π»ββοΈπ§πΌββοΈπ§π½ββοΈπ§πΎββοΈπ§πΏββοΈπ§ββοΈ
π§ββοΈπ§π½ββοΈπ§πΎββοΈπ§πΏββοΈπ§πΎββοΈπ§πΏββοΈπ§π½ββοΈπ§πΏπ§ββοΈπ§π½ββοΈπ§πΏπ§πΌββοΈ
π§πΏββοΈπ§π§π»π§ββοΈπ§π½π§πΎπ§πΎββοΈπ§ββοΈπ§π»ββοΈπ§πΌββοΈπ§π½ββοΈπ§ββοΈ
π§πΏββοΈπ§π§π»π§ββοΈπ§π½π§πΏββοΈπ§πΎπ§πΎββοΈπ§πΎββοΈπ§ββοΈπ§π»ββοΈπ§π½ββοΈ
\[\downarrow \text{SRS}\]
Sample
π§π§πΎπ§πΌββοΈπ§π½ββοΈπ§ββοΈπ§π½ββοΈ
Population Distribution \(f(x \mid \lambda) = \lambda e^{-\lambda x}, \quad \text{ for }x \geq 0\)

Population Parameters
\[ \begin{align} \lambda &= ? \end{align} \]
Population
π§π§π»π§π½π§πΎπ§πΏπ§ββοΈπ§π»ββοΈπ§πΌββοΈπ§π½ββοΈπ§πΎββοΈπ§πΏββοΈπ§ββοΈ
π§ββοΈπ§π½ββοΈπ§πΎββοΈπ§πΏββοΈπ§πΎββοΈπ§πΏββοΈπ§π½ββοΈπ§πΏπ§ββοΈπ§π½ββοΈπ§πΏπ§πΌββοΈ
π§πΏββοΈπ§π§π»π§ββοΈπ§π½π§πΎπ§πΎββοΈπ§ββοΈπ§π»ββοΈπ§πΌββοΈπ§π½ββοΈπ§ββοΈ
π§πΏββοΈπ§π§π»π§ββοΈπ§π½π§πΏββοΈπ§πΎπ§πΎββοΈπ§πΎββοΈπ§ββοΈπ§π»ββοΈπ§π½ββοΈ
\[\downarrow \text{SRS}\]
Sample
π§π§πΎπ§πΌββοΈπ§π½ββοΈπ§ββοΈπ§π½ββοΈ
Population Distribution
\(f(x \mid \alpha, \beta) = \frac{x^{\alpha - 1}(1 - x)^{\beta - 1}}{\text{B}(\alpha, \beta)} \quad 0 \leq x \leq 1\)

Population Parameters \[ \begin{align} \alpha = ? && \beta &= ? \end{align} \]
Population
π§π§π»π§π½π§πΎπ§πΏπ§ββοΈπ§π»ββοΈπ§πΌββοΈπ§π½ββοΈπ§πΎββοΈπ§πΏββοΈπ§ββοΈ
π§ββοΈπ§π½ββοΈπ§πΎββοΈπ§πΏββοΈπ§πΎββοΈπ§πΏββοΈπ§π½ββοΈπ§πΏπ§ββοΈπ§π½ββοΈπ§πΏπ§πΌββοΈ
π§πΏββοΈπ§π§π»π§ββοΈπ§π½π§πΎπ§πΎββοΈπ§ββοΈπ§π»ββοΈπ§πΌββοΈπ§π½ββοΈπ§ββοΈ
π§πΏββοΈπ§π§π»π§ββοΈπ§π½π§πΏββοΈπ§πΎπ§πΎββοΈπ§πΎββοΈπ§ββοΈπ§π»ββοΈπ§π½ββοΈ
\[\downarrow \text{SRS}\]
Sample
π§π§πΎπ§πΌββοΈπ§π½ββοΈπ§ββοΈπ§π½ββοΈ
Population Distribution
\[ \begin{align} f(x\mid \theta) \end{align} \]
Population Parameters
\[ \begin{align} \theta_1 = ? \quad \theta_2 &= ? \quad \dots \quad \theta_l = ? \end{align} \]
Population Assumption
Assume the population follows some distribution with parameters \(\theta_1\), through \(\theta_l\) to be estimated.
\(f(x \mid \theta)\) is the probability density function (PDF) or probability mass function (PMF) for discrete data.
Note that \(\theta\) may be a vector of parameters
The model is the set of possible distributions, each determined by a different value of \(\theta\).
Sampling Assumptions
Let \(X = (X_1, \dots, X_n)\) be independent and identically distributed (i.i.d) random variables (RVs) with a probability density function (pdf) or probability mass function (pmf) \(f(x \mid \theta)\), where \(\theta = (\theta_1, \dots, \theta_l)\) are the unknown population parameters.
RIS (Random Independent Sample)
A Random Independent Sample (RIS) of size \(n\) involves a sampling method where individuals are chosen randomly and independently from the population.
Estimator
An estimator, \(\hat \Theta(X_1, X_2, \dots, X_n)\) is a rule, formula, or function used to calculate an estimate based on sample data. We will denote the estimator by \(\hat \Theta\) (captital theta) to emphasize that it representing a random variable that depends on the random sample.
Point Estimate
A point estimate, (or simply, estimate) \(\hat \theta(x_1, x_2, \dots, x_n)\) is the numerical value produced by applying the estimator to a specific sample. It is the realized value of the estimator after data collection. It is typically written simply as \(\hat \theta\).
Population
π§π§π»π§π½π§πΎπ§πΏπ§ββοΈπ§π»ββοΈπ§πΌββοΈπ§π½ββοΈπ§πΎββοΈπ§πΏββοΈπ§ββοΈ
π§ββοΈπ§π½ββοΈπ§πΎββοΈπ§πΏββοΈπ§πΎββοΈπ§πΏββοΈπ§π½ββοΈπ§πΏπ§ββοΈπ§π½ββοΈπ§πΏπ§πΌββοΈ
π§πΏββοΈπ§π§π»π§ββοΈπ§π½π§πΎπ§πΎββοΈπ§ββοΈπ§π»ββοΈπ§πΌββοΈπ§π½ββοΈπ§ββοΈ
π§πΏββοΈπ§π§π»π§ββοΈπ§π½π§πΏββοΈπ§πΎπ§πΎββοΈπ§πΎββοΈπ§ββοΈπ§π»ββοΈπ§π½ββοΈ
\[\downarrow \text{RIS}\]
Sample
π§π§πΎπ§πΌββοΈπ§π½ββοΈπ§ββοΈπ§π½ββοΈ
Population Distribution
\[ \begin{align} & f(x \mid\mu, \sigma) \end{align} \]
Sample Statistic (RV)
\[ \hat \mu = \frac{X_1 + \dots + X_n}{n} \]
the sample mean function is our estimator (RV) of the population parameter \(\mu\).
Population
π§π§π»π§π½π§πΎπ§πΏπ§ββοΈπ§π»ββοΈπ§πΌββοΈπ§π½ββοΈπ§πΎββοΈπ§πΏββοΈπ§ββοΈ
π§ββοΈπ§π½ββοΈπ§πΎββοΈπ§πΏββοΈπ§πΎββοΈπ§πΏββοΈπ§π½ββοΈπ§πΏπ§ββοΈπ§π½ββοΈπ§πΏπ§πΌββοΈ
π§πΏββοΈπ§π§π»π§ββοΈπ§π½π§πΎπ§πΎββοΈπ§ββοΈπ§π»ββοΈπ§πΌββοΈπ§π½ββοΈπ§ββοΈ
π§πΏββοΈπ§π§π»π§ββοΈπ§π½π§πΏββοΈπ§πΎπ§πΎββοΈπ§πΎββοΈπ§ββοΈπ§π»ββοΈπ§π½ββοΈ
\[\downarrow \text{RIS}\]
Sample
π§π§πΎπ§πΌββοΈπ§π½ββοΈπ§ββοΈπ§π½ββοΈ
Population Distribution \[ \begin{align} & f(x \mid\mu, \sigma) \end{align} \]
Point Estimate
\[ \begin{align} {\hat \mu} &= \frac{x_1 + \dots + x_n}{n} \\ &= \frac{170 + 192 + \dots + 155}{6} \\ &= 167.7 \end{align} \]
the value, 167.7, is the estimate of the population parameter \(\mu\).
Population
π§π§π»π§π½π§πΎπ§πΏπ§ββοΈπ§π»ββοΈπ§πΌββοΈπ§π½ββοΈπ§πΎββοΈπ§πΏββοΈπ§ββοΈ
π§ββοΈπ§π½ββοΈπ§πΎββοΈπ§πΏββοΈπ§πΎββοΈπ§πΏββοΈπ§π½ββοΈπ§πΏπ§ββοΈπ§π½ββοΈπ§πΏπ§πΌββοΈ
π§πΏββοΈπ§π§π»π§ββοΈπ§π½π§πΎπ§πΎββοΈπ§ββοΈπ§π»ββοΈπ§πΌββοΈπ§π½ββοΈπ§ββοΈ
π§πΏββοΈπ§π§π»π§ββοΈπ§π½π§πΏββοΈπ§πΎπ§πΎββοΈπ§πΎββοΈπ§ββοΈπ§π»ββοΈπ§π½ββοΈ
\[\downarrow \text{RIS}\]
Sample
π§π§πΎπ§πΌββοΈπ§π½ββοΈπ§ββοΈπ§π½ββοΈ
Population Distribution
\[ \begin{align} & f(x \mid\theta) \end{align} \]
Sample Statistic
\[ \hat \Theta = g(X_1, \dots, X_n) \]
Point Estimate
\[ \hat \theta = g(x_1, \dots, x_n) \]

Maximum Likelihood Estimation (MLE) is a method for estimating the parameters of a statistical model.
The basic idea behind MLE is to find the values of the parameters that maximize the likelihood function, which measures how well the model explains the observed data.
Maximum Likelihood Estimation is one of several method used for for parameter estimation.
In this paradigm, we treat the data as fixed and ask:
Which parameter value makes the observed data most plausible?
Suppose our model has parameter \(\theta\). The likelihood function
\[ L(\theta \mid x) = f(x \mid \theta) \]
where \(x\) is the observed data.
Important
Same formula, different point of view:
Coin Flip Example
Exercise 1 Suppose:
iClicker: Coin toss model
What is the appropriate model for Exercise 1
For \(x=7\) successes out of \(n=10\),
\[ L(p \mid x=7) \propto p^7(1-p)^3, \qquad 0 \le p \le 1 \]
More fully,
\[ L(p \mid n=10, x=7) = \binom{10}{7} p^7 (1-p)^3 \]
If we assumed the coin was fair, i.e. \(p=0.5\) the PMF is:
# values of p to show
p_vals <- c(0.1, 0.2, 0.3,
0.4, 0.5, 0.6,
0.7)
# support for Binomial(n = 10, p)
x <- 0:10
n <- 10
# save old graphics settings
old_par <- par(no.readonly = TRUE)
p = 0.5
probs <- dbinom(x, size = n, prob = p)
barplot(
probs,
names.arg = x,
col = "gray75",
border = "gray35",
ylim = c(0, max(probs) * 1.1),
main = paste0("PMF for binomial with n=10, p=", p),
xlab = "x",
ylab = paste0("f(x|n=10,p=", p, ")")
)If the coin was biased towards heads, say \(p=0.8\) the PMF is:
If the coin was biased towards tails, say \(p=0.2\) the PMF is:
iclicker
We observed \(x = 7\) heads. Out of the possible values of \(\theta\) which we have tried in the previous animation, which value of \(\theta\) is most plausible?
Figure 1: Binomial PMF for \(n=10\) evaluated at the MLE, \(\hat\theta = 0.7\). The highlighted bar is \(P(Y=7\mid \theta=0.7)\), the likelihood of the observed outcome at this parameter value.
Alternatively, we could plot the likelihood


With \(p = 0.1\) the \(\Pr(X = 7)\) = 8.748e-06.


With \(p = 0.15\) the \(\Pr(X = 7)\) = 0.00012591.


With \(p = 0.2\) the \(\Pr(X = 7)\) = 0.00078643.


With \(p = 0.25\) the \(\Pr(X = 7)\) = 0.0030899.


With \(p = 0.3\) the \(\Pr(X = 7)\) = 0.0090017.


With \(p = 0.35\) the \(\Pr(X = 7)\) = 0.021203.


With \(p = 0.4\) the \(\Pr(X = 7)\) = 0.042467.


With \(p = 0.45\) the \(\Pr(X = 7)\) = 0.074603.


With \(p = 0.5\) the \(\Pr(X = 7)\) = 0.11719.


With \(p = 0.55\) the \(\Pr(X = 7)\) = 0.16648.


With \(p = 0.6\) the \(\Pr(X = 7)\) = 0.21499.


With \(p = 0.65\) the \(\Pr(X = 7)\) = 0.25222.


With \(p = 0.7\) the \(\Pr(X = 7)\) = 0.26683.


With \(p = 0.75\) the \(\Pr(X = 7)\) = 0.25028.


With \(p = 0.8\) the \(\Pr(X = 7)\) = 0.20133.


With \(p = 0.85\) the \(\Pr(X = 7)\) = 0.12983.


With \(p = 0.9\) the \(\Pr(X = 7)\) = 0.057396.


Plotting the likelihood for all values of p
p <- seq(0, 1, length.out = 400)
lik <- dbinom(7, size = 10, prob = p)
plot(
p, lik, type = "l", lwd = 3, bty = "n",
xlab = "p",,
ylab = "Likelihood",
main = "Likelihood for x = 7 out of n = 10"
)
abline(v = 0.7, col = 2, lwd = 2, lty = 2)
par(xpd = TRUE)
text(0.7, max(lik), labels = expression(hat(p)==0.7), pos = 3, col = 2, cex = 1.2)The likelihood function for observing 7 successes out of 10 trials, with the peak indicating the MLE of \(\hat p = 0.7\).
Likelihood (definition)
Let \(f(x \mid \theta)\) be the joint probability (or density) function of random variables \(X\), evaluated at observed values \(x\). The likelihood function is defined as \[\begin{equation} L(\theta\mid x) = f(x \mid \theta) \end{equation}\]
Key idea: We treat the data \(x\) as fixed and view \(L(\theta)\) as a function of \(\theta\).
If \((X_1, \ldots, X_n)\) are iid discrete random variables with probability mass function (PMF) \(p(x, \theta)\), then the likelihood function is given by: \[ \begin{align*} L(\theta) &= P(X_1 = x_1, \dots, X_n = x_n) \\ &= \prod_{i=1}^{n} P(X_i = x_i) \text{ by multiplication rule for independent RV}\\ &= \prod_{i=1}^{n} p(x_i \mid \theta) \end{align*} \]
And in the continuous case, if the density is \(f(x \mid\theta)\), then the likelihood function is: \[ L(\theta) = \prod_{i=1}^{n} f(x_i \mid \theta) \]
Maximum likelihood estimators (MLEs)
Maximum likelihood estimators or MLEs are those values of the parameters that maximize the likelihood function with respect to the parameter \(\theta\). That is, \[ \hat{\theta}_{\text{MLE}} = \underset{\theta \in \Theta}{\arg\max} \, L(\theta) \] where \(\Theta\) is the set of possible values of the parameter \(\theta\).
There are three ways of finding the MLE
At a maximum:
\[ \frac{\partial L}{\partial w} = 0 \qquad \frac{\partial^2 L}{\partial w^2} < 0 \]
There are three ways of finding the MLE
Analytically: use calculus to solve for the parameter value(s) that result in a peak.
Grid search: exhaustive search through parameter space
There are three ways of finding the MLE
Analytically: use calculus to solve for the parameter value(s) that result in a peak.
Grid search: exhaustive search through parameter space
Numerically: use non-linear optimization (e.g. gradient descent) to iteratively find the peak
In this course, we will only talk about the analytic solutions
Example: Normal Likelihood
Let \(X_1, \ldots, X_n\) be independent and identically distributed random variables following a normal distribution \(N(\mu, \sigma^2)\). Let \(x_1, \ldots, x_n\) be the corresponding sample values. Find the likelihood function.
Recall the pdf of Normal distribution:
\[\begin{equation} f(x \mid \mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left\{{-\frac{(x - \mu)^2}{2\sigma^2}}\right\} \end{equation}\]
Example: MLE for geometric distribution
Suppose \(X_1, \ldots, X_n\) is a random sample from a geometric distribution with parameter \(p\), \(0 \leq p \leq 1\). Find the MLE.
Recall the probability mass function (PMF) of a geometric distribution with parameter \(p\), denoted as \(X \sim \text{Geometric}(p)\), is given by: \[ P(X = x) = p(1 - p)^{x-1} \quad \text{for } x = 1, 2, 3, \ldots \]
The plotted likelihood for simulated data from a geometric distribution with \(p\) = 0.3.
For this simulation the MLE is \(\hat p = 0.33\)
The plotted log-likelihood for simulated data from a geometric distribution with \(p\) = 0.3.


Because the natural logarithm function is increasing, the maximum value of the likelihood function, if it exists, will occur at the same point as the maximum value of the log-likelihood function.
MLEs for multiple parameters
Let \((X_1, \ldots, X_n)\) be a random sample with joint probability mass function (if discrete) or probability density function (if continuous):
\[ L(\theta_1, \dots, \theta_m; x_1, \dots, x_n) = f(x_1, \ldots, x_n; \theta_1, \ldots, \theta_m) \]
where the values of the parameters \((\theta_1, \ldots, \theta_m)\) are unknown and \((x_1, \ldots, x_n)\) are the observed sample values.
Then, the maximum likelihood estimates \((\hat{\theta}_1, \ldots, \hat{\theta}_m\)) are those values of the parameters that maximize the likelihood function, so that:
\[ f(x_1, \ldots, x_n; \hat\theta_1, \ldots, \hat\theta_m) > f(x_1, \ldots, x_n; \theta_1, \ldots, \theta_m) \] for all allowable \(\theta_1, \dots, \theta_m\)
MLEs for the gamma distribution
Let \(X_1, \ldots, X_n\) be a random sample from a population with a gamma distribution and shape parameter \(\alpha > 0\) and rate parameter \(\beta > 0\), with PDF given by:
\[ f(x \mid \alpha, \beta) = \frac{\beta^\alpha}{\Gamma(\alpha)} x^{\alpha-1} e^{-\beta x}, \quad x > 0. \]
Find the Maximum Likelihood Estimators (MLEs) for the unknown parameters \(\alpha\) and \(\beta\)
Given a sample \(X_1, X_2, \dots, X_n\), the likelihood function is:
\[ \mathcal{L}(\alpha, \beta) = \prod_{i=1}^{n} \frac{\beta^\alpha}{\Gamma(\alpha)} X_i^{\alpha-1} e^{-\beta X_i}. \]
Taking the log-likelihood: \(\ell(\alpha, \beta)\) = \(\log[\mathcal{L}(\alpha, \beta)]\) =
\[ n \alpha \log \beta - n \log \Gamma(\alpha) + (\alpha - 1) \sum_{i=1}^{n} \log X_i - \beta \sum_{i=1}^{n} X_i \]
Taking the derivative with respect to \(\beta\) and setting it to zero,
\[ \frac{\partial \ell}{\partial \beta} = \frac{n\alpha}{\beta} - \sum_{i=1}^{n} X_i = 0. \] Solving for \(\beta\),
\[ \hat{\beta} = \frac{n\alpha}{\sum_{i=1}^{n} X_i} = \frac{\alpha}{\bar{X}}, \]
where \(\bar{X}\) is the sample mean.
Taking the derivative with respect to \(\alpha\),
\[ \frac{\partial \ell}{\partial \alpha} = n \log \beta - n \frac{\Gamma'(\alpha)}{\Gamma(\alpha)} + \sum_{i=1}^{n} \log X_i. \]
Substituting \(\beta = \frac{\alpha}{\bar{X}}\),
\[ n \log \alpha - n \log \bar{X} - n \frac{\Gamma'(\alpha)}{\Gamma(\alpha)} + \sum_{i=1}^{n} \log X_i = 0. \]
This equation does not have a closed-form solution for \(\alpha\), so it is typically solved numerically. A common approach is to use the method of moments to get an initial estimate and then refine it using numerical optimization (e.g., Newton-Raphson). This is beyond the scope of STAT 205
Maximum likelihood estimation is one of the most versatile methods for fitting parametric statistical models to data.
For most cases of practical interest, the performance of MLEs is optimal for large enough data.
Parts of this lecture were inspired by Myung (2003) and Gribble Lab MLE slides
Comment about Likelihood
Although likelihood depends on the observed sample values \(x = (x_1, \ldots, x_n)\), is to be regarded as a function of the parameter \(\theta\).
In the discrete case, \(L(\theta; x_1, \ldots, x_n)\) gives the probability of observing \(x = (x_1, \ldots, x_n)\) for a given \(\theta\).
Thus, the likelihood function is a statistic, depending on the observed sample \(x = (x_1, \ldots, x_n)\).