Maximum Likelihood Estimation

STAT 205: Introduction to Mathematical Statistics

Dr. Irene Vrbik

University of British Columbia Okanagan

Recap

  • Many applied statistical analysis problems involves estimating population parameters

    • e.g. mean \(\mu\), proportion \(p\), variance \(\sigma^2\)
  • Direct observation of these numerical characteristics may not be possible, so random variables are observed

    • e.g. sample mean/proportion/variance \(\overline{X}\)/\(\hat p\)/\(S^2\)

Objective: Develop methods using sample data to gain information about unknown population characteristics.

Simple Random Samples

Population

πŸ§πŸ§πŸ»πŸ§πŸ½πŸ§πŸΎπŸ§πŸΏπŸ§β€β™€οΈπŸ§πŸ»β€β™€οΈπŸ§πŸΌβ€β™€οΈπŸ§πŸ½β€β™€οΈπŸ§πŸΎβ€β™€οΈπŸ§πŸΏβ€β™€οΈπŸ§β€β™‚οΈ

πŸ§β€β™‚οΈπŸ§πŸ½β€β™‚οΈπŸ§πŸΎβ€β™‚οΈπŸ§πŸΏβ€β™‚οΈπŸ§πŸΎβ€β™€οΈπŸ§πŸΏβ€β™€οΈπŸ§πŸ½β€β™‚οΈπŸ§πŸΏπŸ§β€β™‚οΈπŸ§πŸ½β€β™‚οΈπŸ§πŸΏπŸ§πŸΌβ€β™€οΈ

πŸ§πŸΏβ€β™‚οΈπŸ§πŸ§πŸ»πŸ§β€β™‚οΈπŸ§πŸ½πŸ§πŸΎπŸ§πŸΎβ€β™‚οΈπŸ§β€β™€οΈπŸ§πŸ»β€β™€οΈπŸ§πŸΌβ€β™€οΈπŸ§πŸ½β€β™€οΈπŸ§β€β™‚οΈ

πŸ§πŸΏβ€β™€οΈπŸ§πŸ§πŸ»πŸ§β€β™‚οΈπŸ§πŸ½πŸ§πŸΏβ€β™‚οΈπŸ§πŸΎπŸ§πŸΎβ€β™€οΈπŸ§πŸΎβ€β™‚οΈπŸ§β€β™€οΈπŸ§πŸ»β€β™€οΈπŸ§πŸ½β€β™€οΈ

Important

Taking an SRS minimizes bias, enables valid statistical inferences, and makes results generalizable.

\[\downarrow\]

Sample

πŸ§πŸ§πŸΎπŸ§πŸΌβ€β™€οΈπŸ§πŸ½β€β™€οΈπŸ§β€β™‚οΈπŸ§πŸ½β€β™‚οΈ

SRS (Simple Random Sample)

A Simple Random Sample (SRS) is a sampling method in which every element in the population has an equal probability of being selected, and each subset of the population of a given size has an equal chance of being chosen.

Setup

  • We begin with a simple random sample of size \(n\) from a population.
  • The population is assumed to follow a known distribution (e.g. Normal, binomial), but its parameters are unknown.

Goal: to use the sample to obtain the best possible estimates of these population parameters.

Example: Normal Population

Population

πŸ§πŸ§πŸ»πŸ§πŸ½πŸ§πŸΎπŸ§πŸΏπŸ§β€β™€οΈπŸ§πŸ»β€β™€οΈπŸ§πŸΌβ€β™€οΈπŸ§πŸ½β€β™€οΈπŸ§πŸΎβ€β™€οΈπŸ§πŸΏβ€β™€οΈπŸ§β€β™‚οΈ

πŸ§β€β™‚οΈπŸ§πŸ½β€β™‚οΈπŸ§πŸΎβ€β™‚οΈπŸ§πŸΏβ€β™‚οΈπŸ§πŸΎβ€β™€οΈπŸ§πŸΏβ€β™€οΈπŸ§πŸ½β€β™‚οΈπŸ§πŸΏπŸ§β€β™‚οΈπŸ§πŸ½β€β™‚οΈπŸ§πŸΏπŸ§πŸΌβ€β™€οΈ

πŸ§πŸΏβ€β™‚οΈπŸ§πŸ§πŸ»πŸ§β€β™‚οΈπŸ§πŸ½πŸ§πŸΎπŸ§πŸΎβ€β™‚οΈπŸ§β€β™€οΈπŸ§πŸ»β€β™€οΈπŸ§πŸΌβ€β™€οΈπŸ§πŸ½β€β™€οΈπŸ§β€β™‚οΈ

πŸ§πŸΏβ€β™€οΈπŸ§πŸ§πŸ»πŸ§β€β™‚οΈπŸ§πŸ½πŸ§πŸΏβ€β™‚οΈπŸ§πŸΎπŸ§πŸΎβ€β™€οΈπŸ§πŸΎβ€β™‚οΈπŸ§β€β™€οΈπŸ§πŸ»β€β™€οΈπŸ§πŸ½β€β™€οΈ

\[\downarrow \text{SRS}\]

Sample

πŸ§πŸ§πŸΎπŸ§πŸΌβ€β™€οΈπŸ§πŸ½β€β™€οΈπŸ§β€β™‚οΈπŸ§πŸ½β€β™‚οΈ

Population Distribution

\(f(x \mid\mu, \sigma)= \frac{1}{\sqrt{2\pi\sigma^2}}\exp\left\{ -\frac{(x - \mu)^2}{2\sigma^2} \right\}\)

Population Parameters

\[ \begin{align} \mu &= ? & \sigma^2 &= ? \end{align} \]

Example: Exponential Distribution

Population

πŸ§πŸ§πŸ»πŸ§πŸ½πŸ§πŸΎπŸ§πŸΏπŸ§β€β™€οΈπŸ§πŸ»β€β™€οΈπŸ§πŸΌβ€β™€οΈπŸ§πŸ½β€β™€οΈπŸ§πŸΎβ€β™€οΈπŸ§πŸΏβ€β™€οΈπŸ§β€β™‚οΈ

πŸ§β€β™‚οΈπŸ§πŸ½β€β™‚οΈπŸ§πŸΎβ€β™‚οΈπŸ§πŸΏβ€β™‚οΈπŸ§πŸΎβ€β™€οΈπŸ§πŸΏβ€β™€οΈπŸ§πŸ½β€β™‚οΈπŸ§πŸΏπŸ§β€β™‚οΈπŸ§πŸ½β€β™‚οΈπŸ§πŸΏπŸ§πŸΌβ€β™€οΈ

πŸ§πŸΏβ€β™‚οΈπŸ§πŸ§πŸ»πŸ§β€β™‚οΈπŸ§πŸ½πŸ§πŸΎπŸ§πŸΎβ€β™‚οΈπŸ§β€β™€οΈπŸ§πŸ»β€β™€οΈπŸ§πŸΌβ€β™€οΈπŸ§πŸ½β€β™€οΈπŸ§β€β™‚οΈ

πŸ§πŸΏβ€β™€οΈπŸ§πŸ§πŸ»πŸ§β€β™‚οΈπŸ§πŸ½πŸ§πŸΏβ€β™‚οΈπŸ§πŸΎπŸ§πŸΎβ€β™€οΈπŸ§πŸΎβ€β™‚οΈπŸ§β€β™€οΈπŸ§πŸ»β€β™€οΈπŸ§πŸ½β€β™€οΈ

\[\downarrow \text{SRS}\]

Sample

πŸ§πŸ§πŸΎπŸ§πŸΌβ€β™€οΈπŸ§πŸ½β€β™€οΈπŸ§β€β™‚οΈπŸ§πŸ½β€β™‚οΈ

Population Distribution \(f(x \mid \lambda) = \lambda e^{-\lambda x}, \quad \text{ for }x \geq 0\)

Population Parameters

\[ \begin{align} \lambda &= ? \end{align} \]

Example: Beta Distribution

Population

πŸ§πŸ§πŸ»πŸ§πŸ½πŸ§πŸΎπŸ§πŸΏπŸ§β€β™€οΈπŸ§πŸ»β€β™€οΈπŸ§πŸΌβ€β™€οΈπŸ§πŸ½β€β™€οΈπŸ§πŸΎβ€β™€οΈπŸ§πŸΏβ€β™€οΈπŸ§β€β™‚οΈ

πŸ§β€β™‚οΈπŸ§πŸ½β€β™‚οΈπŸ§πŸΎβ€β™‚οΈπŸ§πŸΏβ€β™‚οΈπŸ§πŸΎβ€β™€οΈπŸ§πŸΏβ€β™€οΈπŸ§πŸ½β€β™‚οΈπŸ§πŸΏπŸ§β€β™‚οΈπŸ§πŸ½β€β™‚οΈπŸ§πŸΏπŸ§πŸΌβ€β™€οΈ

πŸ§πŸΏβ€β™‚οΈπŸ§πŸ§πŸ»πŸ§β€β™‚οΈπŸ§πŸ½πŸ§πŸΎπŸ§πŸΎβ€β™‚οΈπŸ§β€β™€οΈπŸ§πŸ»β€β™€οΈπŸ§πŸΌβ€β™€οΈπŸ§πŸ½β€β™€οΈπŸ§β€β™‚οΈ

πŸ§πŸΏβ€β™€οΈπŸ§πŸ§πŸ»πŸ§β€β™‚οΈπŸ§πŸ½πŸ§πŸΏβ€β™‚οΈπŸ§πŸΎπŸ§πŸΎβ€β™€οΈπŸ§πŸΎβ€β™‚οΈπŸ§β€β™€οΈπŸ§πŸ»β€β™€οΈπŸ§πŸ½β€β™€οΈ

\[\downarrow \text{SRS}\]

Sample

πŸ§πŸ§πŸΎπŸ§πŸΌβ€β™€οΈπŸ§πŸ½β€β™€οΈπŸ§β€β™‚οΈπŸ§πŸ½β€β™‚οΈ

Population Distribution

\(f(x \mid \alpha, \beta) = \frac{x^{\alpha - 1}(1 - x)^{\beta - 1}}{\text{B}(\alpha, \beta)} \quad 0 \leq x \leq 1\)

Population Parameters \[ \begin{align} \alpha = ? && \beta &= ? \end{align} \]

More Generally

Population

πŸ§πŸ§πŸ»πŸ§πŸ½πŸ§πŸΎπŸ§πŸΏπŸ§β€β™€οΈπŸ§πŸ»β€β™€οΈπŸ§πŸΌβ€β™€οΈπŸ§πŸ½β€β™€οΈπŸ§πŸΎβ€β™€οΈπŸ§πŸΏβ€β™€οΈπŸ§β€β™‚οΈ

πŸ§β€β™‚οΈπŸ§πŸ½β€β™‚οΈπŸ§πŸΎβ€β™‚οΈπŸ§πŸΏβ€β™‚οΈπŸ§πŸΎβ€β™€οΈπŸ§πŸΏβ€β™€οΈπŸ§πŸ½β€β™‚οΈπŸ§πŸΏπŸ§β€β™‚οΈπŸ§πŸ½β€β™‚οΈπŸ§πŸΏπŸ§πŸΌβ€β™€οΈ

πŸ§πŸΏβ€β™‚οΈπŸ§πŸ§πŸ»πŸ§β€β™‚οΈπŸ§πŸ½πŸ§πŸΎπŸ§πŸΎβ€β™‚οΈπŸ§β€β™€οΈπŸ§πŸ»β€β™€οΈπŸ§πŸΌβ€β™€οΈπŸ§πŸ½β€β™€οΈπŸ§β€β™‚οΈ

πŸ§πŸΏβ€β™€οΈπŸ§πŸ§πŸ»πŸ§β€β™‚οΈπŸ§πŸ½πŸ§πŸΏβ€β™‚οΈπŸ§πŸΎπŸ§πŸΎβ€β™€οΈπŸ§πŸΎβ€β™‚οΈπŸ§β€β™€οΈπŸ§πŸ»β€β™€οΈπŸ§πŸ½β€β™€οΈ

\[\downarrow \text{SRS}\]

Sample

πŸ§πŸ§πŸΎπŸ§πŸΌβ€β™€οΈπŸ§πŸ½β€β™€οΈπŸ§β€β™‚οΈπŸ§πŸ½β€β™‚οΈ

Population Distribution

\[ \begin{align} f(x\mid \theta) \end{align} \]

Population Parameters

\[ \begin{align} \theta_1 = ? \quad \theta_2 &= ? \quad \dots \quad \theta_l = ? \end{align} \]

Population Assumption

Assume the population follows some distribution with parameters \(\theta_1\), through \(\theta_l\) to be estimated.

Notation

  • \(f(x \mid \theta)\) is the probability density function (PDF) or probability mass function (PMF) for discrete data.

  • Note that \(\theta\) may be a vector of parameters

  • The model is the set of possible distributions, each determined by a different value of \(\theta\).

Setup

Sampling Assumptions

Let \(X = (X_1, \dots, X_n)\) be independent and identically distributed (i.i.d) random variables (RVs) with a probability density function (pdf) or probability mass function (pmf) \(f(x \mid \theta)\), where \(\theta = (\theta_1, \dots, \theta_l)\) are the unknown population parameters.

RIS (Random Independent Sample)

A Random Independent Sample (RIS) of size \(n\) involves a sampling method where individuals are chosen randomly and independently from the population.

Notation

Estimator

An estimator, \(\hat \Theta(X_1, X_2, \dots, X_n)\) is a rule, formula, or function used to calculate an estimate based on sample data. We will denote the estimator by \(\hat \Theta\) (captital theta) to emphasize that it representing a random variable that depends on the random sample.

Point Estimate

A point estimate, (or simply, estimate) \(\hat \theta(x_1, x_2, \dots, x_n)\) is the numerical value produced by applying the estimator to a specific sample. It is the realized value of the estimator after data collection. It is typically written simply as \(\hat \theta\).

Normal Population

Population

πŸ§πŸ§πŸ»πŸ§πŸ½πŸ§πŸΎπŸ§πŸΏπŸ§β€β™€οΈπŸ§πŸ»β€β™€οΈπŸ§πŸΌβ€β™€οΈπŸ§πŸ½β€β™€οΈπŸ§πŸΎβ€β™€οΈπŸ§πŸΏβ€β™€οΈπŸ§β€β™‚οΈ

πŸ§β€β™‚οΈπŸ§πŸ½β€β™‚οΈπŸ§πŸΎβ€β™‚οΈπŸ§πŸΏβ€β™‚οΈπŸ§πŸΎβ€β™€οΈπŸ§πŸΏβ€β™€οΈπŸ§πŸ½β€β™‚οΈπŸ§πŸΏπŸ§β€β™‚οΈπŸ§πŸ½β€β™‚οΈπŸ§πŸΏπŸ§πŸΌβ€β™€οΈ

πŸ§πŸΏβ€β™‚οΈπŸ§πŸ§πŸ»πŸ§β€β™‚οΈπŸ§πŸ½πŸ§πŸΎπŸ§πŸΎβ€β™‚οΈπŸ§β€β™€οΈπŸ§πŸ»β€β™€οΈπŸ§πŸΌβ€β™€οΈπŸ§πŸ½β€β™€οΈπŸ§β€β™‚οΈ

πŸ§πŸΏβ€β™€οΈπŸ§πŸ§πŸ»πŸ§β€β™‚οΈπŸ§πŸ½πŸ§πŸΏβ€β™‚οΈπŸ§πŸΎπŸ§πŸΎβ€β™€οΈπŸ§πŸΎβ€β™‚οΈπŸ§β€β™€οΈπŸ§πŸ»β€β™€οΈπŸ§πŸ½β€β™€οΈ

\[\downarrow \text{RIS}\]

Sample

πŸ§πŸ§πŸΎπŸ§πŸΌβ€β™€οΈπŸ§πŸ½β€β™€οΈπŸ§β€β™‚οΈπŸ§πŸ½β€β™‚οΈ

Population Distribution

\[ \begin{align} & f(x \mid\mu, \sigma) \end{align} \]

Sample Statistic (RV)

\[ \hat \mu = \frac{X_1 + \dots + X_n}{n} \]

the sample mean function is our estimator (RV) of the population parameter \(\mu\).

Normal Population

Population

πŸ§πŸ§πŸ»πŸ§πŸ½πŸ§πŸΎπŸ§πŸΏπŸ§β€β™€οΈπŸ§πŸ»β€β™€οΈπŸ§πŸΌβ€β™€οΈπŸ§πŸ½β€β™€οΈπŸ§πŸΎβ€β™€οΈπŸ§πŸΏβ€β™€οΈπŸ§β€β™‚οΈ

πŸ§β€β™‚οΈπŸ§πŸ½β€β™‚οΈπŸ§πŸΎβ€β™‚οΈπŸ§πŸΏβ€β™‚οΈπŸ§πŸΎβ€β™€οΈπŸ§πŸΏβ€β™€οΈπŸ§πŸ½β€β™‚οΈπŸ§πŸΏπŸ§β€β™‚οΈπŸ§πŸ½β€β™‚οΈπŸ§πŸΏπŸ§πŸΌβ€β™€οΈ

πŸ§πŸΏβ€β™‚οΈπŸ§πŸ§πŸ»πŸ§β€β™‚οΈπŸ§πŸ½πŸ§πŸΎπŸ§πŸΎβ€β™‚οΈπŸ§β€β™€οΈπŸ§πŸ»β€β™€οΈπŸ§πŸΌβ€β™€οΈπŸ§πŸ½β€β™€οΈπŸ§β€β™‚οΈ

πŸ§πŸΏβ€β™€οΈπŸ§πŸ§πŸ»πŸ§β€β™‚οΈπŸ§πŸ½πŸ§πŸΏβ€β™‚οΈπŸ§πŸΎπŸ§πŸΎβ€β™€οΈπŸ§πŸΎβ€β™‚οΈπŸ§β€β™€οΈπŸ§πŸ»β€β™€οΈπŸ§πŸ½β€β™€οΈ

\[\downarrow \text{RIS}\]

Sample

πŸ§πŸ§πŸΎπŸ§πŸΌβ€β™€οΈπŸ§πŸ½β€β™€οΈπŸ§β€β™‚οΈπŸ§πŸ½β€β™‚οΈ

Population Distribution \[ \begin{align} & f(x \mid\mu, \sigma) \end{align} \]

Point Estimate

\[ \begin{align} {\hat \mu} &= \frac{x_1 + \dots + x_n}{n} \\ &= \frac{170 + 192 + \dots + 155}{6} \\ &= 167.7 \end{align} \]

the value, 167.7, is the estimate of the population parameter \(\mu\).

More Generally

Population

πŸ§πŸ§πŸ»πŸ§πŸ½πŸ§πŸΎπŸ§πŸΏπŸ§β€β™€οΈπŸ§πŸ»β€β™€οΈπŸ§πŸΌβ€β™€οΈπŸ§πŸ½β€β™€οΈπŸ§πŸΎβ€β™€οΈπŸ§πŸΏβ€β™€οΈπŸ§β€β™‚οΈ

πŸ§β€β™‚οΈπŸ§πŸ½β€β™‚οΈπŸ§πŸΎβ€β™‚οΈπŸ§πŸΏβ€β™‚οΈπŸ§πŸΎβ€β™€οΈπŸ§πŸΏβ€β™€οΈπŸ§πŸ½β€β™‚οΈπŸ§πŸΏπŸ§β€β™‚οΈπŸ§πŸ½β€β™‚οΈπŸ§πŸΏπŸ§πŸΌβ€β™€οΈ

πŸ§πŸΏβ€β™‚οΈπŸ§πŸ§πŸ»πŸ§β€β™‚οΈπŸ§πŸ½πŸ§πŸΎπŸ§πŸΎβ€β™‚οΈπŸ§β€β™€οΈπŸ§πŸ»β€β™€οΈπŸ§πŸΌβ€β™€οΈπŸ§πŸ½β€β™€οΈπŸ§β€β™‚οΈ

πŸ§πŸΏβ€β™€οΈπŸ§πŸ§πŸ»πŸ§β€β™‚οΈπŸ§πŸ½πŸ§πŸΏβ€β™‚οΈπŸ§πŸΎπŸ§πŸΎβ€β™€οΈπŸ§πŸΎβ€β™‚οΈπŸ§β€β™€οΈπŸ§πŸ»β€β™€οΈπŸ§πŸ½β€β™€οΈ

\[\downarrow \text{RIS}\]

Sample

πŸ§πŸ§πŸΎπŸ§πŸΌβ€β™€οΈπŸ§πŸ½β€β™€οΈπŸ§β€β™‚οΈπŸ§πŸ½β€β™‚οΈ

Population Distribution

\[ \begin{align} & f(x \mid\theta) \end{align} \]

Sample Statistic

\[ \hat \Theta = g(X_1, \dots, X_n) \]

Point Estimate

\[ \hat \theta = g(x_1, \dots, x_n) \]

Maximum Likelihood Estimation

Sir Ronald A. Fisher
  • Maximum Likelihood Estimation (MLE) is a method for estimating the parameters of a statistical model.

  • The basic idea behind MLE is to find the values of the parameters that maximize the likelihood function, which measures how well the model explains the observed data.

Maximum Likelihood Estimation

  • Maximum Likelihood Estimation is one of several method used for for parameter estimation.

  • In this paradigm, we treat the data as fixed and ask:

Which parameter value makes the observed data most plausible?

  • This plausibility is quantified by the likelihood function.

Big shift

Suppose our model has parameter \(\theta\). The likelihood function

\[ L(\theta \mid x) = f(x \mid \theta) \]

where \(x\) is the observed data.

Important

Same formula, different point of view:

  • \(f(x\mid \theta)\): \(\leftarrow\) function of \(x\) (\(\theta\) fixed)
  • \(L(\theta \mid x)\): \(\leftarrow\) function of \(\theta\) (\(x\) fixed)

Coin Flip example

Coin Flip Example

Exercise 1 Suppose:

  • we flip a coin \(n=10\) times
  • we observe \(x=7\) heads
  • \(p\) = probability of heads

iClicker

iClicker: Coin toss model

What is the appropriate model for Exercise 1

  1. Binomial(\(n = 10, p\))
  2. Binomial(\(n = 7, p\))
  3. Binomial(\(n = 10, x = 7\))
  4. Bernoulli(\(p\))

Likelihood for the coin toss example

For \(x=7\) successes out of \(n=10\),

\[ L(p \mid x=7) \propto p^7(1-p)^3, \qquad 0 \le p \le 1 \]

More fully,

\[ L(p \mid n=10, x=7) = \binom{10}{7} p^7 (1-p)^3 \]

Fair coin

If we assumed the coin was fair, i.e. \(p=0.5\) the PMF is:

Code
# values of p to show
p_vals <- c(0.1, 0.2, 0.3,
            0.4, 0.5, 0.6,
            0.7)

# support for Binomial(n = 10, p)
x <- 0:10
n <- 10

# save old graphics settings
old_par <- par(no.readonly = TRUE)

p = 0.5
probs <- dbinom(x, size = n, prob = p)

barplot(
  probs,
  names.arg = x,
  col = "gray75",
  border = "gray35",
  ylim = c(0, max(probs) * 1.1),
  main = paste0("PMF for binomial with n=10, p=", p),
  xlab = "x",
  ylab = paste0("f(x|n=10,p=", p, ")")
)

Biased Heads

If the coin was biased towards heads, say \(p=0.8\) the PMF is:

Code
p = 0.8
probs <- dbinom(x, size = n, prob = p)

barplot(
  probs,
  names.arg = x,
  col = "gray75",
  border = "gray35",
  ylim = c(0, max(probs) * 1.1),
  main = paste0("PMF for binomial with n=10, p=", p),
  xlab = "x",
  ylab = paste0("f(x|n=10,p=", p, ")")
)

Biased Tails

If the coin was biased towards tails, say \(p=0.2\) the PMF is:

Code
p = 0.2
probs <- dbinom(x, size = n, prob = p)

barplot(
  probs,
  names.arg = x,
  col = "gray75",
  border = "gray35",
  ylim = c(0, max(probs) * 1.1),
  main = paste0("PMF for binomial with n=10, p=", p),
  xlab = "x",
  ylab = paste0("f(x|n=10,p=", p, ")")
)

Maximum Likelihood Intuition

iclicker

We observed \(x = 7\) heads. Out of the possible values of \(\theta\) which we have tried in the previous animation, which value of \(\theta\) is most plausible?

  1. \(\theta = 0.1\)
  2. \(\theta = 0.3\)
  3. \(\theta = 0.5\)
  4. \(\theta = 0.7\)
  5. Something else

Figure 1: Binomial PMF for \(n=10\) evaluated at the MLE, \(\hat\theta = 0.7\). The highlighted bar is \(P(Y=7\mid \theta=0.7)\), the likelihood of the observed outcome at this parameter value.

Plot the likelihood

Alternatively, we could plot the likelihood

Plot the likelihood

With \(p = 0.1\) the \(\Pr(X = 7)\) = 8.748e-06.

Plot the likelihood

With \(p = 0.15\) the \(\Pr(X = 7)\) = 0.00012591.

Plot the likelihood

With \(p = 0.2\) the \(\Pr(X = 7)\) = 0.00078643.

Plot the likelihood

With \(p = 0.25\) the \(\Pr(X = 7)\) = 0.0030899.

Plot the likelihood

With \(p = 0.3\) the \(\Pr(X = 7)\) = 0.0090017.

Plot the likelihood

With \(p = 0.35\) the \(\Pr(X = 7)\) = 0.021203.

Plot the likelihood

With \(p = 0.4\) the \(\Pr(X = 7)\) = 0.042467.

Plot the likelihood

With \(p = 0.45\) the \(\Pr(X = 7)\) = 0.074603.

Plot the likelihood

With \(p = 0.5\) the \(\Pr(X = 7)\) = 0.11719.

Plot the likelihood

With \(p = 0.55\) the \(\Pr(X = 7)\) = 0.16648.

Plot the likelihood

With \(p = 0.6\) the \(\Pr(X = 7)\) = 0.21499.

Plot the likelihood

With \(p = 0.65\) the \(\Pr(X = 7)\) = 0.25222.

Plot the likelihood

With \(p = 0.7\) the \(\Pr(X = 7)\) = 0.26683.

Plot the likelihood

With \(p = 0.75\) the \(\Pr(X = 7)\) = 0.25028.

Plot the likelihood

With \(p = 0.8\) the \(\Pr(X = 7)\) = 0.20133.

Plot the likelihood

With \(p = 0.85\) the \(\Pr(X = 7)\) = 0.12983.

Plot the likelihood

With \(p = 0.9\) the \(\Pr(X = 7)\) = 0.057396.

Plot the likelihood

Code
p <- seq(0, 1, length.out = 400)
lik <- dbinom(7, size = 10, prob = p)

plot(
  p, lik, type = "l", lwd = 3,bty = "n",
  xlab = "p",
  ylab = "Likelihood",
  main = "Likelihood for x = 7 out of n = 10"
)

Plotting the likelihood for all values of p

Plot the likelihood

Code
p <- seq(0, 1, length.out = 400)
lik <- dbinom(7, size = 10, prob = p)

plot(
  p, lik, type = "l", lwd = 3, bty = "n",
  xlab = "p",, 
  ylab = "Likelihood",
  main = "Likelihood for x = 7 out of n = 10"
)

abline(v = 0.7, col = 2, lwd = 2, lty = 2)
par(xpd = TRUE)
text(0.7, max(lik), labels = expression(hat(p)==0.7), pos = 3, col = 2, cex = 1.2)

The likelihood function for observing 7 successes out of 10 trials, with the peak indicating the MLE of \(\hat p = 0.7\).

Likelihood

Likelihood (definition)

Let \(f(x \mid \theta)\) be the joint probability (or density) function of random variables \(X\), evaluated at observed values \(x\). The likelihood function is defined as \[\begin{equation} L(\theta\mid x) = f(x \mid \theta) \end{equation}\]

Key idea: We treat the data \(x\) as fixed and view \(L(\theta)\) as a function of \(\theta\).

Likelihood Formula

If \((X_1, \ldots, X_n)\) are iid discrete random variables with probability mass function (PMF) \(p(x, \theta)\), then the likelihood function is given by: \[ \begin{align*} L(\theta) &= P(X_1 = x_1, \dots, X_n = x_n) \\ &= \prod_{i=1}^{n} P(X_i = x_i) \text{ by multiplication rule for independent RV}\\ &= \prod_{i=1}^{n} p(x_i \mid \theta) \end{align*} \]

And in the continuous case, if the density is \(f(x \mid\theta)\), then the likelihood function is: \[ L(\theta) = \prod_{i=1}^{n} f(x_i \mid \theta) \]

Comment about Likelihood

  • Although likelihood depends on the observed sample values \(x = (x_1, \ldots, x_n)\), is to be regarded as a function of the parameter \(\theta\).

  • In the discrete case, \(L(\theta; x_1, \ldots, x_n)\) gives the probability of observing \(x = (x_1, \ldots, x_n)\) for a given \(\theta\).

  • Thus, the likelihood function is a statistic, depending on the observed sample \(x = (x_1, \ldots, x_n)\).

MLEs

Maximum likelihood estimators (MLEs)

Maximum likelihood estimators or MLEs are those values of the parameters that maximize the likelihood function with respect to the parameter \(\theta\). That is, \[ \hat{\theta}_{\text{MLE}} = \underset{\theta \in \Theta}{\arg\max} \, L(\theta) \] where \(\Theta\) is the set of possible values of the parameter \(\theta\).

Finding MLE

There are three ways of finding the MLE

  1. Analytically: use calculus to solve for the parameter value(s) that result in a peak.

At a maximum:

\[ \frac{\partial L}{\partial w} = 0 \qquad \frac{\partial^2 L}{\partial w^2} < 0 \]

Finding MLE

There are three ways of finding the MLE

  1. Analytically: use calculus to solve for the parameter value(s) that result in a peak.

  2. Grid search: exhaustive search through parameter space

  • this is computationally expensive, esp. in high dimensions

Finding MLE

There are three ways of finding the MLE

  1. Analytically: use calculus to solve for the parameter value(s) that result in a peak.

  2. Grid search: exhaustive search through parameter space

  3. Numerically: use non-linear optimization (e.g. gradient descent) to iteratively find the peak

In this course, we will only talk about the analytic solutions

Procedure for finding the MLE

  1. Define the likelihood function, \(L(\theta)\).
  2. Often it is easier to take the natural logarithm (ln) of \(L(\theta)\).
  3. When applicable, differentiate \(\ln L(\theta)\) with respect to \(\theta\), and then equate the derivative to zero.
  4. Solve for the parameter \(\theta\), and we will obtain \(\hat{\theta}\).
  5. Check whether it is a maximizer or a global maximizer.

Example: Normal Likelihood

Let \(X_1, \ldots, X_n\) be independent and identically distributed random variables following a normal distribution \(N(\mu, \sigma^2)\). Let \(x_1, \ldots, x_n\) be the corresponding sample values. Find the likelihood function.

Recall the pdf of Normal distribution:

\[\begin{equation} f(x \mid \mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left\{{-\frac{(x - \mu)^2}{2\sigma^2}}\right\} \end{equation}\]

Solution

Example: MLE for geometric distribution

Suppose \(X_1, \ldots, X_n\) is a random sample from a geometric distribution with parameter \(p\), \(0 \leq p \leq 1\). Find the MLE.

Recall the probability mass function (PMF) of a geometric distribution with parameter \(p\), denoted as \(X \sim \text{Geometric}(p)\), is given by: \[ P(X = x) = p(1 - p)^{x-1} \quad \text{for } x = 1, 2, 3, \ldots \]

Solution

Plot the likelihood

The plotted likelihood for simulated data from a geometric distribution with \(p\) = 0.3.

Plot the MLE

For this simulation the MLE is \(\hat p = 0.33\)

Plot the log-likelihood

The plotted log-likelihood for simulated data from a geometric distribution with \(p\) = 0.3.

Plot the MLEs

Because the natural logarithm function is increasing, the maximum value of the likelihood function, if it exists, will occur at the same point as the maximum value of the log-likelihood function.

MLEs for multiple parameters

Let \((X_1, \ldots, X_n)\) be a random sample with joint probability mass function (if discrete) or probability density function (if continuous):

\[ L(\theta_1, \dots, \theta_m; x_1, \dots, x_n) = f(x_1, \ldots, x_n; \theta_1, \ldots, \theta_m) \]

where the values of the parameters \((\theta_1, \ldots, \theta_m)\) are unknown and \((x_1, \ldots, x_n)\) are the observed sample values.

Then, the maximum likelihood estimates \((\hat{\theta}_1, \ldots, \hat{\theta}_m\)) are those values of the parameters that maximize the likelihood function, so that:

\[ f(x_1, \ldots, x_n; \hat\theta_1, \ldots, \hat\theta_m) > f(x_1, \ldots, x_n; \theta_1, \ldots, \theta_m) \] for all allowable \(\theta_1, \dots, \theta_m\)

MLEs for the gamma distribution

Let \(X_1, \ldots, X_n\) be a random sample from a population with a gamma distribution and shape parameter \(\alpha > 0\) and rate parameter \(\beta > 0\), with PDF given by:

\[ f(x \mid \alpha, \beta) = \frac{\beta^\alpha}{\Gamma(\alpha)} x^{\alpha-1} e^{-\beta x}, \quad x > 0. \]

Find the Maximum Likelihood Estimators (MLEs) for the unknown parameters \(\alpha\) and \(\beta\)

Solution

Given a sample \(X_1, X_2, \dots, X_n\), the likelihood function is:

\[ \mathcal{L}(\alpha, \beta) = \prod_{i=1}^{n} \frac{\beta^\alpha}{\Gamma(\alpha)} X_i^{\alpha-1} e^{-\beta X_i}. \]

Taking the log-likelihood: \(\ell(\alpha, \beta)\) = \(\log[\mathcal{L}(\alpha, \beta)]\) =

\[ n \alpha \log \beta - n \log \Gamma(\alpha) + (\alpha - 1) \sum_{i=1}^{n} \log X_i - \beta \sum_{i=1}^{n} X_i \]

Estimator for \(\beta\) (given \(\alpha\)):

Taking the derivative with respect to \(\beta\) and setting it to zero,

\[ \frac{\partial \ell}{\partial \beta} = \frac{n\alpha}{\beta} - \sum_{i=1}^{n} X_i = 0. \] Solving for \(\beta\),

\[ \hat{\beta} = \frac{n\alpha}{\sum_{i=1}^{n} X_i} = \frac{\alpha}{\bar{X}}, \]

where \(\bar{X}\) is the sample mean.

Estimator for \(\alpha\):

Taking the derivative with respect to \(\alpha\),

\[ \frac{\partial \ell}{\partial \alpha} = n \log \beta - n \frac{\Gamma'(\alpha)}{\Gamma(\alpha)} + \sum_{i=1}^{n} \log X_i. \]

Substituting \(\beta = \frac{\alpha}{\bar{X}}\),

\[ n \log \alpha - n \log \bar{X} - n \frac{\Gamma'(\alpha)}{\Gamma(\alpha)} + \sum_{i=1}^{n} \log X_i = 0. \]

This equation does not have a closed-form solution for \(\alpha\), so it is typically solved numerically. A common approach is to use the method of moments to get an initial estimate and then refine it using numerical optimization (e.g., Newton-Raphson). This is beyond the scope of STAT 205

Comment on MLEs

  • Maximum likelihood estimation is one of the most versatile methods for fitting parametric statistical models to data.

  • For most cases of practical interest, the performance of MLEs is optimal for large enough data.

References

Parts of this lecture were inspired by Myung (2003) and Gribble Lab MLE slides

Myung, In Jae. 2003. β€œTutorial on Maximum Likelihood Estimation.” Journal of Mathematical Psychology 47 (1): 90–100.
Ramachandran, K. M., and C. P. Tsokos. 2020. Mathematical Statistics with Applications in r. Elsevier Science. https://books.google.ca/books?id=t3bLDwAAQBAJ.