STAT 205: Introduction to Mathematical Statistics
University of British Columbia Okanagan
March 15, 2024
Two different methods of finding estimators for population parameters have been introduced: Maximum Likelihood Estimator (MLE) and Method of Moment (MME).
MLEs or moment estimators are not, in general, unbiased. And the MME is not always unique.
So how do we go about selecting a “good” estimator among a collection of candidate sample statistics?
This lecture will discuss some desirable properties of estimators in more detail.
In this lecture we will be covering:
Definition 1: RIS
A random independent sample (RIS) of size \(n\) from a specific distribution is a collection of \(n\) independent RVs \(X_{1},\) \(X_{2},\) \(...X_{n}\), each of them having this distribution (this is achieved by performing the corresponding experiment, independently, that many times).
Definition 2: Estimator
An estimator of unknown population parameter \(\theta\) is any sample statistic, say \(\hat\theta(X_1, \dots, X_n)\), of the \(n\) observations.
To narrow down our choices, we will first insist that our estimators be unbaised, meaning \[ \mathbb{E}[\hat\theta]= \theta \]
or at least asymptotically unbiased, i.e. \[ \mathbb{E}[\hat\theta] \underset{n \rightarrow \infty}{\rightarrow} \theta \]
Theorem 1: Unbiased Estimator for the mean
The mean of a random sample \(\bar{X}\) is always an unbiased estimator of the population mean \(\mu\) (assuming the mean exists).
Proof: Let \(X_1, \dots, X_n\) be random variables with mean \(\mu\). Then, the sample mean is
\[ \mathbb{E}[\bar{X}] = \frac{1}{n} \sum_{i=1}^{n} \mathbb{E}[X_i] = \frac{1}{n} \cdot n\mu = \mu. \]
Hence, \(\bar{X}\) is an unbiased estimator of \(\mu\).
Consider estimating \(\mu\) of a distribution by taking \(\hat \theta = X_1\) (the first observation only)
This provides a fully unbiased estimator which is evidently unacceptable, since we are wasting nearly all of the available information.
Making an estimator unbiased (or at least asymptotically so) is not enough to make it even acceptable (let alone ‘good’).
Exercise 1:
Let \(X_1, \dots, X_n\) be a random sample from a population with finite mean \(\mu\). Show that the sample mean \(\bar{X}\) and \(\frac{1}{3} \bar{X} + \frac{2}{3} X_1\) are both unbiased estimators of \(\mu\).
We’ve already shown \(\bar{X}\) is unbiased, so \[ \begin{align} \mathbb{E}\left[\frac{1}{3} \bar{X} + \frac{2}{3} X_1\right] &= \frac{1}{3} \mu + \frac{2}{3} \mu = \mu \end{align} \]
🤓 In fact, if we have two unbiased estimators, there are infinitely many unbiased estimators. (See Example 5.3.3 in Ramachandran and Tsokos (2020) for proof).
In many cases (as we’ve seen already), it is possible to transform a biased estimator into an unbiased estimator.
This can be achieved by applying a mathematical operation to the biased estimator, typically involving a constant term or a function of the sample size, such that the resulting estimator has zero bias.
For example, in SRSWR, \(S^2 = \frac{n}{n-1}\hat\sigma^2 =\dfrac{\sum_{i=1}^n (X_i - \bar X)^2}{n-1}\) is an unbiased estimator for \(\sigma^2\).
Warning
Unbiasedness may not be retained under functional transformations.
That is, if \(\hat\theta\) is an unbiased estimator of \(\theta\), it does not follow that \(f(\hat\theta)\) is an unbiased estimator of \(f(\theta)\).
\(S^2\) is an unbiased estimator of \(\sigma^2\); however, \(S\) is NOT an unbiased estimator of \(\sigma\).
Since \(\frac{n-1}{\sigma ^{2}}s^{2}\epsilon ~\chi _{n-1}^{2}\)
\[\begin{equation*} \mathbb{E}\left( \sqrt{\chi _{n-1}^{2}}\right) =\dfrac{1}{\Gamma (\frac{n-1}{% 2})\,2^{\frac{n-1}{2}}}\int\limits_{0}^{\infty }\sqrt{x}\cdot x^{\frac{n-3}{2% }}e^{-\frac{x}{2}}\,dx=\dfrac{\sqrt{2}\Gamma (\frac{n}{2})}{\Gamma (\frac{n-1% }{2})} \end{equation*}\] \[\begin{equation*} \mathbb{E}(s)=\,\dfrac{\sigma }{\sqrt{n-1}}\cdot \dfrac{\sqrt{2}\Gamma (% \frac{n}{2})}{\Gamma (\frac{n-1}{2})} %\approx \sigma %(1-\frac{1}{4n}-\frac{7}{% %32n^{2}}+...) \end{equation*}\] But! we know how to fix this and create the following fully unbiased estimator of \(\sigma\) \[\begin{equation*} \,\sqrt{\frac{n-1}{2}}\frac{\Gamma (\frac{n-1}{2})}{\Gamma (\frac{n}{2})}s \end{equation*}\]
Being unbiased is only one essential ingredient of a good estimator
The other one is its variance, which we would like to keep as small as possible.
If an estimator is biased, then we should prefer the one with low bias as well as low variance.
Generally, it is better to have an estimator that has low bias as well as low variance.
Definition 3: MSE
The mean square error of the estimator \(\hat\theta\), denoted by \(\text{MSE}(\hat\theta)\), is defined as: \[ \text{MSE}(\hat\theta) = \mathbb{E}[(\hat\theta - \theta)^2] \]
The MSE measures, on average, how close an estimator comes to the true value of the parameter.
Through the following calculations, we will now show that the MSE is a measure that combines both bias and variance…
Definition 4: Relative Efficiency
Given two estimates, \(\hat\theta_1\) and \(\hat\theta_2\), of a parameter \(\theta\), the efficiency of \(\hat\theta_1\) relative to \(\hat\theta_2\), is defined to be
\[ \text{eff}(\hat\theta_1, \hat\theta_2) = \frac{\text{Var}(\hat\theta_1)}{\text{Var}(\hat\theta_2)} \]
Thus, if the efficiency is smaller than 1, \(\hat\theta_2\) has a larger variance than \(\hat\theta_1\) has.
This leads to two new definitions …
Definition 5: Consistency
A consistent estimator must have two properties:
\[\mathbb{E}[\hat \theta] \underset{n\rightarrow \infty }{\longrightarrow } \theta\] i.e. be asymptotically unbiased, and \[ \text{Var}(\hat{\theta})\underset{n\rightarrow \infty }{\longrightarrow }0 \] meaning that its variance must tend to zero with increasing sample size.
Consistency implies that we can converge on the exact value of \(\theta\) by indefinitely increasing the sample size.
Nice as it sounds, this represents only the minimal standard (or even less) to be expected of an estimator; some of them may still be so wasteful to make them unacceptable.
e.g. averaging every second observation (i.e. wasting half of our sample) still yields a consistent (but rather silly) estimator of \(\mu\).
For an i.i.d. sample of size (\(n\)), the log likelihood is \[ \ell(\theta) = \sum_{i=1}^{n} \log{f(x_i \mid \theta )} \]
We denote the true value of \(\theta\) by \(\theta_0\).
It can be shown that under reasonable conditions1 the MLE, \(\hat{\theta}_{MLE}\), is a consistent estimate of \(\theta_0\); that is, \(\hat{\theta}_{MLE}\) converges to \(\theta_0\) in probability as \(n\) approaches infinity.
Exercise 2:
Let \(X_1, X_2, X_3\) be a sample of size \(n=3\) from a distribution with unknown mean \(\mu\), where the variance \(\sigma^2\) is a known positive number. Show that both \(\hat\theta_1 = \bar{X}\) and \(\hat\theta_2 = \frac{1}{8}\left[2X_1 + X_2 + 5X_3\right]\) are unbiased estimators for \(\mu\). Compare the variances of \(\hat\theta_1\) and \(\hat\theta_2\).
The unbiased estimator that minimizes the MSE is called the MVUE of \(\theta\).
Definition 6: Minimum variance unbiased estimator (MVUE)
Minimum variance unbiased estimator (MVUE) is an unbiased estimator whose variance is smaller or equal to the variance of any other unbiased estimator for all potential values of \(\theta\).
Having such an estimator would of course be ideal, but we run into several difficulties …
Generally the variance of an estimator is a function of \(\theta\), which means we are now comparing functions, not values.
It may easily happen that two unbiased estimators have variances such that one is smaller in some range of \(\theta\) values and bigger in another.
Neither estimator is then (uniformly) better than the other, and the MVUE estimator may therefore not exist.
Even when the MVUE estimator exists, how do we know that it does and, finally, how do we find it?
To partially answer the second point: luckily, there is a theoretical Cramér-Rao Lower bound on the variance of all unbiased estimators
When an unbiased estimator achieves this bound, it is automatically MVUE (details in an upcoming theorem).
Before stating the theorem, first a definition of Fishers Information and a note that we will be assuming certain “regularity conditions” wherein the parameter \(\theta\) does not affect the boundaries of the distribution’s support1.
Definition 7: Regularity Assumptions (Set 1)
(R0). The cdfs are distinct; i.e., \(\theta \neq \theta' \Rightarrow F(x_i; \theta) \neq F(x_i; \theta)\).
(R1). The pdfs have common support for all \(\theta\).
(R2). The point \(\theta_0\) is an interior point in \(\Omega\).
Definition 8: Addition Regularity Asssumptions for CRLB (Set 2)
(R3). The pdf \(f(x;\theta)\) is twice differentiable as a function of \(\theta\).
(R4). The integral \(\int f(x;\theta) \, dx\) can be differentiated twice under the integral sign as a function of \(\theta\).
Definition 9: Fisher Information
Consider a random variable \(X\) whose pdf \(f(x; \theta)\) depends on an unknown parameter \(\theta\) which is in a set \(\Omega\). The Fisher Information is defined by:
\[ \begin{align} I(\theta) &= \mathbb{E}\left[ \left(\frac{ \partial \ln f(x; \theta )}{\partial \theta } \right)^2 \right] \end{align} \]
where \(\frac{\partial \ln f(x; \theta )}{\partial \theta}\) is the so-called score function. It can be shown1 that \(I(\theta)\) can be found from the following expression (which is usually easier to compute):
\[ I(\theta) = - \mathbb{E}\left[ \frac{ \partial^2 \ln f(x; \theta )}{\partial \theta^2 }\right] \]
Fisher’s Information quantifies the curvature of the log-likelihood function with respect to the parameter.
It measures the amount of information that an observed data set carries about the parameter being estimated.
Higher values of indicate that the data provide more information about the parameter.
Exercise 3: Information for a Bernoulli RV
Let \(X\) be a Bernoulli random variable, so \(f(x; p) = p^x(1-p)^{1-x}\), \(x = 0, 1\). Find the Fishers Information \(I(p)\)
The first derivative of the log-likelihood function is \[ \begin{align} \ell'(p) &=\frac{\partial}{\partial p} \left[ x \ln p + (1-x) \ln (1- p) \right] \\ \frac{\partial \ln f(x; \theta )}{\partial \theta} &= \frac{x}{p} - (1-x)\frac{1}{1-p} \end{align} \]
The second derivative of the log-likelihood function is
\[ \begin{align} \ell''(p) &= \frac{ \partial}{\partial p } \left[ \frac{x}{p} - (1-x)\frac{1}{1-p} \right]\\ \frac{ \partial^2 \ln f(x; \theta )}{\partial \theta^2 } &= -\frac{x}{p^2} - \frac{1-x}{(1-p)^2} \end{align} \]
So the Fisher Information is:
\[ \begin{align} I(p) &= -\mathbb{E}\left[ -\frac{X}{p^2} - \frac{1 - X}{(1-p)^2} \right]\\ &= \frac{\mathbb{E}[X]}{p^2} + \frac{1 - \mathbb{E}[X]}{(1-p)^2} \\ &= \frac{p}{p^2} + \frac{1 - p}{(1-p)^2}\\ &= \frac{1}{p(1-p)} \end{align} \]
Then the Fisher information in \(X_1, X_2,\dots X_n\) is simply \(n\) times the Fisher information in a single observation.
Continuing with our example, let \(X_1, X_2, \dots, X_n\) be a random sample from a Bernoulli distribution. The Fisher information in the random sample is
\[\begin{align} I_n(p) = nI(p) = \frac{n}{p(1 – p)} \end{align}\]Theorem 2: Cramér-Rao Lower bound (CRLB)
Suppose that \(X_1, \dots X_n\) is a random sample from a population having a common density function \(f(x; \theta)\) depending on a parameter \(\theta \in \Omega\). Assume that the regularity conditions \((R0)–(R4)\) hold and let \(\hat\theta\) be an unbiased estimator of \(\theta\). The variance of any estimator \(\hat{\theta}\) of such a parameter must meet the following inequality: \[\begin{equation} \text{Var}(\hat{\theta})\geq \dfrac{1}{I_n(\theta)} = \dfrac{1}{nI(\theta)}\label{C-R} \end{equation}\] where \(I_n(\theta)\) denotes the Fisher information in the sample, and \(I(\theta)\) denotes the Fishers Information in a single observation from \(f(x; \theta)\).
Based on this CRLB we define the so called efficiency of an unbiased estimator \(\hat\theta\) as the ratio of the theoretical variance bound, denoted CRV, to the actual variance of \(\hat\theta\) , thus:
Definition 10: Efficiency
If CRV is the theoretical variance bound given by the CRLB theorem, then the efficiency of an unbiased estimator \(\hat\theta\) is given by:
\[ \text{efficiency} = \dfrac{CRV}{\text{Var}(\hat\theta)} \]
Efficiency cannot be bigger than 1 and is usually expressed in percent. An estimator which reaches 100% efficiency when \(n \rightarrow \infty\) is called asymptotically efficient.
Exercise 4: Bernoulli CRLB
Find the CRLB for estimating \(p\) for a Bernoulli(\(p\)) random variable.
The CRLB for \(p\) is \([nI(p)]^{-1}\) \[\begin{align} \frac{1}{nI(p)} = \frac{1}{I_n(p)}= \frac{p(1 – p)}{n} \end{align}\]
Does this seem familiar?
Recall that the MLE for \(p\) was given by \(p = \frac{1}{n} \sum\limits_{i=1}^{n}X_{i}\) which is
Therefore \(\hat p\) attains the CRLB and thus is the MVUE for \(p\).
More generally, an estimator \(\hat \theta\) whose variance is \(\text{Var} = [nI(\theta)]^{-1}\) is the “best” unbiased estimator.
In other words, an estimator whose variance attains the lower bound of the Cramer-Rao inequality is MVUE estimator of \(\theta\).
In many cases, the maximum likelihood estimator (MLE) achieves the CRLB asymptotically.
Theorem 3: Large Sample Properties of the MLE
Let \(X_1, \ldots, X_n\) be a random sample from a distribution whose support does not depend on \(\theta\). Then for large \(n\) the maximum likelihood estimator \(\hat{\theta}\) has approximately a normal distribution with mean \(\theta\) and variance \(\dfrac{1}{nI(\theta)}\).
A proof of this result appears in the appendix to chapter 7 in Devore, Berk, and Carlton (2021).
Exercise 5: Normal Mean
Suppose that \(X_1, \dots, X_n\) is a random sample from a \(\mathcal{N}(\theta, \sigma^2)\) where \(\sigma^2\) is known but \(\theta\) is our unknown parameter of interest. Prove that \(\hat \theta := \bar{X}\) is the MVUE of \(\theta\).
We know that the variance of \(\bar{X}\) is \(\frac{\sigma ^{2}}{n}.\)
Write the log-likelihood function for the a \(\mathcal{N}(\mu ,\sigma)\) is given by:
\[ \mathcal{L}(\mu, \sigma^2) = \prod_{i=1}^{n} \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x_i - \mu)^2}{2\sigma^2}\right) \]Hence the log-likelihood is given by
\[ \ell(\mu, \sigma^2) = -\frac{n}{2} \ln(2\pi) - \frac{n}{2} \ln(\sigma^2) - \frac{1}{2\sigma^2} \sum_{i=1}^{n} (x_i - \mu)^2 \]
Find the 2nd partial of the log-likelihood w.r.t \(\theta\):
Calculate the Fisher Information: \(-\mathbb{E}[nI(\sigma)^{-1}] = \frac{n}{\sigma^2}\)
Thus, CRLB equals \(\dfrac{1}{\frac{n}{\sigma ^{2}}}=\frac{\sigma ^{2}% }{n}\) implying that \(\bar{X}\) is MVUE of \(\mu\).
Exercise 6: Exponential Distribution
Suppose that \(X_1, \dots, X_n\) is a random sample from an exponential distribution depending on a scale parameter \(\theta > 0\). Prove that \(\hat\theta = \bar{X}\) is the minimum variance unbiased estimator (MVUE) of \(\theta\).
Since \(X_i \sim \text{Exp}(\theta)\), we know that \[\begin{align} E(X_i) &= \theta \text{ (i.e unbiassed)}& \text{Var}(X_i) &= \theta^2 \end{align}\]
Moreover, from the CLT we know if \(\hat \theta = \bar{X}\) then \(\text{Var}(\hat \theta) = \frac{\theta^2}{n}\)
Let’s calculate the Information/CRLB ….
Write log likelihood function as a function.
Since the density Exp(\(\theta\)) is,
\[ \begin{equation} f(x;\theta) = \frac{1}{\theta}\exp\{-x/\theta\} \end{equation} \]
The likelihood is therefore:
\[ \begin{equation} \ell(\theta) = \ln f(x;\theta) = -\ln(\theta) - \frac{x}{\theta} \end{equation} \]
Find the 2nd partial of the log-likelihood w.r.t \(\theta\):
\[ \begin{align} \ell'(\theta) &= \frac{\partial \ln f(x;\theta)}{\partial \theta} = -\frac{1}{\theta} + \frac{x}{\theta^2} \\ \ell''(\theta) &= \frac{\partial^2 \ln f(x;\theta)}{\partial \theta^2} = \frac{1}{\theta^2} - 2 \frac{x}{\theta^3} \\ \end{align} \]
Caclulate the Fisher Information
\[ \begin{align} I(\theta) = -\mathbb{E}\left[\frac{1}{\theta^2} - 2 \frac{x}{\theta^3}\right] &= - \frac{1}{\theta^2} + 2 \frac{\theta}{\theta^3} = \frac{1}{\theta^2} \end{align} \]
Find the CRLB:
\[ \begin{align} [nI(\theta)]^{-1} &= \frac{1}{n(1/\theta^2)} = \frac{\theta^2}{n} = \text{Var}(\hat \theta) \end{align} \]
Since \(\hat \theta\) attains the CRLB, we conclude that \(\bar{X}\) must be the minimum variance unbiased estimator (MVUE) of \(\theta\).
Comment
Because the bias is 0 for unbiased estimators, it is clear that
\[ \text{MSE}(\hat\theta) = \text{Var}(\hat\theta) \]
where \(\hat\theta\) is unbiased.
While MSE can serve as a good criterion for determining when one estimator is “better” than another, it is often hard to find the \(\hat\theta\) that minimizes the MSE.
For this reason, we simplify the search by looking for unbiased estimators to minimize \(\text{Var}(\hat\theta)\).