Lecture 10: Properties of Parameter Estimators

STAT 205: Introduction to Mathematical Statistics

Dr. Irene Vrbik

University of British Columbia Okanagan

March 15, 2024

Motivation

Two different methods of finding estimators for population parameters have been introduced: Maximum Likelihood Estimator (MLE) and Method of Moment (MME).
MLEs or moment estimators are not, in general, unbiased. And the MME is not always unique.
So how do we go about selecting a “good” estimator among a collection of candidate sample statistics?
This lecture will discuss some desirable properties of estimators in more detail.

Outline

In this lecture we will be covering:

Unbiased
Consistency
Relative Efficiency

MVUE
Cramér-Rao Lower bound (CRLB)
Efficiency

Recall

Definition 1: RIS

A random independent sample (RIS) of size \(n\) from a specific distribution is a collection of \(n\) independent RVs \(X_{1},\) \(X_{2},\) \(...X_{n}\), each of them having this distribution (this is achieved by performing the corresponding experiment, independently, that many times).

Definition 2: Estimator

An estimator of unknown population parameter \(\theta\) is any sample statistic, say \(\hat\theta(X_1, \dots, X_n)\), of the \(n\) observations.

Unbiased

To narrow down our choices, we will first insist that our estimators be unbaised, meaning \[ \mathbb{E}[\hat\theta]= \theta \]

or at least asymptotically unbiased, i.e. \[ \mathbb{E}[\hat\theta] \underset{n \rightarrow \infty}{\rightarrow} \theta \]

Sample Mean unbiased

Theorem 1: Unbiased Estimator for the mean

The mean of a random sample \(\bar{X}\) is always an unbiased estimator of the population mean \(\mu\) (assuming the mean exists).

Proof: Let \(X_1, \dots, X_n\) be random variables with mean \(\mu\). Then, the sample mean is

\[ \mathbb{E}[\bar{X}] = \frac{1}{n} \sum_{i=1}^{n} \mathbb{E}[X_i] = \frac{1}{n} \cdot n\mu = \mu. \]

Hence, \(\bar{X}\) is an unbiased estimator of \(\mu\).

But is being unbiased enough?

Consider estimating \(\mu\) of a distribution by taking \(\hat \theta = X_1\) (the first observation only)
This provides a fully unbiased estimator which is evidently unacceptable, since we are wasting nearly all of the available information.
Making an estimator unbiased (or at least asymptotically so) is not enough to make it even acceptable (let alone ‘good’).

Non-unique unbiased estimators

Exercise 1:

Let \(X_1, \dots, X_n\) be a random sample from a population with finite mean \(\mu\). Show that the sample mean \(\bar{X}\) and \(\frac{1}{3} \bar{X} + \frac{2}{3} X_1\) are both unbiased estimators of \(\mu\).

We’ve already shown \(\bar{X}\) is unbiased, so \[ \begin{align} \mathbb{E}\left[\frac{1}{3} \bar{X} + \frac{2}{3} X_1\right] &= \frac{1}{3} \mu + \frac{2}{3} \mu = \mu \end{align} \]

🤓 In fact, if we have two unbiased estimators, there are infinitely many unbiased estimators. (See Example 5.3.3 in Ramachandran and Tsokos (2020) for proof).

Comment

In many cases (as we’ve seen already), it is possible to transform a biased estimator into an unbiased estimator.
This can be achieved by applying a mathematical operation to the biased estimator, typically involving a constant term or a function of the sample size, such that the resulting estimator has zero bias.
For example, in SRSWR, \(S^2 = \frac{n}{n-1}\hat\sigma^2 =\dfrac{\sum_{i=1}^n (X_i - \bar X)^2}{n-1}\) is an unbiased estimator for \(\sigma^2\).

Supplementary

Warning

Unbiasedness may not be retained under functional transformations.

That is, if \(\hat\theta\) is an unbiased estimator of \(\theta\), it does not follow that \(f(\hat\theta)\) is an unbiased estimator of \(f(\theta)\).
\(S^2\) is an unbiased estimator of \(\sigma^2\); however, \(S\) is NOT an unbiased estimator of \(\sigma\).

Proof

Since \(\frac{n-1}{\sigma ^{2}}s^{2}\epsilon ~\chi _{n-1}^{2}\)

\[\begin{equation*} \mathbb{E}\left( \sqrt{\chi _{n-1}^{2}}\right) =\dfrac{1}{\Gamma (\frac{n-1}{% 2})\,2^{\frac{n-1}{2}}}\int\limits_{0}^{\infty }\sqrt{x}\cdot x^{\frac{n-3}{2% }}e^{-\frac{x}{2}}\,dx=\dfrac{\sqrt{2}\Gamma (\frac{n}{2})}{\Gamma (\frac{n-1% }{2})} \end{equation*}\] \[\begin{equation*} \mathbb{E}(s)=\,\dfrac{\sigma }{\sqrt{n-1}}\cdot \dfrac{\sqrt{2}\Gamma (% \frac{n}{2})}{\Gamma (\frac{n-1}{2})} %\approx \sigma %(1-\frac{1}{4n}-\frac{7}{% %32n^{2}}+...) \end{equation*}\] But! we know how to fix this and create the following fully unbiased estimator of \(\sigma\) \[\begin{equation*} \,\sqrt{\frac{n-1}{2}}\frac{\Gamma (\frac{n-1}{2})}{\Gamma (\frac{n}{2})}s \end{equation*}\]

More than unbiasedness

Being unbiased is only one essential ingredient of a good estimator
The other one is its variance, which we would like to keep as small as possible.
If an estimator is biased, then we should prefer the one with low bias as well as low variance.
Generally, it is better to have an estimator that has low bias as well as low variance.

MSE

Definition 3: MSE

The mean square error of the estimator \(\hat\theta\), denoted by \(\text{MSE}(\hat\theta)\), is defined as: \[ \text{MSE}(\hat\theta) = \mathbb{E}[(\hat\theta - \theta)^2] \]

The MSE measures, on average, how close an estimator comes to the true value of the parameter.

Through the following calculations, we will now show that the MSE is a measure that combines both bias and variance…

Decomposition of MSE

From definition 5.3.2 of Ramachandran and Tsokos (2020) (where \(B\) is the bias of the estimator.

Comment

Because the bias is 0 for unbiased estimators, it is clear that

\[ \text{MSE}(\hat\theta) = \text{Var}(\hat\theta) \]

where \(\hat\theta\) is unbiased.

While MSE can serve as a good criterion for determining when one estimator is “better” than another, it is often hard to find the \(\hat\theta\) that minimizes the MSE.
For this reason, we simplify the search by looking for unbiased estimators to minimize \(\text{Var}(\hat\theta)\).

Relative Efficiency

Definition 4: Relative Efficiency

Given two estimates, \(\hat\theta_1\) and \(\hat\theta_2\), of a parameter \(\theta\), the efficiency of \(\hat\theta_1\) relative to \(\hat\theta_2\), is defined to be

\[ \text{eff}(\hat\theta_1, \hat\theta_2) = \frac{\text{Var}(\hat\theta_1)}{\text{Var}(\hat\theta_2)} \]

Thus, if the efficiency is smaller than 1, \(\hat\theta_2\) has a larger variance than \(\hat\theta_1\) has.

Consistency

This leads to two new definitions …

Definition 5: Consistency

A consistent estimator must have two properties:

\[\mathbb{E}[\hat \theta] \underset{n\rightarrow \infty }{\longrightarrow } \theta\] i.e. be asymptotically unbiased, and \[ \text{Var}(\hat{\theta})\underset{n\rightarrow \infty }{\longrightarrow }0 \] meaning that its variance must tend to zero with increasing sample size.

Comment

Consistency implies that we can converge on the exact value of \(\theta\) by indefinitely increasing the sample size.
Nice as it sounds, this represents only the minimal standard (or even less) to be expected of an estimator; some of them may still be so wasteful to make them unacceptable.
e.g. averaging every second observation (i.e. wasting half of our sample) still yields a consistent (but rather silly) estimator of \(\mu\).

Consistent MLEs

For an i.i.d. sample of size (\(n\)), the log likelihood is \[ \ell(\theta) = \sum_{i=1}^{n} \log{f(x_i \mid \theta )} \]

We denote the true value of \(\theta\) by \(\theta_0\).

It can be shown that under reasonable conditions¹ the MLE, \(\hat{\theta}_{MLE}\), is a consistent estimate of \(\theta_0\); that is, \(\hat{\theta}_{MLE}\) converges to \(\theta_0\) in probability as \(n\) approaches infinity.

Example

Exercise 2:

Let \(X_1, X_2, X_3\) be a sample of size \(n=3\) from a distribution with unknown mean \(\mu\), where the variance \(\sigma^2\) is a known positive number. Show that both \(\hat\theta_1 = \bar{X}\) and \(\hat\theta_2 = \frac{1}{8}\left[2X_1 + X_2 + 5X_3\right]\) are unbiased estimators for \(\mu\). Compare the variances of \(\hat\theta_1\) and \(\hat\theta_2\).

MVUE

The unbiased estimator that minimizes the MSE is called the MVUE of \(\theta\).

Definition 6: Minimum variance unbiased estimator (MVUE)

Minimum variance unbiased estimator (MVUE) is an unbiased estimator whose variance is smaller or equal to the variance of any other unbiased estimator for all potential values of \(\theta\).

Having such an estimator would of course be ideal, but we run into several difficulties …

Challenges

Generally the variance of an estimator is a function of \(\theta\), which means we are now comparing functions, not values.
It may easily happen that two unbiased estimators have variances such that one is smaller in some range of \(\theta\) values and bigger in another.
Neither estimator is then (uniformly) better than the other, and the MVUE estimator may therefore not exist.
Even when the MVUE estimator exists, how do we know that it does and, finally, how do we find it?

Cramér-Rao Lower bound (CRLB)

To partially answer the second point: luckily, there is a theoretical Cramér-Rao Lower bound on the variance of all unbiased estimators
When an unbiased estimator achieves this bound, it is automatically MVUE (details in an upcoming theorem).
Before stating the theorem, first a definition of Fishers Information and a note that we will be assuming certain “regularity conditions” wherein the parameter \(\theta\) does not affect the boundaries of the distribution’s support¹.

Regularity Conditions

Definition 7: Regularity Assumptions (Set 1)

(R0). The cdfs are distinct; i.e., \(\theta \neq \theta' \Rightarrow F(x_i; \theta) \neq F(x_i; \theta)\).

(R1). The pdfs have common support for all \(\theta\).

(R2). The point \(\theta_0\) is an interior point in \(\Omega\).

Definition 8: Addition Regularity Asssumptions for CRLB (Set 2)

(R3). The pdf \(f(x;\theta)\) is twice differentiable as a function of \(\theta\).

(R4). The integral \(\int f(x;\theta) \, dx\) can be differentiated twice under the integral sign as a function of \(\theta\).

Fishers Information

Definition 9: Fisher Information

Consider a random variable \(X\) whose pdf \(f(x; \theta)\) depends on an unknown parameter \(\theta\) which is in a set \(\Omega\). The Fisher Information is defined by:

\[ \begin{align} I(\theta) &= \mathbb{E}\left[ \left(\frac{ \partial \ln f(x; \theta )}{\partial \theta } \right)^2 \right] \end{align} \]

where \(\frac{\partial \ln f(x; \theta )}{\partial \theta}\) is the so-called score function. It can be shown¹ that \(I(\theta)\) can be found from the following expression (which is usually easier to compute):

\[ I(\theta) = - \mathbb{E}\left[ \frac{ \partial^2 \ln f(x; \theta )}{\partial \theta^2 }\right] \]

Intuition

Fisher’s Information quantifies the curvature of the log-likelihood function with respect to the parameter.
It measures the amount of information that an observed data set carries about the parameter being estimated.
Higher values of indicate that the data provide more information about the parameter.

The likelihood is less curved, hence \(\text{Var}(\hat\theta)\) is higher, and Fisher Information is lower.

The likelihood is more curved, hence \(\text{Var}(\hat\theta)\) is smaller, and Fisher Information is higher

Exercise 3: Information for a Bernoulli RV

Let \(X\) be a Bernoulli random variable, so \(f(x; p) = p^x(1-p)^{1-x}\), \(x = 0, 1\). Find the Fishers Information \(I(p)\)

Solution to Exercise 3 {.scrollable .smaller {visibility=“hidden”}}

The first derivative of the log-likelihood function is \[ \begin{align} \ell'(p) &=\frac{\partial}{\partial p} \left[ x \ln p + (1-x) \ln (1- p) \right] \\ \frac{\partial \ln f(x; \theta )}{\partial \theta} &= \frac{x}{p} - (1-x)\frac{1}{1-p} \end{align} \]

The second derivative of the log-likelihood function is

\[ \begin{align} \ell''(p) &= \frac{ \partial}{\partial p } \left[ \frac{x}{p} - (1-x)\frac{1}{1-p} \right]\\ \frac{ \partial^2 \ln f(x; \theta )}{\partial \theta^2 } &= -\frac{x}{p^2} - \frac{1-x}{(1-p)^2} \end{align} \]

{.scrollable {visibility=“hidden”}}

So the Fisher Information is:

\[ \begin{align} I(p) &= -\mathbb{E}\left[ -\frac{X}{p^2} - \frac{1 - X}{(1-p)^2} \right]\\ &= \frac{\mathbb{E}[X]}{p^2} + \frac{1 - \mathbb{E}[X]}{(1-p)^2} \\ &= \frac{p}{p^2} + \frac{1 - p}{(1-p)^2}\\ &= \frac{1}{p(1-p)} \end{align} \]

Fisher Information for a Sample of size \(n\)

Then the Fisher information in \(X_1, X_2,\dots X_n\) is simply \(n\) times the Fisher information in a single observation.

Continuing with our example, let \(X_1, X_2, \dots, X_n\) be a random sample from a Bernoulli distribution. The Fisher information in the random sample is

\[\begin{align} I_n(p) = nI(p) = \frac{n}{p(1 – p)} \end{align}\]

Theorem 2: Cramér-Rao Lower bound (CRLB)

Suppose that \(X_1, \dots X_n\) is a random sample from a population having a common density function \(f(x; \theta)\) depending on a parameter \(\theta \in \Omega\). Assume that the regularity conditions \((R0)–(R4)\) hold and let \(\hat\theta\) be an unbiased estimator of \(\theta\). The variance of any estimator \(\hat{\theta}\) of such a parameter must meet the following inequality: \[\begin{equation} \text{Var}(\hat{\theta})\geq \dfrac{1}{I_n(\theta)} = \dfrac{1}{nI(\theta)}\label{C-R} \end{equation}\] where \(I_n(\theta)\) denotes the Fisher information in the sample, and \(I(\theta)\) denotes the Fishers Information in a single observation from \(f(x; \theta)\).

Efficiency

Based on this CRLB we define the so called efficiency of an unbiased estimator \(\hat\theta\) as the ratio of the theoretical variance bound, denoted CRV, to the actual variance of \(\hat\theta\) , thus:

Definition 10: Efficiency

If CRV is the theoretical variance bound given by the CRLB theorem, then the efficiency of an unbiased estimator \(\hat\theta\) is given by:

\[ \text{efficiency} = \dfrac{CRV}{\text{Var}(\hat\theta)} \]

Efficiency cannot be bigger than 1 and is usually expressed in percent. An estimator which reaches 100% efficiency when \(n \rightarrow \infty\) is called asymptotically efficient.

Exercise 4: Bernoulli CRLB

Find the CRLB for estimating \(p\) for a Bernoulli(\(p\)) random variable.

The CRLB for \(p\) is \([nI(p)]^{-1}\) \[\begin{align} \frac{1}{nI(p)} = \frac{1}{I_n(p)}= \frac{p(1 – p)}{n} \end{align}\]

Does this seem familiar?

Comment

Recall that the MLE for \(p\) was given by \(p = \frac{1}{n} \sum\limits_{i=1}^{n}X_{i}\) which is

unbiased, \(\mathbb{E}[\hat p] = \frac{1}{n} \sum \mathbb{E}[X_i] = \frac{1}{n} np = p\)
with variance given by \(\text{Var}({\hat p}) = \frac{p(1-p)}{n}\)

Therefore \(\hat p\) attains the CRLB and thus is the MVUE for \(p\).

Comments

More generally, an estimator \(\hat \theta\) whose variance is \(\text{Var} = [nI(\theta)]^{-1}\) is the “best” unbiased estimator.
In other words, an estimator whose variance attains the lower bound of the Cramer-Rao inequality is MVUE estimator of \(\theta\).
In many cases, the maximum likelihood estimator (MLE) achieves the CRLB asymptotically.

Large-Sample Properties of the MLE

Theorem 3: Large Sample Properties of the MLE

Let \(X_1, \ldots, X_n\) be a random sample from a distribution whose support does not depend on \(\theta\). Then for large \(n\) the maximum likelihood estimator \(\hat{\theta}\) has approximately a normal distribution with mean \(\theta\) and variance \(\dfrac{1}{nI(\theta)}\).

A proof of this result appears in the appendix to chapter 7 in Devore, Berk, and Carlton (2021).

Examples

Normal

Exercise 5: Normal Mean

Suppose that \(X_1, \dots, X_n\) is a random sample from a \(\mathcal{N}(\theta, \sigma^2)\) where \(\sigma^2\) is known but \(\theta\) is our unknown parameter of interest. Prove that \(\hat \theta := \bar{X}\) is the MVUE of \(\theta\).

Solution to Exercise 5

Solution to Exercise 5 {.scrollable .smaller {visibility=“hidden”}}

We know that the variance of \(\bar{X}\) is \(\frac{\sigma ^{2}}{n}.\)

Write the log-likelihood function for the a \(\mathcal{N}(\mu ,\sigma)\) is given by:

\[ \mathcal{L}(\mu, \sigma^2) = \prod_{i=1}^{n} \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x_i - \mu)^2}{2\sigma^2}\right) \]Hence the log-likelihood is given by

\[ \ell(\mu, \sigma^2) = -\frac{n}{2} \ln(2\pi) - \frac{n}{2} \ln(\sigma^2) - \frac{1}{2\sigma^2} \sum_{i=1}^{n} (x_i - \mu)^2 \]
Find the 2nd partial of the log-likelihood w.r.t \(\theta\):

\[\begin{align} \frac{\partial ^{2}}{\partial \mu ^{2}}\left[ -\ln (\sqrt{2\pi }\sigma )- \frac{(x-\mu )^{2}}{2\sigma ^{2}}\right] =-\frac{1}{\sigma ^{2}} \end{align}\]

Calculate the Fisher Information: \(-\mathbb{E}[nI(\sigma)^{-1}] = \frac{n}{\sigma^2}\)
Thus, CRLB equals \(\dfrac{1}{\frac{n}{\sigma ^{2}}}=\frac{\sigma ^{2}% }{n}\) implying that \(\bar{X}\) is MVUE of \(\mu\).

Exercise 6: Exponential Distribution

Suppose that \(X_1, \dots, X_n\) is a random sample from an exponential distribution depending on a scale parameter \(\theta > 0\). Prove that \(\hat\theta = \bar{X}\) is the minimum variance unbiased estimator (MVUE) of \(\theta\).

Solution to Exercise 6

Solution to Exercise 6 {.scrollable {visibility=“hidden”}}

Since \(X_i \sim \text{Exp}(\theta)\), we know that \[\begin{align} E(X_i) &= \theta \text{ (i.e unbiassed)}& \text{Var}(X_i) &= \theta^2 \end{align}\]

Moreover, from the CLT we know if \(\hat \theta = \bar{X}\) then \(\text{Var}(\hat \theta) = \frac{\theta^2}{n}\)

Let’s calculate the Information/CRLB ….

Steps for finding CRLB {.scrollable {visibility=“hidden”}}

Write log likelihood function as a function.

Since the density Exp(\(\theta\)) is,

\[ \begin{equation} f(x;\theta) = \frac{1}{\theta}\exp\{-x/\theta\} \end{equation} \]

The likelihood is therefore:

\[ \begin{equation} \ell(\theta) = \ln f(x;\theta) = -\ln(\theta) - \frac{x}{\theta} \end{equation} \]
Find the 2nd partial of the log-likelihood w.r.t \(\theta\):

\[ \begin{align} \ell'(\theta) &= \frac{\partial \ln f(x;\theta)}{\partial \theta} = -\frac{1}{\theta} + \frac{x}{\theta^2} \\ \ell''(\theta) &= \frac{\partial^2 \ln f(x;\theta)}{\partial \theta^2} = \frac{1}{\theta^2} - 2 \frac{x}{\theta^3} \\ \end{align} \]
Caclulate the Fisher Information

\[ \begin{align} I(\theta) = -\mathbb{E}\left[\frac{1}{\theta^2} - 2 \frac{x}{\theta^3}\right] &= - \frac{1}{\theta^2} + 2 \frac{\theta}{\theta^3} = \frac{1}{\theta^2} \end{align} \]
Find the CRLB:

\[ \begin{align} [nI(\theta)]^{-1} &= \frac{1}{n(1/\theta^2)} = \frac{\theta^2}{n} = \text{Var}(\hat \theta) \end{align} \]

Since \(\hat \theta\) attains the CRLB, we conclude that \(\bar{X}\) must be the minimum variance unbiased estimator (MVUE) of \(\theta\).

References

Devore, J. L., K. N. Berk, and M. A. Carlton. 2021. Modern Mathematical Statistics with Applications. Springer Texts in Statistics. Springer International Publishing. https://books.google.ca/books?id=ghcsEAAAQBAJ.

Hogg, R. V., J. W. McKean, and A. T. Craig. 2019. Introduction to Mathematical Statistics. What’s New in Statistics Series. Pearson. https://books.google.ca/books?id=V1SzswEACAAJ.

Ramachandran, K. M., and C. P. Tsokos. 2020. Mathematical Statistics with Applications in r. Elsevier Science. https://books.google.ca/books?id=t3bLDwAAQBAJ.