In many theoretical discussions in statistics and probability theory, we often assume a “random indepdent sample”:
Random Independent Sample (RIS)
The random variables \(X_1, X_2, \dots, X_n\) are called a random independent sample of size n from the population\(f(x)\) if \(X_1, \dots, X_n\) are mutaully independent RVs and the marginal pdf or pmf of each \(X_i\) is the same function \(f(x)\). Alternatively, \(X_1, \dots, X_n\) are called independent and identically distributed (i.i.d) RVs with pdf of pmf \(f(x)\).
To put another way, we assume each \(X_i\) is an observation on the same vairable and each \(X_i\) has marginatl distribution given by \(f(x)\).
In addition, we assume mutually independence meaning the the value of one ovservation has no effect on any other observations.
This often equates to taking a simple random sampling (SRS) with replacement from a “infinite”1 population
Simple Random Sampling
Simple Random Sampling Without Replacement (SRSWOR / SRS)
SRSWOR of size \(n\) is the probability sampling design for which a fixed number of \(n\) units are selected from a population of \(N\) units without replacement such that every possible sample of \(n\) units has equal probability of being selected.
Simple Random Sampling With Replacement (SRSWR)
SRSWR is a method of selecting \(n\) units from a population of \(N\) units with replacement such that at each stage of selection, each unit has an equal chance of being selected, i.e., \(1/N\).
Background
From the surveying standpoint, it of course makes more sense to sample without replacement.
e.g. you might perform a telephone survey of 10,000 people; once a person has been called, they won’t be called again.
When working with very large (infinite) populations, the difference between SRS and SRSWOR is negligible
e.g. the chance of calling the same person in the above telephone survey is 1/10,000 = 0.01%
Finite popluation
Unlike infinite populations, “finite” populations\(\{x_1, x_2, \dots, x_N\}\) have a know and fixed size \(N < \infty\).
In other words a finite population refers to a specific, countable number of units.
When dealing with a finite population , the difference between SRS and SRSWOR may not be negligible
Hence adjustments may be necessary to to account for the finite nature of the population.
SRSWOR from Finite populations
When SRSWOR from a finite population \(\{x_1, x_2, \dots, x_N\}\), the value \(x_i\) is not replaced after being selected and is therefore impossible to sample \(x_i\) again.
As a consequence \(X_1\) and \(X_2\) are NOT independent since:
SRSWR from a finite population is a type of sampling model forms the basis for the statistical (re)-sampling technique known as bootstrapping (see Lecture 7)
SRSWOR from a finite population does not satisfy all the conditions of a “random sample”; i.e. \(X_i\) and \(X_j\) are not independent.
Covariance
If \(X\) and \(Y\) are jointly distributed random variables with expectations \(\mu_X\) and \(\mu_Y\) , respectively, the covariance of \(X\) and \(Y\) is
Variance of a linear combination of random variables.
For the variance of a linear combination of random variables we have: \[
\text{Var}(a + \sum_{i=1}^n b_i X_i) = \sum_{i=1}^n \sum_{j=1}^n b_i b_j \text{Cov}(X_i, X_j)
\]
When sampling with replacement, \(X_i\) would be independent, for \(i\neq j\). Hence we would have \(\text{Cov}(X_i,X_j)=0\), whereas \(\text{Cov}(X_i,X_i)= \text{Var}(X_i) = \sigma^2\) and \[\text{Var}(\bar{X}) = \frac{1}{n^2} \sum_{i=1}^n \text{Var}(X_i) = \frac{\sigma^2}{n}
\]
When sampling without replacement, the variance of the sample mean is a little bit more involved.
It can be shown1 that Cov\((X_i , X_j) = −\dfrac{\sigma^2}{N−1}\) which yeilds the following result …
SRSWOR from a finite population
When taking a simple random sampling without replacement (SRSWOR) of size \(n\) from a finite population \(\{X_1, X_2, \dots, X_n\}\), the variance of the sample mean \(\bar{X} = \sum_{i=1}^n X_i\) is given by \[
\text{Var}(\bar{X}) = \frac{\sigma^2}{n}\left(1 - \frac{n-1}{N-1}\right) = \frac{\sigma^2}{n}\frac{N-n}{N-1}
\]
Notice that the variance of the sample mean in sampling without replacement differs from that in sampling with replacement by the finite population correction (FPC)1\[
\text{FPC} = \left( 1 - {\frac{n - 1}{N - 1}} \right) = \frac{N-n}{N-1}
\]
FPC will be small when we have large \(n\) relative to \(N\)
FPC will be close to 1 when \(n\) is small relative to \(N\)
Comments
In practice, surveys are often sampling a small portion of a large population and this finite correct is close to one.
In other words, the sampling fraction1 is frequently very small, in which case the standard error of \(\bar{X}\), is
\[
\sigma_{\bar{X}} \approx \sigma/\sqrt{n}
\]
Small FPCs
In situations where the sampling fraction, \(n/N\) is relatively high, the FPC will be small.
As a consequence, the standard errors will decrease and the respective confidence intervals will become narrower to account for this reduction in variability.
Notice that the larger the sampling fraction the more benefit there is (which makes sense …)
The benefit becomes noticeable when the sample size is at least 10% of the population.
Hospital Example
Description: The data comprise the numbers of discharges for 393 short-stay hospitals during January 1968.
Format: This data frame contains the following columns:
beds - number of beds
discharges - number of discharges during January 1968
# in base:par(mfrow=c(1,2))hist(hospital$beds, main ="", xlab="Number of beds")hist(hospital$discharges, breaks =15, main ="", xlab="Number of discharges")
A histogram of the population values
Population Parameters
The population consists of \(N = 393\) short-stay hospitals.
We let \(x_i\) denote the number of patients discharged from the \(i\)th hospital during January 1968.
Thirty people from a population of 300 were asked how much they had in savings. The sample mean (\(\bar{x}\)) was $1,500, with a sample standard deviation of $89.55. Construct a 95% confidence interval estimate for the population mean.
Using degrees of freedom 29 (= \(n-1\)) and \(t_{\alpha/2}\) = 2.045, a 95% confidence level is for the population mean is:
For a SRSWOR on a finite population, the sample variance \(\hat\sigma^2 = \frac{1}{n-1}\sum_{i=1}^n (X_i - \bar{X})^2\) is a biased estimator for population variance \(\sigma^2\) with \[
\mathbb{E}[\hat\sigma^2] = \sigma^2
\left(
\frac{n-1}{n}
\right)
\left( {\frac{N}{N - 1}} \right)
\]
Because \(N > n\), it follows with a little algebra that
\[
\frac{n−1}{n} \frac{N}{N−1} < 1
\] and \(\mathbb{E}[\hat\sigma^2] < \sigma^2\). In other words, \(\hat\sigma^2\) tends to overestimate \(\sigma^2\).
As we’ve seen before, an unbiased estimate of \(\sigma^2\) can easily be found …
Unbiased Estimator of Variance
Unbiased estimator for population variance
For SRSWOR from a finite population an unbiased estimator for \(\sigma^2\) is given by:
Since \(N\) is usually quite large, this is a rather negligible correction.
References
Ramachandran and Tsokos (2020), Rice (2007), Casella and Berger (2002)
Casella, G., and R. L. Berger. 2002. Statistical Inference. Duxbury Advanced Series in Statistics and Decision Sciences. Thomson Learning. https://books.google.ca/books?id=0x_vAAAAMAAJ.
Comments
In practice, surveys are often sampling a small portion of a large population and this finite correct is close to one.
In other words, the sampling fraction1 is frequently very small, in which case the standard error of \(\bar{X}\), is
\[ \sigma_{\bar{X}} \approx \sigma/\sqrt{n} \]