Lecture: Sampling from Finite Populations

STAT 205: Introduction to Mathematical Statistics

Dr. Irene Vrbik

University of British Columbia Okanagan

March 15, 2024

Outline

In this lecture we will be covering

Random sample (formal definition)

In many theoretical discussions in statistics and probability theory, we often assume a “random indepdent sample”:

Random Independent Sample (RIS)

The random variables \(X_1, X_2, \dots, X_n\) are called a random independent sample of size n from the population \(f(x)\) if \(X_1, \dots, X_n\) are mutaully independent RVs and the marginal pdf or pmf of each \(X_i\) is the same function \(f(x)\). Alternatively, \(X_1, \dots, X_n\) are called independent and identically distributed (i.i.d) RVs with pdf of pmf \(f(x)\).

Source: Casella and Berger (2002)

Random Sample

  • To put another way, we assume each \(X_i\) is an observation on the same vairable and each \(X_i\) has marginatl distribution given by \(f(x)\).

  • In addition, we assume mutually independence meaning the the value of one ovservation has no effect on any other observations.

  • This often equates to taking a simple random sampling (SRS) with replacement from a “infinite”1 population

Simple Random Sampling

Simple Random Sampling Without Replacement (SRSWOR / SRS)

SRSWOR of size \(n\) is the probability sampling design for which a fixed number of \(n\) units are selected from a population of \(N\) units without replacement such that every possible sample of \(n\) units has equal probability of being selected.

Simple Random Sampling With Replacement (SRSWR)

SRSWR is a method of selecting \(n\) units from a population of \(N\) units with replacement such that at each stage of selection, each unit has an equal chance of being selected, i.e., \(1/N\).

Background

From the surveying standpoint, it of course makes more sense to sample without replacement.

  • e.g. you might perform a telephone survey of 10,000 people; once a person has been called, they won’t be called again.

When working with very large (infinite) populations, the difference between SRS and SRSWOR is negligible

  • e.g. the chance of calling the same person in the above telephone survey is 1/10,000 = 0.01%

Finite popluation

  • Unlike infinite populations, “finite” populations \(\{x_1, x_2, \dots, x_N\}\) have a know and fixed size \(N < \infty\).

  • In other words a finite population refers to a specific, countable number of units.

  • When dealing with a finite population , the difference between SRS and SRSWOR may not be negligible

  • Hence adjustments may be necessary to to account for the finite nature of the population.

SRSWOR from Finite populations

  • When SRSWOR from a finite population \(\{x_1, x_2, \dots, x_N\}\), the value \(x_i\) is not replaced after being selected and is therefore impossible to sample \(x_i\) again.

  • As a consequence \(X_1\) and \(X_2\) are NOT independent since:

    \[\begin{align} P(X_1 = x_1) &= \frac{1}{N}\\ P(X_2 = x_1 \mid X_1 = x_1) &= 0\\ P(X_2 = x_9 \mid X_1 = x_1) &= \frac{1}{N-1} \end{align}\]

SRSWR from Finite populations

  • When SRSWR from a finite population \(\{x_1, x_2, \dots, x_N\}\), the value \(x_i\) is replaced after selection such that:

\[\begin{align} P(X_1 = x_1) &= \frac{1}{N}\\ P(X_2 = x_1 \mid X_1 = x_1) &= \frac{1}{N} \end{align}\]

Comment

  • SRSWR from a finite population is a type of sampling model forms the basis for the statistical (re)-sampling technique known as bootstrapping (see Lecture 7)

  • SRSWOR from a finite population does not satisfy all the conditions of a “random sample”; i.e. \(X_i\) and \(X_j\) are not independent.

Covariance

If \(X\) and \(Y\) are jointly distributed random variables with expectations \(\mu_X\) and \(\mu_Y\) , respectively, the covariance of \(X\) and \(Y\) is

\[ \text{Cov}(X, Y)= \mathbb{E}[(X −\mu_X)(Y −\mu_Y)] \]

provided that the expectation exists.

Variance of a linear combination of random variables.

For the variance of a linear combination of random variables we have: \[ \text{Var}(a + \sum_{i=1}^n b_i X_i) = \sum_{i=1}^n \sum_{j=1}^n b_i b_j \text{Cov}(X_i, X_j) \]

It follows that

\[ \text{Var}(\bar{X}) = \frac{1}{n^2}\sum_{i=1}^n \sum_{j=1}^n \text{Cov}(X_i, X_j) \]

When sampling with replacement, \(X_i\) would be independent, for \(i\neq j\). Hence we would have \(\text{Cov}(X_i,X_j)=0\), whereas \(\text{Cov}(X_i,X_i)= \text{Var}(X_i) = \sigma^2\) and \[\text{Var}(\bar{X}) = \frac{1}{n^2} \sum_{i=1}^n \text{Var}(X_i) = \frac{\sigma^2}{n} \]

\[ \text{Var}(\bar{X}) = \frac{1}{n^2}\sum_{i=1}^n \sum_{j=1}^n \text{Cov}(X_i, X_j) \]

When sampling without replacement, the variance of the sample mean is a little bit more involved.

It can be shown1 that Cov\((X_i , X_j) = −\dfrac{\sigma^2}{N−1}\) which yeilds the following result …

SRSWOR from a finite population

When taking a simple random sampling without replacement (SRSWOR) of size \(n\) from a finite population \(\{X_1, X_2, \dots, X_n\}\), the variance of the sample mean \(\bar{X} = \sum_{i=1}^n X_i\) is given by \[ \text{Var}(\bar{X}) = \frac{\sigma^2}{n}\left(1 - \frac{n-1}{N-1}\right) = \frac{\sigma^2}{n}\frac{N-n}{N-1} \]

Proof: \[ \begin{align} \text{Var}(\bar{X}) &= \frac{1}{n^2}\sum_{i=1}^n \text{Var}(X_i) + \frac{1}{n^2}\sum_{i=1}^n \sum_{j\neq i} \text{Cov}(X_i, X_j)\\ &= \frac{\sigma^2}{n} - \frac{1}{n^2}\cdot n (n-1) \frac{\sigma^2}{N−1} = \frac{\sigma^2}{n}\left(1 - \frac{n-1}{N-1}\right) \end{align} \]

Finite Population Correction

Notice that the variance of the sample mean in sampling without replacement differs from that in sampling with replacement by the finite population correction (FPC)1 \[ \text{FPC} = \left( 1 - {\frac{n - 1}{N - 1}} \right) = \frac{N-n}{N-1} \]

  • FPC will be small when we have large \(n\) relative to \(N\)
  • FPC will be close to 1 when \(n\) is small relative to \(N\)

Comments

  • In practice, surveys are often sampling a small portion of a large population and this finite correct is close to one.

  • In other words, the sampling fraction1 is frequently very small, in which case the standard error of \(\bar{X}\), is

\[ \sigma_{\bar{X}} \approx \sigma/\sqrt{n} \]

Small FPCs

  • In situations where the sampling fraction, \(n/N\) is relatively high, the FPC will be small.

  • As a consequence, the standard errors will decrease and the respective confidence intervals will become narrower to account for this reduction in variability.

  • Notice that the larger the sampling fraction the more benefit there is (which makes sense …)

  • The benefit becomes noticeable when the sample size is at least 10% of the population.

Hospital Example

Description: The data comprise the numbers of discharges for 393 short-stay hospitals during January 1968.

Format: This data frame contains the following columns:

  • beds - number of beds

  • discharges - number of discharges during January 1968

  • region - unspecified geographical region

Source: Herkson (1976) and used in Rice (2007)

Load data

Code
hospital <- data.frame(
  beds=c(10,16,20,24,25,26,30,34,38,40,43,50,50,57,62,64,67,69,70,73,81,91,96,100,100,103,110,111,116,120,122,127,130,137,142,145,150,154,160,163,170,180,184,192,200,206,214,224,228,233,241,247,252,257,263,268,273,276,284,289,297,303,307,310,322,327,388,347,352,359,365,370,378,390,400,408,418,437,450,467,474,490,498,506,523,534,541,550,558,577,591,606,635,670,719,816,838,918,986,14,18,20,25,25,27,30,35,38,40,47,50,50,59,62,64,67,69,70,74,86,95,98,100,100,104,110,111,118,120,123,128,134,138,143,145,151,155,160,165,175,180,185,193,204,207,214,225,229,235,242,247,254,260,264,269,275,279,285,291,300,304,308,312,324,330,339,347,354,361,366,373,380,391,400,411,419,437,451,469,478,492,500,509,524,536,543,550,562,579,592,613,650,684,760,817,857,936,15,19,20,25,25,28,32,35,39,41,48,50,56,61,63,65,67,70,70,80,88,95,99,100,102,106,110,111,119,120,125,128,135,139,144,145,151,156,160,169,178,181,187,195,204,207,224,227,229,235,244,248,255,261,265,269,275,279,286,295,300,307,309,313,325,330,340,348,357,365,367,374,385,393,400,411,422,438,461,470,479,493,500,510,524,538,543,551,566,583,600,625,652,712,774,829,860,937,15,19,24,25,26,29,32,37,40,43,49,50,57,61,63,65,68,70,70,80,90,96,100,100,102,108,110,113,119,121,126,129,135,141,145,150,152,159,161,170,180,184,188,196,205,210,224,227,231,235,244,252,256,261,268,270,275,282,287,297,302,307,310,318,327,332,340,350,358,365,368,374,386,394,401,417,425,445,463,472,480,496,505,517,530,540,549,556,573,584,606,631,658,712,785,830,904,957), 
  discharges=c(57,35,23,120,92,98,118,66,95,87,141,260,229,247,91,240,255,233,315,200,266,120,228,362,414,518,389,273,440,431,534,535,426,505,322,475,621,486,637,690,665,479,652,625,695,703,918,558,955,684,610,1084,995,974,811,773,956,852,456,233,539,715,935,1226,876,946,1425,1166,1098,1029,1766,1156,896,1373,2190,1719,1040,1632,1705,1665,1258,1390,1657,1527,1232,2034,1376,986,1152,1509,999,1218,1606,1707,2058,1239,2135,1678,2268,64,56,90,36,64,89,79,100,78,121,174,186,194,297,182,225,321,253,233,297,270,243,308,383,265,309,439,498,134,373,416,323,381,592,384,337,376,467,590,360,592,531,635,504,697,814,726,1186,1175,669,601,1028,956,1076,1009,884,1201,754,1007,824,778,1036,1031,912,1232,471,885,906,876,889,1225,787,1009,1389,1219,989,1315,1346,1584,1012,1835,1126,1785,2031,1350,1418,1093,1287,2116,1415,1648,1463,1620,1504,1283,1706,1624,1394,41,42,22,49,103,85,52,75,153,213,81,296,115,308,242,239,215,209,258,297,310,243,346,318,227,311,368,95,345,414,231,557,411,244,453,283,543,778,402,662,446,573,1011,744,670,726,590,410,439,629,858,928,1160,788,609,887,1063,861,941,764,557,1153,985,1016,1049,872,1133,1219,915,966,1453,1137,1150,926,1095,808,1089,1370,1948,1322,1534,1355,1744,2051,1805,1522,1780,1645,1828,1583,2154,2240,2150,1893,2844,2766,2818,1894,76,48,170,14,100,109,125,108,124,210,173,87,220,58,222,231,259,216,244,301,326,377,444,373,371,327,298,594,360,467,427,707,362,355,828,470,538,487,611,689,84,481,713,586,622,670,587,732,931,925,490,810,705,795,1106,951,632,767,1097,842,958,855,1042,944,1210,1400,1097,1173,976,956,1413,1231,1272,1060,1634,1369,1347,1105,1617,1239,1149,1301,1669,834,1420,1386,1547,1478,1789,1326,1785,1684,1376,2089,2171,1715,2700,1765), 
  region=factor(c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4))
)

Histogram

Code
# in base:
 par(mfrow=c(1,2))
 hist(hospital$beds, main ="", xlab="Number of beds")
 hist(hospital$discharges, breaks = 15,  main ="", xlab="Number of discharges")

A histogram of the population values

Population Parameters

  • The population consists of \(N = 393\) short-stay hospitals.
  • We let \(x_i\) denote the number of patients discharged from the \(i\)th hospital during January 1968.

Population mean:

\[ \begin{align} \mu &= \frac{1}{N}\sum_{i=1}^N x_i \\ &= 814.6030534 \end{align} \]

Population variance:

\[ \begin{align} \sigma^2 &= \frac{1}{N}\sum_{i=1}^N (x_i - \mu)^2 \\ &= 346881.5 \end{align} \]

Exercise 1:

If the population of 393 hospitals with population variance is sampled without replacement with \(n\) = 7.

Code
x = hospital$discharges
mu = mean(x)
sig2 = sum((x - mu)^2)/length(x)
sig = sqrt(sig2)
fpc = (N-n)/(N-1)
sterr = sig/sqrt(n)*sqrt(fpc)

Since the sampling fraction \(\frac{n}{N}\) = 0.0178117 is small the FPC makes a slight difference on \(\sigma_{\bar{X}}\)

\[ \begin{align} \sigma_{\bar{X}} &= \frac{\sigma}{\sqrt{n}}\sqrt{1 - \frac{n-1}{N-1}}\\ &= \frac{588.9665}{\sqrt{7}}\sqrt{1 - \frac{6}{392}}\\ &= 222.6084 \times 0.9923174\\ &= 220.8982 \end{align} \]

Simulated sample

Code
set.seed(2024)
small_samp <- replicate(1000, mean(sample(x, size = n)))
hist(small_samp, breaks = 20, prob = TRUE, main = "")
fcp = (N-n)/(N-1)
sterr = sqrt(fcp*sig2/n)
sterr_nofpc = sqrt(sig2/n)
curve(dnorm(x, mean = mu, sd = sterr), add = TRUE, lwd = 2, col =2)
curve(dnorm(x, mean = mu, sd = sterr_nofpc), add = TRUE, col =4, lty = 2, lwd = 2)
legend("topleft", legend = c("with FPC", "without FPC"), col = c(2,4), lty =c(1,2))

Exercise 2:

If the population of 393 hospitals with population variance is sampled without replacement with \(n\) = 88.

Code
x = hospital$discharges
mu = mean(x)
sig2 = sum((x - mu)^2)/length(x)
sig = sqrt(sig2)
fpc = (N-n)/(N-1)
sterr = sig/sqrt(n)*sqrt(fpc)
sterr_nofpc = sqrt(sig2/n)

Since the sampling fraction \(\frac{n}{N}\) = 0.2239186 is relatively large the FPC makes a larger difference on \(\sigma_{\bar{X}}\)

\[ \begin{align} \sigma_{\bar{X}} &= \frac{\sigma}{\sqrt{n}}\sqrt{1 - \frac{n-1}{N-1}}\\ &= \frac{588.9665}{\sqrt{88}}\sqrt{1 - \frac{87}{392}}\\ &= 62.78404 \times 0.8820778\\ &= 55.38041 \end{align} \]

Simulated sample

Code
set.seed(4456456)
small_samp <- replicate(1000, mean(sample(x, size = n)))
hist(small_samp, breaks = 20, prob = TRUE, main = "")
curve(dnorm(x, mean = mu, sd = sterr), lwd = 2, add = TRUE, col = 2)
curve(dnorm(x, mean = mu, sd = sterr_nofpc), lwd = 2, add = TRUE, col = 4, lty =2)
legend("topleft", legend = c("with FPC", "without FPC"), col = c(2,4), lty =c(1,2))

Exercise 3:

In the population of 393 hospitals, a proportion \(p = .654\) had fewer than 1000 discharges.

If this proportion were estimated from a sample of size 32 as the sample proportion \(\hat p\) , the standard error of \(\hat p\) is

\[ \begin{align} \sigma_{\hat p} &= \sqrt{\frac{p(1-p)}{n}}\sqrt{ \left({\frac{N-n}{N - 1}} \right)}\\ &= \sqrt{\frac{0.654 × 0.346}{32}}\sqrt{\frac{393-32}{392}}\\ &= 0.08409147\times 0.9596449 = 0.08 \end{align} \]

FPC as a function of \(n\)

With \(N\) = 300
Sample Size (\(n\)) \(\sqrt{FPC}\)
1 1
10 0.9848348
25 0.9590268
500 0.9143962
100 0.8178608
150 0.7082882
200 0.5783149
250 0.4089304
300 0
With \(N\) = 10,000
Sample Size (\(n\)) \(\sqrt{FPC}\)
1 1
10 0.9995499
25 0.9987992
50 0.9975467
100 0.9950372
500 0.9747282
1000 0.9487307
5000 0.7071421
8000 0.447236

Exercise 4:

Thirty people from a population of 300 were asked how much they had in savings. The sample mean (\(\bar{x}\)) was $1,500, with a sample standard deviation of $89.55. Construct a 95% confidence interval estimate for the population mean.

Using degrees of freedom 29 (= \(n-1\)) and \(t_{\alpha/2}\) = 2.045, a 95% confidence level is for the population mean is:

\[ \begin{align} \bar{x} = \pm \ t_{\alpha/2}\frac{S}{\sqrt{n}}\sqrt{\frac{N-n}{N-1}} = $1,500 ± 31.776\\ = $1,468.22 ≤ μ ≤ $1,531.78. \end{align} \]

Estimating Population Variance

Unbiased Sample Variance

For a SRSWOR on a finite population, the sample variance \(\hat\sigma^2 = \frac{1}{n-1}\sum_{i=1}^n (X_i - \bar{X})^2\) is a biased estimator for population variance \(\sigma^2\) with \[ \mathbb{E}[\hat\sigma^2] = \sigma^2 \left( \frac{n-1}{n} \right) \left( {\frac{N}{N - 1}} \right) \]

Proof

Comments

Because \(N > n\), it follows with a little algebra that

\[ \frac{n−1}{n} \frac{N}{N−1} < 1 \] and \(\mathbb{E}[\hat\sigma^2] < \sigma^2\). In other words, \(\hat\sigma^2\) tends to overestimate \(\sigma^2\).

As we’ve seen before, an unbiased estimate of \(\sigma^2\) can easily be found …

Unbiased Estimator of Variance

Unbiased estimator for population variance

For SRSWOR from a finite population an unbiased estimator for \(\sigma^2\) is given by:

\[ \frac{N-1}{N} S^2 \] where \(S^2 = \frac{1}{n-1} \sum_{i=1}^n (X_i - \bar{X})^2\).

Since \(N\) is usually quite large, this is a rather negligible correction.

References

Ramachandran and Tsokos (2020), Rice (2007), Casella and Berger (2002)

Casella, G., and R. L. Berger. 2002. Statistical Inference. Duxbury Advanced Series in Statistics and Decision Sciences. Thomson Learning. https://books.google.ca/books?id=0x_vAAAAMAAJ.
Ramachandran, K. M., and C. P. Tsokos. 2020. Mathematical Statistics with Applications in r. Elsevier Science. https://books.google.ca/books?id=t3bLDwAAQBAJ.
Rice, J. A. 2007. Mathematical Statistics and Data Analysis. Advanced Series. Cengage Learning. https://books.google.ca/books?id=KfkYAQAAIAAJ.