Hypergeometric Distribution

Posted by Beetle B. on Thu 18 May 2017

The hypergeometric experiment requires:

1. The population is finite, with $$N$$ individuals.
2. The outcome of each trial is binary (success or failure).
3. The population has $$M$$ successes.
4. The sample $$n$$ is selected without replacement such that each subset of size $$n$$ is equally likely to be selected.

The random variable $$X$$ is the number of successes in the sample. It is essentially the binomial distribution without replacement.

\begin{equation*} h(x;n,M,N)=\frac{\dbinom{M}{x}\dbinom{N-M}{n-x}}{\dbinom{N}{n}} \end{equation*}

for $$\max(0,n-N+M)\le x\le \min(n,M)$$. It sums to 1 using Vandermonde’s identity.

Let $$p=M/N$$. Then $$E(X)=np$$, $$V(X)=\left(\frac{N-n}{N-1}\right)np(1-p)$$

$$\left(\frac{N-n}{N-1}\right)$$ is called the finite population correction factor.

Note that the variance is smaller than the binomial distribution.

We have various symmetries:

• Swapping the roles of successes and failures: $$h(x;n,M,N)=h(n-x;n,N-M,N)$$
• Swapping the roles of sampled and not sampled: $$h(x;n,M,N)=h(M-x;N-n,M,N)$$
• Swapping the roles of population successes and sample drawn: $$h(x;n,M,N)=h(x;M,n,N)$$ (I can show this formally, but it is not intuitive.)

There is a multivariate hypergeometric distribution where you have more than two choices for the outcome of each trial.

• There are multiple variables. Let $$X_{i}$$ be the number of outcomes of type $$i$$.
• The pmf is $$\frac{\prod_{i=1}^{c}\dbinom{M_{i}}{x_{i}}}{\dbinom{N}{n}}$$ where $$c$$ is the total number of types.
• $$E(X_{i})=\frac{nM_{i}}{N}$$
• $$V(X_{i})=\frac{M_{i}}{N}\left(1-\frac{M_{i}}{N}\right)n\frac{N-n}{N-1}$$