Hypergeometric Distribution

Posted by Beetle B. on Thu 18 May 2017

The hypergeometric experiment requires:

  1. The population is finite, with \(N\) individuals.
  2. The outcome of each trial is binary (success or failure).
  3. The population has \(M\) successes.
  4. The sample \(n\) is selected without replacement such that each subset of size \(n\) is equally likely to be selected.

The random variable \(X\) is the number of successes in the sample. It is essentially the binomial distribution without replacement.

\begin{equation*} h(x;n,M,N)=\frac{\dbinom{M}{x}\dbinom{N-M}{n-x}}{\dbinom{N}{n}} \end{equation*}

for \(\max(0,n-N+M)\le x\le \min(n,M)\). It sums to 1 using Vandermonde’s identity.

Let \(p=M/N\). Then \(E(X)=np\), \(V(X)=\left(\frac{N-n}{N-1}\right)np(1-p)\)

\(\left(\frac{N-n}{N-1}\right)\) is called the finite population correction factor.

Note that the variance is smaller than the binomial distribution.

We have various symmetries:

  • Swapping the roles of successes and failures: \(h(x;n,M,N)=h(n-x;n,N-M,N)\)
  • Swapping the roles of sampled and not sampled: \(h(x;n,M,N)=h(M-x;N-n,M,N)\)
  • Swapping the roles of population successes and sample drawn: \(h(x;n,M,N)=h(x;M,n,N)\) (I can show this formally, but it is not intuitive.)

There is a multivariate hypergeometric distribution where you have more than two choices for the outcome of each trial.

  • There are multiple variables. Let \(X_{i}\) be the number of outcomes of type \(i\).
  • The pmf is \(\frac{\prod_{i=1}^{c}\dbinom{M_{i}}{x_{i}}}{\dbinom{N}{n}}\) where \(c\) is the total number of types.
  • \(E(X_{i})=\frac{nM_{i}}{N}\)
  • \(V(X_{i})=\frac{M_{i}}{N}\left(1-\frac{M_{i}}{N}\right)n\frac{N-n}{N-1}\)