We often want to store an object for quick retrieval, or see if a given object has been observed before in \(O(1)\) time. Let \(U\) be the universe of such objects, and assume we are going to sample a set \(S\) from it. Usually, \(|S|<<|U|\) (e.g. the universe could be the set of all phone numbers).
A hash function is a function that maps \(U\) to \(\{0,1,\dots,N\}\). The idea is that we maintain an array of size roughly equal to our sample set, apply the hash function to an object, and place the object in the array at the index our hash function dictates.
Collisions
A collision occurs if two distinct objects hash to the same value. There are two ways to handle this.
- Chaining Via Linked Lists: Let each entry in the array be a pointer to
a linked list. If we have a collision, we add to the front of the
linked list. When testing to see if an object is seen, we’ll have to
traverse the linked list if collisions had occurred. There is some
overhead to this, so avoid if memory is limited.
- Under uniform hashing, the number of entries in each linked list, with high probability, is within a constant factor of \(N/M\). This implies that, with high probability, search is \(O(1)\).
- Two Probe Hashing: We have two hash functions. We insert the key into the shorter of the two chains. The expected length of the longest chain becomes \(\log\log N\).
- Open Addressing: In open addressing, we allow only one object per
bucket. When a collision occurs, we use some form of probing to
determine where to place it:
- Linear Probing: If a collision occurs, we simply place in the next
empty slot.
- This is like the car parking problem: Assume a straight street has \(M\) parking spots. If we can’t park right in front of the shop we want to be, then we park on the next adjacent one. Knuth showed that with \(\frac{1}{2}M\) cars, the average displacement is \(\frac{3}{2}M\). With \(M\) cars, the average displacement is \(\sqrt{\frac{\pi M}{8}}\) (not sure what this means).
- Under linear probing, the average number of probes in linear
probing of size \(M\) with \(N=\alpha M\) keys is:
- \(\frac{1}{2}\left(1+\frac{1}{1-\alpha}\right)\) for a hit.
- \(\frac{1}{2}\left(1+\frac{1}{\left(1-\alpha\right)^{2}}\right)\)
- for a miss.
- We typically pick \(\alpha=\frac{1}{2}\)
- Double Hashing: We have two hash functions. The first points into the array. The second is used if we have a collision to decide what offset to look at. For example, if the first hash function dictates position 10, and the second hash function gives 6, then if we have a collision, we’ll check index 16. If we still have a collision, we’ll check index 22, etc. This eliminates clustering around a given position.
- Cuckoo Hashing: See the Wikipedia page for details.
- Linear Probing: If a collision occurs, we simply place in the next
empty slot.
Deletion in open addressing is nontrivial.
Qualities of Good Hash Functions
- We’d like \(O(1)\) evaluation for all objects in \(U\). So we cannot have array lookups as part of the evaluation. Nor should it depend heavily on, say, the length of the string (if \(U\) is the set of all strings).
- Relatively Uniform Binning: This is to minimize the amount of collisions, and to ensure all bins are utilized (e.g if a hash function can never return 7, then we’re wasting space).
- Some people desire \(N\) to be a prime of roughly the size of the sample set. Not too close to some power of 2, and not too close to a power of 10.
Load Factor
The load factor \(\alpha\) is the number of items being stored divided by the number of buckets.
For chaining, \(\alpha\ge 1\). We want it as close to 1 as possible.
For open addressing, \(\alpha\le 1\) We want it to be much less than 1.
Random Notes
Collision Properties
Assume the array has \(M\) entries and we use open chaining.
- It takes on average \(\sqrt{\frac{\pi M}{2}}\) entries before a collision occurs.
- After roughly \(M\ln M\) entries, every “bin” is expected to have at least one entry.
- After \(M\) insertions, the most loaded bin has \(\Theta\left(\frac{\log M}{\log\log M}\right)\) entries in its linked list.
The last 2 properties imply that collisions are roughly uniform: No “bin” is preferred.
Attacks
Given any hash function, there always exists some pathological data set that causes everything to fall into one or very few buckets. To see this, use the pigeon hole principle: There are \(N\) buckets, and \(|U|\) objects. At least one bucket will have more than \(|U|/N\) entries (note that usually this is a huge number). Let your data set be those entries.
This has serious implications. If your service is using a hash function, and an attacker knows which hash function you’re using, he/she can devise a dataset to bring your service down.
To mitigate this:
- Use a cryptographic hash function. Even if it’s known, it is nontrivial to design a dataset that will cause it to explode.
- Randomly pick from a family of hash functions.
Universal Hashing
A collection of functions \(H\) is universal if and only if for every \(h\in H\), and for every pair of distinct objects \(x,y\), the probability of a collision is bounded above by \(1/n\) where \(n\) is the size of your array.
Theorem: If you have a universal set of hashing functions, all operations take \(O(1)\) on average with chaining if \(|S|=O(n)\).
Proof: Consider the worst case scenario of not finding an object. Each array is a linked list. Let the size of the linked list at a given location be \(L\). Note that \(L\) is actually a random variable. Conveniently, \(O(1)+L=L\) is also the expense of not finding an object in the array (evaluate hash function and traverse the linked list). So the problem boils down to calculating the expected value of \(L\).
The usual strategy: Define \(L\) as the sum of indicator random variables, and utilize the linearity of expectation.
Fix \(x\) to be an object. Define \(Z_{y}\) to be an indicator random variable: It is 1 if there is a collision, 0 otherwise. Then \(L=\sum_{y\in S}Z_{y}\). Now \(E[Z_{y}]\) is the probability that a collision occurs, which is bounded above by \(1/n\). Summing over all \(S\), we get \(E[L]\le|S|/n\). But this is simply \(O(1)\) as \(|S|=O(n)\).