Huffman Encoding

Posted by Beetle B. on Sat 06 December 2014

The problem with run length encoding was that our chunk size (e.g. 4 bits) is fixed. This is nonoptimal.

With Huffman encoding, we can have variable length sequences to represent each character. However, we need a prefix free encoding. What this means is that if we represent A with 11, and D with 110, and F with 01, T with 0, we have an ambiguity. If we encounter a 1101, is this AF, or DT? A prefix free encoding ensures such an ambiguity cannot arise.

We can represent any prefix free encoding with a binary trie. All the characters are stored in leaves. The left branch represents a 0, and the right represents a 1.

As an example:

An example of Huffman encoding

So while decoding, if we encounter a 011, we know this represents R.

This trie is created for all letters/symbols of the alphabet.

This is for decoding. How do you encode? A simple way is to create a symbol table with all your letters, with the values being the bit strings.

Again, note that for a given character, the size of the bitstring is variable.

Constructing the Prefix Free Encoding

  • Scan through the whole text and note down the frequency counts for each character.
  • Construct a node for each character: Make each of them a single-node trie. Give them a weight equal to the frequency.
  • Select the two tries with minimum weights, and merge them to form a single trie with weight the sum of the two tries. By merging, I mean form a new root node, and put one trie as its left child, and the other trie as its right child.
  • Recurse till you have only one trie.

Properties

  • The text encoded by each bitstring is of fixed length.
  • The bit string used to encode a piece of text (usually a single character) is of variable length.
  • The cost to build this encoding is \(N+R\log R\) where \(R\) is the alphabet size.
  • The compression is optimal (for single character encodings).
  • You need the whole text before you begin encoding, so it’s not a good idea for live communications.
    • However, if we use the algorithm primarily to encode English text, we could just use a well known trie that utilizes the expected frequency counts of each letter.
  • Need to communicate the prefix encoding. It is different for each text you compress, and the decompression side needs the trie you used to encode it.