Run Length Encoding

Posted by Beetle B. on Sat 06 December 2014

We have a large string (e.g. text file). Usually, ASCII is not the optimal space allocation for the file. There are various schemes for compressing the data.

Run length encoding uses a very basic principle: If you have a long sequence of characters, simply store the character and the number of times it appears. So AAAAAAAAAABBCCCCCDDAAA would be stored as A10B2C5D2A3.

In practice, the actual scheme varies. When looking at data at the bit level, all you have is contiguous 1’s and 0’s. So one scheme would be to split into, say, 4 bit chunks. The first 4 bits is a number that states the number of contiguous 1’s, the next is the number that states how many contiguous 0’s, and so on. Concretely, 1111111111000111110000000 would be stored as 1010 0011 1001 0111 which stands for 10 1’s 3 0’s, 5 1’s, and 7 0’s.

Of course, this scheme suffers if you have more than 15 contiguous 1’s or 0’s, but it’s still manageable. For example, if you have 20 contiguous 1’s, that would be represented as 1111 0000 0101 (15 1’s, 0 0’s, 5 1’s).