Due to patent issues with newer schemes , Huffman Coding remains to be the go-to compression algorithm of the day, being applied in ZIP, JPEG, PNG… and much more.
The core idea of Huffman Coding, is to use shorter codes to represent more frequent characters. As for why we always see a mysterious tree of zeros and ones, well, that’s essentially how computers store anything: with binary codes. Essentially the goal of constructing the Huffman binary tree is: to assign the most frequent characters to the shortest zero/one path, walking from tree root to end of a branch.
How to create such a tree?
That is where Huffman meanders in.
Say we have 5 numbers as characters:
- 1 appearing 1 time
- 2 appearing 2 times
- 3 appearing 3 times
- 4 appearing 4 times
- 5 appearing 5 times
Ideally we’d want 5 appearing on higher positions in the tree, because then its path (zeros and ones representing the character) would be shorter. To construct this tree in a most efficient way so that the more frequent characters would have shorter paths, first we can combine least frequent characters into a new branch, to get them out of the way:
Now we have:
- a new branch A with an appearing frequency of 3
- character 3 with an appearing frequency of 3
- character 4
- character 5.
To make sure 5 ended up closest to tree root, we can yet again, combine the least frequent characters (or branches) into a new branch:
In this case, combining new branch A with a frequency of 3 with character 3, resulting in a new branch B with a frequency of 6. Now we’re left with:
- Branch B, appearing with frequency 6
- character 4, frequency 4
- character 5, frequency 5
Repeating the process of combining the 2 least frequent character/branch into a new branch:
Now we only have two branches left:
- Branch B with an appearing frequency of 6
- Branch C with an appearing frequency of 9
Combining the two, we finally get a tree:
Now comes the easy part. To get the code representing a character, we only need to trace the tree from Root to the Character. In this case, character 3 would be represented by the path/code 01, character 2 would be 001.
Now that we get a table of character to code:
We can encode the text “122333444455555” with the shortest zero/one combinations. Well done Huffman.