Problem
Input: alphabet and the frequencies at which each letter occurs.
Output: a prefix-free (no two codes share the same code sequence) code made up of corresponding to each character, where .
Solution space is given by the set of possible encodings for the alphabet. We are trying to minimize the cost of the encoding by considering the different frequencies in the alphabet. This is captured by the following objective function:
Optimizing over binary trees
This is related to creating an Optimal static binary search tree.
Assume we have a coding for the alphabet. We can construct a binary tree where the left branch corresponds to a and the right branch corresponds to a . Keep adding nodes for every digit. The coding is prefix-free, so the leaves of this tree correspond to unique characters in the alphabet.
Therefore, we can reframe the problem as such: given the frequencies of each character , we want to construct a binary tree such that we minimize the following objective function:
This is because the depth of a character in the tree corresponds to the length of the encoding for it. To optimize the tree, it should be “full,” meaning all internal nodes have 2 children (otherwise, we can replace the leaf with its parent and still have a prefix-free coding).
Two properties about the optimal solutions
Here, we break our proof of correctness down into two parts. We want to prove the following two facts:
- Any full binary tree with leaves must have 2 leaves such that they are siblings of the same parent at the deepest level.
- Suppose is a full binary tree that minimizes . Then, there must exist a such that and (the frequencies of two characters) are the smallest and their corresponding leaves appear as siblings in the deepest node. Furthermore, .
Todo
I did not take notes for the proofs. Oops
We can construct a by swapping the two nodes at the deepest level with and , which are guaranteed to exist by fact 1, thereby proving fact 2.
Narrowing the solution space
Proving these two facts allows us to restrict the solution space that our exchange argument needs to travel through to get from to , our greedy solution.
In particular, let be the set of solutions that are full binary trees where the leaves have characters as siblings in the deepest node. We will only need to look at these solutions, simplifying our argument.
Then we can reframe the optimization problem again: we want to output a full binary tree that minimizes
Algorithm
Note that , where is the parent of node .
- Set aside nodes , where and have the smallest frequencies. Consider , which is the parent of the two nodes.
- Calculate .
- Recurse on the nodes , as well as the frequencies .