If you want to make money online : Register now

# [Answers] What is a recursion for Huffman encoding?

, ,
Problem Detail:

If I had to write the Huffman encoding as a recursion in terms of the cost $C_{ij}$, given weights/frequencies $W = [w_1,\ldots,w_n]$, how would I do it? (where this is creating a minimum cost tree, $C = \sum_i w_i \cdot \mathrm{depth}_i$).

I am going off of Dasgupta's Huffman procedure:

procedure Huffman(f)

Input: An array f[1···n] of frequencies

Output: An encoding tree with n leaves

let H be a priority queue of integers, ordered by f

for i = 1 to n:   insert(H,i)

for k = n+1 to 2n–1:

i = deletemin(H),   j = deletemin(H)

create a node numbered k with children i,j

f[k] = f[i] + f[j]

insert(H,k)

It should be something like the matrix multiplication order minimization algorithm, which seeks to minimize the total number of components calculated when multiplying a string of matrices,

$$C_{ij} = \min(C_{ik} + C_{k+1,j}+(m_i-1) \cdot m_k \cdot m_j)$$

Where $m_i,m_j$ are the dimensions of the matrix multiplication at the level directly below the node...

I wrote something like:

$$aC_{ij} = \min((a+1)C_{ik} + (a+1)C_{k+1,j}),\text{ for all } i,j,k$$

With base case: $C_{ii} = w_i$

Knowing full well that this is probably incorrect. But I don't understand how I should write it. I mean, I have the algorithm fine—just Huffman + DFS afterward (which always turns left, then right) to find out the order of the elements. But I don't understand how the recursion notation works in this situation.

#### Answered By : Yuval Filmus

Huffman's algorithm uses the greedy heuristic, which is different from the dynamic programming approach in the other problem you mention. However, we can still define the cost recursively.

There are actually several options. In all cases, given a distribution $p_1,\ldots,p_n$, we will find a formula for $C(p_1,\ldots,p_n)$, which is the optimal cost of a prefix code.

1. $C(1) = 0$.
2. $C(p_1,\ldots,p_n) = p_i + p_j + C(p_1,\ldots,\widehat{p_i},\ldots,\widehat{p_j},\ldots,p_n,p_i+p_j)$, where $p_i,p_j$ (for $i < j$) are the two smallest probabilities (i.e. $\max(p_i,p_j) \leq p_k$ for all $k \neq i,j$), and $\widehat{p_i}$ signifies that $p_i$ is removed.

The proof that this recurrence is correct is not immediate.

If we want a Huffman-like recurrence whose correctness is immediate, we need to tweak this only slightly:

1. $C(1) = 0$.
2. $C(p_1,\ldots,p_n) = \min_{i<j} [p_i + p_j + C(p_1,\ldots,\widehat{p_i},\ldots,\widehat{p_j},\ldots,p_n,p_i+p_j)]$.

This recurrence holds since (i) given a code for $p_1,\ldots,\widehat{p_i},\ldots,\widehat{p_j},\ldots,p_n,p_i+p_j$ we can get a code for $p_1,\ldots,p_n$ which is worse by $p_i+p_j$ by splitting the node corresponding to $p_i+p_j$ into two, and (ii) given a code for $p_1,\ldots,p_n$, if we choose any two sibling leaves $p_i,p_j$ then $C(p_1,\ldots,p_n) = p_i + p_j + C(p_1,\ldots,\widehat{p_i},\ldots,\widehat{p_j},\ldots,p_n,p_i+p_j)$.

A short reflection reveals that we can always ensure $j = n$ (say), and this leads to a simpler recurrence:

1. $C(1) = 0$.
2. $C(p_1,\ldots,p_n) = \min_{i<n} [p_i + p_n + C(p_1,\ldots,\widehat{p_i},\ldots,p_{n-1},p_i+p_n)]$.

The last two recurrences are bottom-up. There is also a top-down recurrence:

1. $C(1) = 0$.

2. $C(p_1,\ldots,p_n) = 1 + \min_{S \sqcup T = [n]} [C(p_S) + C(p_T)]$, where the minimization is over all partitions of $\{1,\ldots,n\}$ into two non-empty sets, and $p_S = (p_i : i \in S)$.

In this recurrence we are splitting the root instead of merging two sibling leaves. I leave to correctness proof for the reader.

Only the first recurrence above can be implemented efficiently.