17. The Huffman code problem

The principle behind this technique is to design a binary character code for each character, using a variable number of bits to represent each character, so as to reduce the total length of the document.

To allow easy, unambiguous decoding, we require the code to be a prefix code: no code is a prefix of another

For example, if the codewords are {0, 01, 11, 001}, the decoding of string like 001 is ambiguous. You can interpret it as 0-01, or 001.

Solution ---- insisting on the prefix-free property: no codeword can be a prefix of another codeword, where codeword is a string of digits to represent a character in the encoding.

Details

Given a character set C and the frequency f (c) of each character c belongs to C in the document A, the problem is to find a prefix code that minimizes the total length of the document.
The tree representing such an optimal code is called a Huffman tree T . We aim to minimize

B(T,A) = sum of f(c)d(c)

——d (c) is the depth of character c in the tree T (numbers of bits used to encode character c), or the length of the codeword of character c.
——B(T,A) is the total number of bits for document A based on the Huffman tree T.

The problem is to design a tree T to minimize B(T,A).

  1. There are no nodes with only a single child.
  2. Rearranging characters within the same level of the tree T does not change B(T,A).
  3. Swapping a character at a higher level with another less-frequent character reduces B(T,A), where the level of the tree root is the lowest level

We have a greedy strategy for the Huffman coding problem:

17. The Huffman code problem_第1张图片
Process

For efficient implementation of a Huffman tree, we can use a priority queue data structure - a minimum heap

The running time is O(n) +O(n) + O(nlogn) = O(nlogn)

你可能感兴趣的:(17. The Huffman code problem)