In 1953, David A. Huffman published his paper “A Method for the Construction of Minimum-Redundancy Codes”, and hence printed his name in the history of computer science. As a professor who gives the final exam problem on Huffman codes, I am encountering a big problem: the Huffman codes are NOT unique. For example, given a string “aaaxuaxz”, we can observe that the frequencies of the characters ‘a’, ‘x’, ‘u’ and ‘z’ are 4, 2, 1 and 1, respectively. We may either encode the symbols as {‘a’=0, ‘x’=10, ‘u’=110, ‘z’=111}, or in another way as {‘a’=1, ‘x’=01, ‘u’=001, ‘z’=000}, both compress the string into 14 bits. Another set of code can be given as {‘a’=0, ‘x’=11, ‘u’=100, ‘z’=101}, but {‘a’=0, ‘x’=01, ‘u’=011, ‘z’=001} is NOT correct since “aaaxuaxz” and “aazuaxax” can both be decoded from the code 00001011001001. The students are submitting all kinds of codes, and I need a computer program to help me determine which ones are correct and which ones are not.
Each input file contains one test case. For each case, the first line gives an integer N ( 2≤N≤63 ), then followed by a line that contains all the N distinct characters and their frequencies in the following format:
c[1] f[1] c[2] f[2] ... c[N] f[N]
where c[i]
is a character chosen from {‘0’ - ‘9’, ‘a’ - ‘z’, ‘A’ - ‘Z’, ‘_’}, and f[i]
is the frequency of c[i]
and is an integer no more than 1000. The next line gives a positive integer M ( ≤1000 ), then followed by M student submissions. Each student submission consists of N lines, each in the format:
c[i] code[i]
where c[i] is the i-th character and code[i] is an non-empty string of no more than 63 ‘0’s and ‘1’s.
For each test case, print in each line either “Yes” if the student’s submission is correct, or “No” if not.
Note: The optimal solution is not necessarily generated by Huffman algorithm. Any prefix code with code length being optimal is considered correct.
7
A 1 B 1 C 1 D 3 E 3 F 6 G 6
4
A 00000
B 00001
C 0001
D 001
E 01
F 10
G 11
A 01010
B 01011
C 0100
D 011
E 10
F 11
G 00
A 000
B 001
C 010
D 011
E 100
F 101
G 110
A 00000
B 00001
C 0001
D 001
E 00
F 10
G 11
Yes
Yes
No
No
给定词频序列,判定给定的若干组编码方式是否与Huffman编码等效。
要点有两个,一个是要求编码不产生歧义,另一个是总编码长度最短。
单独将词频序列提出来,可以产生一个唯一的Huffman编码长度,这也是最优的长度。
按照Huffman算法,每次提取两个最小的,合并,最后就可以得到这个最短长度。
这里可以使用插入排序,也可以直接用优先队列加速。
根据给出的编码方式,构造一个Trie Tree(字典树)。这个字典树的每个节点要存:
bool isVisited; // 是否被访问过
bool isMarked; // 是否被标记占用
Trie *next[2]; // 指向下一级节点
当按照字符串构造时,注意沿途做如下标记:
每当访问到一个节点,isVisited = true;
每当抵达终点,使isMarked = true;
即:
isVisited
标记的节点不能是新编码的终点,否则新编码就是某个编码的前缀子码。isMarked
标记时,中断,否则某个编码一定是新编码的前缀子码。这样就可以保证所有的终点都是叶子节点了。
PTA Huffman Codes With Trie Tree