当前仅仅完成了一小部分,
程序上仅仅实现了普通的基于字符的huffman压缩与解压缩.
程序管理上尝试了使用cmake构建,还是很方便的.
测试实验了采用 google test 1.4,也是很好用的.
文档编辑尝试使用latex+cjk, latex2html,相当好用:)
恩,下一步先尝试 python嵌入c++,利用pygraphviz从而可以打印生成的二叉树形态,展示huffman建立所得的二叉树.
然后进一步完善实现文档尝试用dia绘制说明图等,
完成范式huffman并尝试不同的解法的内存占用情况,不同的解码方法的速度情况.
实现基于词的huffman,范式huffman,看一下压缩率.考虑基于词的方法处理中文文本.
实现其它的压缩方法,进行比对.
最后比较重要的是压缩在索引中的应用.
先把当前文档贴出来,待补充.
November 14, 2009
本文主要参考«深入搜索引擎»的第二章文本压缩, 介绍各种压缩算法原理, 并接合我所写的基于C++模板的glzip库中对各种压缩算法的实现. 给出实验结果和对比分析. 书中有详细的理论分析, 这里主要侧重实现算法以及实验结果分析. 这里的我的实验都会用到两个测试文本, 一个是简单的simple.log, 另一个是大小为24M的文本big.log. 将会介绍:
I love nba and cba
and ...
当前在windowsXP+vmware+ubuntu8.04,CPU1.86GHZ,1G MEM,环境下测试, 采用g++ 4.2.4 -O2 编译选项,注意如果采用g++ 4.4.2程序运行时间会有显著减少,对于 24M的文件压缩解压缩总共时间会减少大概1s.
The input file is simple.log
The total bytes num is 27
The total number of different characters in the file is 13
The average encoding length per character is 3.48148
So not consider the header the approximate compressing ration should be 0.435185
The encoding map is as below
Character Times Frequence EncodeLength Encode
\n 2 0.0741 4 1011
5 0.185 2 00
. 3 0.111 3 010
I 1 0.037 5 11010
a 4 0.148 3 100
b 2 0.0741 4 1100
c 1 0.037 5 11011
d 2 0.0741 4 1010
e 1 0.037 5 11110
l 1 0.037 5 11111
n 3 0.111 3 011
o 1 0.037 5 11100
v 1 0.037 5 11101
第二个test是紧接着进行解码解压缩得到big.log.crs,也只是测试速度1.984s.
最后一个test是对比原始文件big.log和压缩然后解压缩得到的最终文件big.log.crs.de, 测试结果是完全一致,正确恢复原文件内容.
压缩得到的big.log.crs大为 13M,压缩率13/24=0.541和我们的预期是一致的.
该文档的部分编码表表如下, 出现的字符数也即编码表大小不会大于256, 因为我们是基于字符的编码, 8位一个字符.
The input file is big.log
The total bytes num is 24292128
The total number of different characters in the file is 99
The average encoding length per character is 4.3231
So not consider the header the approximate compressing ration should be 0.540388
The encoding map:
Character Times Frequence EncodeLength Encode
\n 456549 0.0188 6 110100
70 2.88e-06 18 011001100000111001
6731035 0.277 2 10
! 2362 9.72e-05 13 0110011000000
" 59322 0.00244 9 110101110
# 335 1.38e-05 16 0000111110000111
$ 1549 6.38e-05 14 11001101111000
% 1423 5.86e-05 14 01100110000101
& 1745 7.18e-05 14 11001101111001
' 50639 0.00208 9 110011001
( 17318 0.000713 10 0000101010
) 17288 0.000712 10 0000100011
* 12959 0.000533 11 11001101001
+ 671 2.76e-05 15 011001100000110
, 169157 0.00696 7 0101111
- 149271 0.00614 7 0000110
. 243628 0.01 7 1111110
/ 12387 0.00051 11 11001101000
0 20424 0.000841 10 0101110100
1 59394 0.00244 9 110101111
2 19924 0.00082 10 0000111111
3 14963 0.000616 11 11010110111
4 14220 0.000585 11 11010110000
5 21064 0.000867 10 0101110111
6 27586 0.00114 10 1100110110
7 14629 0.000602 11 11010110100
8 16253 0.000669 10 0000100000
9 35422 0.00146 9 000010100
allen:~/study/data_structure/huffman/huffman_c++/final/final/build/bin$ ./utest
[==========] Running 3 tests from 3 test cases.
[----------] Global test environment set-up.
[----------] 1 test from huff_char_compress
[ RUN ] huff_char_compress.perf
[ OK ] huff_char_compress.perf (1445 ms)
[----------] 1 test from huff_char_compress (1445 ms total)
[----------] 1 test from huff_char_decompress
[ RUN ] huff_char_decompress.perf
[ OK ] huff_char_decompress.perf (1984 ms)
[----------] 1 test from huff_char_decompress (1984 ms total)
[----------] 1 test from huff_char
[ RUN ] huff_char.func
[ OK ] huff_char.func (555 ms)
[----------] 1 test from huff_char (555 ms total)
[----------] Global test environment tear-down
[==========] 3 tests from 3 test cases ran. (3984 ms total)
[ PASSED ] 3 tests.
This document was generated using the LaTeX2HTML translator Version 2002-2-1 (1.71)
Copyright © 1993, 1994, 1995, 1996, Nikos Drakos, Computer Based Learning Unit, University of Leeds.
Copyright © 1997, 1998, 1999, Ross Moore, Mathematics Department, Macquarie University, Sydney.
The command line arguments were:
latex2html compresss.tex -split 0
The translation was initiated by allen on 2009-11-14