基于Huffman编码的文档压缩

算法思想:

统计文件中字符种类个数和各种字符个数,根据词频统计构建赫夫曼树并根据赫夫曼树找出各字符对应的编码,再根据各字符的编码对文件进行压缩,最后重构赫夫曼树,根据赫夫曼树和二进制编码对压缩文件进行解压。

其中的重点在于赫夫曼树的构建与编码:构建n颗二叉树(包括单个树结点),每棵树只有其根结点有权重,由这些树组成森林。在森林中选取最小的两棵树构成新的二叉树(小的树为左子树)并且置新二叉树根结点权值为左右子树上根结点的权值之和,从森林中删除这两棵二叉树,重复以上步骤直到森林中只剩一棵树,该树就是赫夫曼树。从根结点遍历,左子结点加0,右子结点加1,从根结点到每一叶子结点的路径构成的01串就是该叶子结点的赫夫曼编码。

 

算法实现:

压缩:原文件读取,赫夫曼树构建与编码,生成压缩文件;

解压:压缩文件读取,赫夫曼树重建与解编码,生成原文件。

采用链表实现,可以节省内存空间,但时间有可能比矩阵慢。

文件读取采用ifstream.get()方法,该方法每次读取一个字节,每个字节为8位,转化为无符号整形为0-255,也就是说,文件所有可能出现的字符有256种,创建一个大小为257并初始值为0的整型数组来储存各个字符出现的频率。按顺序按字节读取文件,将读取到的字节转化为无符号整型作为数组下标找到数组中各个字符频率储存的位置并加一,以此来统计原文件的各种字符的频率。最后将下标为256的值加一,表示每个文件有一个结束标志。

文件读取与字符统计

while  (index = (unsigned) infile.get()) != EOF
     weightOfChar[index]++

统计数组中频率不为0的数量n,创建2n-1个树结点,每个结点包括代表的字符的值,父结点和左右子结点的下标值,频率,编码。该树数组的前n个数组用来储存文件中存在的字符,后面的树结点为前面结点的父结点,通过不断找出最小权重的二叉树构成新的二叉树,直至构建出赫夫曼树。


构建赫夫曼树

for  i = n to 2n-1
    selectMin(&leftChild, &rightChild, i)  //找出最小的两棵树并赋值

    //构建父子关系

    huffmannode[i].lChild = leftChild and huffmannode[i].rChild = rightChild

    huffmannode[leftChild].parent and huffmannode[rightChild].parent = i

    huffmannode[i].weight=huffmannode[leftChild].weight+huffmannode[rightChild].weight

生成赫夫曼编码可从每个叶子结点开始,为每个叶子结点赋一个空串,当某个结点为其父结点的左子树结点时,给字符串加0,若为父结点的右子树结点时,给字符串加1,直至遇到根结点,最后将字符串逆序就是每个叶子结点的赫夫曼编码。


赫夫曼编码

for  i = 0 to num

	while 当前结点不为根结点

		if  当前结点为父结点的左子树结点

			huffmanCode+'0';

		else

			huffmanCode+ '1';

huffmanCode逆序

经过以上步骤,可以得到一个huffman编码字典。关于cacm.all所生成的字典,单独整理在附件中。

先往压缩文件存入字符的种类数,字符和其频率,用于后期解码,再存入压缩的内容。


储存编码后的文件

buf = 0

while  index = (unsigned int) infile.get() != EOF //按顺序边读边写

	charToBit = codeOfChar[index]

	for i to charToBit.length

	if  charToBit[i] == '1'

		buf++

	if offset == 7	//如果偏移了7次,表示已经凑足8个01构成的编码

		outfile << buf and offset = 0 and buf = 0

	else

		buf = buf << 1 and offset++

	文件读取和写入完后,将的结束标志的编码也储存进去

	最后不足8位的左移凑成8位再写入

解压文件,先读取文件的第一行得到字符的种类数n,再读取n行得到所有字符的种类和权重,根据这些信息,可以直接调用前面所写的赫夫曼树构建和编码的函数,生成一棵与压缩文件时一样的赫夫曼树。然后根据按顺序读取到的压缩后的信息和赫夫曼树还原文件内容。


还原写入解压文件

p = huffmanTree.head  and  isOver = false

while ( ! isOver)
	
	buf = (unsigned char)in.get()

	for  i = 0 to 8

		if  (unsigned int)buf >= 128  //buf大于等于128表示最左边数字为1

			p = p->rChild

		else

			p = p->lChild

			buf = buf << 1   //将buf左移1位用来继续判断数字

		if  p指向叶子结点

			if  p指向结束符	//表示文件已经读取完了,退出两重循环

				isOver = false  and	break

			else

				out << (p结点所代表的字符)  and	p=huffmanTree.head 


证明贪心算法得到的是最优解

要证明赫夫曼树可以通过贪心算法得到最优解,需要证明最优前缀码问题具有贪心选择性质和最优子结构性质。

(1)贪心选择性质

设C是字符集,令x,y是C中具有最小频率的两个字符,证明存在C的一个最优前缀码,x和y的码字长度相同且只有最后一个二进制位不同。

证明:设a,b是树T中最深的两个叶子兄弟结点,a.frep

B(T) - B(T’) = x.frep*dT(x) + a.frep*dT(a) - x.frep*dT(x) - a.frep*dT(a)

= x.frep*dT(x) + a.frep*dT(a) - x.frep*dT(a) - a.frep*dT(x)

= (a.frep - x.frep)(dT(a)-dT(x)) >= 0

同理得B(T’) - B(T’’) >= 0,所以 B(T) - B(T’’) >= 0,由于T树是最优的,所以B(T) - B(T’’) <= 0,得到B(T’) = B(T’’)。所以T’’也是最优树,而且x和y的码字长度相同且只有最后一个二进制位不同。

(2)最优子结构

设T是表示字符集C的一棵最优前缀编码树。设x和y是树T中的一对兄弟叶子,z是其父亲。如果将z看成是具有频率z.frep=x.frep+y.frep的字符,则最优子结构性质就是要说明:树T’=T-{x,y}将是表示字符集C’=C-{x,y}∪{z}的一棵最优前缀编码树。

证明:将树T的平均码长B(T)用树T’的平均码长B(T’)来表示。对于每个c属于C-{x,y},有dT(c)=dT(c),因此c.frep*dT(c) - c.frep*dT(c),由于dT(x) = dT(y) = dT(z)+1,有x.frep*dT(x) + y.frep*dT(y) = (x.frep + y.frep)(dT(z) + 1) = z.frep*dT(z) + (x.frep+y.frep),于是B(T) = B(T’) + x.frep + y.frep。假设T对应的前缀码不是C的最优前缀码,存在最优编码树T’’满足B(T’’)


具体实现代码:

#include 
#include 
#include 
#include 
#include 
#include  

using namespace std;

#define maxCodeLen 200
#define maxASCIIKinds 257
LARGE_INTEGER StartTime;	//计时参数
LARGE_INTEGER EndTime;
LARGE_INTEGER Freq;

void startTime();	//开始计时
double endTime();	//停止计时并返回消耗时间

int readFile(string filePath);
void createHuffmanTree(int num);
void selectMin(int *lc,int *rc,int num);
void huffmanCoding(int num);
void compressedFile(int num, string filePathBeforeCompress, string filePathAfterCompress);
void decompressionFile(string filePathAfterCompress, string filePathDeCompress);
void getWeight(string codeStr);
void displayHuffmanCode(int num);
void Encode(string filePathBeforeCompress, string filePathAfterCompress);
void Decode(string filePathAfterCompress, string filePathDeCompress);

//统计字符出现的次数
int weightOfChar[maxASCIIKinds];
string codeOfChar[maxASCIIKinds];

typedef struct {
	int charValue = -1;
	int weight = 0;
	int parent = -1;
	int lChild = -1;
	int rChild = -1;
	char code[maxCodeLen];
	int codeLen = 0;
}huffmanNode;

huffmanNode *huffmannode;


int main()
{
	string filePathBeforeCompress = "要压缩的文件";
	string filePathAfterCompress = "压缩后的文件";
	string filePathDeCompress = "解压后的文件";
	int t, i;

	cout << "1.压缩文件      2.解压文件   " << endl;
	while (cin >> t) {
		if (t == 1)
			Encode(filePathBeforeCompress, filePathAfterCompress);
		else if (t == 2)
			Decode(filePathAfterCompress, filePathDeCompress);
		else
			break;
	}
    return 0;
}

void Encode(string filePathBeforeCompress, string filePathAfterCompress) {

	startTime();
	int charKindsLen;
	string huffmanCode;

	//读取文件并统计字符频率返回字符种类数量
	charKindsLen = readFile(filePathBeforeCompress);
	//根据字符频率构建哈夫曼树
	createHuffmanTree(charKindsLen);
	//根据哈夫曼树进行编码
	huffmanCoding(charKindsLen);
	//压缩文件
	compressedFile(charKindsLen, filePathBeforeCompress, filePathAfterCompress);

	cout << "压缩时间" << endTime() << "us" << endl;
}

void Decode(string filePathAfterCompress, string filePathDeCompress) {
	
	startTime();
	decompressionFile(filePathAfterCompress, filePathDeCompress);
	cout << "解压时间" << endTime() << "us" << endl;
}

//读取文件并统计字符的频率,返回文件字符种类的数量
int readFile(string filePath) {

	int singleChar;
	int charKindsLen = 0;
	int ASCIIKinds = 0;
	ifstream infile;

	infile.open(filePath, ios::in | ios::binary);
	if (!infile) {
		cout << "文件不存在\n";
		exit(-1);
	}
	
	//所有字符权重初始化为0
	for (int i = 0; i < maxASCIIKinds; i++)
		weightOfChar[i] = 0;
	weightOfChar[maxASCIIKinds - 1] = 1;

	//统计字符出现频率,以字符的ASCII码作为下标
	while ((singleChar = infile.get()) != EOF) {
		if (singleChar < 0)
			singleChar = 256 + singleChar;
		weightOfChar[singleChar]++;
	}
	
	infile.close();

	//统计权重不为0的字符的数量
	for (int j = 0; j < maxASCIIKinds; j++)
		if (weightOfChar[j] != 0)
			charKindsLen++;
	
	return charKindsLen;
}

//选择所有节点中权重最小的两个节点
void selectMin(int *lc, int *rc, int num) {

	int i;
	int min = -1;
	for (i = 0; i < num; i++) {
		if (huffmannode[i].parent == -1) {
			if (min == -1 || huffmannode[i].weight < huffmannode[min].weight)
				min = i;
		}
	}
	*lc = min;
	min = -1;
	for (i = 0; i < num; i++) {
		if (huffmannode[i].parent == -1 && i!=*lc) {
			if (min == -1 || huffmannode[i].weight < huffmannode[min].weight)
				min = i;
		}
	}
	*rc = min;
}

//构建哈夫曼树
void createHuffmanTree(int num) {

	int allNum = 2 * num - 1;
	int i, j, leftChild, rightChild;

	huffmannode = new huffmanNode[allNum];

	for (i = 0, j = 0; i < maxASCIIKinds; i++) {
		if (weightOfChar[i] != 0) {
			huffmannode[j].charValue = i;
			huffmannode[j].weight = weightOfChar[i];
			huffmannode[j++].code[0] = 0;
		}
	}

	for (i = num; i < allNum; i++) {
		selectMin(&leftChild, &rightChild, i);
		huffmannode[i].lChild = leftChild;
		huffmannode[i].rChild = rightChild;
		huffmannode[i].weight = huffmannode[leftChild].weight + huffmannode[rightChild].weight;
		huffmannode[leftChild].parent = i;
		huffmannode[rightChild].parent = i;
	}
}

//为每一个节点生成一个哈夫曼编码
void huffmanCoding(int num) {

	int currentNode;
	int parentNode;
	int i, j, index;
	char huffmanCode[maxCodeLen];

	for (i = 0; i < num; i++) {
		index = 0;
		for (currentNode = i, parentNode = huffmannode[currentNode].parent; parentNode != -1; currentNode = parentNode, parentNode = huffmannode[parentNode].parent) {
			if (currentNode == huffmannode[parentNode].lChild)
				huffmanCode[index++] = '0';
			else
				huffmanCode[index++] = '1';
		}
		huffmanCode[index] = '\0';

		for (j = index - 1; j >= 0; j--) 
			huffmannode[i].code[index - j - 1] = huffmanCode[j];
		huffmannode[i].code[index] = '\0';
		huffmannode[i].codeLen = index;

		codeOfChar[huffmannode[i].charValue] = huffmannode[i].code;
	}
}

//压缩输出为有特定后缀的压缩文件
void compressedFile(int num, string filePathBeforeCompress, string filePathAfterCompress) {

	ifstream infile;
	ofstream outfile;
	int singleChar;
	string charToBit;
	unsigned char buf = 0;
	int bufLen = sizeof(unsigned char) * 8;
	int offset = 0;
	int i, len;
	infile.open(filePathBeforeCompress, ios::in | ios::binary);
	outfile.open(filePathAfterCompress, ios::out | ios::binary);

	//储存哈夫曼树信息
	outfile << num;
	outfile << (char)10;
	for (i = 0; i < maxASCIIKinds; i++) {
		if (codeOfChar[i] != "") {
			outfile << i;
			outfile << ".";
			outfile << weightOfChar[i];
			outfile << ":";
			outfile << codeOfChar[i];
			outfile << (char)10;
		}
	}

	//储存编码后的文件信息
	while ((singleChar = infile.get()) != EOF) {
		if (singleChar < 0)
			singleChar = 256 + singleChar;
		charToBit = codeOfChar[singleChar];
		len = charToBit.length();
		for (i = 0; i < len; i++) {
			if (charToBit[i] == '1')
				buf++;
			if (offset == bufLen - 1) {
				outfile << buf;
				offset = 0;
				buf = 0;
			}else {
				buf = buf << 1;
				offset++;
			}
		}
	}
	//储存最后的EOF结束标志
	charToBit = codeOfChar[maxASCIIKinds - 1];
	for (i = 0; i < charToBit.length(); i++) {
		if (charToBit[i] == '1')
			buf++;
		if (offset == bufLen - 1) {
			outfile << buf;
			offset = 0;
			buf = 0;
		}
		else {
			buf = buf << 1;
			offset++;
		}
	}
	//最后不足8位的右移凑成8位
	while (offset < bufLen - 1) {
		offset++;
		buf = buf << 1;
	}
	outfile << buf;

	infile.close();
	outfile.close();
}

//解压并生成解压后的文件
void decompressionFile(string filePathAfterCompress, string filePathDeCompress) {

	ifstream in;
	ofstream out;

	in.open(filePathAfterCompress, ios::in | ios::binary);
	out.open(filePathDeCompress, ios::out | ios::binary);

	int i, j, k;
	int singleChar;
	int nodeLen, len;
	string charLen;
	string codeStr;
	int weight;
	string code;
	unsigned char buf;
	int bufLen = 8;

	//获取字符的种类数
	getline(in,charLen);
	nodeLen = atoi(charLen.c_str());

	huffmannode = new huffmanNode[nodeLen * 2 - 1];

	//初始化权重
	for (i = 0; i < maxASCIIKinds; i++)
		weightOfChar[i] = 0;
	weightOfChar[maxASCIIKinds - 1] = 1;

	//给权重赋值
	for (i = 0; i < nodeLen; i++) {
		getline(in, codeOfChar[i]);
		getWeight(codeOfChar[i]);
	}
	
	//构建哈夫曼树
	createHuffmanTree(nodeLen);
	//根据哈夫曼树进行编码
	huffmanCoding(nodeLen);

	//displayHuffmanCode(nodeLen);

	bool isOver = true;
	//p指向树的根节点
	huffmanNode *p = &huffmannode[nodeLen * 2 - 2];
	while (isOver) {
		buf = (unsigned char)in.get();
		for (i = 0; i < bufLen; i++) {
			if ((int)buf >= 128) {
				p = &huffmannode[p->rChild];
			}
			else {
				p = &huffmannode[p->lChild];
			}
			buf = buf << 1;
			if (p->charValue != -1) {
				if (p->charValue == maxASCIIKinds - 1) {
					isOver = false;
					break;
				}else{
					out << (unsigned char)(p->charValue);
					p = &huffmannode[nodeLen * 2 - 2];
				}
			}
		}
	}
	in.close();
	out.close();
}

void getWeight(string codeStr) {

	int pos1, pos2, singleChar;
	string Str;

	pos1 = codeStr.find(".") + 1;
	pos2 = codeStr.find(":") + 1;
	singleChar = atoi(codeStr.substr(0,pos1).c_str());
	weightOfChar[singleChar] = atoi(codeStr.substr(pos1,pos2).c_str());
}

void displayHuffmanCode(int num) {
	for (int i = num-1; i < num; i++){
		cout << huffmannode[i].charValue << " " << huffmannode[i].weight << " " << huffmannode[i].parent << " " << huffmannode[i].lChild << " " << huffmannode[i].rChild << " ";
		for (int j = 0; j < huffmannode[i].codeLen;j++) 
			cout << huffmannode[i].code[j];
		cout << endl;
	}
}

/*void getTimeRAR() {

	clock_t start_time = clock();
	system("rar a cacm E:/suanfa/cacm.all -ad E:/suanfa/");
	clock_t end_time = clock();
	printf("RAR 压缩: %lu ms\n", end_time - start_time);

	system("del E:/suanfa/cacm.all");
	start_time = clock();
	system("unrar x E:/suanfa/cacm -ad E:/suanfa/");
	end_time = clock();
	printf("RAR 解压: %lu ms\n", end_time - start_time);

}*/

void startTime()
{
	QueryPerformanceFrequency(&Freq);
	QueryPerformanceCounter(&StartTime);
}

double endTime()
{
	QueryPerformanceCounter(&EndTime);
	return (double)((EndTime.QuadPart - StartTime.QuadPart) * 1000000 / Freq.QuadPart);	//返回消耗时间,单位毫秒
}



你可能感兴趣的:(算法设计与分析)