HUFFMAN编码可以很有效的压缩数据,通常可以压缩20%到90%的空间(算法导论)。具体的压缩率取决于数据的特性(词频)。如果采取标准的语料库进行编码,一般可以得到比较满意的编码结果(对不同文件产生不同压缩率的折中方法)。
本文采取对单独一个文件进行编码的方式来演示此压缩算法的使用。
分为下面几个步骤:
1.统计词频数据
2.词频数据转换成HUFFMAN算法能够处理的类型(本文为HuffmanNode,内部有存储词频和树节点的结构)
(1)由输入的HuffmanNode[]数组创建最小优先级队列
(2)依次取出队列中的每两个节点,然后由此两个节点构造一个新的节点,然后在重新插入回队列。直到队列中只剩唯一一个节点。
此节点为编码树的根节点。
(3)依次遍历原来输入的每个HUFFMAN节点,得到每个字符的对应编码(压缩使用)。
(4)解码方式,依次输入0/1字符码到算法,算法遍历产生的编码树,如果有返回字符,则得到解码字符。
词频统计的实现:
public class FrequencyCounter { public IEnumerablechar, int>> MapReduce(string str) { //the GroupBy method is acting as the map, //while the Select method does the job of reducing the intermediate results into the final list of results. var wordOccurrences = str .GroupBy(w => w) .Select(intermediate => new { Key = intermediate.Key, Value = intermediate.Sum(w => 1) }) .OrderBy(kvp => kvp.Value); IEnumerable char, int>> kvps = from wo in wordOccurrences select new KeyValuePair<char, int>(wo.Key, wo.Value); return kvps; } }
HUFFMAN编码类的实现:
public class Huffman { private ListoriginalNodes; private HuffmanNode rootNode; public Huffman(IEnumerable char, int>> kvps) { //保存原始数据 var tmpOriginalNodes = from kvp in kvps select new HuffmanNode(kvp.Key, kvp.Value); //创建最小优先队列,并输入数据 MinPriorityQueue minQueue = new MinPriorityQueue (); originalNodes = new List (); foreach (var node in tmpOriginalNodes) { originalNodes.Add(node); minQueue.Insert(node); } //建造编码树,并取得编码树的根节点 while (!minQueue.IsEmpty) { HuffmanNode left = minQueue.ExtractMin(); if (minQueue.IsEmpty) { rootNode = left; break; } HuffmanNode right = minQueue.ExtractMin(); HuffmanNode newNode = new HuffmanNode(null, left.Value + right.Value, left, right); left.Parent = newNode; right.Parent = newNode; minQueue.Insert(newNode); } } //只接受单个char的加密 public string Encode(char sourceChar) { HuffmanNode hn = originalNodes.FirstOrDefault(n => n.Key == sourceChar); if (hn == null) return null; HuffmanNode parent = hn.Parent; StringBuilder rtn = new StringBuilder(); while (parent != null) { if (Object.ReferenceEquals(parent.Left, hn))//左孩子,编码为0 { rtn.Insert(0, "0", 1); } else//右孩子,编码为1 { rtn.Insert(0, "1", 1); } hn = parent; parent = parent.Parent; } return rtn.ToString(); } //只接受一个字符的解码输出 public bool Decode(string string01, out char? output) { HuffmanNode tmpNode = rootNode; char[] chars = string01.Trim().ToCharArray(); for (int i = 0; i < chars.Count(); i++) { if (chars[i] == '0') tmpNode = tmpNode.Left; if (chars[i] == '1') tmpNode = tmpNode.Right; } if (tmpNode != null && tmpNode.Left == null && tmpNode.Right==null) { output = tmpNode.Key; return true; } else { output = null; return false; } } class HuffmanNode : IHeapValue { public HuffmanNode(char? key, int value, HuffmanNode left = null, HuffmanNode right = null) { this.Left = left; this.Right = right; this.Key = key; this.Value = value; } public HuffmanNode Left { get; private set; } public HuffmanNode Right { get; private set; } public HuffmanNode Parent { get; set; } public char? Key { get; private set; } public int Value { get; set; } } }
对文本进行编码的用法:
FrequencyCounter fc = new FrequencyCounter(); var kvps = fc.MapReduce(这里是你的文本); hm = new Huffman(kvps); StringBuilder sb = new StringBuilder(); string ori =这里是你的文本; char[] chararray = ori.ToCharArray(); for (int i = 0; i < chararray.Length; i++) { sb.Append(hm.Encode(chararray[i])); }
对编码进行解码:
string bstr =你的编码后的文本; StringBuilder sb = new StringBuilder(); char? outchar = null; string tmpStr = null; for (int i = 0; i < bstr.Length; i++) { tmpStr = tmpStr + bstr[i]; if (hm.Decode(tmpStr, out outchar)) { tmpStr = null; sb.Append(outchar); } }
测试效果,可以看到压缩效果还是很明显的:
完毕。
作者:Andy Zeng
欢迎任何形式的转载,但请务必注明出处。
http://www.cnblogs.com/andyzeng/p/3703321.html