和Huffman-Tree一样,Shannon-Fano coding也是用一棵二叉树对字符进行编码。但在实际操作中呢,Shannon-Fano却没有大用处,这是由于它与Huffman coding相比,编码效率较低的结果(或者说香农-范诺算法的编码平均码字较大)。但是它的基本思路我们还是可以参考下的。
根据Wikipedia上面的解释,我们来看下香农范诺算法的原理:
Shannon-Fano的树是根据旨在定义一个有效的代码表的规范而建立的。实际的算法很简单:
这个例子展示了一组字母的香浓编码结构(如图a所示)这五个可被编码的字母有如下出现次数:
Symbol | A | B | C | D | E |
---|---|---|---|---|---|
Count | 15 | 7 | 6 | 6 | 5 |
Probabilities | 0.38461538 | 0.17948718 | 0.15384615 | 0.15384615 | 0.12820513 |
从左到右,所有的符号以它们出现的次数划分。在字母B与C之间划定分割线,得到了左右两组,总次数分别为22,17。 这样就把两组的差别降到最小。通过这样的分割, A与B同时拥有了一个以0为开头的码字, C,D,E的码子则为1,如图b所示。 随后, 在树的左半边,于A,B间建立新的分割线,这样A就成为了码字为00的叶子节点,B的码子01。经过四次分割, 得到了一个树形编码。 如下表所示,在最终得到的树中, 拥有最大频率的符号被两位编码, 其他两个频率较低的符号被三位编码。
符号 | A | B | C | D | E |
---|---|---|---|---|---|
编码 | 00 | 01 | 10 | 110 | 111 |
Entropy(熵,平均码字长度):
1: begin
2: count source units
3: sort source units to non-decreasing order
4: SF-SplitS
5: output(count of symbols, encoded tree, symbols)
6: write output
7: end
8:
9: procedure SF-Split(S)
10: begin
11: if (|S|>1) then
12: begin
13: divide S to S1 and S2 with about same count of units
14: add 1 to codes in S1
15: add 0 to codes in S2
16: SF-Split(S1)
17: SF-Split(S2)
18: end
19: end
/************************************************************************/ /* File Name: Shanno-Fano.cpp * @Function: Lossless Compression @Author: Sophia Zhang @Create Time: 2012-9-26 20:20 @Last Modify: 2012-9-26 20:57 */ /************************************************************************/ #include"iostream" #include "queue" #include "map" #include "string" #include "iterator" #include "vector" #include "algorithm" #include "math.h" using namespace std; #define NChar 8 //suppose use 8 bits to describe all symbols #define Nsymbols 1<<NChar //can describe 256 symbols totally (include a-z, A-Z) #define INF 1<<31-1 typedef vector<bool> SF_Code;//8 bit code of one char map<char,SF_Code> SF_Dic; //huffman coding dictionary int Sumvec[Nsymbols]; //record the sum of symbol count after sorting class HTree { public : HTree* left; HTree* right; char ch; int weight; HTree(){left = right = NULL; weight=0;ch ='\0';} HTree(HTree* l,HTree* r,int w,char c){left = l; right = r; weight=w; ch=c;} ~HTree(){delete left; delete right;} bool Isleaf(){return !left && !right; } }; bool comp(const HTree* t1, const HTree* t2)//function for sorting { return (*t1).weight>(*t2).weight; } typedef vector<HTree*> TreeVector; TreeVector TreeArr;//record the symbol count array after sorting void Optimize_Tree(int a,int b,HTree& root)//find optimal separate point and optimize tree recursively { if(a==b)//build one leaf node { root = *TreeArr[a-1]; return; } else if(b-a==1)//build 2 leaf node { root.left = TreeArr[a-1]; root.right=TreeArr[b-1]; return; } //find optimizing point x int x,minn=INF,curdiff; for(int i=a;i<b;i++)//find the point that minimize the difference between left and right; this can also be implemented by dichotomy { curdiff = Sumvec[i]*2-Sumvec[a-1]-Sumvec[b]; if(abs(curdiff)<minn){ x=i; minn = abs(curdiff); } else break;//because this algorithm has monotonicity } HTree*lc = new HTree; HTree *rc = new HTree; root.left = lc; root.right = rc; Optimize_Tree(a,x,*lc); Optimize_Tree(x+1,b,*rc); } HTree* BuildTree(int* freqency)//create the tree use Optimize_Tree { int i; for(i=0;i<Nsymbols;i++)//statistic { if(freqency[i]) TreeArr.push_back(new HTree (NULL,NULL,freqency[i], (char)i)); } sort(TreeArr.begin(), TreeArr.end(), comp); memset(Sumvec,0,sizeof(Sumvec)); for(i=1;i<=TreeArr.size();i++) Sumvec[i] = Sumvec[i-1]+TreeArr[i-1]->weight; HTree* root = new HTree; Optimize_Tree(1,TreeArr.size(),*root); return root; } /************************************************************************/ /* Give Shanno Coding to the Shanno Tree /*PS: actually, this generative process is same as Huffman coding /************************************************************************/ void Generate_Coding(HTree* root, SF_Code& curcode) { if(root->Isleaf()) { SF_Dic[root->ch] = curcode; return; } SF_Code lcode = curcode; SF_Code rcode = curcode; lcode.push_back(false); rcode.push_back(true); Generate_Coding(root->left,lcode); Generate_Coding(root->right,rcode); } int main() { int freq[Nsymbols] = {0}; char *str = "bbbbbbbccccccaaaaaaaaaaaaaaaeeeeedddddd";//15a,7b,6c,6d,5e //statistic character frequency while (*str!='\0') freq[*str++]++; //build tree HTree* r = BuildTree(freq); SF_Code nullcode; Generate_Coding(r,nullcode); for(map<char,SF_Code>::iterator it = SF_Dic.begin(); it != SF_Dic.end(); it++) { cout<<(*it).first<<'\t'; std::copy(it->second.begin(),it->second.end(),std::ostream_iterator<bool>(cout)); cout<<endl; } }
Result:
以上面图中的统计数据为例,进行编码。
符号 | A | B | C | D | E |
---|---|---|---|---|---|
计数 | 15 | 7 | 6 | 6 | 5 |
Reference: