Green hand
文本压缩的模型就是要预测(或统计)字符出现的概率,模型提供这种字符的概率分布函数,再有解码器应用相同的分布函数进行解码。下面实现初步的字符级的模型。
Equation: Entropy=Sum(-P[i]*log[P[i]])
Semi-static modeling
At the first sight at the text, we calculate the possibility of each character (i.e. i–P[i]), then we utilize the equation(*1) to set up the length of each character’s code.
Adaptive modeling
We start with a smooth PDF of characters, then calculate the possibilities of each character from just the text we just have received, e.g. with a 1000-characters passage, while we have decoding or encoding at the 400th character and the word ‘u’ has been found 20 times in these 400 read words, we put P[‘u’]=20.0/400. In this way, both encoding and decoding share the same PDF model. To avoid the ‘zero-frequency’ issue, we initiate each character first appearing 1 time.
Canonical Huffman modeling
Taking this case for instance, with the using of a casual Huffman model, decoding n characters requisites n-1 inner nodes and n leaves, which each of these leaves acquires 2 pointers, on the Huffman tree. Finally, we need 4n words to decode n words, and in practice, with decoding 1MB words to storing 16MB memory at most.
Comparing to the case of canonical Huffman tree, we use just n+100 memory.
Canonical Huffman tree is a subset of Huffman tree.
First, we provide the principles and some parameters:
Principles:
(*1). the codes should be with good coherence, e.g. 3D,4D,5D
(*2). the 1st code with length-i can be calculate from the last code with length-(i-1) using the equation(*2)
(*3). the 1st minimal length code should be 0D
Parameters:
firstcode[i]: the first code with i-length, we can calculate it with equation(*2), it’s truly a binary code;
numl[i]: total amount of i-length code;
index[i]: the index of the first i-length-code in the dictionary.
Equation(2): firstcode[i]=2(last_code[i-1]+1), firstcode[min_len]=0
Second, construct the code words:
e.g.
words ‘a’~’u’ with the code length ‘a’-3, (‘b’:’i’)-4, (‘j’:’u’)-5, with Principal-1 we could get ‘a’ with code ‘000b’. With Principle-2 we can easily get ‘b’ with ‘0010b’, ‘c’ with ‘0011b’ etc.
Finally, decoding algorithm:
长度为i的码字的前j位的数值大于长度为j的码字的数值.
we first find out the actual length of the next pending code and the deviation between code and firstcode[i] can assist us to locate the location in the dictionary.
Python codes:
”’
Only character-level with 26 English characters to be as an example, without complimenting encoding Canonical Huffman Model
#!/usr/bin/env python
import re
def lines(file):
'''
to seperate single characters into a list and add '\n' at the end
'''
for i in file: yield i
yield '\n'
def blocks(file):
'''
to seperate words into a list and returns this list
'''
b=[]
for i in lines(file):
if i.strip():
b.append(i)
elif b:
yield ''.join(b).strip()
b=[]
def word_index():
'''
we need to change 'utf-8' to unicode first to compare, to do this, we need to ignore errors 'cause we can
we also should to ignore cases like 'A'&'a'
finally we'd better sort this word list
'''
vocabulary=[]
total=0
with open('./casual/te.txt') as f:
for i in blocks(f.read()):
if i.lower() not in vocabulary:
flag=True
for j in i:
jc=unicode(j,'utf-8',errors='ignore')
#if any char in the single word is not an English character, throw it
if not ((jc>=u'\u0041' and jc<=u'\u005a') or (jc>=u'\u0061' and jc<=u'\u007a')):
flag=False
if flag:
vocabulary.append(i.lower())
total+=1
vocabulary.sort()
print vocabulary
print total
def semiStaticModeling():
'''
build up a semi-static model here for Haffman codes
in this model, we should first read through the whole passage and build a static model
Calculate possibility of each character
'''
#chars=[]
total=0
chars=[0 for i in range(26)]
with open('./casual/te.txt') as f:
for i in lines(f.read()):
if i.strip():
#chars.append(unicode(i.lower(), errors='ignore'))
total+=1
if i == 'a': chars[0]+=1
elif i == 'b': chars[1]+=1
elif i == 'c': chars[2]+=1
elif i == 'd': chars[3]+=1
elif i == 'e': chars[4]+=1
elif i == 'f': chars[5]+=1
elif i == 'g': chars[6]+=1
elif i == 'h': chars[7]+=1
elif i == 'i': chars[8]+=1
elif i == 'j': chars[9]+=1
elif i == 'k': chars[10]+=1
elif i == 'l': chars[11]+=1
elif i == 'm': chars[12]+=1
elif i == 'n': chars[13]+=1
elif i == 'o': chars[14]+=1
elif i == 'p': chars[15]+=1
elif i == 'q': chars[16]+=1
elif i == 'r': chars[17]+=1
elif i == 's': chars[18]+=1
elif i == 't': chars[19]+=1
elif i == 'u': chars[20]+=1
elif i == 'v': chars[21]+=1
elif i == 'w': chars[22]+=1
elif i == 'x': chars[23]+=1
elif i == 'y': chars[24]+=1
elif i == 'z': chars[25]+=1
for i in chars: print float(i)/total
return chars
def adaModel(file):
'''
adaptive modeling
zero-ordered, character-level model
to avoid 'zero-frequency' problem, we initiate the 26 characters to appear at the first for 1 time
ONLY decode 26 English characters for example, and so does the 'filedes'
attributes:
filedes: to locate the position(number) in the passage
'''
filedes=0
chars=[]
chars_a=1
pa=1.0/26
with open(file) as f:
for i in lines(f.read()):
#if 'i' is not a whitespace character, append it
if i.strip():
chars.append(i.lower())
filedes+=1
if unicode(i, errors='ignore') == 'a':
chars_a+=1
pa=float(float(pa)*filedes+1)/filedes
print filedes, pa
def canonicalM(code, firstcode, index, table):
'''
DECODE
Canonical Huffman modeling decoding method
Attributes:
firstcode[i]: the first code with i-length, we can calculate it with equation(*1)
numl[i]: total number of i-length code
index[i]: the index of the first i-length-code
l: length of the codes
table: store the characters in a table
(*1): firstcode[i]=2*(last_code[i-1]+1), firstcode[min_len]=0
e.g. --http://blog.csdn.net/goncely/article/details/616589
firstcode[3:5] = 000b, 0010b, 10100b
numl[3:5] = 1(a), 8(b~i), (j~u)
index[3:5] = 0, 1, 9
'''
l=1
while(code>=firstcode[l]):
code<<=1
l+=1
#beneath is the right length
l-=1
print table[index[l]+code-firstcode[l]]
参考http://blog.csdn.net/goncely/article/details/616589