文本压缩1

Green hand
文本压缩的模型就是要预测(或统计)字符出现的概率,模型提供这种字符的概率分布函数,再有解码器应用相同的分布函数进行解码。下面实现初步的字符级的模型。
Equation: Entropy=Sum(-P[i]*log[P[i]])

  1. Semi-static modeling
    At the first sight at the text, we calculate the possibility of each character (i.e. i–P[i]), then we utilize the equation(*1) to set up the length of each character’s code.

  2. Adaptive modeling
    We start with a smooth PDF of characters, then calculate the possibilities of each character from just the text we just have received, e.g. with a 1000-characters passage, while we have decoding or encoding at the 400th character and the word ‘u’ has been found 20 times in these 400 read words, we put P[‘u’]=20.0/400. In this way, both encoding and decoding share the same PDF model. To avoid the ‘zero-frequency’ issue, we initiate each character first appearing 1 time.

  3. Canonical Huffman modeling
    Taking this case for instance, with the using of a casual Huffman model, decoding n characters requisites n-1 inner nodes and n leaves, which each of these leaves acquires 2 pointers, on the Huffman tree. Finally, we need 4n words to decode n words, and in practice, with decoding 1MB words to storing 16MB memory at most.

Comparing to the case of canonical Huffman tree, we use just n+100 memory.

Canonical Huffman tree is a subset of Huffman tree.
First, we provide the principles and some parameters:
Principles:
(*1). the codes should be with good coherence, e.g. 3D,4D,5D
(*2). the 1st code with length-i can be calculate from the last code with length-(i-1) using the equation(*2)
(*3). the 1st minimal length code should be 0D
Parameters:
firstcode[i]: the first code with i-length, we can calculate it with equation(*2), it’s truly a binary code;
numl[i]: total amount of i-length code;
index[i]: the index of the first i-length-code in the dictionary.
Equation(2): firstcode[i]=2(last_code[i-1]+1), firstcode[min_len]=0

Second, construct the code words:
e.g.
words ‘a’~’u’ with the code length ‘a’-3, (‘b’:’i’)-4, (‘j’:’u’)-5, with Principal-1 we could get ‘a’ with code ‘000b’. With Principle-2 we can easily get ‘b’ with ‘0010b’, ‘c’ with ‘0011b’ etc.

Finally, decoding algorithm:
长度为i的码字的前j位的数值大于长度为j的码字的数值.
we first find out the actual length of the next pending code and the deviation between code and firstcode[i] can assist us to locate the location in the dictionary.

Python codes:
”’

Only character-level with 26 English characters to be as an example, without complimenting encoding Canonical Huffman Model

#!/usr/bin/env python
import re

def lines(file):
    '''

    to seperate single characters into a list and add '\n' at the end
    '''

    for i in file: yield i
    yield '\n'

def blocks(file):
    '''

    to seperate words into a list and returns this list
    '''

    b=[]
    for i in lines(file):
        if i.strip():
            b.append(i)
        elif b: 
            yield ''.join(b).strip()
            b=[]

def word_index():
    '''

    we need to change 'utf-8' to unicode first to compare, to do this, we need to ignore errors 'cause we can
    we also should to ignore cases like 'A'&'a'
    finally we'd better sort this word list
    '''

    vocabulary=[]
    total=0
    with open('./casual/te.txt') as f:
        for i in blocks(f.read()):
            if i.lower() not in vocabulary:
                flag=True
                for j in i:

                    jc=unicode(j,'utf-8',errors='ignore')
                    #if any char in the single word is not an English character, throw it
                    if not ((jc>=u'\u0041' and jc<=u'\u005a') or (jc>=u'\u0061' and jc<=u'\u007a')):
                        flag=False
                if flag:
                    vocabulary.append(i.lower())
            total+=1

    vocabulary.sort()
    print vocabulary
    print total


def semiStaticModeling():
    '''

    build up a semi-static model here for Haffman codes
    in this model, we should first read through the whole passage and build a static model

    Calculate possibility of each character
    '''

    #chars=[]
    total=0
    chars=[0 for i in range(26)]

    with open('./casual/te.txt') as f:
        for i in lines(f.read()):
            if i.strip():
                #chars.append(unicode(i.lower(), errors='ignore'))
                total+=1
                if i == 'a': chars[0]+=1
                elif i == 'b': chars[1]+=1
                elif i == 'c': chars[2]+=1
                elif i == 'd': chars[3]+=1
                elif i == 'e': chars[4]+=1
                elif i == 'f': chars[5]+=1
                elif i == 'g': chars[6]+=1
                elif i == 'h': chars[7]+=1
                elif i == 'i': chars[8]+=1
                elif i == 'j': chars[9]+=1
                elif i == 'k': chars[10]+=1
                elif i == 'l': chars[11]+=1
                elif i == 'm': chars[12]+=1
                elif i == 'n': chars[13]+=1
                elif i == 'o': chars[14]+=1
                elif i == 'p': chars[15]+=1
                elif i == 'q': chars[16]+=1
                elif i == 'r': chars[17]+=1
                elif i == 's': chars[18]+=1
                elif i == 't': chars[19]+=1
                elif i == 'u': chars[20]+=1
                elif i == 'v': chars[21]+=1
                elif i == 'w': chars[22]+=1
                elif i == 'x': chars[23]+=1
                elif i == 'y': chars[24]+=1
                elif i == 'z': chars[25]+=1

    for i in chars: print float(i)/total
    return chars

def adaModel(file):
    '''

    adaptive modeling
    zero-ordered, character-level model
    to avoid 'zero-frequency' problem, we initiate the 26 characters to appear at the first for 1 time

    ONLY decode 26 English characters for example, and so does the 'filedes'

    attributes:
        filedes: to locate the position(number) in the passage
    '''

    filedes=0
    chars=[]
    chars_a=1
    pa=1.0/26
    with open(file) as f:
        for i in lines(f.read()):
            #if 'i' is not a whitespace character, append it
            if i.strip():
                chars.append(i.lower())
                filedes+=1
                if unicode(i, errors='ignore') == 'a':
                    chars_a+=1
                    pa=float(float(pa)*filedes+1)/filedes
                    print filedes, pa


def canonicalM(code, firstcode, index, table):
    '''

    DECODE
    Canonical Huffman modeling decoding method

    Attributes:
        firstcode[i]: the first code with i-length, we can calculate it with equation(*1)
        numl[i]: total number of i-length code
        index[i]: the index of the first i-length-code

        l: length of the codes
        table: store the characters in a table

    (*1): firstcode[i]=2*(last_code[i-1]+1), firstcode[min_len]=0
    e.g.  --http://blog.csdn.net/goncely/article/details/616589
        firstcode[3:5] = 000b, 0010b, 10100b
        numl[3:5] = 1(a), 8(b~i), (j~u)
        index[3:5] = 0, 1, 9 
    '''

    l=1
    while(code>=firstcode[l]):
        code<<=1
        l+=1
    #beneath is the right length
    l-=1
    print table[index[l]+code-firstcode[l]] 

参考http://blog.csdn.net/goncely/article/details/616589

你可能感兴趣的:(python)