N-Gram的数据结构

ARPA的n-gram语法如下:

\data\
ngram 1=64000
ngram 2=522530
ngram 3=173445

\1-grams:
-5.24036        'cause  -0.2084827
-4.675221       'em     -0.221857
-4.989297       'n      -0.05809768
-5.365303       'til    -0.1855581
-2.111539       </s>    0.0
-99     <s>     -0.7736475
-1.128404       <unk>   -0.8049794
-2.271447       a       -0.6163939
-5.174762       a's     -0.03869072
-3.384722       a.      -0.1877073
-5.789208       a.'s    0.0
-6.000091       aachen  0.0
-4.707208       aaron   -0.2046838
-5.580914       aaron's -0.06230035
-5.789208       aarons  -0.07077657
-5.881973       aaronson        -0.2173971
具体说明见 : ARPA的n-gram语言模型格式


整个ARPA-LM由很多个n-gram项组成,分别说明这两个的数据结构


一,n-gram数据结构


n-gram的数据结构如下:
    typedef struct
    {
        real        log_prob ;
        real        log_bo ;
        int         *words ;
    } ARPALMEntry ;
  1. words,表示当前的n-gram所涉及的单词,如果是1-gram,那就只有一个,如果是2-gram,那么words就包括这两个单词的序号。
  2. log_bo,表示ngram的回退概率。
  3. log_prob,表示ngram的组合概率。

二,ARPA-LM数据结构

多个项组成的整个n-gram语言模型的数据结构如下:
class ARPALM
 {
    public:
        Vocabulary *vocab ;

        int            order ;
        ARPALMEntry    **entries ; // 语言模型的所有项,组成一个数组
        int            *n_ngrams ; // 一元语言模型、二元语言模型、三元语言模型等组成的数组,数组每一项都表示对应的的元有多少个。

        char           *unk_wrd ; // 词典中不在语言模型中的词。
        int            unk_id ;// 词典中不在语言模型中的词ID,这个ID指定为词典的最后一个序号。

        int            n_unk_words ;
        int            *unk_words ;
    private: 
        bool           *words_in_lm ; // 布尔类型数组,标识词是否在语言模型中。
}
  1. vocab,用于构建语言模型的词典指针。词典定义见:词典内存存储模型
  2. entries,语言模型的所有ngram项,是ARPALMEntry类型的一个二维数组。entries[0]存储1-gram,entries[1]存储2-gram,依此类推。
  3. n_ngrams,整型数组,依次包含1-gram,2-gram,3-gram,....所包含的ngram项个数。
  4. unk_wrd,词典中可以不在语言模型中的词。
  5. unk_id,词典中可以不在语言模型中的词的ID,这个ID指定为词典的最后一个词序号。
  6. n_unk_words,在读语言模型之后,统计在词典中,但没有用来建立语言模型的词个数,如果没有指定unk_wrd的话,是不允许的,就表示所有的词典中的词都应该用来建语言模型。
  7. unk_words,存储6中统计的词序号。
  8. words_in_lm,这个标识词典中的词是否在语言模型中出现。

你可能感兴趣的:(N-Gram的数据结构)