先看一下语言模型的格式
\data\ ngram 1=64000 ngram 2=522530 ngram 3=173445 \1-grams: -5.24036 'cause -0.2084827 -4.675221 'em -0.221857 -4.989297 'n -0.05809768 -5.365303 'til -0.1855581 -2.111539 </s> 0.0 -99 <s> -0.7736475 -1.128404 <unk> -0.8049794 -2.271447 a -0.6163939 -5.174762 a's -0.03869072 -3.384722 a. -0.1877073 -5.789208 a.'s 0.0 -6.000091 aachen 0.0 -4.707208 aaron -0.2046838 -5.580914 aaron's -0.06230035 -5.789208 aarons -0.07077657 -5.881973 aaronson -0.2173971(注:上面的值都是以10为底的对数值)
上面是一个语言模型的一部分,三元语言模型的综合格式如下:
\data ngram 1=nr # 一元语言模型 ngram 2=nr # 二元语言模型 ngram 3=nr # 三元语言模型 \1-grams: pro_1 word1 back_pro1 \2-grams: pro_2 word1 word2 back_pro2 \3-grams: pro_3 word1 word2 word3 \end\第一项表示ngram的条件概率,就是P(wordN | word1,word2,。。。,wordN-1)。
第二项表示ngram的词。
最后一项是回退的权重。
举例来说,对于三个连续的词来说,我们计算三个词一起出现的概率:
P(word3|word1,word2)表示word1和word2出现的情况下word3出现的概率,比如P(平|习,进)的意思是已经出现了“习进”两个字,后面是平的概率,这个概率这么计算:
if(存在(word1,word2,word3)的三元模型){ return pro_3(word1,word2,word3) ; }else if(存在(word1,word2)二元模型){ return back_pro2(word1,word2)*P(word3|word2) ; }else{ return P(word3 | word2); }
if(存在(word1,word2)的二元模型){ return pro_2(word1,word2); }else{ return back_pro2(word1)*pro_1(word2) ; }