1-grams:(注:以下为一元词的基本情况)
-4.891179(注:log(概率),以10为底) ! -1.361815
-6.482389 !) -0.1282758
-6.482389 !’ -0.1282758
-5.254417 “(注:一元词) -0.1470514
-6.482389 “‘ -0.1282758(注:log(回退权重),以10为底)
…
2-grams:
-0.02140159 !
-2.266701 ! –
-0.5719482 !)
-0.5719482 !’
-2.023553 ” ‘Biomass’
-2.023553 ” ‘vertical’
…
3-grams:
-0.01154674 the !
-0.01154674 urgent !
-0.01154674 us’ !
-1.075004 the “.EU” Top
-0.827616 the “.EU” domain
-0.9724987 the “.EU” top-level …
3、利用上一步生成的语言模型计算测试集的困惑度:
ngram -ppl devtest2006.en -order 3 -lm europarl.en.lm > europarl.en.lm.ppl
其中测试集采用wmt08用于机器翻译的测试集devtest2006.en,2000句;参数-ppl为对测试集句子进行评分(logP(T),其 中P(T)为所有句子的概率乘积)和计算测试集困惑度的参数;europarl.en.lm.ppl为输出结果文件;其他参数同上。输出文件结果如下:
file devtest2006.en: 2000 sentences, 52388 words, 249 OOVs
0 zeroprobs, logprob= -105980 ppl= 90.6875 ppl1= 107.805
第一行文件devtest2006.en的基本信息:2000句,52888个单词,249个未登录词;
第二行为评分的基本情况:无0概率;logP(T)=-105980,ppl==90.6875, ppl1= 107.805,均为困惑度。其公式稍有不同,如下:
;
其中Sen和Word分别代表句子和单词数。
附:SRILM主页推荐的书目和文献。
入门——了解语言模型尤其是n-gram模型的参考书目章节:
• 《自然语言处理综论》第一版第6章,第二版第4章(Speech and Language Processing by Dan Jurafsky and Jim Martin (chapter 6 in the 1st edition, chapter 4 in the 2nd edition) )
• 《统计自然语言处理基础》第6章。(Foundations of Statistical Natural Language Processing by Chris Manning and Hinrich Schütze (chapter 6))
深入学习相关文献:
• A. Stolcke, SRILM – An Extensible Language Modeling Toolkit, in Proc. Intl. Conf. Spoken Language Processing, Denver, Colorado, September 2002. Gives an overview of SRILM design and functionality.
• D. Jurafsky, Language Modeling, Lecture 11 of his course on “Speech Recognition and Synthesis” at Stanford. Excellent introduction to the basic concepts in LM.
• J. Goodman, The State of The Art in Language Modeling, presented at the 6th Conference of the Association for Machine Translation in the Americas (AMTA), Tiburon, CA, October, 2002.
Tutorial presentation and overview of current LM techniques (with emphasis on machine translation).
• K. Kirchhoff, J. Bilmes, and K. Duh, Factored Language Models Tutorial, Tech. Report UWEETR-2007-0003, Dept. of EE, U. Washington, June 2007. This report serves as both a tutorial and reference manual on FLMs.
• S. F. Chen and J. Goodman, An Empirical Study of Smoothing Techniques for Language Modeling, Tech. Report TR-10-98, Computer Science Group, Harvard U., Cambridge, MA, August 1998 (original postscript document). Excellent overview and comparative study of smoothing methods. Served as a reference for many of the methods implemented in SRILM.
注:原创文章,转载请注明出处“我爱自然语言处理”:www.52nlp.cn
本文链接地址:http://www.52nlp.cn/language-model-training-tools-srilm-details/