----------------------------------------------------------------------
The Records of Running Moses
University of Macau
2010-1-27
----------------------------------------------------------------------
这是统计机器翻译系统Moses运行的记录过程,首先第一部分给出运行所需要的准备文件,第二部分给出安装的过程和步骤,第三部分给出构建语言模型和翻译过程,第四部分给出翻译的结果和评测。首先这是第一部分的内容:Moses运行前的准备:
The purpose of this guide is to offer a step-by step example of downloading, compiling, and running the Moses decoder and related support tools. What I want to do is to record the steps in order to use some day. Here, I make no claims that all of the steps here will work perfectly on every machine you try it on, or that things will stay the same as the software changes. Please remember that Moses is research software under active development.
In this record, there are six parts to introduce the process .The structure of the records is like below:
1. Structure of this passage
2. Working Environment and dealing with corpus
PART I - Download Tools and Data
In this part, I download some tools that will be needed, including translation tools GIZA++ and MKCLS,
language model SRILM or IRSTLM; deal with language sentences tools SCRIPTS, evaluation tools NIST and
BLEU. I also give the URL where to download them.
PART II – Support Tools Installation
In this part, I introduce how to install the tools. Honestly speaking, the installation process is complex, for it
contains a lot of folders that even a test folder is included. At first run this program, you may be confuse about it!
The most complex is that it needs many additional tools to support the installation. So be careful to install all the
extra tools first.
PART III – Build language model
In this part, I build a language model using Chinese corpus as an example. The command is different with
language model CMU-Cam_Toolkit_v2 I have used. I think SRILM has more power. At last we get a language
model called: chinese.gz
PART IV – Build Translation model
In this part, I build the translation model using the perl command: train-factored-phrase-model.perl .Here we call
the GIZA++ and mkcls to get the result. What we get are four folders called: corpus, giza.chn-eng, giza.eng-chn
and model. The most important folder is the folder model.
PART V –Get the Result
In this part, I do the translation in the folder model. I copy the decoder moses to this folder and then use the
configure file named moses.ini in the model folder to get the translation result.
Evaluation
In this part, I use the evaluation tools to give the evaluation of the translation.
(1)The system is Ubuntu 9.10 and the GCC version is 4.4.1
(2)The working directory is /home/tianliang/mosesdecoder
(3)The corpus I use is 1500 Chinese-English translation sentences named raw.chn and raw.eng.
(3)The training corpus is put in /home/tianliang/mosesdecoder/corpus
(4)I will deal with the corpus in /home/tianliang/mosesdecoder/corpus. After download the SCRIPTS (will
introduce next part), I use the tools to tokenize the corpus and lowercase the corpus. At last I get the Chinese
corpus named clean.chn and clean.lowercased.chn; the English corpus named clean.eng and clean.lowercased.eng.
You can refer to the next part to know how to use these tools.
Note:
² In this report, I don’t use the lowercased corpus. According to the suggestion of Moses, I should use the
small characters when training, but I just use the tokenlized ones. The reason why I call the clean.chn and
clean.eng is that I use the clean-corpus-n.perl to get the short sentences that less than 100 words. That's to
say it has been cleaned to short ones. The command is like below:
$. / clean-corpus-n.perl raw chn eng clean 1 100
² The Chinese corpus is tokenized by tools from ICTCAL. After tokenized we call it clean.chn.
Support Tools
1. GIZA++ and mkcls.
Moses has a number of scripts designed to aid training, and they rely on GIZA++ and mkcls to function. More information on the origins of these tools is available at:
Ø http://www.fjoch.com/GIZA++.html
Ø http://www.fjoch.com/mkcls.html
A Google Code project has been set up, and the code is being maintained:
Ø http://giza-pp.googlecode.com/
2. SRILM
Moses uses SRILM-style language models. SRILM is available from:
Ø http://www.speech.sri.com/projects/srilm/download.html
3. IRSTLM
The IRSTLM tools provide the ability to use quantized and disk memory-mapped language models.
It's optional, this time I don't use this tool as an example:
Ø http://sourceforge.net/projects/irstlm
4. SCRIPTS
Ø http://www.statmt.org/wmt07/scripts.tgz
This scripts package includes a lot of tools to deal with the data corpus. These tools are in the following:
Detokenizer .perl
===========
Usage ./detokenizer.perl -l [en|de|...] < tokenizedfile > detokenizedfile
Used after decoding, removes most spaces inserted by tokenizer.perl.
Lowercaser .perl
==========
Usage ./lowercase.perl < tokenizedfile > lowercasedfile
This is to change the input sentences into lowercase.
Reuse Weights .perl
=============
./reuse-weights.perl weights.ini < moses.ini > weighted.ini
Combines feature weights in weights.ini with phrase-tables, LMs
and reordering-tables specified in moses.ini to make weighted.ini
Sentence Splitter .perl
=================
Usage ./split-sentences.perl -l [en|de|...] < textfile > splitfile
Uses punctuation and Capitalization clues to split paragraphs of sentences into files with one sentence per line.
For example:
This is a paragraph. But why," you ask?
goes to:
This is a paragraph.
"But why," you ask?
Tokenizer .perl
=========
Usage ./tokenizer.perl -l [en|de|...] < textfile > tokenizedfile
Splits out most punctuation from words. Special cases where splits do not occur are documented in the code. For
example:
This E.U. treaty is, to use the words of Mr. Smith, "awesome."
goes to:
This E.U. treaty is , to use the words of Mr. Smith , " awesome . "
XML Wrapper .perl
===========
Usage ./wrap-xml.perl xml-frame language [system-name] < translatedfile > wrappedfile.sgm
Using the doc, sent, and other tags specified in the xml-frame, creates a NIST-compatile SGM file tagged with the
specified language and system whose contents are from translatedfile.
5. NIST, BLEU
The evaluation tools that can be gotten from:
Ø ftp://jaguar.ncsl.nist.gov/mt/resources/mteval-v11b.pl
接下来将介绍Moses的具体安装过程,请参看moses运行记录(二)。