Moses安装、训练和优化
我是以root权限在Ubuntu服务器上安装运行的Moses,总结了网上的一些经验,再加上自己遇到一些问题然后解决问题的过程,跟大家分享一下。我安装运行成功所采用的Ubuntu、Boost、IRSTLM和giza++的版本组合如下:Ubuntu12.04.01 LTS + Boost_1_55_0+ irstlm-5.80.08 + giza-pp-v1.0.7。
一、Boost、GIZAA++、IRSTLM的安装
在安装之前需要进行以下软件的安装(Moses的官方上有):
sudo apt-get install build-essentialgit-core pkg-config automake libtool wget zlib1g-dev python-dev libbz2-dev libsoap-lite-perl
1、 Boost的安装
Moses的官方文档上说,在Ubuntu 12.04上Boost有brokenversions,所以必须自己下载安装(我将其安装在/home/lty/Moses/boost_1.55_0目录中,Moses是我创建的一个目录总体进行Moses的运行):
wget http://downloads.sourceforge.net/project/boost/boost/1.55.0/boost_1_55_0.tar.gz
tar zxvf boost_1_55_0.tar.gz
cd boost_1_55_0/
./bootstrap.sh
./b2 -j4 --prefix=/home/lty/Moses/boost_1_55_0--libdir=/home/lty/Moses/boost_1_55_0/lib64 --layout=system link=static install|| echo FAILURE
2、 GIZAA++的安装
wget http://giza-pp.googlecode.com/files/giza-pp-v1.0.7.tar.gz
tar xzvf giza-pp-v1.0.7.tar.gz
cd giza-pp
Make
在编译后会生成三个可执行文件:
giza-pp/GIZA++-v2/GIZA++
giza-pp/GIZA++-v2/snt2cooc.out
giza-pp/mkcls-v2/mkcls
这些文件需要放在一个文件夹当中,我将这些文件放在了/home/lty/Moses/giza-pp中。
但是目前为了加快速度,大家都会采用mgiza,
git clone https://github.com/moses-smt/mgizahttps://github.com/moses-smt/mgiza
唯一与上面不一样的地方是要多拷贝一个文件merge_alignment.py,我将这些文件放在了/home/lty/Moses/mgiza/mgizapp/bin中
3、 IRSTLM的安装
下载地址:http://sourceforge.net/projects/irstlm/
tar zxvf irstlm-5.80.08.tgz
cd irstlm-5.80.08
./regenerate-makefiles.sh
./configure --prefix=/home/lty/Moses/irstlm-5.80.08
make install
此时需要记住上面三个文件夹,我本机的目录是:
/home/lty/Moses/irstlm-5.80.08
/home/lty/Moses/boost_1_55_0(可指定,也可不指定)
/home/lty/Moses/mgiza/mgizapp/bin
4、 Moses下载安装
需要先下载安装一下软件:
sudo apt-get install git build-essential libz-dev libbz2-dev
然后下载Moses:
git clone https://github.com/moses-smt/mosesdecoder.git
cd mosesdecoder
./bjam -j4 --with-irstlm=/home/lty/Moses/irstlm-5.80.08 --with-giza=/home/lty/Moses/mgiza/mgizapp/bin--with-boost=/home/lty/Moses/boost_1_55_0
-j4是利用CPU是4核的进行编译
二、语料预处理
1、语料的预处理 在/home/Moses/建立一个corpus来存放学习集,官方网站下载学习资料
cd
mkdir corpus
cd corpus
wgethttp://www.statmt.org/wmt13/training-parallel-nc-v8.tgz
tar zxvf training-parallel-nc-v8.tgz
1)tokenisation:在预料的单词和单词之间或者单词和标点之间插入空白,然后进行后续操作。
/home/lty/Moses/mosesdecoder/scripts/tokenizer/tokenizer.perl-l en < training/ news-commentary-v8.fr-en.en >news-commentary-v8.fr-en.tok.en
/home/lty/Moses/mosesdecoder/scripts/tokenizer/tokenizer.perl-l fr < training/news-commentary-v8.fr-en.fr >news-commentary-v8.fr-en.tok.fr
2)Truecaser:提取一些关于文本的统计信息
/home/lty/Moses/mosesdecoder/scripts/recaser/train-truecaser.perl--model truecase-model.en --corpus news-commentary-v8.fr-en.tok.en
/home/lty/Moses/mosesdecoder/scripts/recaser/train-truecaser.perl--model truecase-model.fr --corpus news-commentary-v8.fr-en.tok.fr
3)truecasing:将语料中每句话的字和词组都转换为没有格式的形式,减少数据稀疏性问题。
/home/lty/Moses/mosesdecoder/scripts/recaser/truecase.perl--model truecase-model.en < news-commentary-v8.fr-en.tok.en > news-commentary-v8.fr-en.true.en
/home/lty/Moses/mosesdecoder/scripts/recaser/truecase.perl--model truecase-model.fr < news-commentary-v8.fr-en.tok.fr > news-commentary-v8.fr-en.true.fr
4)cleaning: 将长语句和空语句删除,并且将不对齐语句进行处理。
/home/lty/Moses/mosesdecoder/scripts/training/clean-corpus-n.perlnews-commentary-v8.fr-en.true fr en news-commentary-v8.fr-en.clean 1 80
三、语言模型、机器翻译模型训练、Tunning翻译模型和测试
1、 先要进行语言模型训练
语言模型(LM)用于确保流利的输出,在这一步使用Irstlm进行处理。
/home/lty/Moses/mosesdecoder/scripts/generic/trainlm-irst2.perl-cores 4 -irst-dir /home/lty/Moses/irstlm-5.80.08/bin -p 0 -order 5 -text news-commentary-v8.fr-en.true.en -lmsmall-news-commentary-v8.fr-en.blm.en
echo "is this an Englishsentence" | /home/lty/Moses/mosesdecoder/bin/query news-commentary-v8.fr-en.blm.en
2、翻译模型的训练
cd ..
mkdir working
cd working
/home/lty/Moses/mosesdecoder/scripts/training/train-model.perl-root-dir train -corpus/home/lty/Moses/corpus/small-news-commentary-v8.fr-en.clean -f fr -e en-alignment grow-diag-final-and -reordering msd-bidirectional-fe -lm0:3:/home/lty/Moses/corpus/small-news-commentary-v8.fr-en.blm.en:8-external-bin-dir /home/lty/Moses/giza-pp
3、Tunning翻译模型
回到corpus,下载开发集
wget http://www.statmt.org/wmt12/dev.tgz
tar zxvf dev.tgz
/home/lty/Moses/mosesdecoder/scripts/tokenizer/tokenizer.perl-l en < dev/news-test2008.en > news-test2008.tok.en
/home/lty/Moses/mosesdecoder/scripts/tokenizer/tokenizer.perl-l fr < dev/news-test2008.fr > news-test2008.tok.fr
/home/lty/Moses/mosesdecoder/scripts/recaser/truecase.perl–model truecase-model.en
/home/lty/Moses/mosesdecoder/scripts/recaser/truecase.perl–model truecase-model.fr < news-test2008.tok.fr > news-test2008.true.fr
开发集在进行了和学习集相同的处理之后,对原本的moses.ini进行调优
进入working文件夹然后运行
nohup /home/lty/Moses/mosesdecoder/scripts/training/mert-moses.pl/home/lty/Moses/corpus/news-test2008.true.en /home/lty/Moses/corpus/news-test2008.true.fr/home/lty/Moses/mosesdecoder/bin/moses train/model/moses.ini --mertdir/home/lty/Moses/mosesdecoder/bin/ --decoder-flags="-threads 8"2>&1 1> Result20160120 &
4、测试模型
如果有单独的测试集,也要进行和开发集及训练集一样的预处理,这里用开发集直接进行了简单测试
/home/lty/Moses/mosesdecoder/bin/moses -f /home/lty/Moses/working/mert-work/moses.ini< small-news-test2008.true.fr > ewstest2011.out
为了加快速度,需要把phrase-table和reordering-table进行优化加快速度,关于phrase-table,请参考http://www.statmt.org/moses/?n=Advanced.RuleTables#ntoc3中的On-DiskPhrase table部分。关于reordering-table,请参考以下命令执行:
cd /home/lty/Moses/corpus/mert-work
mkdir binarised-model
/home/lty/Moses/mosesdecoder/bin/CreateOnDiskPt1 1 4 100 2 /home/lty/Moses/corpus/train/model/phrase-table.gz /home/lty/Moses/corpus/mert-workbinarised-model/phrase-table.1.folder
/home/lty/Moses/mosesdecoder/bin/processLexicalTable-in /home/lty/Moses/corpus/train/model/reordering-table.wbe-msd-bidirectional-fe.gz -out /home/lty/Moses/corpus/mert-work/binarised-model/reordering-table
然后要修改/home/lty/Moses/corpus/mert-work中的mose.ini文件,修改成:
PhraseDictionaryOnDiskname=TranslationModel0 num-features=4 path=/home/lty/Moses/corpus/mert-work/binarised-model/phrase-table.1.folder input-factor=0 output-factor=0
LexicalReordering name=LexicalReordering0num-features=6 type=wbe-msd-bidirectional-fe-allff input-factor=0output-factor=0 path=/home/lty/Moses/corpus/mert-work/binarised-model/reordering-table
经过这个操作之后,生成速度加快了45倍,^_^。