Moses的安装、训练和优化

Moses安装、训练和优化

我是以root权限在Ubuntu服务器上安装运行的Moses,总结了网上的一些经验,再加上自己遇到一些问题然后解决问题的过程,跟大家分享一下。我安装运行成功所采用的Ubuntu、Boost、IRSTLM和giza++的版本组合如下:Ubuntu12.04.01 LTS + Boost_1_55_0+ irstlm-5.80.08 + giza-pp-v1.0.7。

一、Boost、GIZAA++、IRSTLM的安装

在安装之前需要进行以下软件的安装(Moses的官方上有):

sudo apt-get install build-essentialgit-core pkg-config automake libtool wget zlib1g-dev python-dev libbz2-dev libsoap-lite-perl

1、  Boost的安装

Moses的官方文档上说,在Ubuntu 12.04上Boost有brokenversions,所以必须自己下载安装(我将其安装在/home/lty/Moses/boost_1.55_0目录中,Moses是我创建的一个目录总体进行Moses的运行):

wget http://downloads.sourceforge.net/project/boost/boost/1.55.0/boost_1_55_0.tar.gz

tar zxvf boost_1_55_0.tar.gz

cd boost_1_55_0/

./bootstrap.sh

./b2 -j4 --prefix=/home/lty/Moses/boost_1_55_0--libdir=/home/lty/Moses/boost_1_55_0/lib64 --layout=system link=static install|| echo FAILURE

2、  GIZAA++的安装

wget http://giza-pp.googlecode.com/files/giza-pp-v1.0.7.tar.gz 

tar xzvf giza-pp-v1.0.7.tar.gz 

cd giza-pp 

Make 

在编译后会生成三个可执行文件:

giza-pp/GIZA++-v2/GIZA++ 

giza-pp/GIZA++-v2/snt2cooc.out 

giza-pp/mkcls-v2/mkcls 

这些文件需要放在一个文件夹当中,我将这些文件放在了/home/lty/Moses/giza-pp中。

但是目前为了加快速度,大家都会采用mgiza,

git clone https://github.com/moses-smt/mgizahttps://github.com/moses-smt/mgiza

唯一与上面不一样的地方是要多拷贝一个文件merge_alignment.py,我将这些文件放在了/home/lty/Moses/mgiza/mgizapp/bin中

3、  IRSTLM的安装

下载地址:http://sourceforge.net/projects/irstlm/

tar zxvf irstlm-5.80.08.tgz 

cd irstlm-5.80.08

./regenerate-makefiles.sh 

./configure --prefix=/home/lty/Moses/irstlm-5.80.08 

make install

此时需要记住上面三个文件夹,我本机的目录是:

/home/lty/Moses/irstlm-5.80.08 

/home/lty/Moses/boost_1_55_0(可指定,也可不指定)

/home/lty/Moses/mgiza/mgizapp/bin

4、  Moses下载安装

需要先下载安装一下软件:

sudo apt-get install git build-essential libz-dev libbz2-dev 

然后下载Moses:

git clone https://github.com/moses-smt/mosesdecoder.git 
cd mosesdecoder

./bjam -j4 --with-irstlm=/home/lty/Moses/irstlm-5.80.08 --with-giza=/home/lty/Moses/mgiza/mgizapp/bin--with-boost=/home/lty/Moses/boost_1_55_0

-j4是利用CPU是4核的进行编译

 

二、语料预处理

1、语料的预处理  在/home/Moses/建立一个corpus来存放学习集,官方网站下载学习资料

cd 

mkdir corpus

cd corpus 

wgethttp://www.statmt.org/wmt13/training-parallel-nc-v8.tgz

tar zxvf training-parallel-nc-v8.tgz

 

1)tokenisation:在预料的单词和单词之间或者单词和标点之间插入空白,然后进行后续操作。

 

/home/lty/Moses/mosesdecoder/scripts/tokenizer/tokenizer.perl-l en < training/ news-commentary-v8.fr-en.en >news-commentary-v8.fr-en.tok.en

 

/home/lty/Moses/mosesdecoder/scripts/tokenizer/tokenizer.perl-l fr < training/news-commentary-v8.fr-en.fr >news-commentary-v8.fr-en.tok.fr

 

2)Truecaser:提取一些关于文本的统计信息

 

/home/lty/Moses/mosesdecoder/scripts/recaser/train-truecaser.perl--model truecase-model.en --corpus news-commentary-v8.fr-en.tok.en

 

/home/lty/Moses/mosesdecoder/scripts/recaser/train-truecaser.perl--model truecase-model.fr --corpus news-commentary-v8.fr-en.tok.fr

 

3)truecasing:将语料中每句话的字和词组都转换为没有格式的形式,减少数据稀疏性问题。

 

/home/lty/Moses/mosesdecoder/scripts/recaser/truecase.perl--model truecase-model.en < news-commentary-v8.fr-en.tok.en > news-commentary-v8.fr-en.true.en 

 

/home/lty/Moses/mosesdecoder/scripts/recaser/truecase.perl--model truecase-model.fr < news-commentary-v8.fr-en.tok.fr > news-commentary-v8.fr-en.true.fr

 

4)cleaning: 将长语句和空语句删除,并且将不对齐语句进行处理。

 

/home/lty/Moses/mosesdecoder/scripts/training/clean-corpus-n.perlnews-commentary-v8.fr-en.true fr en news-commentary-v8.fr-en.clean 1 80

三、语言模型、机器翻译模型训练、Tunning翻译模型和测试

1、  先要进行语言模型训练

语言模型(LM)用于确保流利的输出,在这一步使用Irstlm进行处理。

/home/lty/Moses/mosesdecoder/scripts/generic/trainlm-irst2.perl-cores 4 -irst-dir /home/lty/Moses/irstlm-5.80.08/bin -p 0 -order 5  -text news-commentary-v8.fr-en.true.en -lmsmall-news-commentary-v8.fr-en.blm.en

echo "is this an Englishsentence" | /home/lty/Moses/mosesdecoder/bin/query news-commentary-v8.fr-en.blm.en

2、翻译模型的训练

cd ..

mkdir working

cd working

/home/lty/Moses/mosesdecoder/scripts/training/train-model.perl-root-dir train -corpus/home/lty/Moses/corpus/small-news-commentary-v8.fr-en.clean -f fr -e en-alignment grow-diag-final-and -reordering msd-bidirectional-fe -lm0:3:/home/lty/Moses/corpus/small-news-commentary-v8.fr-en.blm.en:8-external-bin-dir /home/lty/Moses/giza-pp

3、Tunning翻译模型

回到corpus,下载开发集 

wget http://www.statmt.org/wmt12/dev.tgz

tar zxvf dev.tgz

/home/lty/Moses/mosesdecoder/scripts/tokenizer/tokenizer.perl-l en < dev/news-test2008.en > news-test2008.tok.en 

/home/lty/Moses/mosesdecoder/scripts/tokenizer/tokenizer.perl-l fr < dev/news-test2008.fr > news-test2008.tok.fr 

/home/lty/Moses/mosesdecoder/scripts/recaser/truecase.perl–model truecase-model.en  news-test2008.true.en

/home/lty/Moses/mosesdecoder/scripts/recaser/truecase.perl–model truecase-model.fr < news-test2008.tok.fr > news-test2008.true.fr

开发集在进行了和学习集相同的处理之后,对原本的moses.ini进行调优

进入working文件夹然后运行

nohup /home/lty/Moses/mosesdecoder/scripts/training/mert-moses.pl/home/lty/Moses/corpus/news-test2008.true.en /home/lty/Moses/corpus/news-test2008.true.fr/home/lty/Moses/mosesdecoder/bin/moses train/model/moses.ini --mertdir/home/lty/Moses/mosesdecoder/bin/ --decoder-flags="-threads 8"2>&1 1> Result20160120 &

4、测试模型

如果有单独的测试集,也要进行和开发集及训练集一样的预处理,这里用开发集直接进行了简单测试

/home/lty/Moses/mosesdecoder/bin/moses -f /home/lty/Moses/working/mert-work/moses.ini< small-news-test2008.true.fr > ewstest2011.out

为了加快速度,需要把phrase-table和reordering-table进行优化加快速度,关于phrase-table,请参考http://www.statmt.org/moses/?n=Advanced.RuleTables#ntoc3中的On-DiskPhrase table部分。关于reordering-table,请参考以下命令执行:

cd /home/lty/Moses/corpus/mert-work

mkdir binarised-model

/home/lty/Moses/mosesdecoder/bin/CreateOnDiskPt1 1 4 100 2 /home/lty/Moses/corpus/train/model/phrase-table.gz /home/lty/Moses/corpus/mert-workbinarised-model/phrase-table.1.folder

/home/lty/Moses/mosesdecoder/bin/processLexicalTable-in /home/lty/Moses/corpus/train/model/reordering-table.wbe-msd-bidirectional-fe.gz  -out /home/lty/Moses/corpus/mert-work/binarised-model/reordering-table

然后要修改/home/lty/Moses/corpus/mert-work中的mose.ini文件,修改成:

PhraseDictionaryOnDiskname=TranslationModel0 num-features=4 path=/home/lty/Moses/corpus/mert-work/binarised-model/phrase-table.1.folder input-factor=0 output-factor=0

LexicalReordering name=LexicalReordering0num-features=6 type=wbe-msd-bidirectional-fe-allff input-factor=0output-factor=0 path=/home/lty/Moses/corpus/mert-work/binarised-model/reordering-table

经过这个操作之后,生成速度加快了45倍,^_^。

你可能感兴趣的:(自然语言处理,机器翻译,Moses,自然语言处理,配置,Ubuntu)