首先是tokenisation,就是词例化或者说分词,要使用 :mosesdecoder/scripts/tokenizer/tokenizer.perl进行词例化。 作用是将平行语料(其实是两个文本文件),中的每个句子进行词例化
~/mosesdecoder/scripts/tokenizer/tokenizer.perl -l en \ < ~/corpus/training/news-commentary-v8.fr-en.en \ > ~/corpus/news-commentary-v8.fr-en.tok.en ~/mosesdecoder/scripts/tokenizer/tokenizer.perl -l fr \ < ~/corpus/training/news-commentary-v8.fr-en.fr \ > ~/corpus/news-commentary-v8.fr-en.tok.fr
说明:-l 后指定文本的语言,看了一下文件,似乎只支持 en de fr 和 it,如要给中文分词的话,可能需要更多的配置,然后要指定输入的文本和输出的地点,切记 “<” ">"是必须的!
Usage ./tokenizer.perl (-l [en|de|...]) (-threads 4) < textfile > tokenizedfile Options: -q ... quiet. -a ... aggressive hyphen splitting. -b ... disable Perl buffering. -time ... enable processing time calculation. -penn ... use Penn treebank-like tokenization. -protected FILE ... specify file with patters to be protected in tokenisation. -no-escape ... don't perform HTML escaping on apostrophy, quotes, etc.
truecasing: The initial words in each sentence are converted to their most probable casing. This helps reduce data sparsity.,使用的脚本是:
/mosesdecoder/scripts/recaser/train-truecaser.perl和 /scripts/recaser/truecase.perl例子如下
~/mosesdecoder/scripts/recaser/train-truecaser.perl \ --model ~/corpus/truecase-model.en --corpus \ ~/corpus/news-commentary-v8.fr-en.tok.en ~/mosesdecoder/scripts/recaser/train-truecaser.perl \ --model ~/corpus/truecase-model.fr --corpus \ ~/corpus/news-commentary-v8.fr-en.tok.fr
~/mosesdecoder/scripts/recaser/truecase.perl \ --model ~/corpus/truecase-model.en \ < ~/corpus/news-commentary-v8.fr-en.tok.en \ > ~/corpus/news-commentary-v8.fr-en.true.en ~/mosesdecoder/scripts/recaser/truecase.perl \ --model ~/corpus/truecase-model.fr \ < ~/corpus/news-commentary-v8.fr-en.tok.fr \ > ~/corpus/news-commentary-v8.fr-en.true.fr
manual中的USER GUIDE写到:
Instead of lowercasing all training and test data, we may also want to keep words in their nat- ural case, and only change the words at the beginning of their sentence to their most frequent form. This is what we mean by truecasing. Again, this requires first the training of a truecasing model, which is a list of words and the frequency of their different forms.
truecase.perl --model MODEL [-b] < in > out
-b代表 unbuffered,不清楚用途目前
Finally we clean, limiting sentence length to 80: ~/mosesdecoder/scripts/training/clean-corpus-n.perl \ ~/corpus/news-commentary-v8.fr-en.true fr en \ ~/corpus/news-commentary-v8.fr-en.clean 1 80
语言模型最朴实的作用在于让你的output更加流畅,更加像母语,为了达到这一效果,我们需要另外的句子对齐的平行语料来训练我们的语言模型(如果你用training model的同一个语料来训练,未免感觉会无效?)
~/irstlm/bin/add-start-end.sh \ < ~/corpus/news-commentary-v8.fr-en.true.en \ > news-commentary-v8.fr-en.sb.en
然后使用:build-lm.sh build一个语言模型源文件,输出lm源文件
export IRSTLM=$HOME/irstlm; ~/irstlm/bin/build-lm.sh \ -i news-commentary-v8.fr-en.sb.en \ -t ./tmp -p -s improved-kneser-ney -o news-commentary-v8.fr-en.lm.en
~/irstlm/bin/compile-lm \
首先是tokenisation,就是词例化或者说分词,要使用 :mosesdecoder/scripts/tokenizer/tokenizer.perl进行词例化。 作用是将平行语料(其实是两个文本文件),中的每个句子进行词例化
~/mosesdecoder/scripts/tokenizer/tokenizer.perl -l en \ < ~/corpus/training/news-commentary-v8.fr-en.en \ > ~/corpus/news-commentary-v8.fr-en.tok.en ~/mosesdecoder/scripts/tokenizer/tokenizer.perl -l fr \ < ~/corpus/training/news-commentary-v8.fr-en.fr \ > ~/corpus/news-commentary-v8.fr-en.tok.fr
说明:-l 后指定文本的语言,看了一下文件,似乎只支持 en de fr 和 it,如要给中文分词的话,可能需要更多的配置,然后要指定输入的文本和输出的地点,切记 “<” ">"是必须的!
Usage ./tokenizer.perl (-l [en|de|...]) (-threads 4) < textfile > tokenizedfile Options: -q ... quiet. -a ... aggressive hyphen splitting. -b ... disable Perl buffering. -time ... enable processing time calculation. -penn ... use Penn treebank-like tokenization. -protected FILE ... specify file with patters to be protected in tokenisation. -no-escape ... don't perform HTML escaping on apostrophy, quotes, etc.
truecasing: The initial words in each sentence are converted to their most probable casing. This helps reduce data sparsity.,使用的脚本是:
/mosesdecoder/scripts/recaser/train-truecaser.perl和 /scripts/recaser/truecase.perl例子如下~/mosesdecoder/scripts/recaser/train-truecaser.perl \ --model ~/corpus/truecase-model.en --corpus \ ~/corpus/news-commentary-v8.fr-en.tok.en ~/mosesdecoder/scripts/recaser/train-truecaser.perl \ --model ~/corpus/truecase-model.fr --corpus \ ~/corpus/news-commentary-v8.fr-en.tok.fr
~/mosesdecoder/scripts/recaser/truecase.perl \ --model ~/corpus/truecase-model.en \ < ~/corpus/news-commentary-v8.fr-en.tok.en \ > ~/corpus/news-commentary-v8.fr-en.true.en ~/mosesdecoder/scripts/recaser/truecase.perl \ --model ~/corpus/truecase-model.fr \ < ~/corpus/news-commentary-v8.fr-en.tok.fr \ > ~/corpus/news-commentary-v8.fr-en.true.fr
manual中的USER GUIDE写到:
Instead of lowercasing all training and test data, we may also want to keep words in their nat- ural case, and only change the words at the beginning of their sentence to their most frequent form. This is what we mean by truecasing. Again, this requires first the training of a truecasing model, which is a list of words and the frequency of their different forms.
truecase.perl --model MODEL [-b] < in > out
-b代表 unbuffered,不清楚用途目前
Finally we clean, limiting sentence length to 80: ~/mosesdecoder/scripts/training/clean-corpus-n.perl \ ~/corpus/news-commentary-v8.fr-en.true fr en \ ~/corpus/news-commentary-v8.fr-en.clean 1 80
语言模型最朴实的作用在于让你的output更加流畅,更加像母语,为了达到这一效果,我们需要另外的句子对齐的平行语料来训练我们的语言模型(如果你用training model的同一个语料来训练,未免感觉会无效?)
~/irstlm/bin/add-start-end.sh \ < ~/corpus/news-commentary-v8.fr-en.true.en \ > news-commentary-v8.fr-en.sb.en
然后使用:build-lm.sh build一个语言模型源文件,输出lm源文件
export IRSTLM=$HOME/irstlm; ~/irstlm/bin/build-lm.sh \ -i news-commentary-v8.fr-en.sb.en \ -t ./tmp -p -s improved-kneser-ney -o news-commentary-v8.fr-en.lm.en
~/irstlm/bin/compile-lm \ --text \ news-commentary-v8.fr-en.lm.en.gz \ news-commentary-v8.fr-en.arpa.en
注意这里的 --text 后不要加 yes manual中写错了,这样就生成了一个arpa文件(可以用于query和生成二进制的IRSTLM模型以及KenLM模型,这里一直用生成KenLM解决,因为不知道为何IRSTLM不好用)
You can directly create an IRSTLM binary LM (for faster loading in Moses) by replacing the last command with the following: ~/irstlm/bin/compile-lm news-commentary-v8.fr-en.lm.en.gz \ news-commentary-v8.fr-en.blm.en
You can transform an arpa LM (*.arpa.en file) into an IRSTLM binary LM as follows: ~/irstlm/bin/compile-lm \ news-commentary-v8.fr-en.arpa.en \ news-commentary-v8.fr-en.blm.en
or viceversa, you can transform an IRSTLM binary LM into an arpa LM as follows: ~/irstlm/bin/compile-lm \ --text yes \ news-commentary-v8.fr-en.blm.en \ news-commentary-v8.fr-en.arpa.en
This instead binarises (for faster loading) the *.arpa.en file using KenLM: ~/mosesdecoder/bin/build_binary \ news-commentary-v8.fr-en.arpa.en \ news-commentary-v8.fr-en.blm.en You can check the language model by querying it, e.g. $ echo "is this an English sentence ?" \ | ~/mosesdecoder/bin/query news-commentary-v8.fr-en.blm.en
--text \ news-commentary-v8.fr-en.lm.en.gz \ news-commentary-v8.fr-en.arpa.en
注意这里的 --text 后不要加 yes manual中写错了,这样就生成了一个arpa文件(可以用于query和生成二进制的IRSTLM模型以及KenLM模型,这里一直用生成KenLM解决,因为不知道为何IRSTLM不好用)
You can directly create an IRSTLM binary LM (for faster loading in Moses) by replacing the last command with the following: ~/irstlm/bin/compile-lm news-commentary-v8.fr-en.lm.en.gz \ news-commentary-v8.fr-en.blm.en
You can transform an arpa LM (*.arpa.en file) into an IRSTLM binary LM as follows: ~/irstlm/bin/compile-lm \ news-commentary-v8.fr-en.arpa.en \ news-commentary-v8.fr-en.blm.en
or viceversa, you can transform an IRSTLM binary LM into an arpa LM as follows: ~/irstlm/bin/compile-lm \ --text yes \ news-commentary-v8.fr-en.blm.en \ news-commentary-v8.fr-en.arpa.en
This instead binarises (for faster loading) the *.arpa.en file using KenLM: ~/mosesdecoder/bin/build_binary \ news-commentary-v8.fr-en.arpa.en \ news-commentary-v8.fr-en.blm.en
You can check the language model by querying it, e.g. $ echo "is this an English sentence ?" \ | ~/mosesdecoder/bin/query news-commentary-v8.fr-en.blm.en
-- root directory, where output files are stored--corpus
-- corpus file name (full pathname), excluding extension--e
-- extension of the English corpus file--f
-- extension of the foreign corpus file--lm
-- language model: <factor>:<order>:<filename> (option can be repeated)--first-step
-- first step in the training process (default 1)--last-step
-- last step in the training process (default 7)--parts
-- break up corpus in smaller parts before GIZA++ training--corpus-dir
-- corpus directory (default $ROOT/corpus
-- lexical translation probability directory (default $ROOT/model
-- model directory (default $ROOT/model
-- extraction file (default $ROOT/model/extract
-- GIZA++ directory (default $ROOT/giza.$F-$E
-- inverse GIZA++ directory (default $ROOT/giza.$E-$F
-- heuristic used for word alignment: intersect
, union
, grow
, grow-final
, grow-diag
, grow-diag-final
(default), grow-diag-final-and
, srctotgt
, tgttosrc
-- maximum length of phrases entered into phrase table (default 7)--giza-option
-- additional options for GIZA++ training--verbose
-- prints additional word alignment information--no-lexical-weighting
-- only use conditional probabilities for the phrase table, not lexical weighting--parts
-- prepare data for GIZA++ by running snt2cooc
in parts--direction
-- run training step 2 only in direction 1 or 2 (for parallelization)--reordering
-- specifies which reordering models to train using a comma-separated list of config-strings, see FactoredTraining.BuildReorderingModel. (default distance)--reordering-smooth
-- specifies the smoothing constant to be used for training lexicalized reordering models. If the letter "u" follows the constant, smoothing is based on actual counts. (default 0.5)--alignment-factors