fairseq给出的preprocess代码只支持一个语言对的binarize,而笔者在[机器翻译] 记一次多语言机器翻译模型的训练想要对多个语言对同时进行binarize,过程中能够创建一个共享的词典。
和师兄交流之后,实现这一结果有两种方式:1. 在学习bpe之后,就会得到一个共享词表,需要对这个词表进行一些修改,然后作为binarize的参数;2. 不使用bpe得到的词表,而是做两次binarize,第一次是为每一个语言对进行一次binarize,然后得到不同的词表。接着将这些词表进行合并,作为第二次binarize的参数。
本文记录笔者通过对fairseq-preprocess流程的理解,参考https://github.com/RayeRen/multilingual-kd-pytorch/blob/master/preprocess_universal.py,实现更加简便的、一步到位的多个语言对binarize流程。
在这之后,本文也给出了上面所说的第一种的预处理方式。(关于第二种,随缘补充)
该方法通过对fairseq.fairseq_cli.preprocess.py的理解,修改得到一个用于multilingual fairseq-preprocess的代码。
从fairseq.fairseq_cli.preprocess.py中可以看到:
如果提供srcdict或者tgtdict,则会通过task.load_dictionary(args.srcdict)
来读取词典。task.load_dictionary的执行流程为:[fairseq.tasks.translation.TranslationTask]->[fairseq.tasks.fairseq_task.FairseqTask.load_dictionary]->[fairseq.data.dictionary.Dictionary.load]->[fairseq.data.dictionary.Dictionary.add_from_file]。
如果不提供dict,则会通过task.build_dictionary
来创建词典,[fairseq.tasks.fairseq_task.FairseqTask.build_dictionary代码如下:
d = Dictionary()
for filename in filenames:
Dictionary.add_file_to_dictionary(
filename, d, tokenizer.tokenize_line, workers
)
d.finalize(threshold=threshold, nwords=nwords, padding_factor=padding_factor)
return d
只要把所有语言对的train data都加入到filenames中,就可以直接创建一个共享的词表,接下来只要用这个词表对所有语言对进行binarize就可以了。因此,修改过程如下:
笔者首先将fairseq.fairseq_cli.preprocess.py复制到当前目录一份,然后修改以下3个函数:
def cli_main():
parser = options.get_preprocessing_parser()
parser.add_argument('--pref', metavar='FP', default=None, help='data prefix')
args = parser.parse_args()
main(args)
def main(args):
# setup some basic things
utils.import_user_module(args)
os.makedirs(args.destdir, exist_ok=True)
logger.addHandler(
logging.FileHandler(
filename=os.path.join(args.destdir, "preprocess.log"),
)
)
logger.info(args)
assert (
args.dataset_impl != "huffman"
), "preprocessing.py doesn't support Huffman yet, use HuffmanCodeBuilder directly."
# build shared dictionaries
# target = not args.only_source
train_files = glob.glob('{}/train.*-*.*'.format(args.pref))
train_files = [f for f in train_files if len(f.split('.')) in [3, 4, 5]]
test_files = glob.glob('{}/test.*-*.*'.format(args.pref))
test_files = [f for f in test_files if len(f.split('.')) in [3, 4, 5]]
valid_files = glob.glob('{}/valid.*-*.*'.format(args.pref))
valid_files = [f for f in valid_files if len(f.split('.')) in [3, 4, 5]]
lng_pairs = set([f.split('/')[-1].split(".")[1] for f in (train_files + test_files + valid_files)])
task = tasks.get_task(args.task)
shared_dictionary = _build_dictionary(
train_files,
task=task,
args=args,
src=True,
)
# save dictionaries
if args.joined_dictionary:
shared_dictionary.save(os.path.join(args.destdir, "dict.txt"))
else:
for lng_pair in lng_pairs:
src, tgt = lng_pair.split('-')
tmp_src_dict_path = os.path.join(args.destdir, f'dict.{src}.txt')
tmp_tgt_dict_path = os.path.join(args.destdir, f'dict.{tgt}.txt')
if not os.path.exists(tmp_src_dict_path):
shared_dictionary.save(tmp_src_dict_path)
if not os.path.exists(tmp_tgt_dict_path):
shared_dictionary.save(tmp_tgt_dict_path)
if args.dict_only:
return
for lng_pair in lng_pairs:
src_and_tgt = lng_pair.split('-')
if len(src_and_tgt) != 2:
continue
src, tgt = src_and_tgt
print("| building: ", src, tgt)
args.source_lang = src
args.target_lang = tgt
_make_all(src, shared_dictionary, args)
_make_all(tgt, shared_dictionary, args)
logger.info("Wrote preprocessed data to {}".format(args.destdir))
def _make_all(lang, vocab, args):
lng_pair = f"{args.source_lang}-{args.target_lang}"
_make_dataset( ## iwslt14.tokenized/train.en-ar
vocab, os.path.join(args.pref, f"train.{lng_pair}"), "train", lang, args=args, num_workers=args.workers
)
_make_dataset(
vocab, os.path.join(args.pref, f"valid.{lng_pair}"), "valid", lang, args=args, num_workers=args.workers
)
_make_dataset(
vocab, os.path.join(args.pref, f"test.{lng_pair}"), "test", lang, args=args, num_workers=args.workers
)
该方法在学习bpe之后,就会得到一个共享词表,需要对这个词表进行一些修改,然后作为binarize的参数。
当前目录有以下文件:
执行下面脚本,完成数据的划分后,得到下面的文件,其中的train.all用于学习sentencepiece:
#!/usr/bin/env bash
# Adapted from https://github.com/facebookresearch/MIXER/blob/master/prepareData.sh
echo 'Cloning Moses github repository (for tokenization scripts)...'
git clone git://github.com/moses-smt/mosesdecoder.git
###
# just generate train\test\valid data for iwslt14
# with same simple preprocess steps and without tokenization, because the next step is learn spm
###
SCRIPTS=mosesdecoder/scripts
LC=$SCRIPTS/tokenizer/lowercase.perl
CLEAN=$SCRIPTS/training/clean-corpus-n.perl
tmp=tmp
orig=orig
tgt=en
rm -r $orig
rm -r $tmp
mkdir -p $orig $tmp
for src in ar de es fa he it nl pl; do
lang=$src-en
echo "pre-processing train data..."
for l in $src $tgt; do
if [[ ! -f $src-en.tgz ]]; then
wget https://wit3.fbk.eu/archive/2014-01//texts/$src/en/$src-en.tgz
fi
cd $orig
tar zxvf ../$src-en.tgz
cd ..
f=train.tags.$lang.$l
cat $orig/$lang/$f | \
grep -v '' | \
grep -v '' | \
grep -v '' | \
grep -v '' | \
sed -e 's///g' | \
sed -e 's/<\/title>//g' | \
sed -e 's///g' | \
sed -e 's/<\/description>//g' > $tmp/$f
echo ""
done
for l in $src $tgt; do
perl $LC < $tmp/train.tags.$lang.$l > $tmp/train.$lang.$l
rm $tmp/train.tags.$lang.$l
done
echo "pre-processing valid/test data..."
for l in $src $tgt; do
for o in `ls $orig/$lang/IWSLT14.TED*.$l.xml`; do
fname=${o##*/}
f=$tmp/${fname%.*}
echo $o $f
grep ' $o | \
sed -e 's/\s*//g' | \
sed -e 's/\s*<\/seg>\s*//g' | \
sed -e "s/\’/\'/g" | \
perl $LC > $f
echo ""
done
done
echo "creating train, valid, test..."
for l in $src $tgt; do
mv $tmp/train.$src-$tgt.$l $tmp/train-valid.$src-$tgt.$l
awk '{if (NR%23 == 0) print $0; }' $tmp/train-valid.$src-$tgt.$l > $tmp/valid.en-$src.$l
awk '{if (NR%23 != 0) print $0; }' $tmp/train-valid.$src-$tgt.$l > $tmp/train.en-$src.$l
rm $tmp/train-valid.$src-$tgt.$l
cat $tmp/IWSLT14.TED.dev2010.$src-$tgt.$l \
$tmp/IWSLT14.TEDX.dev2012.$src-$tgt.$l \
$tmp/IWSLT14.TED.tst2010.$src-$tgt.$l \
$tmp/IWSLT14.TED.tst2011.$src-$tgt.$l \
$tmp/IWSLT14.TED.tst2012.$src-$tgt.$l \
> $tmp/test.en-$src.$l
rm $tmp/IWSLT14.TED*.$src-$tgt.$l
done
TRAIN=$tmp/train.all
for l in $src $tgt; do
cat $tmp/train.en-$src.$l >> $TRAIN
done
done
echo "counting..."
for src in ar de es fa he it nl pl; do
for split in train valid test; do
for l in $src $tgt; do
wc -l $tmp/$split.en-$src.$l
done
done
done
echo "done"
学习spm,并apply,得到下面文件,用于binarize。
#!/usr/bin/env bash
echo 'Cloning fairseq repository...'
git clone git@github.com:facebookresearch/fairseq.git
# learn bpe
bpe=bpe
tmp=tmp
tgt=en
SCRIPTS=mosesdecoder/scripts
CLEAN=$SCRIPTS/training/clean-corpus-n.perl
rm -r $bpe
mkdir -p $bpe
python -u fairseq/scripts/spm_train.py \
--input=$tmp/train.all \
--model_prefix=spm.bpe \
--vocab_size=30000 \
--character_coverage=1.0 \
--model_type=bpe \
--num_threads=45 \
--shuffle_input_sentence
# apply bpe
for split in train valid test; do
for src in ar de es fa he it nl pl; do
echo ${split} en-${src}
python fairseq/scripts/spm_encode.py \
--model spm.bpe.model \
--output_format=piece \
--inputs ${tmp}/${split}.en-${src}.${src} ${tmp}/${split}.en-${src}.en \
--outputs ${bpe}/${split}.en-${src}.bpe.unclean.${src} ${bpe}/${split}.en-${src}.bpe.unclean.en
restrict length ratio
perl $CLEAN -ratio 1.5 ${bpe}/${split}.en-${src}.bpe.unclean ${src} en ${bpe}/${split}.en-${src}.bpe 1 256
rm ${bpe}/${split}.en-${src}.bpe.unclean.*
done
done
#!/usr/bin/env bash
# create share dict
path=data-bin
rm -r $path
mkdir -p $path
cut -f1 spm.bpe.vocab | tail -n +4 | sed "s/$/ 100/g" > $path/dict.txt
#for lang in ar de es fa he it nl pl en; do
# cp $path/dict.txt $path/dict.${lang}.txt
#done
for split in train valid test; do
for src in ar de es fa he it nl pl; do
echo ${split} en-${src}
fairseq-preprocess \
--source-lang $src --target-lang en \
--trainpref bpe/train.en-${src}.bpe \
--validpref bpe/valid.en-${src}.bpe \
--testpref bpe/test.en-${src}.bpe \
--destdir $path \
--srcdict $path/dict.txt \
--tgtdict $path/dict.txt
done
done
https://github.com/RayeRen/multilingual-kd-pytorch/blob/master/data/iwslt/raw/prepare-iwslt14.sh
https://github.com/facebookresearch/fairseq/issues/2110#issue-614837309
https://github.com/facebookresearch/fairseq/tree/main/examples/m2m_100