最近(2017年以来)的WMT14 English-French Baseline记录
1. GNMT
https://arxiv.org/pdf/1609.08144.pdf
语料处理:a shared source and target vocabulary of 32K wordpieces
For the wordpiece models, we train 3 different models with vocabulary sizes of 8K, 16K, and 32K. Table 4 summarizes our results on the WMT En→Fr dataset. In this table, we also compare against other strong baselines without model ensembling. As can be seen from the table, “WPM-32K”, a wordpiece model with a shared source and target vocabulary of 32K wordpieces, performs well on this dataset and achieves the best quality as well as the fastest inference speed.
On WMT En→Fr, the training set contains 36M sentence pairs. In both cases, we use newstest2014 as the test sets to compare against previous work. The combination of newstest2012 and newstest2013 is used as the development set.
实验结果:Table 4 in Page 16: En→Fr WPM-32K 38.95
or Table 6 in Page 17: En→Fr Trained with log-likelihood 38.95
2. Transformer
https://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf
语料处理: 32000 joint word-piece vocabulary
For English-French, we used the significantly larger WMT 2014 English-French dataset consisting of 36M sentences and split tokens into a 32000 word-piece vocabulary.
实验结果:Table 2 in Page 8: Transformer (base model) 38.1 Transformer (big) 41.0
3. RNMT+
http://aclweb.org/anthology/P18-1008
语料处理:32K joint sub-word units (其实是32K wordpieces)
We train our models on the standard WMT’14 En→Fr and En→De datasets that comprise 36.3M and 4.5M sentence pairs, respectively. Each sentence was encoded into a sequence of sub-word units obtained by first tokenizing the sentence with the Moses tokenizer, then splitting tokens into subword units (also known as “wordpieces”) using the approach described in (Schuster and Nakajima, 2012). We use a shared vocabulary of 32K sub-word units for each source-target language pair.
实验结果: Table 1 in Page 81: RNMT+ 41.00 ± 0.05
4. ConvS2S
https://arxiv.org/pdf/1705.03122.pdf
github:https://github.com/facebookresearch/fairseq/
https://github.com/facebookresearch/fairseq/issues/59 (语料处理)
语料处理:40K joint BPE
We use the full training set of 36M sentence pairs, and remove sentences longer than 175 words as well as pairs with a source/target length ratio exceeding 1.5. This results in 35.5M sentence-pairs for training. Results are reported on newstest2014. We use a source and target vocabulary with 40K BPE types.
注意validation set的设置: In all setups a small subset of the training data serves as validation set (about 0.5-1% for each dataset) for early stopping and learning rate annealing.
实验结果: Table 1: ConvS2S (BPE 40K) 40.51
5. Fairseq
https://arxiv.org/pdf/1806.00187.pdf
github:https://github.com/pytorch/fairseq
语料处理: 40K joint BPE
For En–Fr, we train on WMT’14 and borrow the setup of Gehring et al. (2017) with 36M training sentence pairs. We use newstest12+13 for validation and newstest14 for test. The 40K vocabulary is based on a joint source and target BPE factorization.
validation set: newstest12+13 for validation
实验结果: Table2: Our result 43.2