目标:7月把NMT adaptation文章通读下
场景:假设储备了大量新闻领域双语语料,少量科技领域双语语料(或者没有),科技领域单语语料(大量、少量或者没有)的大菜鸟翻译公司,接到一个科技领域的翻译项目。如何使用现有资源去尽可能的把科技领域的翻译做好。
问题:这里的科技领域就是in-domain,新闻领域是out-domain。如何使用数量有限的in-domain parallel corpus和相对丰富的out-domain parallel corpus来更好地提升in-domain translation performance。
论文阅读:
1. A Survey of Domain Adaptation for Neural Machine Translation
https://arxiv.org/pdf/1806.00258.pdf
文章给做了个简单的归类:
1. Data Centric
2. Model Centric
关于Data Centric,第三个Using Out-of-Domain Parallel Corpura好理解,在使用out-domain parallel corpus时候,如果是所有的out-domain data带有区分性地使用起来,就是Multi-Domain,如果只是使用某种标准对out-domain parallel corpus中的data进行挑选使用,则是Data Selection。
关于Model Centric,分别在训练阶段(Training)、解码阶段(Decoding)和模型结构上(Architecture Centric)进行相应的调整。感觉比较有意思的是Domain Discriminator,Encoder后面分别挂一个负责target sentence generation的decoder和一个负责predict source domain的Discriminator。借助Discriminator来帮助Encoder捕获domain information。
2. Sentence Selection and Weighting for Neural Machine Translation Domain Adaptation
https://ieeexplore.ieee.org/abstract/document/8360031/
3. Document-Level Adaptation for Neural Machine Translation
http://www.aclweb.org/anthology/W18-2708
4. Instance Weighting for Neural Machine Translation Domain Adaptation
http://www.aclweb.org/anthology/D/D17/D17-1155.pdf
5. An Empirical Comparison of Simple Domain Adaptation Methods for Neural Machine Translation
https://arxiv.org/pdf/1701.03214.pdf
6. Multi-Domain Neural Machine Translation through Unsupervised Adaptation
http://www.aclweb.org/anthology/W/W17/W17-4713.pdf
7. Dynamic Data Selection for Neural Machine Translation
https://arxiv.org/pdf/1708.00712.pdf
8. Cost weighting for neural machine translation domain adaptation
http://www.aclweb.org/anthology/W17-3205