【论文笔记】Unicoder: A Universal Language Encoder by Pre-training with Multiple Cross-lingual Tasks

We present Unicoder, a universal language encoder that is insensitive to different languages. Given an arbitrary NLP task, a model can be trained with Unicoder using training data in one language and directly applied to inputs of the same task in other languages.


universal的含义:

  • 对语言是不敏感的
  • 在一种语言上训练后,可以直接应用到其他语言

.

Comparing to similar efforts such as Multilingual BERT (Devlin et al., 2018) and XLM (Lample and Conneau, 2019), three new cross-lingual pre-training tasks are proposed, including cross-lingual word recovery, cross-lingual paraphrase classification and cross-lingual masked language model. These tasks help Unicoder learn the mappings among different languages from more perspectives.


  • 文章在Multilingual BERT和XLM的基础上,提出了三种新的预训练任务。

.

Experiments are performed on two tasks: cross-lingual natural language inference (XNLI) and cross-lingual question answering (XQA), where XLM is our baseline.


两个检验任务:跨语言推断和跨语言问答。

简介

Multilingual BERT trains a BERT model based on multilingual Wikipedia,
which covers 104 languages. As its vocabulary contains tokens from all languages, Multilingual BERT can be used to cross-lingual tasks directly.


关于Multilingual BERT的一些简介:

  • 包含104种语言
  • 共享词表

.

XLM further improves Multilingual BERT by introducing a translation language model (TLM). TLM takes a concatenation of a bilingual sentence pair as input and performs masked language model based on it. By doing this, it learns the mappings among different languages and performs good on the XNLI dataset.
However, XLM only uses a single cross-lingual task during pre-training.


关于XLM的一些简介:

  • 提出了TLM预训练任务
  • 在训练过程中提供双语句对,做MLM
  • 预训练任务只有一个TLM

.

At the same time, Liu et al. (2019) has shown that multi-task learning can further improve a BERT-style pre-trained model. So we think more cross-lingual tasks could further improve the resulting pre-trained model for cross-lingual tasks. To verify this, we propose Unicoder, a universal language encoder that is insensitive to different languages and pre-trained based on 5 pre-training tasks.


本文的中心思想:多任务学习能有效提高预训练模型的效果

.

In short, our contributions are 4-fold. First, 3 new cross-lingual pre-training tasks are proposed, which can help to learn a better language-independent encoder. Second, a cross-lingual question answering (XQA) dataset is built, which can be used as a new cross-lingual benchmark dataset. Third, we verify that by fine-tuning multiple languages together, significant improvements can be obtained. Fourth, on the XNLI dataset, new state-of-the-art results are achieved.


从摘要开始,作者就一直在强调的文章内容总结:

  • 三种新的预训练任务
  • 构建了新的跨语言问答数据库(看前文似乎是用机翻做了数据增强)
  • 发现使用多语言进行微调能够有效改善效果
  • 推断任务的SOTA

.

方法详解

模型结构

【论文笔记】Unicoder: A Universal Language Encoder by Pre-training with Multiple Cross-lingual Tasks_第1张图片

Unicoder follows the network structure of XLM (Lample and Conneau, 2019). A shared vocabulary is constructed by running the Byte Pair Encoding (BPE) algorithm (Sennrich et al., 2016) on corpus of all languages. We also down sample the rich-resource languages corpus, to prevent words of target languages from being split too much at the character level.


  • Unicoder的结构和XLM相同
  • 使用了共享词表
  • 对语料进行了BPE分词
  • 对富语料进行下采样,防止词被分的太碎(没太明白)

预训练任务

跨语言词恢复

【论文笔记】Unicoder: A Universal Language Encoder by Pre-training with Multiple Cross-lingual Tasks_第2张图片

Formally, given a bilingual sentence pair (X, Y ), where X = (x1, x2, …, xm) is a sentence with m words from language s, Y = (y1, y2, …, yn) is a sentence with n words from language t, this task first represents each xi as xti ∈ Rh by all word embeddings of Y :
【论文笔记】Unicoder: A Universal Language Encoder by Pre-training with Multiple Cross-lingual Tasks_第3张图片
A ∈ Rm×n is an cross attention matrix calculated by:
在这里插入图片描述
W ∈ R3∗h is a trainable weight and ⊙ is element-wise multiplication.
.
Then, Unicoder takes Xt = (xt1 , xt2, …, xtn) as input, and tries to predict the original word sequence X.
.
This task also aims to let the pre-trained model learn the underlying word alignments between two languages.


  • 做双向的cross-attention,从对方语言获得表示,预测本方语言的原词(无监督,只需要保证平行句对)
  • 目的是让模型在预训练过程中学会对齐

.

跨语言文段分类

【论文笔记】Unicoder: A Universal Language Encoder by Pre-training with Multiple Cross-lingual Tasks_第4张图片

This task takes two sentences from different languages as input and classifies whether they are with the same meaning.


  • 这个任务很好理解,就是把双语句对拼在一起,做文本分类,看是否同义

跨语言遮盖语言建模

【论文笔记】Unicoder: A Universal Language Encoder by Pre-training with Multiple Cross-lingual Tasks_第5张图片

We construct cross-lingual document by replacing the sentences with even index to its translation.


与TLM不同的地方在于:

  • TLM是在平行句对上做MLM
  • 这个是在交替句对上做MLM(看图中数据很好理解)

.
数据来源:找有句子对齐的数据库,然后替换源语言的奇数句子为目标语言。

多语言微调

【论文笔记】Unicoder: A Universal Language Encoder by Pre-training with Multiple Cross-lingual Tasks_第6张图片

以测试时数据为中文为例,解释几种微调方式:

  • Translate-train :只有英文数据,把数据翻译成中文之后,在翻译的伪语料上做训练微调。
  • Translate-test :只有英文数据,在英文语料上做训练微调,测试时先把中文都翻译成英文,再做后续操作。
  • Cross-Lingual test:最标准的做法,什么技巧都不用。英文训,中文测。
  • ** Multi-language fine-tuning**:只有英文数据,把数据翻译成中文之后,训练时同时使用英文真语料和中文伪语料,测试时只测中文表现。

.
作者表示,第四种方法能极大提升性能(在大多数数据集中)。

实验

数据处理方法

  • 数据来源:十五种语言…
  • tokenize:we follows the line of Koehn et al. (2007); Chang et al. (2008) for each language.
  • BPE

训练细节

结果对比(与当前SOTA)

在XNLI数据集上的表现

【论文笔记】Unicoder: A Universal Language Encoder by Pre-training with Multiple Cross-lingual Tasks_第7张图片

XQA数据集上的表现

【论文笔记】Unicoder: A Universal Language Encoder by Pre-training with Multiple Cross-lingual Tasks_第8张图片

模型分析

消融实验

语言数量和表现的关系

英语和其他语种的关系(富语料和其他语料的关系)

不同语种之间的关系

你可能感兴趣的:(自然语言处理,人工智能)