Unified Language Model Pre-training for Natural Language

Unified Language Model Pre-training for Natural Language Understanding and Generation
对于语言理解和生成的统一语言模型预训练
Abstract
This paper presents a new Unified pre-trained Language Model(UNILM) that can be fine-tuned for both natural language understanding and generation tasks.
这篇论文提出了一个新的统一的预训练语言模型(UNILM)可以对自然语言理解和生成任务进行微调。
The model is pre-trained using three types of language modeling tasks:unidirectional,bidirectional,and sequence-to-sequence prediction.
这个模型使用三种不同的语言模型任务进行预训练:单向,双向和序列到序列的预测。
The unified modeling is achieved by employing a shared Transformer network and utilizing specific self-attention masks to control what context the prediction conditions on.
统一模型通过应用一个共享的Transformer网络获得,并且使用一个特殊的自注意力任务去控制下一个预测条件在什么样的语境下。
UNILM compares favorably with BERT on the GLUE benchmark,and the SQuAD2.0 and CoQA question answering tasks.
UNILM在GLUE基准上,以及SQuAD2.0和CoQA问答任务上与BERT任务进行比较,发现相应的效果有所提升。
Moreover,UNILM achieves new state-of -the art results on five natural language generation datasets,including improving the CNN/DailyMail abstractive summarization ROUGE-L to 40.51(2.04 absolute improvement).
此外,UNILM达到最好的结果在五项自然语言生成数据集上,包括提升CNN/DailyMail的抽象总结ROUGE-L达到40.51的分数(2.04绝对提升)
the Gigaword abstractive summarization ROUGE-L to 35.75(0.86 absolute improvement),
Gigaword抽象总结ROUGE-L达到35.75(0.86绝对提升)
the CoQA generative question answering F score to 82.5(37.1 absolute improvement),
CoQA生成问答F分数达到82.5(37.1绝对提升)
the SQuAD question generation BLEU-4 to 22.12(3.75 absolute improvement)
SQuAD问题生成BLEU-4达到22.12(3.75绝对提升)
and the DSTC7 document-grounded dialog response generation NIST-4 to 2.67(human performance is 2.65)
以及DSTC7基于文本对话反应生成NIST-4达到2.67(人类表现是2.65).
The code and pre-trained models are available at https://github.com/microsoft/unilm.
代码和预训练模型在https://github.com/microsoft/unilm上面可以获得。
1.Introduction
Language model(LM) pre-training has substantially advanced the state of the art across a variety of natural language processing tasks.
语言预训练模型通过一系列自然语言处理任务过程已经大量地推动了最好的结果。
Pre-trained LMs learn contextualized text representations by predicting words based on their context using large amounts of text data,and can be fine-tuned to adapt to downstream tasks.
contextualized:将…置于背景,置于上下文考虑
预训练语言模型学习,通过学习上下文单词表示可以基于他们的文本使用大量的文本数据预测单词,并且可以微调去适应下游任务。
In this work we propose a new Unified pre-trained Language Model(UNILM) that can be applied to both natural language understanding(NLU) and natural language generation(NLG) tasks.
在这个工作中我们提出了一个新的统一预训练模型(UNILM)可以应用于自然语言理解和自然语言生成任务之中。
UNILM is a multi-layer Transformer network,jointly pre-trained on large amounts of text,optimized for three types of unsupervised language modeling objectives as shown in Table2.
UNILM是一个多层Transformer网络,共同在大语料上进行预训练,使用三种类型的无监督语言模型对象像Table2之中的那样显示的进行优化。
Unified Language Model Pre-training for Natural Language_第1张图片
Unified Language Model Pre-training for Natural Language_第2张图片Table 2:The unified LM is jointly pre-trained by multiple language modeling objectives,sharing the same parameters.
unified语言模型被多种语言模型对象进行共同的预训练,并且使用相同的参数。
We fine-tune and evaluate the pre-trained unified LM on various datasets,including both language understanding and generation tasks.
我们微调并且预测预训练unified语言模型在不同的数据集上,包括同样语言理解和生成任务。
In particular,we design a set of cloze tasks where a masked word is predicted based on its context.
特别的,我们设计一个克隆任务的数据集,被遮盖的单词在基于它的内容上被预测。
These cloze tasks differ in how the context is defined.这些克隆任务的区别在于文本是如何被定义的。
For a left-to-right unidirectional LM,the context of the masked word to be predicted consists of all the words on its left.
作为一个从左到右的单向语言模型,被预测的遮盖单词内容包括所有在它左边的单词。
For a right-to-left unidirectional LM,the context consists of all the words on the right.
作为一个从右到左的单向语言模型,被预测的遮盖单词内容包括所有在它右边的单词。
For a bidirectional LM,the context consists of the words on both the right and the left.
作为一个双向语言模型,被预测单词的内容包括所有在左边和右边的内容。
For a sequence-to-sequence LM,the context of the to-be-predicted word in the second(target) sequence consists of all the words in the first(source) sequence and the words on the its left in the target sequence.
作为一个序列到序列的语言模型,在第二个序列中被预测的单词的内容包含所有在第一个序列中单词的内容以及所有它左边的目标序列的单词内容。

Similar to BERT,the pre-trained UNILM can be fine-tuned(with additional task-specific layers if necessary) to adapt to various downstram tasks.
类似于BERT,预训练的UNILM可以被微调(如果有必要的话使用额外的特定任务的网络层)去适应不同的下游任务。
But unlike BERT which is used mainly for NLU tasks,UNILM can be configured,using different self-attention masks(Section 2),to aggregate context for different types of language models,and thus can be used for both NLU and NLG tasks.
但是不像BERT主要用于NLU任务,UNLIM可以使用不同的自注意力掩码配置(篇章2),去使用不同类型的语言模型增强文本,这可以被用作同样的NLU和NLG任务之中。

The proposed UNILM has three main advantages.
上述的UNLIM有三个主要的长处。
First,the unified pre-training procedure leads to a single Transformer LM that uses the shared parameters and architecture for different types of LMs,alleviating the need of separately training and hosting multiple LMs.
第一,统一的预训练过程导致一个单独的Transformer语言模型对于不同的参数和结构使用不同类型的语言模型,减轻单独的训练和使用不同语言模型的需求。
Second,the parameter sharing makes the learned text representations more general because they are jointly optimized for different language modeling objectives where context is utilized in different ways,mitigating overfitting to any single LM task.
第二,参数共享使得已学习的文本表示更加普遍,因为在文本被不同的方向上使用的时候他们对不同语言模型目标进行共同的优化,减轻对于任何单独语言模型任务的过拟合。
Third,in addition to its application to NLU tasks,the use of UNILM as a sequence-to-sequence LM(Section 2.3),makes it a natural choice for NLG,such as abstractive summarization and question generation.
第三,除了它的NLU任务之中的应用,UNILM作为一个序列到序列的语言模型,使得它作为一个NLG自然的选项,例如抽象总结和问题生成。

Experimental results show that our model,used as a bidirectional encoder,compares favorably with BERT on the GLUE benchmark and two extractive questioin answering tasks(i.e.,SQuAD 2.0 and CoQA).
实验结果显示我们的模型被用作双向编码模型,比BERT在GLUE基准上的结果以及两个提取问答任务的效果要好(即SQuAD 2.0和CoQA)。
In addition,we demonstrate the effectiveness of UNILM on five NLG datasets,where it is used as a sequence-to-sequence model,creating new state-of-the-art results on CNN/DailyMail and Gigaword abstractive summarization,SQuAD question generation,CoQA generative question answering,and DSTC7 dialog response generation.
除此之外,我们论述了UNILM在五个NLG数据集上面的有效性,在这里它被用作一个序列到序列的模型,在CNN/DailyMail和Gigaword抽象总结,SQuAD问答,CoQA生成问答和DSTC7对话回应生成之中创建了新的最好的结果。
2.Unified Language Model Pre-training
统一预训练语言模型
Given an input sequence x = x 1 , . . . x ∣ x ∣ x = x_{1},...x_{|x|} x=x1,...xx,UNILM obtains a contextualized vector representation for each token.
给定一个输入序列 x = x 1 , . . . x ∣ x ∣ x = x_{1},...x_{|x|} x=x1,...xx,UNILM获得了对于每一个标记的上下文向量表示。
As shown in Figure 1,the pre-training optimizes the shared Transformer network with respect to several unsupervised language modeling objectives,namely,unidirectional LM,bidirectional LM,and sequence-to-sequence LM.
像表1所示的那样,预训练模型最优化共享与几个无监督语言模型目标有关的共享Transformer网络,即,单向语言模型,双向语言模型和序列到序列的语言模型。
In order to control the access to the context of the word token to be predicted,we employ different masks for self-attention.
为了去控制去预测的单词内容标记,我们对应self-attention使用不同的masks。
In other words,we use masking to control how much context the token should attend to when computing its contextualized representation.
换句话说,我们使用mask去控制当计算它的上下文表示的时候,多少内容的标记需要被考虑到。
Once UNILM is pretrained,we can fine-tune it using task-specific data for downstream tasks.
一旦UNILM被预训练,我们可以为了下游任务使用特定任务的数据去微调它。
Unified Language Model Pre-training for Natural Language_第3张图片Figure 1:Overview of unified LM pre-training.
图片1:unified语言模型预训练总览。
The model parameters are shared across the LM objectives(i.e.,bidirectional LM,unidirectional LM,and sequence-to-sequence LM).
模型参数在语言目标之间共享(即,双向语言模型,单向语言模型和序列到序列的语言模型)
We use different self-attention masks to control the access to context for each word token.
我们使用不同的自注意力掩盖去控制对于每一个单词标志到达的内容。
The right-to-left LM is similar to the left-to-right one,which is omitted in the figure for brevity.
右到左的语言模型与左到右的语言模型相似,经常在图片中为了简洁而省略。
2.1 Input Representation
输入表示
The input x is a word sequence,which is either a text segment for unidirectional LMs or a pair of segments pached together for bidirectional LM and sequence-to-sequence LM.
输入x是一个单词序列,要么是一个对于单向语言模型的文本片段,要么是一对片段堆积起来的双向语言模型和序列到序列的语言模型。
We always add a special start-of-sequence([SOS]) token at the beginning of input,and a special end-of-sequence([EOS]) token at the end of each segment.
我们经常添加一个开始标志([SOS])在每段输入的开始,和一个特殊的结束标志([EOS])在每段输入的结束。
[EOS] not only marks the sentence boundary in NLU tasks,but also is used for the model to learn when to terminate the decoding process in NLG tasks.
[EOS]不仅标记着NLU任务的边界,并且也用于模型去学习在NLG任务中何时结束解码的过程。
The input representation follows that of BERT.
输入标记在BERT标记之后。
Texts are tokenized to subword units by WordPiece.
文本被WordPiece进行分词为子词单元。
For each input token,its vector representation is computed by summing the corresponding token embedding,position embedding,and segment embedding.
对于每一个输入序列,它的向量表示通过对于相应的token embedding,position embedding和segment embedding求和得到。
Since UNILM is trained using multiple LM tasks,segment embeddings also play a role of LM identifier in that we use different segment embeddings for different LM objectives.
既然UNILM使用不同的语言模型任务被训练,segment embeddings同样扮演一个语言模型鉴别者的角色因为我们对于不同的语言目标使用不同的segment embeddings。

2.2 Backbone Network:Multi-Layer Transformer
网络层的支柱:多层Transformer
The input vectors x i i = 1 ∣ x ∣ { x_{i}}_{i=1}^{|x|} xii=1x is first packed into H 0 = [ x 1 , . . . x ∣ x ∣ ] H^{0} = [x_{1},...x_{|x|}] H0=[x1,...xx],and then encoded into contextual representations at different levels of abstract H l = [ h 1 l , . . . h ∣ x ∣ l ] H^{l} = [{h_{1}}^{l},...h_{|x|}^{l}] Hl=[h1l,...hxl] using an L-layer Transformer H l = T r a n s f o r m e r l ( H l − 1 ) , l ∈ [ 1 , L ] H^{l} = Transformer_{l}(H^{l-1}),l \in [1,L] Hl=Transformerl(Hl1),l[1,L]
输入向量 x i i = 1 ∣ x ∣ { x_{i}}_{i=1}^{|x|} xii=1x是第一个被打包进 H 0 = [ x 1 , . . . x ∣ x ∣ ] H^{0} = [x_{1},...x_{|x|}] H0=[x1,...xx],接着被编码进上下文本表示不同抽象级别的 H l = [ h 1 l , . . . h ∣ x ∣ l ] H^{l} = [{h_{1}}^{l},...h_{|x|}^{l}] Hl=[h1l,...hxl]使用一个L层Transformer H l = T r a n s f o r m e r l ( H l − 1 ) , l ∈ [ 1 , L ] H^{l} = Transformer_{l}(H^{l-1}),l \in [1,L] Hl=Transformerl(Hl1),l[1,L]
In each Transformer block,multiple self-attention heads are used to aggregate the output vectors of the previous layer.
对于每个Transformer块,不同self-attention头被用来集合之前层的输出向量。
For the i-th Transformer layer,the output of a self-attention head A l A_{l} Al is computed via:
对于每一个i层的Transformer层,自注意力头 A l A_{l} Al的输出通过计算
Q = H l − 1 W l Q , K = H l − 1 W l K , V = H l − 1 W l V ( 1 ) Q = H^{l-1}W_{l}^{Q},K=H^{l-1}W_{l}^{K},V=H^{l-1}W_{l}^{V}(1) Q=Hl1WlQ,K=Hl1WlK,V=Hl1WlV(1)
M i j = 0 ( 2 ) M_{ij} = 0(2) Mij=0(2),allow to attend
M i j = − ∞ ( 2 ) M_{ij} = -\infty(2) Mij=(2),prevent from attending
A l = s o f t m a x ( Q K T d k + M ) V l ( 3 ) A_{l} = softmax(\frac{QK^{T}}{\sqrt{d_{k}}}+M)V_{l}(3) Al=softmax(dk QKT+M)Vl(3)
where the previous layer’s output H l − 1 ∈ R ∣ x ∣ × d h H^{l-1} \in R^{|x|\times{d_{h}}} Hl1Rx×dh is linearly projected to a triple of queries,keys and values using parameter matrices W l Q , W l K , W l V ∈ R d h × d k W_{l}^{Q},W_{l}^{K},W_{l}^{V} \in R^{d_{h}\times d{k}} WlQ,WlK,WlVRdh×dk,respectively.
之前的层的输出 H l − 1 ∈ R ∣ x ∣ × d h H^{l-1} \in R^{|x|\times{d_{h}}} Hl1Rx×dh分别地被线性地映射到一个三元组queries,keys和values之中使用参数矩阵 W l Q , W l K , W l V ∈ R ∣ x ∣ × d h W_{l}^{Q},W_{l}^{K},W_{l}^{V} \in R^{|x|\times{d_{h}}} WlQ,WlK,WlVRx×dh
and the mask matrix M ∈ R ∣ x ∣ × ∣ x ∣ M \in R^{|x|\times|x|} MRx×x决定是否一对标志应该被相互注意到。
We use different mask matrices M to control what context a token can attend to when computing its contextualized representatioin,as illustrated in Figure1.
我们使用不同的遮盖矩阵M去控制当计算它的上下文表示的时候,什么内容标志可以被注意到,如图片1所论述的那样。
Unified Language Model Pre-training for Natural Language_第4张图片Take bidirectional LM as an example.The elements of the mask matrix are all 0s,indicating that all the tokens have access to each other.
将双向语言模型作为一个例子。遮盖矩阵元素都是0,暗示所有的标志都有到达相应内容的途径。
2.3 Pre-training Objectives
预训练目标
We pretrain UNILM using four cloze tasks designed for different language modeling objectives.
我们预训练UNILM使用四个克隆任务用于不同的语言模型目标之中。
In a cloze task,we randomly choose some WordPiece tokens in the input,and replace them with special token[MASK].
在一个克隆的任务之中,我们随机在输入中选择一些WordPiece标志,并且将他们使用一些特殊的标志[MASK]进行替换。
Then,we feed their corresponding output vectors computed by the Transformer network into a softmax classifier to predict the masked token.
接下来,我们将他们相应的输出相应被相应的Transformer网络使用softmax分类器计算去预测掩盖的标志。
The parameters of UNILM are learned to minimize the cross-entropy loss computed using the predicted tokens and the original tokens.
UNILM参数被学习来最小化计算交叉熵计算使用预测的标志和原来的标志。
It is worth noting that the use of cloze tasks makes it possible to use the same training procedure for all the LMs,unidirectional and bidirectional alike.
值得注意到的是克隆任务让它对于所有的语言模型,例如单向的和双向的能够可能使用相同的训练过程。
Unidirectional LM
单向的语言模型
We use both left-to-right and right-to-left LM objectives.Take the left-to-right LM as an example.
我们使用同样的从左到右和从右到左的语言模型目标。将从左到右语言模型作为一个例子。
The representation of each token encodes only the leftward context tokens and itself.
每个标志代表编码只有左边的标志和它自身。
For instance,to predict the masked token of “x1x2[MASK]x4”,only tokens x1,x2 and itself can be used.
例如,为了预测掩盖标志"x1x2[MASK]x4",只有x1,x2和它自身可以被使用。
This is done by using a triangular matrix for the self-attention mask M(as in Equation(2)),where the upper triangular part of the self-attention mask is set to − ∞ -\infty ,and the other elements to 0,as shown in Figure1.
这个通过对自我注意力掩盖M使用三角矩阵(像在等式(2)之中的那样),最上方的三角部分的自注意力掩码被设置到 − ∞ -\infty ,同时其他元素被设置为0,像在图1中显示的那样。
Bidirectional LM
Following [9],a bidirectional LM allows all tokens to attend to each other in prediction.
接着[9],一个双向的语言模型允许所有的标志在预测的时候注意到其他位置。
It encodes contextual information from both directions,and can generate better contextual representations of text than its unidirectional counterpart.
它编码上下文信息从不同的方向,并且可以比它的单向相对部分产生更好的上下文文本的表示。
As indicated in Equation(2),the self-attention mask M is a zero matrix,so that every token is allowed to attend across all positions in the input sequence.
正如等式(2)中显示的那样,自注意力掩码M是一个零矩阵,所以每一个标志都可以关注到输入序列之中的所有的位置。
Sequence-to-Sequence LM
序列到序列的语言模型
As shown in Figure1,for prediction,the tokens in the first(source) segment can attend to each other from both directions within the segment,
正如Figure1中显示的那样,为了预测,在第一个段落(资源)中的标志可以照料到在本段落中的每一个来自不同方向的段落的标志。
whilie the tokens of the second(target)segment can only attend to the leftward context in the target segment and itself,as well as all the tokens in the source segment.
然而第二个段落标志只能照料到目标段落左边的文本和它自身,以及所有目标段落中的标志。
For example,given source segment t 1 t 2 t_{1}t_{2} t1t2 and its target segment t 3 t 4 t 5 t_{3}t_{4}t_{5} t3t4t5,we feed input “[SOS] t 1 t 2 t_{1} t_{2} t1t2 [EOS] t 3 t 4 t 5 t_{3} t_{4} t_{5} t3t4t5 [EOS]” into the model.
例如,给出源段落 t 1 t 2 t_{1}t{2} t1t2和它的目标段落 t 3 t 4 t 5 t_{3}t_{4}t_{5} t3t4t5,我们将输入"[SOS] t 1 t 2 t_{1} t_{2} t1t2 [EOS] t 3 t 4 t 5 t_{3} t_{4} t{5} t3t4t5 [EOS]"放入到模型之中。
While both t 1 t_{1} t1 and t 2 t_{2} t2 have access to the first four tokens,including [SOS] and [EOS], t 4 t_{4} t4 can only attend to the first six tokens.
尽管 t 1 t_{1} t1 t 2 t_{2} t2有到达第一个四个标志的途径,包括[SOS]和[EOS], t 4 t_{4} t4只可以关注到第一个六个标志。
Unified Language Model Pre-training for Natural Language_第5张图片Figure 1 shows the self-attention mask M used for the sequence-to-sequence LM objective.
表1显示的是自注意力掩码用于序列到序列语言目标。
The left part of M is set to 0 so that all tokens can attend to the first segment.
M的左边部分被设置为0所以所以的标志可以注意到第一个段落。
The upper right part is set to − ∞ -\infty to block attentions from the source segment to the target segment.
最上右方的部分被设置为 − ∞ -\infty 对于从源段落到目标段落遮盖的注意力。
Moreover,for the lower rights part,we set its upper triangular part to − ∞ -\infty ,and the other elemetns to 0,which prevents tokens in the target segment from attending their future(right) positions.
此外,对于最低的右边部分,我们设置它的最上面的三角部分为 − ∞ -\infty ,以及其他的元素设置为0,组织在目标段落中的标志关注他们的将来(右边)位置。

During training,we randomly choose tokens in both segments,and replace them with the special token[MASK].
在训练中,我们随即选择在同样段落之中的标记,并且将他们使用特殊的标记[MASK]进行替换。
The model is learned to recover the masked tokens.
模型去学习恢复掩盖的标记。
Since the pair of source and target texts are packed as a contiguous input text sequence in training,we implicitly encourage the model to learn the relationship between the two segments.
既然在训练的过程中来源和目标文本被打包成为一个连续输入序列,我们隐蔽地鼓励模型去学习两个段落之间的关系。
In order to better predict tokens in the target segment,UNILM learns to effectively encode the source segment.
为了更好地学习目标段落,UNILM学习有效地将目标段落进行编码。
Thus,the cloze task designed for the sequence-to-sequence LM,also known as the encoder-decoder model,simultaneously pre-trains a bidirectional encoder and an unidirectional decoder.
因此,克隆任务用于序列到序列的语言模型,同时被视作编码-解码模型,同时与训练一个双向编码和一个单向解码的过程。
The pre-trained model,used as an encoder-decoder model,can be easily adapted to a wide range of conditional text generation tasks,such as abstractive summarization.
预训练模型使用编码解码模型,可以被很容易地适应一个大范围的有条件的文本生成任务,例如抽象总结。
Next Sentence Prediction
下一个句子预测
For the bidirectional LM,we also include the next sentence prediction task for pre-training,as in [9].
对于双向语言模型,我们也包括了预训练中的下一个句子预测任务,正如[9]中所示的那样。
2.4 Pre-training Setup
预训练设置
The overall training objective the sum of different types of LM objectives described above.
不同类型语言模型的总的训练目标在上面的内容中描述过了。
Specifically,within one training batch,1/3 of the time we use the bidirectional LM objective,1/3 of the time we employ the sequence-to-sequence LM objective,and both left-to-right and right-to-left LM objectives are sampled with rate of 1/6.
特别地,在一个训练批次之中,1/3的时间我们使用双向语言模型目标,1/3时间我们使用序列到序列的语言模型目标,从左到右和从右到左的语言模型目标采用1/6的比例进行采样。
The model architecture of UNILM follows that of BERT_LARGE[9] for a fair comparison.
UNILM追随BERT_LARGE为了一个更加公平的比较。
The gelu activation is used as GPT.
gelu激活函数跟在GPT之中使用的一样。
Specifically,we use a 24-layer Transformer with 1,024 hidden size,and 16 attention heads,which contains about 340M parameters.
特别地,我们使用24层Transformer有1024个隐藏层,以及16个注意力头,包含大概340M参数。
The weight matrix of the softmax classifier is tied with token embeddings.
权重矩阵的softmax分类器与token embeddings绑定在一起。
UNILM is initialized by BERTLARGE,and then pre-trained using English Wikipedia and BookCorpus,which have been processed in the same way as [9].The vocabulary size is 28,996.
UNILM使用BERTLARGE的参数初始化,接着使用英语Wikipedia和书语料,进行的过程与[9]一样。单词的数量是28,996。
The maximum length of input sequence is 512.The token masking probability is 15%.
最大输入序列是512。单词被遮盖的概率是15%。
Among masked positions,80% of the time we replace the token with [MASK],10% of the time with a random token,and keeping the original token for the rest.
在被掩盖的位置中,80%的时间我们使用[MASK]替代标记,10%的时间使用随机标记进行替换,剩余的标记将原来的标记保持。
In addition,80% of the time we randomly mask one token each time,and 20% of the time we mask a bigram or a trigram.
除此之外,80%的时间我们每次随机掩盖一个标记,20%的时间我们掩盖两个或三个单词。
Adam[22] with β 1 = 0.9 \beta_{1} = 0.9 β1=0.9, β 2 = 0.999 \beta_{2} = 0.999 β2=0.999 is used for optimization.
Adam使用 β 1 = 0.9 \beta_{1} = 0.9 β1=0.9, β 2 = 0.999 \beta_{2} = 0.999 β2=0.999用于优化。
The learning rate is 3e-5,with linear warmup over the first 40000 steps and linear decay.
学习率是3e-5,在最初的40000步使用线性热身接下来按权重进行缩减。
The dropout rate is 0.1.The weight decay is 0.01.The batch size is 330.
dropout率是0.1,权重衰减为0.01,批次大小为330.
The pre-training procedure runs for about 770,000 steps.It takes about 7 hours for 10,000 steps using 8 Nvidia Telsa V100 32GB GPU cards with mixed precision training.
预训练过程跑了大概770000步。使用混合精度装着8个Nidia Telsa V100 32GB GPU卡跑10000步大概需要7个小时。
2.5 Fine-tuning on Downstream NLU and NLG Tasks
在下游NLU和NLG任务之中的微调
For NLU tasks,we fine-tune UNILM as a bidirectional Transformer encoder,like BERT.
在NLU任务中,我们微调UNILM作为一个双向Transformer的编码器,类似于BERT。
Take text classification as an example.We use the encoding vector of [SOS] as the representation of input,denoted as h 1 L h_{1}^{L} h1L,and feed it to a randomly initialized softmax classifier(i.e.,the task-specific output layer),where W C ∈ R d h × C W^{C} \in R^{d_{h} \times C} WCRdh×C is a parameter matrix,and C the number of categories.
将文本分类作为一个例子。我们使用编码向量[SOS]作为输入的代表,视作 h 1 L h_{1}^{L} h1L,接着将它喂进初始化的softmax分类器, W C ∈ R d h × C W^{C} \in R^{d_{h} \times C} WCRdh×C属于一个参数矩阵,并且C是分类的数量。
We maximize the likelihood of the labeled training data by updating the parameters of the pre-trained LM and the added softmax classifier.
我们通过更新预训练语言模型的参数和增加的softmax分类器来最大化标签训练数据的可能性。

For NLG tasks,we take the sequence-to-sequence task as an example.
对于NLG任务,我们使用序列到序列的任务作为一个例子。
The fine-tuning procedure is similar to pretraining using the self-attention masks as in Section 2.3.
微调过程类似于预训练过程使用自我注意力掩码像段落2.3一样。
Let S1 and S2 denote source and target sequences,respectively.We pach them together with special tokens,to form the input"[SOS] S1 [EOS] S2 [EOS]".
将S1和S2分别代表来源和目标序列。我们将他们使用特殊的标志打包在一起,形成输入"[SOS] S1 [EOS] S2 [EOS]"。
The model is fine-tuned by masking some percentage of tokens in the target sequence at random,and learning to recover the masked words.
模型微调的过程中通过随机掩盖一些目标序列的标志,并且学习去恢复掩盖的单词。
The training objective is to maximize the likelihood of masked tokens given context.
训练目标是去最大化给定文本掩盖标志的可能性。
It is worth noting that [EOS],which marks the end of the target sequence,can also be masked during fine-tuning,thus when this happens,the model learns when to emit [EOS] to terminate the generation process of the target sequence.
值得注意的是[EOS]标记着目标序列的终止,在微调的过程中也可以被标记着,因此当这发生的时候,模型需要学会什么时候省略[EOS]去终止目标序列生成的过程。

3.Experiments
We have conducted experiments on both NLU(i.e.,the GLUE benchmark,and extractive question answering) and NLG tasks(i.e.,abstractive summarization,question generation,generative question answering,and dialog response generation).
我们已经做实验在NLU(即,GLUE基准和抽取问答)以及NLG任务(即,抽象总结,问题生成,生成问答,对话回应生成)。
3.1 Abstrative Summarization
抽象总结
Automatic text summarization produces a concise and fluent summary conveying the key information in the input(e.g.,a news article).
自动文本总结产生了一个简单流利的总结将关键信息传递到输入之中。
We focus on abstractive summarization,a generation task where the summary is not constrained to reusing the phrases or sentences in the input text.
当总结不止局限于重复使用输入文本中的词组和句子的时候,我们专注于抽象总结。
We use the non-anonymized version of the CNN/DailyMail dataset[37] and Gigaword[36] for model fine-tuning and evaluation.
我们使用CNN/DailyMail数据集中非匿名的版本和Gigaword为了模型的微调和预测。
We fine-tune UNILM as a sequence-to-sequence model following the procedure described in Section 2.5 by concatenating document(the first segment) and summary(the second segment) as input which is truncated according to a pre-defined maximum length.
我们使用序列到序列模型遵循片段2.5的流程通过连接文本(第一个段落)和总结(第二个段落)并且根据预先定义的最大长度进行切断对UNILM进行微调。

(!!!将论文和总结连接起来并进行一波微调!!!)

We fine-tune our model on the training set for 30 epochs.We reuse most hyper-parameters from pre-training.
我们在训练集上使用30个迭代次数微调我们的模型。我们重新使用与训练中最多的超参数。
The masking probability is 0.7.We also use label smoothing with rate of 0.1.
被遮盖的可能性是0.7。我们同样使用0.1比例的标签平滑。
(标签平滑是一种在分类之中防止过拟合的方法。)
For CNN/DailyMail,we set batch size to 32,and maximum length to 768.For Gigaword,we set batch size to 64,and maximum length to 256.
对于CNN/DailyMail,我们设置批次大小为32,以及最大长度为768。对于Gigaword,我们设置批次大小为64,以及最大长度为256。
During decoding,we use beam search with beam size of 5.The input document is truncated to the first 640 and 192 tokens for CNN/DailyMail and Gigaword,respectively.
在解码中,我们使用束搜索使用束的大小为5.输入文档被切断为前面的640个单词和192个单词分别对于CNN/DailyMail和Gigaword。
We remove duplicated trigrams in beam search,and tweak the maximum sumary length on the development set.
我们除去在束搜索中重复的三元数组,并且轻微调整训练集上面的最大总结长度。

We use the F1 version of ROUGE[25] as the evaluation metric for both datasets.In Table 3,we compare UNILM against the baseline and several state-of-the-art models on CNN/DailyMail.
我们使用F1版本的ROUGE[25]作为所有数据集的测试基准。在表3,我们将UNILM与baseline进行比较以及和几个目前最好的在CNN/DailyMail之中的模型进行比较。
LEAD-3 is a baseline model that extracts the first three sentences in a document as its summary.PGNet[37] is a sequence-to-sequence model based on the pointer-generator network.
LEAD-3是一个基准模型从第一个文中抽取三个句子作为它的总结。PGNet是一个序列到序列的模型基于指针生成的网络。
S2S-ELMo uses a sequence-to-sequence model augmented with pre-trained ELMo representations,which is termed as SRC-ELMO+SHEMB in [13].
S2S-ELMo使用一个序列到序列的模型对ELMo表示进行增强,被视作为在[13]中的SRC-ELMO+SHEMB。
Bottom-Up is a sequence-to-sequence model augmented with a bottom-up content selector for selecting salient phrases.
Bottom-Up是一个序列到序列的模型为了选择最主要的词组使用一个bottom-up内容选择器。
We also include in Table 3 the best reported extractive summarization result[27] on the dataset.As shown in Table 3,our model outperforms all previous abstractive systems,creating a new state-of-the-art abstractive summarization result on the dataset.
我们同样在Table3包含最好报道的在数据集上的提取总结结果。正如Table 3之中显示的那样,我们的模型比之前所有的抽取系统都要好,创建了一个新的最好的数据集上的抽取总结结果
Our model also outperforms the best extractive model by 0.88 point in ROUGE-L.
我们的模型在ROUGE-L上也比最好提取模型要表现得好。

In Table 4,we evaluate the models on Gigaword with different scales(10K and 3.8M).Both Transformer[43] and OpenNMT[23] implement standard attentional sequence-to-sequence models.
在表4,我们在Gigaword上面使用不同的规格(10K和3.8M)去测量模型。同样Transformer和OpenNMT执行标准的注意力序列到序列的模型。
Re3Sum[4] retrieves summaries as candidate templates,and then use an extended sequence-to-sequence model to generate summaries.Experimental results show that UNILM achieves better performance than previous work.
Re3Sum提取总结作为备选模板,接着使用一个延伸的序列到序列的模型去产生总结。实验的结果显示UNILM比之前的工作获得了更好的表现。
Besides,in the low-resource setting(i.e.,only 10,000 examples are used as training data),our model outperforms MASS by 7.08 point in ROUGE-L.
除此之外,在低资源配置(即,只有10,000例子被用作训练数据),我们的模型在ROUGE-L上面超出MASS7.08分。
3.2 Question Answering(QA)
The task is to answer a question given a passage[33,34,15].There are two settings.The first is called extractive QA,where the answer is assumed to be a text span in the passage.The other is called generative QA,where the answer needs to be generated on the fly.
任务是去给定一篇文章回答一个问题。有两种设置方式,第一种是提取QA,答案被视为在文章中的文本间隔。第二种是生成QA,答案需要在产生并及时生效。
(on the fly:产生并及时生效)
Extractive QA
This task can be formulated as a NLU task where we need to predict the start and end positions of the answer spans within the passage.We fine-tune the pre-trained UNILM as a bidirectional encoder for the task.
当我们需要文章中回答间隔的开始标志和结束标志的时候,任务可以被构建为一个NLU任务。我们微调预训练UNILM作为一个任务的双向编码
We conduct experiments on the Stanford Question Answering Dataset(SQuAD) 2.0[34],and Conversational Question Answering(CoQA)[35] datasets.
我们在斯坦福问答数据集(SQuAD)和争议问答(CoQA)数据集上面做实验。

The results on SQuAD 2.0 are reported in Table 5,where we compare two models in Exact Match(EM) and F1 score.RMR+ELMo[20] is an LSTM-based question answering model augmented with pre-trained language representation.
SQuAD 2.0的结果在Table 5上面发布,我们使用EM和F1分数进行比较两个模型。RMR+ELMo是一个使用预训练语言表示加强的基于LSTM的问答模型。
BERT_LARGE is a cased model,fine-tuned on the SQuAD training data for 3 epochs,with batch size 24,and maximum length 384.UNILM is fine-tuned in the same way as BERT_LARGE.We see that UNILM outperforms BERT_LARGE.
BERT_LARGE是一个大写的模型,在SQuAD训练数据上面微调了3个循环,使用批次大小为24和最大长度为384。UNILM和BERT_LARGE微调的方法一样。我们可以看出UNILM比BERT_LARGE的效果要好。

CoQA is a conversational question answering dataset.Compared with SQuAD,CoQA has serveral unique characteristics.First,the examples in CoQA are conversational,so we need to answer the input question based on conversation histories.
CoQA是一个谈话问答数据集。与SQuAD相比,CoQA有几个独特的特点。首先,CoQA中的例子是对话形式的,所以我们需要回答输入的问题基于对话的历史。
Second,the answer in CoQA can be free-form texts,including a large portion is of yes/no answers.
第二,CoQA的回答可以是任何形式的文本,包括一大部分yes/no的回答。

We modify the model used for SQuAD as follows.Firstly,in addition to the asked question,we concatenate the question-answer histories to the first segment,so that the model can capture conversational information.
我们按照如下步骤为SQuAD修改模型。首先,除了被问到的问题,我们将问答历史拼接在一起形成第一个段落,使得模型可以捕获对话信息。
Secondly,for yes/no questions,we use the final hidden vector of the [SOS] token to predict whether the input is a yes/no question,and whether the answer is yes or no.For other example,we select a passage subspan with the highest F1 score for training.
第二,对于是/否问题,我们使用最终隐藏向量[SOS]标志去预测是否输入是一个yes/no问题,是否回答是yes或者no。对于其他例子,我们选自一个文章的子间隔使用最高的F1分数进行训练。

The results on CoQA are reported in Table 6,where we compare two models in F1 scores.DrQA+ELMo[35] is an LSTM-based question answering model augmented with pre-trained ELMo representation.
CoQA的结果在Table 6中被报道出来,我们使用了F1分数比较了两个模型。DrQA+ELMo是一个使用预训练ELMo表示进行加强的基于LSTM的问答模型。

BERT_LARGE is a cased model,fine-tuned on the CoQA training data for 2 epochs,with batch size 16,and maximum length 512.UNILM is fine-tuned with the same hyper-parameters as BERT_LARGE.We see that UNILM outperforms BERT_LARGE.
BERT_LARGE是一个大写的模型,在CoQA训练数据上微调了两个迭代,使用批次大小为16以及最大长度为512。UNILM是一个使用相同的超参数与BERT_LARGE进行微调的模型。我们可以看出UNILM比BERT_LARGE效果要好。

Generative QA
Generative question answering generates free-form answers for the input question and passage,which is a NLG task.In contrast,extractive methods can only predict subspans of the input passage as answers.
生成问答对于输入的问题和文章产生任何形式的回答,这作为一个NLG任务。相反,抽取方法只能预测将输入文章作为回答的子跨度。
On the CoQA dataset(as described above),Reddy et al.[2019] show that vanilla sequence-to-sequence models still underperforms extractive methods by a wide margin.
在CoQA数据集上(正如上面所描述的那样),Reddy et al显示香草序列到序列的模型仍然比抽取方式有一个明显的优势。

We adapt UNILM to generative question answering as a sequence-to-sequence model.The first segment(i.e.,the input sequence) is the concatenation of conversational histories,the input question and the passage.
我们使用UNILM取生成问答作为一个序列到序列的模型。第一个段落(即输入序列)是一个连接的对话历史,输入问题的文章。
The second segment(i.e.,the output sequence) is the answer.We fine-tune the pre-trained UNILM on the CoQA training set for 10 epochs.
第二个段落(即输出序列)作为回答。我们使用预训练的UNILM在CoQA训练集上微调了10个批次。
We set the batch size to 32,the mask probability to 0.5,and the maximum length to 512.We also use label smoothing with rate of 0.1.The other hyper-parameters are kept the same as pre-training.
我们设置批次的大小为32,掩盖的概率为0.5,以及最大长度为512.我们同样使用标签顺滑为0.1。其他的参数和预训练之中设置的一样。
During decoding,we use beam search with beam size of 3.The maximum length of input question and passage is 470.For passages that are longer than the maximum length,we split the passage into several chunks with a sliding window approach,and select a chunk with the highest word overlap over the question.
在解码阶段,我们使用束搜索并且束的大小为3.输入问题和文章的最大长度为470.对于超过最大长度的文章,我们将文章拆分为多个模块使用滑动窗口的方法,以及选择问题之中的最大单词重叠过程。
We compare our method with the generative question answering models Seq2Seq and PGNet as described in [35].The Seq2Seq baseline is a sequence-to-sequence model with an attention mechanism.
我们比较我们的方法使用生成问答模型Seq2Seq和PGNet像[35]中所示的那样。Seq2Seq基准是一个序列到序列的使用attention机制的模型。
The PGNet model arguments Seq2Seq with a copy mechanism.As shown in Table 7,our generative question answering model outperforms previous generative methods by a wide margin,with significantly closes the gap between generative method and extractive method.
PGNet模型参数Seq2Seq有一个复制机制。正如Table7中显示的那样,我们的生成问答模型比之前的生成方法有明显的提升,明显地缩进了生成方法和提取方法的距离。
3.3 Question Generation
We conduct experiments for the answer-aware question generation task[52].Given an input passage and an answer span,our goal is to generate a question that asks for the answer.
我们运行一个答案问题生成任务。给出一个输入文章和一个回答答案选项,我们的目标是生成一个问题去索要答案。
The SQuAD 1.1 dataset[33] is used for evaluation.Following[12],we split the original training set into training and test sets,and keep the original development set.
SQuAD 1.1数据集用于测试。遵循[12],我们将原先的训练集划分为训练集和测试集,并且保持原先的开发集。
We also conduct experiments following the data split as in [51],which uses the reversed dev-test split.
我们同时做实验按照[51]的数据划分,使用相反的开发测试数据划分。

The question generation task is formulated as a sequence-to-sequence problem.The first segment is the concatenation of input passage and answer,while the second segment is the generated question.
问题生成任务使用序列到序列被制定。第一个段落是输入文章和问题的连接,然而第二个段落是生成的问题。
We fine-tune UNILM on the training set for 10 epochs.We set batch size to 32,masking probability to 0.7,and learning rate to 2e-5.The rate of label smoothing is 0.1.The other hyper-parameters are the same as pre-training.
我们使用10个迭代次数微调UNILM。我们将批次大小设置为32,将概率设置为0.7,学习率设置为2e-5。标签平滑率被设置为0.1。其他的超参数和预训练的保持一致。
During decoding,we truncate the input to 464 tokens by selecting a passage chunk which contains the answer.The evaluation metrics BLEU-4,METEOR,and ROUGE-L are computed by the same scripts as in [12].
在解码的过程中,我们将输入切成464个标志通过选择一个文章的板块包括回答。测试基准BLEU-4,METEOR和ROUGE-L通过[12]中的相同的脚本被计算出来。
Generated Questions Improve QA
生成问题提升问答
The question generation model can automatically harvest a large number of question-passage-answer examples from a text corpus.We show that the augmented data generated by question generation improves the question answering model.
问题生成模型可以自动地从文本语料之中收获一大堆问题-文章-答案例子。我们显示从问题生成中获得的增强数据提升了问答模型的效率。
We generate five million answerable examples,and four million unanswerable examples by modifying the answerable ones.We fine-tune our question answering model on the generated data for one epoch.The the model is fine-tuned on the SQuAD 2.0 data for two more epochs.
我们通过修改回答部分生成五百万回答的例子和四百万答案的例子。我们在我们的问答模型中以一个单元微调生成的数据。模型在SQuAD 2.0数据上通过两个或者更多的迭代次数进行微调。
As shown in Table 9,the augmented data generated by UNILM improves question answering model introduced in Section 3.2.
正如表9中显示的那样,由UNILM产生的增强数据提升了在篇章3.2中介绍的问答模型的效果。
Note that we use bidirectional masked language modeling as an auxiliary task for both the generated and SQuAD 2.0 datasets during fine-tuning,which brings 2.3 absolute improvement compared to directly using automatically generated examples.
注意到我们使用双向掩码模型对于生成和SQuAD 2.0数据集在微调的时候作为辅助任务,与直接使用自动生成例子相比带来了百分之2.3的绝对提升。
A possible reason is that the auxiliary task alleviates catastrophic forgetting[49] when fine-tuning on augmented data.
一个可能的原因在于在增强数据上进行微调的时候,辅助任务减轻了灾难性遗忘。
3.4 Response Generation
回答生成
We evaluate UNILM on the document-grounded dialog response generation task[30,15].
我们在文本对话回答生成任务中测试UNILM。
Given a multi-turn conversation history and a web document as the knowledge source,the system needs to generate a natural language response that is both conversationally appropriate and reflective of the contents of the web document.
给出一个多轮对话历史和一个网络文档作为知识来源,系统需要生成一个既是符合对话的又是网络文档内容的反映的自然语言回应。
We fine-tune the UNILM to the task as a sequence-to-sequence model.The first segment(input sequence)is the concatenation of the web document and the conversation history.The second segment(output sequence) is the response.
我们微调UNILM任务作为一个序列到序列的模型。第一个段落(输入序列)是网络文档和回答历史的连接。第二个段落(输出序列)是相应的回答。
We fine-tune UNILM on the DSTC7 training data for 20 epochs,with batch size 64.The masking probability is set to 0.5.The maximum length is 512.
我们在DSTC7训练数据上微调UNILM20个批次,使用批次大小64。掩盖概率被设置为0.5,最大的长度为512。
During decoding,we use beam search with size of 10.The maximum length of generated response is set to 40.As shown in Table 10,UNILM outperforms the best system[41] in the DSTC7 shared task [14] across all evalution metrics.
在解码的过程中,我们把束搜索的大小设置为10,生成回应的最大长度设置为40。正如Table10中显示的那样,UNILM比DSTC7中最好系统的共享任务的所有评测指标都要好。
3.5 GLUE Benchmark
We evaluate UNILM on the General Language Understanding Evaluation(GLUE) benchmark[45].
我们测试UNILM在普通语言理解测试(GLUE)基准中。
GLUE is a collection of nine language understanding task,including question answering[33],linguistic acceptability[46],sentiment analysis[38],text similarity[5],paraphrase detection[10],and natural language inference(NLI)[7,2,17,3,24,47].
GLUE是一个九个自然语言理解任务的集合,包括问答,语言接受度,情感分析,文本相似度,段落检测和自然语言推理(NLI)。

Our model is fine-tuned as a bidirectional LM.We use Adamax[21] as our optimizer with a learning rate of 5e-5 and a batch size of 32.The maximum number of epochs is set to 5.
我们模型被使用双向语言模型进行微调。我们使用Adamax作为我们的优化器使用学习率5e-5和一个32的批次,最大数量的批次被设定为5.
A linear learning rate decay schedule with warmup of 0.1 is used.The dropout rate of the last linear projection for each task is set to 0.1,except 0.3 for MNLI and 0.05 for CoLA/SST-2。
一个线性学习率调度计划使用0.1的热身被使用。对于每个任务的最后一个线性映射的dropout率被设置为0.1,除去MNLI的0.3和CoLA/SST-2的0.05。
To avoid the gradient explosion issue,the gradient norm was clipped within 1.We truncated the tokens no longer than 512.
为了避免梯度爆炸问题,梯度正则化需要被修建到1之内。我们将标志切到不大于512.
Table 11 presents the GLUE test results obtained from the benchmark evaluation server.The results show that UNILM obtains comparable performance on the GLUE tasks in comparison with BERT_LARGE.
Table11展示了从基准测试服务中获得的GLUE测试的结果。结果显示与BERT_LARGE相比,UNILM在GLUE任务之中获得了无与伦比的表现。
4 Conclusion and Future Work
We propose a unified pre-training model,UNILM,which is jointly optimized for several LM objectives with shared parameters.
我们提出了一个联合预训练模型,UNILM,对于几个语言目标使用共享参数进行共同的优化。
The unification of bidirectional,unidirectional,and sequence-to-sequence LMs enables us to straightforwardly fine-tune the pre-trained UNILM for both NLU and NLG tasks.
单向,双向和序列到序列的语言模型使得我们能够直接微调预训练的UNILM进行语言理解任务和语言生成任务。
Experimental results demonstrate that our model compares favorably with BERT on the GLUE benchmark and two question answering datasets.
实验结果显示我们的模型比BERT在GLUE基准之上以及两个问答数据集上的效果要好。
In addition,UNILM outperforms previous state-of-the-art models on five NLG datasets:CNN/DailyMail and Gigaword abstractive summarization,SQuAD question generation,CoQA generation question anwering,and DSTC7 dialog response generation.
除此之外,UNILM在五个自然语言生成数据集上的五个最好的模型要好:CNN/DailyMail和Gigaword抽象总结,SQuAD问题生成,CoQA生成问题的回答和DSTC7对话回应的生成。
The work can be advanced from the following perspectives:
还可以从以下角度提升工作的效果:
1.We will push the limit of the current method by training more epochs and larger models on web-scale text corpora.
我们将推动目前方法的极限通过在网络规模的文本语料之中训练更多的迭代次数和更大的模型。
At the same time,we will also conduct more experiments on end applications as well as ablation experiments to investigate the model capability and the benefits of pre-training multiple language modeling tasks with the same network.
同时,我们将同样执行更多的实验在终端应用和消融实验中去调查模型的能力以及使用相同的网络去预训练多个语言模型的任务。
2.We are focusing on monolingual NLP tasks in our current experiments.We are also interested in extending UNILM to support cross-lingual tasks.
我们关注于在我们当前实验中的多种语言任务。我们还对延伸UNILM去支持交叉语言任务感兴趣。
3.We will conduct multi-task fine-tuning on both NLU and NLG tasks,which is a natural extension on Multi-Task Deep Neural Network(MT-DNN)(26).
我们还在NLU和NLG任务之上进行多种任务的微调,是一个在深度神经网络上的多种任务的自然延伸。
Acknowledgement
We would like to acknowledgement Shiyue Zhang for the help discussions about the question generation experiments.
我们想要承认zhagnshiyue对于问题生成实验的讨论
全部校对完毕

你可能感兴趣的:(论文翻译)