论文的地址:https://arxiv.org/abs/1905.03197
代码地址:https://github.com/microsoft/unilm
This paper presents a new UNIfied pre-trained Language Model (UNILM) that can be fine-tuned for both natural language understanding and generation tasks.(UNILM模型可以微调用于自然语言理解和生成任务。)The model is pre-trained using three types of language modeling tasks: unidirectional, bidirectional, and sequence-to-sequence prediction.(UNILM预训练过程包含三种语言模型任务:单项的语言模型、双向的语言模型,序列生成任务。)
github的地址:https://github.com/microsoft/unilm.unilmThe unified modeling is achieved by employing a shared Transformer network and utilizing specific self-attention masks to control what context the prediction conditions on.(UNILM通过共享transformer 网络和自注意力掩码实现控制上下文的条件。)
Pre-trained LMs learn contextualized text representations by predicting words based on their context using large amounts of text data, and can be fine-tuned to adapt to downstream tasks.(预训练的语言模型通过大量的文本数据,结合上下文的信息学习单词的语义信息,然后采用微调的方式应用于下游任务。)Different prediction tasks and training objectives have been used for pre-training LMs of different types, as shown in Table 1. (不同的训练任务和目标对应不同的预训练语言模型。)
UNILM is a multi-layer Transformer network, jointly pre-trained on large amounts of text, optimized for three types of unsupervised language modeling objectives as shown in Table 2.(UNILM是一个多层的transformer网络,应用大量的语料联合优化三种类型的无监督语言模型)
But unlike BERT which is used mainly for NLU tasks, UNILM can be configured, using different self-attention masks (Section 2), to aggregate context for different types of language models, and thus can be used for both NLU and NLG tasks.(Bert主要用于自然语言理解任务,UNLIM可以通过配置实现不同的自注意力mask适用于不同的语言模型,可以应用NLU和NLG任务。)
The proposed UNILM has three main advantages.
First, the unified pre-training procedure leads to a single Transformer LM that uses the shared parameters and architecture for different types of LMs, alleviating the need of separately training and hosting multiple LMs.(首先,因为不同类型的语言模型采用共享的参数和结构,所以UNLM预训练过程是一个单个的transformer 语言模型,避免了多个语言模型训练过程的分离。)
Second, the parameter sharing makes the learned text representations more general because they are jointly optimized for different language modeling objectives where context is utilized in different ways, mitigating overfitting to any single LM task.(其次,参数共享使得学习的文本表示更加通用,因为它们是针对不同的语言建模目标联合优化的,以不同的方式使用上下文,从而减少了对任何单个LM任务的过拟合。)
Third, in addition to its application to NLU tasks, the use of UNILM as a sequence-to-sequence LM (Section 2.3), makes it a natural choice for NLG, such as abstractive summarization and question generation.(再次,除了应用于NLU任务,UNILM可以作为seq-to-seq LM模型,也可以应用于NLG任务,例如摘要生成,问题生成。)
As shown in Figure 1, the pre-training optimizes the shared Transformer [43] network with respect to several unsupervised language modeling objectives, namely, unidirectional LM, bidirectional LM, and sequence-to-sequence LM.(如图1所示,预训练针对几个无监督的语言建模目标,即单向LM、双向LM和序列到序列LM,对共享transformer网络进行了优化。)In order to control the access to the context of the word token to be predicted, we employ different masks for self-attention.(为了控制被预测单词的上下文,我们采用不同的mask策略对于自注意力机制。)
Texts are tokenized to subword units by WordPiece [48].(通过WordPiece算法,文本被切分为子词。)For each input token, its vector representation is computed by summing the corresponding token embedding, position embedding, and segment embedding.(对于每个输入的token,它表示为词向量、位置向量和段落向量的和。)
(先前transformer network的输出,通过不同的W参数被映射为queries、keys和values向量,另外,mask matrix M决定了token能否互相关注。)We use different mask matrices M to control what context a token can attend to when computing its contextualized representation, as illustrated in Figure 1.(图1所示,使用不同的掩码矩阵M控制计算词的表示利用到的上下文词。)
We pretrain UNILM using four cloze tasks designed for different language modeling objectives.(对于不同语言模型设了四种完形填空的任务去训练UNILM。)In a cloze task, we randomly choose some WordPiece tokens in the input, and replace them with special token [MASK].(在完型填空任务,我们随机选择输入的WordPiece token,使用[MASK]替换它们。)Then, we feed their corresponding output vectors computed by the Transformer network into a softmax classifier to predict the masked token.(然后,将Transformer network的输出向量送入softmax分类器预测mask token的概率分布。) The parameters of UNILM are learned
to minimize the cross-entropy loss computed using the predicted tokens and the original tokens.(UNILM目标函数:使用交叉熵函数计算预测token和原始token之间的损失。)It is worth noting that the use of cloze tasks makes it possible to use the same training procedure for allLMs, unidirectional and bidirectional alike.(指的注意的是,完型填空的任务也可以适用于所有语言模型的训练,如单向的语言模型和双向的语言模型。)
Unidirectional LM: Take the left-to-right LM as an example. The representation of each token encodes only the leftward context tokens and itself.(以 left-to-right LM为例,token的表示仅使用它左边的token和它自己。) For instance, to predict the masked token of “x1x2 [MASK] x4”, only tokens x1, x2 and itself can be used. This is done by using a triangular matrix for the self-attention mask M (as in Equation (2)), where the upper triangular part of the self-attention mask is set to −∞, and the other elements to 0, as shown in Figure 1.(使用上三角矩阵表示自注意力mask M,上三角矩阵上半部分值为负无穷,其余部分为0。 )
Bidirectional LM Following [9], a bidirectional LM allows all tokens to attend to each other in prediction. It encodes contextual information from both directions, and can generate better contextual representations of text than its unidirectional counterpart.(bidirectional LM双向的token都参与预测。它编码上下文的信息,能生成更好的向量表示比单向。)
Next Sentence Prediction: For the bidirectional LM, we also include the next sentence prediction task for pre-training, as in [9].(对于双向的LM模型,我们也设计了下一个句子的预测任务。)
The overall training objective the sum of different types of LM objectives described above. Specifically, within one training batch, 1/3 of the time we use the bidirectional LM objective, 1/3 of
the time we employ the sequence-to-sequence LM objective, and both left-to-right and right-to-left LM objectives are sampled with rate of 1/6.(训练目标为三种不同的LM目标的综合。尤其,训练过程中每个batch,1/3的时间训练双向的LM,1/3的时间训练seq-to-seqLM,left-to-right and right-to-left LM各1/6的采样速率。)
Specifically, we use a 24-layer Transformer with 1, 024 hidden size, and 16 attention heads, which contains about 340M parameters.(模型使用24层的Transformer结构,隐层维度:1024,注意力头:16,包含340M的参数。)UNILM
is initialized by BERTLARGE, and then pre-trained using English Wikipedia2 and BookCorpus [53], which have been processed in the same way as [9]. The vocabulary size is 28, 996. The maximum length of input sequence is 512. The token masking probability is 15%. Among masked positions, 80% of the time we replace the token with [MASK], 10% of the time with a random token, and keeping the original token for the rest. In addition, 80% of the time we randomly mask one token each time, and 20% of the time we mask a bigram or a trigram.Adam [22] with β1 = 0.9, β2 = 0.999 is used for optimization. The learning rate is 3e-5, with linear warmup over the first 40, 000 steps and linear decay. The dropout rate is 0.1. The weight decay is 0.01. The batch size is 330. The pre-training procedure runs for about 770, 000 steps. It takes about 7 hours for 10, 000 steps using 8 Nvidia Telsa V100 32GB GPU cards with mixed precision training.
模型在多个任务上都超越,之前的模型的结果
We propose a unified pre-training model, UNILM, which is jointly optimized for several LM objectives with shared parameters.(我们提出了一个统一的训练前模型UNILM,它是针对具有共享参数的几个LM目标联合优化的。)
The unification of bidirectional, unidirectional, and sequence to-sequence LMs enables us to straightforwardly fine-tune the pre-trained UNILM for both NLU and NLG tasks.(双向、单向和顺序LMS的统一使我们能够直接地对预先训练的UNILM进行NLU和NLG任务的微调。)