Facebook论文:为实现跨语种Zero-Shot迁移的巨量多语言句子Embeddings

Mikel Artetxe
Holger Schwenk (Facebook)
Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond

Abstract

本文介绍了一种可以学习多语言句子表示的方法,可用于30多个语种,93种语言and written in 28 different scripts.
系统用了所有语言共享BPE词汇表的单BiLSTM 编码器,同时又在parallel corpora上训练的auxiliary解码器。
这种技术允许我们只在英语上annotated data训练出的句子embedding模型的基础上训练分类器,然后迁移到其他93种语言上,不需要任何修改

它由编码器(encoder)、解码器(decoder)两大部分组成。其中,编码器是个无关语种的BiLSTM,负责构建句嵌入,这些句嵌入接下来会通过线性变来换初始化LSTM解码器。为了让这样一对编码器、解码器能处理所有语言,还有个小条件:编码器最好不知道输入的究竟是什么语言,这样才能学会独立于语种的表示。所以,还要从所有输入语料中学习出一个“比特对嵌入词库”(BPE)。
不过,解码器又有着完全相反的需求:它得知道输入的究竟是什么语言,才能得出相应的输出。于是,Facebook就为解码器附加了一项输入:语言ID,也就是上图的Lid
训练这样一个系统,Facebook用了16个英伟达V100 GPU,将batch size设置为12.8万个token,花5天时间训练了17个周期。
用包含14种语言的跨语种自然语言推断数据集(cross-lingual natural language inference,简称XNLI)来测试,这种多语种句嵌入(上图的Proposed method)零数据(Zero-Shot) 迁移成绩,在其中13种语言上都创造了新纪录,只有西班牙语例外。另外,Facebook用其他任务测试了这个系统,包括ML-Doc数据集上的分类任务、BUCC双语文本数据挖掘。他们还在收集了众多外语学习者翻译例句的Tatoeba数据集基础上,制造了一个122种语言对齐句子的测试集,来证明自家算法在多语言相似度搜索任务上的能力。
http://www.sohu.com/a/2854308...
BPE vocabulary (Byte Pair Encoding:Byte pair encoding是一种简单的数据压缩技术,它把句子中经常出现的字节pairs用一个没有出现的字节去替代。)

A new state-of-the-art on zero-shot cross-lingual natural language inference for all the 14 languages in the XNLI dataset but one.

The Cross-lingual Natural Language Inference ( XNLI) corpus is a crowd-sourced collection of 5,000 test and 2,500 dev pairs for the MultiNLI corpus. The pairs are annotated with textual entailment and translated into 14 languages: French, Spanish, German, Greek, Bulgarian, Russian, Turkish, Arabic, Vietnamese, Thai, Chinese, Hindi, Swahili and Urdu. This results in 112.5k annotated pairs. Each premise can be associated with the corresponding hypothesis in the 15 languages, summing up to more than 1.5M combinations. The corpus is made to evaluate how to perform inference in any language (including low-resources ones like Swahili or Urdu) when only English NLI data is available at training time. One solution is cross-lingual sentence encoding, for which XNLI is an evaluation benchmark.

Also achieve very competitive results in cross-lingual document classification(MLDoc dataset)
Our sentence embeddings are also strong at parallel corpus mining, establishing a new state-of-the-art in the BUCC shared task for 3 of its 4 language pairs.
也制作了一个新的测试集of aligned sentences in 122 languages based on the Tatoeba corpus and show that our sentence embeddings obtain strong results in multilingual similarity search even for low-resource languages.
Our Pytorch implementation, pre-trained encoder and the multilingual test set will be freely available.

Natural language inference
Natural language inference is the task of determining whether a “hypothesis” is true (entailment), false (contradiction), or undetermined (neutral) given a “premise”.

MultiNLI
The Multi-Genre Natural Language Inference (MultiNLI) corpus contains around 433k hypothesis/premise pairs. It is similar to the SNLI corpus, but covers a range of genres of spoken and written text and supports cross-genre evaluation. The data can be downloaded from the MultiNLI website.
SciTail
The SciTail entailment dataset consists of 27k. In contrast to the SNLI and MultiNLI, it was not crowd-sourced but created from sentences that already exist “in the wild”. Hypotheses were created from science questions and the corresponding answer candidates, while relevant web sentences from a large corpus were used as premises. Models are evaluated based on accuracy.

Public leaderboards for in-genre (matched) and cross-genre (mismatched) evaluation are available, but entries do not correspond to published models.

State-of-the-art results can be seen on the SNLI website.

SNLI:The Stanford Natural Language Inference (SNLI) Corpus contains around 550k hypothesis/premise pairs. Models are evaluated based on accuracy.

Introduction

The advance techiniques in NLP are known to be particularly data hungry, limiting their applicability in many practical scenarios.
An increasingly popular approach to alleviate this issue is to first learn general language representations on unlabeled data, which are then integrated in task-specific downstream systems
This approach was first popularized by word embeddings, but has recently been superseded by sentence-level representaions.
Nevertheless, all these works learn a separate model for each language and are thus unable to leverage information across different languages, greatly limiiting their potential performance for low-resource languges
Universal langauage agnostic sentence embeddings, that is, vertor representations of sentences that are general with respect to two dimensions: the input language and the NLP tasks

由于语料不足,所以大家都现在无监督学习数据表征,如word embeddings, 现在转向了sentence embeddings。大家也都在尝试跨语言和多任务学习。

  • The hope that languages with limited resources benefit from joint training over many languages, the desire to perform zero-shot transfer of an NLP model from one language (e.g. English) to another,
  • And the possibility to handle code-switching.

We achieve the this by using a single encoder that can handle multiple languages, so that senantically similar sentences in different languages are close in the resulting embedding space.

语码转换
说明语码转换是一个常见的语言现象,指一个人在一个对话中交替使用多于一种语言或其变体。此现象是众多语言接触现象之一,常出现于多语者的日常语言。除了日常语言的对话,语码转换也出现于文字书写中。“语码转换”之讨论必定会牵涉“双语”之内容。语码转换的语料中可见两种以上语言在语音、句法结构等多方面的相互影响。

Contributions

  • We learn one shared encoder that can handle 93 different languages. All languages are jointly embedded in a shared space, in contrast to most other works which usally consider separate English/foreign alignments.
  • Cross-lingual 1)natural lanuage inference (XNLI dataset) and 2)classification (ML-Doc dataset), 3)bitext mining (BUCC dataset) and 4)multilingual similarity search (Tatoeba dataset)

推理、分类、bitext,多语言相似搜索

  • We define a new test set based on the freely available Tatoeba corpus and provide baseline results for 122 languages. We report accuracy for multilingual similarity search on this test set, but the corpus could also be used for MT evaluation.
Tatoeba
English-German Sentence Translation Database (Manythings/Tatoeba)The Tatoeba Project is also run by volunteers and is set to make the most bilingual sentence translations available between many different languages.Manythings.org compiles the data and makes it accessible. http://www.manythings.org/cor...
Bitext API是另一个深度语言分析工具,提供易于导出到各种数据管理工具的数据。该平台产品可用于聊天机器人和智能助手、CS和Sentiment,以及一些其他核心NLP任务。这个API的重点是语义、语法、词典和语料库,可用于80多种语言。此外,该API是客户反馈分析自动化方面的最佳API之一。该公司声称可以将洞察的准确度做到90%
文档: https://docs.api.bitext.com/
Demo: http://parser.bitext.com/
强烈推荐20个必须了解的API,涉及机器学习、NLP和人脸检测

Related Work

  • Word Embeddings (Distributed Representations of Words and Phrases and their Compositionality)
  • Glove GloVe: Global Vectors for Word Representation

There has been an increasing interest in learning continuous vector representations of longer linguistic units like sentences.
These sentence embeddins are commonly obtained using a Recurrent Neural Network (RNN) encoder, which is typically trained in an unsupervised way over large collections of unlabelled corpora.

相关知识补充

一、文本表示和各词向量间的对比
1、文本表示哪些方法?
下面对文本表示进行一个归纳,也就是对于一篇文本可以如何用数学语言表示呢?
基于one-hot、tf-idf、textrank等的bag-of-words;
主题模型:LSA(SVD)、pLSA、LDA;
基于词向量的固定表征:word2vec、fastText、glove
基于词向量的动态表征:elmo、GPT、bert
2、怎么从语言模型理解词向量?怎么理解分布式假设?
上面给出的4个类型也是nlp领域最为常用的文本表示了,文本是由每个单词构成的,而谈起词向量, one-hot是可认为是最为简单的词向量,但存在维度灾难和语义鸿沟等问题通过构建共现矩阵并利用SVD求解构建词向量,则计算复杂度高;而早期词向量的研究通常来源于语言模型,比如NNLM和RNNLM,其主要目的是语言模型,而词向量只是一个副产物。

所谓分布式假设,用一句话可以表达:相同上下文语境的词有似含义。而由此引申出了word2vec、fastText,在此类词向量中,虽然其本质仍然是语言模型,但是它的目标并不是语言模型本身,而是词向量,其所作的一系列优化,都是为了更快更好的得到词向量。glove则是基于全局语料库、并结合上下文语境构建词向量,结合了LSA和word2vec的优点。
3、传统的词向量有什么问题?怎么解决?各种词向量的特点是什么?
上述方法得到的词向量是固定表征的,无法解决一词多义等问题,如“川普”。为此引入基于语言模型的动态表征方法:elmo、GPT、bert。

各种词向量的特点:
(1)One-hot 表示 :维度灾难、语义鸿沟;
(2)分布式表示 (distributed representation) :

矩阵分解(LSA):利用全局语料特征,但SVD求解计算复杂度大;
基于NNLM/RNNLM的词向量:词向量为副产物,存在效率不高等问题;
word2vec、fastText:优化效率高,但是基于局部语料;
glove:基于全局预料,结合了LSA和word2vec的优点;
elmo、GPT、bert:动态特征;

5、word2vec和fastText对比有什么区别?(word2vec vs fastText)
1)都可以无监督学习词向量, fastText训练词向量时会考虑subword;
2) fastText还可以进行有监督学习进行文本分类,其主要特点:
结构与CBOW类似,但学习目标是人工标注的分类结果;
采用hierarchical softmax对输出的分类标签建立哈夫曼树,样本中标签多的类别被分配短的搜寻路径;
引入N-gram,考虑词序特征;
引入subword来处理长词,处理未登陆词问题;

6、glove和word2vec、 LSA对比有什么区别?(word2vec vs glove vs LSA)
1)glove vs LSA
LSA(Latent Semantic Analysis)可以基于co-occurance matrix构建词向量,实质上是基于全局语料采用SVD进行矩阵分解,然而SVD计算复杂度高;
glove可看作是对LSA一种优化的高效矩阵分解算法,采用Adagrad对最小平方损失进行优化;
2)word2vec vs glove
word2vec是局部语料库训练的,其特征提取是基于滑窗的;而glove的滑窗是为了构建co-occurance matrix,是基于全局语料的,可见glove需要事先统计共现概率;因此,word2vec可以进行在线学习,glove则需要统计固定语料信息。
word2vec是无监督学习,同样由于不需要人工标注;glove通常被认为是无监督学习,但实际上glove还是有label的,即共现次数
$log(X_{ij})$。
word2vec损失函数实质上是带权重的交叉熵,权重固定;glove的损失函数是最小平方损失函数,权重可以做映射变换。
总体来看,glove可以被看作是更换了目标函数和权重函数的全局word2vec。

elmo、GPT、bert三者之间有什么区别?(elmo vs GPT vs bert)
之前介绍词向量均是静态的词向量,无法解决一次多义等问题。下面介绍三种elmo、GPT、bert词向量,它们都是基于语言模型的动态词向量。下面从几个方面对这三者进行对比:
(1)特征提取器:elmo采用LSTM进行提取,GPT和bert则采用Transformer进行提取。很多任务表明Transformer特征提取能力强于LSTM,elmo采用1层静态向量+2层LSTM,多层提取能力有限,而GPT和bert中的Transformer可采用多层,并行计算能力强。
(2)单/双向语言模型:
GPT采用单向语言模型,elmo和bert采用双向语言模型。但是elmo实际上是两个单向语言模型(方向相反)的拼接,这种融合特征的能力比bert一体化融合特征方式弱。
GPT和bert都采用Transformer,Transformer是encoder-decoder结构,GPT的单向语言模型采用decoder部分,decoder的部分见到的都是不完整的句子;bert的双向语言模型则采用encoder部分,采用了完整句子。

深入解剖word2vec

见知乎: nlp中的词向量对比:word2vec/glove/fastText/elmo/GPT/bert

Motivation

  • Skip-thought model 2015 couple the encoder with an auxiliary decoder, and train the entire system end-to-end to predict the surrounding sentences over a large collection of books.
  • It was later shown that more competitive results could be obtained by training the encoder over labeled Natural Language INference (NLI) data 2017
  • This was recently extended to multitask learning , combining different training objectives like that of skip-though, NLI and manchine traslation 2018.
we introduce auxiliary decoders: separate decoder models which are only used to provide alearning signal to the encoders.
Hierarchical Autoregressive Image Models with Auxiliary Decoders

While the previous methods consider a single language at a time, multilingual representaions have attracted a large attention in recent times.

  • Most of research focuses on cross-lingual word embeddings 2017 which are commonly learned jointly from parallel corpora 2015.
  • An alternative approach that is becoming increasingly polular is to train word embeddings independently for each language over monoligual corpora, and then map them to a shared space based on a bilingual dictionary 2013 2018.
  • Cross-lingual word embeddins are often used to build bag-of-word representations of longer linguistic units by taking their centroid 2012.

While this approach has the advantage of requireing a weak (or even no) cross-lingual signal, it has been shown that the resulting sentence embeddings words raher poorly in practical cross-lingual transfer settings 2018.

  • A more competitive approach that we follow here is to use a sequence-to-sequence encoder-decoder architecture.

The full system is trained end-to-end on parallel corpora akin to neural machine translation: the encoder maps the source sequence into a fixed-length vector representaion, which is used by the decoder to create the target sequence.
This decoder is then discarded, and the encoder is kept to embed sentences in any of the training languages
While some proposals use a separate encoder for each language 2018 sharing a single encoder for all languages also gives strong results.

  • Nevertheless, most existing work is either limited to few, rather close languages or more commonly, consider pairwise joint embeddings with English and one foreign luaguage only.
  • Tothe best of our knowledge, all existing word on learning multilingual representations for a large number of languages is limited to word embeddings, our being the first paper exporing massively multilingual sentence representatios.
  • While all the previous approaches learn a fixed-length representation for each sentence, a recent research line has obtained very strong results using variable-length representation instead, consisting of contextualized embeddings of the words in the sentence.

For that purpose, these methods train either an RNN or self-attentional encoder over unnaotated corpora using some form of language modeling. A classifier can then learned on top of the resulting encoder,
which is commnly further fine-tuned during this supervised training.
Despite the strong performance of these approaches in monolingual settings, we argue that fixed-length approaches provide a more generic, flexible and compatible representation form for our multilingual scenario,
and our model indeed outperforms the multilingual BERT modelin zero-shot transfer

Proposed method

作者用了一个single, language agnostic的BiLSTM encoder来构建sentence embeddings,并一并在parallel corpora上生成了auxiliary decoder。

laser 主要原理
laser 主要原理是将所有语种的用多层bi-lstm encode,最后state 拿出来,然后用max-pooling 变成固定维度的向量, 用这个向量去decode,训练时候翻译成2个语种,论文说1个语种的效果不好,翻译成2个目标语种就行,也不一定所有的都需要翻译成2个,大部分翻译就行。然后在下游应用,把encoder拿回来用,decoder就没啥用了。
然后发现语料少的语种在和语料的多一起训练过程中有受益。
知乎:Google bert 和Facebook laser 的对比


As it can be seen, sentence embeddings are obtained by applying a max-pooling operation over the output of a BiLSTM encoder.
These sentence embeddings are used to initialize the decoder LSTM through a linear transformation, and are also concatenated to its input embeddings at every time step.
Note that there is no other connection between the encoder and the decoder, as we want all relevent information of the input sequence to captured by the sentence embedding
For the purpose, we build a joint byte-pair encoding (BPE) vocabulary with 50k operations, which is learned on the concatentaion of all training corpora.
This way, the encoder has no explicit signal on what the input language is, encouraging it to learn language is, encourageing it to learn language independent representations.
In contrast, the decoder takes a language ID embedding that specifies the language to generate, which is concatenated to the input and sentence embeddings at every time step.

  • *In this paper, we limit our study to a stacked BiLSTM with 1 to 5 layers, each 512-dimensional.
  • The resulting sentence represtations (after concatenating both directions) are 1024 dimensional.
  • The decoder has always one layer of dimension 2048. The input embedding size is set to 320, while the language ID embedding has 32 dimensions.

In preceding work, each sentence at the input was jontly translated into all other languges. While this approach was shown to learn high-quality representaions,
it poses two obvious drawbacks when trying to scale to a large number of languages.

  • First, it requires an N-way parallel corpus, which is difficult to obtain for all languages.
  • Second, it has a quadratic cost with respect to the number of languages, making training prohibitively slow as the number of languages is increased.

In our preliminary experiments, we observed that similar results can be obtained by using less target languages - two seem to be enough. (Note that, if we had a single target language, the only way to train the encoder for that language would be auto-encoding, which we observe to work poorly. Having two target languages avoids this problem.)

At the same time, we relax the requirement for N-way parallel corpora by considering independent alignments with the two target languages, e.g. we do not require each source sentence to be translated into two target languages.
Training minimizes the cross-entropy loss on the training corpus, alternating over all combinations of the languages involved.
For thea purpose, we use Adam with a constant learning rate of 0.001 and dropout set to 0.1, and train for a fixed number of epochs.(Implementation based on fairseq)
With a total batch size of 128,000 tokens. Unless otherwise specified, we train our model for 17 epochs, which takes about 5 days. Stopping traiing early decreases the overall performance only slightly.

你可能感兴趣的:(python,机器学习)