Natural Language Inference (NLI) task requires an agent to determine the logical relationship between a natural language premise and a natural language hypothesis. We introduce Interactive Inference Network (IIN), a novel class of neural network architectures that is able to achieve high-level understanding of the sentence pair by hierarchically extracting semantic features from interaction space. We show that an interaction tensor (attention weight) contains semantic information to solve natural language inference, and a denser interaction tensor contains richer semantic information. One instance of such architecture, Densely Interactive Inference Network (DIIN), demonstrates the state-of-the-art performance on large scale NLI copora and large-scale NLI alike corpus. It’s noteworthy that DIIN achieve a greater than 20% error reduction on the challenging Multi-Genre NLI (MultiNLI; Williams et al. 2017) dataset with respect to the strongest published system.
自然语言推理(NLI)任务要求代理确定自然语言前提和自然语言假设之间的逻辑关系。 我们引入了交互式推理网络(IIN),这是一种新颖的神经网络体系结构,能够通过从交互空间中分层提取语义特征来实现对句子对的高级理解。 我们证明了交互张量(注意权重)包含语义信息来解决自然语言推理,而更密集的交互张量包含更丰富的语义信息。 这种架构的一个实例,密集交互推理网络(DIIN),展示了大规模NLI copora和大规模NLI相似语料库的最先进性能。 值得注意的是,对于最强的已发布系统,DIIN在具有挑战性的多类型NLI(MultiNLI; Williams等人2017)数据集上实现了大于20%的误差减少。
Natural Language Inference (NLI also known as recognizing textual entiailment, or RTE) task requires one to determine whether the logical relationship between two sentences is among entailment (if the premise is true, then the hypothesis must be true), contradiction (if the premise is true, then the hypothesis must be false) and neutral (neither entailment nor contradiction). NLI is known as a fundamental and yet challenging task for natural language understanding(Williams et al., 2017), not only because it requires one to identify the language pattern, but also to understand certain common sense knowledge. In Table 1, three samples from MultiNLI corpus show solving the task requires one to handle the full complexity of lexical and compositional semantics. The previous work on NLI (or RTE) has extensively researched on conventional approaches(Fyodorov et al., 2000; Bos & Markert, 2005; MacCartney & Manning, 2009). Recent progress on NLI is enabled by the availability of 570k human annotated dataset(Bowman et al., 2015) and the advancement of representation learning technique.
自然语言推理(NLI也称为识别文本启动或RTE)任务需要人们确定两个句子之间的逻辑关系是否在蕴涵之间(如果前提是真的,那么假设必须是真的),矛盾(如果前提是是真的,那么假设必须是假的)和中立的(既不是蕴涵也不是矛盾)。 NLI被认为是自然语言理解的一项基本而又具有挑战性的任务(Williams et al。,2017),不仅因为它需要一个人来识别语言模式,而且还需要理解某些常识知识。在表1中,来自MultiNLI语料库的三个样本显示解决任务需要一个来处理词汇和组合语义的完整复杂性。之前关于NLI(或RTE)的研究已经对传统方法进行了广泛的研究(Fyodorov等,2000; Bos&Markert,2005; MacCartney&Manning,2009)。最近在NLI上的进展得益于570k人类注释数据集的可用性(Bowman等,2015)和表示学习技术的进步。
Among the core representation learning techniques, attention mechanism is broadly applied in many NLU tasks since its introduction: machine translation(Bahdanau et al., 2014), abstractive summarization(Rush et al., 2015), Reading Comprehension(Hermann et al., 015), dialog system(Mei et al., 2016), etc. As described by Vaswani et al. (2017), “An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key”. Attention mechanism is known for its alignment between representations, focusing one part of representation over another, and modeling the dependency regardless of sequence length. Observing attention’s powerful capability, we hypothesize that the attention weight can assist with machine to understanding the text.
在核心表征学习技术中,注意机制自引入以来广泛应用于许多NLU任务:机器翻译(Bahdanau等,2014),抽象概括(Rush等,2015),阅读理解(Hermann等, 015),对话系统(Mei等,2016)等。如Vaswani等人所述。 (2017),“注意功能可以描述为将查询和一组键值对映射到输出,其中查询,键,值和输出都是向量。输出计算为值的加权和,其中分配给每个值的权重由查询与相应密钥“的兼容性函数计算。已知注意机制在表示之间的对齐,将表示的一部分聚焦在另一部分上,并且无论序列长度如何都对依赖性建模。观察注意力的强大能力,我们假设注意力可以帮助机器理解文本。
A regular attention weight, the core component of the attention mechanism, encodes the crosssentence word relationship into a alignment matrix. However, a multi-head attention weightVaswani et al. (2017) can encode such interaction into multiple alignment matrices, which shows a more powerful alignment. In this work, we push the multi-head attention to a extreme by building a word-by-word dimension-wise alignment tensor which we call interaction tensor. The interaction tensor encodes the high-order alignment relationship between sentences pair. Our experiments demonstrate that by capturing the rich semantic features in the interaction tensor, we are able to solve natural language inference task well, especially in cases with paraphrase, antonyms and overlapping words.
定期注意权重是注意机制的核心组成部分,它将交叉词关系编码为对齐矩阵。 然而,多头注意力的重量是瓦斯瓦尼等人。 (2017)可以将这种交互编码成多个对齐矩阵,这显示出更强大的对齐。 在这项工作中,我们通过建立一个逐字的尺寸方向对齐张量将多头注意力推向极端,我们将其称为交互张量。 交互张量编码句子对之间的高阶对齐关系。 我们的实验表明,通过捕获交互张量中丰富的语义特征,我们能够很好地解决自然语言推理任务,尤其是在具有释义,反义词和重叠词的情况下。
We dub the general framework as Interactive Inference Network(IIN). To the best of our knowledge, it is the first attempt to solve natural language inference task in the interaction space. We further explore one instance of Interactive Inference Network, Densely Interactive Inference Network (DIIN), which achieves new state-of-the-art performance on both SNLI and MultiNLI copora. To test the generality of the architecture, we interpret the paraphrase identification task as natural language inference task where matching as entailment, not-matching as neutral. We test the model on Quora Question Pair dataset, which contains over 400k real world question pair, and achieves new state-of-the-art performance.
我们将通用框架称为交互式推理网络(IIN)。 据我们所知,它是在交互空间中首次尝试解决自然语言推理任务。 我们进一步探索了交互式推理网络,密集交互式推理网络(DIIN)的一个实例,它在SNLI和MultiNLI copora上实现了最新的最先进性能。 为了测试架构的一般性,我们将复述识别任务解释为自然语言推理任务,其中匹配为蕴涵,不匹配为中性。 我们在Quora Question Pair数据集上测试模型,该数据集包含超过400k的真实世界问题对,并实现了新的最先进的性能。
We introduce the related work in Section 2, and discuss the general framework of IIN along with a specific instance that enjoys state-of- the-art performance on multiple datasets in Section 3. We describe experiments and analysis in Section 4. Finally, we conclude and discuss future work in Section 5.
我们在第2节介绍了相关的工作,并讨论了IIN的一般框架以及在第3节中对多个数据集具有最新性能的特定实例。我们在第4节中描述了实验和分析。最后,我们总结并讨论第5节中的未来工作。
The early exploration on NLI mainly rely on conventional methods and small scale datasets(Marelli et al., 2014). The availability of SNLI dataset with 570k human annotated sentence pairs has enabled a good deal of progress on natural language understanding. The essential representation learning techniques for NLU such as attention(Wang & Jiang, 2015), memory(Munkhdalai & Yu, 2016) and the use of parse structure(Bowman et al., 2016; Mou et al., 2015) are studied on the SNLI which serves as an important benchmark for sentence understanding. The models trained on NLI task can be divided into two categories:
(i) sentence encoding-based model which aims to find vector representation for each sentence and classifies the relation by using the concatenation of two vector representation along with their absolute element-wise difference and element-wise product(Bowman et al., 2016; Vendrov et al., 2015; Mou et al., 2015; Liu et al., 2016; Munkhdalai & Yu, 2016).
(ii) Joint feature models which use the cross sentence feature or attention from one sentence to another(Rocktaschel et al., 2015; Wang & Jiang, 2015; Cheng et al., 2016; Parikh et al., 2016; Wang et al., 2017; Yu & Munkhdalai, 2017; Sha et al., 2016).
对NLI的早期探索主要依赖于传统方法和小规模数据集(Marelli等,2014)。具有570k人类注释句子对的SNLI数据集的可用性使得在自然语言理解方面取得了很大进展。研究了NLU的基本表征学习技术,如注意力(Wang&Jiang,2015),记忆(Munkhdalai&Yu,2016)和解析结构的使用(Bowman et al,2016; Mou et al,2015)。 SNLI是句子理解的重要基准。在NLI任务上训练的模型可以分为两类:
(i)基于句子编码的模型,其旨在找到每个句子的向量表示,并通过使用两个向量表示的级联以及它们的绝对元素差异和元素相乘来对关系进行分类(Bowman et al,2016; Vendrov et al,2015; Mou et al,2015; Liu et al,2016; Munkhdalai&Yu,2016)。
(ii)使用交叉句特征或从一个句子到另一个句子的关注的联合特征模型(Rocktaschel等,2015; Wang&Jiang,2015; Cheng等,2016; Parikh等,2016;Wang等。,2017; Yu&Munkhdalai,2017; Sha et al,2016)。
After neural attention mechanism is successfully applied on the machine translation task, such technique has became widely used in both natural language process and computer vision domains. Many variants of attention technique such as hard- attention(Xu et al., 2015), self-attention(Parikh et al., 2016), multi-hop attention(Gong & Bowman, 2017), bidirectional attention(Seo et al., 2016) and multi-head attention(Vaswani et al., 2017) are also introduced to tackle more complicated tasks. Before this work, neural attention mechanism is mainly used to make alignment, focusing on specific part of the representation. In this work, we want to show that attention weight contains rich semantic information required for understanding the logical relationship between sentence pair.
神经注意机制成功应用于机器翻译任务后,这种技术已广泛应用于自然语言过程和计算机视觉领域。 注意力技术的许多变种,如强烈关注(Xu et al,2015),自我关注(Parikh等,2016),多跳注意(Gong&Bowman,2017),双向注意(Seo et al ,2016)和多头注意(Vaswani等,2017)也被引入以解决更复杂的任务。 在此工作之前,神经注意机制主要用于进行对齐,侧重于表示的特定部分。 在这项工作中,我们希望表明注意力量包含理解句子对之间逻辑关系所需的丰富语义信息。
Though RNN or LSTM are very good for variable length sequence modeling, using Convolutional neural network in NLU tasks is very desirable because of its parallelism in computation. Convolutional structure has been successfully applied in various domain such as machine translation(Gehring et al., 2017), sentence classification(Kim, 2014), text matching(Hu et al., 2014) and sentiment analysis(Kalchbrenner et al., 2014), etc. The convolution structure is also applied on different level of granularity such as byte(Zhang & LeCun, 2017), character(Zhang et al., 2015), word(Gehring et al., 2017) and sentences(Mou et al., 2015) levels.
尽管RNN或LSTM非常适用于可变长度序列建模,但由于其在计算中的并行性,因此在NLU任务中使用卷积神经网络是非常期望的。 卷积结构已成功应用于各种领域,如机器翻译(Gehring等,2017),句子分类(Kim,2014),文本匹配(Hu et al,2014)和情感分析(Kalchbrenner等,2014) 卷积结构也应用于不同的粒度级别,如byte(Zhang&LeCun,2017),character(Zhang et al,2015),word(Gehring et al,2017)和sentence(Mou 等,2015)水平。
The Interactive Inference Network (IIN) is a hierarchical multi-stage process and consists of five components. Each of the components is compatible with different type of implementations. Potentially all exiting approaches in machine learning, such as decision tree, support vector machine and neural network approach, can be transfer to replace certain component in this architecture. We focus on neural network approaches below. Figure 1 provides a visual illustration of Interactive Inference Network.
交互式推理网络(IIN)是一个分层的多阶段过程,由五个部分组成。 每个组件都与不同类型的实现兼容。 机器学习中可能存在的所有现有方法,例如决策树,支持向量机和神经网络方法,都可以转移以替换该体系结构中的某些组件。 我们专注于下面的神经网络方法。 图1提供了交互式推理网络的可视化图示。
Embedding Layer converts each word or phrase to a vector representation and construct the representation matrix for sentences. In embedding layer, a model can map tokens to vectors with the pre-trained word representation such as GloVe(Pennington et al., 2014), word2Vec(Mikolov et al., 2013) and fasttext(Joulin et al., 2016). It can also utilize the preprocessing tool, e.g. named entity recognizer, part-of-speech recognizer, lexical parser and coreference identifier etc., to incorporate more lexical and syntactical information into the feature vector.
嵌入层将每个单词或短语转换为矢量表示,并构造句子的表示矩阵。 在嵌入层中,模型可以将令牌映射到具有预训练单词表示的向量,例如GloVe(Pennington等人,2014),word2Vec(Mikolov等人,2013)和fasttext(Joulin等人,2016)。 它还可以利用预处理工具,例如, 命名实体识别器,词性识别器,词法解析器和共同识别器等,将更多的词法和句法信息合并到特征向量中。
Encoding Layer encodes the representations by incorporating the context information or enriching the representation with desirable features for future use. For instance, a model can adopt bidirectional recurrent neural network to model the temporal interaction on both direction, recursive neural network(Socher et al., 2011) (also known as TreeRNN) to model the compositionality and the recursive structure of language, or self-attention to model the long-term dependency on sentence. Different components of encoder can be combined to obtain a better sentence matrix representation.
编码层通过合并上下文信息或者使用期望的特征丰富表示以供将来使用来对表示进行编码。 例如,一个模型可以采用双向递归神经网络来模拟两个方向上的时间相互作用,递归神经网络(Socher et al,2011)(也称为TreeRNN)来模拟语言的组合性和递归结构,或者自我 - 注重模仿对句子的长期依赖。 可以组合编码器的不同组件以获得更好的句子矩阵表示。
Interaction Layer creates an word-by-word interaction tensor by both premise and hypothesis representation matrix. The interaction can be modeled in different ways. A common approach is to compute the cosine similarity or dot product between each pair of feature vector. On the other hand, a high-order interaction tensor can be constructed with the outer product between two matrix representations.
交互层通过前提和假设表示矩阵创建逐字交互张量。 可以以不同方式对交互进行建模。 常见的方法是计算每对特征向量之间的余弦相似性或点积。 另一方面,可以利用两个矩阵表示之间的外积来构造高阶交互张量。
Feature Extraction Layer adopts feature extractor to extract the semantic feature from interaction tensor. The convolutional feature extractors, such as AlexNet(Krizhevsky et al., 2012), VGG(Simonyan & Zisserman, 2014), Inception(Szegedy et al., 2014), ResNet(He et al., 2016) and DenseNet(Huang et al., 2016), proven work well on image recognition are completely compatible under such architecture. Unlike the work (Kim, 2014; Zhang et al., 2015) who employs 1-D sliding window, our CNN architecture allows 2-D kernel to extract semantic interaction feature from the word-by-word interaction between n-gram pair. Sequential or tree-like feature extractors are also applicable in the feature extraction layer.
特征提取层采用特征提取器从交互张量中提取语义特征。 卷积特征提取器,例如AlexNet(Krizhevsky等人,2012),VGG(Simonyan&Zisserman,2014),Inception(Szegedy等人,2014),ResNet(He等人,2016)和DenseNet(Huang等)。 al。,2016),在图像识别方面的成熟工作在这种架构下是完全兼容的。 与使用1-D滑动窗口的工作(Kim,2014; Zhang et al,2015)不同,我们的CNN架构允许2-D内核从n-gram对之间的逐字交互中提取语义交互特征。 顺序或树状特征提取器也适用于特征提取层。
Output Layer decodes the acquired features to give prediction. Under the setting of NLI, the output layer predicts the confidence on each class.
输出层对获取的特征进行解码以进行预测。 在NLI的设置下,输出层预测每个类的置信度。
Here we introduce Densely Interactive Inference Network (DIIN), which is a relatively simple instantiation of IIN but produces state-of-the-art performance on multiple datasets.
这里我们介绍密集交互式推理网络(DIIN),这是一个相对简单的IIN实例,但在多个数据集上产生了最先进的性能。
Embedding Layer:
For DIIN, we use the concatenation of word embedding, character feature and syntactical features. The word embedding is obtained by mapping token to high dimensional vector space by pre-trained word vector (840B GloVe). The word embedding is updated during training. As in (Kim et al., 2016; Lee et al., 2016), we filter character embedding with 1D convolution kernel. The character convolutional feature maps are then max pooled over time dimension for each token to obtain a vector. The character features supplies extra information for some out-ofvocabulary (OOV) words. Syntactical features include one-hot part-of-speech (POS) tagging feature and binary exact match (EM) feature. The EM value is activated if there are tokens with same stem or lemma in the other sentence as the corresponding token. The EM feature is simple while found useful as in reading comprehension task (Chen et al., 2017a). In analysis section, we study how EM feature helps text understanding. Now we have premise representation P ∈ R p × d P \in R^{p×d} P∈Rp×d and hypothesis representation H ∈ R h × d H \in R^{h×d} H∈Rh×d , where p refers to the sequence length of premise, h refers to the sequence length of hypothesis and d means the dimension of both representation. The 1-D convolutional neural network and character features weights share the same set of parameters between premise and hypothesis.
对于DIIN,我们使用单词嵌入,字符特征和句法特征的串联。通过预训练的单词向量(840B GloVe)将令牌映射到高维向量空间来获得单词嵌入。嵌入一词在培训期间更新。如(Kim et al,2016; Lee et al,2016),我们使用1D卷积核过滤字符嵌入。然后,对于每个标记,字符卷积特征图在时间维度上最大化,以获得向量。角色特征为一些词汇外(OOV)词提供额外信息。语法特征包括单热词性(POS)标记功能和二进制精确匹配(EM)功能。如果在另一个句子中有与相应标记具有相同词干或引理的标记,则激活EM值。 EM功能很简单,但在阅读理解任务中很有用(Chen et al,2017a)。在分析部分,我们研究EM特征如何帮助文本理解。现在我们有前提表示 P ∈ R p × d P \in R^{p×d} P∈Rp×d和假设表示 H ∈ R h × d H \in R^{h×d} H∈Rh×d,其中p表示前提的序列长度,h表示假设的序列长度,d表示两个表示的维度。 1-D卷积神经网络和角色特征权重在前提和假设之间共享相同的参数集。
Encoding Layer:
In the encoding layer, the premise representation P P P and the hypothesis representation H H H are passed through a two-layer highway network, thus having P ^ ∈ R p × d \hat P \in R^ {p×d} P^∈Rp×d and H ^ ∈ R h × d \hat H \in R^{h×d} H^∈Rh×d for new premise representation and new hypothesis representation. These new representation are then passed to self-attention layer to take into account the word order and context information. Take premise as example, we model self-attention by
A i j = α ( P ^ i ) , P ^ i ∈ R A_{ij}=\alpha(\hat P_i), \hat P_i \in R Aij=α(P^i),P^i∈R (1)
(2) P ‾ i = ∑ j = 1 p e x p ( A i j ) ∑ k = 1 p e x p ( A k j ) P ^ j , ∀ i , j ∈ [ 1 , . . . , p ] \overline P_i = \sum^p_{j=1}\frac{exp(A_{ij})}{\sum^p_{k=1}exp(A_{kj})} \hat P_j, \\ \forall i, j \in [1, ..., p] \tag 2 Pi=j=1∑p∑k=1pexp(Akj)exp(Aij)P^j,∀i,j∈[1,...,p](2)
在编码层中,前提表示 P P P和假设表示 H H H通过双层公路网络,因此具有 P ^ ∈ R p × d \hat P \in R^ {p×d} P^∈Rp×d和 H ^ ∈ R h × d \hat H \in R^{h×d} H^∈Rh×d 用于新的前提表示和新的假设表示。 然后将这些新表示传递给自我关注层以考虑词序和上下文信息。 以前提为例,我们通过模拟自我关注
where P ‾ i \overline P_i Pi is a weighted summation of P ^ \hat P P^. We choose α ( a , b ) = w a T [ a ; b ; a ∘ b ] \alpha(a ,b)=w^T_a[a;b;a \circ b] α(a,b)=waT[a;b;a∘b] where w a ∈ R 3 d w_a \in R^{3d} wa∈R3d is a trainable weight, ∘ \circ ∘ is element-wise multiplication, [ ; ] [;] [;] is vector concatenation across row, and the implicit multiplication is matrix multiplication. Then both P ^ \hat P P^ and P ‾ \overline P P are fed into a semantic composite fuse gate (fuse gate in short), which acts as a skip connection. The fuse gate is implemented as
z i = t a n h ( W 1 T [ P ^ i ; P ‾ i ] + b 1 ) z_i=tanh(W^{1T}[\hat P_i;\overline P_i]+b^1) zi=tanh(W1T[P^i;Pi]+b1) (3)
r i = σ ( W 2 T [ P ^ i ; P ‾ i ] + b 2 ) r_i=\sigma(W^{2T}[\hat P_i; \overline P_i]+b^2) ri=σ(W2T[P^i;Pi]+b2) (4)
f i = σ ( W 3 T [ P ^ i ; P ‾ i + b 3 ] ) f_i=\sigma(W^{3T}[\hat P_i;\overline P_i+b^3]) fi=σ(W3T[P^i;Pi+b3])(5)
P ~ = r i ∘ P ^ i + f i ∘ z i \tilde P=r_i \circ \hat P_i+f_i \circ z_i P~=ri∘P^i+fi∘zi
其中 P ‾ i \overline P_i Pi是 P ^ \hat P P^的加权和。 我们选择 α ( a , b ) = w a T [ a ; b ; a ∘ b ] \alpha(a ,b)=w^T_a[a;b;a \circ b] α(a,b)=waT[a;b;a∘b],其中 w a ∈ R 3 d w_a \in R^{3d} wa∈R3d是可训练的权重, ∘ \circ ∘是逐元素乘法, [ ; ] [;] [;]是跨行的向量连接,并且隐式乘法是矩阵乘法。 然后,P ^和P被馈送到语义复合熔丝门(简称熔丝门),其用作跳过连接。 熔丝门实现为
where W 1 W^1 W1, W 2 W^2 W2, W 3 ∈ R 2 d × d W^3 \in R^{2d×d} W3∈R2d×d and b 1 b^1 b1, b 2 b^2 b2, b 3 ∈ R d b^3 \in R^d b3∈Rd are trainable weights, s i g m a sigma sigma is sigmoid nonlinear operation.
其中 W 1 W^1 W1, W 2 W^2 W2, W 3 ∈ R 2 d × d W^3 \in R^{2d×d} W3∈R2d×d 和 b 1 b^1 b1, b 2 b^2 b2, b 3 ∈ R d b^3 \in R^d b3∈Rd是可训练的权重,σ是S形非线性运算。
We do the same operation on hypothesis representation, thus having H ~ \tilde H H~ . The weights of intraattention and fuse gate for premise and hypothesis are not shared, but the difference between the weights of are penalized. The penalization aims to ensure the parallel structure learns the similar functionality but is aware of the subtle semantic difference between premise and hypothesis.
我们对假设表示进行相同的操作,因此具有 H ~ \tilde H H~。 前提和假设的内部注意力和熔断器门的权重不共享,但权重之间的差异受到惩罚。 惩罚旨在确保并行结构学习类似功能,但意识到前提和假设之间的微妙语义差异。
Interaction Layer:
The interaction layer models the interaction between premise encoded representation P e n c P^{enc} Penc and hypothesis encoded representation H e n c H^{enc} Henc as follows:
I i j = β ( P ~ i , H ~ j ) ∈ R d , ∀ i ∈ [ 1 , . . . , p ] , ∀ ∈ [ 1 , . . . , h ] I_{ij}=\beta (\tilde P_i, \tilde H_j) \in R^d, \forall i \in [1, ..., p], \forall \in [1, ..., h] Iij=β(P~i,H~j)∈Rd,∀i∈[1,...,p],∀∈[1,...,h] (7)
交互层模拟前提编码表示 P e n c P^{enc} Penc与假设编码表示 H e n c H^{enc} Henc之间的交互,如下所示:
where P ~ i \tilde P_i P~i is the i i i-th row vector of P ~ \tilde P P~, and H ~ j \tilde H_j H~j is the j j j-th row vector of H ~ \tilde H H~ . Though there are many implementations of interaction, we find β ( a , b ) = a ∘ b \beta(a, b)=a \circ b β(a,b)=a∘b very useful.
其中 P ~ i \tilde P_i P~i 是 P ~ \tilde P P~的第 i i i行向量, H ~ j \tilde H_j H~j是 H ~ \tilde H H~的第 j j j行向量。 虽然有许多交互实现,但我们发现 β ( a , b ) = a ∘ b \beta(a, b)=a \circ b β(a,b)=a∘b 非常有用。
Feature Extraction Layer:
We adopt DenseNet(Huang et al., 2016) as convolutional feature extractor in DIIN. Though our experiments show ResNet(He et al., 2016) works well in the architecture, we choose DenseNet because it is effective in saving parameters. One interesting observation with ResNet is that if we remove the skip connection in residual structure, the model does not converge at all. We found batch normalization delays convergence without contributing to accuracy, therefore we does not use it in our case. A ReLU activation function is applied after all convolution unless otherwise noted. Once we have the interaction tensor I I I, we use a convolution with 1 × 1 kernel to scale down the tensor in a ratio, η \eta η, without following ReLU. If the input channel is k k k then the output channel is f l o o r ( k × η ) floor(k × \eta) floor(k×η). Then the generated feature map is feed into three sets of Dense block(Huang et al., 2016) and transition block pair. The DenseNet block contains n layers of 3 × 3 convolution layer with growth rate of g g g. The transition layer has a convolution layer with 1 × 1 kernel for scaling down purpose, followed by a max pooling layer with stride 2. The transition scale down ratio in transition layer is θ \theta θ.
我们采用DenseNet(Huang et al。,2016)作为DIIN中的卷积特征提取器。虽然我们的实验表明ResNet(He et al。,2016)在架构中运行良好,但我们选择DenseNet是因为它在保存参数方面很有效。 ResNet的一个有趣观察是,如果我们在残差结构中删除跳过连接,则模型根本不会收敛。我们发现批量归一化会延迟收敛而不会影响准确性,因此我们不会在我们的情况下使用它。除非另有说明,否则在所有卷积之后应用ReLU激活函数。一旦我们得到了相互作用张量 I I I,我们就使用1×1内核的卷积来按照比率 η \eta η缩小张量,而不遵循ReLU。如果输入通道是 k k k,则输出通道是 f l o o r ( k × η ) floor(k×\eta) floor(k×η)。然后将生成的特征映射输入三组密集块(Huang等,2016)和转换块对。 DenseNet块包含n层3×3卷积层,生长速率为 g g g。过渡层具有用于缩小目的的具有1×1内核的卷积层,随后是具有步幅2的最大池化层。过渡层中的过渡缩小比率是 θ \theta θ。
Output Layer:
DIIN uses a linear layer to classify final flattened feature representation to three classes.
DIIN使用线性层将最终展平的特征表示分类为三个类。
In this section, we present the evaluation of our model. We first perform quantitative evaluation, comparing our model with other competitive models. We then conduct some qualitative analyses to understand how DIIN achieve the high level understanding through interaction.
在本节中,我们将介绍我们模型的评估。 我们首先进行定量评估,将我们的模型与其他竞争模型进行比较然后,我们进行一些定性分析,以了解DIIN如何通过互动实现高水平的理解。
Here we introduce three datasets we evaluate our model on. The evaluation metric for all dataset is accuracy.
这里我们介绍我们评估模型的三个数据集。 所有数据集的评估指标都是准确性。
SNLI
Stanford Natural Language Inference (SNLI; Bowman et al. 2015) has 570k human annotated sentence pairs. The premise data is draw from the captions of the Flickr30k corpus, and the hypothesis data is manually composed. The labels provided in are “entailment”, “neutral’, “contradiction” and “-”. “-” shows that annotators cannot reach consensus with each other, thus removed during training and testing as in other works. We use the same data split as in Bowman et al. (2015).
斯坦福自然语言推理(SNLI; Bowman2015等人)有570k人类注释句子对。 前提数据来自Flickr30k语料库的标题,假设数据是手动编写的。 提供的标签是“蕴涵”,“中立”,“矛盾”和“ - ”。 “ - ”表示注释者无法达成共识,因此在训练和测试过程中与其他工作一样被删除。 我们使用与Bowman(2015年)等人相同的数据分割。
MultiNLI
Multi-Genre NLI Corpus (MultiNLI; Williams et al. 2017) has 433k sentence pairs, whose collection process and task detail are modeled closely to SNLI. The premise data is collected from maximally broad range of genre of American English such as written non-fiction genres (SLATE, OUP, GOVERNMENT, VERBATIM, TRAVEL), spoken genres (TELEPHONE, FACETO-FACE), less formal written genres (FICTION, LETTERS) and a specialized one for 9/11. Half of these selected genres appear in training set while the rest are not, creating in-domain (matched) and cross-domain (mismatched) development/test sets. We use the same data split as provided by Williams et al. (2017). Since test set labels are not provided, the test performance is obtained through submission on Kaggle.com2. Each team is limited to two submissions per day.
多类型NLI语料库(MultiNLI; Williams等人2017)拥有433k个句子对,其收集过程和任务细节与SNLI密切相关。 前提数据来自美国英语最广泛的类型,如书面非小说类型(SLATE,OUP,政府,VERBATIM,TRAVEL),口语类型(TELEPHONE,FACETO-FACE),不太正式的书面类型(FICTION, 信件)和专门用于9/11事件的信件。 这些选定类型中的一半出现在训练集中,而其余部分则不出现,创建域内(匹配)和跨域(不匹配)开发/测试集。 我们使用Williams(2017年)等人提供的相同数据分割。 由于未提供测试集标签,因此可通过Kaggle.com2上的提交获得测试性能。 每个团队每天限制提交两份。
Quora question pair
Quora question pair dataset contains over 400k real world question pair selected from Quora.com. A binary annotation which stands for match (duplicate) or not match (not duplicate) is provided for each question pair. In our case, duplicate question pair can be interpreted as entailment relation and not duplicate as neutral. We use the same split ratio as mentioned in (Wang et al., 2017).
Quora问题对数据集包含从Quora.com中选择的超过400k真实世界的问题对。 为每个问题对提供表示匹配(重复)或不匹配(不重复)的二进制注释。 在我们的例子中,重复的问题对可以被解释为蕴涵关系而不是重复为中性。 我们使用与(Wang等人,2017)中提到的相同的分流比。
We implement our algorithm with Tensorflow(Abadi et al., 2016) framework. An Adadelta optimizer(Zeiler, 2012) with ρ as 0.95 and as 1e−8 is used to optimize all the trainable weights. The initial learning rate is set to 0.5 and batch size to 70. When the model does not improve best indomain performance for 30,000 steps, an SGD optimizer with learning rate of 3e−4 is used to help model to find a better local optimum. Dropout layers are applied before all linear layers and after word-embedding layer. We use an exponential decayed keep rate during training, where the initial keep rate is 1.0 and the decay rate is 0.977 for every 10,000 step. We initialize our word embeddings with pre-trained 300D GloVe 840B vectors Pennington et al. (2014) while the out-of-vocabulary word are randomly initialized with uniform distribution. The character embeddings are randomly initialized with 100D. We crop or pad each token to have 16 characters. The 1D convolution kernel size for character embedding is 5. All weights are constraint by L2 regularization, and the L2 regularization at step t is calculated as follows:
L 2 R a t i o t = σ ( ( t − L 2 F u l l S t e p / 2 ) ∗ 8 L 2 F u l l S t e p / 2 ) ∗ L 2 F u l l R a t i o L2Ratio_t=\sigma(\frac{(t-L2FullStep/2)*8}{L2FullStep/2})*L2FullRatio L2Ratiot=σ(L2FullStep/2(t−L2FullStep/2)∗8)∗L2FullRatio (8)
我们使用Tensorflow(Abadi等,2016)框架实现我们的算法。使用ρ为0.95和1e-8的Adadelta优化器(Zeiler,2012)来优化所有可训练的权重。初始学习率设置为0.5,批量大小设置为70.当模型没有提高30,000步的最佳indomain性能时,使用学习率为3e-4的SGD优化器来帮助建模以找到更好的局部最优。在所有线性层之前和在字嵌入层之后应用滤除层。我们在训练期间使用指数衰减保持率,其中初始保持率为1.0,并且每10,000步骤的衰减率为0.977。我们使用预先训练的300D GloVe 840B向量Pennington等人初始化我们的单词嵌入。 (2014),而词汇外单词随机初始化为均匀分布。字符嵌入随机初始化为100D。我们裁剪或填充每个标记有16个字符。字符嵌入的1D卷积核大小为5.所有权重均由L2正则化约束,步骤t的L2正则化计算如下:
where L 2 F u l l R a t i o L2FullRatio L2FullRatio determines the maximum L2 regularization ratio, and L 2 F u l l S t e p L2FullStep L2FullStep determines at which step the maximum L2 regularization ratio would be applied on the L2 regularization. We choose L 2 F u l l R a t i o L2FullRatio L2FullRatio as 0:9e -5 and L 2 F u l l S t e p L2FullStep L2FullStep as 100,000. The ratio of L2 penalty between the difference of two encoder weights is set to 1e-3. For a dense block in feature extraction layer, the number of layer n is set to 8 and growth rate g is set to 20. The first scale down ratio η in feature extraction layer is set to 0:3 and transitional scale down ratio θ is set to 0:5. The sequence length is set as a hard cutoff on all experiments: 48 for MultiNLI, 32 for SNLI and 24 for Quora Question Pair Dataset. During the experiments on MultiNLI, we use 15% of data from SNLI as in Williams et al. (2017). We select the parameter by the best run of development accuracy. Our ensembling approach considers the majority vote of the predictions given by multiple runs of the same model under different random parameter initialization.
其中L2FullRatio确定最大L2正则化比率, L 2 F u l l S t e p L2FullStep L2FullStep确定最大L2正则化比率将应用于L2正则化的哪一步。我们选择 L 2 F u l l R a t i o L2FullRatio L2FullRatio为0:9e-5, L 2 F u l l S t e p L2FullStep L2FullStep为100,000。两个编码器权重的差值之间的L2惩罚的比率被设置为1e-3。对于特征提取层中的密集块,层n的数量被设置为8并且增长率g被设置为20。特征提取层中的比率η设定为0:3,过渡比例缩小比率θ设定为0:5。序列长度在所有实验中设置为硬截止:对于MultiNLI为48,对于SNLI为32,对于Quora问题对数据集为24。在MultiNLI的实验中,我们使用来自SNLI的15%的数据,如Williams(2017年)等人所述。我们通过最佳的开发精度来选择参数。我们的集成方法考虑了在不同随机参数初始化下由同一模型的多次运行给出的预测的多数投票。
We compare our result with all other published systems in Table 2. Besides ESIM, the state-of-theart model on SNLI, all other models appear at RepEval 2017 workshop. RepEval 2017 workshop requires all submitted model to be sentence encoding-based model therefore alignment between sentences and memory module are not eligible for competition. All models except ours share one common feature that they use LSTM as a essential building block as encoder. Our approach, without using any recurrent structure, achieves the new state-of- the-art performance of 80.0%, exceeding current state-of-the-art performance by more than 5%. Unlike the observation from Nangia et al. (2017), we find the out-of-domain test performance is consistently lower than in-domain test performance. Selecting parameters from the best in-domain development accuracy partially contributes to this result.
我们将结果与表2中的所有其他已发布系统进行比较。除了ESIM,SNLI上的最新模型,所有其他模型都出现在2017年RepEval研讨会上。 2017年RepEval研讨会要求所有提交的模型都是基于句子编码的模型,因此句子和内存模块之间的对齐不符合竞争条件。 除了我们之外的所有模型都有一个共同特征,即它们使用LSTM作为编码器的基本构建块。 我们的方法在不使用任何重复结构的情况下,实现了80.0%的最新技术性能,超过了当前最先进的性能超过5%。 与Nangia等人的观察结果不同。 (2017),我们发现域外测试性能始终低于域内测试性能。 从最佳的域内开发精度中选择参数部分地有助于此结果。
In Table 3, we compare our model to other model performance on SNLI. Experiments (2-7) are sentence encoding based model. Bowman et al. (2016) provides a BiLSTM baseline. Vendrov et al. (2015) adopts two layer GRU encoder with pre-trained ”skip-thoughts” vectors. To capture sentence-level semantics, Mou et al. (2015) use tree-based CNN and Bowman et al. (2016) propose a stack-augmented parser- interpreter neural network (SPINN) which incorporates parsing information in a sequential manner. Liu et al. (2016) uses intra-attention on top of BiLSTM to generate sentence representation, and Munkhdalai & Yu (2016) proposes an memory augmented neural network to encode the sentence. The next group of model, experiments (8-18), uses cross sentence feature. Rocktaschel et al. (2015) aligns each sentence word-by-word with attention on top of LSTMs. Wang ¨ & Jiang (2015) enforces cross sentence attention word-by-word matching with the proprosed mLSTM model. Cheng et al. (2016) proposes long short-term memory-network(LSTMN) with deep attention fusion that links the current word to previous word stored in memory. Parikh et al. (2016) decomposes the task into sub-problems and conquer them respectively. Yu & Munkhdalai (2017) proposes neural tree indexer, a full n-ary tree whose subtrees can be overlapped. Re-read LSTM proposed by Sha et al. (2016) considers the attention vector of one sentence as the inner-state of LSTM for another sentence. Chen et al. (2016) propose a sequential model that infers locally, and a ensemble with tree-like inference module that further improves performance. We show our model, DIIN, achieves state-of-the-art performance on the competitive leaderboard.
在表3中,我们将我们的模型与SNLI上的其他模型性能进行比较。实验(2-7)是基于句子编码的模型。鲍曼等人。 (2016)提供BiLSTM基线。 Vendrov等人。 (2015)采用具有预训练的“跳过思想”向量的两层GRU编码器。为了捕获句子级别的语义,Mou等人。 (2015)使用基于树的CNN和Bowman等人。 (2016)提出了一种堆栈扩充的解析器 - 解释器神经网络(SPINN),其以顺序方式结合解析信息。刘等人。 (2016)在BiLSTM之上使用内部注意来生成句子表示,并且Munkhdalai&Yu(2016)提出了用于编码句子的记忆增强神经网络。下一组模型,实验(8-18),使用交叉句子功能。 Rocktaschel等人。 (2015)在LSTM之上逐字逐句地对齐每个句子。 Wang¨&Jiang(2015)用proprosed mLSTM模型逐字逐句地强制执行交叉句注意。程等人。 (2016)提出了具有深度注意力融合的长期短期记忆网络(LSTMN),其将当前词链接到存储在存储器中的先前词。 Parikh等。 (2016)将任务分解为子问题并分别征服它们。 Yu&Munkhdalai(2017)提出了神经树索引器,一个完整的n-ary树,其子树可以重叠。重新阅读Sha等人提出的LSTM。 (2016)将一个句子的注意向量视为另一个句子的LSTM的内在状态。陈等人。 (2016)提出了一个推断本地的顺序模型,以及一个具有树状推理模块的集合,进一步提高了性能。我们展示了我们的模型DIIN,在竞争排行榜上实现了最先进的性能。
4.5 EXPERIMENT ON QUORA QUESTION PAIR DATASET
In this subsection, we evaluate the effectiveness of our model for paraphrase identification as natural language inference task. Other than our baselines, we compare with Wang et al. (2017) and Tomar et al. (2017). BIMPM models different perspective of matching between sentence pair on both direction, then aggregates matching vector with LSTM. DECATTword and DECATTchar uses automatically collected in-domain paraphrase data to noisy pretrain n-gram word embedding and ngram subword embedding correspondingly on decomposable attention model proposed by (Parikh et al., 2016). In Table 4, our experiment shows DIIN has better performance than all other models and an ensemble score is higher than the former best result for more than 1 percent.
在本小节中,我们评估了我们的复述识别模型作为自然语言推理任务的有效性。 除了我们的基线,我们与Wang等人进行了比较。 (2017年)和托马尔等人。(2017年)。 BIMPM模拟两个方向上句子对之间匹配的不同视角,然后将匹配向量与LSTM聚合。 DECATTword和DECATTchar使用自动收集的域内释义数据对噪声预训练n-gram词嵌入和ngram子词嵌入相应地在(Parikh等,2016)提出的可分解注意力模型上。 在表4中,我们的实验表明DIIN具有比所有其他模型更好的性能,并且集合得分高于前一个最佳结果超过1%。