论文笔记:Enhanced LSTM for Natural Language Inference

Enhanced LSTM for Natural Language Inference

https://arxiv.org/pdf/1609.06038v3.pdf

Related Work

  • Enhancing sequential inference models based on chain networks
  • Further, considering recursive architectures to encode syntactic parsing information

Hybrid Neural Inference Models

Major components

  • input encoding、local inference modeling、inference composition
  • ESIM(sequential NLI model)、Tree LSTM(incorporate syntactic parsing information)
    论文笔记:Enhanced LSTM for Natural Language Inference_第1张图片
Notation
  • Two sentences:
    • a = ( a 1 , . . . , a l a ) a = (a_1, ..., a_{l_a}) a=(a1,...,ala)
    • b = ( b 1 , . . . , b l b ) b = (b_1, ..., b_{l_b}) b=(b1,...,blb)
  • Enbedding of l l l-dimensional vector: a i a_i ai b j ∈ R l b_j\in \mathbb{R}^l bjRl
  • a ˉ i \bar {a}_i aˉi: generated by the B i L S T M BiLSTM BiLSTM at time i i i over the input sequence a a a
Goal
  • Predict a label y y y that indicates the logic relationship between a a a and b b b

Input Encoding

  • Use B i L S T M BiLSTM BiLSTM to encode the input premise and hypothesis

  • Hidden states by two LSTMs at each time step are concatenated to represent that time step and its context

  • Encode syntactic parse trees of a premise and hypothesis through tree-LSTM

  • A tree node is deployed with a tree-LSTM memory block depicted

    • At each node, an input vector x t x_t xt and hidden vectors of it( h t − 1 L h^L_{t-1} ht1L and h t − 1 R h^R_{t-1} ht1R)are taken in as the input to calculate the current node’s hidden vector h t h_t ht
      论文笔记:Enhanced LSTM for Natural Language Inference_第2张图片
  • Detailed computation:

    • h t = T r L S T M ( x t , h t − 1 L , h t − 1 R ) h_t=TrLSTM(x_t, h^L_{t-1}, h^R_{t-1}) ht=TrLSTM(xt,ht1L,ht1R)
    • h t = o t ⊙ t a n h ( c t ) h_t=o_t\odot tanh(c_t) ht=ottanh(ct)
    • o t = σ ( W o x t + U o L h t − 1 L + U o R h t − 1 R ) o_t=\sigma(W_ox_t+U^L_oh^L_{t-1}+U^R_oh^R_{t-1}) ot=σ(Woxt+UoLht1L+UoRht1R)
    • c t = f t T ⊙ c t − 1 L + f t R ⊙ c t − 1 R + i t ⊙ u t c_t=f_t^T \odot c^L_{t-1}+f^R_t\odot c^R_{t-1}+i_t\odot u_t ct=ftTct1L+ftRct1R+itut
    • f t L = σ ( W f x t + U f L L h t − 1 L + U f L R h t − 1 R ) f^L_t=\sigma(W_fx_t+U^{LL}_fh^L_{t-1}+U^{LR}_fh^R_{t-1}) ftL=σ(Wfxt+UfLLht1L+UfLRht1R)
    • f t R = σ ( W f x t + U f R L h t − 1 L + U f R R h t − 1 R ) f^R_t=\sigma(W_fx_t+U^{RL}_fh^L_{t-1}+U^{RR}_fh^R_{t-1}) ftR=σ(Wfxt+UfRLht1L+UfRRht1R)
    • i t = σ ( W i x t + U i L h t − 1 L + U i R h t − 1 R ) i_t=\sigma(W_ix_t+U^L_i h^L_{t-1}+U^R_ih^R_{t-1}) it=σ(Wixt+UiLht1L+UiRht1R)
    • u t = t a n h ( W c x t + U c L h t − 1 L + U c R h t − 1 R ) u_t=tanh(W_cx_t+U^L_ch^L_{t-1}+U^R_ch^R_{t-1}) ut=tanh(Wcxt+UcLht1L+UcRht1R)
  • All W ∈ R d × l , U ∈ d × d W\in \mathbb{R}^{d\times l}, U\in\mathbb{d\times d} WRd×l,Ud×d are weight matrices to be learned

Local Inference Modeling

Locality of inference
  • Employ some forms of hard or soft alignment to associate the relevant subcomponents between a premise and a hypothesis
  • Argue for leveraging attention over the bidirectional sequential encoding of the input
  • soft alignment layer computes the attention weights as the similarity of a hidden state tuple < a ˉ i , b ˉ j > <\bar a_i,\bar b_j> <aˉi,bˉj> between a premise and a hypothesis with e i j = a ˉ i T b ˉ j e_{ij}= \bar {a}^T_i \bar b_j eij=aˉiTbˉj
  • use bidirectional LSTM and tree-LSTM to encode the premise and hypothesis
  • In sequential inference model, use BiLSTM
Local inference collected over sequences
  • Local inference is determined by the attentiion weight e i j e_{ij} eij, which is used to obtain the local relevance between a premise and hypothesis
  • The content in { b ˉ j } j = 1 l b {\{\bar b_j\}}^{l_b}_{j=1} {bˉj}j=1lb that is relevant to a ˉ i \bar a_i aˉi will be selected and represented as a ~ i \tilde a_i a~i

a ~ i = ∑ j = 1 l b e x p ( e i j ) ∑ k = 1 l b e x p ( e i k ) b ˉ j , ∀ i ∈ [ 1 , . . . , l a ] \tilde a_i =\sum\limits_{j=1}^{l_b}\frac{exp(e_{ij})}{\sum^{l_b}_{k=1}exp(e_{ik})}\bar b_j, \forall i \in[1,...,l_a] a~i=j=1lbk=1lbexp(eik)exp(eij)bˉj,i[1,...,la]

b ~ j = ∑ i = 1 l a e x p ( e i j ) ∑ k = 1 l a e x p ( e k j ) a ˉ i , ∀ j ∈ [ 1 , . . . , l b ] \tilde b_j =\sum\limits_{i=1}^{l_a}\frac{exp(e_{ij})}{\sum^{l_a}_{k=1}exp(e_{kj})}\bar a_i, \forall j \in[1,...,l_b] b~j=i=1lak=1laexp(ekj)exp(eij)aˉi,j[1,...,lb]

Local inference collected over parse trees
  • compute the difference and the element-wise product for the tuple < a ˉ , a ~ > <\bar a, \tilde a> <aˉ,a~>as well as for
    < b ˉ , b ~ > <\bar b, \tilde b> <bˉ,b~>
  • The difference and element-wise product are then concatenated with the original vectors

m a = [ a ˉ ; a ~ ; a ˉ − a ~ ; a ˉ ⊙ a ~ ; ] m_a=[\bar a;\tilde a;\bar a-\tilde a;\bar a \odot \tilde a;] ma=[aˉ;a~;aˉa~;aˉa~;]

m b = [ b ˉ ; b ~ ; b ˉ − b ~ ; b ˉ ⊙ b ~ ; ] m_b=[\bar b;\tilde b;\bar b-\tilde b;\bar b \odot \tilde b;] mb=[bˉ;b~;bˉb~;bˉb~;]

Inference Composition

  • Explore a composition layer to compose the enhanced local inference information m a m_a ma and m b m_b mb
The composition layer
  • In sequential inference model, use BiLSTM to compose local inference information sequentially
  • Formulas for BiLSTM are used to capture local inference information m a m_a ma and m b m_b mb and their context here for inference composition
  • In the tree composition, a tree node updates to compose local inference

v a , t = T r L S T M ( F ( m a , t ) , h t − 1 L , h t − 1 R ) v_{a,t}=TrLSTM(F(m_{a,t}), h^L_{t-1}, h^R_{t-1}) va,t=TrLSTM(F(ma,t),ht1L,ht1R)

v b , t = T r L S T M ( F ( m b , t ) , h t − 1 L , h t − 1 R ) v_{b,t}=TrLSTM(F(m_{b,t}), h^L_{t-1}, h^R_{t-1}) vb,t=TrLSTM(F(mb,t),ht1L,ht1R)

  • Use a 1-layer feedforward neural network with the ReLU activation, which is also applied to BiLSTM in sequential inference composition
Pooling
  • Convert the resulting vectors obtained above to a fixed-length vector with pooling and feeds it to the final classifier to determine the overall inference relationship
  • Compute both average and max pooling, and concatenate all these vectors to form the final fixed length vector v v v

v a , a v e = ∑ i = 1 l a v a , i l a v_{a,ave}=\sum\limits_{i=1}^{l_a}\frac{v_{a,i}}{l_a} va,ave=i=1lalava,i, v a , m a x = max ⁡ i = 1 l a v a , i v_{a,max}=\max\limits_{i=1}^{l_a}v_{a,i} va,max=i=1maxlava,i

v b , a v e = ∑ j = 1 l b v b , j l b v_{b,ave}=\sum\limits_{j=1}^{l_b}\frac{v_{b,j}}{l_b} vb,ave=j=1lblbvb,j, v b , m a x = max ⁡ j = 1 l b v b , j v_{b,max}=\max\limits_{j=1}^{l_b}v_{b,j} vb,max=j=1maxlbvb,j

v = [ v a , a v e ; v a , m a x ; v b , a v e ; v b , m a x ] v =[v_{a,ave};v_{a,max};v_{b,ave};v_{b,max}] v=[va,ave;va,max;vb,ave;vb,max]

  • Put v v v into a final multilayer perceptron(MLP) classifier
  • Use multi-class cross-entropy loss

你可能感兴趣的:(NLP-beginner,nlp,python)