文章标题:A Convolutional Neural Network For Modelling Sentences
作者:Nal Kalchbrenner等
单位:University of Oxford
a. 序列问题
b. 表征的意义
c. 词表征
语音 | 图像 | 文字 |
Aim:represent the semantic content of a sentence.
Problem:individual sentences are rarely observed,let’s start from word.
Problem with words as discrete symbols
Example: in web search, if user searches for “Seattle motel”, we would like to match documents containing “Seattle hotel”
motel=[0 0 0 0 0 0 0 1 0 0 0 0]
hotel=[0 1 0 0 0 0 0 0 0 0 0 0]
These two vectors are orthogonal.
There is no natural notion of similarity for one-hot vectors!
Distributional semantics:A word’s meaning is given by the words that frequently appear close-by
When a word w appears in a text, its context is the set of words that appear nearby(within a fixed-size windows)
Use the many contexts of w to build up a representation of w
How to make neighbors represent words?
Answer:With a cooccurrence matrix X
Window based co-occurrence matrix
·Window around each word captures both syntactic(POS) and semantic information
·Window length 1(more common:5-10)
·Symmetric(irrelevant whether left or right context)
Example corpus:
·I like deep learning.
·I like NLP.
·I enjoy flying.
·Increase in size with vocabulary
·Very high dimensional:require a lot of storage
·Subsequent classification models have sparse issues
·Models are less robust
·Idea:store “most” of the important information in a fixed, small number of dimensions: a dense vector
·Usually around 25-1000 dimensions
·How to reduce the dimensionality
Single Value Decomposition of cooccurrence matrix X.
下面看两个例子(An Improved Model of Semantic Similarity Based on Lexical Co-Occurrence Rohde et al.2005):
·Computational cost scales quadratically for n matrix:O(mn2) flops (when n
·Hard to incorporate new words or documents.
Solution:Directly learn low-dimensional word vectors.
·Instead of capturing co-occurrence counts directly.
·Predict surrounding words of every word.
·Faster and can easily incorporate a new sentence/document or add a word to the vocabulary.
Word2Vec Overview:
Predict surrounding words
·Predict surrounding words in a window of length m of every word.
·Objective function: Maximize the log probability of any context word given the current center word.
·Likelihood= L ( θ ) = ∏ t = 1 T ∏ − m ≤ j ≤ m j / = 0 P ( w t + j ∣ w t ; θ ) L(\theta)=\prod_{t=1}^T\prod_{\underset{j/=0}{-m≤j≤m}}P(w_{t+j}|w_t;\theta) L(θ)=∏t=1T∏j/=0−m≤j≤mP(wt+j∣wt;θ)
·objective function J ( θ ) = − 1 T ∑ t = 1 T ∑ − m ≤ j ≤ m j / = 0 l o g P ( w t + j ∣ w t ) J(\theta)=-\frac{1}{T}\sum_{t=1}^T\sum_{\underset{j/=0}{-m≤j≤m}}logP(w_{t+j}|w_t) J(θ)=−T1∑t=1T∑j/=0−m≤j≤mlogP(wt+j∣wt)
· θ \theta θ represents all variables we optimize.
These representations are very good at encoding dimensions of similarity!
Convolutional Neural Networks with Dynamic k-Max Pooling
Properties of the Sentence Model
·bag-of-words I bag-of-n-grams
·Time-delay Neural Networks:时间延迟网络
·Recurrent Neural Networks:RNN
·Recursive neural networks:递归神经网络
what do the vectors represent?
· feature combinations(a node in the second layer might be “feature 1 AND feature 5 are active”)
·e.g. capture things such as "not"AND “hate”
· BUT! Cannot handle “not hate”
Why bag of n-grams?
·allow us to capture combination features in a simple way “don’t love”,“not the best”
·works pretty well
What problems with bag of n-grams?
·parameter explosion
·No sharing between similar words/n-grams
Generally based on 1D convolutions
but often uses terminology/functions borrowed from computer vision for historical reasons
Two main paradigms
(基于上下文窗口的建模)context window modeling: for tagging, etc. get the surrounding context before tagging
(基于句子的建模)sentence modeling: do convolution to extract n-grams, pooling to combine over whole sentence
output the result from the last hidden state bias to the last hidden state
·Extended the previous state-of-the-art in sentiment analysis by a large margin.
·Best performing out of a family of recursive networks(Recursive Autoencoders,Socher et al,2011;Matrix-Vector Recursive Neural Networks,Socher et al.,2012).
·Composition function is expressed as atensor-each slice of the tensor encodes different composition.
·Can discern negation at different scopes.
上图来自:Socher et al,“Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank”,EMNLP 2013
· Need parse trees to be computed beforehand.
· Phrase-level classification is expensive to obtain.
· Hard to adopt to other domains(e.g. Twitter).
c j = m T s j − m + 1 : j c_j=m^Ts_{j-m+1:j} cj=mTsj−m+1:j
· m ∈ ℜ m m\in \real^m m∈ℜm convolutional filter,m是卷积核
· s ∈ ℜ n s\in \real^n s∈ℜn sequence of length n,s是长度为n的序列
· s i ∈ ℜ s_i\in \real si∈ℜ single feature value of i-th word in sequence s
Narrow Convolution: c l e n g t h = s l e n g t h − m l e n g t h + 1 c_{length}=s_{length}-m_{length}+1 clength=slength−mlength+1,其实就是普通卷积计算
Wide Convolution: c l e n g t h = s l e n g t h + m l e n g t h − 1 c_{length}=s_{length}+m_{length}-1 clength=slength+mlength−1
用文章中的例子来看,s是有7个单词,所以 s l e n g t h = 7 s_{length}=7 slength=7,使用的卷积核大小都是5
句子的矩阵是由若干个词嵌入组成,大小是d×s,d是词向量大小,s是句子中词的个数。然后用m-gram大小的卷积核进行特征提取(第一个箭头),虽然句子的长度是不一样的,但是在max pooling阶段直接取句子中最重要的特征,解决了长度不一样的问题。
w i ∈ ℜ d w_i\in \real^d wi∈ℜdvector of d
c i ∈ ℜ d c_i\in\real^d ci∈ℜd vector of d
·Sensitive to the order of the words
·Not depend on external language-specific features such as dependency or constituency parse trees
·Give largely uniform importance to the signal coming from each words in the sentence(except words at the margin)
·Range of feature limited to the span of the weights
·Increasing m will make detector larger but exacerbate the neglect of the margins
·Minimum sizes
Max pooling
·Cannot distinguish whether a relevant feature in one of the rows occurs just one time or more
·Forget the order in which the features occur
reach the entire sentence! (including the words at the margins)
取k个最大值,而且保持抽取这些值的顺序,这样就会解决之前只抽取最大值会忽略句子中词与词的关系(只有一个特征有什么相互关系),另外保持顺序也解决了最大延时网络中的max pooling不能保存语序的缺点。
·Given a value k Sequence p ∈ ℜ p p\in\real^p p∈ℜp Length p≥k
·Select thesubsequence p m a x k p_{max}^k pmaxk of the k highest values of p
·Keep the original order
·After the topmost convolutional layer
文章中还有一种方法叫:Dynamic k-max pooling,这个方法中k的值是可以变化的,不是固定不变的。
k l = m a x ( k t o p , ⌈ L − l L s ⌉ ) k_l=max(k_{top},\left \lceil \frac{L-l}{L}s \right \rceil) kl=max(ktop,⌈LL−ls⌉)
· l l l is the number of the current convolutional layer
·L is the total number of convolutional layers in the network
· k t o p k_{top} ktop is the fixed topmost pooling parameter e.g.
a network with 3 convolutional layers and k t o p k_{top} ktop =3, length s=18
the pooling parameter in the first layer k 1 k_{1} k1=12
the pooling parameter in the second layer k 2 k_{2} k2=6
the pooling parameter in the third layer k 3 k_{3} k3= k t o p k_{top} ktop =3
Detect the l t h l_{th} lth order feature occurring at most k l k_{l} kl times
M = [ d i a g ( m : , 1 ) , . . . d i a g ( m : , m ) ] M=[diag(m_{:,1}),...diag(m_{:,m})] M=[diag(m:,1),...diag(m:,m)]
a = g ( M [ w j ⋮ w j + m − 1 ] + b ) a=g(M \begin{bmatrix} w_j\\ \vdots \\ w_{j+m-1} \end{bmatrix}+b) a=g(M⎣⎢⎡wj⋮wj+m−1⎦⎥⎤+b)
Convolve multiple feature maps into next layer and form higher order feature.
F j i + 1 = ∑ k = 1 n m j , k i + 1 ∗ F k i F_j^{i+1}=\sum_{k=1}^nm_{j,k}^{i+1}*F_k^i Fji+1=k=1∑nmj,ki+1∗Fki
F j i , . . . , F n i i t h F_j^i,...,F_n^i \space i_{th} \space Fji,...,Fni ith order feature maps in the same layer
m j , k i + 1 m_{j,k}^{i+1} mj,ki+1 forms an order-4 tensor
·individual row can have many orders
·Different rows are independent of each other
·Sum every two rows in a feature map component-wise
·Sensitive to the order of the words in the input sentence
·Discriminate whether a specific n-gram occurs in the input(wide convolution)
·Tell the relative position of the most relevant n-grams(dynamic k-max pooling)
·NBoW models insensitive to word order
RNN sensitive to word order while it has a bias towards the latest words it inputs
·Recursive Neural Networks sensitive to word order but has a bias toward the topmost nodes in the tree
·Max-TDNN sensitive to word order but max pooling only picks single n-gram feature in each row of the sentence matrix
·DCNN induces an internal complex feature graph over the input
·NBoW is a shallow model
·RNN has a linear chain structure
·Max-TDNN induces a single fixed-range feature subgraphs
·Recursive Neural Networks follows the structure of an external parse tree(clear hierarchy of feature orders)
1.Sentiment Prediction in Movie Reviews
2.Question Type Classification
3.Twitter Sentiment Prediction with Distant Supervision
·Somewhat positive
·Somewhat negative
·Six question types
·Numeric information
·Automatically labelled as positive or negative depending on the emoticon
如何增大特征检测的扫描范围:Wide convolution
如何捕捉到句子更复杂的特征关系:Dynamic k-max pooling
a. 文章中的公式5是什么意思,为什么用对角阵拼接,写出公式推导步骤?
b. 为什么说将公式5中的矩阵补全,就可以卷积考虑到不同的维度的关系?
c. 如果这里用到了residual connection,该怎么使用,具体的公式怎么改?
2. 关闭一切资源,从0开始写出Dynamic CNN网络。