两篇论文中的 ConvLSTM 对比

【这篇重点在分析一下改变了的网络模型,其他的写的并不全面】

1、《Deep Learning Approach for Sentiment Analysis of Short Texts》

learning long-term dependencies with gradient descent is difcult in neural network language model because of the vanishing gradients problem

学习梯度下降的长期依赖关系在神经网络语言模型中是困难的,因为梯度消失问题

In our experiments, ConvLstm exploit LSTM as a substitute of pooling layer in CNN to reduce the loss of detailed local information and capture long term dependencies in sequence of sentences.

在我们的实验中,ConvLstm 利用 LSTM 作为 CNN中池化层(pooling layer)的替代品,以减少详细局部信息的丢失并捕获句子序列中的长期依赖关系。

IMDB, Stanford Sentiment Treebank (SSTb)

实验结果:
Empirical results show that ConvLstm achieved comparable performances with less parameters on sentiment analysis tasks
实证结果表明,ConvLstm在情感分析任务中实现了具有较少参数的可比较的表现

一般做法:
Recent work by [10] consists of multi layers of CNN and max pooling, similar to the architecture proposed by [11] in computer vision.
multi layers of CNN max pooling

flattened to form a vector, which feeds into a small number of fully connected layers followed by a classifcation layer

一般做法的不足之处:

We observed that employing convolutional to capture long term dependencies requires many layers, because of the locality of the convolutional and pooling.
我们观察到,由于卷积和池化的局部性,使用卷积来捕获长期依赖性需要很多层。

As the length of the input grows, this become crucial;

it confrmed that RNN is able to capture long-term dependencies even in the case of a single layer.
它确信RNN即使在单层的情况下也能够捕获长期的依赖关系。

Our work is also inspired fom the fact that recurent layers are able to capture long-term dependences with one single layer [14].

模型的作用是:
In our model, we utilize a recurent layer LSTM as substitutes for the pooling layer in order to reduce the loss of detailed local information and capture longterm dependencies. Surprisingly, our model achieved comparable results on two sentiment analysis benchmarks with less number of parameters. We show that it is possibly to use a much smaller model to achieve the same level of classifcation performance when recurent layer combined with convolutional layer.

在我们的模型中,我们利用recurent层LSTM作为池层的替代品,以减少详细的局部信息丢失并捕获长期依赖性。 令人惊讶的是,我们的模型在两个情绪分析基准上获得了可比较的结果,而参数数量较少。 我们表明,当recurrent层与卷积层结合时,可能使用更小的模型来达到相同级别的分类性能。

网络结构:

The convolutional layer can extract high-level features fom input sequence efficiently.

The recurent layer LSTM has the ability to remember important information across long stretches of time and also is able to capture long-term dependences.

Our work is inspired by [3, 10, 12]

Then we employ a recurent layer as an alterative for pooling layers to efficiently capture long-term dependencies for the text
classifcation tasks. Our model achieved a competitive results on multiple benchmarks with less number of parameters.

有偏见的model,因为最近出现的数据比早先的数据更加重要,
而关键组件可能出现在序列的任意位置,不一定在当前的前几个,不仅仅在最后,
于是引入LSTM模型来克服这些困难。

LSTM是更复杂的功能,它可以控制信息的流动,防止消失的梯度,并允许递归层更容易捕获长期的依赖关系。

用 CNN 形成 feature map
然后接一层 LSTM

We devoted extra time tuning the learing rate, dropout and the number of units in the convolutional layer,
since these hyper-parameters has a large impact on the prediction performance.
花了时间去调整 learning rate,dropout,和 卷积层的 units 数目

The number of epochs varies between (5, 10) for both dataset
epochs数目

We believe that by adding a recurrent layer as a substitute to the pooling layer it can effectively reduce the number of the convolutional layers needed in the model in order to capture long-term dependencies.

网络结构:
Therefore, we consider emerging a convolutional and a recurent layers in one single model ConvLstm, with multiple filters width (3, 4, 5), feature maps = 256, for activation fnctions in the convolutional layer we used rectifed linear (ReLus) for nonlinearity, padding was set to zero.
All elements that would fall outside the matrix are taken to be zero.
To reduce overfitting we applied dropout 0.5 only before the recurent layer

The main contibution in our work, exploits recurent layers as substitutes for the pooling layer;
a key aspect of CNNs are pooling layers, which are typically applied after the convolutional layers to subsample their input, a max operation is the most common way to do pooling operation.

Our LSTM has input, forget, and outut gates, hidden state dimension is set to 128

In our model we set the number of filters in the convolutional layers to be the 2x as the dimension of the hidden states in the recurrent layer, which add 3%-6% relative perforance.
在我们的模型中,我们将卷积图层中的滤波器数量设置为迭代层中隐藏状态的维数的2倍,这增加了3%-6%的相对透射率。

Dropout is an effective way to regularize deep neural networks [3].
We observe applying dropout before and after the recurent layer that decrease the performance of the model 2%-5%, therefore, we only apply dropout afer the recurrent layer, we set the dropout to 0.5.
Dropout prevent co-adaptation of hidden units by randomly dropping out

加一层 Dropout

We train the model by minimizing the negative log-likelihood or cross entropy loss.
我们通过最小化负对数似然或交叉熵损失来训练模型。

The gradient of the cost fnction is computed with backpropagation through time (BPTT).
通过反向传播时间(BPTT)计算成本函数的梯度。

The accuracy of the model does not increase with incensement in the number of the convolutional layers, one layer enough to peak the model [3], more pooling layers mostly lead to the loss of long term-dependencies [12].
Thus, in our model we ignored the pooling layer in the convolutional network and replaced it with single LSTM layer to reduce the loss in local information, single recurent layer is enough to capture long-term dependencies in the model.

模型的准确性不会随着卷积层数的增加而增加,一层足以使模型达到峰值[3],more pooling layers 大多会导致长期依赖性的丧失 [12] 。
因此,在我们的模型中,我们忽略了卷积网络中的汇聚层,并用单个LSTM层替换它以减少本地信息的损失,单个recurent层足以捕获模型中的长期依赖关系。

We perform several experiments to offer fair comparison to recent presented deep learning and traditional methods, as shown in Table II and III. For IMDB dataset the previous baseline is bag-of-words [41] and Paragraph Vectors [39].

Our ConvLstm model archives comparable performances with significantly less parameters. We achieved better results compared to convolutional only models;
it likely loses detailed local features because of the number of the pooling layers.
We assumed that the proposed model is more compact because of the small number of parameters and less disposed to overftting.
Hence, it generalizes better when the training size is limited.
It is possible to use more filters in the convolutional layer without changing the dimensional in the recurrent layer, which potentially increases the performance 2%-4% without sacrifice of the number of the parameters.
We observed that many factors affect the performance of the deep learing methods, such as the dataset size, vanishing and exploding of the gradient, choosing the best feature extractors and classifiers is still an open research area.
However, there is no specifc model fit for all types of datasets.

我们进行了几次实验,以提供与最近提出的深度学习和传统方法的公平比较,如表II和III所示。对于IMDB数据集,以前的基线是词袋[41]和段落矢量[39]。

我们的ConvLstm模型归档的参数相当少。与仅卷积模型相比,我们取得了更好的结果;
由于共用层的数量,它可能会失去详细的本地特征。
我们假设,由于参数少且不易处理,所提出的模型更紧凑。
因此,当培训规模有限时,它会更好地推广。
可以在卷积层中使用更多的滤波器而不改变递归层的尺寸,这可能会在不牺牲参数数量的情况下将性能提高2%-4%。
我们观察到,影响深度学习方法性能的因素很多,如数据集大小,梯度的消失和爆炸,选择最佳特征提取器和分类器仍然是一个开放的研究领域。
但是,没有适用于所有类型数据集的特定模型。

In this paper, we proposed a neural language model to overcome the shortcomings in traditional and deep learing methods.
We propose to combine the convolutional and recurrent layer into a single model on top of pre-trained word vectors; to capture long-term dependencies in short texts more efficiently.
We validated the proposed model on SSTb and IMDB Datasets.
We achieved comparable results with less number of convolutional layers compared to the convolutional only architecture, and our results confirm that unsupervised pre-trained of word vectors is a significant feature in deep learing for NLP. Also using LSTM as an alterative for the pooling layers in CNN gives the model enhancement to capture long-term dependencies.
It will be remarkable for future research to apply the architecture on other NLP applications such as spam filtering and web search. Using other variants of recurrent neural network as substitutes for pooling layers is also area worth exploring
在本文中,我们提出了一个神经语言模型来克服传统和深度学习方法的缺点。
我们建议在预先训练的单词向量之上将卷积和复发层合并成一个模型; 在短文本中更有效地捕获长期依赖关系。
我们在SSTb和IMDB数据集上验证了建议的模型。
我们使用卷积层数量较少的卷积层实现了可比较的结果,并且我们的结果证实,无监督预先训练的单词向量是深度学习NLP的重要特征。 同样使用LSTM作为CNN中汇集层的替代方案,可以增强模型以捕获长期相关性。
未来的研究将该架构应用于其他NLP应用程序(如垃圾邮件过滤和网络搜索)将是非常值得关注的。 使用循环神经网络的其他变体作为合并层的替代品也是值得探索的领域

2、《》

你可能感兴趣的:(算法,深度学习,自然语言处理)