【Paper】CNN-LSTM:Long-term Recurrent Convolutional Networks for Visual Recognition and Description

论文期刊:CVPR 2015 (oral)
论文被引:3673 (04/24/20)
论文原文:点击此处


该论文是 CNN-LSTM 的开山鼻祖,主要用于生成图像描述。初稿发布于2014年,拿到了 CVPR 的 oral,第四个版本(本文)发布于2016年。


文章目录

  • Long-term Recurrent Convolutional Networks for Visual Recognition and Description
  • Abstract
  • 1 INTRODUCTION
  • 2 BACKGROUND: RECURRENT NETWORKS
  • 3 LONG-TERM RECURRENT CONVOLUTIONAL NETWORK (LRCN) MODEL
  • 5 IMAGE CAPTIONING (略)
  • 6 VIDEO DESCRIPTION (略)
  • 7 RELATED WORK
    • 7.1 Prior Work
      • 7.1.1 Activity Recognition
      • 7.1.2 Image Captioning
      • 7.1.3 Video Description
    • 7.2 Contemporaneous and Subsequent Work
      • 7.2.1 Activity Recognition
      • 7.2.2 Image Captioning
    • 7.2.3 Video Description
      • 7.2.4 Visual Grounding
      • 7.2.5 Natural Language Object Retrieval
  • 8 CONCLUSION


Long-term Recurrent Convolutional Networks for Visual Recognition and Description

用于视觉识别和描述的长期递归卷积网络


Abstract

Models based on deep convolutional networks have dominated recent image interpretation tasks; we investigate whether models which are also recurrent are effective for tasks involving sequences, visual and otherwise. We describe a class of recurrent convolutional architectures which is end-to-end trainable and suitable for large-scale visual understanding tasks, and demonstrate the value of these models for activity recognition, image captioning, and video description. In contrast to previous models which assume a fixed visual representation or perform simple temporal averaging for sequential processing, recurrent convolutional models are “doubly deep” in that they learn compositional representations in space and time. Learning long-term dependencies is possible when nonlinearities are incorporated into the network state updates. Differentiable recurrent models are appealing in that they can directly map variable-length inputs (e.g., videos) to variable-length outputs (e.g., natural language text) and can model complex temporal dynamics; yet they can be optimized with backpropagation. Our recurrent sequence models are directly connected to modern visual convolutional network models and can be jointly trained to learn temporal dynamics and convolutional perceptual representations. Our results show that such models have distinct advantages over state-of-the-art models for recognition or generation which are separately defined or optimized.

基于深度卷积网络的模型主导了最近的图像解释任务。我们调查了也经常出现的模型是否对涉及序列,视觉和其他方面的任务有效。我们描述了一类递归卷积体系结构,它是端到端可训练的,适合大规模的视觉理解任务,并展示了这些模型对活动识别,图像字幕和视频描述的价值。与先前的模型假定固定的视觉表示或对顺序处理执行简单的时间平均相比,循环卷积模型“加倍深入”以学习时空的构图表示。当非线性被合并到网络状态更新中时,学习长期依赖性是可能的。可区分的递归模型之所以吸引人,是因为它们可以将可变长度的输入(例如视频)直接映射到可变长度的输出(例如自然语言文本),并且可以对复杂的时间动态建模;但是可以通过反向传播对其进行优化。我们的循环序列模型直接连接到现代视觉卷积网络模型,可以共同训练以学习时间动态和卷积感知表示。我们的结果表明,与单独定义或优化的识别或生成的最新模型相比,此类模型具有明显的优势。


1 INTRODUCTION

Recognition and description of images and videos is a fundamental challenge of computer vision. Dramatic progress has been achieved by supervised convolutional neural network (CNN) models on image recognition tasks, and a number of extensions to process video have been recently proposed. Ideally , a video model should allow processing of variable length input sequences, and also provide for variable length outputs, including generation of full length sentence descriptions that go beyond conventional one-versus-all prediction tasks. In this paper we propose Long-term Recurrent Convolutional Networks (LRCNs), a class of architectures for visual recognition and description which combines convolutional layers and long-range temporal recursion and is end-to-end trainable (Figure 1). We instantiate our architecture for specific video activity recognition, image caption generation, and video description tasks as described below.

图像和视频的识别和描述是计算机视觉的基本挑战。监督卷积神经网络(CNN)模型在图像识别任务上已经取得了令人瞩目的进展,最近还提出了许多扩展的视频处理方法。理想情况下,视频模型应允许处理可变长度的输入序列,并且还应提供可变长度的输出,包括生成超出传统的“一对多”预测任务的长句子描述。预测任务的全长句子描述。在本文中,我们提出了长期递归卷积网络(LRCN),这是一类用于视觉识别和描述的体系结构,该体系结构将卷积层和远程时间递归相结合,并且是端到端可训练的(图1)。我们为特定的视频活动识别,图像标题生成和视频描述任务实例化我们的体系结构,如下所述。
【Paper】CNN-LSTM:Long-term Recurrent Convolutional Networks for Visual Recognition and Description_第1张图片
Fig. 1. We propose Long-term Recurrent Convolutional Networks (LRCNs), a class of architectures leveraging the strengths of rapid progress in CNNs for visual recognition problems, and the growing desire to apply such models to time-varying inputs and outputs. LRCN processes the (possibly) variable-length visual input (left) with a CNN (middle left), whose outputs are fed into a stack of recurrent sequence models (LSTMs, middle-right), which finally produce a variable-length prediction (right). Both the CNN and LSTM weights are shared across time, resulting in a representation that scales to arbitrarily long sequences.

图1.我们提出了长期递归卷积网络(LRCN),这是一类利用CNN的快速发展优势解决视觉识别问题的体系结构,并且人们越来越希望将这种模型应用于时变的输入和输出。 LRCN使用CNN(中间 左)处理(可能)可变长度视觉输入(左),其输出被馈入一堆递归序列模型(LSTM,右中),最终产生可变长度预测(右)。 CNN和LSTM权重均在时间上共享,因此表示形式可缩放为任意长序列。

Research on CNN models for video processing has considered learning 3D spatio-temporal filters over raw sequence data [1], [2], and learning of frame-to-frame representations which incorporate instantaneous optic flow or trajectory-based models aggregated over fixed windows or video shot segments [3], [4]. Such models explore two extrema of perceptual time-series representation learning: either learn a fully general time-varying weighting, or apply simple temporal pooling. Following the same inspiration that motivates current deep convolutional models, we advocate for video recognition and description models which are also deep over temporal dimensions; i.e., have temporal recurrence of latent variables. Recurrent Neural Network (RNN) models are “deep in time” – explicitly so when unrolled – and form implicit compositional representations in the time domain. Such “deep” models predated deep spatial convolution models in the literature [5], [6].

对用于视频处理的CNN模型的研究已经考虑了对原始序列数据[1],[2]的3D时空滤波器的学习,以及对帧到帧表示的学习,其中结合了在固定窗口上聚合的基于瞬时光流或基于轨迹的模型或视频镜头片段[3],[4]。这样的模型探索了感知时间序列表示学习的两个极端:要么学习完全通用的时变加权(fully general time-varying weighting),要么应用简单的时序汇合(temporal pooling)。遵循激发当前深度卷积模型的相同灵感,我们提倡视频识别和描述模型,它们在时间维度上也很深。即具有潜在变量的时间重复性。递归神经网络(RNN)模型是“深度时间”的-展开时明确如此-并在时域中形成隐式组成表示。这种“深度”模型早于文献[5],[6]中的深度空间卷积模型。

The use of RNNs in perceptual applications has been explored for many decades, with varying results. A significant limitation of simple RNN models which strictly integrate state information over time is known as the “vanishing gradient” effect: the ability to backpropagate an error signal through a long-range temporal interval becomes increasingly difficult in practice. Long Short-T erm Memory (LSTM) units, first proposed in [7], are recurrent modules which enable long-range learning. LSTM units have hidden state augmented with nonlinear mechanisms to allow state to propagate without modification, be updated, or be reset, using simple learned gating functions. LSTMs have recently been demonstrated to be capable of large-scale learning of speech recognition [8] and language translation models [9], [10].

几十年来,人们一直在探索RNNs在感知应用中的应用,结果各不相同。随着时间的推移,严格整合状态信息的简单RNN模型的一个显著局限性被称为“消失梯度”效应:通过长时间间隔反向传播错误信号的能力在实践中变得越来越困难。长-短期记忆单元(LSTM)是[7]中首次提出的一种能够实现长距离学习的递归模块。LSTM单元具有非线性机制增强的隐藏状态,允许状态传播而无需修改、更新或重置,使用简单的学习门控函数。LSTMs最近被证明能够大规模学习语音识别[8]和语言翻译模型[9],[10]。

We show here that convolutional networks with recurrent units are generally applicable to visual time-series modeling, and argue that in visual tasks where static or flat temporal models have previously been employed, LSTM style RNNs can provide significant improvement when ample training data are available to learn or refine the representation. Specifically, we show that LSTM type models provide for improved recognition on conventional video activity challenges and enable a novel end-to-end optimizable mapping from image pixels to sentence-level natural language descriptions. We also show that these models improve generation of descriptions from intermediate visual representations derived from conventional visual models.

我们在这里证明了具有递归单元的卷积网络通常适用于视觉时间序列建模,并且认为在以前使用静态或平坦时间模型的视觉任务中,当有足够的训练数据来学习或改进表示时,LSTM风格的RNNs可以提供显著的改进。具体来说,我们表明LSTM类型的模型提供了对传统视频活动挑战的改进识别,并实现了从图像像素到句子级自然语言描述的端到端优化映射。我们还表明,这些模型改进了从传统视觉模型派生的中间视觉表示的描述生成。

We instantiate our proposed architecture in three experimental settings (Figure 3). First, we show that directly connecting a visual convolutional model to deep LSTM networks, we are able to train video recognition models that capture temporal state dependencies (Figure 3 left; Section 4). While existing labeled video activity datasets may not have actions or activities with particularly complex temporal dynamics, we nonetheless observe significant improvements on conventional benchmarks.

我们在三个实验环境中实例化了我们提出的架构(图3)。首先,我们展示了将视觉卷积模型直接连接到深层LSTM网络,我们能够训练捕获时间状态依赖的视频识别模型(图3左;第4节)。虽然现有的标记视频活动数据集可能没有具有特别复杂的时间动态的动作或活动,但我们仍然观察到对传统基准的显著改进。

Second, we explore end-to-end trainable image to sentence mappings. Strong results for machine translation tasks have recently been reported [9], [10]; such models are encoder-decoder pairs based on LSTM networks. We propose a multimodal analog of this model, and describe an architecture which uses a visual convnet to encode a deep state vector, and an LSTM to decode the vector into a natural language string (Figure 3 middle; Section 5). The resulting model can be trained end-to-end on large-scale image and text datasets, and even with modest training provides competitive generation results compared to existing methods.

其次,我们探索端到端可训练的图像到句子映射。机器翻译任务的强结果最近被报道[9],[10];这类模型是基于LSTM网络的编码器-解码器对。我们提出了该模型的一种多模式模拟,并描述了一种使用visual convnet对深状态向量进行编码,使用LSTM将向量解码为自然语言字符串的体系结构(图3中间;第5节)。所得到的模型可以在大规模的图像和文本数据集上进行端到端的训练,即使是适度的训练,也能提供与现有方法相比具有竞争力的生成结果。

Finally, we show that LSTM decoders can be driven directly from conventional computer vision methods which predict higher-level discriminative labels, such as the semantic video role tuple predictors in [11] (Figure 3, right; Section 6). While not end-to-end trainable, such models offer architectural and performance advantages over previous statistical machine translation-based approaches.

最后,我们展示了LSTM解码器可以直接从预测更高级别区分标签的传统计算机视觉方法驱动,例如[11]中的语义视频角色元组预测(图3,右;第6节)。虽然这种模型不是端到端可培训的,但与以前基于统计机器翻译的方法相比,它具有架构和性能优势。

We have realized a generic framework for recurrent models in the widely adopted deep learning framework Caffe [12], including ready-to-use implementations of RNN and LSTM units. (See http://jeffdonahue.com/lrcn/.)

在广泛采用的深度学习框架Caffe[12]中,我们实现了一个用于递归模型的通用框架,包括RNN和LSTM单元的现成实现。(见http://jeffdonahue.com/lrcn/)

【Paper】CNN-LSTM:Long-term Recurrent Convolutional Networks for Visual Recognition and Description_第2张图片
图2。本文中使用的基本RNN单元(左)和LSTM存储单元(右)的图(摘自[13],对[14]中描述的体系结构进行了略微简化,该体系结构源自[7]中最初提出的LSTM)。


2 BACKGROUND: RECURRENT NETWORKS

Traditional recurrent neural networks (RNNs, Figure 2, left) model temporal dynamics by mapping input sequences to hidden states, and hidden states to outputs via the following recurrence equations (Figure 2, left):

传统的递归神经网络(RNN,图2,左)通过以下顺序方程将输入序列映射到隐藏状态,将隐藏状态映射到输出,从而对时间结构进行建模(图2,左):

在这里插入图片描述

where g g g is an element-wise non-linearity , such as a sigmoid or hyperbolic tangent, x t x_t xt is the input, h t ∈ R N h_t \in \R^N htRN is the hidden state with N N N hidden units, and z t z_t zt is the output at time t t t. For a length T T T input sequence ( x 1 , x 2 , . . . , x T ) (x_1, x_2, ..., x_T) (x1,x2,...,xT), the updates above are computed sequentially as h 1 h_1 h1(letting h 0 = 0 h_0=0 h0=0), z 1 , h 2 , z 2 , . . . , h T , z T z_1, h_2, z_2, ..., h_T, z_T z1,h2,z2,...,hT,zT.

其中 g g g 是非线性逐元素运算组件(激活函数),例如S形或双曲线正切, x t x_t xt 是输入, h t ∈ R N h_t \in \R^N htRN 是具有 N N N 个隐藏单元的隐藏状态, z t z_t zt 是在时间 t t t 的输出。对于长度为 T T T 的输入序列 ( x 1 , x 2 , . . . , x T ) (x_1, x_2, ..., x_T) (x1,x2,...,xT),以上更新计算的顺序为 h 1 h_1 h1 (令 h 0 = 0 h_0=0 h0=0), z 1 , h 2 , z 2 , . . . , h T , z T z_1, h_2, z_2, ..., h_T, z_T z1,h2,z2,...,hT,zT

Though RNNs have proven successful on tasks such as speech recognition [15] and text generation [16], it can be difficult to train them to learn long-term dynamics, likely due in part to the vanishing and exploding gradients problem [7] that can result from propagating the gradients down through the many layers of the recurrent network, each corresponding to a particular time step. LSTMs provide a solution by incorporating memory units that explicitly allow the network to learn when to “forget” previous hidden states and when to update hidden states given new information. As research on LSTMs has progressed, hidden units with varying connections within the memory unit have been proposed. We use the LSTM unit as described in [13] (Figure 2, right), a slight simplification of the one described in [8], which was derived from the original LSTM unit proposed in [7].

尽管事实证明RNN在诸如语音识别[15]和文本生成[16]等任务上是成功的,但可能很难训练他们学习长期动态,这可能部分是由于梯度消失和爆炸问题[7]所致。可以通过在递归网络的多个层中向下传播梯度来获得结果,每个梯度对应于特定的时间步长。 LSTM通过合并内存单元提供了一种解决方案,该内存单元明确允许网络学习何时“忘记”先前的隐藏状态以及何时在给定新信息的情况下更新隐藏状态。随着对LSTM的研究的发展,已经提出了在存储单元内具有不同连接的隐藏单元。我们使用[13]中描述的LSTM单元(图2,右),对[8]中描述的LSTM单元进行了略微的简化,它源自[7]中提出的原始LSTM单元。

Letting σ ( x ) = ( 1 + e − x ) − 1 σ(x) = (1 + e^{−x})^{−1} σ(x)=(1+ex)1 be the sigmoid non-linearity which squashes real-valued inputs to a [ 0 , 1 ] [0,1] [0,1] range, and letting t a n h ( x ) = e x − e − x e x + e − x = 2 σ ( 2 x ) − 1 tanh(x) = \frac {e^x−e^{−x}} {e^x+e^{−x}} = 2σ(2x) − 1 tanh(x)=ex+exexex=2σ(2x)1 be the hyperbolic tangent non-linearity , similarly squashing its inputs to a [ − 1 , 1 ] [−1,1] [1,1] range, the LSTM updates for time step t t t given inputs x t , h t − 1 x_t, h_{t−1} xt,ht1, and c t − 1 c_{t−1} ct1 are:

σ ( x ) = ( 1 + e − x ) − 1 σ(x) = (1 + e^{−x})^{−1} σ(x)=(1+ex)1 为非线性sigmoid激活函数,它将实值输入压缩到[0,1]范围,并且让 t a n h ( x ) = e x − e − x e x + e − x = 2 σ ( 2 x ) − 1 tanh(x) = \frac {e^x−e^{−x}} {e^x+e^{−x}} = 2σ(2x) − 1 tanh(x)=ex+exexex=2σ(2x)1 是双曲正切非线性,类似地将其输入压缩到[−1,1]范围,LSTM在给定输入 x t , h t − 1 x_t, h_{t−1} xt,ht1 c t − 1 c_{t−1} ct1 的情况下更新时间步 t t t
【Paper】CNN-LSTM:Long-term Recurrent Convolutional Networks for Visual Recognition and Description_第3张图片
x ⨀ y x \bigodot y xy denotes the element-wise product of vectors x x x and y y y.
x ⨀ y x \bigodot y xy 表示向量 x x x y y y 的逐元素乘积。

In addition to a hidden unit ht∈ RN, the LSTM includes an input gate it∈ RN, forget gate ft∈ RN, output gate ot∈ RN, input modulation gate gt∈ RN, and memory cell ct∈ RN. The memory cell unit ct is a sum of two terms: the previous memory cell unit ct−1which is modulated by ft, and gt, a function of the current input and previous hidden state, modulated by the input gate it. Because it and ft are sigmoidal, their values lie within the range [0,1], and it and ft can be thought of as knobs that the LSTM learns to selectively forget its previous memory or consider its current input. Likewise, the output gate ot learns how much of the memory cell to transfer to the hidden state. These additional cells seem to enable the LSTM to learn complex and long-term temporal dynamics for a wide variety of sequence learning and prediction tasks. Additional depth can be added to LSTMs by stacking them on top of each other, using the hidden state h(‘−1) t of the LSTM in layer ‘ − 1 as the input to the LSTM in layer ‘.

除了隐藏单元ht∈RN,LSTM还包括输入门it∈RN,忘记门ft∈RN,输出门ot∈RN,输入调制门gt∈RN和存储单元ct∈RN。记忆单元ct是两项的和:前一个存储单元单元ct-1(由ft调制)和gt(当前输入和前一个隐藏状态的函数),由输入门对其进行调制。因为它和ft是S形的,所以它们的值在[0,1]范围内,并且可以将它和ft视为LSTM学习选择性地忘记其先前的记忆或考虑其当前输入的旋钮。同样,输出门ot了解要转移到隐藏状态的记忆单元数量。这些额外的单元格似乎使LSTM能够为各种序列学习和预测任务学习复杂的长期时间动态。通过将第 l − 1 l-1 l1 层中LSTM的隐藏状态 h t ( l − 1 ) h^{(l-1)}_t ht(l1) 用作第 l l l 层LSTM的输入,可以将它们堆叠在一起,从而增加LSTM的深度。

Recently , LSTMs have achieved impressive results on language tasks such as speech recognition [8] and machine translation [9], [10]. Analogous to CNNs, LSTMs are attractive because they allow end-to-end fine-tuning. For example, [8] eliminates the need for complex multi-step pipelines in speech recognition by training a deep bidirectional LSTM which maps spectrogram inputs to text. Even with no language model or pronunciation dictionary, the model produces convincing text translations. [9] and [10] translate sentences from English to French with a multilayer LSTM encoder and decoder. Sentences in the source language are mapped to a hidden state using an encoding LSTM, and then a decoding LSTM maps the hidden state to a sequence in the target language. Such an encoder-decoder scheme allows an input sequence of arbitrary length to be mapped to an output sequence of different length. The sequence-to-sequence architecture for machine translation circumvents the need for language models.

最近,LSTM在语言任务(例如语音识别[8]和机器翻译[9],[10])上取得了令人印象深刻的结果。与CNN相似,LSTM具有吸引力,因为它们允许端到端的微调。例如,[8]通过训练将频谱图输入映射到文本的深度双向LSTM,消除了语音识别中对复杂的多步流水线的需求。即使没有语言模型或发音词典,该模型也会产生令人信服的文本翻译。 [9]和[10]使用多层LSTM编码器和解码器将句子从英语翻译为法语。使用编码LSTM将源语言中的句子映射到隐藏状态,然后通过解码LSTM将隐藏状态映射到目标语言中的序列。这种编码器-解码器方案允许将任意长度的输入序列映射到不同长度的输出序列。机器翻译的序列到序列体系结构避免了对语言模型的需求。

The advantages of LSTMs for modeling sequential data in vision problems are twofold. First, when integrated with current vision systems, LSTM models are straightforward to fine-tune end-to-end. Second, LSTMs are not confined to fixed length inputs or outputs allowing simple modeling for sequential data of varying lengths, such as text or video. We next describe a unified framework to combine recurrent models such as LSTMs with deep convolutional networks to form end-to-end trainable networks capable of complex visual and sequence prediction tasks.

LSTM在视觉问题中对顺序数据进行建模的优点是双重的。首先,当与当前的视觉系统集成时,LSTM模型可以直接端到端微调。其次,LSTM不限于固定长度的输入或输出,而是允许对长度可变的顺序数据(例如文本或视频)进行简单建模。接下来,我们将描述一个统一的框架,以将诸如LSTM的循环模型与深度卷积网络相结合,以形成能够执行复杂的视觉和序列预测任务的端到端可训练网络。


3 LONG-TERM RECURRENT CONVOLUTIONAL NETWORK (LRCN) MODEL

This work proposes a Long-term Recurrent Convolutional Network (LRCN) model combining a deep hierarchical visual feature extractor (such as a CNN) with a model that can learn to recognize and synthesize temporal dynamics for tasks involving sequential data (inputs or outputs), visual, linguistic, or otherwise. Figure 1 depicts the core of our approach. LRCN works by passing each visual input x t x_t xt (an image in isolation, or a frame from a video) through a feature transformation ϕ V ( . ) \phi V(.) ϕV(.) with parameters V V V , usually a CNN, to produce a fixed-length vector representation ϕ V ( x t ) \phi _V(x_t) ϕV(xt). The outputs of ϕ V \phi _V ϕV are then passed into a recurrent sequence learning module.

这项工作提出了一个长期递归卷积网络(LRCN)模型,该模型将深层次的视觉特征提取器(例如CNN)与可以学习识别和合成动态时间序列建模任务中的序列数据(输入和输出),视觉,语言或其它任务。图1描绘了我们方法的核心。 LRCN的工作方式是,将每个视觉输入 x t x_t xt(独立的图像或视频中的一帧)通过带有参数 V V V(通常是CNN)的特征变换 ϕ V ( . ) \phi V(.) ϕV(.),以生成固定长度的矢量表示 ϕ V ( x t ) \phi _V(x_t) ϕV(xt)。然后将 ϕ V \phi _V ϕV 的输出传递到递归序列学习模块。

In its most general form, a recurrent model has parameters W W W, and maps an input x t x_t xt and a previous time step hidden state h t − 1 h_{t−1} ht1 to an output z t z_t zt and updated hidden state h t h_t ht. Therefore, inference must be run sequentially (i.e., from top to bottom, in the Sequence Learning box of Figure 1), by computing in order: h 1 = f W ( x 1 , h 0 ) = f W ( x 1 , 0 ) h_1= f_W(x_1, h_0) = f_W(x_1,0) h1=fW(x1,h0)=fW(x1,0), then h 2 = f W ( x 2 , h 1 ) h_2= f_W(x_2, h_1) h2=fW(x2,h1), etc., up to h T h_T hT. Some of our models stack multiple LSTMs a top one another as described in Section 2.

以其最一般的形式,循环模型具有参数 W W W,并将输入 x t x_t xt 和上一个时间步隐藏状态 h t − 1 h_ {t-1} ht1 映射到输出 z t z_t zt 和更新的隐藏状态 h t h_t ht 。因此,必须按以下顺序计算依次推断(即,从上到下,在图1的“序列学习”框中): h 1 = f W ( x 1 , h 0 ) = f W ( x 1 , 0 ) h_1 = f_W(x_1,h_0)= f_W(x_1,0) h1=fW(x1h0)=fW(x1,0),然后是 h 2 = f W ( x 2 , h 1 ) h_2 = f_W(x_2,h_1) h2=fW(x2h1),依此类推,直到 h T h_T hT。我们的某些模型将多个LSTM堆叠在一起,如第2节所述。

To predict a distribution P ( y t ) P(y_t) P(yt) over outcomes y t ∈ C y_t \in C ytC (where C C C is a discrete, finite set of outcomes) at time step t t t, the outputs z t ∈ R d z z_t \in \R^{d_z} ztRdz of the sequential model are passed through a linear prediction layer y t ^ = W z z t + b z \hat{y_t}= W_z z_t+ b_z yt^=Wzzt+bz, where W z ∈ R ∣ C ∣ × d z W_z\in \R ^{|C|×d_z} WzRC×dz and b z ∈ R ∣ C ∣ b_z\in \R^{|C|} bzRC are learned parameters. Finally , the predicted distribution P ( y t ) P(yt) P(yt) is computed by taking the softmax of
y t ^ : P ( y t = c ) = s o f t m a x ( y t ^ ) = e x p ( y t ^ , c ) ∑ c ′ ∈ C e x p ( y t ^ , c ′ ) \hat{y_t}: P(y_t= c) = softmax(\hat{y_t}) = \frac{exp(\hat{y_t},c) } {\sum_{c'\in C} exp(\hat{y_t},c')} yt^:P(yt=c)=softmax(yt^)=cCexp(yt^,c)exp(yt^,c)

为了在时间步长 t t t 上预测结果 y t ∈ C y_t \in C ytC(其中 C C C 是离散的有限结果集)上的分布 P ( y t ) P(y_t) P(yt),输出 z t ∈ R d z z_t \in \R^{d_z } ztRdz 的顺序模型通过线性预测层 y t ^ = W z z t + b z \hat{y_t} = W_z z_t + b_z yt^=Wzzt+bz,其中 W z ∈ R ∣ C ∣ × d z W_z \in \R^{| C |×d_z} WzRC×dz b z ∈ R ∣ C ∣ b_z \in \R^{ | C |} bzRC 是学习的参数。 最后,通过采用以下公式的softmax来计算预测分布 P ( y t ) P(yt) P(yt)
y t ^ : P ( y t = c ) = s o f t m a x ( y t ^ ) = e x p ( y t ^ , c ) ∑ c ′ ∈ C e x p ( y t ^ , c ′ ) \hat{y_t}:P(y_t = c)= softmax(\hat {y_t})= \frac {exp(\hat{y_t},c)} {\sum_ {c'\in C} exp(\hat{y_t},c')} yt^:P(yt=c)=softmax(yt^)=cCexp(yt^,c)exp(yt^,c)

The success of recent deep models for object recognition [17], [18], [19] suggests that strategically composing many “layers” of non-linear functions can result in powerful models for perceptual problems. For large T, the above recurrence indicates that the last few predictions from a recurrent network with T time steps are computed by a very “deep” (T layer) non-linear function, suggesting that the resulting recurrent model may have similar representational power to a T layer deep network. Critically , however, the sequence model’s weights W are reused at every time step, forcing the model to learn generic time step-to-time step dynamics (as opposed to dynamics conditioned on t, the sequence index) and preventing the parameter size from growing in proportion to the maximum sequence length.

最近用于对象识别的深层模型的成功[17],[18],[19]表明,有策略地组合非线性函数的许多“层”可以产生强大的感知问题模型。对于较大的T,上述递归表明来自具有T个时间步长的递归网络的最后几个预测是通过非常“深的”(T层)非线性函数计算的,这表明所得的递归模型可能具有与T层深度网络。但是,至关重要的是,序列模型的权重W在每个时间步都被重用,从而迫使模型学习通用的时间步长到时间步长的动态特性(与基于t的动态条件相反,即序列索引),并防止参数大小增长与最大序列长度成比例

In most of our experiments, the visual feature transformation φ corresponds to the activations in some layer of a deep CNN. Using a visual transformation φV(.) which is time-invariant and independent at each time step has the important advantage of making the expensive convolutional inference and training parallelizable over all time steps of the input, facilitating the use of fast contemporary CNN implementations whose efficiency relies on independent batch processing, and end-to-end optimization of the visual and sequential model parameters V and W.

在我们的大多数实验中,视觉特征转换φ对应于深层CNN某层中的激活。使用不随时间变化且在每个时间步均独立的可视变换φV(.)具有以下重要优势:使昂贵的卷积推理和训练可在输入的所有时间步上并行化,从而便于使用高效的当代CNN实现依赖于独立的批处理以及视觉和顺序模型参数V和W的端到端优化。

We consider three vision problems (activity recognition, image description and video description), each of which instantiates one of the following broad classes of sequential learning tasks:

我们考虑了三个视觉问题(活动识别,图像描述和视频描述),每个问题都实例化了以下广泛的顺序学习任务之一:

  1. Sequential input, static output (Figure 3, left): ( x 1 , x 2 , . . . , x T → y (x_1,x_2,...,x_T →y (x1x2...xTy. The visual activity recognition problem can fall under this umbrella, with videos of arbitrary length T T T as input, but with the goal of predicting a single label like running or jumping drawn from a fixed vocabulary.

1)顺序输入,静态输出(图3,左): ( x 1 , x 2 , . . . , x T → y (x_1,x_2,...,x_T →y (x1x2...xTy。视觉活动识别问题可能属于这种情况,以任意长度 T T T 的视频作为输入,但目标是预测单个标签,例如从固定词汇中提取的奔跑或跳跃。

  1. Static input, sequential output (Figure 3, middle): x 7→ hy1, y2, …, yTi. The image captioning problem fits in this category, with a static (non-time-varying) image as input, but a much larger and richer label space consisting of sentences of any length.

2)静态输入,顺序输出(图3,中间): x → ( y 1 , y 2 , . . . , y T ) x→(y_1,y_2,...,y_T) x(y1y2...yT)。图像标题问题属于此类,输入的是静态(非时变)图像,但是标签空间更大,更丰富,由任意长度的句子组成。

  1. Sequential input and output (Figure 3, right): x 1 , x 2 , . . . , x T → ( y 1 , y 2 , . . . , y T ′ ) x_1,x_2,...,x_T→(y_1,y_2,...,y_{T'}) x1x2...xT(y1y2...yT). In tasks such as video description, both the visual input and output are time-varying, and in general the number of input and output time steps may differ (i.e., we may have ( T ≠ T ′ ) (T \neq T') (T=T)). In video description, for example, the number of frames in the video should not constrain the length of (number of words in) the natural language description.

3)顺序输入和输出(右图3): x 1 , x 2 , . . . , x T → ( y 1 , y 2 , . . . , y T ′ ) x_1,x_2,...,x_T→(y_1,y_2,...,y_{T'}) x1x2...xT(y1y2...yT)。在诸如视频描述之类的任务中,视觉输入和输出都是随时间变化的,并且通常输入和输出时间步长的数量可能不同(即,我们可能具有 ( T ≠ T ′ ) (T \neq T') (T=T)。例如,在视频描述中,视频中的帧数不应限制自然语言描述的长度(其中的单词数)。
【Paper】CNN-LSTM:Long-term Recurrent Convolutional Networks for Visual Recognition and Description_第4张图片
Fig. 3. T ask-specific instantiations of our LRCN model for activity recognition, image description, and video description.

In the previously described generic formulation of recurrent models, each instance has T inputs x 1 , x 2 , . . . , x T x_1,x_2,...,x_T x1x2...xT and T T T outputs ( y 1 , y 2 , . . . , y T ′ ) (y_1,y_2,...,y_{T'}) (y1y2...yT). Note that this formulation does not align cleanly with any of the three problem classes described above – in the first two classes, either the input or output is static, and in the third class, the input length T need not match the output length T 0 T_0 T0. Hence, we describe how we adapt this formulation in our hybrid model to each of the above three problem settings.

在先前描述的递归模型的一般表述中,每个实例具有 T T T 个输入 x 1 , x 2 , . . . , x T x_1,x_2,...,x_T x1x2...xT T T T 个输出 ( y 1 , y 2 , . . . , y T ) (y_1,y_2,...,y_{T}) (y1y2...yT)。请注意,此公式与上述三个问题类别中的任何一个都不完全吻合-在前两个类别中,输入或输出是静态的,在第三个类别中,输入长度 T T T 不必与输出长度 T 0 T_0 T0 匹配。因此,我们描述了如何在我们的混合模型中将此公式适应上述三个问题设置中的每一个。

With sequential inputs and static outputs (class 1), we take a late-fusion approach to merging the per-time step predictions ( y 1 , y 2 , . . . , y T ) (y_1,y_2,...,y_{T}) (y1y2...yT) into a single prediction y y y for the full sequence. With static inputs x and sequential outputs (class 2), we simply duplicate the input x x x at all T time steps: x : ∀ t ∈ 1 , 2 , . . . , T : x t : = x x: ∀t∈{1,2,...,T}:x_t:= x x:t1,2,...,T:xt:=x. Finally , for a sequenceto-sequence problem with (in general) different input and output lengths (class 3), we take an “encoder-decoder” approach, as proposed for machine translation by [9], [20]. In this approach, one sequence model, the encoder, maps the input sequence to a fixed-length vector, and another sequence model, the decoder, unrolls this vector to a sequential output of arbitrary length. Under this type of model, a run of the full system on one instance occurs over T+T0−1 time steps. For the first T time steps, the encoder processes the input x 1 , x 2 , . . . , x T x_1,x_2,...,x_T x1x2...xT, and the decoder is inactive until time step T, when the encoder’s output is passed to the decoder, which in turn predicts the first output y1. For the latter T ′ − 1 T'-1 T1 time steps, the decoder predicts the remainder of the output y 2 , y 3 , . . . , y T ′ y_2,y_3,...,y_{T'} y2,y3,...,yT with the encoder inactive. This encoderdecoder approach, as applied to the video description task, is depicted in Section 6, Figure 5 (left).

对于顺序输入和静态输出(类别1),我们采用后融合方法将每个时间步长预测 ( y 1 , y 2 , . . . , y T ) (y_1,y_2,...,y_{T}) (y1y2...yT) 合并为整个序列的单个预测 y y y。对于静态输入 x x x 和顺序输出(类2),我们只需在所有 T T T 个时间步长上复制输入 x : ∀ t ∈ 1 , 2 , . . . , T : x t : = x x: ∀t∈{1,2,...,T}:x_t:= x x:t1,2,...,T:xt:=x。最后,对于(通常)具有不同输入和输出长度(第3类)的序列间问题,我们采用“编码器-解码器”方法,如[9],[20]提出的机器翻译方法。在这种方法中,一个序列模型(编码器)将输入序列映射到固定长度的向量,而另一个序列模型(解码器)将该向量展开为任意长度的顺序输出。在这种类型的模型下,整个系统在一个实例上的运行发生在 T + T ′ − 1 T + T'-1 T+T1 的时间步长上。对于前 T T T 个时间步长,编码器处理输入 x 1 , x 2 , . . . , x T x_1,x_2,...,x_T x1x2...xT,并且解码器在时间步长 T T T 之前是不活动的,直到时间步长 T T T,此时编码器的输出将传递到解码器,进而预测第一个输出 y 1 y_1 y1 。对于后面的 T ′ − 1 T'-1 T1 时间步长,解码器在编码器未激活的情况下预测输出 y 2 , y 3 , . . . , y T ′ y_2,y_3,...,y_{T'} y2,y3,...,yT 的其余部分。应用于视频描述任务的这种编码器-解码器方法在第6节,图5(左)中进行了描述。

Under the proposed system, the parameters (V, W) of the model’s visual and sequential components can be jointly optimized by maximizing the likelihood of the ground truth outputs ytat each time step t, conditioned on the input data and labels up to that point ( x 1 : t , y 1 : t − 1 ) (x_{1:t}, y_{1:t−1}) (x1:t,y1:t1). In particular, for a training set D D D of labeled sequences ( x t , y t ) t = 1 T ∈ D (x_t, y_t)^T_{t=1}∈ D (xt,yt)t=1TD, we optimize parameters ( V , W ) (V, W) (V,W) to minimize the expected negative log likelihood of a sequence sampled from the training set L ( V , W , D ) = − 1 ∣ D ∣ ∑ ( x t , y t ) t = 1 T ∈ D ∑ t = 1 T l o g P ( y t ∣ x 1 : t , y 1 : t − 1 , V , W ) L(V, W,D) = −\frac {1} {|D|} \sum_{(x_t,y_t)^T_{t=1}} \in D \sum ^T _{t=1} log P(y_t|x_{1:t}, y_{1:t−1}, V, W) L(V,W,D)=D1(xt,yt)t=1TDt=1TlogP(ytx1:t,y1:t1,V,W). One of the most appealing aspects of the described system is the ability to learn the parameters “end-to-end,” such that the parameters V of the visual feature extractor learn to pick out the aspects of the visual input that are relevant to the sequential classification problem. We train our LRCN models using stochastic gradient descent, with backpropagation used to compute the gradient ∇ V , W L ( V , W , ˜ D ) ∇_{V,W}L(V, W,˜D) V,WL(V,W,˜D) of the objective L with respect to all parameters ( V , W ) (V, W) (V,W) over minibatches ˜ D ⊂ D ˜D ⊂ D ˜DD sampled from the training dataset D D D.

在建议的系统下,可以通过最大化每个时间步 t t t 的正确标注(ground truth)输出 y t y_t yt 的可能性来共同优化模型的视觉和顺序成分的参数 ( V , W ) (V, W) (V,W),该条件取决于输入数据和直到该点的标签 ( x 1 : t , y 1 : t − 1 ) (x_{1:t}, y_{1:t−1}) (x1:t,y1:t1)。特别是,对于带有标记序列 ( x t , y t ) t = 1 T ∈ D (x_t, y_t)^T_{t=1}∈ D (xt,yt)t=1TD 的训练集 D D D,我们优化参数 ( V , W ) (V, W) (V,W) 以最小化从训练集 L ( V , W , D ) = − 1 ∣ D ∣ ∑ ( x t , y t ) t = 1 T ∈ D ∑ t = 1 T l o g P ( y t ∣ x 1 : t , y 1 : t − 1 , V , W ) L(V, W,D) = −\frac {1} {|D|} \sum_{(x_t,y_t)^T_{t=1}} \in D \sum ^T _{t=1} log P(y_t|x_{1:t}, y_{1:t−1}, V, W) L(V,W,D)=D1(xt,yt)t=1TDt=1TlogP(ytx1:t,y1:t1,V,W)。所描述的系统的最吸引人的方面之一是能够学习“端到端”参数的能力,从而使视觉特征提取器的参数 V V V 学会挑选与视觉输入相关的视觉输入方面。顺序分类问题。我们使用随机梯度下降训练LRCN模型,并使用反向传播来计算物镜L相对于所有样本 D ~ ⊂ D \widetilde{D} ⊂ D D D 的所有参数 ( V , W ) (V, W) (V,W) 的梯度 ∇ V , W L ( V , W , D ~ ) ∇_{V,W}L(V, W,\widetilde{D}) V,WL(V,W,D )。来自训练数据集 D D D

We next demonstrate the power of end-to-end trainable hybrid convolutional and recurrent networks by exploring three applications: activity recognition, image captioning, and video description.

接下来,我们将通过探索三种应用来证明端到端可训练混合卷积和递归网络的功能:活动识别,图像字幕和视频描述。


5 IMAGE CAPTIONING (略)

6 VIDEO DESCRIPTION (略)


7 RELATED WORK

我们介绍与该工作中讨论的三个任务有关的先前文献。此外,我们讨论了结合卷积网络和循环网络的后续扩展,以在活动识别,图像字幕和视频描述以及相关的新任务(例如视觉问题解答)上获得改进的结果。


7.1 Prior Work

7.1.1 Activity Recognition

State-of-the-art shallow models combine spatio-temporal features along dense trajectories [50] and encode features as bags of words or Fisher vectors for classification. Such shallow features track how low level features change through time but cannot track higher level features. Furthermore, by encoding features as bags of words or Fisher vectors, temporal relationships are lost.

最新的浅层模型结合了沿密集轨迹的时空特征[50],并将特征编码为单词袋或Fisher向量袋进行分类。这样的浅层特征跟踪低层特征如何随时间变化,但不能跟踪高层特征。此外,通过将特征编码为单词袋或Fisher向量,就失去了时间关系。

Many deep architectures proposed for activity recognition stack a fixed number of video frames for input to a deep network. [3] propose a fusion convolutional network which fuses layers which correspond to different input frames at various levels of a deep network. [4] proposes a two stream CNN which combines one CNN trained on RGB frames and one CNN trained on a stack of 10 flow frames. When combining RGB and flow by averaging softmax scores, results are comparable to state-of-the-art shallow models on UCF101 [25] and HMDB51 [51]. Results are further improved by using an SVM to fuse RGB and flow as opposed to simply averaging scores. Alternatively, [1] and [2] propose learning deep spatio-temporal features with 3D convolutional neural networks. [2], [52] propose extracting visual and motion features and modeling temporal dependencies with recurrent networks. This architecture most closely resembles our proposed architecture for activity classification, though it differs in two key ways. First, we integrate 2D CNNs that can be pre-trained on large image datasets. Second, we combine the CNN and LSTM into a single model to enable end-to-end fine-tuning.

提议用于活动识别的许多深度架构堆叠固定数量的视频帧,以输入到深度网络。 [3]提出了一种融合卷积网络,该融合卷积网络在深层网络的各个级别上融合对应于不同输入帧的层。 [4]提出了一种两流CNN,它结合了一个在RGB帧上训练的CNN和一个在10个流帧的堆栈上训练的CNN。当通过平均softmax得分将RGB和流结合在一起时,结果与UCF101 [25]和HMDB51 [51]上的最新浅模型相当。通过使用SVM融合RGB和流,与简单平均分数相反,结果得到了进一步改善。另外,[1]和[2]提出利用3D卷积神经网络学习深度的时空特征。 [2],[52]提出提取视觉和运动特征并使用递归网络对时间依赖性进行建模。尽管它在两个关键方面有所不同,但该体系结构与我们建议的活动分类体系结构最相似。首先,我们集成了可以在大型图像数据集上进行预训练的2D CNN。其次,我们将CNN和LSTM合并为一个模型,以实现端到端的微调。


7.1.2 Image Captioning

Several early works [53], [54], [55], [56] on image captioning combine object and scene recognition with template or tree based approaches to generate captions. Such sentences are typically simple and are easily distinguished from more fluent human generated descriptions. [46], [57] address this by composing new sentences from existing caption fragments which, though more human like, are not necessarily accurate or correct.

关于图像字幕的一些早期工作[53],[54],[55],[56]将对象和场景识别与基于模板或树的方法结合在一起以生成字幕。这样的句子通常很简单,很容易与更流畅的人工生成描述区分开。 [46],[57]通过从现有字幕片段组成新句子来解决此问题,尽管这些片段更像人类,但不一定准确或正确。

More recently, a variety of deep and multi-modal models [27], [29], [30], [58] have been proposed for image and caption retrieval, as well as caption generation. Though some of these models rely on deep convolutional nets for image feature extraction [30], [58], recently researchers have realized the importance of also including temporally deep networks 11 to model text. [29] propose an RNN to map sentences into a multi-modal embedding space. By mapping images and language into the same embedding space, they are able to compare images and descriptions for image and annotation retrieval tasks. [27] propose a model for caption generation that is more similar to the model proposed in this work: predictions for the next word are based on previous words in a sentence and image features. [58] propose an encoderdecoder model for image caption retrieval which relies on both a CNN and LSTM encoder to learn an embedding of image-caption pairs. Their model uses a neural language decoder to enable sentence generation. As evidenced by the rapid growth of image captioning, visual sequence models like LRCN are increasingly important for describing the visual world using natural language.

最近,已经提出了用于图像和字幕检索以及字幕生成的各种深度和多模式模型[27],[29],[30],[58]。尽管其中一些模型依赖于深度卷积网络来进行图像特征提取[30] [58],但最近的研究人员已经意识到,还必须包括时间上的深度网络11以对文本进行建模。 [29]提出了一种RNN将句子映射到多模式嵌入空间中。通过将图像和语言映射到相同的嵌入空间中,它们能够比较图像和描述以进行图像和注释检索任务。 [27]提出了一种用于字幕生成的模型,该模型与这项工作中提出的模型更为相似:下一个单词的预测基于句子中的前一个单词和图像特征。 [58]提出了一种用于图像字幕检索的编码器-解码器模型,该模型依赖于CNN和LSTM编码器来学习图像字幕对的嵌入。他们的模型使用神经语言解码器来启用句子生成。正如图像字幕的快速增长所证明的那样,像LRCN这样的视觉序列模型对于使用自然语言描述视觉世界变得越来越重要。


7.1.3 Video Description

Recent approaches to describing video with natural language have made use of templates, retrieval, or language models [11], [59], [60], [60], [61], [62], [63], [64]. To our knowledge, we present the first application of deep models to the video description task. Most similar to our work is [11], which use phrase-based SMT [47] to generate a sentence. In Section 6 we show that phrase-based SMT can be replaced with LSTMs for video description as has been shown previously for language translation [9], [65].

视频说明。用自然语言描述视频的最新方法已经使用了模板,检索或语言模型[11],[59],[60],[60],[61],[62],[63],[64] 。据我们所知,我们将深度模型首次应用于视频描述任务。与我们的工作最相似的是[11],它使用基于短语的SMT [47]生成句子。在第6节中,我们展示了基于短语的SMT可以用LSTM代替视频描述,如先前针对语言翻译所展示的[9],[65]。


7.2 Contemporaneous and Subsequent Work

Similar work in activity recognition and visual description was conducted contemporaneously with our work, and a variety of subsequent work has combined convolutional and recurrent networks to both improve upon our results and achieve exciting results on other sequential visual tasks.

活动识别和视觉描述方面的类似工作与我们的工作同时进行,随后的各种工作将卷积和递归网络相结合,以改善我们的结果并在其他顺序的视觉任务上取得令人兴奋的结果。


7.2.1 Activity Recognition

Contemporaneous with our work, [66] train a network which combines CNNs and LSTMs for activity recognition. Because activity recognition datasets like UCF101 are relatively small in comparison to image recognition datasets, [66] pretrain their network using the Sports-1M [3] dataset which includes over a million videos mined from YouTube. By training a much larger network (four stacked LSTMs) and pretraining on a large video dataset, [66] achieve 88.6% on the UCF101 dataset.

活动识别。与我们的工作同步的是,[66]训练了一个结合CNN和LSTM进行活动识别的网络。由于活动识别数据集(如UCF101)与图像识别数据集相比相对较小,因此[66]使用Sports-1M [3]数据集对其网络进行了预训练,其中包括从YouTube提取的超过一百万个视频。通过训练更大的网络(四个堆叠的LSTM)并在大型视频数据集上进行预训练,[66]在UCF101数据集上达到88.6%。

[67] also combines a convolutional network with an LSTM to predict multiple activities per frame. Unlike LRCN, [67] focuses on frame-level (rather than video-level) predictions, which allows their system to label multiple activities that occur in different temporal locations of a video clip. Like we show for activity recognition, [67] demonstrates that including temporal information improves upon a single frame baseline. Additionally , [67] employ an attention mechanism to further improve results.

[67]也结合了卷积网络和LSTM来预测每帧的多个活动。与LRCN不同,[67]专注于帧级(而不是视频级)预测,这使他们的系统可以标记在视频剪辑的不同时间位置发生的多种活动。就像我们为活动识别所显示的那样,[67]证明在单个帧基线上包含时间信息会有所改善。另外,[67]采用注意力机制来进一步改善结果。


7.2.2 Image Captioning

[45] and [38] also propose models which combine a CNN with a recurrent network for image captioning. Though similar to LRCN, the architectures proposed in [45] and [38] differ in how image features are input into the sequence model. In contrast to our system, in which image features are input at each time step, [45] and [38] only input image features at the first time step. Furthermore, they do not explore a “factored” representation (Figure 4). Subsequent work [44] has proposed attention to focus on which portion of the image is observed during sequence generation. By including attention, [44] aim to visually focus on the current word generated by the model. Other works aim to address specific limitations of captioning models based on combining convolutional and recurrent architectures. For example, methods have been proposed to integrate new vocabulary with limited [40] or no [68] examples of images and corresponding captions.

[45]和[38]还提出了将CNN与递归网络相结合以进行图像字幕的模型。尽管与LRCN相似,但[45]和[38]中提出的体系结构在将图像特征输入到序列模型的方式有所不同。与我们在每个时间步输入图像特征的系统相反,[45]和[38]仅在第一时间步输入图像特征。此外,他们没有探讨“因式”表示(图4)。随后的工作[44]提出了对关注在序列生成过程中观察到图像的哪个部分的关注。通过吸引注意力,[44]旨在从视觉上关注模型生成的当前单词。其他工作旨在基于卷积和循环体系结构的结合来解决字幕模型的特定限制。例如,已经提出了将新词汇与图像和相应字幕的有限[40]或没有[68]的示例进行整合的方法。


7.2.3 Video Description

In this work, we rely on intermediate features for video description, but end-to-end trainable models for visual captioning have since been proposed. [69] propose creating a video feature by pooling high level CNN features across frames. The video feature is then used to generate descriptions in the same way an image is used to generate a description in LRCN. Though achieving good results, by pooling CNN features, temporal information from the video is lost. Consequently , [70] propose an LSTM to encode video frames into a fixed length vector before sentence generation with an LSTM. Using an end-to-end trainable “sequence-to-sequence” model which can exploit temporal structure in video, [70] improve upon results for video description. [71] propose a similar model, adding a temporal attention mechanism which weights video frames differently when generating each word in a sentence.

在这项工作中,我们依赖于视频描述的中间特征,但此后提出了用于视觉字幕的端到端可训练模型。 [69]提议通过跨帧合并高级CNN功能来创建视频功能。然后,将视频功能用于生成描述,就像使用图像在LRCN中生成描述一样。尽管获得了良好的结果,但是通过合并CNN功能,视频中的时间信息会丢失。因此,[70]提出了一种LSTM,用于在使用LSTM生成句子之前将视频帧编码为固定长度的矢量。使用可以利用视频中时间结构的端到端可训练的“序列到序列”模型,[70]改进了视频描述的结果。 [71]提出了一个类似的模型,增加了一个时间注意机制,当生成句子中的每个单词时,该机制对视频帧的加权不同。


7.2.4 Visual Grounding

[72] combine CNNs with LSTMs for visual grounding. The model first encodes a phrase which describes part of an image using an LSTM, then learns to attend to the appropriate location in the image to accurately reconstruct the phrase. In order to reconstruct the phrase, the model must learn to visually ground the input phrase to the appropriate location in the image.

[72]将CNN与LSTM结合使用以实现可视化。该模型首先使用LSTM对描述部分图像的短语进行编码,然后学习关注图像中的适当位置以准确地重建该短语。为了重建短语,模型必须学习将输入的短语在视觉上接地到图像中的适当位置。


7.2.5 Natural Language Object Retrieval

In this work, we present methods for image retrieval based on a natural language description. In contrast, [73] use a model based on LRCN for object retrieval, which returns the bounding box around a given object as opposed to an entire image. In order to adapt LRCN to the task of object retrieval, [73] include local convolutional features which are extracted from object proposals and the spatial configuration of object proposals in addition to a global image feature. By including local features, [73] effectively adapt LRCN for object retrieval.

自然语言对象检索。在这项工作中,我们提出了基于自然语言描述的图像检索方法。相反,[73]使用基于LRCN的模型进行对象检索,该模型返回围绕给定对象而不是整个图像的边界框。为了使LRCN适应对象检索的任务,[73]除了全局图像特征外,还包括从对象建议中提取的局部卷积特征和对象建议的空间配置。通过包含局部特征,[73]有效地使LRCN适用于对象检索。


8 CONCLUSION

We’ve presented LRCN, a class of models that is both spatially and temporally deep, and flexible enough to be applied to a variety of vision tasks involving sequential inputs and outputs. Our results consistently demonstrate that by learning sequential dynamics with a deep sequence model, we can improve upon previous methods which learn a deep hierarchy of parameters only in the visual domain, and on methods which take a fixed visual representation of the input and only learn the dynamics of the output sequence.

我们介绍了LRCN,这是一类模型,在空间和时间上都具有深度,并且足够灵活,可以应用于涉及顺序输入和输出的各种视觉任务。我们的结果一致表明,通过使用深度序列模型学习顺序动力学,我们可以改进以前的方法(仅在视觉域中学习深度的参数层次结构),以及在方法上采用固定的视觉表示形式并仅学习输入的方法。输出序列的动态。

As the field of computer vision matures beyond tasks with static input and predictions, deep sequence modeling tools like LRCN are increasingly central to vision systems for problems with sequential structure. The ease with which these tools can be incorporated into existing visual recognition pipelines makes them a natural choice for perceptual problems with time-varying visual input or sequential outputs, which these methods are able to handle with little input preprocessing and no hand-designed features.

随着计算机视觉领域的发展超出静态输入和预测任务的范围,诸如LRCN之类的深度序列建模工具在视觉系统中对于顺序结构问题的重要性日益增强。这些工具可以轻松地集成到现有的视觉识别管道中,使其成为时变视觉输入或顺序输出的感知问题的自然选择,这些方法只需很少的输入预处理就可以处理,而无需人工设计。


Fig. 6. Image description: images with corresponding captions generated by our finetuned LRCN model. These are images 1-12 of our randomly chosen validation set from COCO 2014 [33]. We used beam search with a beam size of 5 to generate the sentences, and display the top (highest likelihood) result above.

图6.图像描述:经过微调的LRCN模型生成的具有相应标题的图像。这些是我们从COCO 2014 [33]中随机选择的验证集的图像1-12。我们使用光束大小为5的光束搜索来生成句子,并在上方显示最高(最高可能性)结果。


你可能感兴趣的:(论文学习(Paper))