Both convolutional and recurrent operations are building blocks that process one local neighborhood at a time. In this paper, we present non-local operations as a generic family of building blocks for capturing long-range dependencies. Inspired by the classical non-local means method [4] in computer vision, our non-local operation computes the response at a position as a weighted sum of the features at all positions. This building block can be plugged into many computer vision architectures. On the task of video classification, even without any bells and whistles, our nonlocal models can compete or outperform current competition winners on both Kinetics and Charades datasets. In static image recognition, our non-local models improve object detection/segmentation and pose estimation on the COCO suite of tasks. Code will be made available.
卷积和循环操作都是一次处理一个local neighborhood的构建块。在本文中,我们将non-local操作作为捕获long-range 依赖关系的通用系列构建块。受计算机视觉中的non-local means method[4]的启发,我们的non-local运算将位置处的响应计算为所有位置处的特征的加权和。这个构建模块可以插入许多计算机视觉体系结构中。在视频分类的任务中,即使没有任何花里胡哨的工作,我们的non-local模型可以在Kinetics和Charades数据集上竞争或胜过当前的竞赛获胜者。在静态图像识别中,我们的non-local模型改进了COCO套件中的物体检测/分割和姿态估计。代码将可用。
Capturing long-range dependencies is of central importance in deep neural networks. For sequential data (e.g., in speech, language), recurrent operations [38, 23] are the dominant solution to long-range dependency modeling. For image data, long-distance dependencies are modeled by the large receptive fields formed by deep stacks of convolutional operations [14, 30].
捕获远距离依赖在深度神经网络中至关重要。对于顺序数据(例如,在演讲,语言中),循环操作[38,23]是long-range 依赖性建模的主要解决方案。对于图像数据,long-range 依赖性由卷积操作的深层栈形成的大容量感受野建模[14,30]。
Convolutional and recurrent operations both process a local neighborhood, either in space or time; thus long-range dependencies can only be captured when these operations are applied repeatedly, propagating signals progressively through the data. Repeating local operations has several limitations. First, it is computationally inefficient. Second, it causes optimization difficulties that need to be carefully addressed [23, 21]. Finally, these challenges make multihop dependency modeling, e.g., when messages need to be delivered back and forth between distant positions, difficult.
卷积和循环操作既可以处理local neighborhood,也可以处理空间或时间;因此只有在重复应用这些操作时才能捕获long-range 依赖关系,逐步在数据中传播信号。重复local操作有几个限制。首先,它在计算上效率低下。其次,它导致需要仔细解决的优化困难[23,21]。最后,这些挑战使多跳依赖建模成为可能,例如,当消息需要在遥远的位置之间来回传递时,这很困难。
In this paper, we present non-local operations as an efficient, simple, and generic component for capturing longrange dependencies with deep neural networks. Our proposed non-local operation is a generalization of the classical non-local mean operation [4] in computer vision. Intuitively, a non-local operation computes the response at a position as a weighted sum of the features at all positions in the input feature maps (Figure 1). The set of positions can be in space, time, or spacetime, implying that our operations are applicable for image, sequence, and video problems.
在本文中,我们将non-local 操作作为一个高效,简单和通用的组件,用于捕获深度神经网络的长范围依赖关系。我们提出的non-local 操作是计算机视觉中的classical non-local mean operation[4]的推广。直观地说,non-local 操作计算一个位置处的响应,作为输入特征映射中所有位置上的特征的加权和(图1)。这组位置可以在空间,时间或时空中,这意味着我们的操作适用于图像,序列和视频问题。
There are several advantages of using non-local operations: (a) In contrast to the progressive behavior of recurrent and convolutional operations, non-local operations capture long-range dependencies directly by computing interactions between any two positions, regardless of their positional distance; (b) As we show in experiments, non-local operations are efficient and achieve their best results even with only a few layers (e.g., 5); (c) Finally, our non-local operations maintain the variable input sizes and can be easily combined with other operations (e.g., convolutions as we will use).
使用non-local 操作有几个优点:(a)与循环和卷积操作的渐进行为相反,non-local 操作直接通过计算任意两个位置之间的相互作用来捕获long-range 依赖性,而不管它们的位置距离如何; (b)正如我们在实验中所展示的那样,non-local 操作是高效的,即使只有几层(如5)也能达到最佳效果; (c)最后,我们的non-local 操作保持可变输入大小,并且可以很容易地与其他操作相结合(例如,我们将使用的卷积)。
We showcase the effectiveness of non-local operations in the application of video classification. In videos, long-range interactions occur between distant pixels in space as well as time. A single non-local block, which is our basic unit, can directly capture these spacetime dependencies in a feedforward fashion. With a few non-local blocks, our architecures called non-local neural networks are more accurate for video classification than 2D and 3D convolutional networks [48] (including the inflated variant [7]). In addition, non-local neural networks are more computationally economical than their 3D convolutional counterparts. Comprehensive ablation studies are presented on the Kinetics [27] and Charades [44] datasets. Using RGB only and without any bells and whistles (e.g., optical flow, multi-scale testing), our method achieves results on par with or better than the latest competitions winners on both datasets.
我们展示了non-local 操作在视频分类应用中的有效性。在视频中,远距离像素在空间和时间之间发生远距离相互作用。一个单独的non-local block,它是我们的基本单元,可以以前馈方式直接捕获这些时空依赖性。有了一些non-local block,我们的称为non-local神经网络的结构比2D和3D卷积网络(包括膨胀变体[7])对视频分类更准确。另外,non-local神经网络在计算上比其三维卷积对应物更经济。Kinetics[27]和Charades [44]数据集提供了全面的消融研究。我们的方法仅使用RGB而没有任何花里胡哨(例如,光流,多尺度测试),与两个数据集上的最新竞赛获胜者相比甚至更好。
To demonstrate the generality of non-local operations, we further present object detection/segmentation and pose estimation experiments on the COCO dataset [33]. On top of the strong Mask R-CNN baseline [19], our non-local blocks can increase accuracy on all three tasks at a small extra computational cost. Together with the evidence on videos, these image experiments show that non-local operations are generally useful and can become a basic building block in designing deep neural networks.
为了证明non-local 操作的一般性,我们在COCO数据集上进一步提出了目标检测/分割和姿态估计实验[33]。除了强大的Mask R-CNN baseline [19]之外,我们的non-local block可以以较小的额外计算成本提高所有三项任务的准确性。结合视频证据,这些图像实验表明,non-local 操作通常是有用的,并且可以成为设计深度神经网络的基本构件。
Non-local image processing. Non-local means [4] is a classical filtering algorithm that computes a weighted mean of all pixels in an image. It allows distant pixels to contribute to the filtered response at a location based on patch appearance similarity. This non-local filtering idea was later developed into BM3D (block-matching 3D) [10], which performs filtering on a group of similar, but non-local, patches. BM3D is a solid image denoising baseline even compared with deep neural networks [5]. Block matching was used with neural networks for image denoising [6, 31]. Non-local matching is also the essence of successful texture synthesis [12], super-resolution [16], and inpainting [1] algorithms.
non-local图像处理。non-local means[4]是一种经典的滤波算法,用于计算图像中所有像素的加权平均值。它允许遥远的像素根据色块外观相似性对位置处的滤波响应做出贡献。这种non-local过滤思想后来发展成BM3D(block-matching 3D)[10],它对一组类似但non-local的补丁执行过滤。与深度神经网络相比,BM3D是一个固体图像去噪baselines[5]。块匹配与神经网络一起用于图像去噪[6,31]。Non-local matching也是算法texture synthesis [12], super-resolution [16], 以及 inpainting [1]成功的实质。
Graphical models. Long-range dependencies can be modeled by graphical models such as conditional random fields (CRF) [29, 28]. In the context of deep neural networks, a CRF can be exploited to post-process semantic segmentation predictions of a network [9]. The iterative mean-field inference of CRF can be turned into a recurrent network and trained [56, 42, 8, 18, 34]. In contrast, our method is a simpler feedforward block for computing non-local filtering. Unlike these methods that were developed for segmentation, our general-purpose component is applied for classification and detection. These methods and ours are also related to a more abstract model called graph neural networks [41].
图形模型。long-range 依赖可以通过图形模型建模,如conditional random fields(CRF)[29,28]。在深度神经网络的背景下,可以利用CRF来后处理网络的语义分割预测[9]。 CRF的迭代平均场推断可以转化为循环网络并进行训练[56,42,8,18,34]。相反,我们的方法是用于计算non-local filtering的更简单的前馈块。与为分割开发的这些方法不同,我们的通用组件用于分类和检测。这些方法和我们的这些方法也与更抽象的称为graph neural networks的模型有关[41]。
Feedforward modeling for sequences. Recently there emerged a trend of using feedforward (i.e., non-recurrent) networks for modeling sequences in speech and language [36, 54, 15]. In these methods, long-term dependencies are captured by the large receptive fields contributed by very deep 1-D convolutions. These feedforward models are amenable to parallelized implementations and can be more efficient than widely used recurrent models.
序列的前馈建模。最近出现了使用前馈(即非循环)网络来建模语音和语言序列的趋势[36,54,15]。在这些方法中,long-term dependencies被深度一维卷积所贡献的大感受野所捕获。这些前馈模型适用于并行化实现,并且可以比广泛使用的循环模型更有效。
Self-attention. Our work is related to the recent selfattention [49] method for machine translation. A selfattention module computes the response at a position in a sequence (e.g., a sentence) by attending to all positions and taking their weighted average in an embedding space. As we will discuss in the next, self-attention can be viewed as a form of the non-local mean [4], and in this sense our work bridges self-attention for machine translation to the more general class of non-local filtering operations that are applicable to image and video problems in computer vision.
Self-attention。我们的工作与最近的selfattention[49]机器翻译方法有关。selfattention模块通过关注所有位置并在嵌入空间中取其加权平均来计算序列(例如,句子)中位置处的响应。正如我们下面将要讨论的那样,Self-attention可以被看作是non-local means的一种形式[4],在这个意义上,我们的工作将机器翻译的Self-attention与更一般的non-local过滤类别这些操作适用于计算机视觉中的图像和视频问题。
Interaction networks. Interaction Networks (IN) [2, 52] were proposed recently for modeling physical systems. They operate on graphs of objects involved in pairwise interactions. Hoshen [24] presented the more efficient Vertex Attention IN (VAIN) in the context of multi-agent predictive modeling. Another variant, named Relation Networks [40], computes a function on the feature embeddings at all pairs of positions in its input. Our method also processes all pairs, as we will explain (f (xi , xj ) in Eq.(1)). While our non-local networks are connected to these approaches, our experiments indicate that the non-locality of the model, which is orthogonal to the ideas of attention/interaction/relation (e.g., a network can attend to a local region), is the key to their empirical success. Non-local modeling, a long-time crucial element of image processing (e.g., [12, 4]), has been largely overlooked in recent neural networks for computer vision.
Interaction networks。Interaction Networks (IN)[2,52]最近被提出用于建模物理系统。他们对成对相互作用中涉及的物体进行操作。 Hoshen [24]在多智能体预测建模的背景下提出了更高效的Vertex Attention IN(VAIN)。另一个名为Relation Networks[40]的变体计算输入中所有位置对的特征嵌入函数。我们的方法也处理所有的对,如我们将解释方程(1)中的(f(xi,xj))。虽然我们的non-local网络与这些方法相关,但我们的实验表明,模型的非局域性与attention/interaction/relation的观点(例如,一个网络可以关注local区域)是正交的,是他们的经验成功的关键。non-local建模是图像处理中一个长期关键的因素(例如,[12,4]),在最近的计算机视觉神经网络中很大程度上被忽视了。
Video classification architectures. A natural solution to video classification is to combine the success of CNNs for images and RNNs for sequences [55, 11]. In contrast, feedforward models are achieved by 3D convolutions (C3D) [26, 48] in spacetime, and the 3D filters can be formed by “inflating” [13, 7] pre-trained 2D filters. In addition to endto-end modeling on raw video inputs, it has been found that optical flow [45] and trajectories [50, 51] can be helpful. Both flow and trajectories are off-the-shelf modules that may find long-range, non-local dependency. A systematic comparison of video architectures can be found in [7].
视频分类体系结构。视频分类的自然解决方案是将图像的CNN和序列的RNN结合起来[55,11]。相反,前馈模型通过三维卷积(C3D)[26,48]在时空中实现,3D滤波器可以通过“inflating”[13,7]pre-trained 2D filters形成。除了原始视频输入的端到端建模之外,已经发现optical flow [45] 和 trajectories[50,51]可能会有所帮助。flow 和 trajectories都是off-the-shelf的模块,可能会发现long-range,local dependency。视频体系结构的系统比较可以在[7]中找到。
We presented a new class of neural networks which capture long-range dependencies via non-local operations. Our non-local blocks can be combined with any existing architectures. We show the significance of non-local modeling for the tasks of video classification, object detection and segmentation, and pose estimation. On all tasks, a simple addition of non-local blocks provides solid improvement over baselines. We hope non-local layers will become an essential component of future network architectures.
我们提出了一类新的神经网络,通过non-local 操作来捕获long-range 依赖关系。我们的non-local block可以与任何现有的架构相结合。我们展示了non-local建模对于视频分类,物体检测和分割以及姿态估计任务的重要性。在所有任务中,non-local block的简单添加提供了比baselines更好的改进。我们希望non-local层将成为未来网络架构的重要组成部分。
Figure 1. A spacetime non-local operation in our network trained for video classification. A position xi’s response is computed by the weighted average of the features of all positions xj(only the highest weighted ones are shown here). In this example computed by our model, note how it relates the ball in the first frame to the ball in the last two frames. More examples are in Figure 3.
图1.我们的网络中的spacetime non-local 操作经过视频分类训练。位置xi的响应由所有位置xj的特征的加权平均值计算(这里仅显示最高的加权平均值)。在我们的模型计算的这个例子中,请注意它如何将第一帧中的球与最后两帧中的球联系起来。更多示例见图3。
Figure 2. A spacetime non-local block. The feature maps are shown as the shape of their tensors, e.g., T ×H ×W ×1024 for 1024 channels (proper reshaping is performed when noted). “⊗” denotes matrix multiplication, and “⊕” denotes element-wise sum. The softmax operation is performed on each row. The blue boxes denote 1×1×1 convolutions. Here we show the embedded Gaussian version, with a bottleneck of 512 channels. The vanilla Gaussian version can be done by removing θ and φ, and the dot-product version can be done by replacing softmax with scaling by 1/N .
图2.spacetime non-local block。特征图被显示为其张量的形状,例如1024通道的T×H×W×1024(当注意时执行适当的整形)。 “⊗”表示矩阵乘法,“⊕”表示单元和。 softmax操作在每一行执行。蓝色框表示1×1×1卷积。在这里我们展示了embedded Gaussian version ,其中有512个通道的瓶颈。可以通过去除θ和φ来完成vanilla Gaussian version ,并且可以通过用1 / N的缩放代替softmax来完成dot-product version 。
Figure 3. Examples of the behavior of a non-local block in res3 computed by a 5-block non-local model trained on Kinetics. These examples are from held-out validation videos. The starting point of arrows represents one xj, and the ending points represent xj. The 20 highest weighted arrows for each xi are visualized. The 4 frames are from a 32-frame input, shown with a stride of 8 frames. These visualizations show how the model finds related clues to support its prediction.
图3.在Kinetics上训练的5块non-local模型计算的res3中non-local block的行为示例。这些示例来自推出的验证视频。箭头的起点代表一个xi,终点代表xj。可视化每个xi的20个最高加权箭头。这4帧来自32帧输入,显示为8帧的步幅。这些可视化显示了该模型如何找到相关线索来支持其预测。