SnailTyan

Deformable Convolutional Networks论文翻译——中英文对照

文章作者：Tyan
博客：noahsnail.com | CSDN | 简书

声明：作者翻译论文仅为学习，如有侵权请联系作者删除博文，谢谢！

翻译论文汇总：https://github.com/SnailTyan/deep-learning-papers-translation

Deformable Convolutional Networks

Abstract

Convolutional neural networks (CNNs) are inherently limited to model geometric transformations due to the fixed geometric structures in their building modules. In this work, we introduce two new modules to enhance the transformation modeling capability of CNNs, namely, deformable convolution and deformable RoI pooling. Both are based on the idea of augmenting the spatial sampling locations in the modules with additional offsets and learning the offsets from the target tasks, without additional supervision. The new modules can readily replace their plain counterparts in existing CNNs and can be easily trained end-to-end by standard back-propagation, giving rise to deformable convolutional networks. Extensive experiments validate the performance of our approach. For the first time, we show that learning dense spatial transformation in deep CNNs is effective for sophisticated vision tasks such as object detection and semantic segmentation. The code is released at https://github.com/msracver/Deformable-ConvNets.

摘要

卷积神经网络（CNN）由于其构建模块固定的几何结构天然地局限于建模几何变换。在这项工作中，我们引入了两个新的模块来提高CNN的转换建模能力，即可变形卷积和可变形RoI池化。两者都基于这样的想法：增加模块中的空间采样位置以及额外的偏移量，并且从目标任务中学习偏移量，而不需要额外的监督。新模块可以很容易地替换现有CNN中的普通模块，并且可以通过标准的反向传播便易地进行端对端训练，从而产生可变形卷积网络。大量的实验验证了我们方法的性能。我们首次证明了在深度CNN中学习密集空间变换对于复杂的视觉任务（如目标检测和语义分割）是有效的。代码发布在https://github.com/msracver/Deformable-ConvNets。

1. Introduction

A key challenge in visual recognition is how to accommodate geometric variations or model geometric transformations in object scale, pose, viewpoint, and part deformation. In general, there are two ways. The first is to build the training datasets with sufficient desired variations. This is usually realized by augmenting the existing data samples, e.g., by affine transformation. Robust representations can be learned from the data, but usually at the cost of expensive training and complex model parameters. The second is to use transformation-invariant features and algorithms. This category subsumes many well known techniques, such as SIFT (scale invariant feature transform) [42] and sliding window based object detection paradigm.

1. 引言

视觉识别中的一个关键挑战是如何在目标尺度，姿态，视点和部件变形中适应几何变化或建模几何变换。一般来说，有两种方法。首先是建立具有足够期望变化的训练数据集。这通常通过增加现有的数据样本来实现，例如通过仿射变换。鲁棒的表示可以从数据中学习，但是通常以昂贵的训练和复杂的模型参数为代价。其次是使用变换不变的特征和算法。这一类包含了许多众所周知的技术，如SIFT（尺度不变特征变换）[42]和基于滑动窗口的目标检测范例。

There are two drawbacks in above ways. First, the geometric transformations are assumed fixed and known. Such prior knowledge is used to augment the data, and design the features and algorithms. This assumption prevents generalization to new tasks possessing unknown geometric transformations, which are not properly modeled. Second, hand-crafted design of invariant features and algorithms could be difficult or infeasible for overly complex transformations, even when they are known.

上述方法有两个缺点。首先，几何变换被假定是固定并且已知的。这样的先验知识被用来扩充数据，并设计特征和算法。这个假设阻止了对具有未知几何变换的新任务的泛化能力，这些新任务没有被正确地建模。其次，手工设计的不变特征和算法对于过于复杂的变换可能是困难的或不可行的，即使在已知复杂变化的情况下。

Recently, convolutional neural networks (CNNs) [35] have achieved significant success for visual recognition tasks, such as image classification [31], semantic segmentation [41], and object detection [16]. Nevertheless, they still share the above two drawbacks. Their capability of modeling geometric transformations mostly comes from the extensive data augmentation, the large model capacity, and some simple hand-crafted modules (e.g., max-pooling [1] for small translation-invariance).

最近，卷积神经网络（CNNs）[35]在图像分类[31]，语义分割[41]和目标检测[16]等视觉识别任务中取得了显著的成功。不过，他们仍然有上述两个缺点。它们对几何变换建模的能力主要来自大量的数据增强，大的模型容量以及一些简单的手工设计模块（例如，对小的平移具有不变性的最大池化[1]）。

In short, CNNs are inherently limited to model large, unknown transformations. The limitation originates from the fixed geometric structures of CNN modules: a convolution unit samples the input feature map at fixed locations; a pooling layer reduces the spatial resolution at a fixed ratio; a RoI (region-of-interest) pooling layer separates a RoI into fixed spatial bins, etc. There lacks internal mechanisms to handle the geometric transformations. This causes noticeable problems. For one example, the receptive field sizes of all activation units in the same CNN layer are the same. This is undesirable for high level CNN layers that encode the semantics over spatial locations. Because different locations may correspond to objects with different scales or deformation, adaptive determination of scales or receptive field sizes is desirable for visual recognition with fine localization, e.g., semantic segmentation using fully convolutional networks [41]. For another example, while object detection has seen significant and rapid progress [16, 52, 15, 47, 46, 40, 7] recently, all approaches still rely on the primitive bounding box based feature extraction. This is clearly sub-optimal, especially for non-rigid objects.

简而言之，CNN本质上局限于建模大型，未知的转换。该限制源于CNN模块的固定几何结构：卷积单元在固定位置对输入特征图进行采样；池化层以一个固定的比例降低空间分辨率；一个RoI（感兴趣区域）池化层把RoI分成固定的空间组块等等。缺乏处理几何变换的内部机制。这会导致明显的问题。举一个例子，同一CNN层中所有激活单元的感受野大小是相同的。对于在空间位置上编码语义的高级CNN层来说，这是不可取的。由于不同的位置可能对应不同尺度或形变的目标，所以对于具有精细定位的视觉识别来说，例如使用全卷积网络的语义分割[41]，尺度或感受野大小的自适应确定是理想的情况。又如，尽管最近目标检测已经取得了显著而迅速的进展[16,52,15,47,46,40,7]，但所有方法仍然依赖于基于特征提取的粗糙边界框。这显然是次优的，特别是对于非刚性目标。

In this work, we introduce two new modules that greatly enhance CNNs’ capability of modeling geometric transformations. The first is deformable convolution. It adds 2D offsets to the regular grid sampling locations in the standard convolution. It enables free form deformation of the sampling grid. It is illustrated in Figure 1. The offsets are learned from the preceding feature maps, via additional convolutional layers. Thus, the deformation is conditioned on the input features in a local, dense, and adaptive manner.

Figure 1: Illustration of the sampling locations in 3 × 3 standard and deformable convolutions. (a) regular sampling grid (green points) of standard convolution. (b) deformed sampling locations (dark blue points) with augmented offsets (light blue arrows) in deformable convolution. (c)(d) are special cases of (b), showing that the deformable convolution generalizes various transformations for scale, (anisotropic) aspect ratio and rotation.

在这项工作中，我们引入了两个新的模块，大大提高了CNN建模几何变换的能力。首先是可变形卷积。它将2D偏移添加到标准卷积中的常规网格采样位置上。它可以使采样网格自由形变。如图1所示。偏移量通过附加的卷积层从前面的特征图中学习。因此，变形以局部的，密集的和自适应的方式受到输入特征的限制。

图1：3×3标准卷积和可变形卷积中采样位置的示意图。（a）标准卷积的定期采样网格（绿点）。（b）变形的采样位置（深蓝色点）和可变形卷积中增大的偏移量（浅蓝色箭头）。（c）（d）是（b）的特例，表明可变形卷积泛化到了各种尺度（各向异性）、长宽比和旋转的变换。

The second is deformable RoI pooling. It adds an offset to each bin position in the regular bin partition of the previous RoI pooling [15, 7]. Similarly, the offsets are learned from the preceding feature maps and the RoIs, enabling adaptive part localization for objects with different shapes.

第二个是可变形的RoI池化。它为前面的RoI池化的常规bin分区中的每个bin位置添加一个偏移量[15,7]。类似地，从前面的特征映射和RoI中学习偏移量，使得具有不同形状的目标能够自适应的进行部件定位。

Both modules are light weight. They add small amount of parameters and computation for the offset learning. They can readily replace their plain counterparts in deep CNNs and can be easily trained end-to-end with standard back-propagation. The resulting CNNs are called deformable convolutional networks, or deformable ConvNets.

两个模块都轻量的。它们为偏移学习增加了少量的参数和计算。他们可以很容易地取代深层CNN中简单的对应部分，并且可以很容易地通过标准的反向传播进行端对端的训练。所得到的CNN被称为可变形卷积网络，或可变形ConvNets。

Our approach shares similar high level spirit with spatial transform networks [26] and deformable part models [11]. They all have internal transformation parameters and learn such parameters purely from data. A key difference in deformable ConvNets is that they deal with dense spatial transformations in a simple, efficient, deep and end-to-end manner. In Section 3.1, we discuss in details the relation of our work to previous works and analyze the superiority of deformable ConvNets.

我们的方法与空间变换网络[26]和可变形部件模型[11]具有类似的高层精神。它们都有内部的转换参数，纯粹从数据中学习这些参数。可变形ConvNets的一个关键区别在于它们以简单，高效，深入和端到端的方式处理密集的空间变换。在3.1节中，我们详细讨论了我们的工作与以前的工作的关系，并分析了可变形ConvNets的优越性。

2. Deformable Convolutional Networks

The feature maps and convolution in CNNs are 3D. Both deformable convolution and RoI pooling modules operate on the 2D spatial domain. The operation remains the same across the channel dimension. Without loss of generality, the modules are described in 2D here for notation clarity. Extension to 3D is straightforward.

2. 可变形卷积网络

CNN中的特征映射和卷积是3D的。可变形卷积和RoI池化模块都在2D空间域上运行。在整个通道维度上的操作保持不变。在不丧失普遍性的情况下，为了符号清晰，这些模块在2D中描述。扩展到3D很简单。

2.1. Deformable Convolution

The 2D convolution consists of two steps: 1) sampling using a regular grid  over the input feature map x ; 2) summation of sampled values weighted by w . The grid  defines the receptive field size and dilation. For example,

 = {(- 1, - 1), (- 1, 0), \dots, (0, 1), (1, 1)}

defines a

3×3 3 × 3 kernel with dilation

1 1 .

2.1. 可变形卷积

2D卷积包含两步：1）用规则的网格  在输入特征映射 x 上采样；2）对 w 加权的采样值求和。网格  定义了感受野的大小和扩张。例如，

 = {(- 1, - 1), (- 1, 0), \dots, (0, 1), (1, 1)}

定义了一个扩张大小为

1 1 的

3×3 3 × 3 卷积核。

For each location p0 on the output feature map y , we have

y (p 0) = \sum p n \in  w (p n) \cdot x (p 0 + p n) (1)

where

pn p n enumerates the locations in

 R .

对于输出特征映射 y 上的每个位置 p0 ，我们有

y (p 0) = \sum p n \in  w (p n) \cdot x (p 0 + p n) (1)

其中

pn p n 枚举了

 R 中的位置。

In deformable convolution, the regular grid  is augmented with offsets {Δpn|n=1,...,N} , where N=|| . Eq.(1) becomes

y (p 0) = \sum p n \in  w (p n) \cdot x (p 0 + p n + Δ p n) . (2)

在可变形卷积中，规则的网格  通过偏移 {Δpn|n=1,...,N} 增大，其中 N=|| 。方程(1)变为

y (p 0) = \sum p n \in  w (p n) \cdot x (p 0 + p n + Δ p n) . (2)

Now, the sampling is on the irregular and offset locations pn+Δpn . As the offset Δpn is typically fractional, Eq.(2) is implemented via bilinear interpolation as

x (p) = \sum q G (q, p) \cdot x (q), (3)

where

p p denotes an arbitrary (fractional) location (

p=p0+pn+Δpn p = p 0 + p n + Δ p n for Eq.(2)),

q q enumerates all integral spatial locations in the feature map

x x , and

G(⋅,⋅) G ( ⋅ , ⋅ ) is the bilinear interpolation kernel. Note that

G G is two dimensional. It is separated into two one dimensional kernels as

G (q, p) = g (q x, p x) \cdot g (q y, p y), (4)

where

g(a,b)=max(0,1−|a−b|) g ( a , b ) = m a x ( 0 , 1 − | a − b | ) . Eq.(3) is fast to compute as

G(q,p) G ( q , p ) is non-zero only for a few

q q s.

现在，采样是在不规则且有偏移的位置 pn+Δpn 上。由于偏移 Δpn 通常是小数，方程(2)可以通过双线性插值实现

x (p) = \sum q G (q, p) \cdot x (q), (3)

其中

p p 表示任意（小数）位置(公式(2)中

p=p0+pn+Δpn p = p 0 + p n + Δ p n )，

q q 枚举了特征映射

x x 中所有整体空间位置，

G(⋅,⋅) G ( ⋅ , ⋅ ) 是双线性插值的核。注意

G G 是二维的。它被分为两个一维核

G (q, p) = g (q x, p x) \cdot g (q y, p y), (4)

其中

g(a,b)=max(0,1−|a−b|) g ( a , b ) = m a x ( 0 , 1 − | a − b | ) 。方程(3)可以快速计算因为

G(q,p) G ( q , p ) 仅对于一些

q q 是非零的。

As illustrated in Figure 2, the offsets are obtained by applying a convolutional layer over the same input feature map. The convolution kernel is of the same spatial resolution and dilation as those of the current convolutional layer (e.g., also 3×3 with dilation 1 in Figure 2. The output offset fields have the same spatial resolution with the input feature map. The channel dimension 2N corresponds to N 2D offsets. During training, both the convolutional kernels for generating the output features and the offsets are learned simultaneously. To learn the offsets, the gradients are back-propagated through the bilinear operations in Eq.(3) and Eq.(4). It is detailed in appendix A.

Figure 2: Illustration of 3 × 3 deformable convolution.

如图2所示，通过在相同的输入特征映射上应用卷积层来获得偏移。卷积核具有与当前卷积层相同的空间分辨率和扩张（例如，在图2中也具有扩张为1的 3×3 ）。输出偏移域与输入特征映射具有相同的空间分辨率。通道维度 2N （注释：偏移的通道维度，包括 x 方向的通道维度和 y 方向的通道维度）对应于 N 个2D偏移量。在训练过程中，同时学习用于生成输出特征的卷积核和偏移量。为了学习偏移量，梯度通过方程(3)和(4)中的双线性运算进行反向传播。详见附录A。

图2：3×3可变形卷积的说明。

2.2. Deformable RoI Pooling

RoI pooling is used in all region proposal based object detection methods [16, 15, 47, 7]. It converts an input rectangular region of arbitrary size into fixed size features.

2.2. 可变形RoI池化

在所有基于区域提出的目标检测方法中都使用了RoI池化[16,15,47,7]。它将任意大小的输入矩形区域转换为固定大小的特征。

RoI Pooling [15] Given the input feature map x and a RoI of size w×h and top-left corner p0 , RoI pooling divides the RoI into k×k ( k is a free parameter) bins and outputs a k×k feature map y . For (i,j) -th bin ( 0≤i,j<k ), we have

y (i, j) = \sum p \in b i n (i, j) x (p 0 + p) / n i j, (5)

where

nij n i j is the number of pixels in the bin. The

(i,j) ( i , j ) -th bin spans

⌊iwk⌋≤px<⌈(i+1)wk⌉ ⌊ i w k ⌋ ≤ p x < ⌈ ( i + 1 ) w k ⌉ and

⌊jhk⌋≤py<⌈(j+1)hk⌉ ⌊ j h k ⌋ ≤ p y < ⌈ ( j + 1 ) h k ⌉ .

RoI池化[15]。给定输入特征映射 x 、RoI的大小 w×h 和左上角 p0 ，RoI池化将ROI分到 k×k （ k 是一个自由参数）个组块(bin)中，并输出 k×k 的特征映射 y 。对于第 (i,j) 个组块( 0≤i,j<k )，我们有

y (i, j) = \sum p \in b i n (i, j) x (p 0 + p) / n i j, (5)

其中

nij n i j 是组块中的像素数量。第

(i,j) ( i , j ) 个组块的跨度为

⌊iwk⌋≤px<⌈(i+1)wk⌉ ⌊ i w k ⌋ ≤ p x < ⌈ ( i + 1 ) w k ⌉ 和

⌊jhk⌋≤py<⌈(j+1)hk⌉ ⌊ j h k ⌋ ≤ p y < ⌈ ( j + 1 ) h k ⌉ 。

Similarly as in Eq.(2), in deformable RoI pooling, offsets {Δpij|0≤i,j<k} are added to the spatial binning positions. Eq.(5) becomes

y (i, j) = \sum p \in b i n (i, j) x (p 0 + p + Δ p i j) / n i j . (6)

Typically,

Δpij Δ p i j is fractional. Eq.(6) is implemented by bilinear interpolation via Eq.(3) and (4).

类似于方程（2），在可变形RoI池化中，将偏移 {Δpij|0≤i,j<k} 加到空间组块的位置上。方程（5）变为

y (i, j) = \sum p \in b i n (i, j) x (p 0 + p + Δ p i j) / n i j . (6)

通常，

Δpij Δ p i j 是小数。方程（6）通过双线性插值方程（3）和（4）来实现。

Figure 3 illustrates how to obtain the offsets. Firstly, RoI pooling (Eq.(5)) generates the pooled feature maps. From the maps, a fc layer generates the normalized offsets Δpˆij , which are then transformed to the offsets Δpij in Eq.(6) by element-wise product with the RoI’s width and height, as Δpij=γ⋅Δpˆij∘(w,h) . Here γ is a pre-defined scalar to modulate the magnitude of the offsets. It is empirically set to γ=0.1 . The offset normalization is necessary to make the offset learning invariant to RoI size. The fc layer is learned by back-propagation, as detailed in appendix A.

Figure 3: Illustration of 3 × 3 deformable RoI pooling.

图3说明了如何获得偏移量。首先，RoI池化(方程(5))生成池化后的特征映射。从特征映射中，一个fc层产生归一化偏移量 Δpˆij ，然后通过与RoI的宽和高进行逐元素的相乘将其转换为方程(6)中的偏移量 Δpij ，如： Δpij=γ⋅Δpˆij∘(w,h) 。这里 γ 是一个预定义的标量来调节偏移的大小。它经验地设定为 γ=0.1 。为了使偏移学习对RoI大小具有不变性，偏移归一化是必要的。fc层是通过反向传播学习，详见附录A。

图3：阐述3×3的可变形RoI池化。

Position-Sensitive (PS) RoI Pooling [7] It is fully convolutional and different from RoI pooling. Through a conv layer, all the input feature maps are firstly converted to k2 score maps for each object class (totally C+1 for C object classes), as illustrated in the bottom branch in Figure 4. Without need to distinguish between classes, such score maps are denoted as {xi,j} where (i,j) enumerates all bins. Pooling is performed on these score maps. The output value for (i,j) -th bin is obtained by summation from one score map xi,j corresponding to that bin. In short, the difference from RoI pooling in Eq.(5) is that a general feature map x is replaced by a specific positive-sensitive score map xi,j .

Figure 4: Illustration of 3 × 3 deformable PS RoI pooling.

位置敏感（PS）的RoI池化[7]。它是全卷积的，不同于RoI池化。通过一个卷积层，所有的输入特征映射首先被转换为每个目标类的 k2 个分数映射（对于 C 个目标类，总共 C+1 个），如图4的底部分支所示。不需要区分类，这样的分数映射被表示为 {xi,j} ，其中 (i,j) 枚举所有的组块。池化是在这些分数映射上进行的。第 (i,j) 个组块的输出值是通过对分数映射 xi,j 对应的组块求和得到的。简而言之，与方程（5）中RoI池化的区别在于，通用特征映射 x 被特定的位置敏感的分数映射 xi,j 所取代。

图4：阐述3×3的可变形PS RoI池化。

In deformable PS RoI pooling, the only change in Eq.(6) is that x is also modified to xi,j . However, the offset learning is different. It follows the fully convolutional spirit in [7], as illustrated in Figure 4. In the top branch, a conv layer generates the full spatial resolution offset fields. For each RoI (also for each class), PS RoI pooling is applied on such fields to obtain normalized offsets Δpˆij , which are then transformed to the real offsets Δpij in the same way as in deformable RoI pooling described above.

在可变形PS RoI池化中，方程（6）中唯一的变化是 x 也被修改为 xi,j 。但是，偏移学习是不同的。它遵循[7]中的“全卷积”精神，如图4所示。在顶部分支中，一个卷积层生成完整空间分辨率的偏移量字段。对于每个RoI（也对于每个类），在这些字段上应用PS RoI池化以获得归一化偏移量 Δpˆij ，然后以上面可变形RoI池化中描述的相同方式将其转换为实数偏移量 Δpij 。

2.3. Deformable ConvNets

Both deformable convolution and RoI pooling modules have the same input and output as their plain versions. Hence, they can readily replace their plain counterparts in existing CNNs. In the training, these added conv and fc layers for offset learning are initialized with zero weights. Their learning rates are set to β times ( β=1 by default, and β=0.01 for the fc layer in Faster R-CNN) of the learning rate for the existing layers. They are trained via back propagation through the bilinear interpolation operations in Eq.(3) and Eq.(4). The resulting CNNs are called deformable ConvNets.

2.3. 可变形卷积网络

可变形卷积和RoI池化模块都具有与普通版本相同的输入和输出。因此，它们可以很容易地取代现有CNN中的普通版本。在训练中，这些添加的用于偏移学习的conv和fc层的权重被初始化为零。它们的学习率设置为现有层学习速率的 β 倍（默认 β=1 ，Faster R-CNN中的fc层为 β=0.01 ）。它们通过方程（3）和方程（4）中双线性插值运算的反向传播进行训练。由此产生的CNN称为可变形ConvNets。

To integrate deformable ConvNets with the state-of-the-art CNN architectures, we note that these architectures consist of two stages. First, a deep fully convolutional network generates feature maps over the whole input image. Second, a shallow task specific network generates results from the feature maps. We elaborate the two steps below.

为了将可变形的ConvNets与最先进的CNN架构集成，我们注意到这些架构由两个阶段组成。首先，深度全卷积网络在整个输入图像上生成特征映射。其次，浅层任务专用网络从特征映射上生成结果。我们详细说明下面两个步骤。

Deformable Convolution for Feature Extraction We adopt two state-of-the-art architectures for feature extraction: ResNet-101 [22] and a modifed version of Inception-ResNet [51]. Both are pre-trained on ImageNet [8] classification dataset.

特征提取的可变形卷积。我们采用两种最先进的架构进行特征提取：ResNet-101[22]和Inception-ResNet[51]的修改版本。两者都在ImageNet[8]分类数据集上进行预训练。

The original Inception-ResNet is designed for image recognition. It has a feature misalignment issue and problematic for dense prediction tasks. It is modified to fix the alignment problem [20]. The modified version is dubbed as “Aligned-Inception-ResNet” and is detailed in appendix B.

最初的Inception-ResNet是为图像识别而设计的。它有一个特征不对齐的问题，对于密集的预测任务是有问题的。它被修改来解决对齐问题[20]。修改后的版本被称为“Aligned-Inception-ResNet”，详见附录B.

Both models consist of several convolutional blocks, an average pooling and a 1000-way fc layer for ImageNet classification. The average pooling and the fc layers are removed. A randomly initialized 1 × 1 convolution is added at last to reduce the channel dimension to 1024. As in common practice [4, 7], the effective stride in the last convolutional block is reduced from 32 pixels to 16 pixels to increase the feature map resolution. Specifically, at the beginning of the last block, stride is changed from 2 to 1 (“conv5” for both ResNet-101 and Aligned-Inception-ResNet). To compensate, the dilation of all the convolution filters in this block (with kernel size > 1) is changed from 1 to 2.

两种模型都由几个卷积块组成，平均池化和用于ImageNet分类的1000类全连接层。平均池化和全连接层被移除。最后加入随机初始化的1×1卷积，以将通道维数减少到1024。与通常的做法[4,7]一样，最后一个卷积块的有效步长从32个像素减少到16个像素，以增加特征映射的分辨率。具体来说，在最后一个块的开始，步长从2变为1（ResNet-101和Aligned-Inception-ResNet的“conv5”）。为了进行补偿，将该块（核大小>1）中的所有卷积滤波器的扩张从1改变为2。

Optionally, deformable convolution is applied to the last few convolutional layers (with kernel size > 1). We experimented with different numbers of such layers and found 3 as a good trade-off for different tasks, as reported in Table 1.

Table 1: Results of using deformable convolution in the last 1, 2, 3, and 6 convolutional layers (of 3 × 3 filter) in ResNet-101 feature extraction network. For class-aware RPN, Faster R-CNN, and R-FCN, we report result on VOC 2007 test.

可选地，可变形卷积应用于最后的几个卷积层（核大小>1）。我们尝试了不同数量的这样的层，发现3是不同任务的一个很好的权衡，如表1所示。

表1：在ResNet-101特征提取网络中的最后1个，2个，3个和6个卷积层上（3×3滤波器）应用可变形卷积的结果。对于class-aware RPN，Faster R-CNN和R-FCN，我们报告了在VOC 2007测试集上的结果。

Segmentation and Detection Networks A task specific network is built upon the output feature maps from the feature extraction network mentioned above.

分割和检测网络。根据上述特征提取网络的输出特征映射构建特定任务的网络。

In the below, C denotes the number of object classes.

在下面， C 表示目标类别的数量。

DeepLab [5] is a state-of-the-art method for semantic segmentation. It adds a 1 × 1 convolutional layer over the feature maps to generates (C + 1) maps that represent the per-pixel classification scores. A following softmax layer then outputs the per-pixel probabilities.

DeepLab[5]是最先进的语义分割方法。它在特征映射上添加1×1卷积层以生成表示每个像素分类分数的（C+1）个映射。然后随后的softmax层输出每个像素的概率。

Category-Aware RPN is almost the same as the region proposal network in [47], except that the 2-class (object or not) convolutional classifier is replaced by a (C + 1)-class convolutional classifier. It can be considered as a simplified version of SSD [40].

除了用（C+1）类卷积分类器代替2类（目标或非目标）卷积分类器外，Category-Aware RPN与[47]中的区域提出网络几乎是相同的。它可以被认为是SSD的简化版本[40]。

Faster R-CNN [47] is the state-of-the-art detector. In our implementation, the RPN branch is added on the top of the conv4 block, following [47]. In the previous practice [22, 24], the RoI pooling layer is inserted between the conv4 and the conv5 blocks in ResNet-101, leaving 10 layers for each RoI. This design achieves good accuracy but has high per-RoI computation. Instead, we adopt a simplified design as in [38]. The RoI pooling layer is added at last. On top of the pooled RoI features, two fc layers of dimension 1024 are added, followed by the bounding box regression and the classification branches. Although such simplification (from 10 layer conv5 block to 2 fc layers) would slightly decrease the accuracy, it still makes a strong enough baseline and is not a concern in this work.

Faster R-CNN[47]是最先进的检测器。在我们的实现中，RPN分支被添加在conv4块的顶部，遵循[47]。在以前的实践中[22,24]，在ResNet-101的conv4和conv5块之间插入了RoI池化层，每个RoI留下了10层。这个设计实现了很好的精确度，但是具有很高的每个RoI计算。相反，我们采用[38]中的简化设计。RoI池化层在最后添加。在池化的RoI特征之上，添加了两个1024维的全连接层，接着是边界框回归和分类分支。虽然这样的简化（从10层conv5块到2个全连接层）会稍微降低精确度，但它仍然具有足够强的基准，在这项工作中不再关心。

Optionally, the RoI pooling layer can be changed to deformable RoI pooling.

可选地，可以将RoI池化层更改为可变形的RoI池化。

R-FCN [7] is another state-of-the-art detector. It has negligible per-RoI computation cost. We follow the original implementation. Optionally, its RoI pooling layer can be changed to deformable position-sensitive RoI pooling.

R-FCN[7]是另一种最先进的检测器。它的每个RoI计算成本可以忽略不计。我们遵循原来的实现。可选地，其RoI池化层可以改变为可变形的位置敏感的RoI池化。

3. Understanding Deformable ConvNets

This work is built on the idea of augmenting the spatial sampling locations in convolution and RoI pooling with additional offsets and learning the offsets from target tasks.

3. 理解可变形卷积网络

这项工作以用额外的偏移量在卷积和RoI池中增加空间采样位置，并从目标任务中学习偏移量的想法为基础。

When the deformable convolution are stacked, the effect of composited deformation is profound. This is exemplified in Figure 5. The receptive field and the sampling locations in the standard convolution are fixed all over the top feature map (left). They are adaptively adjusted according to the objects’ scale and shape in deformable convolution (right). More examples are shown in Figure 6. Table 2 provides quantitative evidence of such adaptive deformation.

Figure 5: Illustration of the fixed receptive field in standard convolution (a) and the adaptive receptive field in deformable convolution (b), using two layers. Top: two activation units on the top feature map, on two objects of different scales and shapes. The activation is from a 3 × 3 filter. Middle: the sampling locations of the 3 × 3 filter on the preceding feature map. Another two activation units are highlighted. Bottom: the sampling locations of two levels of 3 × 3 filters on the preceding feature map. Two sets of locations are highlighted, corresponding to the highlighted units above.

Figure 6: Each image triplet shows the sampling locations ( 93=729 red points in each image) in three levels of 3 × 3 deformable filters (see Figure 5 as a reference) for three activation units (green points) on the background (left), a small object (middle), and a large object (right), respectively.

Table 2: Statistics of effective dilation values of deformable convolutional filters on three layers and four categories. Similar as in COCO [39], we divide the objects into three categories equally according to the bounding box area. Small: area < 962 pixels; medium: 962 < area < 2242 ; large: area > 2242 pixels.

当可变形卷积叠加时，复合变形的影响是深远的。这在图5中举例说明。标准卷积中的感受野和采样位置在顶部特征映射上是固定的（左）。它们在可变形卷积中（右）根据目标的尺寸和形状进行自适应调整。图6中显示了更多的例子。表2提供了这种自适应变形的量化证据。

图5：标准卷积（a）中的固定感受野和可变形卷积（b）中的自适应感受野的图示，使用两层。顶部：顶部特征映射上的两个激活单元，在两个不同尺度和形状的目标上。激活来自3×3滤波器。中间：前一个特征映射上3×3滤波器的采样位置。另外两个激活单元突出显示。底部：前一个特征映射上两个3×3滤波器级别的采样位置。突出显示两组位置，对应于上面突出显示的单元。

图6：每个图像三元组在三级3×3可变形滤波器（参见图5作为参考）中显示了三个激活单元（绿色点）分别在背景（左）、小目标（中）和大目标（右）上的采样位置（每张图像中的 93=729 个红色点）。

表2：可变形卷积滤波器在三个卷积层和四个类别上的有效扩张值的统计。与在COCO[39]中类似，我们根据边界框区域将目标平均分为三类。小：面积< 962 个像素；中等： 962 <面积< 2242 ；大：面积> 2242 。

The effect of deformable RoI pooling is similar, as illustrated in Figure 7. The regularity of the grid structure in standard RoI pooling no longer holds. Instead, parts deviate from the RoI bins and move onto the nearby object foreground regions. The localization capability is enhanced, especially for non-rigid objects.

Figure 7: Illustration of offset parts in deformable (positive sensitive) RoI pooling in R-FCN [7] and 3 × 3 bins (red) for an input RoI (yellow). Note how the parts are offset to cover the non-rigid objects.

可变形RoI池化的效果是类似的，如图7所示。标准RoI池化中网格结构的规律不再成立。相反，部分偏离RoI组块并移动到附近的目标前景区域。定位能力得到增强，特别是对于非刚性物体。

图7：R-FCN[7]中可变形（正敏感）RoI池化的偏移部分的示意图和输入RoI（黄色）的3x3个组块（红色）。请注意部件如何偏移以覆盖非刚性物体。

Our work is related to previous works in different aspects. We discuss the relations and differences in details.

3.1. 相关工作的背景

我们的工作与以前的工作在不同的方面有联系。我们详细讨论联系和差异。

Spatial Transform Networks (STN) [26] It is the first work to learn spatial transformation from data in a deep learning framework. It warps the feature map via a global parametric transformation such as affine transformation. Such warping is expensive and learning the transformation parameters is known difficult. STN has shown successes in small scale image classification problems. The inverse STN method [37] replaces the expensive feature warping by efficient transformation parameter propagation.

空间变换网络（STN）[26]。这是在深度学习框架下从数据中学习空间变换的第一个工作。它通过全局参数变换扭曲特征映射，例如仿射变换。这种扭曲是昂贵的，学习变换参数是困难的。STN在小规模图像分类问题上取得了成功。反STN方法[37]通过有效的变换参数传播来代替昂贵的特征扭曲。

The offset learning in deformable convolution can be considered as an extremely light-weight spatial transformer in STN [26]. However, deformable convolution does not adopt a global parametric transformation and feature warping. Instead, it samples the feature map in a local and dense manner. To generate new feature maps, it has a weighted summation step, which is absent in STN.

可变形卷积中的偏移学习可以被认为是STN中极轻的空间变换器[26]。然而，可变形卷积不采用全局参数变换和特征扭曲。相反，它以局部密集的方式对特征映射进行采样。为了生成新的特征映射，它有加权求和步骤，STN中不存在。

Deformable convolution is easy to integrate into any CNN architectures. Its training is easy. It is shown effective for complex vision tasks that require dense (e.g., semantic segmentation) or semi-dense (e.g., object detection) predictions. These tasks are difficult (if not infeasible) for STN [26, 37].

可变形卷积很容易集成到任何CNN架构中。它的训练很简单。对于要求密集（例如语义分割）或半密集（例如目标检测）预测的复杂视觉任务来说，它是有效的。这些任务对于STN来说是困难的（如果不是不可行的话）[26,37]。

Active Convolution [27] This work is contemporary. It also augments the sampling locations in the convolution with offsets and learns the offsets via back-propagation end-to-end. It is shown effective on image classification tasks.

主动卷积[27]。这项工作是当代的。它还通过偏移来增加卷积中的采样位置，并通过端到端的反向传播学习偏移量。它对于图像分类任务是有效的。

Two crucial differences from deformable convolution make this work less general and adaptive. First, it shares the offsets all over the different spatial locations. Second, the offsets are static model parameters that are learnt per task or per training. In contrast, the offsets in deformable convolution are dynamic model outputs that vary per image location. They model the dense spatial transformations in the images and are effective for (semi-)dense prediction tasks such as object detection and semantic segmentation.

与可变形卷积的两个关键区别使得这个工作不那么一般和适应。首先，它在所有不同的空间位置上共享偏移量。其次，偏移量是每个任务或每次训练都要学习的静态模型参数。相反，可变形卷积中的偏移是每个图像位置变化的动态模型输出。他们对图像中的密集空间变换进行建模，对于（半）密集的预测任务（如目标检测和语义分割）是有效的。

Effective Receptive Field [43] It finds that not all pixels in a receptive field contribute equally to an output response. The pixels near the center have much larger impact. The effective receptive field only occupies a small fraction of the theoretical receptive field and has a Gaussian distribution. Although the theoretical receptive field size increases linearly with the number of convolutional layers, a surprising result is that, the effective receptive field size increases linearly with the square root of the number, therefore, at a much slower rate than what we would expect.

有效的感受野[43]。它发现，并不是感受野中的所有像素都贡献平等的输出响应。中心附近的像素影响更大。有效感受野只占据理论感受野的一小部分，并具有高斯分布。虽然理论上的感受野大小随卷积层数量线性增加，但令人惊讶的结果是，有效感受野大小随着数量的平方根线性增加，因此，感受野大小以比我们期待的更低的速率增加。

This finding indicates that even the top layer’s unit in deep CNNs may not have large enough receptive field. This partially explains why atrous convolution [23] is widely used in vision tasks (see below). It indicates the needs of adaptive receptive field learning.

这一发现表明，即使是深层CNN的顶层单元也可能没有足够大的感受野。这部分解释了为什么空洞卷积[23]被广泛用于视觉任务（见下文）。它表明了自适应感受野学习的必要。

Deformable convolution is capable of learning receptive fields adaptively, as shown in Figure 5, 6 and Table 2.

可变形卷积能够自适应地学习感受野，如图5，6和表2所示。

Atrous convolution [23] It increases a normal filter’s stride to be larger than 1 and keeps the original weights at sparsified sampling locations. This increases the receptive field size and retains the same complexity in parameters and computation. It has been widely used for semantic segmentation [41, 5, 54] (also called dilated convolution in [54]), object detection [7], and image classification [55].

空洞卷积[23]。它将正常滤波器的步长增加到大于1，并保持稀疏采样位置的原始权重。这增加了感受野的大小，并保持了相同的参数和计算复杂性。它已被广泛用于语义分割[41,5,54]（在[54]中也称扩张卷积），目标检测[7]和图像分类[55]。

Deformable convolution is a generalization of atrous convolution, as easily seen in Figure 1 (c). Extensive comparison to atrous convolution is presented in Table 3.

Table 3: Evaluation of our deformable modules and atrous convolution, using ResNet-101.

可变形卷积是空洞卷积的推广，如图1（c）所示。表3给出了大量的与空洞卷积的比较。

表3：我们的可变形模块与空洞卷积的评估，使用ResNet-101。

Deformable Part Models (DPM) [11] Deformable RoI pooling is similar to DPM because both methods learn the spatial deformation of object parts to maximize the classification score. Deformable RoI pooling is simpler since no spatial relations between the parts are considered.

可变形部件模型（DPM）[11]。可变形RoI池化与DPM类似，因为两种方法都可以学习目标部件的空间变形，以最大化分类得分。由于不考虑部件之间的空间关系，所以可变形RoI池化更简单。

DPM is a shallow model and has limited capability of modeling deformation. While its inference algorithm can be converted to CNNs [17] by treating the distance transform as a special pooling operation, its training is not end-to-end and involves heuristic choices such as selection of components and part sizes. In contrast, deformable ConvNets are deep and perform end-to-end training. When multiple deformable modules are stacked, the capability of modeling deformation becomes stronger.

DPM是一个浅层模型，其建模变形能力有限。虽然其推理算法可以通过将距离变换视为一个特殊的池化操作转换为CNN[17]，但是它的训练不是端到端的，而是涉及启发式选择，例如选择组件和部件尺寸。相比之下，可变形ConvNets是深层的并进行端到端的训练。当多个可变形模块堆叠时，建模变形的能力变得更强。

DeepID-Net [44] It introduces a deformation constrained pooling layer which also considers part deformation for object detection. It therefore shares a similar spirit with deformable RoI pooling, but is much more complex. This work is highly engineered and based on RCNN [16]. It is unclear how to adapt it to the recent state-of-the-art object detection methods [47, 7] in an end-to-end manner.

DeepID-Net[44]。它引入了一个变形约束池化层，它也考虑了目标检测的部分变形。因此，它与可变形RoI池化共享类似的精神，但是要复杂得多。这项工作是高度工程化并基于RCNN的[16]。目前尚不清楚如何以端对端的方式将其应用于最近的最先进目标检测方法[47,7]。

Spatial manipulation in RoI pooling Spatial pyramid pooling [34] uses hand crafted pooling regions over scales. It is the predominant approach in computer vision and also used in deep learning based object detection [21, 15].

RoI池化中的空间操作。空间金字塔池化[34]在尺度上使用手工设计的池化区域。它是计算机视觉中的主要方法，也用于基于深度学习的目标检测[21,15]。

Learning the spatial layout of pooling regions has received little study. The work in [28] learns a sparse subset of pooling regions from a large over-complete set. The large set is hand engineered and the learning is not end-to-end.

很少有学习池化区域空间布局的研究。[28]中的工作从一个大型的超完备集合中学习了池化区域一个稀疏子集。大数据集是手工设计的并且学习不是端到端的。

Deformable RoI pooling is the first to learn pooling regions end-to-end in CNNs. While the regions are of the same size currently, extension to multiple sizes as in spatial pyramid pooling [34] is straightforward.

可变形RoI池化第一个在CNN中端到端地学习池化区域。虽然目前这些区域的规模相同，但像空间金字塔池化[34]那样扩展到多种尺度很简单。

Transformation invariant features and their learning There have been tremendous efforts on designing transformation invariant features. Notable examples include scale invariant feature transform (SIFT) [42] and ORB [49] (O for orientation). There is a large body of such works in the context of CNNs. The invariance and equivalence of CNN representations to image transformations are studied in [36]. Some works learn invariant CNN representations with respect to different types of transformations such as [50], scattering networks [3], convolutional jungles [32], and TI-pooling [33]. Some works are devoted for specific transformations such as symmetry [13, 9], scale [29], and rotation [53].

变换不变特征及其学习。在设计变换不变特征方面已经进行了巨大的努力。值得注意的例子包括尺度不变特征变换（SIFT）[42]和ORB[49]（O为方向）。在CNN的背景下有大量这样的工作。CNN表示对图像变换的不变性和等价性在[36]中被研究。一些工作学习关于不同类型的变换（如[50]，散射网络[3]，卷积森林[32]和TI池化[33]）的不变CNN表示。有些工作专门用于对称性[13,9]，尺度[29]和旋转[53]等特定转换。

As analyzed in Section 1, in these works the transformations are known a priori. The knowledge (such as parameterization) is used to hand craft the structure of feature extraction algorithm, either fixed in such as SIFT, or with learnable parameters such as those based on CNNs. They cannot handle unknown transformations in the new tasks.

如第一部分分析的那样，在这些工作中，转换是先验的。使用知识（比如参数化）来手工设计特征提取算法的结构，或者是像SIFT那样固定的，或者用学习的参数，如基于CNN的那些。它们无法处理新任务中的未知变换。

In contrast, our deformable modules generalize various transformations (see Figure 1). The transformation invariance is learned from the target task.

相反，我们的可变形模块概括了各种转换（见图1）。从目标任务中学习变换的不变性。

Dynamic Filter [2] Similar to deformable convolution, the dynamic filters are also conditioned on the input features and change over samples. Differently, only the filter weights are learned, not the sampling locations like ours. This work is applied for video and stereo prediction.

动态滤波器[2]。与可变形卷积类似，动态滤波器也是依据输入特征并在采样上变化。不同的是，只学习滤波器权重，而不是像我们这样采样位置。这项工作适用于视频和立体声预测。

Combination of low level filters Gaussian filters and its smooth derivatives [30] are widely used to extract low level image structures such as corners, edges, T-junctions, etc. Under certain conditions, such filters form a set of basis and their linear combination forms new filters within the same group of geometric transformations, such as multiple orientations in Steerable Filters [12] and multiple scales in [45]. We note that although the term deformable kernels is used in [45], its meaning is different from ours in this work.

低级滤波器的组合。高斯滤波器及其平滑导数[30]被广泛用于提取低级图像结构，如角点，边缘，T形接点等。在某些条件下，这些滤波器形成一组基，并且它们的线性组合在同一组几何变换中形成新的滤波器，例如Steerable Filters[12]中的多个方向和[45]中多尺度。我们注意到尽管[45]中使用了可变形内核这个术语，但它的含义与我们在本文中的含义不同。

Most CNNs learn all their convolution filters from scratch. The recent work [25] shows that it could be unnecessary. It replaces the free form filters by weighted combination of low level filters (Gaussian derivatives up to 4-th order) and learns the weight coefficients. The regularization over the filter function space is shown to improve the generalization ability when training data are small.

大多数CNN从零开始学习所有的卷积滤波器。最近的工作[25]表明，这可能是没必要的。它通过低阶滤波器（高斯导数达4阶）的加权组合来代替自由形式的滤波器，并学习权重系数。通过对滤波函数空间的正则化，可以提高训练小数据量时的泛化能力。

Above works are related to ours in that, when multiple filters, especially with different scales, are combined, the resulting filter could have complex weights and resemble our deformable convolution filter. However, deformable convolution learns sampling locations instead of filter weights.

上面的工作与我们有关，当多个滤波器，尤其是不同尺度的滤波器组合时，所得到的滤波器可能具有复杂的权重，并且与我们的可变形卷积滤波器相似。但是，可变形卷积学习采样位置而不是滤波器权重。

4. Experiments

4.1. Experiment Setup and Implementation

Semantic Segmentation We use PASCAL VOC [10] and CityScapes [6]. For PASCAL VOC, there are 20 semantic categories. Following the protocols in [19, 41, 4], we use VOC 2012 dataset and the additional mask annotations in [18]. The training set includes 10, 582 images. Evaluation is performed on 1, 449 images in the validation set. For CityScapes, following the protocols in [5], training and evaluation are performed on 2, 975 images in the train set and 500 images in the validation set, respectively. There are 19 semantic categories plus a background category.

4. 实验

4.1. 实验设置和实现

语义分割。我们使用PASCAL VOC[10]和CityScapes[6]。对于PASCAL VOC，有20个语义类别。遵循[19,41,4]中的协议，我们使用VOC 2012数据集和[18]中的附加掩模注释。训练集包含10,582张图像。评估在验证集中的1,449张图像上进行。对于CityScapes，按照[5]中的协议，对训练数据集中的2,975张图像和验证集中的500张图像分别进行训练和评估。有19个语义类别加上一个背景类别。

For evaluation, we use the mean intersection-over-union (mIoU) metric defined over image pixels, following the standard protocols [10, 6]. We use mIoU@V and mIoU@C for PASCAl VOC and Cityscapes, respectively.

为了评估，我们使用在图像像素上定义的平均交集（mIoU）度量，遵循标准协议[10，6]。我们在PASCAl VOC和Cityscapes上分别使用mIoU@V和mIoU@C。

In training and inference, the images are resized to have a shorter side of 360 pixels for PASCAL VOC and 1,024 pixels for Cityscapes. In SGD training, one image is randomly sampled in each mini-batch. A total of 30k and 45k iterations are performed for PASCAL VOC and Cityscapes, respectively, with 8 GPUs and one mini-batch on each. The learning rates are 10−3 and 10−4 in the first 23 and the last 13 iterations, respectively.

在训练和推断中，PASCAL VOC中图像的大小调整为较短边有 360 个像素，Cityscapes较短边有 1,024 个像素。在SGD训练中，每个小批次数据中随机抽取一张图像。分别对PASCAL VOC和Cityscapes进行30k和45k迭代，有8个GPU每个GPU上处理一个小批次数据。前 23 次迭代和后 13 次迭代的学习率分别设为 10−3 ， 10−4 。

Object Detection We use PASCAL VOC and COCO [39] datasets. For PASCAL VOC, following the protocol in [15], training is performed on the union of VOC 2007 trainval and VOC 2012 trainval. Evaluation is on VOC 2007 test. For COCO, following the standard protocol [39], training and evaluation are performed on the 120k images in the trainval and the 20k images in the test-dev, respectively.

目标检测。我们使用PASCAL VOC和COCO[39]数据集。对于PASCAL VOC，按照[15]中的协议，对VOC 2007 trainval和VOC 2012 trainval的并集进行培训。评估是在VOC 2007测试集上。对于COCO，遵循标准协议[39]，分别对trainval中的120k张图像和test-dev中的20k张图像进行训练和评估。

For evaluation, we use the standard mean average precision (mAP) scores [10, 39]. For PASCAL VOC, we report mAP scores using IoU thresholds at 0.5 and 0.7. For COCO, we use the standard COCO metric of mAP@[0.5:0.95], as well as [email protected].

为了评估，我们使用标准的平均精度均值（MAP）得分[10,39]。对于PASCAL VOC，我们使用0.5和0.7的IoU阈值报告mAP分数。对于COCO，我们使用mAP@[0.5：0.95]的标准COCO度量，以及[email protected]。

In training and inference, the images are resized to have a shorter side of 600 pixels. In SGD training, one image is randomly sampled in each mini-batch. For class-aware RPN, 256 RoIs are sampled from the image. For Faster R-CNN and R-FCN, 256 and 128 RoIs are sampled for the region proposal and the object detection networks, respectively. 7×7 bins are adopted in RoI pooling. To facilitate the ablation experiments on VOC, we follow [38] and utilize pre-trained and fixed RPN proposals for the training of Faster R-CNN and R-FCN, without feature sharing between the region proposal and the object detection networks. The RPN network is trained separately as in the first stage of the procedure in [47]. For COCO, joint training as in [48] is performed and feature sharing is enabled for training. A total of 30k and 240k iterations are performed for PASCAL VOC and COCO, respectively, on 8 GPUs. The learning rates are set as 10−3 and 10−4 in the first 23 and the last 13 iterations, respectively.

在训练和推断中，图像被调整为较短边具有600像素。在SGD训练中，每个小批次中随机抽取一张图片。对于class-aware RPN，从图像中采样256个RoI。对于Faster R-CNN和R-FCN，对区域提出和目标检测网络分别采样256个和128个RoI。在ROI池化中采用 7×7 的组块。为了促进VOC的消融实验，我们遵循[38]，并且利用预训练的和固定的RPN提出来训练Faster R-CNN和R-FCN，而区域提出和目标检测网络之间没有特征共享。RPN网络是在[47]中过程的第一阶段单独训练的。对于COCO，执行[48]中的联合训练，并且训练可以进行特征共享。在8个GPU上分别对PASCAL VOC和COCO执行30k次和240k次迭代。前 23 次迭代和后 13 次迭代的学习率分别设为 10−3 ， 10−4 。

4.2. Ablation Study

Extensive ablation studies are performed to validate the efficacy and efficiency of our approach.

4.2. 消融研究

我们进行了广泛的消融研究来验证我们方法的功效性和有效性。

Deformable Convolution Table 1 evaluates the effect of deformable convolution using ResNet-101 feature extraction network. Accuracy steadily improves when more deformable convolution layers are used, especially for DeepLab and class-aware RPN. The improvement saturates when using 3 deformable layers for DeepLab, and 6 for others. In the remaining experiments, we use 3 in the feature extraction networks.

可变形卷积。表1使用ResNet-101特征提取网络评估可变形卷积的影响。当使用更多可变形卷积层时，精度稳步提高，特别是DeepLab和class-aware RPN。当DeepLab使用3个可变形层时，改进饱和，其它的使用6个。在其余的实验中，我们在特征提取网络中使用3个。

We empirically observed that the learned offsets in the deformable convolution layers are highly adaptive to the image content, as illustrated in Figure 5 and Figure 6. To better understand the mechanism of deformable convolution, we define a metric called effective dilation for a deformable convolution filter. It is the mean of the distances between all adjacent pairs of sampling locations in the filter. It is a rough measure of the receptive field size of the filter.

我们经验地观察到，可变形卷积层中学习到的偏移量对图像内容具有高度的自适应性，如图5和图6所示。为了更好地理解可变形卷积的机制，我们为可变形卷积滤波器定义了一个称为有效扩张的度量标准。它是滤波器中所有采样位置的相邻对之间距离的平均值。这是对滤波器的感受野大小的粗略测量。

We apply the R-FCN network with 3 deformable layers (as in Table 1) on VOC 2007 test images. We categorize the deformable convolution filters into four classes: small, medium, large, and background, according to the ground truth bounding box annotation and where the filter center is. Table 2 reports the statistics (mean and std) of the effective dilation values. It clearly shows that: 1) the receptive field sizes of deformable filters are correlated with object sizes, indicating that the deformation is effectively learned from image content; 2) the filter sizes on the background region are between those on medium and large objects, indicating that a relatively large receptive field is necessary for recognizing the background regions. These observations are consistent in different layers.

我们在VOC 2007测试图像上应用R-FCN网络，具有3个可变形层（如表1所示）。根据真实边界框标注和滤波器中心的位置，我们将可变形卷积滤波器分为四类：小，中，大和背景。表2报告了有效扩张值的统计（平均值和标准差）。它清楚地表明：1）可变形滤波器的感受野大小与目标大小相关，表明变形是从图像内容中有效学习到的； 2）背景区域上的滤波器大小介于中，大目标的滤波器之间，表明一个相对较大的感受野是识别背景区域所必需的。这些观察结果在不同层上是一致的。

The default ResNet-101 model uses atrous convolution with dilation 2 for the last three 3 × 3 convolutional layers (see Section 2.3). We further tried dilation values 4, 6, and 8 and reported the results in Table 3. It shows that: 1) accuracy increases for all tasks when using larger dilation values, indicating that the default networks have too small receptive fields; 2) the optimal dilation values vary for different tasks, e.g., 6 for DeepLab but 4 for Faster R-CNN; 3) deformable convolution has the best accuracy. These observations verify that adaptive learning of filter deformation is effective and necessary.

默认的ResNet-101模型在最后的3个3×3卷积层使用扩张为的2空洞卷积（见2.3节）。我们进一步尝试了扩张值4，6和8，并在表3中报告了结果。它表明：1）当使用较大的扩张值时，所有任务的准确度都会增加，表明默认网络的感受野太小；* 2）对于不同的任务，最佳扩张值是不同的，例如，6用于DeepLab，4用于Faster R-CNN； 3）可变形卷积具有最好的精度。这些观察结果证明了滤波器变形的自适应学习是有效和必要的。

Deformable RoI Pooling It is applicable to Faster R-CNN and R-FCN. As shown in Table 3, using it alone already produces noticeable performance gains, especially at the strict [email protected] metric. When both deformable convolution and RoI Pooling are used, significant accuracy improvements are obtained.

可变形RoI池化。它适用于Faster R-CNN和R-FCN。如表3所示，单独使用它已经产生了显著的性能收益，特别是在严格的[email protected]度量标准下。当同时使用可变形卷积和RoI池化时，会获得显著准确性改进。

Model Complexity and Runtime Table 4 reports the model complexity and runtime of the proposed deformable ConvNets and their plain versions. Deformable ConvNets only add small overhead over model parameters and computation. This indicates that the significant performance improvement is from the capability of modeling geometric transformations, other than increasing model parameters.

Table 4: Model complexity and runtime comparison of deformable ConvNets and the plain counterparts, using ResNet-101. The overall runtime in the last column includes image resizing, network forward, and post-processing (e.g., NMS for object detection). Runtime is counted on a workstation with Intel E5-2650 v2 CPU and Nvidia K40 GPU.

模型复杂性和运行时间。表4报告了所提出的可变形ConvNets及其普通版本的模型复杂度和运行时间。可变形ConvNets仅增加了很小的模型参数和计算量。这表明显著的性能改进来自于建模几何变换的能力，而不是增加模型参数。

表4：使用ResNet-101的可变形ConvNets和对应普通版本的模型复杂性和运行时比较。最后一列中的整体运行时间包括图像大小调整，网络前馈传播和后处理（例如，用于目标检测的NMS）。运行时间计算是在一台配备了Intel E5-2650 v2 CPU和Nvidia K40 GPU的工作站上。

4.3. Object Detection on COCO

In Table 5, we perform extensive comparison between the deformable ConvNets and the plain ConvNets for object detection on COCO test-dev set. We first experiment using ResNet-101 model. The deformable versions of class-aware RPN, Faster R-CNN and R-FCN achieve mAP@[0.5:0.95] scores of 25.8% , 33.1% , and 34.5% respectively, which are 11% , 13% , and 12% relatively higher than their plain-ConvNets counterparts respectively. By replacing ResNet-101 by Aligned-Inception-ResNet in Faster R-CNN and R-FCN, their plain-ConvNet baselines both improve thanks to the more powerful feature representations. And the effective performance gains brought by deformable ConvNets also hold. By further testing on multiple image scales (the image shorter side is in [480, 576, 688, 864, 1200, 1400]) and performing iterative bounding box average [14], the mAP@[0.5:0.95] scores are increased to 37.5% for the deformable version of R-FCN. Note that the performance gain of deformable ConvNets is complementary to these bells and whistles.

Table 5: Object detection results of deformable ConvNets v.s. plain ConvNets on COCO test-dev set. M denotes multi-scale testing, and B denotes iterative bounding box average in the table.

4.3. COCO的目标检测

在表5中，我们在COCO test-dev数据集上对用于目标检测的可变形ConvNets和普通ConvNets进行了广泛的比较。我们首先使用ResNet-101模型进行实验。class-aware RPN，Faster CNN和R-FCN的可变形版本分别获得了 25.8% ， 33.1% 和 34.5% 的mAP@[0.5：0.95]分数，分别比它们对应的普通ConvNets相对高了 11% ， 13% 和 12% 。通过在Faster R-CNN和R-FCN中用Aligned-Inception-ResNet取代ResNet-101，由于更强大的特征表示，它们的普通ConvNet基线都得到了提高。而可变形ConvNets带来的有效性能收益也是成立的。通过在多个图像尺度上（图像较短边在[480,576,688,864,1200,1400]内）的进一步测试，并执行迭代边界框平均[14]，对于R-FCN的可变形版本，mAP@[0.5：0.95]分数增加到了37.5％。请注意，可变形ConvNets的性能增益是对这些附加功能的补充。

表5：可变形ConvNets和普通ConvNets在COCO test-dev数据集上的目标检测结果。在表中M表示多尺度测试，B表示迭代边界框平均值。

5. Conclusion

This paper presents deformable ConvNets, which is a simple, efficient, deep, and end-to-end solution to model dense spatial transformations. For the first time, we show that it is feasible and effective to learn dense spatial transformation in CNNs for sophisticated vision tasks, such as object detection and semantic segmentation.

5. 结论

本文提出了可变形ConvNets，它是一个简单，高效，深度，端到端的建模密集空间变换的解决方案。我们首次证明了在CNN中学习高级视觉任务（如目标检测和语义分割）中的密集空间变换是可行和有效的。

Acknowledgements

The Aligned-Inception-ResNet model was trained and investigated by Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun in unpublished work.

致谢

Aligned-Inception-ResNet模型由Kaiming He，Xiangyu Zhang，Shaoqing Ren和Jian Sun在未发表的工作中进行了研究和训练。

References

[1] Y.-L. Boureau, J. Ponce, and Y. LeCun. A theoretical analysis of feature pooling in visual recognition. In ICML, 2010. 1

[2] B. D. Brabandere, X. Jia, T. Tuytelaars, and L. V. Gool. Dynamic filter networks. In NIPS, 2016. 6

[3] J. Bruna and S. Mallat. Invariant scattering convolution networks. TPAMI, 2013. 6

[4] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Semantic image segmentation with deep convolutional nets and fully connected crfs. In ICLR, 2015. 4, 7

[5] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. arXiv preprint arXiv:1606.00915, 2016. 4, 6, 7

[6] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele. The cityscapes dataset for semantic urban scene understanding. In CVPR, 2016. 7

[7] J. Dai, Y. Li, K. He, and J. Sun. R-fcn: Object detection via region-based fully convolutional networks. In NIPS, 2016. 1, 2, 3, 4, 5, 6

[8] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, 2009. 4, 10

[9] S. Dieleman, J. D. Fauw, and K. Kavukcuoglu. Exploiting cyclic symmetry in convolutional neural networks. arXiv preprint arXiv:1602.02660, 2016. 6

[10] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object Classes (VOC) Challenge. IJCV, 2010. 7

[11] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained part-based models. TPAMI, 2010. 2, 6

[12] W. T. Freeman and E. H. Adelson. The design and use of steerable filters. TPAMI, 1991. 6

[13] R. Gens and P. M. Domingos. Deep symmetry networks. In NIPS, 2014. 6

[14] S. Gidaris and N. Komodakis. Object detection via a multiregion & semantic segmentation-aware cnn model. In ICCV, 2015. 9

[15] R. Girshick. Fast R-CNN. In ICCV, 2015. 1, 2, 3, 6, 7

[16] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, 2014. 1, 3, 6

[17] R. Girshick, F. Iandola, T. Darrell, and J. Malik. Deformable part models are convolutional neural networks.

[20] K. He, X. Zhang, S. Ren, and J. Sun. Aligned-inceptionresnet model, unpublished work. 4, 10

[21] K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. In ECCV, 2014. 6

[22] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016. 4, 10

[23] M. Holschneider, R. Kronland-Martinet, J. Morlet, and P. Tchamitchian. A real-time algorithm for signal analysis with the help of the wavelet transform. Wavelets: Time-Frequency Methods and Phase Space, page 289297, 1989. 6

[24] J. Huang, V. Rathod, C. Sun, M. Zhu, A. Korattikara, A. Fathi, I. Fischer, Z. Wojna, Y. Song, S. Guadarrama, and K. Murphy. Speed/accuracy trade-offs for modern convolutional object detectors. arXiv preprint arXiv:1611.10012, 2016. 4

[25] J.-H. Jacobsen, J. van Gemert, Z. Lou, and A. W.M.Smeulders. Structured receptive fields in cnns. In CVPR, 2016. 6

[26] M. Jaderberg, K. Simonyan, A. Zisserman, and K. Kavukcuoglu. Spatial transformer networks. In NIPS, 2015. 2, 5

[27] Y. Jeon and J. Kim. Active convolution: Learning the shape of convolution for image classification. In CVPR, 2017. 5

[28] Y. Jia, C. Huang, and T. Darrell. Beyond spatial pyramids: Receptive field learning for pooled image features. In CVPR, 2012. 6

[29] A. Kanazawa, A. Sharma, and D. Jacobs. Locally scale-invariant convolutional neural networks. In NIPS, 2014. 6

[30] J. J. Koenderink and A. J. van Doom. Representation of local geometry in the visual system. Biological Cybernetics, 55(6):367–375, Mar. 1987. 6

[31] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012. 1

[32] D. Laptev and J. M. Buhmann. Transformation-invariantcon-volutional jungles. In CVPR, 2015. 6

[33] D. Laptev, N. Savinov, J. M. Buhmann, and M. Pollefeys. Ti-pooling: transformation-invariant pooling for feature learning in convolutional neural networks. arXiv preprint arXiv:1604.06318, 2016. 6

[34] S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In CVPR, 2006. 6

[35] Y. LeCun and Y. Bengio. Convolutional networks for images, speech, and time series. The handbook of brain theory and neural networks, 1995. 1

[36] K. Lenc and A. Vedaldi. Understanding image representations by measuring their equivariance and equivalence. In CVPR, 2015. 6

[37] C.-H. Lin and S. Lucey. Inverse compositional spatial transformer networks. arXiv preprint arXiv:1612.03897, 2016. arXiv preprint arXiv:1409.5403, 2014. 6

[18] B. Hariharan, P. Arbeláez, L. Bourdev, S. Maji, and J. Malik. 5 Semantic contours from inverse detectors. In ICCV, 2011. 7 [19] B. Hariharan, P. Arbeláez, R. Girshick, and J. Malik. Simultaneous detection and segmentation. In ECCV. 2014. 7

[38] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie. Feature pyramid networks for object detection. In CVPR, 2017. 4, 7

[39] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft COCO: Common objects in context. In ECCV. 2014. 7

[40] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, and S. Reed. Ssd: Single shot multibox detector. In ECCV, 2016. 1, 4

[41] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015. 1, 6, 7

[42] D. G. Lowe. Object recognition from local scale-invariant features. In ICCV, 1999. 1, 6

[43] W. Luo, Y. Li, R. Urtasun, and R. Zemel. Understanding the effective receptive field in deep convolutional neural networks. arXiv preprint arXiv:1701.04128, 2017. 6

[44] W. Ouyang, X. Wang, X. Zeng, S. Qiu, P. Luo, Y. Tian, H. Li, S. Yang, Z. Wang, C.-C. Loy, and X. Tang. Deepid-net: Deformable deep convolutional neural networks for object detection. In CVPR, 2015. 6

[45] P. Perona. Deformable kernels for early vision. TPAMI, 1995. 6

[46] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You only look once: Unified, real-time object detection. In CVPR, 2016. 1

[47] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. In NIPS, 2015. 1, 3, 4, 6, 7

[48] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. TPAMI, 2016. 7

[49] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski. Orb: an efficient alternative to sift or surf. In ICCV, 2011. 6

[50] K. Sohn and H. Lee. Learning invariant representations with local transformations. In ICML, 2012. 6

[51] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. Alemi. Inception-v4, inception-resnet and the impact of residual connections on learning. arXiv preprint arXiv:1602.07261, 2016. 4, 10

[52] C. Szegedy, S. Reed, D. Erhan, and D. Anguelov. Scalable, high-quality object detection. arXiv:1412.1441v2, 2014. 1

[53] D. E. Worrall, S. J. Garbin, D. Turmukhambetov, and G. J. Brostow. Harmonic networks: Deep translation and rotation equivariance. arXiv preprint arXiv:1612.04642, 2016. 6

[54] F. Yu and V. Koltun. Multi-scale context aggregation by dilated convolutions. In ICLR, 2016. 6

[55] F. Yu, V. Koltun, and T. Funkhouser. Dilated residual networks. In CVPR, 2017. 6

你可能感兴趣的:(深度学习,Deep,Learnig)

PyTorch & TensorFlow速成复习：从基础语法到模型部署实战（附FPGA移植衔接）阿牛的药铺算法移植部署 pytorch tensorflow fpga开发
PyTorch&TensorFlow速成复习：从基础语法到模型部署实战（附FPGA移植衔接）引言：为什么算法移植工程师必须掌握框架基础？针对光学类产品算法FPGA移植岗位需求（如可见光/红外图像处理），深度学习框架是算法落地的"桥梁"——既要用PyTorch/TensorFlow验证算法可行性，又要将训练好的模型（如CNN、目标检测）转换为FPGA可部署的格式（ONNX、TFLite）。本文采用"
深度学习模型表征提取全解析 ZhangJiQun&MXP 教学 2024大模型以及算力 2021 AI python 深度学习人工智能 python embedding 语言模型
模型内部进行表征提取的方法在自然语言处理（NLP）中，“表征（Representation）”指将文本（词、短语、句子、文档等）转化为计算机可理解的数值形式（如向量、矩阵），核心目标是捕捉语言的语义、语法、上下文依赖等信息。自然语言表征技术可按“静态/动态”“有无上下文”“是否融入知识”等维度划分一、传统静态表征（无上下文，词级为主）这类方法为每个词分配固定向量，不考虑其在具体语境中的含义（无法解
【Qualcomm】高通SNPE框架简介、下载与使用 Jackilina_Stone 人工智能 Qualcomm SNPE
目录一高通SNPE框架1SNPE简介2QNN与SNPE3Capabilities4工作流程二SNPE的安装与使用1下载2Setup3SNPE的使用概述一高通SNPE框架1SNPE简介SNPE（SnapdragonNeuralProcessingEngine），是高通公司推出的面向移动端和物联网设备的深度学习推理框架。SNPE提供了一套完整的深度学习推理框架，能够支持多种深度学习模型，包括Pytor
不同行业的 AI 数据安全与合规实践：7 大核心要点全解析观熵人工智能 DeepSeek 私有化部署
不同行业的AI数据安全与合规实践：7大核心要点全解析关键词AI数据安全、行业合规、私有化部署、数据分类分级、国产大模型、隐私保护、DeepSeek部署摘要随着国产大模型在金融、医疗、政务、教育等关键领域的深入部署，AI系统对数据安全与行业合规提出了更高要求。本文结合DeepSeek私有化部署实战，系统梳理当前各行业主流的数据安全合规标准与落地策略，从数据分类分级、访问控制、审计追踪到敏感信息识别与
深度学习篇---昇腾NPU&CANN 工具包 Atticus-Orion 上位机知识篇图像处理篇深度学习篇深度学习人工智能 NPU 昇腾 CANN
介绍昇腾NPU是华为推出的神经网络处理器，具有强大的AI计算能力，而CANN工具包则是面向AI场景的异构计算架构，用于发挥昇腾NPU的性能优势。以下是详细介绍：昇腾NPU架构设计：采用达芬奇架构，是一个片上系统，主要由特制的计算单元、大容量的存储单元和相应的控制单元组成。集成了多个CPU核心，包括控制CPU和AICPU，前者用于控制处理器整体运行，后者承担非矩阵类复杂计算。此外，还拥有AICore
深度学习图像分类数据集—桃子识别分类 AI街潜水的八角深度学习图像数据集深度学习分类人工智能
该数据集为图像分类数据集，适用于ResNet、VGG等卷积神经网络，SENet、CBAM等注意力机制相关算法，VisionTransformer等Transformer相关算法。数据集信息介绍：桃子识别分类：['B1','M2','R0','S3']训练数据集总共有6637张图片，每个文件夹单独放一种数据各子文件夹图片统计:·B1:1601张图片·M2:1800张图片·R0:1601张图片·S3:
AI MCP教程之什么是 MCP？利用本地 LLM 、MCP、DeepSeek 集成构建您自己的 AI 驱动工具知识大胖 NVIDIA GPU和大语言模型开发教程人工智能 mcp deepseek
介绍利用模型上下文协议(MCP)的工具吸引了我们的注意力—将AI变成触手可及的生产力引擎。它们巧妙、高效，让人难以抗拒。但如果您可以将这样的功能添加到自己的工具中，会怎么样呢？在本指南中，我将引导您构建一个具有本地运行的大型语言模型(LLM)和MCP集成的AI工具-让您以类似的方式自动执行利用MCP的工具您喜欢的任务。推荐文章《AnythingLLM教程系列之12AnythingLLM上的Olla
使用 Ollama 、 DeepSeek和QWEN的模型上下文协议 (MCP) ，使用本地 LLM 教程的 MCP 服务器知识大胖 NVIDIA GPU和大语言模型开发教程服务器运维人工智能 qwen2vl deepseek
简介模型上下文协议：MCP服务器据称是AI领域的下一个重大改变者，它将使AI代理变得比我们想象的更加先进。MCP或模型上下文协议由Anthropic去年发布，它可以帮助LLM连接软件并对其进行控制。但有一个问题大多数MCP服务器都与ClaudeAI兼容，尤其是ClaudeAI桌面应用程序，但它们有自己的限制。有没有办法我们可以使用本地LLM运行MCP服务器？是的，在这个特定的逐步详细教程中，我们将
12 个强大的 DeepSeek AI 提示将彻底改变您的日常生活知识大胖 NVIDIA GPU和大语言模型开发教程人工智能 deepseek
内容写作的最佳提示让我们从写作开始吧。无论您是博主、学生还是社交媒体创作者，这些提示都将帮助您创作出精彩的内容。提示1：“扮演专业文案撰稿人，为[产品/服务]撰写引人注目的广告文案。文案应引人入胜、具有说服力，且字数不得超过100个字。”这使得ChatGPT的响应结构就像真实的广告文案一样。提示2：“以更具吸引力和说服力的方式重写此段落，同时保持含义不变：[插入文本]。”推荐文章《Neo4j上使用
使用 Deepseek Zero Coding Experience 创建类似飞扬的小鸟游戏知识大胖 NVIDIA GPU和大语言模型开发教程游戏 deepseek ollama janus pro
简介Flappybird在苹果商店推出后，每天大约能赚5000美元，但后来被苹果故意下架。现在我正尝试使用Deepseek制作这样一款游戏。技术在不断变化，编码知识也在不断变化，只需修改代码即可获得结果。让我们在Deepseek上试试这款游戏：推荐文章《如何在本地电脑上安装和使用DeepSeekR-1》权重1，DeepSeek《Nvidia系列之使用NVIDIAIsaacSim和ROS2的命令行控
24GB GPU 中的 DeepSeek R1：Unsloth AI 针对 671B 参数模型进行动态量化知识大胖 NVIDIA GPU和大语言模型开发教程人工智能 deepseek ollama
简介最初的DeepSeekR1是一个拥有6710亿个参数的语言模型，UnslothAI团队对其进行了动态量化，将模型大小减少了80%（从720GB减少到131GB），同时保持了强大的性能。当添加模型卸载功能时，该模型可以在24GBVRAM下以低令牌/秒的推理速度运行。推荐文章《本地构建AI智能分析助手之01快速安装，使用PandasAI和Ollama进行数据分析，用自然语言向你公司的数据提问为决策
在 Obsidian 中本地使用 DeepSeek — 无需互联网！知识大胖 NVIDIA GPU和大语言模型开发教程人工智能 deepseek
简介您是否想在Obsidian内免费使用类似于ChatGPT的本地LLM？如果是，那么本指南适合您！我将引导您完成在Obsidian中安装和使用DeepSeek-R1模型的确切步骤，这样您就可以在笔记中拥有一个由AI驱动的第二大脑。推荐文章《24GBGPU中的DeepSeekR1：UnslothAI针对671B参数模型进行动态量化》权重1，DeepSeek类《在RaspberryPi上运行语音识别
使用 DeepSeek R1 和 Ollama 开发 RAG 系统使用 DeepSeek R1 和 Ollama 构建强大的 RAG 系统。了解开发智能 AI 解决方案的设置过程、最佳实践和技巧。知识大胖 NVIDIA GPU和大语言模型开发教程人工智能 deepseek ollama
简介DeepSeekR1和Ollama提供了用于构建检索增强生成(RAG)系统的强大工具。本指南介绍了使用这些技术开发RAG应用程序的设置、实施和最佳实践。为什么RAG系统会改变游戏规则检索增强生成(RAG)系统结合了搜索和生成AI的优点，可实现精确且准确的情境感知响应。借助DeepSeekR1和Ollama等工具，创建RAG系统不再令人生畏。无论您是构建聊天机器人、知识助手还是AI驱动的搜索引擎
【实战AI】macbook M1 本地ollama运行deepseek 东方鲤鱼 chat AI macos ai llama AIGC chatgpt
由于deepseek官网或者Aapi调用会有网络延迟或不响应的情况，故在本地搭建部署；前提条件1.由于需要拉取开源镜像，受网络限制，部分资源在前提中会下载的更快！请自行；2.设备macbookM132G下载ollamaOllama是一款跨平台推理框架客户端（MacOS、Windows、Linux），专为无缝部署大型语言模型（LLM）（如Llama2、Mistral、Llava等）而设计。通过一键式
NumPy-@运算符详解 GG不是gg numpy numpy
NumPy-@运算符详解一、@运算符的起源与设计目标1.从数学到代码：符号的统一2.设计目标二、@运算符的核心语法与运算规则1.基础用法：二维矩阵乘法2.一维向量的矩阵语义3.高维数组：批次矩阵运算4.广播机制：灵活的形状匹配三、@运算符与其他乘法方式的核心区别1.对比`np.dot()`2.对比元素级乘法`*`3.对比`np.matrix`的`*`运算符四、典型应用场景：从基础到高阶1.深度学习
NLP_知识图谱_大模型——个人学习记录 macken9999 自然语言处理知识图谱大模型自然语言处理知识图谱学习
1.自然语言处理、知识图谱、对话系统三大技术研究与应用https://github.com/lihanghang/NLP-Knowledge-Graph深度学习-自然语言处理(NLP)-知识图谱：知识图谱构建流程【本体构建、知识抽取（实体抽取、关系抽取、属性抽取）、知识表示、知识融合、知识存储】-元気森林-博客园https://www.cnblogs.com/-402/p/16529422.htm
DeepSeek解读道德经第五十九章 cal_ 道德经道德经
一、原文与译文原文：治人事天，莫若啬。夫唯啬，是谓早服；早服谓之重积德；重积德则无不克；无不克则莫知其极；莫知其极，可以有国；有国之母，可以长久。是谓深根固柢，长生久视之道。译文：治理百姓侍奉天道，没有比珍爱能量更重要的。唯有珍惜能量，才叫早作准备；早作准备就是厚积德性；厚积德性则无往不胜；无往不胜则力量无穷；力量无穷便可守护国家；掌握治国根本，方能长久延续。这便是根深柢固、长生久存之道。二、核心
解决 Python 包安装失败问题：以 accelerate 为例
在使用Python开发项目时，我们经常会遇到依赖包安装失败的问题。今天，我们就以accelerate包为例，详细探讨一下可能的原因以及解决方法。通过这篇文章，你将了解到Python包安装失败的常见原因、如何切换镜像源、如何手动安装包，以及一些实用的注意事项。一、问题背景在开发一个深度学习项目时，我需要安装accelerate包来优化模型的训练过程。然而，当我运行以下命令时：bash复制pipins
Golang面试题二（slice,map,chan） os-lee go高级 golang 开发语言后端
目录1.slice的底层实现1.结构体定义2.slice四种初始化方式3.底层函数2.Go语言当中数组和slice的区别是什么？1.长度不同2.函数传参不同3.计算长度方式不同3.slice的扩容机制，有什么注意点扩容机制总结4.扩容前后的Slice是否相同5.深拷贝和浅拷贝浅拷贝（ShallowCopy）深拷贝（DeepCopy）总结6.slice为什么不是线程安全的7.map底层实现8.map
在mac m1基于llama.cpp运行deepseek
lama.cpp是一个高效的机器学习推理库，目标是在各种硬件上实现LLM推断，保持最小设置和最先进性能。llama.cpp支持1.5位、2位、3位、4位、5位、6位和8位整数量化，通过ARMNEON、Accelerate和Metal支持Apple芯片，使得在MACM1处理器上运行Deepseek大模型成为可能。1下载llama.cppgitclonehttps://github.com/ggerg
从RNN循环神经网络到Transformer注意力机制：解析神经网络架构的华丽蜕变熊猫钓鱼>_> 神经网络 rnn transformer
1.引言在自然语言处理和序列建模领域，神经网络架构经历了显著的演变。从早期的循环神经网络（RNN）到现代的Transformer架构，这一演变代表了深度学习方法在处理序列数据方面的重大进步。本文将深入比较这两种架构，分析它们的工作原理、优缺点，并通过实验结果展示它们在实际应用中的性能差异。2.循环神经网络（RNN）2.1基本原理循环神经网络是专门为处理序列数据而设计的神经网络架构。RNN的核心思想
Python桌面应用开发的未来——智能化工具与大模型赋能 IronwoodStag78
开发AI智能应用，就下载InsCodeAIIDE，一键接入DeepSeek-R1满血版大模型！标题：Python桌面应用开发的未来——智能化工具与大模型赋能随着人工智能技术的飞速发展，传统软件开发模式正在被重新定义。Python作为一门功能强大且灵活的语言，在桌面应用开发领域一直占据重要地位。然而，面对日益复杂的用户需求和快速变化的技术环境，如何提升开发效率、降低开发门槛，成为开发者亟需解决的问题
如何使用Python实现交通工具识别
如何使用Python实现交通工具识别文章目录技术架构功能流程识别逻辑用户界面增强特性依赖项主要类别内容展示该系统是一个基于深度学习的交通工具识别工具，具备以下核心功能与特点：技术架构使用预训练的ResNet50卷积神经网络模型（来自ImageNet数据集）集成图像增强预处理技术（随机裁剪、旋转、翻转等）采用多数投票机制提升预测稳定性基于置信度评分的结果筛选策略功能流程用户通过GUI界面选择待识别图
Python OpenCV教程从入门到精通的全面指南【文末送书】一键难忘 python opencv 开发语言
文章目录PythonOpenCV从入门到精通1.安装OpenCV2.基本操作2.1读取和显示图像2.2图像基本操作3.图像处理3.1图像转换3.2图像阈值处理3.3图像平滑4.边缘检测和轮廓4.1Canny边缘检测4.2轮廓检测5.高级操作5.1特征检测5.2目标跟踪5.3深度学习与OpenCVPythonOpenCV从入门到精通【文末送书】PythonOpenCV从入门到精通OpenCV(Ope
第八周 tensorflow实现猫狗识别降花绘 365天深度学习 tensorflow系列 tensorflow 深度学习人工智能
本文为365天深度学习训练营内部限免文章（版权归K同学啊所有）**参考文章地址：[TensorFlow入门实战｜365天深度学习训练营-第8周：猫狗识别（训练营内部成员可读）]**作者：K同学啊文章目录一、本周学习内容:1、自己搭建VGG16网络2、了解model.train_on_batch（）3、了解tqdm，并使用tqdm实现可视化进度条二、前言三、电脑环境四、前期准备1、导入相关依赖项2、
深度学习实战-使用TensorFlow与Keras构建智能模型程序员Gloria Python超入门 TensorFlow python
深度学习实战-使用TensorFlow与Keras构建智能模型深度学习已经成为现代人工智能的重要组成部分，而Python则是实现深度学习的主要编程语言之一。本文将探讨如何使用TensorFlow和Keras构建深度学习模型，包括必要的代码实例和详细的解析。1.深度学习简介深度学习是机器学习的一个分支，使用多层神经网络来学习和表示数据中的复杂模式。其广泛应用于图像识别、自然语言处理、推荐系统等领域。
AI在垂直领域的深度应用：医疗、金融与自动驾驶的革新之路
AI在垂直领域的深度应用：医疗、金融与自动驾驶的革新之路一、医疗领域：AI驱动的精准诊疗与效率提升1.医学影像诊断AI算法通过深度学习技术，已实现对X光、CT、MRI等影像的快速分析，辅助医生检测癌症、骨折等疾病。例如，GoogleDeepMind的AI系统在乳腺癌筛查中，误检率比人类专家低9.4%；中国的推想医疗AI系统可在20秒内完成肺部CT扫描分析，为急诊救治争取黄金时间。2.药物研发传统药
专题：2025云计算与AI技术研究趋势报告|附200+份报告PDF、原数据表汇总下载
原文链接：https://tecdat.cn/?p=42935关键词：2025,云计算，AI技术，市场趋势，深度学习，公有云，研究报告云计算和AI技术正以肉眼可见的速度重塑商业世界。过去十年，全球云服务收入激增8倍，中国云计算市场规模突破6000亿元，而深度学习算法的应用量更是暴涨400倍。这些数字背后，是企业从“自建机房”到“云原生开发”的转型，是AI从“实验室”走向“产业级应用”的跨越。本报告
【深度学习解惑】在实践中如何发现和修正RNN训练过程中的数值不稳定？云博士的AI课堂大模型技术开发与实践哈佛博后带你玩转机器学习深度学习深度学习 rnn 人工智能 tensorflow pytorch 神经网络机器学习
在实践中发现和修正RNN训练过程中的数值不稳定目录引言与背景介绍原理解释代码说明与实现应用场景与案例分析实验设计与结果分析性能分析与技术对比常见问题与解决方案创新性与差异性说明局限性与挑战未来建议和进一步研究扩展阅读与资源推荐图示与交互性内容语言风格与通俗化表达互动交流1.引言与背景介绍循环神经网络(RNN)在处理序列数据时表现出色，但训练过程中常面临梯度消失和梯度爆炸问题，导致数值不稳定。当网络
【深度学习实战】当前三个最佳图像分类模型的代码详解云博士的AI课堂大模型技术开发与实践哈佛博后带你玩转机器学习深度学习深度学习人工智能分类模型机器学习 Transformer EfficientNet ConvNeXt
下面给出三个在当前图像分类任务中精度表现突出的模型示例，分别基于SwinTransformer、EfficientNet与ConvNeXt。每个模型均包含：训练代码（使用PyTorch）从预训练权重开始微调（也可注释掉预训练选项，从头训练）数据集目录结构：└──dataset_root├──buy#第一类图像└──nobuy#第二类图像随机拆分：80%训练，20%验证每个Epoch输出一次loss
插入表主键冲突做更新 a-john
有以下场景：用户下了一个订单，订单内的内容较多，且来自多表，首次下单的时候，内容可能会不全（部分内容不是必须，出现有些表根本就没有没有该订单的值）。在以后更改订单时，有些内容会更改，有些内容会新增。问题：如果在sql语句中执行update操作，在没有数据的表中会出错。如果在逻辑代码中先做查询，查询结果有做更新，没有做插入，这样会将代码复杂化。解决： mysql中提供了一个sql语
Android xml资源文件中@、@android:type、@*、？、@+含义和区别 Cb123456 @+@?@*
一.@代表引用资源 1.引用自定义资源。格式：@[package:]type/name android：text="@string/hello" 2.引用系统资源。格式：@android:type/name android:textColor="@android:color/opaque_red"
数据结构的基本介绍天子之骄数据结构散列表树、图线性结构价格标签
数据结构的基本介绍数据结构就是数据的组织形式，用一种提前设计好的框架去存取数据，以便更方便，高效的对数据进行增删查改。正确选择合适的数据结构，对软件程序的高效执行的影响作用不亚于算法的设计。此外，在计算机系统中数据结构的作用也是非同小可。例如常常在编程语言中听到的栈，堆等，就是经典的数据结构。经典的数据结构大致如下：一：线性数据结构 (1)：列表 a
通过二维码开放平台的API快速生成二维码一炮送你回车库 api
现在很多网站都有通过扫二维码用手机连接的功能，联图网(http://www.liantu.com/pingtai/)的二维码开放平台开放了一个生成二维码图片的Api,挺方便使用的。闲着无聊，写了个前台快速生成二维码的方法。 html代码如下:(二维码将生成在这div下) ? 1 &nbs
ImageIO读取一张图片改变大小 3213213333332132 java IO image BufferedImage
package com.demo; import java.awt.image.BufferedImage; import java.io.File; import java.io.IOException; import javax.imageio.ImageIO; /** * @Description 读取一张图片改变大小 * @author FuJianyon
myeclipse集成svn（一针见血） 7454103 eclipse SVN MyEclipse
&n
装箱与拆箱----autoboxing和unboxing darkranger J2SE
4.2　自动装箱和拆箱基本数据(Primitive)类型的自动装箱(autoboxing)、拆箱(unboxing)是自J2SE 5.0开始提供的功能。虽然为您打包基本数据类型提供了方便，但提供方便的同时表示隐藏了细节，建议在能够区分基本数据类型与对象的差别时再使用。 4.2.1　autoboxing和unboxing 在Java中，所有要处理的东西几乎都是对象(Object)
ajax传统的方式制作ajax aijuans Ajax
//这是前台的代码 <%@ page language="java" import="java.util.*" pageEncoding="UTF-8"%> <% String path = request.getContextPath(); String basePath = request.getScheme()+
只用jre的eclipse是怎么编译java源文件的？ avords java eclipse jdk tomcat
eclipse只需要jre就可以运行开发java程序了，也能自动编译java源代码，但是jre不是java的运行环境么，难道jre中也带有编译工具？还是eclipse自己实现的？谁能给解释一下呢问题补充：假设系统中没有安装jdk or jre，只在eclipse的目录中有一个jre，那么eclipse会采用该jre，问题是eclipse照样可以编译java源文件，为什么呢？ &nb
前端模块化 bee1314 模块化
背景：前端JavaScript模块化，其实已经不是什么新鲜事了。但是很多的项目还没有真正的使用起来，还处于刀耕火种的野蛮生长阶段。 JavaScript一直缺乏有效的包管理机制，造成了大量的全局变量，大量的方法冲突。我们多么渴望有天能像Java（import），Python (import)，Ruby(require)那样写代码。在没有包管理机制的年代，我们是怎么避免所
处理百万级以上的数据处理 bijian1013 oracle sql 数据库大数据查询
一.处理百万级以上的数据提高查询速度的方法： 1.应尽量避免在 where 子句中使用!=或<>操作符，否则将引擎放弃使用索引而进行全表扫描。 2.对查询进行优化，应尽量避免全表扫描，首先应考虑在 where 及 o
mac 卸载 java 1.7 或更高版本征客丶 java OS
卸载 java 1.7 或更高 sudo rm -rf /Library/Internet\ Plug-Ins/JavaAppletPlugin.plugin 成功执行此命令后，还可以执行 java 与 javac 命令 sudo rm -rf /Library/PreferencePanes/JavaControlPanel.prefPane 成功执行此命令后，还可以执行 java
【Spark六十一】Spark Streaming结合Flume、Kafka进行日志分析 bit1129 Stream
第一步，Flume和Kakfa对接，Flume抓取日志，写到Kafka中第二部，Spark Streaming读取Kafka中的数据，进行实时分析本文首先使用Kakfa自带的消息处理（脚本）来获取消息，走通Flume和Kafka的对接 1. Flume配置 1. 下载Flume和Kafka集成的插件，下载地址：https://github.com/beyondj2ee/f
Erlang vs TNSDL bookjovi erlang
TNSDL是Nokia内部用于开发电信交换软件的私有语言，是在SDL语言的基础上加以修改而成，TNSDL需翻译成C语言得以编译执行，TNSDL语言中实现了异步并行的特点，当然要完整实现异步并行还需要运行时动态库的支持，异步并行类似于Erlang的process（轻量级进程），TNSDL中则称之为hand，Erlang是基于vm(beam)开发，
非常希望有一个预防疲劳的java软件, 预防过劳死和眼睛疲劳,大家一起努力搞一个 ljy325 企业应用
　非常希望有一个预防疲劳的java软件，我看新闻和网站，国防科技大学的科学家累死了，太疲劳，老是加班，不休息，经常吃药，吃药根本就没用，根本原因是疲劳过度。我以前做java,那会公司垃圾，老想赶快学习到东西跳槽离开，搞得超负荷，不明理。深圳做软件开发经常累死人，总有不明理的人，有个软件提醒限制很好，可以挽救很多人的生命。相关新闻：（1）IT行业成五大疾病重灾区：过劳死平均37.9岁
读《研磨设计模式》-代码笔记-原型模式 bylijinnan java 设计模式
声明：本文只为方便我个人查阅和理解，详细的分析以及源代码请移步原作者的博客http://chjavach.iteye.com/ /** * Effective Java 建议使用copy constructor or copy factory来代替clone()方法： * 1.public Product copy(Product p){} * 2.publi
配置管理---svn工具之权限配置 chenyu19891124 SVN
今天花了大半天的功夫，终于弄懂svn权限配置。下面是今天收获的战绩。安装完svn后就是在svn中建立版本库，比如我本地的是版本库路径是C:\Repositories\pepos。pepos是我的版本库。在pepos的目录结构 pepos component webapps 在conf里面的auth里赋予的权限配置为 [groups]
浅谈程序员的数学修养 comsci 设计模式编程算法面试招聘
浅谈程序员的数学修养
批量执行 bulk collect与forall用法 daizj oracle sql bulk collect forall
BULK COLLECT 子句会批量检索结果，即一次性将结果集绑定到一个集合变量中，并从SQL引擎发送到PL/SQL引擎。通常可以在SELECT INTO、 FETCH INTO以及RETURNING INTO子句中使用BULK COLLECT。本文将逐一描述BULK COLLECT在这几种情形下的用法。有关FORALL语句的用法请参考：批量SQL之 F
Linux下使用rsync最快速删除海量文件的方法 dongwei_6688 OS
1、先安装rsync：yum install rsync 2、建立一个空的文件夹：mkdir /tmp/test 3、用rsync删除目标目录：rsync --delete-before -a -H -v --progress --stats /tmp/test/ log/这样我们要删除的log目录就会被清空了，删除的速度会非常快。rsync实际上用的是替换原理，处理数十万个文件也是秒删。
Yii CModel中rules验证规格 dcj3sjt126com rules yii validate
Yii cValidator主要用法分析： yii验证rulesit 分类： Yii yii的rules验证 cValidator主要属性 attributes ,builtInValidators,enableClientValidation,message,on,safe,skipOnError
基于vagrant的redis主从实验 dcj3sjt126com vagrant
平台: Mac 工具: Vagrant 系统: Centos6.5 实验目的: Redis主从实现思路制作一个基于sentos6.5, 已经安装好reids的box, 添加一个脚本配置从机, 然后作为后面主机从机的基础box 制作sentos6.5+redis的box mkdir vagrant_redis cd vagrant_
Memcached(二)、Centos安装Memcached服务器 frank1234 centos memcached
一、安装gcc rpm和yum安装memcached服务器连接没有找到，所以我使用的是make的方式安装，由于make依赖于gcc，所以要先安装gcc 开始安装，命令如下，[color=red][b]顺序一定不能出错[/b][/color]：建议可以先切换到root用户，不然可能会遇到权限问题：su root 输入密码...... rpm -ivh kernel-head
Remove Duplicates from Sorted List hcx2013 remove
Given a sorted linked list, delete all duplicates such that each element appear only once. For example,Given 1->1->2, return 1->2.Given 1->1->2->3->3, return&
Spring4新特性——JSR310日期时间API的支持 jinnianshilongnian spring4
Spring4新特性——泛型限定式依赖注入 Spring4新特性——核心容器的其他改进 Spring4新特性——Web开发的增强 Spring4新特性——集成Bean Validation 1.1(JSR-349)到SpringMVC Spring4新特性——Groovy Bean定义DSL Spring4新特性——更好的Java泛型操作API Spring4新
浅谈enum与单例设计模式 247687009 java 单例
在JDK1.5之前的单例实现方式有两种(懒汉式和饿汉式并无设计上的区别故看做一种)，两者同是私有构造器，导出静态成员变量，以便调用者访问。第一种 package singleton; public class Singleton { //导出全局成员 public final static Singleton INSTANCE = new S
使用switch条件语句需要注意的几点 openwrt c break switch
1. 当满足条件的case中没有break，程序将依次执行其后的每种条件（包括default）直到遇到break跳出 int main() { int n = 1; switch(n) { case 1: printf("--1--\n"); default: printf("defa
配置Spring Mybatis JUnit测试环境的应用上下文 schnell18 spring mybatis JUnit
Spring-test模块中的应用上下文和web及spring boot的有很大差异。主要试下来差异有：单元测试的app context不支持从外部properties文件注入属性 @Value注解不能解析带通配符的路径字符串解决第一个问题可以配置一个PropertyPlaceholderConfigurer的bean。第二个问题的具体实例是：
Java 定时任务总结一 tuoni java spring timer quartz timertask
Java定时任务总结一.从技术上分类大概分为以下三种方式： 1.Java自带的java.util.Timer类，这个类允许你调度一个java.util.TimerTask任务; 说明： java.util.Timer定时器，实际上是个线程，定时执行TimerTask类 &
一种防止用户生成内容站点出现商业广告以及非法有害等垃圾信息的方法 yangshangchuan rank 相似度计算文本相似度词袋模型余弦相似度
本文描述了一种在ITEYE博客频道上面出现的新型的商业广告形式及其应对方法，对于其他的用户生成内容站点类型也具有同样的适用性。最近在ITEYE博客频道上面出现了一种新型的商业广告形式，方法如下： 1、注册多个账号（一般10个以上）。 2、从多个账号中选择一个账号，发表1-2篇博文

Deformable Convolutional Networks论文翻译——中英文对照

Deformable Convolutional Networks

Abstract

摘要

1. Introduction

1. 引言

2. Deformable Convolutional Networks

2. 可变形卷积网络

2.1. Deformable Convolution

2.1. 可变形卷积

2.2. Deformable RoI Pooling

2.2. 可变形RoI池化

2.3. Deformable ConvNets

2.3. 可变形卷积网络

3. Understanding Deformable ConvNets

3. 理解可变形卷积网络

3.1. In Context of Related Works

3.1. 相关工作的背景

4. Experiments

4.1. Experiment Setup and Implementation

4. 实验

4.1. 实验设置和实现

4.2. Ablation Study

4.2. 消融研究

4.3. Object Detection on COCO

4.3. COCO的目标检测

5. Conclusion

5. 结论

Acknowledgements

致谢

References

你可能感兴趣的:(深度学习,Deep,Learnig)