FSOD论文翻译

image.png

图4.我们的网络架构使用ResNet-50作为骨干。 支持图像(绿色)和查询图像(蓝色)被送入重量共享的主干。 RPN使用关注特征,该特征由紧凑1×1×C支持特征和H×W×C查询特征之间的深度互相关生成。 将由补丁关系头(顶部头),全局关系头(中间头)和局部相关头(下头)生成的类得分相加,作为最终匹配得分,并生成边界框预测 通过补丁关系头。

In an R-CNN framework, an RPN module will be followed by a detector which takes an important role of re-scoring the proposals and class recognition. Therefore, we want a detector to have a strong discriminative ability to distinguish different categories. To this aim, we propose a novel multi-relation detector to effectively measure the similar-ity between proposal boxes from the query and the support objects. The detector includes three attention modules, which are the patch-relation head to learn a deep non-linear metric for patch matching, the global-relation head to learn a deep embedding for global matching and the local-correlation head to learn the pixel-wise and depthwise cross correlation between support and query proposals. We experimentally show that the three matching modules can complement each other and gains higher performance incrementally by adding one by one. We will introduce our multi-relation detector details below.
在R-CNN框架中,RPN模块后面将是检测器,该检测器将在对proposals进行重新评估和分类识别方面起重要作用。 因此,我们希望检测器具有很强的区分不同类别的能力。 为此,我们提出了一种新的多重关系检测器,可以有效地测量来自query和support对象的proposal框之间的相似性。 检测器包括三个注意模块,分别是用于学习深度非线性度量以进行区块匹配的patch relation端,用于学习深度匹配的深度嵌入的全局关系端以及用于学习像素匹配的局部相关端。 support和query建议之间的明智和深度互相关。 我们通过实验表明,三个匹配模块可以相互补充,并且通过逐个添加模块来逐步获得更高的性能。 我们将在下面介绍多关系检测器的详细信息。

image.png

Figure 5. Attention RPN. The support feature is average pooled to a 1×1×C vector, and then caculate depth-wise cross correlation with the query feature whose output is used as attention feature and is fed into RPN to generate proposals.

• In patch-relation head, we first concatenate the support and query proposal feature maps in depth. Then the combined feature map are fed into the patch-relation module, whose structure is shown in Table. 2. All the convolution and pooling layers in this module have 0 padding to reduce the feature map from 7 × 7 to 1 × 1 which is used as inputs for the binary classification and regression heads. This module is compact and efficient. We do a bit exploitation on the structure of the model and we find replacing the two average pooling with convolutions would not improve our model further.

• The global-relation head extends the patch relation to model the global-embedding relation between the support and query proposals. Given a concatenated feature of support and its query proposal, we average pooling the feature to a vector with a size of 1 × 1 × 2C. We then use an MLP with two fully connected (fc) layers followed by ReLU and a final fc layer to generate matching scores.
•全局关系端扩展了区域关系,以对support和query之间的全局嵌入关系进行建模。对给定的support和query的拼接特征,我们将特征平均池化为一个大小为1×1×2C的向量中。 然后,我们使用具有两个全连接(fc)层、后跟ReLU的MLP模块以及最后面的一个fc层来生成匹配分数。
• Local-correlation head computes the pixel-wise and depth-wise similarity between object ROI feature and the proposal feature, like that in Equ. 1. Different from Equ. 1, we perform dot product on feature pair on the same depth. In particular, we first use a weight-shared 1×1×C convolution to process support and query features individually. They then calculate the depth-wise similarity feature of size 1 × 1 ×C. Finally, a successive fc layer is used to generate matching scores.
•局部相关端像公式1中一样,计算对象ROI特征和目标特征之间的像素方向和深度方向的相似度。与等式1不同,我们在相同的深度上对特征对执行点积。 特别地,我们首先使用共享权重的1×1×C卷积分别处理support和query特征。 然后,他们计算尺寸为1×1×C的深度相似特征。 最后,连续的fc层用于生成匹配分数。
We only use the patch-relation head to generate bounding box predictions, i.e. regression on box coordinates, and use the sum of all matching scores from the three heads as the final matching scores. The intra-class variance and imperfect proposals make the relation between proposals and support objects complex. Our three relation heads contain different attributes and can well handle the complex, where the patch-relation head can generate flexible embedding that be able to match intra-class variances, global-relation head is a stable and general matching, and local-relation patch requires matching on parts.

Training Details

The model is end-to-end trained based on 4 Tesla P40 GPUs using SGD with a weight decay of 0.0001 and momentum of 0.9. The learning rate is 0.002 for the first 56000 iterations, and 0.0002 for the later 4000 iterations. We take the advantage of a pretrained model with its backbone, i.e. ResNet50, trained on [14, 9]. As our test set has no overlap with the datasets, it is safe to use it. During our training, we find that more training iterations will damage performance. We suppose that too many training iterations make model over-fitting on the training set. We fix Res1-3 blocks and only train the high-level layers, which can utilize lowlevel basic feature and avoid over-fitting. The query image is resized to shorter edge to 600 pixels and its max size of the longer edge is restricted to 1000. The support image is cropped around the target object with 16-pixels image context and is resized and zero-padded to a square image of 320x320. For few-shot training and testing, we fuse feature by averaging the object features and then fed them to the RPN attention module and the multi-relation detector.
该模型使用4路Tesla P40进行的端到端训练,使用SGD,其权重衰减为0.0001,动量为0.9。前56000次迭代的学习率为0.002,而后4000次迭代的学习率为0.0002。我们利用预训练模型的骨干即ResNet50进行训练,该模型在[14,9]上进行了训练。由于我们的测试集与数据集没有重叠,因此可以安全地使用它。在我们的训练期间,我们发现更多的训练迭代将损害性能。我们认为,太多的训练迭代使模型过度拟合训练集。我们固定Res1-3层,仅训练高层,这可以利用底层的基本特征并避免过拟合。将query图像的短边大小调整为600像素,并且将长边的最大大小限制为1000。support图像在目标对象周围以16像素的图像上下文进行裁剪,并调整大小并零填充为正方形图像320x320。对于少量训练和测试,我们通过对对象特征求平均值来融合特征,然后将其送到RPN注意力模块和多关系检测器。

你可能感兴趣的:(FSOD论文翻译)