原始文档:https://www.yuque.com/lart/papers/lr11g7
ICCV2019已经接收.
主要目的就是为了实现一种对于视频的半监督学习策略, 并通过网路的构造, 对于特征时空信息更为充分有效的利用, 来实现效果的提升.
这里通过使用有真值的第i帧和第j帧来估计无真值的位于i和j中间的第k帧的伪标签(pseudo-label).
warped ground truth
, 但是从图中可以看出来, 它们噪声太多, 对于监督而言还需要进行调整.
warp
操作具体是如何实现的, 但是这个实际上是光流计算中的一个操作(FlowNet系列中用到了), 就是使用光流和某一帧数据得到的调整后的结果. 参考链接1.原始的RCRNet, 用在这部分时对输入通道进行了修改.
注意这里的结构:
- 为了维持空间细节信息, 取消了对于第5个卷积层的下采样
- 为了保证相同的感受野, 在第五个卷积层中使用扩张率为2的扩张卷积
- 在第五个卷积层后附加了一个ASPP模块, 来获得全图级别的全局上下文信息和多尺度的空间上下文信息
- 最终特征提取器输出为256通道数, 并且是输入尺寸的1/16, 即OS=16
- 这里使用残差结构连接低级与高级特征, 同时对通道数进行调整, 这里N均为96, 为refinement block提供更多的空间信息
- 这里的refinement block首先拼接两个输入的特征图, 之后在馈送到后续的3x3x128的卷积层. 这里也会使用双线性插值上采样
这里通过对RCRNet中插入non-local结构以及ConvGRU结构, 来为模型提供足够的时空信息利用能力, 提升高级特征的时空一致性(spatiotemporal coherence).
为了使用non-local结构, 这里使用了T帧作为输入, 分别得到对应提取的空间特征, 将这T个数据拼接后, 送入non-local结构, 在每个位置上, 利用输入特征图的所有位置上的特征的加权和, 计算得到当前位置的响应. 这可以构造出输入的视频帧的时空关联信息.
RCRNet主要结构没有变化. 这里没有详细介绍使用的non-local结构, 主要介绍了下提出的DB-ConvGRU结构.
由于一个视频序列是按照时间顺序包含一系列场景, 所以还需要在时间域内描述外观对比信息的顺序演化. 这里使用了ConvGRU模块来模拟序列特征的演变. 这里使用卷积结构对传统的全连接GRU进行了改进. 关于ConvGRU的基本运算如下, 这里的 ∗ * ∗表示卷积操作, ∘ \circ ∘表示哈达马乘积, σ \sigma σ表示sigmoid函数, 而这里的 W W W表示可以学习的权重. 公式中简化了表达, 忽略了偏置项.
这里受到[Pyramid dilated deeper convlstm for video salient object detection]的启发, 堆叠两个ConvGRU模块, 分别处理前向与反向, 以进一步增强两个方向的时空信息的交换. 这样, 更深的双向ConvGRU可以记忆过去和未来的信息.
H表示最终的DB-ConvGRU的输出. 而X表示non-local结构的输出特征.
这里受到Non-local neural networks]的启发, 越多的non-local结构一般会获得更好的结果, 所以这里在ConvGRU模块之后也加了一个non-local结构来进一步增强时空一致性.
It is worth noting that our proposed method uses only approximately 20% ground truth maps in the training process to outperform the best-performing fully supervised video-based method (PDB), even though both models are based on the same backbone network (ResNet-50).
这里对于使用的真实标记的比例进行了测试.
To demonstrate the effectiveness of our proposed semi-supervised framework, we explore the sensitivities to different amount of GT and pseudo-labels usage on the VOS dataset.
By repeating the above experiment with different fixed intervals, we show the performance of RCRNet+NER trained with different number of GT labels in Fig. 7.
As shown in the figure, when the number of GT labels is severely insufficient (e.g., 5% of the origin training set), RCRNet+NER can benefit substantially from the increase in GT label usage. An interesting pheomenon is that when the training set is large enough, the application of denser label data does not necessarily lead to better performance. Considering that adjacent densely annotated frames share small differences, ambiguity is usually inevitable during the manual labeling procedure, which may lead to overfitting and affect the generalization performance of the model. Then, we further use the proposed FGPLG to generate different number of pseudo-labels with different number of GT labels.
Some representative quantitative results are shown in Table 2, where we find that when there are insufficient GT labels, adding an appropriate number of generated pseudo-labels for training can effectively improve the performance.
Furthermore, when we use 20% of annotations and 20% of pseudo-labels (column ‘1/5’ in the table) to train RCRNet+NER, it reaches the maxF=0.861 and S-measure=0.874 on the test set of VOS, surpassing the one trained with all GT labels. Even if trained with 5% of annotations and 35% of pseudo-labels (column ‘7/20’ in the table), our model can produce comparable results. This interesting phenomenon demonstrates that pseudo-labels can overcome labeling ambiguity to some extent. Moreover, it also indicates that it is not necessary to densely annotate all video frames manually considering redundancies.
Under the premise of the same labeling effort, selecting the sparse labeling strategy to cover more kinds of video content, and assisting with the generated pseudo-labels for training, will bring more performance gain. 在相同标注努力的前提下, 选择稀疏标注策略覆盖更多种类的视频内容, 并配合生成的伪标签进行训练, 将带来更多的性能收益.
这里的表格5展示了Performance on Unsupervised Video Object Segmentation, 因为对于无监督视频目标分割任务和半监督显著性目标检测任务更为相关.
Semi-supervised video object segmentation aims at tracking a target mask given from the first annotated frame in the subsequent frames, while unsupervised video object segmentation aims at detecting the primary objects through the whole video sequence automatically. It should be noted that the supervised or semi-supervised video segmentation methods mentioned here are all for the test phase, and the training process of both tasks is fully supervised. The semi-supervised video salient object detection considered in this paper is aimed at reducing the labeling dependence of training samples during the training process. Here, unsupervised video object segmentation is the most related task to ours as both tasks require no annotations during the inference phase.
Unsupervised video object segmentation aims at auto-matically separating primary objects from input video se-quences. As described, its problem setting is quite similar to video salient object detection, except that it seeks to per-form a binary classification instead of computing asaliency probability for each pixel.
表格中的J表示交并比.