1. 第一作者:Shihao Jiang
2. 发表年份:2021
3. 发表期刊:ICCV
4. 关键词:光流、代价体、遮挡区域、全局聚合、GRU
5. 探索动机:遮挡问题是光流最大的挑战之一。
定义遮挡:We first define what we mean by occlusion in the context of optical flow estimation. In this paper, an occluded point is defined as a 3D point that is imaged in the reference frame but is not visible in the matching frame. This definition incorporates several different scenarios, such as the query point moving out-of-frame or behind another object(or itself), or another object moving in front of the query point, in the active sense. One particular case of occlusion is shown in Figure 1, where part of the blade moves out-offrame.
遮挡会破坏光度一致性约束:Traditional optical flow algorithms apply the brightness constancy constraint, where pixels related by the flow field are assumed to have the same intensities. It is clear that occlusions are a direct violation of such a constraint.
相关体不能给遮挡区域很好的学习信息:In the deep learning era, correlation (cost) volumes are used to give a matching cost for each potential displacement of a pixel. However, correlations of appearance features are unable to give meaningful guidance for learning the motion of occluded regions.
现有的学习方法很难恢复遮挡较大的区域:Most existing approaches use smoothness terms in an MRF to interpolate occluded motions or use CNNs to directly learn the neighbouring relationships, hoping to learn to estimate occluded motions based on the neighbouring pixels. However, state-of-the-art methods still fail to estimate occluded motions correctly when occlusions are more significant and local evidence is insufficient to resolve the ambiguity.
6. 工作目标:应用合理的运动模型来准确估计被遮挡的运动。
Let us consider how to estimate these hidden motions for the two-frame case. When direct (local) matching information is absent, the motion information has to be propagated from other pixels. Using convolutions to propagate this information has the drawback of limited range since convolution is a local operation. We propose to aggregate the motion features with a non-local approach. Our design is based on the assumption that the motions of a single object(in the foreground or background) are often homogeneous.One source of information that is overlooked by existing works is self-similarities in the reference frame. For each pixel, understanding which other pixels are related to it, or which object it belongs to, is an important cue for accurate optical flow predictions. That is, the motion information of non-occluded self-similar points can be propagated to the occluded points.
7. 核心思想:提出全局聚合运动模块 (GMA) ,可以加入RAFT中。
- We show that long-range connections, implemented using the attention mechanism of transformer networks, are highly beneficial for optical flow estimation, particularly for resolving the motion of occluded pixels where local information is insufficient.
- We show that self-similarities in the reference frame provide an important cue for selecting the long-range connections to prioritise.
- We demonstrate that our global motion feature aggregation strategy leads to a significant improvement in optical flow accuracy in occluded regions, without damaging the performance in nonoccluded regions, and analyse this extensively.
8. 实验结果:SOTA
We improve the average end-point error (EPE) by 13.6% (2.86 → 2.47) on Sintel Final and 13.7% (1.61→1.39) on Sintel Clean, compared to the strong baseline of RAFT. Our approach ranks first on both datasets at the time of submission.
9.论文&代码下载:
https://openaccess.thecvf.com/content/ICCV2021/papers/Jiang_Learning_To_Estimate_Hidden_Motions_With_Global_Motion_Aggregation_ICCV_2021_paper.pdf
https://github.com/zacjiang/GMA
该网络设计基于RAFT,整体网络如下图。RAFT的主要贡献为:
全局运动聚合(GMA)模块包含在阴影框中,是对RAFT的独立补充,具有较低的计算开销,可显著提高性能。它将视觉上下文特征和运动特征作为输入,输出共享全图像信息的的聚合运动特征。然后将聚合的全局运动特征与局部运动特征和视觉上下文特征连接起来,由GRU解码为残差光流。这使得网络可以根据特定像素位置的需要,灵活地在局部和全局运动特征之间进行选择或组合。例如,由遮挡引起的局部图像信息较差的位置可以优先选择全局运动特征。
Geoffrey Hinton在他1976年的第一篇论文中写道“必须通过找到最佳的全局球解释来解决局部歧义”。这一观点在现代深度学习时代仍然成立。为了解决遮挡引起的歧义,我们的核心思想是让网络在更高的层次上进行推理,即对相似像素的运动特征进行全局聚合,隐式地推理出哪些像素在外观特征空间中相似。我们假设网络能够通过在参考帧中寻找外观相似的点来找到运动相似的点。这是由于观察到单个物体上的点的运动通常是均匀的。例如,一个人向右跑的运动向量倾向右边,即使由于遮挡在最终匹配帧中我们没有看到这个人的大部分,这也是成立的。我们可以使用这种统计偏差将运动信息从具有高(隐式)置信度的非遮挡像素传播到具有低置信度的遮挡像素。置信度可以解释为是否存在明显的匹配,即在正确的位移处存在较高的相关值。
有了这些想法,从Transformer网络中获得灵感,它以建模长期依赖关系的能力而闻名。不同于Transformer中的自注意力机制,其中查询、键和值来自相同的特征向量,我们使用了一种广义的注意力变体。
首先将上下文特征图投影为查询特征图和键特征图,用于对第一帧的外观自相似性建模。然后将两个特征图进行点积再进行softmax,得到注意力矩阵,该矩阵在外观特征空间中编码自相似性。与Transformer网络类似,也将查询特征图与一组位置嵌入向量做点积,用位置信息扩充注意力矩阵。另外,4D相关体编码的运动特征图使用学习的值来投影,运动特征本身就是相关体的编码。它的加权和,使用刚得到的注意力矩阵,产生聚合的全局运动特征(GMA)。聚合的运动特征与局部运动特征以及上下文特征相连接,由GRU解码。GMA的详细图下图所示。
设x∈N×Dc表示上下文(外观)特征,y∈N×Dm表示运动特征,其中N = HW, H和W为特征图的高度和宽度,D为特征图的通道维度。第i个特征向量记为xi∈Dc。GMA模块通过计算投影的运动特征的注意力加权和得到特征向量更新。聚合的运动特征由
其中α是初始化为零的学习的标量参数,θ,φ和σ是查询、键和值向量的投影函数,f为相似性注意力函数,由
查询、键和值向量的投影函数由
其中Wqry, Wkey∈Din×Dc, Wval∈Dm×Dm。GMA模块中可学习的参数包括wqry, Wkey, Wval和α。
最终输出是[y|y·|x],这是三个特征图的连接。GRU对其进行解码,得到残差光流。连接体让网络更明智地从运动向量中选择或组合,由全局上下文特征调整,而不需要具体规定如何做到这一点。网络似乎学会了对一些不确定性的概念进行编码,只有当模型不能从局部信息中确定光流时,才会解码聚合的运动向量。
还探索了2D相对位置嵌入的使用,让注意力图依赖于特征自相似性和来自查询点的相对位置。为此,我们计算聚合运动向量为
其中,pj−i表示由像素偏移量j−i索引的相对位置嵌入向量。垂直偏移量和水平偏移量分别学习嵌入向量,并求和得到pj−i。如果在聚合运动向量时抑制距离查询点非常近或非常远的像素是有用的,那么位置嵌入有能力学习这种行为。
还研究了仅从查询向量和位置嵌入向量计算注意力图,没有任何自相似的概念。也就是,
这可以看作是在不推理图像内容的情况下学习长程聚合。这样的方案会利用数据集中的位置偏差。在下表中,(6)和(7)的结果分别表示为Ours (+p)和Ours(only p)。
4.1. 数据集
FlyingThings、Sintel、KITTI-2015、HD1K
4.2. 实现
通过PyTorch实现。训练策略延续RAFT。对于GMA,选择通道尺寸Din = Dc = Dm = 128。
4.3. 基准结果:SOTA
4.4. 内存、参数、时间
4.5. 讨论
However, additive aggregation of this kind is only helpful when the flow field of the attended locations is approximately homogeneous. This does not hold exactly for general object and camera motions, where the flow fields may be far from homogeneous, even on the same rigid object. An example is an object that is directly in front of the camera and rotating about the optical axis, where the flow vectors are in opposite directions. To deal with such scenarios, one possible future work is to first transform the motion features based on the relative positions and perform aggregation afterwards.