【BEV】Cross-view Semantic Segmentation for Sensing Surroundings——View Parsing Network

Cross-view Semantic Segmentation for Sensing Surroundings

跨视图语义分割,Cross-view Semantic Segmentation
图解析网络 (VPN) 的框架,View Parsing Network (VPN)

prior research

尽管语义分割网络可以识别静态图像中的语义内容,但仍远远不足以使机器人在未知环境中进行感知并自由导航。一个重要原因是,解析的第一视图语义掩码仍处于纯图像级别,而没有提供有关周围环境的任何空间信息。
因此,基于自上而下的语义图,我们可以推断周围区域和对象的位置坐标和功能属性。

main issue

  1. we lack the real-world annotations of top-down-view data.To mitigate this, we train the VPN in 3D graphics environment and utilize the domain adaptation technique to transfer it to handle real-world data.
  2. We evaluate the model on both synthetic and real-world agents.To reduce the domain gap between the synthetic scenes and the real-world scenes, we transfer the models trained in the simulation environment to the real-world scenes through domain adaptation
  3. The experimental results show that our model can effectively make use of the information from different views and multi-modalities to understanding spatial information.

In this work

In this work, we propose a novel framework with View Parsing Network (VPN) for cross-view semantic segmentation using simulation environments and then transfer them to real-world environments.

In VPN, a view transformer module is designed to aggregate the information from multiple first-view
observations with different angles and different modalities. It outputs the top-down-view semantic map with a spatial layout of objects.

Our main contributions

(1) We introduce a novel task named cross-view semantic segmentation to facilitate robots to flexibly sense the surrounding environment.
(2) We propose a framework with View Parsing Network which effectively learns and aggregates features across first-view observations with multiple angles and modalities.
(3) We further apply the domain adaptation technique to transferring our model so that it can work in real-world data while without any extra annotations

II. RELATED WORK

A. Semantic Segmentation and Semantic Mapping
这个就是语义分割与语义建图

B. Layout estimation and view synthesis
(i.e. room layout estimation [10], free space estimation [11], and road layout estimation [12], [13]).
Most of the previous methods use annotations of the layout or geometric constraints for the estimation, while our proposed framework estimates the top-down-view map directly from the image, without the intermediate step of estimating the 3D structure of the scene.

C. Learning in Simulation Environments
Rather than working on the task of visual navigation directly, our work aims at parsing the top-down-view semantic map from the first-view observations. The resulting top-down-view map will further facilitate visual navigation.

III. CROSS-VIEW SEMANTIC SEGMENTATION

给定第一视图观测值作为输入,该算法必须生成自顶向下视图语义图。自顶向下视图语义图是摄像机从自顶向下视图的某个高度捕获的带有每个像素的语义标签注释的地图。输入的第一视图观测值是一组具有不同模态的图像。它们由机器人的相机以N个不同的角度捕获 (相距360/N度)。

pipeline

Top-down-view semantics is predicted from the first-view real-world observations in the cross-view semantic segmentation. Input observations from multiple angles are fused. Notice that the result in this figure is generated without training on real-world data.
【BEV】Cross-view Semantic Segmentation for Sensing Surroundings——View Parsing Network_第1张图片

我们首先从N个角度和M个模态 (这里N = 6,图2中的M = 2)采样 N ∗ M N * M NM个第一视图观测,These CNN-based encoders extract N×M spatial feature maps for their first-view input.Then all of these feature maps are fed into the View Transformer Module (VTM).

VTM将这些视图特征图从第一视图空间转换到top-downview特征空间,并将它们融合以获得已经包含足够空间信息的最终特征图。最后,我们使用卷积解码器对其进行解码以预测自顶向下视图语义图。

View Transformer Module

我们设计了视图转换器模块 (VTM),以学习第一视图特征图和顶视图特征图之间所有空间位置的依赖关系。VTM不会改变输入特征图的形状,因此可以将其插入任何现有的encoderdecoder类型的网络架构中进行经典语义分割。

It consists of two parts: View Relation Module (VRM) and View Fusion Module (VFM).

VRM

The first-view feature map is first flattened while the channel dimension remains unchanged. Then we use a view relation module R to learn the relations between the any two pixel positions in flattened first-view feature map and flattened top-down-view feature map.
在这里插入图片描述
where i, j ∈ [0,HW) are the indices of top-down-view feature map t ∈ R H W × C t ∈ R^{HW×C} tRHW×C
first-view feature map f ∈ R H W × C f ∈ R^{HW×C} fRHW×C re-spectively along the flattened dimension
Ri models the relations between the i t h i^{th} ith pixel on top-down-view feature map and every pixel on first-view feature map.(Ri对自顶向下视图特征图上的第i个像素与第一视图特征图上的每个像素之间的关系进行建模。)
在这里,我们在视图关系模块R中简单地使用多层感知器 (MLP)。
之后,将自上而下的视图特征图重新塑造回h × w × c。

VFM

为了汇总来自所有观察输入的信息,我们使用VFM融合了这些自顶向下视图特征图ti。

Sim-to-real Adaptation

为了将我们的VPN推广到没有真实世界地面真相的真实世界数据,我们实现了图2所示的sim-real域适配方案,以缩小差距。该方案包含以下像素级适配和输出空间适配。

Pixel-level adaptation. To mitigate the domain shift, we adopt the pixel-level adaptation on the real-world inputs to make them look more like the style of the simulation data.

语义掩码是一种理想的中层表示。Semantic mask is an ideal mid-level representation without texture gap while including sufficient information and it is easy to transfer.
在这里插入图片描述
其中IR,IS分别是真实的RGB图像和合成风格的语义掩码
P R G B → M a s k P_{RGB→Mask} PRGBMask is the existing semantic segmentation model which parses the real-world RGB into semantic mask
M R e a l → S y n t h e t i c M_{Real→Synthetic} MRealSynthetic is the semantic category mapping process where we construct the concept mappings between the real world and the simulation environment.

Output space adaptation. Beyond the pixel-level transfer on input data, we also devise an adversarial training scheme in structured output space
在这里,生成器G是生成自顶向下视图预测P的视图解析网络,
在训练阶段,

  • 我们首先将一组输入图像从源域 {Is} 转发到G,并以正常的分割损失Lseg。
  • 然后我们使用G从目标域 {It} 中提取图像的特征图Fi (在softmax层之后)
  • 使用鉴别器区分Ft是否来自源域。
  • 优化G的损失函数可以写成如下:

在这里插入图片描述

其中Lseg是语义分割的交叉熵损失,Ladv旨在训练G并欺骗鉴别器D。鉴别器Ld的损失函数是二进制源和目标分类的交叉损失。

架构图

【BEV】Cross-view Semantic Segmentation for Sensing Surroundings——View Parsing Network_第2张图片

你可能感兴趣的:(论文,人工智能,python)