【BEV】Cross-view Semantic Segmentation for Sensing Surroundings——View Parsing Network

Cross-view Semantic Segmentation for Sensing Surroundings

跨视图语义分割,Cross-view Semantic Segmentation
图解析网络 (VPN) 的框架,View Parsing Network (VPN)

prior research


main issue

  1. we lack the real-world annotations of top-down-view data.To mitigate this, we train the VPN in 3D graphics environment and utilize the domain adaptation technique to transfer it to handle real-world data.
  2. We evaluate the model on both synthetic and real-world agents.To reduce the domain gap between the synthetic scenes and the real-world scenes, we transfer the models trained in the simulation environment to the real-world scenes through domain adaptation
  3. The experimental results show that our model can effectively make use of the information from different views and multi-modalities to understanding spatial information.

In this work

In this work, we propose a novel framework with View Parsing Network (VPN) for cross-view semantic segmentation using simulation environments and then transfer them to real-world environments.

In VPN, a view transformer module is designed to aggregate the information from multiple first-view
observations with different angles and different modalities. It outputs the top-down-view semantic map with a spatial layout of objects.

Our main contributions

(1) We introduce a novel task named cross-view semantic segmentation to facilitate robots to flexibly sense the surrounding environment.
(2) We propose a framework with View Parsing Network which effectively learns and aggregates features across first-view observations with multiple angles and modalities.
(3) We further apply the domain adaptation technique to transferring our model so that it can work in real-world data while without any extra annotations


A. Semantic Segmentation and Semantic Mapping

B. Layout estimation and view synthesis
(i.e. room layout estimation [10], free space estimation [11], and road layout estimation [12], [13]).
Most of the previous methods use annotations of the layout or geometric constraints for the estimation, while our proposed framework estimates the top-down-view map directly from the image, without the intermediate step of estimating the 3D structure of the scene.

C. Learning in Simulation Environments
Rather than working on the task of visual navigation directly, our work aims at parsing the top-down-view semantic map from the first-view observations. The resulting top-down-view map will further facilitate visual navigation.


给定第一视图观测值作为输入,该算法必须生成自顶向下视图语义图。自顶向下视图语义图是摄像机从自顶向下视图的某个高度捕获的带有每个像素的语义标签注释的地图。输入的第一视图观测值是一组具有不同模态的图像。它们由机器人的相机以N个不同的角度捕获 (相距360/N度)。


Top-down-view semantics is predicted from the first-view real-world observations in the cross-view semantic segmentation. Input observations from multiple angles are fused. Notice that the result in this figure is generated without training on real-world data.
【BEV】Cross-view Semantic Segmentation for Sensing Surroundings——View Parsing Network_第1张图片

我们首先从N个角度和M个模态 (这里N = 6,图2中的M = 2)采样 N ∗ M N * M NM个第一视图观测,These CNN-based encoders extract N×M spatial feature maps for their first-view input.Then all of these feature maps are fed into the View Transformer Module (VTM).


View Transformer Module

我们设计了视图转换器模块 (VTM),以学习第一视图特征图和顶视图特征图之间所有空间位置的依赖关系。VTM不会改变输入特征图的形状,因此可以将其插入任何现有的encoderdecoder类型的网络架构中进行经典语义分割。

It consists of two parts: View Relation Module (VRM) and View Fusion Module (VFM).


The first-view feature map is first flattened while the channel dimension remains unchanged. Then we use a view relation module R to learn the relations between the any two pixel positions in flattened first-view feature map and flattened top-down-view feature map.
where i, j ∈ [0,HW) are the indices of top-down-view feature map t ∈ R H W × C t ∈ R^{HW×C} tRHW×C
first-view feature map f ∈ R H W × C f ∈ R^{HW×C} fRHW×C re-spectively along the flattened dimension
Ri models the relations between the i t h i^{th} ith pixel on top-down-view feature map and every pixel on first-view feature map.(Ri对自顶向下视图特征图上的第i个像素与第一视图特征图上的每个像素之间的关系进行建模。)
在这里,我们在视图关系模块R中简单地使用多层感知器 (MLP)。
之后,将自上而下的视图特征图重新塑造回h × w × c。



Sim-to-real Adaptation


Pixel-level adaptation. To mitigate the domain shift, we adopt the pixel-level adaptation on the real-world inputs to make them look more like the style of the simulation data.

语义掩码是一种理想的中层表示。Semantic mask is an ideal mid-level representation without texture gap while including sufficient information and it is easy to transfer.
P R G B → M a s k P_{RGB→Mask} PRGBMask is the existing semantic segmentation model which parses the real-world RGB into semantic mask
M R e a l → S y n t h e t i c M_{Real→Synthetic} MRealSynthetic is the semantic category mapping process where we construct the concept mappings between the real world and the simulation environment.

Output space adaptation. Beyond the pixel-level transfer on input data, we also devise an adversarial training scheme in structured output space

  • 我们首先将一组输入图像从源域 {Is} 转发到G,并以正常的分割损失Lseg。
  • 然后我们使用G从目标域 {It} 中提取图像的特征图Fi (在softmax层之后)
  • 使用鉴别器区分Ft是否来自源域。
  • 优化G的损失函数可以写成如下:




【BEV】Cross-view Semantic Segmentation for Sensing Surroundings——View Parsing Network_第2张图片
