Instead of generating proposals from RGB image or projecting point cloud to bird's view or voxels as previous methods do, our stage-1 sub-network directly generates a small number of high-quality 3D proposals from point cloud in a bottom-up manner via segmenting the point cloud of the whole scene into foreground points and background.
第一阶段把点云分割成前景和背景,而不是从RGB、BEV或Voxel生成proposals。
AVOD places 80-100k anchor boxes in the 3D space and pool features for each anchor in multiple views for generating proposals. F-PointNet generates 2D proposals from 2D images, and estimate 3D boxes based on the 3D points cropped from the 2D regions, which might miss difficult objects that could only be clearly observed from 3D space.
作者认为,AVOD和F-PointNet都有各自的缺点。AVOD在3D空间设置了80-100k个anchor,每个anchor利用不同视图的信息提取proposals(估计很占显存);F-PointNet用2D图片生成2D proposal,基于从2D裁剪出的3D点估计3D box,可能错过只能从3D空间观察到的难样本。
作者提出,在第一阶段用全场景点云分割来提取3D proposal。基于3D场景下物体不重叠的事实,3D GT box里面的点就是分割mask,即前景点。
We learn point-wise features to segment the raw point cloud and to generate 3D proposals from the segmented foreground points simutaneously.
Our method avoids using a large set of predefined 3D boxes in the 3D space and significantly constrains the search space for 3D proposal generation.
做法:把点云分割,分割后前景点就是proposals。这样避免使用大量预定义的3D boxes,限制了3D proposals的搜索空间。
..., we ultilize the PointNet++ with multi-scale grouping as our backbone network.
用PointNet++提取点云特征
By learning to segment the foreground points, the point-cloud network is forced to capture contextual information for making accurate point-wise prediction, which is also beneficial for 3D box generation.
分割能让网络更多地关注语义信息。
... we append one segmentation head for estimating the foreground mask and one box regression head for generating 3D proposals. For point segmentation, the ground-truth segmentation mask is naturally provided by the 3D ground-truth boxes. The number of foreground points is generally much smaller than that of the background points for a large-scale outdoor scene. Thus we use the focal loss to handle the class imbalance problem
PointNet++提取的特征分别经过一个前景分割头和一个3D proposal box回归头。如图1. 分割的GT由3D box GT 得到。对于大尺度室外场景,前景点比背景点少得多,所以用了focal loss。
During training, we only require the box regression head to regress 3D bounding box locations from foreground points. Note that although boxes are not regressed from the background points, those points also provide supporting information for generating boxes because of the receptive field of the point-cloud network.
训练时,只用前景点回归3D bbox。虽然不用背景点回归,但backbone提取信息时,感受野里有背景点,间接地参与到bbox生成里。
For estimating center location of an object, as shown in Fig. 3 (图2), we split the surrounding area of each foreground point into a series of discrete bins along the X and Z axes. Specifically, we set a search range S for each X and Z axis of the current foreground point, and each 1D search range is divided into bins of uniform length delta to represent different object centers (x, z) on the X-Z plane.
【不太懂】为了顾及目标中心的位置,作者给当前前景点的X轴和Z轴设置搜索范围S,每个1维搜索范围被分成相同长度的bins,用(x, z)来表示X-Z平面上中心点坐标。
The localization loss for the X or Z axis consists of two terms, one term for bin classification along each X and Z axis, and the other term for residual regression within the classified bin.
x和z的localization loss包含两项:
用smooth L1 loss回归y。
localization target表达式如下:
带括号的p是当前前景点,不带括号的p是对应物体的中心。bin是GT bin,res是GT bin里的偏差,C是用来归一化的,S是搜索范围,delta是bin长度。
方向、尺寸的预测和F-PointNet类似。作者把2pi分成n个bins,计算方向属于哪个bin(分类)和这个bin里的residual regression,这块和xz的类似。物体尺寸(h, w, l)直接回归出来。
推理时,x,z,theta这些基于bin的参数,首先选择分数最高的bin,然后把预测的residual加上。其他直接回归的参数,比如yhwl,直接把预测的residual和初始值相加。
整体loss如下:
其中Npos是前景点个数,bin帽、res帽是预测的前景点bin分类和residuals,不带帽的是用公式1算的GT。分类用交叉熵,回归用smooth L1 loss。
To remove the redundant proposals, we conduct NMS based on the oriented IoU from BEV to generate a small number of high-quality proposals.
在BEV上用NMS。训练时,阈值=0.85,保留前300个训练第二阶段;推理时,阈值=0.8,保留前100个用来refine第二阶段。
To learn more specific local features of each proposal, we propose to pool 3D points and their corresponding point features from stage-1 according to the location of each 3D proposal.
为了获得local信息,做池化。
先把3D proposal稍微扩大一些,然后判断某个点在扩大后的bbox里面还是外面。如果在里面,那么保留,用来后续refine未扩大的proposal。保留下来的点包含:3D坐标,反射强度,预测的分割mask和backbone学到的C维feature。分割mask是为了判断点是前景还是背景。这一步还删除了没有任何内部点的proposal。
To take advantages of our high-recall box proposals from stage-1 and to estimate only the residuals of the box parameters of proposals, we transform the pooled points belonging to each proposal to the canonical coordinate system of the corresponding 3D proposal.
【这里不懂为什么能学到更好的信息】canonical coordinate是什么:
1. 原点在box proposal的中心
2. local X‘和Z’轴和地面大致平行,X'指向proposal的前进方向,Z‘和X'垂直
3. Y’轴和LiDAR坐标系一样
所有池化出来的点都应该用rotation和translation转成和canonical 坐标系相同。这样让box refinement阶段能学习到更好的局部空间信息。
第二阶段既包含了转过坐标的local spatial point feature, 又包含了global semantic features。
虽然canonical transformation让特征学习到更鲁棒的局部空间信息,但同时也丢了深度信息。举个例子,远处物体一般比近处物体包含的点更少。作者把距离信息d(p)加到了点特征里。
每个proposal的local feature (转过坐标的)和reflection、seg mask、distance info 连结(concat),送到fc里,得到和global feature相同维度的特征。然后concat local和global,送入类似pointnet++的网络里,输出用来求loss。
当GT和pred 3D IoU大于0.55时,转换GT和pred的坐标,求smooth l1 loss。
while refining the locations of 3D proposals, use smaller search range.
调整方向角时,假设角度差在[-pi/4, pi/4]之间,因为proposal和gt的3D IoU最少是0.55(为什么0.55和pi/4有关系).作者将pi/2分成大小为omega的bins们,预测基于bin的方向角。
第二个res公式没看懂。
B是第一阶段的3D proposals,Bpos包含了正proposals。probi: bi帽的预测置信度,labeli是对应的label。Fcls有监督的交叉熵loss,Lbin帽和Lresmao和Lbinp Lresp类似。最终用鸟瞰图IoU>0.01 NMS移除掉重叠的bbox。