Spatial As Deep: Spatial CNN for Traffic Scene Understanding论文翻译

Spatial As Deep: Spatial CNN for Traffic Scene Understanding论文翻译


Abstract摘要

Convolutional neural networks (CNNs) are usually built by stacking convolutional operations layer-by-layer. Although CNN has shown strong capability to extract semantics from raw pixels, its capacity to capture spatial relationships of pixels across rows and columns of an image is not fully explored. These relationships are important to learn semantic objects with strong shape priors but weak appearance coherences, such as traffic lanes, which are often occluded or not even painted on the road surface as shown in Fig. 1 (a). In this paper, we propose Spatial CNN (SCNN), which generalizes traditional deep layer-by-layer convolutions to slice-byslice convolutions within feature maps, thus enabling message passings between pixels across rows and columns in a layer. Such SCNN is particular suitable for long continuous shape structure or large objects, with strong spatial relationship but less appearance clues, such as traffic lanes, poles, and wall. We apply SCNN on a newly released very challenging traffic lane detection dataset and Cityscapse dataset1. The results show that SCNN could learn the spatial relationship for structure output and significantly improves the performance. We show that SCNN outperforms the recurrent neural network (RNN) based ReNet and MRF+CNN (MRFNet) in the lane detection dataset by 8.7% and 4.6% respectively. Moreover, our SCNN won the 1st place on the TuSimple Benchmark Lane Detection Challenge, with an accuracy of 96.53%.

卷积神经网络(CNN)通常通过逐层堆叠卷积运算来构建。尽管CNN已经显示出从原始像素中提取语义的强大能力,但是其在图像的行和列上捕获像素的空间关系的能力尚未得到充分研究。这些关系对于学习具有强形状先验但具有弱外观相干性的语义对象(例如交通车道)非常重要,如图1(a)所示,交通车道经常被遮挡或甚至不涂在路面上。在本文中,我们提出了空间CNN(SCNN),它将传统的深层逐层卷积推广到特征映射中的逐片卷积,从而实现层中行和列之间的像素之间的消息传递。这种SCNN特别适用于长连续形状结构或大型物体,具有强烈的空间关系但外观线索较少,例如交通车道,杆和墙。我们将SCNN应用于新发布的非常具有挑战性的交通车道检测数据集和Cityscapse数据集1。结果表明,SCNN可以学习结构输出的空间关系,并显着提高性能。我们表明,SCNN在车道检测数据集中的表现优于基于递归神经网络(RNN)的ReNet和MRF + CNN(MRFNet),分别为8.7%和4.6%。此外,我们的SCNN在TuSimple Benchmark Lane Detection Challenge中获得第一名,准确率为96.53%。

Introduction介绍

In recent years, autonomous driving has received much attention in both academy and industry. One of the most challenging task of autonomous driving is traffic scene understanding, which comprises computer vision tasks like lane detection and semantic segmentation. Lane detection helps to guide vehicles and could be used in driving assistance system (Urmson et al. 2008), while semantic segmentation provides more detailed positions about surrounding objects like vehicles or pedestrians. In real applications, however, these tasks could be very challenging considering the many harsh scenarios, including bad weather conditions, dim or dazzle light, etc. Another challenge of traffic scene understanding is that in many cases, especially in lane detection, we need to tackle objects with strong structure prior but less Copyright c 2018, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.

近年来,自动驾驶在学术界和工业界都备受关注。自动驾驶最具挑战性的任务之一是交通场景理解,其包括诸如车道检测和语义分割的计算机视觉任务。车道检测有助于引导车辆并可用于驾驶辅助系统(Urmson等人,2008),而语义分段提供关于周围物体(如车辆或行人)的更详细位置。然而,在实际应用中,考虑到许多恶劣的场景,包括恶劣的天气条件,昏暗或眩目的光线等,这些任务可能非常具有挑战性。交通现场理解的另一个挑战是,在许多情况下,特别是在车道检测中,我们需要先处理具有强结构的物体,但需要更少的人工智能推进协会(www.aaai.org)。版权所有。

1Code is available at https://github.com/XingangPan/SCNN appearance clues like lane markings and poles, which have long continuous shape and might be occluded. For instance, in the first example in Fig. 1 (a), the car at the right side fully occludes the rightmost lane marking.

1Code可以在https://github.com/XingangPan/SCNN上看到,如车道标记和杆子的外观线索,它们具有长的连续形状并且可能被遮挡。例如,在图1(a)的第一个例子中,右侧的汽车完全封闭了最右边的车道标记。
Spatial As Deep: Spatial CNN for Traffic Scene Understanding论文翻译_第1张图片

Figure 1: Comparison between CNN and SCNN in (a) lane detection and (b) semantic segmentation. For each example, from left to right are: input image, output of CNN, output of SCNN. It can be seen that SCNN could better capture the long continuous shape prior of lane markings and poles and fix the disconnected parts in CNN.

图1:CNN和SCNN在(a)车道检测和(b)语义分段中的比较。对于每个示例,从左到右分别是:输入图像,CNN的输出,SCNN的输出。可以看出,SCNN可以更好地捕获车道标记和极点之前的长连续形状,并且可以修复CNN中的断开部分。

Although CNN based methods (Krizhevsky, Sutskever, and Hinton 2012; Long, Shelhamer, and Darrell 2015) have pushed scene understanding to a new level thanks to the strong representation learning ability. It is still not performing well for objects having long structure region and could be occluded, such as the lane markings and poles shown in the red bounding boxes in Fig. 1. However, humans can easily infer their positions and fill in the occluded part from the context, i.e., the viewable part.

虽然基于CNN的方法(Krizhevsky,Sutskever和Hinton 2012; Long,Shelhamer和Darrell 2015)通过强大的表现学习能力将场景理解推向了一个新的水平。对于具有长结构区域并且可以被遮挡的物体,它仍然表现不佳,例如图1中红色边界框中所示的车道标记和极点。然而,人类可以容易地从上下文(即可视部分)推断它们的位置并填充在被遮挡的部分中。

To address this issue, we propose Spatial CNN (SCNN), a generalization of deep convolutional neural networks to a rich spatial level. In a layer-by-layer CNN, a convolution layer receives input from the former layer, applies convolution operation and nonlinear activation, and sends result to the next layer. This process is done sequentially. Similarly, SCNN views rows or columns of feature maps as layers and applies convolution, nonlinear activation, and sum operations sequentially, which forms a deep neural network. In this way information could be propagated between neurons in the same layer. It is particularly useful for structured object such as lanes, poles, or truck with occlusions, since the spatial information can be reinforced via inter layer propa gation. As shown in Fig. 1, in cases where CNN is discontinuous or is messy, SCNN could well preserve the smoothness and continuity of lane markings and poles. In our experiment, SCNN significantly outperforms other RNN or MRF/CRF based methods, and also gives better results than the much deeper ResNet-101 (He et al. 2016).

为了解决这个问题,我们提出了空间CNN(SCNN),即深度卷积神经网络向丰富空间层次的推广。在逐层CNN中,卷积层接收来自前一层的输入,应用卷积运算和非线性激活,并将结果发送到下一层。该过程按顺序完成。类似地,SCNN将特征映射的行或列视为层,并顺序地应用卷积,非线性激活和求和操作,这形成了深度神经网络。以这种方式,信息可以在同一层中的神经元之间传播。它对于具有遮挡的结构物体(例如车道,杆或卡车)特别有用,因为空间信息可以通过层间传播来加强。如图1所示,在CNN不连续或杂乱的情况下,SCNN可以很好地保持车道标记和杆的平滑性和连续性。在我们的实验中,SCNN明显优于其他基于RNN或MRF / CRF的方法,并且比更深层次的ResNet-101(He et al.2016)也提供了更好的结果。
Spatial As Deep: Spatial CNN for Traffic Scene Understanding论文翻译_第2张图片
Figure 2: (a) Dataset examples for different scenarios. (b) Proportion of each scenario.
图2:(a)不同场景的数据集示例。 (b)每种方案的比例。
Related Work. For lane detection, most existing algorithms are based on hand-crafted low-level features (Aly 2008; Son et al. 2015; Jung, Youn, and Sull 2016), limiting there capability to deal with harsh conditions. Only Huval et al. (2015) gave a primacy attempt adopting deep learning in lane detection but without a large and general dataset. While for semantic segmentation, CNN based methods have become mainstream and achieved great success (Long, Shelhamer, and Darrell 2015; Chen et al. 2017).

相关工作。对于车道检测,大多数现有算法都基于手工制作的低级特征(Aly 2008; Son等人2015; Jung,Youn和Sull 2016),限制了处理恶劣条件的能力。只有Huval等人。 (2015)首次尝试在车道检测中采用深度学习,但没有大型通用数据集。而对于语义分割,基于CNN的方法已成为主流并取得了巨大成功(Long,Shelhamer和Darrell 2015; Chen等2017)。

There have been some other attempts to utilize spatial information in neural networks. Visin et al. (2015) and Bell et al. (2016) used recurrent neural networks to pass information along each row or column, thus in one RNN layer each pixel position could only receive information from the same row or column. Liang et al. (2016a; 2016b) proposed variants of LSTM to exploit contextual information in semantic object parsing, but such models are computationally expensive. Researchers also attempted to combine CNN with graphical models like MRF or CRF, in which message pass is realized by convolution with large kernels (Liu et al. 2015; Tompson et al. 2014; Chu et al. 2016). There are three advantages of SCNN over these aforementioned methods: in SCNN, (1) the sequential message pass scheme is much more computational efficiency than traditional dense MRF/CRF, (2) the messages are propagated as residual, making SCNN easy to train, and (3) SCNN is flexible and could be applied to any level of a deep neural network.

已经有一些其他尝试在神经网络中利用空间信息。Visin等人。 (2015年)和贝尔等人。 (2016)使用递归神经网络沿每行或每列传递信息,因此在一个RNN层中,每个像素位置只能从同一行或列接收信息。梁等人。 (2016a; 2016b)提出了LSTM的变体以在语义对象解析中利用上下文信息,但是这样的模型在计算上是昂贵的。研究人员还尝试将CNN与MRF或CRF等图形模型结合起来,其中消息传递通过与大内核的卷积来实现(Liu等人2015; Tompson等人2014; Chu等人2016)。SCNN相对于上述方法有三个优点:在SCNN中,(1)顺序消息传递方案比传统的密集MRF / CRF具有更高的计算效率,(2)消息作为残差传播,使得SCNN易于训练, (3)SCNN是灵活的,可以应用于任何级别的深度神经网络。

Spatial Convolutional Neural Network空间卷积神经网络
Lane Detection Dataset车道检测数据集
In this paper, we present a large scale challenging dataset for traffic lane detection. Despite the importance and difficulty of traffic lane detection, existing datasets are either too small or too simple, and a large public annotated benchmark is needed to compare different methods (Bar Hillel et al. 2014). KITTI (Fritsch, Kuhnl, and Geiger 2013) and CamVid (Brostow et al. 2008) contains pixel level anno tations for lane/lane markings, but have merely hundreds of images, too small for deep learning methods. Caltech Lanes Dataset (Aly 2008) and the recently released TuSimple Benchmark Dataset (TuSimple 2017) consists of 1224 and 6408 images with annotated lane markings respectively, while the traffic is in a constrained scenario, which has light traffic and clear lane markings. Besides, none of these datasets annotates the lane markings that are occluded or are unseen because of abrasion, while such lane markings can be inferred by human and is of high value in real applications. To collect data, we mounted cameras on six different vehicles driven by different drivers and recorded videos during driving in Beijing on different days. More than 55 hours of videos were collected and 133,235 frames were extracted, which is more than 20 times of TuSimple Dataset. We have divided the dataset into 88880 for training set, 9675 for validation set, and 34680 for test set. These images were undistorted using tools in (Scaramuzza, Martinelli, and Siegwart 2006) and have a resolution of . Fig. 2 (a) shows some examples, which comprises urban, rural, and highway scenes. As one of the largest and most crowded cities in the world, Beijing provides many challenging traffic scenarios for lane detection. We divided the test set into normal and 8 challenging categories, which correspond to the 9 examples in Fig. 2 (a). Fig. 2 (b) shows the proportion of each scenario. It can be seen that the 8 challenging scenarios account for most (72.3%) of the dataset.

在本文中,我们提出了一个用于交通车道检测的大规模挑战性数据集。尽管交通车道检测具有重要性和困难性,但现有数据集要么太小,要么太简单,需要大量的公共注释基准来比较不同的方法(Bar Hillel et al.2014)。KITTI(Fritsch,Kuhnl和Geiger,2013)和CamVid(Brostow等人,2008)包含用于车道/车道标记的像素级别通知,但仅有数百个图像,对于深度学习方法而言太小。Caltech Lanes Dataset(Aly 2008)和最近发布的TuSimple Benchmark Dataset(TuSimple 2017)由1224和6408个图像组成,分别带有带注释的车道标记,而交通则处于受限情景中,具有轻便的交通和清晰的车道标记。此外,这些数据集中没有一个注释由于磨损而被遮挡或看不见的车道标记,而这种车道标记可以由人推断并且在实际应用中具有高价值。为了收集数据,我们在不同的日子在北京开车时,在不同的驾驶员驾驶的六辆不同车辆上安装了摄像头,录制了视频。收集了超过55小时的视频,提取了133,235帧,这是TuSimple数据集的20多倍。我们将数据集划分为88880用于训练集,9675用于验证集,34680用于测试集。这些图像使用(Scaramuzza,Martinelli和Siegwart 2006)中的工具不失真,分辨率为。图2(a)示出了一些示例,其包括城市,乡村和高速公路场景。作为世界上最大,最拥挤的城市之一,北京为车道检测提供了许多具有挑战性的交通方案。我们将测试集分为正常类和8个具有挑战性的类别,这些类别对应于图2(a)中的9个示例。图2(b)显示了每种情况的比例。可以看出,8个具有挑战性的场景占据了数据集的大部分(72.3%)。

For each frame, we manually annotate the traffic lanes with cubic splines. As mentioned earlier, in many cases lane markings are occluded by vehicles or are unseen. In real applications it is important that lane detection algorithms could estimate lane positions from the context even in these challenging scenarios that occur frequently. Therefore, for these cases we still annotate the lanes according to the context, as shown in Fig. 2 (a) (2)(4). We also hope that our algorithm could distinguish barriers on the road, like the one in Fig. 2 (a) (1). Thus the lanes on the other side of the barrier are not annotated. In this paper we focus our attention on the detection of four lane markings, which are paid most attention to in real applications. Other lane markings are not annotated.

对于每个帧,我们使用三次样条手动注释交通车道。如前所述,在许多情况下,车道标记被车辆遮挡或看不见。在实际应用中,重要的是车道检测算法即使在频繁发生的这些挑战性场景中也可以从上下文估计车道位置。因此,对于这些情况,我们仍然根据上下文注释通道,如图2(a)(2)(4)所示。我们也希望我们的算法可以区分道路障碍,如图2(a)(1)所示。因此,屏障另一侧的车道没有注释。在本文中,我们将注意力集中在四个车道标记的检测上,这些标记在实际应用中最受关注。其他车道标记未注释。

Spatial CNN空间CNN

Traditional methods to model spatial relationship are based on Markov Random Fields (MRF) or Conditional Ran dom Fields (CRF) (Kr¨ahenb¨uhl and Koltun 2011). Recent works (Zheng et al. 2015; Liu et al. 2015; Chen et al. 2017) to combine them with CNN all follow the pipeline of Fig. 3 (a), where the mean field algorithm can be implemented with neural networks. Specifically, the procedure is (1) Normalize: the output of CNN is viewed as unary potentials and is normalized by the Softmax operation, (2) Message Passing, which could be realized by channel wise convolution with large kernels (for dense CRF, the kernel size would cover the whole image and the kernel weights are dependent on the input image), (3) Compatibility Transform, which could be implemented with a convolution layer, and (4) Adding unary potentials. This process is iterated for N times to give the final output.

传统的空间关系建模方法基于马尔可夫随机场(MRF)或条件随机场(CRF)(Kr¨ahenb¨uhl和Koltun 2011)。最近的工作(Zheng等人2015; Liu等人2015; Chen等人2017)将它们与CNN结合起来都遵循图3(a)的流程,其中平均现场算法可以用神经网络实现。具体来说,程序是(1)归一化:CNN的输出被视为一元电位,并通过Softmax操作归一化,(2)消息传递,这可以通过大内核的通道卷积来实现(对于密集的CRF,内核大小将覆盖整个图像,内核权重取决于输入图像),(3)兼容性转换,可以使用卷积层实现,以及(4)添加一元电位。该过程迭代N次以给出最终输出。
Spatial As Deep: Spatial CNN for Traffic Scene Understanding论文翻译_第3张图片
Spatial As Deep: Spatial CNN for Traffic Scene Understanding论文翻译_第4张图片
Figure 3: (a) MRF/CRF based method. (b) Our implementation of Spatial CNN. MRF/CRF are theoretically applied to unary potentials whose channel number equals to the number of classes to be classified, while SCNN could be applied to the top hidden layers with richer information.

图3:(a)基于MRF / CRF的方法。 (b)我们实施空间CNN。MRF / CRF理论上应用于一致电位,其通道数等于待分类的类数,而SCNN可应用于具有更丰富信息的顶层隐藏层。

It can be seen that in the message passing process of traditional methods, each pixel receives information from all other pixels, which is very computational expensive and hard to be used in real time tasks as in autonomous driving. For MRF, the large convolution kernel is hard to learn and usually requires careful initialization (Tompson et al. 2014; Liu et al. 2015). Moreover, these methods are applied to the output of CNN, while the top hidden layer, which comprises richer information, might be a better place to model spatial relationship.

可以看出,在传统方法的消息传递过程中,每个像素接收来自所有其他像素的信息,这非常耗费计算并且难以在自动驾驶中用于实时任务。对于MRF,大卷积内核很难学习,并且通常需要仔细初始化(Tompson等人2014; Liu等人2015)。此外,这些方法应用于CNN的输出,而包含更丰富信息的顶部隐藏层可能是建模空间关系的更好地方。

To address these issues, and to more efficiently learn the spatial relationship and the smooth, continuous prior of lane markings, or other structured object in the driving scenario, we propose Spatial CNN. Note that the ’spatial’ here is not the same with that in ’spatial convolution’, but denotes propagating spatial information via specially designed CNN structure.

为了解决这些问题,并且为了更有效地学习车道标记的空间关系和平滑连续的先验,或者在驾驶场景中的其他结构化对象,我们提出了空间CNN。请注意,此处的“空间”与“空间卷积”中的“空间”不同,但表示通过特殊设计的CNN结构传播空间信息。

As shown in the ’SCNN D’ module of Fig. 3 (b), considering a SCNN applied on a 3-D tensor of size , where C, H, and W denote the number of channel, rows, and columns respectively. The tensor would be splited into H slices, and the first slice is then sent into a convolution layer with C kernels of size , where w is the kernel width. In a traditional CNN the output of a convolution layer is then fed into the next layer, while here the output is added to the next slice to provide a new slice. The new slice is then sent to the next convolution layer and this process would continue until the last slice is updated.

如图3(b)的“SCNN D”模块所示,考虑应用于尺寸为的3-D张量的SCNN,其中C,H和W分别表示信道,行和列的数量。张量将被分割成H个切片,然后将第一个切片发送到具有大小为的C内核的卷积层,其中w是内核宽度。在传统的CNN中,卷积层的输出然后被馈送到下一层,而此处输出被添加到下一个片以提供新的片。然后将新切片发送到下一个卷积层,此过程将继续,直到更新最后一个切片。

Specifically, assume we have a 3-D kernel tensor K with element denoting the weight between an element in channel i of the last slice and an element in channel j of the current slice, with an offset of k columes between two elements. Also denote the element of input 3-D tensor X as , where i, j, and k indicate indexes of channel, row, and column respectively. Then the forward computation of SCNN is:

具体地说,假设我们有一个3-D核张量K,元素表示最后一个切片的通道i中的元素与当前切片的通道j中的元素之间的权重,两个元素之间的k个偏移量。还将输入3-D张量X的元素表示为,其中i,j和k分别表示通道,行和列的索引。那么SCNN的正向计算是:
Spatial As Deep: Spatial CNN for Traffic Scene Understanding论文翻译_第5张图片
where f is a nonlinear activation function as ReLU. The X with superscript denotes the element that has been updated. Note that the convolution kernel weights are shared across all slices, thus SCNN is a kind of recurrent neural network. Also note that SCNN has directions. In Fig. 3 (b), the four ’SCNN’ module with suffix ’D’, ’U’, ’R’, ’L’ denotes SCNN that is downward, upward, rightward, and leftward respectively.

其中f是ReLU的非线性激活函数。带有上标的X表示已更新的元素。注意,卷积核权重在所有切片之间共享,因此SCNN是一种递归神经网络。另请注意,SCNN有方向。在图3(b)中,具有suf fi x’D’,‘U’,‘R’,'L’的四个’SCNN’模块分别表示向下,向上,向右和向左的SCNN。

Analysis分析

There are three main advantages of Spatial CNN over traditional methods, which are concluded as follows.

与传统方法相比,Spatial CNN有三个主要优点,其结论如下。

(1) Computational efficiency. As show in Fig. 4, in dense MRF/CRF each pixel receives messages from all other pixels directly, which could have much redundancy, while in SCNN message passing is realized in a sequential propagation scheme. Specifically, assume a tensor with H rows and W columns, then in dense MRF/CRF, there is message pass between every two of the W H pixels. For iterations, the number of message passing is niterW 2H 2. In SCNN, each pixel only receive information from w pixels, thus the number of message passing is , where and w denotes the number of propagation directions in SCNN and the kernel width of SCNN respectively. could range from 10 to 100, while in this paper is set to 4, corresponding to 4 directions, and w is usually no larger than 10 (in the example in Fig. 4 (b) ). It can be seen that for images with hundreds of rows and columns, SCNN could save much computations, while each pixel still could receive messages from all other pixels with message propagation along 4 directions.

(1)计算效率。如图4所示,在密集的MRF / CRF中,每个像素直接从所有其他像素接收消息,这可能具有很多冗余,而在SCNN中,消息传递是在顺序传播方案中实现的。具体地,假设具有H行和W列的张量,然后在密集的MRF / CRF中,在每两个W H像素之间存在消息传递。对于迭代,消息传递的数量是2H 2。在SCNN中,每个像素仅接收来自w个像素的信息,因此消息传递的数量是,其中和w分别表示SCNN中的传播方向的数量和SCNN的内核宽度。的范围可以从10到100,而在本文中,设置为4,对应于4个方向,w通常不大于10(在图4(b)的示例中)。可以看出,对于具有数百行和列的图像,SCNN可以节省大量计算,而每个像素仍然可以从所有其他像素接收消息,其中消息沿4个方向传播。
Spatial As Deep: Spatial CNN for Traffic Scene Understanding论文翻译_第6张图片
Figure 4: Message passing directions in (a) dense MRF/CRF and (b) Spatial CNN (rightward). For (a), only message passing to the inner 4 pixels are shown for clearance.

图4:(a)密集MRF / CRF和(b)空间CNN(向右)中的消息传递方向。对于(a),仅显示传递到内部4个像素的消息以进行清除。

(2) Message as residual. In MRF/CRF, message passing is achieved via weighted sum of all pixels, which, according to the former paragraph, is computational expensive. And recurrent neural network based methods might suffer from gradient descent (Pascanu, Mikolov, and Bengio 2013), considering so many rows or columns. However, deep residual learning (He et al. 2016) has shown its capability to easy the training of very deep neural networks. Similarly, in our deep SCNN messages are propagated as residual, which is the output of ReLU in Eq.(1). Such residual could also be viewed as a kind of modification to the original neuron. As our experiments will show, such message pass scheme achieves better results than LSTM based methods.

(2)消息为残差。在MRF / CRF中,通过所有像素的加权和来实现消息传递,根据前一段,其是计算上昂贵的。考虑到如此多的行或列,基于递归神经网络的方法可能会受到梯度下降(Pascanu,Mikolov和Bengio 2013)的影响。然而,深度残差学习(He et al.2016)已经证明了它能够轻松训练非常深的神经网络。类似地,在我们的深度SCNN中,消息作为残差传播,这是方程(1)中的ReLU的输出。这种残留也可以被视为对原始神经元的一种修改。正如我们的实验所示,这种消息传递方案比基于LSTM的方法获得更好的结果。

(3) Flexibility Thanks to the computational efficiency of SCNN, it could be easily incorporated into any part of a CNN, rather than just output. Usually, the top hidden layer contains information that is both rich and of high semantics, thus is an ideal place to apply SCNN. Typically, Fig. 3 shows our implementation of SCNN on the LargeFOV (Chen et al. 2017) model. SCNNs on four spatial directions are added sequentially right after the top hidden layer (’fc7’ layer) to introduce spatial message propagation.

(3)灵活性由于SCNN的计算效率,它可以很容易地整合到CNN的任何部分,而不仅仅是输出。通常,顶层隐藏层包含丰富且高语义的信息,因此是应用SCNN的理想位置。通常,图3显示了我们在LargeFOV(Chen et al.2017)模型上实现SCNN。在顶部隐藏层('fc7’层)之后依次添加四个空间方向上的SCNN以引入空间消息传播。

Experiment实验

We evaluate SCNN on our lane detection dataset and Cityscapes (Cordts et al. 2016). In both tasks, we train the models using standard SGD with batch size 12, base learning rate 0.01, momentum 0.9, and weight decay 0.0001. The learning rate policy is ”poly” with power and iteration number set to 0.9 and 60K respectively. Our models are modified based on the LargeFOV model in (Chen et al. 2017). The ini tial weights of the first 13 convolution layers are copied from VGG16 (Simonyan and Zisserman 2015) trained on ImageNet (Deng et al. 2009). All experiments are implemented on the Torch7 (Collobert, Kavukcuoglu, and Farabet 2011) framework.

我们在车道检测数据集和Cityscapes上评估SCNN(Cordts等人,2016)。在这两个任务中,我们使用标准SGD训练模型,批量大小为12,基础学习率为0.01,动量为0.9,重量衰减为0.0001。学习率策略是“poly”,功率和迭代次数分别设置为0.9和60K。我们的模型基于LargeFOV模型(Chen et al.2017)进行了修改。从ImageNet训练的VGG16(Simonyan和Zisserman 2015)复制了前13个卷积层的初始权重(Deng et al.2009)。所有实验都在Torch7(Collobert,Kavukcuoglu和Farabet 2011)框架上实施。
Spatial As Deep: Spatial CNN for Traffic Scene Understanding论文翻译_第7张图片
Figure 5: (a) Training model, (b) Lane prediction process. ’Conv’,’HConv’, and ’FC’ denotes convolution layer, atrous convolution layer (Chen et al. 2017), and fully connected layer respectively. ’c’, ’w’, and ’h’ denotes number of output channels, kernel width, and ’rate’ for atrous convolution.

图5:(a)训练模型,(b)车道预测过程。 ‘Conv’,'HConv’和’FC’分别表示卷积层,萎缩卷积层(Chen等2017)和完全连接层。 ‘c’,‘w’和’h’表示有害卷积的输出通道数,内核宽度和’速率’。

Lane Detection车道检测

Lane detection model Unlike common object detection task that only requires bounding boxes, lane detection requires precise prediction of curves. A natural idea is that the model should output probability maps (probmaps) of these curves, thus we generate pixel level targets to train the networks, like in semantic segmentation tasks. Instead of viewing different lane markings as one class and do clustering afterwards, we want the neural network to distinguish different lane markings on itself, which could be more robust. Thus these four lanes are viewed as different classes. Moreover, the probmaps are then sent to a small network to give prediction on the existence of lane markings.

车道检测模型与仅需要边界框的常见物体检测任务不同,车道检测需要精确预测曲线。一个自然的想法是模型应该输出这些曲线的概率图(probmaps),因此我们生成像素级目标来训练网络,就像在语义分割任务中一样。我们希望神经网络能够区分不同的车道标记,而不是将不同的车道标记视为一个类并在之后进行聚类,这样可以更加稳健。因此,这四个车道被视为不同的类别。此外,然后将问题发送到小型网络以预测车道标记的存在。

During testing, we still need to go from probmaps to curves. As shown in Fig.5 (b), for each lane marking whose existence value is larger than 0.5, we search the corresponding probmap every 20 rows for the position with the highest response. These positions are then connected by cubic splines, which are the final predictions.

在测试期间,我们仍然需要从probmaps到曲线。如图5(b)所示,对于存在值大于0.5的每个车道标记,我们每隔20行搜索相应的probmap,以获得响应最高的位置。然后通过三次样条连接这些位置,这是最终的预测。

As shown in Fig.5 (a), the detailed differences between our baseline model and LargeFOV are: (1) the output channel number of the ’fc7’ layer is set to 128, (2) the ’rate’ for the atrous convolution layer of ’fc6’ is set to 4, (3) batch normalization (Ioffe and Szegedy 2015) is added before each ReLU layer, (4) a small network is added to predict the existence of lane markings. During training, the line width of the targets is set to 16 pixels, and the input and target images are rescaled to . Considering the imbalanced label between background and lane markings, the loss of background is multiplied by 0.4.

如图5(a)所示,我们的基线模型和LargeFOV之间的详细差异是:(1)‘fc7’层的输出通道数设置为128,(2)atrous卷积的’速率’ 'fc6’层设置为4,(3)在每个ReLU层之前添加批量标准化(Ioffe和Szegedy 2015),(4)添加小网络以预测车道标记的存在。在训练期间,目标的线宽设置为16像素,输入和目标图像重新调整为。考虑到背景和车道标记之间的不平衡标签,背景损失乘以0.4。
Spatial As Deep: Spatial CNN for Traffic Scene Understanding论文翻译_第8张图片
Figure 6: Evaluation based on IoU. Green lines denote ground truth, while blue and red lines denote TP and FP respectively.

图6:基于IoU的评估。绿线表示基本事实,而蓝线和红线分别表示TP和FP。

Evaluation In order to judge whether a lane marking is successfully detected, we view lane markings as lines with widths equal to 30 pixel and calculate the intersectionover-union (IoU) between the ground truth and the prediction. Predictions whose IoUs are larger than certain threshold are viewed as true positives (TP), as shown in Fig. 6. Here we consider 0.3 and 0.5 thresholds corresponding to loose and strict evaluations. Then we employ F-measure Precision+Recall as the final evaluation index, where Precision Recall Precision and Recall . Here β is set to 1, corresponding to harmonic mean (F1-measure).

评估为了判断是否成功检测到车道标记,我们将车道标记视为宽度等于30像素的线,并计算地面实况和预测之间的交叉联合(IoU)。其IoU大于特定阈值的预测被视为真阳性(TP),如图6所示。在这里,我们考虑0.3和0.5阈值对应松散和严格的评估。然后我们采用F-measure Precision + Recall作为最终评估指标,其中Precision Recall Precision 和Recall 。这里β设定为1,对应于调和平均值(F1-measure)。

Ablation Study In section 2.2 we propose Spatial CNN to enable spatial message propagation. To verify our method, we will make detailed ablation studies in this subsection. Our implementation of SCNN follows that shown in Fig. 3. (1) Effectiveness of multidirectional SCNN. Firstly, we investigate the effects of directions in SCNN. We try SCNN that has different direction implementations, the results are shown in Table. 1. Here the kernel width w of SCNN is set to

消融研究在2.2节中,我们提出了空间CNN来实现空间消息传播。为了验证我们的方法,我们将在本小节中进行详细的消融研究。我们的SCNN实现如图3所示。(1)多向SCNN的有效性。首先,我们研究了SCNN中方向的影响。我们尝试具有不同方向实现的SCNN,结果如表所示。这里SCNN的内核宽度w设置为

  1. It can be seen that the performance increases as more directions are added. To prove that the improvement does not result from more parameters but from the message passing scheme brought about by SCNN, we add an extra convolution layer with kernel width after the top hidden layer of the baseline model and compare with our method. From the results we can see that extra convolution layer could merely bring about little improvement, which verifies the effectiveness of SCNN.

5.可以看出,随着更多方向的增加,性能也会提高。为了证明改进不是来自更多参数,而是来自SCNN带来的消息传递方案,我们在基线模型的顶部隐藏层之后添加了一个额外的卷积层和内核宽度,并与我们的方法进行比较。从结果我们可以看出,额外的卷积层只能带来很小的改进,这证明了SCNN的有效性。
Spatial As Deep: Spatial CNN for Traffic Scene Understanding论文翻译_第9张图片
(2) Effects of kernel width w. We further try SCNN with different kernel width based on the ”SCNN DURL” model, as shown in Table. 2. Here the kernel width denotes the number of pixels that a pixel could receive messages from, and the case is similar to the methods in (Visin et al. 2015; Bell et al. 2016). The results show that larger w is beneficial,

(2)核宽w的影响。我们进一步尝试基于“SCNN DURL”模型的不同内核宽度的SCNN,如表所示。 2。这里内核宽度表示像素可以从中接收消息的像素数,情况类似于(Visin等人2015; Bell等人2016)中的方法。结果表明较大的w是有益的,
and gives a satisfactory result, which surpasses the baseline by a significant margin 8.4% and 3.2% corresponding to different IoU threshold.
和给出了令人满意的结果,超过基线的显着边际8.4%和3.2%对应不同的IoU阈值。
Spatial As Deep: Spatial CNN for Traffic Scene Understanding论文翻译_第10张图片
(3) Spatial CNN on different positions. As mentioned earlier, SCNN could be added to any place of a neural network. Here we consider the SCNN DURL model applied on (1) output and (2) the top hidden layer, which correspond to Fig. 3. The results in Table. 3 indicate that the top hidden layer, which comprises richer information than the output, turns out to be a better position to apply SCNN.

(3)不同位置的空间CNN。如前所述,SCNN可以添加到神经网络的任何位置。这里我们考虑应用于(1)输出的SCNN DURL模型和(2)顶部隐藏层,其对应于图3。结果见表。图3表明包含比输出更丰富的信息的顶部隐藏层被证明是应用SCNN的更好位置。
Spatial As Deep: Spatial CNN for Traffic Scene Understanding论文翻译_第11张图片
(4) Effectiveness of sequential propagation. In our SCNN, information is propagated in a sequential way, i.e., a slice does not pass information to the next slice until it has received information from former slices. To verify the effectiveness of this scheme, we compare it with parallel propagation, i.e., each slice passes information to the next slice simultaneously before being updated. For this parallel case, the in the right part of Eq.(1) is removed. As Table. 4 shows, the sequential message passing scheme outperforms the parallel scheme significantly. This result indicates that in SCNN, a pixel does not merely affected by nearby pixels, but do receive information from further positions.

(4)顺序传播的有效性。在我们的SCNN中,信息以顺序方式传播,即,切片不将信息传递到下一个切片,直到它从前切片接收到信息。为了验证该方案的有效性,我们将其与并行传播进行比较,即,每个片在更新之前同时将信息传递给下一个片。对于这种并行情况,方程(1)右侧的被删除。如表。如图4所示,顺序消息传递方案明显优于并行方案。该结果表明,在SCNN中,像素不仅仅受到附近像素的影响,而且确实从其他位置接收信息。
Spatial As Deep: Spatial CNN for Traffic Scene Understanding论文翻译_第12张图片
(5) Comparison with state-of-the-art methods. To further verify the effectiveness of SCNN in lane detection, we compare it with several methods: the rnn based ReNet (Visin et al. 2015), the MRF based MRFNet, the DenseCRF (Kr¨ahenb¨uhl and Koltun 2011), and the very deep residual network (He et al. 2016). For ReNet based on LSTM, we replace the ”SCNN” layers in Fig. 3 with two ReNet layers: one layer to pass horizontal information and the other to pass vertical information. For DenseCRF, we use dense CRF as post-processing and employ 10 mean field iterations as in (Chen et al. 2017). For MRFNet, we use the implementation in Fig. 3 (a), with iteration times and message passing kernel size set to 10 and 20 respectively. The main difference of the MRF here with CRF is that the weights of message passing kernels are learned during training rather than depending on the image. For ResNet, our implementation is the same with (Chen et al. 2017) except that we do not use the ASPP module. For SCNN, we add SCNN DULR module to the baseline, and the kernel width w is 9. The test results on different scenarios are shown in Table 5, and visualizations are given in Fig. 7.

(5)与最先进的方法进行比较。为了进一步验证SCNN在泳道检测中的有效性,我们将其与几种方法进行比较:基于rnn的ReNet(Visin等人2015),基于MRF的MRFNet,DenseCRF(Kr¨ahenb¨uhl和Koltun 2011),以及非常深的残留网络(He et al.2016)。对于基于LSTM的ReNet,我们用两个ReNet层替换图3中的“SCNN”层:一层传递水平信息,另一层传递垂直信息。对于DenseCRF,我们使用密集CRF作为后处理,并采用10个均值场迭代(Chen et al.2017)。对于MRFNet,我们使用图3(a)中的实现,迭代时间和消息传递内核大小分别设置为10和20。这里的MRF与CRF的主要区别在于消息传递内核的权重是在训练期间学习而不是依赖于图像。对于ResNet,我们的实现与(Chen et al.2017)相同,只是我们不使用ASPP模块。对于SCNN,我们将SCNN DULR模块添加到基线,内核宽度w为9。不同情况下的测试结果如表5所示,可视化如图7所示。
Spatial As Deep: Spatial CNN for Traffic Scene Understanding论文翻译_第13张图片
Spatial As Deep: Spatial CNN for Traffic Scene Understanding论文翻译_第14张图片
Figure 7: Comparison between probmaps of baseline, ReNet, MRFNet, ResNet-101, and SCNN.
图7:基线,ReNet,MRFNet,ResNet-101和SCNN的问题图之间的比较。
From the results, we can see that the performance of ReNet is not even comparable with SCNN DULR with 1, indicating the effectiveness of our residual message passing scheme. Interestingly, DenseCRF leads to worse result here, because lane markings usually have less appearance clues so that dense CRF cannot distinguish lane markings and background. In contrast, with kernel weights learned from data, MRFNet could to some extent smooth the results and improve performance, as Fig. 7 shows, but are still not very satisfactory. Furthermore, our method even outperform the much deeper ResNet-50 and ResNet-101. Despite the over a hundred layers and the very large receptive field of ResNet-101, it still gives messy or discontinuous outputs in challenging cases, while our method, with only 16 convolution layers plus 4 SCNN layers, could preserve the smoothness and continuity of lane lines better. This demonstrates the much stronger capability of SCNN to capture structure prior of objects over traditional CNN.

从结果中我们可以看出,ReNet的性能甚至与SCNN DULR与 1无法比较,表明我们的剩余消息传递方案的有效性。有趣的是,DenseCRF在这里导致更糟糕的结果,因为车道标记通常具有较少的外观线索,因此密集的CRF无法区分车道标记和背景。相比之下,从数据中学习核心权重,MRFNet可以在一定程度上平滑结果并改善性能,如图7所示,但仍然不是很令人满意。此外,我们的方法甚至优于更深层次的ResNet-50和ResNet-101。尽管ResNet-101有超过一百层和非常大的接收场,但在具有挑战性的情况下它仍会产生混乱或不连续的输出,而我们的方法只有16个卷积层和4个SCNN层,可以保持通道的平滑性和连续性。线条更好。这表明SCNN在传统CNN上捕获对象之前的结构的能力强得多。

(6) Computational efficiency over other methods. In the Analysis section we give theoretical analysis on the computational efficiency of SCNN over dense CRF. To verify

(6)计算效率优于其他方法。在分析部分,我们对SCNN在密集CRF上的计算效率进行了理论分析。核实

this, we compare their runtime experimentally. The results are shown in Table. 6, where the runtime of the LSTM in ReNet is also given. Here the runtime does not include runtime of the backbone network. For SCNN, we test both the practical case and the case with the same setting as dense CRF. In the practical case, SCNN is applied on top hidden layer, thus the input has more channels but less hight and width. In the fair comparison case, the input size is modified to be the same with that in dense CRF, and both methods are tested on CPU. The results show that even in fair comparison case, SCNN is over 4 times faster than dense CRF, despite the efficient implementation of dense CRF in (Kr¨ahenb¨uhl and Koltun 2011). This is because SCNN significantly reduces redundancy in message passing, as in Fig. 4. Also, SCNN is more efficient than LSTM, whose gate mechanism requires more computation.

这,我们通过实验比较它们的运行时间。结果如表所示。 6,其中还给出了ReNet中LSTM的运行时间。这里运行时不包括骨干网络的运行时。对于SCNN,我们使用与密集CRF相同的设置测试实际案例和案例。在实际情况中,SCNN应用于顶层隐藏层,因此输入具有更多通道但较少高度和宽度。在公平比较的情况下,输入大小被修改为与密集CRF中的大小相同,并且两种方法都在CPU上进行测试。结果表明,即使在公平比较的情况下,SCNN也比密集CRF快4倍,尽管在(Kr¨ahenb¨uhl和Koltun 2011)中实现了密集CRF的有效实施。这是因为SCNN显着减少了消息传递中的冗余,如图4所示。此外,SCNN比LSTM更有效,LSTM的门机制需要更多的计算。
Spatial As Deep: Spatial CNN for Traffic Scene Understanding论文翻译_第15张图片
2Intel Core i7-4790K CPU 3GeForce GTX TITAN Black
2英特尔酷睿i7-4790K CPU 3GeForce GTX TITAN黑色
Spatial As Deep: Spatial CNN for Traffic Scene Understanding论文翻译_第16张图片
Spatial As Deep: Spatial CNN for Traffic Scene Understanding论文翻译_第17张图片
Figure 8: Visual improvements on Cityscapes validation set. For each example, from left to right are: input image, ground truth, result of LargeFOV, result of LargeFOV+SCNN.

图8:Cityscapes验证集的视觉改进。对于每个示例,从左到右分别是:输入图像,基础事实,LargeFOV的结果,LargeFOV + SCNN的结果。

Semantic Segmentation on Cityscapes城市景观的语义分割

To demonstrate the generality of our method, we also evaluate Spatial CNN on Cityscapes (Cordts et al. 2016). Cityscapes is a standard benchmark dataset for semantic segmentation on urban traffic scenes. It contains 5000 fine annotated images, including 2975 for training, 500 for validation and 1525 for testing. 19 categories are defined including both stuff and objects. We use two classic models, the LargeFOV and ResNet-101 in DeepLab (Chen et al. 2017) as the baselines. Batch normalization layers (Ioffe and Szegedy 2015) are added to LargeFOV to enable faster convergence. For both models, the channel numbers of the top hidden layers are modified to 128 to make them compacter.

为了证明我们方法的一般性,我们还评估了Cityscapes上的空间CNN(Cordts等,2016)。Cityscapes是城市交通场景语义分割的标准基准数据集。它包含5000个带注释的图像,包括2975个用于训练,500个用于验证,1525个用于测试。定义了19个类别,包括东西和物体。我们使用两个经典模型,DeepLab中的LargeFOV和ResNet-101(Chen等人,2017)作为基线。批量标准化层(Ioffe和Szegedy 2015)被添加到LargeFOV以实现更快的收敛。对于这两种型号,顶部隐藏层的通道编号被修改为128以使其更加紧凑。

We add SCNN to the baseline models in the same way as in lane detection. The comparisons between baselines and those combined with the SCNN DURL models with kernel width are shown in Table 7. It can be seen that SCNN could also improve semantic segmentation results. With SCNNs added, the IoUs for all classes are at least comparable to the baselines, while the ”wall”, ”pole”, ”truck”, ”bus”, ”train”, and ”motor” categories achieve significant improve. This is because for long shaped objects like train and pole, SCNN could capture its continuous structure and connect the disconnected part, as shown in Fig. 8. And for wall, truck, and bus which could occupy large image area, the diffusion effect of SCNN could correct the part that are misclassified according to the context. This shows that SCNN is useful not only for long thin structure, but also for large objects which require global information to be classified correctly. There is another interesting phenomenon that the head of the vehicle at the bottom of the images, whose label is ignored during training, is in a mess in LargeFOV while with SCNN added it is classified as road. This is also due to the diffusion effects of SCNN, which passes the information of road to the vehicle head area.

我们以与车道检测相同的方式将SCNN添加到基线模型中。表7中显示了基线与内核宽度为的SCNN DURL模型相结合的比较。可以看出,SCNN还可以改善语义分割结果。随着SCNN的增加,所有类别的IoU至少与基线相当,而“墙”,“极”,“卡车”,“公共汽车”,“火车”和“电机”类别实现了显着改善。这是因为对于像火车和杆这样的长形物体,SCNN可以捕获其连续结构并连接断开的部分,如图8所示。对于可能占据大图像区域的墙壁,卡车和公共汽车,SCNN的扩散效果可以根据上下文纠正错误分类的部分。这表明SCNN不仅适用于长薄结构,而且适用于需要对全局信息进行正确分类的大型对象。另一个有趣的现象是,在训练期间忽略标签的图像底部的车辆头部在LargeFOV中处于混乱状态,而在添加SCNN时,它被分类为道路。这也是由于SCNN的扩散效应,其将道路信息传递到车辆头部区域。

To compare our method with other MRF/CRF based methods, we evaluate LargeFOV+SCNN on Cityscapes test set, and compare with methods that also use VGG16 (Simonyan and Zisserman 2015) as the backbone network. The results are shown in Table 8. Here LargeFOV, DPN, and our method use dense CRF, dense MRF, and SCNN respectively, and share nearly the same base CNN part. The results show that our method achieves significant better performance.

为了将我们的方法与其他基于MRF / CRF的方法进行比较,我们在Cityscapes测试集上评估LargeFOV + SCNN,并与也使用VGG16(Simonyan和Zisserman 2015)作为骨干网络的方法进行比较。结果如表8所示。这里LargeFOV,DPN和我们的方法分别使用密集CRF,密集MRF和SCNN,并且共享几乎相同的基本CNN部分。结果表明,我们的方法取得了显着的更好的性能。
Spatial As Deep: Spatial CNN for Traffic Scene Understanding论文翻译_第18张图片
Conclusion结论

In this paper, we propose Spatial CNN, a CNN-like scheme to achieve effective information propagation in the spatial level. SCNN could be easily incorporated into deep neural networks and trained end-to-end. It is evaluated at two tasks in traffic scene understanding: lane detection and semantic segmentation. The results show that SCNN could effectively preserve the continuity of long thin structure, while in semantic segmentation its diffusion effects is also proved to be beneficial for large objects. Specifically, by introducing SCNN into the LargeFOV model, our 20-layer network outperforms ReNet, MRF, and the very deep ResNet-101 in lane detection. Last but not least, we believe that the large challenging lane detection dataset we presented would push forward researches on autonomous driving.

在本文中,我们提出了空间CNN,一种类似CNN的方案,以在空间层面实现有效的信息传播。SCNN可以很容易地融入深度神经网络并进行端到端训练。它在交通场景理解中的两个任务进行评估:车道检测和语义分割。结果表明,SCNN可以有效地保持长薄结构的连续性,而在语义分割中,其扩散效应也被证明对大型物体有利。具体而言,通过将SCNN引入LargeFOV模型,我们的20层网络在路径检测方面优于ReNet,MRF和非常深的ResNet-101。最后但同样重要的是,我们认为我们提出的大型挑战车道检测数据集将推动自动驾驶的研究。

该翻译选自
http://tongtianta.site/

你可能感兴趣的:(人工智能,论文,Spatial,As,Deep:,Spatial,CNN,for,Tr,道路检测,人工智能)