The network has four task modules and each task performs complementary cooperation: grid box regression, object detection, multi-label classification, and prediction of the vanishing point. This structure allows us to detect and classify the lane and road markings, and predict the vanishing region simultaneously in a single forward pass.网络有四个任务模块,每个任务执行互补合作:网格框回归,对象检测,多标签分类和消失点预测。 这种结构允许我们检测并分类车道和道路标记,并在单个前向通道中同时预测消失区域。
we have designed a Vanishing Point Prediction (VPP) task that guides robust lane and road marking detection similar to human vision. In this paper, “Vanishing Point (VP)” is defined as the nearest point on the horizon where lanes converge and disappear predictively around the farthest point of the visible lane .This VP can be used to provide a global geometric几何 context of a scene, which is important to infer the location of lanes and road markings.
We use a quadrant(象限) mask that divides the whole image into four sections. The intersection of these four sections is a VP. In this way, we can infer the VP using four quadrant sections which cover the structures of a global scene.To implement this, we define five channels for the output of the VPP task: one absence channel and four quadrant channels. Every pixel in the output image chooses to belong to one of the five channels. The absence channel is used to represent a pixel with no VP, while the four quadrant channels stand for one of the quadrant sections on the image. On the other hand, if the VP is hard to be identified (e.g. intersection roads, occlusions), every pixel will tend to be classified as the absence channel. In this case, the average confidence of the absence channel would be high
We noticed that during the training stage the VPP task became dependent on the lane detection task. The dependency between lanes and the VP implies a strong information correlation. In this case, the VP provides redundant冗余的 information to the network, leading to marginal lane detection improvement. In order to prevent this side effect, we train the network in two phases to tolerate the balance between the tasks.
In the first phase, we train only the VPP task.The training of this phase stops upon reaching convergence of the VP detection task. Although we train only the VPP task, due to the weight update of the mutually shared layers, losses of the other detection tasks are also decreased by about20%. This shows that lane and road marking detection and VPP tasks share some common characteristics in the feature representation layers.
In the second phase, we further train all the tasks using the initialized kernels from the first phase.
where Lreg is a grid regression L1 loss, Lom and Lml and Lvp are cross entropy losses in each branch of the network
Balance weight trick
First, w1∼w4 are set to be equal to 1, and the starting losses are observed. Then, we set the reciprocal of these initial loss values to the loss weight so that the losses are uniform. In the middle of the training, if the scale difference between losses becomes large, this process is repeated to balance the loss values. The second phase stops when the validation accuracy is converged.
ps:Some road markings such as crosswalks or safety zones that are difficult to define by a single box are localized by grid sampling without subsequent merging. 某些道路标记(如人行横道或安全区域)难以通过单个框定义,通过网格采样进行定位,无需后续合并。
At the first training phase, we initialize the network only by the VPP task. After the initialization, all four tasks are trained simultaneously. For every task, we use Stochastic Gradient Descent optimization with a momentum of 0.9 and a mini-batch size of 20. Since multiple tasks must converge proportionally, we tune the learning rate of each task.
We train three models of the network divided by task: 2-Task (revised [16]), 3-Task (revised [35]), and 4-Task (VPGNet). 2-Task network includes regression and binary classification tasks. 3-Task network includes 2-Task and a multi-label classification task. 4-Task network includes 3- Task and a VPP task, which is the VPGNet. Since the lane detection in [16] ( An empirical evaluation of deep learning on highway driving)is not fully reproducible, we modify the data layer to handle the grid mask and move one convolutional layer from shared layers to branch layers, as in the 3- and 4-Task networks. The 3-Task network is similar to [35](Traffic-sign detection and classification in the wild. In CVPR, 2016.笔记:https://segmentfault.com/a/1190000009438113), but we modify the data layer to handle the grid mask.
We test our models on NVIDIA GTX Titan X and achieve a speed of 20 Hz by using only a single forward pass. Specifically, the single forward pass takes about 30 ms and the post-processing takes about 20 ms or less.
As the results show, if we use more tasks, more neurons respond, especially around the boundaries of roadways.
In order to see what happens if the VP does not exist, we conducted an additional test on images without the VP (e.g. intersection roads or occlusions). Table 4 shows the results of the experiment, demonstrating that the enhancement of feature representation through the VPP task helps to find lanes even when there is no VP.