论文链接
Most current license plate (LP) detection and recognition approaches
are evaluated on a small and usually unrepresentative dataset
since there are no publicly available large diverse datasets. In this paper,
we introduce CCPD, a large and comprehensive LP dataset. All images
are taken manually by workers of a roadside parking management company
and are annotated carefully. To our best knowledge, CCPD is the
largest publicly available LP dataset to date with over 250k unique car
images, and the only one provides vertices location annotations. With
CCPD, we present a novel network model which can predict the bounding
box and recognize the corresponding LP number simultaneously with
high speed and accuracy. Through comparative experiments, we demonstrate
our model outperforms current object detection and recognition
approaches in both accuracy and speed. In real-world applications, our
model recognizes LP numbers directly from relatively high-resolution
images at over 61 fps and 98.5% accuracy.
最新的车牌(LP)检测和识别方法在一个小的且通常不具有代表性的数据集上进行评估
因为没有公开可用的大型多样数据集。在本文中,我们介绍CCPD,这是一个庞大而全面的LP数据集。所有图片由路边停车管理公司的工人手动拍摄并仔细注释。据我们所知,CCPD是
迄今为止最大的公开LP数据集,拥有超过25万辆独特的汽车图像,并且唯一一个提供顶点位置注释。用CCPD,我们提出了一种可以预测边界的新颖网络模型框并同时识别对应的LP号高速和高精度。通过比较实验,我们证明我们的模型优于目前的物体检测和识别在准确性和速度上都接近。在实际应用中,我们的模型直接从相对高分辨率中识别LP数字超过61 fps的图像和98.5%的精度。
License plate detection and recognition (LPDR) is essential in Intelligent Transport. System and is applied widely in many real-world surveillance systems, such as traffic monitoring, highway toll station, car park entrance and exit management.Extensive researches have been made for faster or more accurate LPDR.
However, challenges for license plate (LP) detection and recognition still existin uncontrolled conditions, such as rotation (about 20◦onwards), snow orfog weather, distortions, uneven illumination, and vagueness. Most papers concerning.LPDR [1,2,3,4,5,6,7,8,9,10] often validate their approaches on extremely limited datasets (less than 3,000 unique images), thus might work well onlyunder some controlled conditions. Current datasets for LPDR (see Table 1,2)either lack in quantity (less than 10k images) or diversity (collected from fixedsurveillance cameras) because an artificial collection of LP pictures requires a
lot of manpower. However, uncontrolled conditions are common in real world. A truly reliable LPDR system should function well in these cases. To aid in better benchmarking LPDR approaches, we present our Chinese City Parking Dataset(CCPD).CCPD collects data from roadside parking in all the streets of one provincial capital in China where residuals own millions of cars. Each parking fee collector
(PFC) works on one street from 07:30 AM to 10:00 PM every day regardless of weather conditions. For each parking bill, the collector is required to take a picture of the car with an Android handheld POS machine and manually annotates the exact LP number. It is worth noting that images from handheld devices exhibit strong variations due to the uncertain position and shooting angle of handheld devices, as well as varying illuminations and different backgrounds at different hours and on different streets (see Figure 1). Each image in CCPD
has detailed annotations in several aspects concerning the LP:
(i) LP number.
(ii) LP bounding box.
(iii) Four vertices locations.
(iv) Horizontal tilt degree and vertical tilt degree [11].
(v) Other relevant information like the LP area, the degree of brightness, the degree of vagueness and so on. Details about those
annotations are explained in section 3.
Most papers [3,4,5,6,7] separate LPDR into two stages (detection · recognition)or three stages (detection · segmentation · character recognition) and process the LP image step by step. However, separating detection from recognition is detrimental to the accuracy and efficiency of the entire recognition process. An imperfect bounding box prediction given by detection methods might make
a part of the LP missing, and thus results in the subsequent recognition failure.Moreover, operations between different stages such as extracting and resizing the LP region for recognition are always accomplished by less efficient CPU, making LP recognition slower. Given these two observations, we come to the intuition
that the LP recognition stage can exploit convolutional features extracted in the LP detection stage for recognizing LP characters. Following that, we design a novel architecture named Roadside Parking net (RPnet) for accomplishing LP detection and recognition in a single forward pass. It’s worth noting that we are not the first to design an end-to-end deep neural network which can localize
车牌检测和识别(LPDR)在智能运输中至关重要。 该系统在交通监控,高速公路收费站,停车场出入口管理等许多现实世界的监控系统中得到了广泛的应用。为使LPDR更快或更准确,已经进行了广泛的研究。
但是,在不受控制的条件下,例如旋转(大约20度以上),下雪或雾天气,扭曲,照明不均匀和模糊不清,车牌(LP)检测和识别仍然面临挑战。 关于.LPDR的大多数论文[1,2,3,4,5,6,7,8,9,10]经常在极其有限的数据集(少于3,000个唯一图像)上验证其方法,因此仅在某些受控条件下可能效果良好 。
LPDR的当前数据集(请参阅表1,2)要么数量不足(少于1万张图像),要么缺乏多样性(从固定式监控摄像机收集),因为LP图片的人工收集需要
大量的人力。但是,不受控制的条件在现实世界中很常见。在这些情况下,真正可靠的LPDR系统应该可以正常运行。为了帮助更好地对LPDR方法进行基准测试,我们提出了中国城市停车数据集(CCPD).CCPD从中国一个省会城市的所有街道的路边停车位收集数据,残差拥有数百万辆汽车。每个停车费收取者
(PFC)不管天气如何,每天从07:30 AM到10:00 PM在一条街道上工作。对于每个停车账单,要求收款人使用Android手持POS机为汽车拍照并手动标注确切的LP号。值得注意的是,由于手持设备的位置和拍摄角度不确定,以及在不同时间和不同街道上的照明和背景不同,手持设备的图像也会表现出很大的差异(见图1)。 CCPD中的每个图像
在有关LP的多个方面有详细的注释:
(i)唱片编号。
(ii)LP边框。
(iii)四个顶点位置。
(iv)水平倾斜度和垂直倾斜度[11]。
(v)其他相关信息,例如LP区域,亮度,模糊程度等。有关这些的详细信息
注释在第3节中进行了说明。
大多数论文[3,4,5,6,7]将LPDR分为两个阶段(检测·识别)或三个阶段(检测·分割·字符识别),并逐步处理LP图像。但是,将检测与识别分开会损害整个识别过程的准确性和效率。检测方法给出的边界框预测不完善可能会使
LP的一部分丢失,从而导致随后的识别失败。此外,不同阶段之间的操作(例如,提取和调整LP区域的大小以进行识别)始终由效率较低的CPU完成,从而使LP识别速度变慢。鉴于这两个观察,我们得出了直觉
LP识别阶段可以利用在LP检测阶段提取的卷积特征来识别LP字符。接下来,我们设计了一种名为路边停车网(RPnet)的新颖体系结构,用于在单个前向通行中完成LP检测和识别。值得注意的是,我们并不是第一个设计可以本地化的端到端深度神经网络的人。
It’s worth noting that we are not the first to design an end-to-end deep neural network which can localize LPs and recognize the LP number simultaneously. However, exploiting Region Proposal Network and Bi-directional Recurrent Neural Networks, the end-toend model put forward by Li et al. [12] is not efficient as it needs 0.3 second to
accomplish the recognition process on a Titan X GPU. By contrast, based on a simpler and more elegant architecture, RPnet can run at more than 60 fps on a weaker NVIDIA Quadro P4000.
Both CCPD and the code for training and evaluating RPnet are available under the open-source MIT License at:
https://github.com/detectRecog/CCPD.
To summarize, this paper makes the following contributions:
– We introduce CCPD, the largest and the most diverse publicly available dataset for LPDR to date. CCPD provides over 250k unique car images with detailed annotations, nearly two orders of magnitude more images than other diverse LP datasets.
– We propose a novel network architecture for unified LPDR named RPnet which can be trained end-to-end. As feature maps are shared for detection and recognition and losses are optimized jointly, RPnet can detect and recognize LPs more accurately and at a faster speed.
– By evaluating state-of-the-art detection and recognition models on CCPD,we demonstrate our proposed model outperforms other approaches in both accuracy and speed.
值得注意的是,我们并不是第一个设计可以定位LP并同时识别LP编号的端到端深度神经网络的人。然而,利用区域提议网络和双向递归神经网络,Li等人提出了端到端模型。 [12]效率不高,因为它需要0.3秒
在Titan X GPU上完成识别过程。相比之下,基于更简单,更优雅的架构,RPnet可以在较弱的NVIDIA Quadro P4000上以60 fps以上的速度运行。
CCPD以及用于培训和评估RPnet的代码都可以通过以下网站的MIT开源许可证获得:
https://github.com/detectRecog/CCPD。
总而言之,本文做出了以下贡献:
–我们介绍CCPD,这是迄今为止LPDR的最大和最多样化的公开可用数据集。 CCPD提供了超过25万张带有详细注释的独特汽车图像,比其他各种LP数据集多了近两个数量级。
–我们为统一的LPDR提出了一种新颖的网络架构,称为RPnet,可以端到端进行培训。由于可以共享特征图以进行检测和识别,并且可以共同优化损失,因此RPnet可以更准确,更快地检测和识别LP。
–通过评估CCPD上的最新检测和识别模型,我们证明了我们提出的模型在准确性和速度上均优于其他方法。
Our work is related to prior art in two aspects: publicly available datasets (as shown in Table 1,2), and existing algorithms on LPDR. Except for [12] which proposed a unified deep neural network to accomplish LPDR in one step, most works separate LP detection from LP recognition.
我们的工作在两个方面与现有技术相关:公开可用的数据集(如表1,2中所示)和LPDR上的现有算法。 除了[12]提出了一个统一的深度神经网络来一步完成LPDR之外,大多数工作都将LP检测与LP识别分开。
Most datasets for LPDR [13,14,15] usually collect images from traffic monitoring systems, highway toll station or parking lots. These images are always under even sunlight or supplementary light sources and the tilt angle of LPs does not exceed 20◦. Caltech [13] and Zemris [14] collected less than 700 images from high-resolutioncameras on the road or freeways and thus had little variations on distances and tilt degrees. The small volume of images is not sufficient to cover various conditions.
Therefore, those datasets are not convincing to evaluate LP detection algorithms. Different from previous datasets, Azam et al. [10] and Hsu et al. [16] pointed out researches on LP detection under hazardous conditions were scarce and specifically looked for images in various conditions like great tilt angles, blurriness, weak illumination, and bad weather. Compared with CCPD, the shooting distance of these images varies little and the number of images is limited.
LPDR [13,14,15]的大多数数据集通常从交通监控系统,高速公路收费站或停车场收集图像。 这些图像始终在阳光直射或辅助光源下,LP的倾斜角度不超过20°。 Caltech [13]和Zemris [14]从公路或高速公路上的高分辨率相机中收集了不到700张图像,因此距离和倾斜度。 少量的图像不足以覆盖各种条件。
因此,这些数据集无法说服评估LP检测算法。 与以前的数据集不同,Azam等。 [10]和Hsu等。 [16]指出,在危险条件下进行LP检测的研究很少,特别是在各种条件下寻找图像,例如大倾斜角,模糊,照明弱和恶劣的天气。 与CCPD相比,这些图像的拍摄距离变化很小,并且图像数量有限。
Current datasets for LP recognition usually collect extracted LP images and annotate their corresponding LP numbers. As shown in Table 2, SSIG [17] and UFPR [3] captured images by cameras on the road. These images were collected on a sunny day and rarely had tilted LPs. Before we introducing CCPD, ReId [15] is the largest dataset for LP recognition with 76k extracted LPs and annotations.
However, gathered from surveillance cameras on highway toll gates, images in ReId are relatively invariant in tilt angles, distances, and illuminations. Lack of either quantity or variance, current datasets are not convincing enough to comprehensively evaluate LP recognition algorithms.
于LP识别的当前数据集通常会收集提取的LP图像并注释其对应的LP编号。 如表2所示,SSIG [17]和UFPR [3]通过道路上的摄像机捕获了图像。 这些图像是在晴天收集的,很少有倾斜的LP。 在介绍CCPD之前,ReId [15]是用于LP识别的最大数据集,具有76k提取的LP和注释。
但是,从高速公路收费站上的监视摄像机收集的数据,ReId中的图像在倾斜角度,距离和照明方面相对不变。 缺乏数量或方差,当前数据集不足以令人信服以全面评估LP识别算法。
LP detection algorithms can be roughly divided into traditional methods and neural network models. Traditional LP detection methods always exploit the abundant edge information
[18,19,20,21,22,23,24] or the background color features [25,26]. Hsieh et al.[19] utilized morphology method to reduce the number of candidates significantly and thus speeded up the plate detection process. Yu et al.[21] proposed a robust method based on wavelet transform and empirical mode decomposition analysis to locate a LP. In [22] the authors analyzed vertical edge gradients to select true plate regions. Wang et al. [23] exploited cascade AdaBoost classifier and a voting mechanism to elect plate candidates. In [27] a new pattern named Local Structure Patterns was introduced to detect plate regions. Moreover, based on the observation that the LP background always exhibits a regular color appearance, many works utilize HSI (Hue, Saturation, Intensity) color space to filter out the LP area. Deb et al. [25] applied HSI color model to detect candidate regions and achieve 89% accuracy on 90 images. In [26] the authors also exploited a color checking module to help find LP regions.
Recent progress on Region-based Convolutional Neural Network [28] stimulates wide applications [3,4,12,16] of popular object detection models on LP detection problem. Faster-RCNN [29] utilizes a region proposal network which can generate high-quality region proposals for detection and thus detects objects more accurately and quickly. SSD [30] completely eliminates proposal generation and subsequent pixel or feature resampling stages and encapsulates all computation in a single network. YOLO [31] and its improved version [32] frame object detection as a regression problem to spatially separated bounding boxes and associated class probabilities.
LP检测算法可以大致分为传统方法和神经网络模型。传统的LP检测方法总是利用丰富的边缘信息[18,19,20,21,22,23,24]或背景颜色特征[25,26]。 Hsieh等[19]利用形态学方法显着减少了候选对象的数量,从而加快了板检测过程。于等人[21]提出了一种基于小波变换和经验模态分解分析的鲁棒定位方法。在[22]中,作者分析了垂直边缘梯度以选择真板块区域。 Wang等。 [23]利用级联AdaBoost分类器和投票机制来选出候选人。在[27]中,引入了一种称为局部结构图案的新图案来检测板块区域。此外,基于观察到LP背景始终显示规则的颜色外观,许多作品利用HSI(色相,饱和度,强度)色彩空间来滤除LP区。 Deb等。 [25]应用HSI颜色模型来检测候选区域,并在90张图像上达到89%的精度。在[26]中,作者还利用颜色检查模块来帮助查找LP区域。
基于区域的卷积神经网络[28]的最新进展刺激了流行的物体检测模型在LP检测问题上的广泛应用[3,4,12,16]。 Faster-RCNN [29]利用区域提议网络可以生成高质量的区域提议进行检测,从而更准确,更快地检测物体。 SSD [30]完全消除了提案生成和后续的像素或功能重采样阶段,并封装了所有计算在单个网络中。 YOLO [31]及其改进的版本[32]将帧对象检测作为空间分隔边界框和相关类概率的回归问题。
LP Recognition can be classified into two categories: (i) segmentation-free methods.(ii) segment first and then recognize the segmented pictures. The former[33,34] usually utilizes LP character features to extract plate characters directly to avoid segmentation or delivers the LP to an optical character recognition (OCR) system [35] or a convolutional neural network[15] to perform the recognition task. For the latter, the LP bounding box should be determined and shape
correction is applied before segmentation. Various features of LP characters can be utilized for segmentation like Connected components analysis (CCA) [36] and character-specific extremal regions [37]. After segmentation, current highperformance methods always train a deep convolutional neural network [38] or utilize features around LP characters like SIFT [39].
LP识别可分为两类:(i)无分段方法。(ii)先分段然后识别分段的图片。前者[33,34]通常利用LP字符特征直接提取板字符以避免分段,或者将LP传递到光学字符识别(OCR)系统[35]或卷积神经网络[15]以执行识别任务。对于后者,应确定LP边界框并确定形状细分之前应用校正。 LP字符的各种功能可用于分段,例如连通成分分析(CCA)[36]和特定于字符的极值区域[37]。细分后,当前的高性能方法总是训练深度卷积神经网络[38]或利用围绕SIFT等LP字符的功能[39]。
In this section, we introduce CCPD – a large, diverse and carefully annotated LP dataset.
在本节中,我们介绍CCPD –一个庞大,多样且经过仔细注释的LP数据集。
CCPD collects images from a city parking management company in one provincial capital in China where car owners own millions of vehicles. The company employs over 800 PFCs each of which charges the parking fee on a specific street. Each parking fee order not only records LP number, cost, parking time and so on, but also requires PFC to take a picture of the car from the front or the rear as a proof. PFCs basically have no holidays and usually work from early morning(07:30 AM) to almost midnight (10:00 PM). Therefore, CCPD has images under diverse illuminations, environments in different weather. Moreover, as the only requirement for taking photos is containing the LP, PFC may shoot from various positions and angles and even makes a slight tremor. As a result, images in CCPD are taken from different positions and angles and are even blurred. Apart from the LP number, each image in CCPD has many other annotations.The most difficult part is annotating the four vertices locations. To accomplish this task, we first manually labelled the four vertices locations over 10k images. Then we designed a network for locating vertices from a small image of LP regions and exploited the 10k images and some data augmentation strategies to train it. Then, after training this network well, we combined a detection Nmodule and this network to automatically annotate the four vertices locations of each image. Finally, we hired seven part-time workers to correct these annotations in two weeks. Details about the annotation process are provided in thesupplementary material. In order to avoid leakage of residents’ privacy, CCPD removes records other than the LP number of each image and selects images from discrete days and in different streets. In addition, all image metadata including device information,GPS location, etc., is cleared and privacy regions like human faces are blurred.
CCPD从中国一个省会城市的一家停车管理公司收集图像,那里的车主拥有数百万辆汽车。该公司雇用800多个PFC,每个PFC都收取特定街道的停车费。每个停车费命令不仅记录LP编号,成本,停车时间等,而且还要求PFC从正面或背面为汽车拍照作为证明。 PFC基本上没有假期,通常从清晨开始工作(07:30 AM)到几乎午夜(10:00 PM)。因此,CCPD具有在不同照明,不同天气环境下的图像。此外,因为拍摄照片的唯一要求是包含LP,所以PFC可能会从各种位置和角度拍摄,甚至会产生轻微的震颤。结果,CCPD中的图像是从不同的位置和角度拍摄的,甚至变得模糊。除了LP编号外,CCPD中的每个图像都有许多其他注释。最困难的部分是注释四个顶点位置。为了完成此任务,我们首先手动标记了10k图像上的四个顶点位置。然后我们设计了一个网络,用于从小图像中定位顶点LP区域并利用10k图像和一些数据增强策略对其进行训练。然后,在很好地训练了该网络之后,我们将检测Nmodule和该网络组合在一起以自动注释每个图像的四个顶点位置。最后,我们聘请了7名兼职人员来更正这些注释在两周内。有关注释过程的详细信息,请参见补充材料。为了避免泄露居民的隐私,CCPD会删除每张图像的LP编号以外的记录,并从不连续的日子和不同的街道中选择图像。此外,包括设备信息,GPS位置等在内的所有图像元数据都会被清除,人脸等隐私区域也会变得模糊。
CCPD provides over 250k unique LP images with detailed annotations. The resolution of each image is 720 (Width) × 1160 (Height) × 3 (Channels). In practice, this resolution is enough to guarantee that the LP in each image is legible. The average size of each file is about 200 kilobytes (a total of over 48.0 Gigabytes for the entire dataset).Each image in CCPD is labelled in the following aspects:
– LP number. Each image in CCPD has only one LP. Each LP number iscomprised of a Chinese character, a letter, and five letters or numbers. The LP number is an important metric for recognition accuracy
– LP bounding box. The bounding box label contains (x, y) coordinates of the top left and bottom right corner of the bounding box. These two points can be utilized to locate the minimum bounding rectangle of LP.
– Four vertices locations. This annotation contains the exact (x, y) coordinatesof the four vertices of LP in the whole image. As the shape of the LP is basically a quadrilateral, these vertices location can accurately represent the borders of the LP for object segmentation.
– Horizontal tilt degree and vertical tilt degree. As explained in [11], the horizontal tilt degree is the angle between LP and the horizontal line. After the 2D rotation, the vertical tilt degree is the angle between the left border line of LP and the horizontal line.
– Other information concerning the LP like the area, the degree of brightness and the degree of vagueness.
Current diverse LPDR datasets [10,16,40] usually contains less than 5k images. After dividing these challenging images into different categories [40], some categories contains less than 100 images. Based on this observation, we select images under different conditions to build several sub-datasets for CCPD from millions of LP images. The distribution of sub-datasets in CCPD is shown in the Figure 2. Descriptions of these sub-datasets are shown in Table 3. Statistics and samples of these sub-datasets are provided in the supplementary material. We further add CCPD-Characters which contains at least 1000 extracted images for each possible LP character. CCPD-Characters is designed for training neural networks to recognize segmented character images. More character images can be automatically extracted by utilizing annotations of images in CCPD.
CCPD提供了超过25万张带有详细注释的独特LP图像。每个图像的分辨率为720(宽)×1160(高)×3(通道)。实际上,此分辨率足以保证每个图像中的LP清晰可见。每个文件的平均大小约为200 KB(整个数据集总计超过48.0 GB)。
CCPD中的每个图像在以下方面进行标记:
– LP号码。 CCPD中的每个图像只有一个LP。每个LP号是由一个汉字,一个字母和五个字母或数字组成。 LP数是识别准确性的重要指标
-LP边界框。边界框标签包含边界框左上角和右下角的(x,y)坐标。这两个点可用于定位LP的最小边界矩形。
–四个顶点位置。该注释包含整个图像中LP的四个顶点的精确(x,y)坐标。由于LP的形状基本上是四边形,因此这些顶点位置可以准确表示LP的边界以进行对象分割。
–水平倾斜度和垂直倾斜度。如[11]所述,水平倾斜度是LP与水平线之间的角度。在2D旋转之后,垂直倾斜度是LP的左边界线和水平线之间的角度。
–有关LP的其他信息,例如面积,亮度和模糊程度。
当前的各种LPDR数据集[10,16,40]通常包含少于5k的图像。将这些具有挑战性的图像划分为不同的类别[40]后,某些类别包含的图像少于100张。基于此观察,我们从数百万张LP图像中选择了不同条件下的图像以构建CCPD的多个子数据集。 CCPD中子数据集的分布如图2所示。表3中显示了这些子数据集的描述。补充材料中提供了这些子数据集的统计信息和样本。我们进一步添加CCPD-Characters,其中包含每个可能的LP字符至少1000个提取的图像。 CCPD-Characters用于训练神经网络以识别分段的字符图像。更多角色图片
利用CCPD中图像的注释可以自动提取图像。
In this section, we introduce our proposed LP detection and recognition framework, called RPnet, and discuss the associated training methodology.
Fig. 3. The over all structure of our RPnet. It consists of ten convolutional layers with ReLU and Batch Normalization, several MaxPooling layers with Dropout and several components composed of fully connected layers. Given an input RGB image, in a single
forward computation, RPnet predicts the LP bounding box and the corresponding LP number at the same time. RPnet first exploits the Box Regression layer to predict the bounding box. Then, refer to the relative position of the bounding box in each feature map, RPnet extracts ROIs from several already generated feature maps, combine them after pooling them to the same width and height (16*8), and feeds the combined features maps to the subsequent Classifiers.
图3.我们的RPnet的总体结构。它由十个具有ReLU和批处理归一化的卷积层,几个具有Dropout的MaxPooling层以及几个由完全连接的层组成的组件组成。在单个输入中给定输入RGB图像
在进行正向计算时,RPnet会同时预测LP边界框和相应的LP编号。 RPnet首先利用Box Regression层来预测边界框。然后,参考每个特征图中边界框的相对位置,RPnet从几个已经生成的特征图中提取ROI,在将它们合并为相同的宽度和高度(16 * 8)之后将它们组合,然后将组合的特征图馈送到后续的分类器。
RPnet, as shown in Figure 3, is composed of two modules. The first module is a deep convolutional neural network with ten convolutional layers to extract different level feature maps from the input LP image. We name this module ‘the detection module’. The detection module feeds the feature map output by the last convolutional layer to three sibling fully-connected layers which we name ‘the box predictor’ for bounding box prediction. The second module, named ‘the recognition module’, exploits region-of-interest (ROI) pooling layers [28] to extract feature maps of interest and several classifiers to predict the LP number of the LP in the input image. The entire module is a single, unified network for LP detection and recognition. Using a popular terminology ‘attention’ [41] in neural networks, the detection module serves as the ‘attention’ of this unified network. It tells the recognition module where to look. Then the recognition module extracts the ROI from shared feature maps and predicts the LP number. Feature Extraction RPnet extracts features from the input image by all the convolutional layers in the detection module. As the number of layers increases, the number of channels increases and the size of the feature map decreases progressively. The later feature map has higher level features extracted and thus is more beneficial for recognizing the LP and predicting its bounding box. Suppose the center point x-coordinate, the center point y-coordinate, the width, and the height of the bounding box are bx, by, bw, bh respectively. Let W and H be the width and the height of the input image. The bounding box location cx, cy, w, h satisfies:
RPnet如图3所示,由两个模块组成。第一个模块是具有十个卷积层的深度卷积神经网络,用于从输入LP图像中提取不同级别的特征图。我们将此模块命名为“检测模块”。检测模块将最后一个卷积层输出的特征图馈送到三个同级的全连接层,我们将其称为“框预测器”以进行边界框预测。第二个模块称为“识别模块”,它利用感兴趣区域(ROI)池层[28]提取感兴趣的特征图和几个分类器,以预测输入图像中LP的LP数量。整个模块是用于LP检测和识别的单个统一网络。使用神经网络中流行的术语“注意” [41],检测模块充当此统一网络的“注意”。它告诉识别模块在哪里看。然后,识别模块从共享特征图中提取ROI,并预测LP数量。特征提取RPnet通过检测模块中所有卷积层从输入图像中提取特征。随着层数的增加,通道数也会增加,特征图的大小也会逐渐减小。后面的特征图提取了更高级别的特征,因此对于识别LP和预测其边界框更有利。假设边界框的中心点x坐标,中心点y坐标,宽度和高度分别为bx,by,bw,bh。令W和H为输入图像的宽度和高度。边界框位置cx,cy,w,h满足:
Empirically feature maps from different layers within a network are empirically known to have different receptive field sizes [42]. Moreover, previous works such as [43] have shown that using feature maps from the lower layers can improve semantic segmentation quality because the lower layers capture more fine details of the input objects. Similarly,
feature maps from relatively lower layers also matter for recognizing LP characters as, just like the object borders in semantic segmentation, the area of the LP is expected to be very small relative to the entire image. After the detection module accomplishes the computation of all convolutional layers, the box predictor outputs the bounding box location (cx, cy, w, h). For a feature layer of size mxn with p channels, as shown in Figure 3, the recognition module extracts feature maps in the bounding box region of size (m∗h) ∗ (n∗w) with p channels. By default, RPnet extracts feature maps at the end of three low-level layers: the second, fourth, sixth convolutional layer. The sizes of extracted feature maps are (122 ∗ h) ∗ (122 ∗ w) ∗ 64, (63 ∗ h) ∗ (63 ∗ w) ∗ 160, (33 ∗ h) ∗ (33 ∗ w) ∗ 192. In practice, extracting feature maps from higher convolutional layers makes recognition process slower and offers little help in improving the recognition accuracy. After these feature maps are extracted, RPnet exploits ROI Pooling layers to convert each extracted feature into a feature map with a fixed spatial extent of PH ∗PW (e.g., 8 ∗ 16 in this paper). Afterwards, these three resized feature maps 8 ∗ 16 ∗ 64, 8 ∗ 16 ∗ 160 and 8 ∗ 16 ∗ 192 are concatenated to one feature map of size 8 ∗ 16 ∗ 416 for LP number classification.
根据经验,网络中不同层的特征图具有不同的接收场大小[42]。此外,诸如[43]之类的先前工作表明,使用较低层的特征图可以提高语义分割的质量,因为较低层捕获输入对象的更多细节。同样,来自较低层的特征图对于识别LP字符也很重要,因为就像语义分割中的对象边界一样,预计LP的面积相对于整个图像非常小。检测模块完成所有卷积层的计算后,框预测器输出边界框位置(cx,cy,w,h)。对于要素层如图3所示,识别模块在大小为mxn的通道上提取特征图,如图3所示,在边界面积为(m * h)∗(n * w)的边界框区域中使用p通道。默认情况下,RPnet在三个低层层(第二,第四,第六个卷积层)的末尾提取特征图。提取的特征图的大小为
(122 * h)(122 * w) 64,(63 * h)(63 * w) 160,(33 * h)(33 * w) 192.实际上,从更高的位置提取特征图卷积层使识别过程变慢,并且对提高识别精度几乎没有帮助。提取这些特征图后,RPnet利用ROI池化层将每个提取的特征转换为具有固定的PH * PW空间范围(例如,本文中为8 * 16)的特征图。然后,这三个调整大小的功能图将8 ∗ 16 ∗ 64、8 ∗ 16 ∗ 160和8 ∗ 16 ∗ 192连接到一个大小为8 ∗ 16 ∗ 416的特征图上,以进行LP编号分类。
4.2 Training
RPnet can be trained end-to-end on CCPD and accomplishes LP bounding box detection and LP number recognition in a single forward. The training involves choosing suitable loss functions for detection performance and recognition performance, as well as pre-training the detection module before training RPnet end-to-end. Training objective The RPnet training objective can be divided into two parts: the localization loss (loc) and the classification loss (cls). Let N be the size of a mini-batch in training. The localization loss (see Equation (1)) is a Smooth L1 loss [28] between the predicted box (pb) and the ground truth box (gb). Let the ground-truth seven LP numbers be gn_{i}(1 ≤ i ≤ 7). pn_{i}(1 ≤ i ≤ 7) denotes predictions for the seven LP characters and each LP character prediction pni contains nc_{i} float numbers, each representing the possibility of belonging to a specific character class. The classification loss (see Equation (2)) is a crossentropy loss. With the joint optimization of both localization and classification losses, the extracted features would have richer information about LP characters.
Experiments show that both detection and recognition performance can be enhanced by jointly optimizing these two losses.
Pre-training detection module Before training PRnet end-to-end, the detection module must provide a reasonable bounding box prediction (cx, cy, w, h). A reasonable prediction (cx, cy, w, h) must meet 0 < cx, cy, w, h < 1 and might try to meet (), thus can represent a valid ROI and guide the recognition module to extract feature maps. Unlike most object detection related papers [29,31] which pre-train their convolutional layers on ImageNet[44] to make these layers more representative, we pre-train the detection module from scratch on CCPD as the data volume of CCPD is large enough and, for locating a single object such as a license plate, parameters pre-trained on ImageNet are not necessarily better than training from scratch. In practice, the detection module always gives a reasonable bounding box prediction after being trained 300 epochs on the training set.
可以在CCPD上对RPnet进行端到端培训,并在单个转发中完成LP边界框检测和LP号码识别。培训涉及为检测性能和识别性能选择合适的损失函数,以及在端到端培训RPnet之前对检测模块进行预培训。训练目标RPnet训练目标可以分为两部分:定位损失(loc)和分类损失(cls)。令N为最小迭代的培训大小。局部化损耗(请参见公式(1))是预测框(pb)与地面真实框(gb)之间的平滑L1损耗[28]。令真实的七个LP数为gni(1≤i≤7)。 pni(1≤i≤7)表示对七个LP字符的预测,每个LP字符预测pni包含nci个浮点数,每个浮点数表示属于特定字符类的可能性。分类损失(参见等式(2))是交叉熵损失。通过联合优化定位和分类损失,提取的特征将具有有关LP字符的丰富信息。实验表明,通过共同优化这两个损失可以提高检测和识别性能。
预训练检测模块在端到端训练PRnet之前,检测模块必须提供合理的边界框预测(cx,cy,w,h)。合理的预测(cx,cy,w,h)必须满足0 As aforementioned in section 3, CCPD-Base consists of approximately 200k unique images. We divide CCPD-Base into two equal parts. One as the default training set, another as the default evaluation set. In addition, several subdatasets(CCPD-DB, CCPD-FN, CCPD-Rotate, CCPD-Tilt, CCPD-Weather, CCPD-Challenge) in CCPD are also exploited for detection and recognition performance evaluation. Apart from Cascade classifier [45], all models used in experiments rely on GPU and are fine-tuned on the training set. For models without default data augmentation strategies, we augment the training data by 5.2 Detection 文字文档 As aforementioned in section 3, CCPD-Base consists of approximately 200k unique images. We divide CCPD-Base into two equal parts. One as the default training set, another as the default evaluation set. In addition, several subdatasets(CCPD-DB, CCPD-FN, CCPD-Rotate, CCPD-Tilt, CCPD-Weather, CCPD-Challenge) in CCPD are also exploited for detection and recognition performance evaluation. Apart from Cascade classifier [45], all models used in experiments rely on GPU and are fine-tuned on the training set. For models without default data augmentation strategies, we augment the training data by 5.2 Detection 2725/5000 5.2检测
randomly sampling four times on each image to increase the training set by five times. More details are provided in the supplementary material. We did not reproduce our experiments on other datasets because most current available LP datasets [13,14,15] are not as diverse as CCPD and their data volume is far fewer than CCPD. Thus, detection accuracy or recognition accuracy on other datasets might not be as convincing as on CCPD. Moreover, we also did not implement approaches not concerning machine learning like [8]
because in practice, when evaluated on a large-scale dataset, methods based on machine learning always perform better.
Detection accuracy metric We follow the standard protocol in object detection Intersection-over-Union (IoU) [12]. The bounding box is considered to be correct if and only if its IoU with the ground-truth bounding box is more than 70% (IoU > 0.7). All models are
fine-tuned on the same 100k training set. We set a higher IoU boundary in the detection accuracy metric than TE2E[12] because a higher boundary can filter out imperfect bounding boxes and thus
better evaluates the detection performance. The results are shown in Table 4. Cascade classifier has difficulty in precisely locating LPs and thus performs badly under a high IoU threshold and it is not robust when dealing with tilted LPs. Concluded from the low detection accuracy 77.3% on CCPD-FN, YOLO has a relatively bad performance on relatively small/large object detection. Benefited from the joint optimization of detection and recognition, the performance of both
RPnet and TE2E surpasses Faster-RCNN and YOLO9000. However, RPnet can recognize twenty times faster than TE2E. Moreover, by analysing the bounding boxes predicted by SSD, we found these boxes wrap around LPs very tightly. Actually, when the IoU threshold is set higher than 0.7, SSD achieves the highest accuracy. The reason might be that the detection loss is not the only training objective of RPnet. For example, a little imperfect bounding box (slightly smaller
than the ground-truth one) might be beneficial for more correct LP recognition.
点击图标下载 AppAndroidiOS
检测语言英语中文德语
中文(简体)英语日语
randomly sampling four times on each image to increase the training set by five times. More details are provided in the supplementary material. We did not reproduce our experiments on other datasets because most current available LP datasets [13,14,15] are not as diverse as CCPD and their data volume is far fewer than CCPD. Thus, detection accuracy or recognition accuracy on other datasets might not be as convincing as on CCPD. Moreover, we also did not implement approaches not concerning machine learning like [8]
because in practice, when evaluated on a large-scale dataset, methods based on machine learning always perform better.
Detection accuracy metric We follow the standard protocol in object detection Intersection-over-Union (IoU) [12]. The bounding box is considered to be correct if and only if its IoU with the ground-truth bounding box is more than 70% (IoU > 0.7). All models are
fine-tuned on the same 100k training set. We set a higher IoU boundary in the detection accuracy metric than TE2E[12] because a higher boundary can filter out imperfect bounding boxes and thus
better evaluates the detection performance. The results are shown in Table 4. Cascade classifier has difficulty in precisely locating LPs and thus performs badly under a high IoU threshold and it is not robust when dealing with tilted LPs. Concluded from the low detection accuracy 77.3% on CCPD-FN, YOLO has a relatively bad performance on relatively small/large object detection. Benefited from the joint optimization of detection and recognition, the performance of both
RPnet and TE2E surpasses Faster-RCNN and YOLO9000. However, RPnet can recognize twenty times faster than TE2E. Moreover, by analysing the bounding boxes predicted by SSD, we found these boxes wrap around LPs very tightly. Actually, when the IoU threshold is set higher than 0.7, SSD achieves the highest accuracy. The reason might be that the detection loss is not the only training objective of RPnet. For example, a little imperfect bounding box (slightly smaller
than the ground-truth one) might be beneficial for more correct LP recognition.
如第3节所述,CCPD-Base由大约200k唯一图像组成。我们将CCPD-Base分为两个相等的部分。一个作为默认训练集,另一个作为默认评估集。此外,CCPD中的几个子数据集(CCPD-DB,CCPD-FN,CCPD-Rotate,CCPD-Tilt,CCPD-Weather,CCPD-Challenge)也被用于检测和识别性能评估。除了Cascade分类器[45],实验中使用的所有模型都依赖于GPU,并根据训练集进行了微调。对于没有默认数据扩充策略的模型,我们通过以下方式扩充训练数据:
在每个图像上随机采样四次,以将训练集增加五倍。补充材料中提供了更多详细信息。我们没有在其他数据集上重现我们的实验,因为大多数当前可用的LP数据集[13,14,15]不像CCPD那样多样化,并且它们的数据量远远少于CCPD。因此,其他数据集的检测准确性或识别准确性可能不如CCPD令人信服。而且,我们也没有像[8]那样实现与机器学习无关的方法。
因为在实践中,当在大规模数据集上进行评估时,基于机器学习的方法总是表现更好。
检测精度度量标准我们遵循对象检测联合上方交叉口(IoU)[12]中的标准协议。当且仅当其与地面真实边界框的IoU大于70%(IoU> 0.7)时,才认为边界框是正确的。所有型号都是
在相同的100k训练集上进行了微调。我们在检测精度指标中设置的IoU边界要比TE2E [12]高,因为更高的边界可以过滤出不完善的边界框,因此
更好地评估检测性能。结果显示在表4中。级联分类器难以精确定位LP,因此在高IoU阈值下表现不佳,并且在处理倾斜的LP时也不可靠。从CCPD-FN的77.3%的低检测精度得出结论,YOLO在相对较小/较大物体检测方面的性能相对较差。受益于检测和识别的联合优化,两者的性能
RPnet和TE2E超过Faster-RCNN和YOLO9000。但是,RPnet的识别速度比TE2E快20倍。此外,通过分析SSD预测的边界框,我们发现这些框非常紧密地围绕LP。实际上,当IoU阈值设置为高于0.7时,SSD可获得最高的准确性。原因可能是检测损失不是RPnet的唯一训练目标。例如,一个不太完美的边框(略小)
比真实情况更有利于更正确的LP识别。