1.notes
1.1.论文关键idea
We propose a novel Connectionist Text Proposal Network (CTPN) that accurately localizes text lines in natural image. The CTPN detects a text line in a sequence of fine-scale text proposals directly in convolutional feature maps. We develop a vertical anchor mechanism that jointly predicts location and text/non-text score of each fixed-width proposal, considerably improving localization accuracy. The sequential proposals are naturally connected by a recurrent neural network, which is seamlessly incorporated into the convolutional network, resulting in an end-to-end trainable model.
seamless: adj.无(接)缝的;(两部分之间)无空隙的
我们提出了一种新颖的连接文本候选网络(CTPN),该网络可以准确定位自然图像中的文本行。 CTPN直接在卷积特征图中检测一系列精细文本建议中的文本行。我们开发了一种垂直定位机制
,可以共同预测每个固定宽度提案的位置和文本/非文本得分
,从而大大提高定位精度。顺序提议通过递归神经网络自然连接,该递归神经网络被无缝地集成到卷积网络中,从而形成了端到端的可训练模型。
This allows the CTPN to explore rich context information of image, making it powerful to detect extremely ambiguous text. The CTPN works reliably on multi-scale and multilanguage text without further post-processing, departing from previous bottom-up methods requiring multi-step post filtering. It achieves 0.88 and 0.61 F-measure on the ICDAR 2013 and 2015 benchmarks, surpassing recent results [8,35] by a large margin. The CTPN is computationally efficient with 0.14s=image, by using the very deep VGG16 model [27]. Online demo is available at: http://textdet.com/.
这使CTPN能够探索图像的丰富上下文信息,从而使其能够强大地检测出非常模糊的文本。 CTPN可以可靠地在多尺度和多语言文本
上运行,而无需进一步的后处理,这与以前的自下而上
的方法(需要多步后过滤)不同。在ICDAR 2013和2015基准上,它达到0.88和0.61 F值,大大超过了最近的结果[8,35]。通过使用非常深的VGG16模型,CTPN的计算效率为每张图片0.14s[27]。
Keywords:
Scene text detection, convolutional network, recurrent neural network, anchor mechanism
场景文本检测,卷积网络,递归神经网络,锚定机制
Reading text in natural image has recently attracted increasing attention in computer vision [8,14,15,10,35,11,9,1,28,32]. This is due to its numerous practical applications such as image OCR, multi-language translation, image retrieval, etc. It includes two sub tasks: text detection and recognition. This work focus on the detection task [14,1,28,32], which is more challenging than recognition task carried out on a well-cropped word image [15,9]. Large variance of text patterns and highly cluttered background pose main challenge of accurate text localization.
clutter n.杂乱;杂乱的东西(尤指不需要的或无用的)v.凌乱地塞满;乱堆放
以自然图像阅读文本最近在计算机视觉中引起了越来越多的关注[8,14,15,10,35,11,9,1,28,32]。 这是由于其大量的实际应用,例如图像OCR,多语言翻译,图像检索等。它包括两个子任务:文本检测和识别。
这项工作的重点是检测任务[14,1,28,32],这比对裁剪好的单词图像[15,9]进行的识别任务更具挑战性。文本模式的巨大差异和高度混乱的背景构成了精确文本本地化的主要挑战。
Current approaches for text detection mostly employ a bottom-up pipeline [28,1,14,32,33]. They commonly start from low-level character or stroke detection, which is typically followed by a number of subsequent steps:non-text component filtering, text line construction and text line verification.
These multi-step bottom-up approaches are generally complicated with less robustness and reliability. Their performance heavily rely on the results of character detection, and connected-components methods or sliding-window methods have been proposed. These methods commonly explore low-level features (e.g., based on SWT [3,13], MSER [14,33,23], or HoG [28]) to distinguish text candidates from background. However, they are not robust by identifying individual strokes or characters separately, without context information.
当前的文本检测方法主要采用自下而上的管道[28,1,14,32,33]。它们通常从低级字符或笔画检测开始,然后通常执行许多后续步骤:非文本组件过滤,文本行构造和文本行验证。
这些多步自下而上的方法通常比较复杂,但健壮性和可靠性较低。它们的性能在很大程度上取决于字符检测的结果,并且已经提出了连接组件方法或滑动窗口方法。这些方法通常探索低级功能(例如,基于SWT [3,13],MSER [14,33,23]或HoG [28])
,以区分文本候选者和背景
。但是,它们在没有上下文信息的情况下无法通过分别识别单个笔画或字符来实现鲁棒性。
For example, it is more confident for people to identify a sequence of characters than an individual one
, especially when a character is extremely ambiguous. These limitations often result in a large number of non-text components
in character detection, causing main difficulties for handling them in following steps. Furthermore, these false detections are easily accumulated sequentially in bottom-up pipeline, as pointed out in [28]. To address these problems, we exploit strong deep features for detecting text information directly in convolutional maps. We develop text anchor mechanism that accurately predicts text locations in fine scale. Then, an in-network recurrent architecture is proposed to connect these fine-scale text proposals in sequences, allowing them to encode rich context information.
例如,人们识别一个字符序列要比单个字符更有信心
,尤其是当一个字符非常模棱两可时。这些限制通常会在字符检测中导致大量非文本组件,从而导致在后续步骤中处理它们的主要困难
。此外,如[28]所指出的,这些错误检测很容易在自下而上的流水线中顺序累积。为了解决这些问题
,我们利用强大的深层功能直接在卷积图中检测文本信息。我们开发了文本锚定机制
,可以精确地预测文本位置。然后,提出了一种网络内循环架构
,以按顺序连接这些小规模文本提案,从而使它们能够编码丰富的上下文信息。
Deep Convolutional Neural Networks (CNN) have recently advanced general object detection substantially [25,5,6]. The state-of-the-art method is Faster Region-CNN (R-CNN) system [25] where a Region Proposal Network (RPN)
is proposed to generate high-quality class-agnostic object proposals directly from convolutional feature maps
. Then the RPN proposals are fed into a Fast R-CNN [5] model for further classification and refinement, leading to the state-of-the-art performance on generic object detection. However, it is difficult to apply these general object detection systems directly to scene text detection, which generally requires a higher localization accuracy.
深度卷积神经网络(CNN)最近已大大提高了一般目标检测的速度[25,5,6]。最先进的方法是Faster Region-CNN(R-CNN)系统[25],其中提出了区域候选网络(RPN)
以直接从卷积特征图生成高质量的类无关对象提议
。然后将RPN提议输入到Fast R-CNN [5]模型中,以进行进一步的分类和细化,从而获得通用对象检测的最新性能。然而,难以将这些通用对象检测系统直接应用于场景文本检测,这通常需要更高的定位精度。
In generic object detection, each object has a well-defined closed boundary
[2], while such a well-defined boundary may not exist in text, since a text line or word is composed of a number of separate characters or strokes. For object detection, a typical correct detection is defined loosely, e.g., by an overlap of > 0.5 between the detected bounding box and its ground truth (e.g., the PASCAL standard [4]), since people can recognize an object easily from major part of it. By contrast, reading text comprehensively is a fine-grained recognition task which requires a correct detection that covers a full region of a text line or word. Therefore, text detection generally requires a more accurate localization, leading to a different evaluation standard, e.g., the Wolf’s standard [30] which is commonly employed by text benchmarks [19,21].
在一般对象检测中,每个对象都有一个明确定义的封闭边界[2]
,而文本中可能不存在这样一个明确定义的边界,因为文本行或单词由多个单独的字符或笔划组成
。对于物体检测,典型的正确检测是宽松定义的,例如,通过检测到的边界框与其真实边框之间的重叠> 0.5(例如,PASCAL标准[4]),因为人们可以从物体的主要部分轻松识别它的。相反,全面阅读文本是一种细粒度的识别任务,需要正确检测才能覆盖文本行或单词的整个区域
。因此,文本检测通常需要更准确的定位,从而导致不同的评估标准,例如,文本基准[19,21]普遍采用的Wolf标准[30]。
In this work, we fill this gap by extending the RPN architecture [25] to accurate text line localization. We present several technical developments that tailor generic object detection model elegantly towards our problem. We strive for a further step by proposing an in-network recurrent mechanism that allows our model to detect text sequence directly in the convolutional maps, avoiding further post-processing by an additional costly CNN detection model.
在这项工作中,我们通过将RPN体系结构[25]扩展到准确的文本行检测来填补这一空白。 我们提出了一些技术发展,可以针对我们的问题优雅地定制通用对象检测模型。 我们通过提出一种网络内递归机制来争取进一步的步骤,该机制允许我们的模型直接在卷积图中检测文本序列,从而避免了额外的昂贵CNN检测模型进行进一步的后处理。
We propose a novel Connectionist Text Proposal Network (CTPN) that directly localizes text sequences in convolutional layers. This overcomes a number of main limitations raised by previous bottom-up approaches building on character detection. We leverage the advantages of strong deep convolutional features and sharing computation mechanism, and propose the CTPN architecture which is described in Fig. 1. It makes the following major contributions:
我们提出了一种新颖的连接主义者文本提案网络(CTPN),该网络可以直接在卷积层中定位文本序列。 这克服了以前基于字符检测的自下而上方法带来的许多主要限制
。 我们利用强大的深度卷积特性和共享计算机制的优势,提出了图1中描述的CTPN体系结构。它做出了以下主要贡献:
First, we cast the problem of text detection into localizing a sequence of finescale text proposals. We develop an anchor regression mechanism that jointly predicts vertical location and text/non-text score of each text proposal, resulting in an excellent localization accuracy. This d//eparts from the RPN prediction of a whole object, which is difficult to provide a satisfied localization accuracy.
首先,我们将文本检测问题转化为本地化一系列精细文本建议。我们开发了一种锚定回归机制,可以共同预测每个文本候选的垂直位置和文本/非文本得分
,从而获得出色的定位精度。 这与整个物体的RPN预测背离,这很难提供令人满意的定位精度。
Second, we propose an in-network recurrence mechanism that elegantly connects sequential text proposals in the convolutional feature maps. This connection allows our detector to explore meaningful context information of text line, making it powerful to detect extremely challenging text reliably.
其次,我们提出了一种内网络递归机制,该机制可以优雅地连接卷积特征图中的顺序文本提议。这种连接使我们的检测器能够探索有意义的文本行上下文信息,从而使其能够可靠地检测出极具挑战性的文本
。
Third, both methods are integrated seamlessly to meet the nature of text sequence, resulting in a unified end-to-end trainable model. Our method is able to handle multi-scale and multi-lingual text in a single process, avoiding further post filtering or refinement.
第三,两种方法无缝地集成在一起以满足文本序列的本质
,从而产生了统一的端到端可训练模型。 我们的方法能够在单个过程中处理多比例和多语言文本,从而避免了进一步的后期过滤或优化。
Fourth, our method achieves new state-of-the-art results on a number of benchmarks, significantly improving recent results (e.g., 0.88 F-measure over 0.83 in [8] on the ICDAR 2013, and 0.61 F-measure over 0.54 in [35] on the ICDAR 2015). Furthermore, it is computationally efficient, resulting in a 0.14s/image running time (on the ICDAR 2013) by using the very deep VGG16 model [27].
第四,我们的方法在多个基准上获得了最新的最新结果,大大改善了最近的结果(例如,ICDAR 2013中[8]的0.88 F-measure超过了0.83,而2015年的ICDAR[35]中则是0.61 F-measure超过了0.54)。此外,它的计算效率很高,通过使用非常深的VGG16模型[27],可以实现0.14s/图像运行时间(在ICDAR 2013上)。
Text detection.
Past works in scene text detection have been dominated by bottom-up approaches which are generally built on stroke or character detection. They can be roughly grouped into two categories, connected- components (CCs) based approaches and sliding-window based methods. The CCs based approaches discriminate text and non-text pixels by using a fast filter, and then text pixels are greedily grouped into stroke or character candidates, by using low-level properties, e.g., intensity, color, gradient, etc. [33,14,32,13,3].
场景文本检测中的过去工作主要由自下而上的方法主导,这些方法通常基于笔画或字符检测。它们可以大致分为两类,基于连接组件(CCs)的方法和基于滑动窗口的方法
。基于CCs的方法通过使用快速过滤器来区分文本和非文本像素,然后通过使用低级属性(例如强度,颜色,渐变等)将文本像素贪婪地分组为笔画或字符候选者[33,14,32,13,3]。
The sliding-window based methods detect character candidates by densely moving a multi-scale window through an image. The character or non-character window is discriminated by a pre- trained classifier, by using manually-designed features [28,29], or recent CNN features [16]. However, both groups of methods commonly suffer from poor performance of character detection, causing accumulated errors in following component filtering and text line construction steps. Furthermore, robustly filtering out non-character components or confidently verifying detected text lines are even difficult themselves [1,33,14]. Another limitation is that the sliding-window methods are computationally expensive, by running a classifier on a huge number of the sliding windows.
基于滑动窗口的方法通过在图像中密集移动多尺度窗口来检测候选字符
。通过使用人工设计的特征[28,29]或最新的CNN特征[16],由预先训练的分类器来区分字符或非字符窗口。但是,这两种方法通常都具有较差的字符检测性能,从而在后续的组件过滤和文本行构造步骤中导致累积的错误
。此外,鲁棒地滤除非字符组成部分或自信地验证检测到的文本行本身甚至很困难[1,33,14]。另一个限制是,通过在大量的滑动窗口上运行分类器,滑动窗口方法的计算量很大。
Object detection.
Convolutional Neural Networks (CNN) have recently advanced general object detection substantially [25,5,6]. A common strategy is to generate a number of object proposals by employing inexpensive low- level features, and then a strong CNN classifier is applied to further classify and refine the generated proposals. Selective Search (SS)
[4] which generates class-agnostic object proposals, is one of the most popular methods applied in recent leading object detection systems, such as Region CNN (R-CNN) [6] and its extensions [5].
卷积神经网络(CNN)最近已大大提高了通用目标检测的速度[25,5,6]。一种常见的策略是通过使用廉价的低级功能来生成许多对象建议,然后将强大的CNN分类器应用于进一步分类和细化所生成的建议。选择性搜索(SS)
[4]生成与类无关的对象建议,是在最近的领先对象检测系统(例如区域CNN(R-CNN)[6]及其扩展名[5])中应用最广泛的方法之一。
Recently, Ren et al. [25] proposed a Faster R-CNN system for object detection. They proposed aRegion Proposal Network (RPN)
that generates high-quality class-agnostic object proposals directly from the convolutional feature maps. The RPN is fast by sharing convolutional computation. However, the RPN proposals are not discriminative, and require a further refinement and classification by an additional costly CNN model, e.g., the Fast R-CNN model [5]. More importantly, text is different significantly from general objects, making it difficult to directly apply general object detection system to this highly domain-specific task.
最近,Ren等[25]提出了一种用于对象检测的Faster R-CNN系统。他们提出了一个区域候选网络(RPN)
,该网络直接从卷积特征图中生成高质量的类不可知对象提案。RPN通过共享卷积计算而很快。但是,RPN提议没有区别,需要通过其他昂贵的CNN模型(例如Fast R-CNN模型[5])进一步完善和分类。更重要的是,文本与常规对象有很大的不同,因此很难将常规对象检测系统直接应用于高度特定领域的任务。
Region Proposal Network (RPN)
This section presents details of the Connectionist Text Proposal Network (CTPN). It includes three key contributions that make it reliable and accurate for text localization: detecting text in fine-scale proposals, recurrent connectionist text proposals, and side-refinement.
本节介绍了连接主义者文本提案网络(CTPN)的详细信息。 它包括三个使它可靠且准确地用于文本本地化的关键贡献:在小框中检测文本,循环连接主义文本框以及侧面改进。
Similar to Region Proposal Network (RPN) [25], the CTPN is essentially a fully convolutional network that allows an input image of arbitrary size. It detects a text line by densely sliding a small window in the convolutional feature maps, and outputs a sequence of fine-scale (e.g., fixed 16-pixel width) text proposals, as shown in Fig. 1 (b).
类似于区域提议网(RPN)[25],CTPN本质上是一个完全卷积的网络,它允许输入任意大小的图像。 它通过在卷积特征图中密集地滑动一个小窗口来检测文本行,并输出一系列精细比例(例如,固定的16像素宽度)的文本建议
,如图1(b)所示。
We take the very deep 16-layer vggNet (VGG16) [27] as an example to describe our approach, which is readily applicable to other deep models. Architecture of the CTPN is presented in Fig. 1 (a). We use a small spatial window, 3×3, to slide the feature maps of last convolutional layer (e.g., the conv5 of the VGG16). The size of conv5 feature maps is determined by the size of input image, while the total stride and receptive field are fixed as 16 and 228 pixels, respectively. Both the total stride and receptive field are fixed by the network architecture. Using a sliding window in the convolutional layer allows it to share convolutional computation, which is the key to reduce computation of the costly sliding-window based methods.
我们以非常深的16层vggNet(VGG16)[27]为例来描述我们的方法,该方法很容易适用于其他深度模型。CTPN的架构如图1(a)所示。我们使用一个3×3的小空间窗口来滑动最后一个卷积层的特征图(例如VGG16的conv5)。conv5特征图的大小由输入图像的大小确定,而总步幅和接收场分别固定为16和228像素。网络结构固定了总步幅和接收场。在卷积层中使用滑动窗口允许它共享卷积计算,这是减少基于代价昂贵的基于滑动窗口的方法的计算的关键。
Generally, sliding-window methods adopt multi-scale windows to detect objects of different sizes
, where one window scale is fixed to objects of similar size. In [25], Ren et al. proposed an efficient anchor regression mechanism that allows the RPN to detect multi-scale objects with a single-scale window. The key insight is that a single window is able to predict objects in a wide range of scales and aspect ratios, by using a number of flexible anchors. We wish to extend this efficient anchor mechanism to our text task.
通常,滑动窗口方法采用多尺度窗口来检测不同大小的对象
,其中一个窗口尺度固定到相似大小的对象上。在[25]中,Ren等人提出了一种有效的锚回归机制,该机制使RPN能够使用单尺度窗口检测多尺度对象。关键的见解是,通过使用多个灵活的锚点,单个窗口能够以各种比例和宽高比预测对象。我们希望将这种有效的锚定机制扩展到我们的文本任务。
However, text differs from generic objects substantially, which generally have a well-defined enclosed boundary and center, allowing inferring whole object from even a part of it [2]. Text is a sequence which does not have an obvious closed boundary. It may include multi-level components, such as stroke, character, word, text line and text region, which are not distinguished clearly between each other. Text detection is defined in word or text line level, so that it may be easy to make an incorrect detection by defining it as a single object, e.g., detecting part of a word. Therefore, directly predicting the location of a text line or word may be difficult or unreliable, making it hard to get a satisfied accuracy. An example is shown in Fig. 2, where the RPN is directly trained for localizing text lines in an image.
但是,文本与通用对象有很大的不同,通用对象通常具有定义明确的封闭边界和中心,甚至可以从其中的一部分推断出整个对象
[2]。文本是没有明显封闭边界的序列。它可能会包含多级组件,例如笔画,字符,单词,文本行和文本区域,它们之间无法清晰地区分。文本检测是在单词或文本行级别定义的
,因此通过将其定义为单个对象(例如,检测单词的一部分),很容易进行错误的检测。因此,直接预测文本行或单词的位置可能很困难或不可靠
,从而难以获得令人满意的准确性。图2中显示了一个示例,其中直接训练RPN以定位图像中的文本行。
We look for a unique property of text that is able to generalize well to text components in all levels. We observed that word detection by the RPN is difficult to accurately predict the horizontal sides of words, since each character within a word is isolated or separated, making it confused to find the start and end locations of a word.
Obviously, a text line is a sequence which is the main difference between text and generic objects. It is natural to consider a text line as a sequence of fine-scale text proposals, where each proposal generally represents a small part of a text line, e.g., a text piece with 16-pixel width. Each proposal may include a single or multiple strokes, a part of a character, a single or multiple characters, etc.
我们寻找一种独特的文本属性,可以很好地推广到所有级别的文本组件。我们观察到,由于单词中的每个字符都是孤立的或分开的,因此RPN对单词的检测很难准确预测单词的水平边,这使查找单词的开始和结束位置感到困惑
。显然,文本行是一个序列,是文本和通用对象之间的主要区别。将文本行视为一系列小尺寸文本建议是很自然的,其中每个建议通常代表一小部分文本行,例如16像素宽的文本。每个建议可能包括一个或多个笔画,一个字符的一部分,一个或多个字符等。
We believe that it would be more accurate to just predict the vertical location of each proposal, by fixing its horizontal location which may be more difficult to predict.
This reduces the search space, compared to the RPN which predicts 4 coordinates of an object. We develop a vertical anchor mechanism that simultaneously predicts a text/non-text score and y-axis location of each fine-scale proposal. It is also more reliable to detect a general fixed-width text proposal than identifying an isolate character, which is easily confused with part of a character or multiple characters.
Furthermore, detecting a text line in a sequence of fixed-width text proposals also works reliably on text of multiple scales and multiple aspect ratios.
我们相信,通过固定每个建议的水平位置来预测每个建议的垂直位置,准确度会更高,但是难度会增加
。与预测对象的4个坐标的RPN相比,这减少了搜索空间。我们开发了一种垂直锚机制,可同时预测每个精细建议的文本/非文本得分和y轴位置。检测一般的固定宽度的文本建议比识别一个容易与一个字符或多个字符的一部分混淆的孤立字符更为可靠
。此外,在一系列固定宽度的文本建议中检测文本行也可以可靠地用于多个比例和多个长宽比的文本。
To this end, we design the fine-scale text proposal as follow. Our detector investigates each spatial location in the conv5 densely. A text proposal is defined to have a fixed width of 16 pixels (in the input image). This is equal to move the detector densely through the conv5 maps, where the total stride is exactly 16 pixels. Then we design k vertical anchors to predict y-coordinates for each proposal. The k anchors have a same horizontal location with a fixed width of 16 pixels, but their vertical locations are varied in k different heights.
In our experiments, we use ten anchors for each proposal, k = 10, whose heights are varied from 11 to 273 pixels (by ÷0.7 each time) in the input image. The explicit vertical coordinates are measured by the height and y-axis center of a proposal bounding box. We compute relative predicted vertical coordinates (v) with respect to the bounding box location of an anchor as,
为此,我们设计了如下的精细文本建议。 我们的探测器会密集地调查conv5中的每个空间位置。文本建议定义为具有16像素的固定宽度(在输入图像中)
。 这等于使检测器在conv5贴图上密集移动,其中总跨度恰好是16个像素。 然后,我们设计k个垂直锚以预测每个提案的y坐标。k个锚点具有相同的水平位置,固定宽度为16个像素,但其垂直位置在k个不同的高度中有所不同
。 在我们的实验中,我们为每个提案使用10个锚点,k = 10,在输入图像中其高度在11到273个像素(每次÷0.7)之间变化。
显式垂直坐标由投标边界框的高度和y轴中心测量。 我们计算相对于锚定框位置的相对预测垂直坐标(v),如下所示:
v c = ( c y − c y a ) / h a , v h = l o g ( h / h a ) . . . . . . . . . . . . . . . ( 1 ) v_c=(c_y-c^a_y)/h^a, v_h=log(h/h^a)...............(1) vc=(cy−cya)/ha,vh=log(h/ha)...............(1)
v c ∗ = ( c y ∗ − c y a ) / h a , v h ∗ = l o g ( h ∗ / h a ) v^*_c=(c^*_y-c^a_y)/h^a, v^*_h=log(h^*/h^a) vc∗=(cy∗−cya)/ha,vh∗=log(h∗/ha)
where v = { v c , v h } v = \{v_c, v_h\} v={vc,vh} and v ∗ = { v c ∗ , v h ∗ } v^∗ = \{v_c^∗, v_h^∗\} v∗={vc∗,vh∗} are the relative predicted coordinates and ground truth coordinates, respectively. c y a c^a_ y cya and h a h^a ha are the center (y-axis) and height of the anchor box, which can be pre-computed from an input image. c y c_y cy and h are the predicted y-axis coordinates in the input image, while c y ∗ c^∗_ y cy∗ and h ∗ h^∗ h∗ are the ground truth coordinates. Therefore, each predicted text proposal has a bounding box with size of h × 16 (in the input image), as shown in Fig. 1 (b) and Fig. 2 (right). Generally, an text proposal is largely smaller than its effective receptive field which is 228×228.
其中 v = { v c , v h } v = \{v_c,v_h \} v={vc,vh}和 v ∗ = { v c ∗ , v h ∗ } v ^ ∗ = \{v_c ^ ∗,v_h ^ ∗ \} v∗={vc∗,vh∗}分别是相对预测坐标和地面真实坐标。 c y a c ^ a_ y cya和 h a h ^ a ha是锚点框的中心(y轴)和高度,可以根据输入图像进行预先计算。 c y c_y cy和h是输入图像中的预测y轴坐标,而 c y ∗ c ^ * _ y cy∗和 h ∗ h ^ ∗ h∗是真实坐标。 因此,每个预测文本建议都有一个大小为h×16的边框(在输入图像中)
,如图1(b)和图2(右)所示。 通常,文本提案比其有效接收字段228×228小得多。
The detection processing is summarised as follow. Given an input image, we have W × H × C conv5 features maps (by using the VGG16 model), where C is the number of feature maps or channels, and W × H is the spatial arrangement. When our detector is sliding a 3×3 window densely through the conv5, each sliding-window takes a convolutional feature of 3 × 3 × C for producing the prediction. For each prediction, the horizontal location (x-coordinates) and kanchor locations are fixed, which can be pre-computed by mapping the spatial window location in the conv5 onto the input image. Our detector outputs the text/non-text scores and the predicted y-coordinates (v) for k anchors at each window location. The detected text proposals are generated from the anchors having a text/non-text score of > 0.7 (with non-maximum suppression). By the designed vertical anchor and fine-scale detection strategy, our detector is able to handle text lines in a wide range of scales and aspect ratios by using a single- scale image. This further reduces its computation, and at the same time, predicting accurate localizations of the text lines. Compared to the RPN or Faster R-CNN system [25], our fine-scale detection provides more detailed supervised information that naturally leads to a more accurate detection.
检测处理总结如下。给定输入图像,我们有W×H×C 个conv5特征图(通过使用VGG16模型),其中C是特征图或通道的数量,W×H是空间排列。当我们的探测器通过conv5密集地滑动3×3窗口时,每个滑动窗口都具有3×3×C的卷积特征以产生预测。对于每个预测,水平位置(x坐标)和锚定位置是固定的,可以通过将conv5中的空间窗口位置映射到输入图像上进行预先计算
。我们的检测器输出每个窗口位置的k个锚点的文本/非文本分数和预测的y坐标(v)。从具有大于0.7的文本/非文本得分的锚点中生成检测到的文本建议(具有非最大抑制)。通过设计的垂直锚点和精细比例检测策略,我们的检测器能够通过使用单比例图像来处理各种比例和宽高比的文本行。
这进一步减少了它的计算,并同时预测了文本行的准确定位。与RPN或Faster R-CNN系统[25]相比,我们的精细检测提供了更详细的监督信息,自然可以带来更准确的检测。
To improve localization accuracy, we split a text line into a sequence of fine-scale text proposals, and predict each of them separately. Obviously,it is not robust to regard each isolated proposal independently
. This may lead to a number of false detections on non-text objects which have a similar structure as text patterns, such as windows, bricks, leaves, etc. (referred as text-like outliers in [13]). It is also possible to discard some ambiguous patterns which contain weak text information. Several examples are presented in Fig. 3 (top). Text have strong sequential characteristics where the sequential context information is crucial to make a reliable decision. This has been verified by recent work [9] where a recurrent neural network (RNN) is applied to encode this context information for text recognition. Their results have shown that the sequential context information is greatly facilitate the recognition task on cropped word images.
为了提高本地化的准确性,我们将一条文本行分成一系列精细的文本提议,并分别预测每个提议。 显然,独立考虑每个孤立的提案并不可靠。
这可能导致对非文本对象的许多错误检测,这些非文本对象的结构与文本模式类似,例如窗口,砖块,树叶等(在[13]中称为类似文本的异常值)。 也有可能丢弃一些包含弱文本信息的歧义模式。 图3(上)显示了几个示例。 文本具有很强的顺序特性,其中顺序上下文信息对于做出可靠的决定至关重要。 最近的工作[9]证实了这一点,其中使用递归神经网络(RNN)编码此上下文信息以进行文本识别。 他们的结果表明,顺序上下文信息极大地促进了对裁剪单词图像的识别任务。
Motivated from this work, we believe that this context information may also be of importance for our detection task. Our detector should be able to explore this important context information to make a more reliable decision, when it works on each individual proposal. Furthermore, we aim to encode this information directly in the convolutional layer, resulting in an elegant and seamless in-network connection of the fine-scale text proposals. RNN provides a natural choice for encoding this information recurrently using its hidden layers. To this end, we propose to design a RNN layer upon the conv5, which takes the convolutional feature of each window as sequential inputs, and updates its internal state recurrently in the hidden layer, H t H_t Ht,
从这项工作的动机出发,我们认为此上下文信息对于我们的检测任务也可能很重要。 当检测器适用于每个提案时,我们的检测器应该能够探索这一重要的上下文信息,从而做出更可靠的决策。 此外,我们的目标是直接在卷积层中编码此信息,从而实现精细文本建议的优雅且无缝的网络内连接。 RNN为使用其隐藏层循环编码此信息提供了自然选择。
为此,我们建议在conv5上设计一个RNN层,该层将每个窗口的卷积特征作为顺序输入,并在隐藏层 H t H_t Ht中循环更新其内部状态,
H t = φ ( H t − 1 , X t ) , t = 1 , 2 , . . . , W . . . . . . . . . . . . ( 3 ) H_t=\varphi(H_{t-1}, X_t), t=1,2,...,W............(3) Ht=φ(Ht−1,Xt),t=1,2,...,W............(3)
where X t ∈ R 3 × 3 × C X_t \in R^{3×3×C} Xt∈R3×3×C is the input conv5 feature from t-th sliding-window (3×3). The sliding-window moves densely from left to right, resulting in t = 1 , 2 , . . . , W t = 1, 2, ..., W t=1,2,...,W sequential features for each row. W W W is the width of the conv5. H t H_t Htis a recurrent internal state that is computed jointly from both current input ( X t X_t Xt) and previous states encoded in H t − 1 H_{t−1} Ht−1. The recurrence is computed by using a non-linear function ϕ \phi ϕ, which defines exact form of the recurrent model. We exploit the long short-term memory (LSTM) architecture [12] for our RNN layer. The LSTM was proposed specially to address vanishing gradient problem, by introducing three additional multiplicative gates: the input gate, forget gate and output gate. Details can be found in [12]. Hence the internal state in RNN hidden layer accesses the sequential context information scanned by all previous windows through the recurrent connection. We further extend the RNN layer by using a bi-directional LSTM, which allows it to encode the recurrent context in both directions, so that the connectionist receipt field is able to cover the whole image width, e.g., 228 × width. We use a 128D hidden layer for each LSTM, resulting in a 256D RNN hidden layer, H t ∈ R 256 H_t \in R^{256} Ht∈R256.
其中 X t ∈ R 3 × 3 × C X_t \in R ^ {3×3×C} Xt∈R3×3×C是第t个滑动窗口(3×3)的输入conv5特征。滑动窗口从左到右密集移动,导致每行 t = 1 , 2 , . . . , W t = 1,2,...,W t=1,2,...,W连续特征。 W W W是conv5的宽度。 H t H_t Ht是循环内部状态,它是根据当前输入( X t X_t Xt)和以 H t − 1 H_ {t-1} Ht−1编码的先前状态共同计算的。递归是通过使用非线性函数”来计算的,该函数定义了递归模型的确切形式。我们为RNN层开发了长期短期记忆(LSTM)架构[12]。 LSTM是专门为解决消失梯度问题而提出的
,它引入了三个附加的乘法门:输入门,忘记门和输出门。详细信息可以在[12]中找到。因此,RNN隐藏层中的内部状态通过循环连接访问所有先前窗口扫描的顺序上下文信息。我们通过使用双向LSTM进一步扩展RNN层,这允许它在两个方向上对循环上下文进行编码,以便连接主义者的接收字段能够覆盖整个图像宽度,例如228×宽度。我们为每个LSTM使用128D隐藏层,从而得到256D RNN隐藏层 H t ∈ R 256 H_t \in R ^ {256} Ht∈R256。
The internal state in H t H_t Ht is mapped to the following FC layer, and output layer for computing the predictions of the t-th proposal. Therefore, our integration with the RNN layer is elegant, resulting in an efficient model that is end-toend trainable without additional cost. The efficiency of the RNN connection is demonstrated in Fig. 3. Obviously, it reduces false detections considerably, and at the same time, recovers many missed text proposals which contain very weak text information.
H t H_t Ht中的内部状态映射到随后的FC层,并输出到输出层以计算第t个提议的预测。 因此,我们与RNN层的集成非常优雅,从而产生了一种高效的模型,可以进行端到端的培训而无需额外的成本。 RNN连接的效率如图3所示。显然,它可以大大减少错误检测,并同时恢复许多包含非常弱的文本信息的遗漏文本建议。
The fine-scale text proposals are detected accurately and reliably by our CTPN. Text line construction is straightforward by connecting continuous text proposals whose text/non-text score is > 0.7. Text lines are constructed as follow. First, we define a paired neighbour (Bj) for a proposal B i B_i Bi as B j − > B i B_j− > B_i Bj−>Bi, when (i) Bj is the nearest horizontal distance to B i B_i Bi, and (ii) this distance is less than 50 pixels, and (iii) their vertical overlap is > 0.7. Second, two proposals are grouped into a pair, if B j − > B i B_j− > B_i Bj−>Bi and B i − > B j B_i− > B_j Bi−>Bj. Then a text line is constructed by sequentially connecting the pairs having a same proposal.
我们的CTPN可以准确,可靠地检测出精细的文本建议。 通过连接文本/非文本分数> 0.7的连续文本建议,可以轻松构造文本行。 文本行的构造如下。 首先,当(i)Bj是最接近 B i B_i Bi的水平距离,并且(ii)该距离小于50个像素时,我们为提案 B i B_i Bi定义成对邻居(Bj)为 B j − > B i B_j−> B_i Bj−>Bi ,以及(iii)它们的垂直重叠> 0.7。 其次,如果 B j − > B i B_j−> B_i Bj−>Bi和 B i − > B j B_i−> B_j Bi−>Bj,则将两个提议分组为一对。 然后,通过顺序连接具有相同提议的线对来构造文本行。
The fine-scale detection and RNN connection are able to predict accurate localizations in vertical direction. In horizontal direction, the image is divided into a sequence of equal 16-pixel width proposals. This may lead to an inaccurate localization when the text proposals in both horizontal sides are not exactly covered by a ground truth text line area, or some side proposals are discarded (e.g., having a low text score), as shown in Fig. 4. This inaccuracy may be not crucial in generic object detection, but should not be ignored in text detection, particularly for those small-scale text lines or words. To address this problem, we propose a side-refinement approach that accurately estimates the offset for each anchor/proposal in both left and right horizontal sides (referred as side-anchor or side-proposal). Similar to the y-coordinate prediction, we compute relative offset as,
精细检测和RNN连接能够预测垂直方向上的精确定位。 在水平方向上,·图像被分为一系列相等的16像素宽度的候选·。 如图4所示,当两个水平边的文本建议没有被地面真实文本行区域完全覆盖时,或者某些边候选被丢弃(例如,文本分数较低)时,这可能导致定位不准确
。 这种不准确性在一般对象检测中可能不是至关重要的,但在文本检测中尤其是那些小规模的文本行或单词,不应忽略。 为了解决这个问题,我们提出了一种侧面细化方法,该方法可以精确地估计左右水平侧(称为侧面固定或侧面固定)中每个锚/建议的偏移量。 与y坐标预测相似,我们将相对偏移量计算为
o = ( x s i d e − c x a ) / w a , o ∗ = ( x s i d e ∗ − c x a ) / w a . . . . . . . . . . . . . ( 4 ) o=(x_{side}-c^a_x)/w^a, o^*=(x^*_{side}-c^a_x)/w^a.............(4) o=(xside−cxa)/wa,o∗=(xside∗−cxa)/wa.............(4)
where x s i d e x_{side} xside is the predicted x x x-coordinate of the nearest horizontal side (e.g., left or right side) to current anchor. x s i d e ∗ x^∗ _{side} xside∗ is the ground truth (GT) side coordinate in x-axis, which is pre-computed from the GT bounding box and anchor location. c x a c^a _x cxa is the center of anchor in x x x-axis. w a w^a wa is the width of anchor, which is fixed, w a = 16 w^a = 16 wa=16 . The side-proposals are defined as the start and end proposals when we connect a sequence of detected fine-scale text proposals into a text line. We only use the offsets of the side-proposals to refine the final text line bounding box. Several detection examples improved by side-refinement are presented in Fig. 4. The side-refinement further improves the localization accuracy, leading to about 2% performance improvements on the SWT and Multi-Lingual datasets. Notice that the offset for side-refinement is predicted simultaneously by our model, as shown in Fig. 1. It is not computed from an additional post-processing step.
其中 x s i d e x_ {side} xside是距当前锚点最近的水平边(例如,左侧或右侧)的预计 x x x坐标。 x s i d e ∗ x ^ ∗ _ {side} xside∗是x轴上的真实(GT)侧坐标,它是根据GT边界框和锚点位置预先计算的。 c x a c^a _x cxa是x轴上锚点的中心。 w a w^a wa是锚的宽度,固定, w a = 16 w^a = 16 wa=16。 当我们将一系列检测到的精细文本建议连接到文本行时,侧面候选定义为开始和结束建议。 我们仅使用侧面建议的偏移量来完善最终文本行边界框。 图4给出了一些通过侧面改进而改进的检测示例。侧面改进进一步提高了定位精度,从而导致SWT和Multi-Lingual数据集的性能提高了大约2%。 请注意,侧边细化的偏移量是由我们的模型同时预测的,如图1所示。它不是通过附加的后处理步骤来计算的。
The proposed CTPN has three outputs which are jointly connected to the last FC layer, as shown in Fig. 1 (a). The three outputs simultaneously predict text/nontext scores (s), vertical coordinates ( v = { v c , v h } v = \{v_c, v_h\} v={vc,vh} in E.q. (2)) and side-refinement offset (o). We explore k k k anchors to predict them on each spatial location in the conv5, resulting in 2 k 2k 2k, 2 k 2k 2k and k k k parameters in the output layer, respectively.
建议的CTPN具有三个输出,这些输出共同连接到最后一个FC层,如图1(a)所示。 这三个输出可同时预测文本/非文本分数(s),垂直坐标(等式(2)中的 v = { v c , v h } v = \{v_c,v_h \} v={vc,vh})和侧面修饰偏移量(o)。 我们探索 k k k锚以在conv5中的每个空间位置上进行预测,分别在输出层中产生 2 k 2k 2k, 2 k 2k 2k和 k k k参数。
We employ multi-task learning to jointly optimize model parameters. We introduce three loss functions, L s c l L^{cl}_ s Lscl , L v r e L^{re}_ v Lvre and l o r e l_o^{re} lore, which compute errors of text/nontext score, coordinate and side- refinement, respectively. With these considerations, we follow the multi-task loss applied in [5,25], and minimize an overall objective function ( L L L) for an image as,
我们采用多任务学习来共同优化模型参数。 我们介绍了三个损失函数 L s c l L ^ {cl} _ s Lscl, L v r e L ^ {re} _ v Lvre和 l o r e l_o ^ {re} lore,它们分别计算文本/非文本分数,坐标和边细化的误差 。 考虑到这些因素,我们遵循[5,25]中应用的多任务损失,并将图像的总体目标函数( L L L)最小化为:
L ( s i , v j , o k ) = 1 N s ∑ i L s c l ( s i , s i ∗ ) + λ 1 N v ∑ j L v r e ( v j , v j ∗ ) + λ 2 N o ∑ k L o r e ( o k , o k ∗ ) . . . . . . . . . . . . ( 5 ) L(s_i, v_j, o_k)=\frac{1}{N_s}\sum_iL^{cl}_s(s_i, s^*_i)+\frac{\lambda_1}{N_v}\sum_jL^{re}_v(v_j,v^*_j)+\frac{\lambda_2}{N_o}\sum_kL^{re}_o(o_k,o^*_k)............(5) L(si,vj,ok)=Ns1∑iLscl(si,si∗)+Nvλ1∑jLvre(vj,vj∗)+Noλ2∑kLore(ok,ok∗)............(5)
where each anchor is a training sample, and i i i is the index of an anchor in a minibatch. s i s_i si is the predicted probability of anchor i i i being a true text. s i ∗ = { 0 , 1 } s^∗_ i = \{0, 1\} si∗={0,1} is the ground truth. j j j is the index of an anchor in the set of valid anchors for y-coordinates regression, which are defined as follow. A valid anchor is a defined positive anchor ( s j ∗ = 1 s^∗_ j = 1 sj∗=1, described below), or has an Intersection-overUnion (IoU) > 0.5 overlap with a ground truth text proposal. v j v_j vj and v j ∗ v^∗_ j vj∗ are the prediction and ground truth y y y-coordinates associated with the j j j-th anchor. k k k is the index of a side-anchor, which is defined as a set of anchors within a horizontal distance (e.g., 32-pixel) to the left or right side of a ground truth text line bounding box. o k o_k okand o k ∗ o^ ∗_ k ok∗ are the predicted and ground truth offsets in x x x-axis associated to the k k k-th anchor. L s c l L^{cl}_ s Lscl is the classification loss which we use Softmax loss to distinguish text and non-text. L v r e L^{re}_ v Lvre and L o r e L^{re}_ o Lore are the regression loss. We follow previous work by using the smooth L1 function to compute them [5,25]. λ 1 \lambda_1 λ1 and λ 2 \lambda_2 λ2 are loss weights to balance different tasks, which are empirically set to 1.0 and 2.0. N s N_s Ns, N v N_v Nv and N o N_o No are normalization parameters, denoting the total number of anchors used by L s c l L^{cl}_ s Lscl , L v r e L^{re}_ v Lvre and L o r e L^{re}_ o Lore , respectively.
其中每个锚点都是训练样本,而 i i i是小批量中锚点的索引。 s i s_i si是锚点 i i i是真实文本的预测概率。 s i ∗ = { 0 , 1 } s ^ ∗ _ i = \{0,1 \} si∗={0,1}是基本事实。 j j j是y坐标回归的有效锚集中的锚的索引,定义如下。有效锚点是定义的正锚点( s j ∗ = 1 s ^ * _ j = 1 sj∗=1,如下所述),或者与真实文本候选的交叉点并集(IoU)> 0.5重叠。 v j v_j vj和 v j ∗ v ^ ∗ _ j vj∗是与第 j j j锚关联的预测和地面真实 y y y坐标。 k k k是侧面锚的索引,其定义为在距真实文本行边界框的左侧或右侧水平距离(例如32像素)内的一组锚。 o k o_k ok和 o k ∗ o^ ∗_ k ok∗是与第 k k k个锚点关联的x轴上的预测和地面真实偏移。 L s c l L ^ {cl} _ s Lscl是分类损失,我们使用Softmax损失来区分文本和非文本。 L v r e L ^ {re} _ v Lvre和 L o r e L ^ {re} _ o Lore是回归损失。我们通过使用平滑L1函数来计算它们[5,25],以遵循先前的工作。 λ 1 \lambda_1 λ1和 λ 2 \lambda_2 λ2是用来权衡不同任务的损失权重,根据经验将其设置为1.0和2.0。 N s N_s Ns, N v N_v Nv和 N o N_o No是归一化参数,表示 L s c l L ^ {cl} _ s Lscl, L v r e L ^ {re} _ v Lvre和 L o r e L ^ {re} _o Lore使用的锚点总数。 。
The CTPN can be trained end-to-end by using the standard back-propagation and stochastic gradient descent (SGD). Similar to RPN [25], training samples are the anchors, whose locations can be pre computed in input image, so that the training labels of each anchor can be computed from corresponding GT box.
通过使用标准的反向传播和随机梯度下降(SGD),可以端到端地训练CTPN。 类似于RPN [25],训练样本是锚点,可以在输入图像中预先计算其位置,以便可以从相应的GT框计算每个锚点的训练标签。
Training labels.
For text/non-text classification, a binary label is assigned to each positive (text) or negative (non- text) anchor. It is defined by computing the IoU overlap with the GT bounding box (divided by anchor location). A positive anchor is defined as : (i) an anchor that has an > 0.7 IoU overlap with any GT box; or (ii) the anchor with the highest IoU overlap with a GT box. By the condition (ii), even a very small text pattern can assign a positive anchor. This is crucial to detect small-scale text patterns, which is one of key advantages of the CTPN. This is different from generic object detection where the impact of condition (ii) may be not significant. The negative anchors are defined as < 0.5 IoU overlap with all GT boxes. The training labels for the y-coordinate regression ( v ∗ v^∗ v∗) and offset regression ( o ∗ o^∗ o∗) are computed as E.q. (2) and (4) respectively.
对于文本/非文本分类,将二进制标签分配给每个正(文本)或负(非文本)锚点。 它是通过计算IoU与GT边界框(按锚点位置划分)的重叠来定义的。 正锚定义为:
(i)
与任何GT盒的IoU重叠> 0.7的锚; 或(ii)
IoU最高的锚与GT盒重叠。 根据条件(ii),即使很小的文本样式也可以分配正锚。 这对于检测小规模文本模式至关重要,这是CTPN的主要优势之一。 这不同于条件(ii)的影响可能不明显的通用对象检测。 负锚定义为与所有GT盒的<0.5 IoU重叠。 y坐标回归( v ∗ v^ ∗ v∗)和偏移回归( o ∗ o^ ∗ o∗)的训练标签计算为E.q. (2)和(4)。
Training data.
In the training process, each mini-batch samples are collected randomly from a single image. The number of anchors for each minibatch is fixed to Ns = 128, with 1:1 ratio for positive and negative samples. A mini-patch is pad with negative samples if the number of positive ones is fewer than 64. Our model was trained on 3,000 natural images, including 229 images from the ICDAR 2013 training set. We collected the other images ourselves and manually labelled them with text line bounding boxes. All self- collected training images are not overlapped with any test image in all benchmarks. The input image is resized by setting its short side to 600 for training, while keeping its original aspect ratio.
在训练过程中,从单个图像中随机收集每个小批量样本。 每个小批量的锚点数量固定为Ns = 128,正样本和负样本的比例为1:1。
如果正样本的数量少于64个,则微型补丁用负样本填充。我们的模型在3,000张自然图像上进行了训练,包括来自ICDAR 2013训练集的229张图像。我们自己收集了其他图像
,并用文本行边框手动标记了它们。 在所有基准测试中,所有自我收集的训练图像均未与任何测试图像重叠。 通过将输入图像的短边设置为600进行训练来调整输入图像的大小,同时保持其原始纵横比。
Implementation Details.
We follow the standard practice, and explore the very deep VGG16 model [27] pre-trained on the ImageNet data [26]. We initialize the new layers (e.g., the RNN and output layers) by using random weights with Gaussian distribution of 0 mean and 0.01 standard deviation. The model was trained end-to- end by fixing the parameters in the first two convolutional layers. We used 0.9 momentum and 0.0005 weight decay. The learning rate was set to 0.001 in the first 16K iterations, followed by another 4K iterations with 0.0001 learning rate. Our model was implemented in Caffe framework [17].
我们遵循标准做法,并探索在ImageNet数据上预先训练的非常深的VGG16模型[27] [26]。我们使用具有0个均值和0.01个标准偏差的高斯分布的随机权重来初始化新层(例如RNN和输出层)。通过在前两个卷积层中固定参数来端对端地训练模型。我们使用0.9的动量和0.0005的权重衰减。在前16K迭代中将学习率设置为0.001,然后再进行一次具有0.0001学习率的4K迭代。我们的模型是在Caffe框架中实现的[17]。
We evaluate the CTPN on five text detection benchmarks, namely the ICDAR 2011 [21], ICDAR 2013 [19], ICDAR 2015 [18], SWT [3], and Multilingual dataset [24]. In our experiments, we first verify the efficiency of each proposed component individually, e.g., the fine-scale text proposal detection or in- network recurrent connection. The ICDAR 2013 is used for this component evaluation.
我们根据五个文本检测基准评估CTPN,即ICDAR 2011 [21],ICDAR 2013 [19],ICDAR 2015 [18],SWT [3]和多语言数据集[24]。在我们的实验中,我们首先分别验证每个候选组件的效率,例如,小规模文本候选检测或内网络循环连接。ICDAR 2013用于此组件评估。
The ICDAR 2011 dataset [21] consists of 229 training images and 255 testing ones, where the images are labelled in word level. The ICDAR 2013 [19] is similar as the ICDAR 2011, and has in total 462 images, including 229 images and 233 images for training and testing, respectively. The ICDAR 2015 (Incidental Scene Text - Challenge 4) [18] includes 1,500 images which were collected by using the Google Glass. The training set has 1,000 images, and the remained 500 images are used for test. This dataset is more challenging than previous ones by including arbitrary orientation, very small-scale and low resolution text. The Multilingual scene text dataset is collected by [24]. It contains 248 images for training and 239 for testing. The images include multi-languages text, and the ground truth is labelled in text line level. Epshtein et al. [3] introduced the SWT dataset containing 307 images which include many extremely small-scale text.
ICDAR 2011数据集[21]包含229个训练图像和255个测试图像,其中图像以单词级别标记。ICDAR 2013 [19]与ICDAR 2011类似,共有462张图像,其中分别有229张图像和233张图像用于训练和测试。ICDAR 2015(偶然场景文本-挑战4)[18]包含1,500张图像,这些图像是使用Google Glass收集的。训练集有1,000张图像,其余500张图像用于测试。通过包含任意方向,非常小规模和低分辨率的文本,
此数据集比以前的数据集更具挑战性。多语言场景文本数据集由[24]收集。 它包含248张用于训练的图像和239张用于测试的图像。 图像包括多语言文本,真实标签在文本行级别标记。Epshtein等[3]介绍了包含307个图像的SWT数据集,其中包括许多极端小规模的文本。
We follow previous work by using standard evaluation protocols which are provided by the dataset creators or competition organizers. For the ICDAR 2011 we use the standard protocol proposed by [30], the evaluation on the ICDAR 2013 follows the standard in [19]. For the ICDAR 2015, we used the online evaluation system provided by the organizers as in [18]. The evaluations on the SWT and Multilingual datasets follow the protocols defined in [3] and [24] respectively.
我们遵循以前的工作,使用数据集创建者或竞赛组织者提供的标准评估协议。 对于ICDAR 2011,我们使用[30]提出的标准协议,对ICDAR 2013的评估遵循[19]中的标准。对于ICDAR 2015,我们使用了组织者提供的在线评估系统,如[18]所用。SWT和多语言数据集的评估分别遵循[3]和[24]中定义的协议。
We first discuss our fine-scale detection strategy against the RPN and Faster R-CNN system [25]. As can be found in Table 1 (left), the individual RPN is difficult to perform accurate text localization, by generating a large amount of false detections (low precision). By refining the RPN proposals with a Fast R-CNN detection model [5], the Faster R-CNN system improves localization accuracy considerably, with a F-measure of 0.75. One observation is that the Faster R-CNN also increases the recall of original RPN.
This may benefit from joint bounding box regression mechanism of the Fast R-CNN, which improves the accuracy of a predicted bounding box. The RPN proposals may roughly localize a major part of a text line or word, but they are not accurate enough by the ICDAR 2013 standard. Obviously, the proposed fine-scale text proposal network (FTPN)
improves the Faster R-CNN remarkably in both precision and recall, suggesting that the FTPN is more accurate and reliable, by predicting a sequence of fine-scale text proposals rather than a whole text line.
我们首先讨论针对RPN和Faster R-CNN系统的精细检测策略[25]。如表1(左)所示,通过生成大量错误检测(低精度),单个RPN很难执行准确的文本定位。通过使用快速R-CNN检测模型[5]完善RPN候选,Faster R-CNN系统以0.75的F值显着提高了定位精度。一种观察是,更快的R-CNN也增加了原始RPN的召回率。这可能得益于Fast R-CNN的联合边界框回归机制,该机制提高了预测边界框的准确性。
RPN候选区域可能会粗略定位文本行或单词的主要部分,但根据ICDAR 2013标准,它们不够准确。显然,提出的精细文本候选网络(FTPN)让Faster R-CNN在准确性和查全率方面显着提高了,这表明FTPN通过预测一系列精细文本候选而非整个文字行,这样更加准确和可靠。
We discuss impact of recurrent connection on our CTPN. As shown in Fig. 3, the context information is greatly helpful to reduce false detections, such as textlike outliers. It is of great importance for recovering highly ambiguous text (e.g., extremely small-scale ones), which is one of main advantages of our CTPN, as demonstrated in Fig. 6. These appealing properties result in a significant performance boost. As shown in Table 1 (left), with our recurrent connection, the CTPN improves the FTPN substantially from a F-measure of 0.80 to 0.88.
我们讨论了循环连接对CTPN的影响。如图3所示,上下文信息极大地有助于减少错误检测,例如类似文本的异常值。如图6所示,这对于恢复高度含糊的文本(例如,极小比例的文本)非常重要,这是我们CTPN的主要优势之一。这些吸引人的属性极大地提高了性能。如表1(左)所示,通过我们的循环连接,CTPN将FTPN的F值从0.80大大提高到0.88。
Running time.
The implementation time of our CTPN (for whole detection processing) is about 0。14s per image with a fixed short side of 600, by using a single GPU. The CTPN without the RNN connection takes about 0.13s/image GPU time. Therefore, the proposed in-network recurrent mechanism increase model computation marginally, with considerable performance gain obtained.
通过使用单个GPU,CTPN(用于整个检测处理)的实现时间约为每个图像0.14s,固定短边长为600。不带RNN连接的CTPN大约需要每图像GPU时间0.13s。因此,所提出的网络内递归机制略微增加了模型计算量,并获得了可观的性能提升。
Our detection results on several challenging images are presented in Fig. 5. As can be found, the CTPN works perfectly on these challenging cases, some of which are difficult for many previous methods. It is able to handle multi-scale and multi-language efficiently (e.g., Chinese and Korean).
我们在几个具有挑战性的图像上的检测结果如图5所示。可以发现,CTPN可以完美地应对这些具有挑战性的情况,其中某些情况对于许多以前的方法来说都是困难的。它能够有效地处理多种语言和多种语言(例如中文和韩文)。
The full evaluation was conducted on five benchmarks. Image resolution is varied significantly in different datasets. We set short side of images to 2000 for the SWT and ICDAR 2015, and 600 for the other three. We compare our performance against recently published results in [1,28,34]. As shown in Table 1 and 2, our CTPN achieves the best performance on all five datasets. On the SWT, our improvements are significant on both recall and F-measure, with marginal gain on precision. Our detector performs favourably against the TextFlow on the Multilingual, suggesting that our method generalize well to various languages. On the ICDAR 2013, it outperforms recent TextFlow [28] and FASText [1] remarkably by improving the F-measure from 0.80 to 0.88. The gains are considerable in both precision and recall, with more than +5% and +7% improvements, respectively. In addition, we further compare our method against [8,11,35], which were published after our initial submission. It consistently obtains substantial improvements on F-measure and recall. This may due to strong capability of CTPN for detecting extremely challenging text, e.g., very small-scale ones, some of which are even difficult for human. As shown in Fig. 6, those challenging ones are detected correctly by our detector, but some of them are even missed by the GT labelling, which may reduce our precision in evaluation.
全面评估是基于五个基准进行的。在不同的数据集中,图像分辨率差异很大。
对于SWT和ICDAR 2015,我们将图像的短边设置为2000
,其他三个设置为600。我们将我们的表现与最近发表的结果进行比较[1,28,34]。如表1和表2所示,我们的CTPN在所有五个数据集上均实现了最佳性能。在SWT上,我们的改进在召回率和F量度上均具有重要意义,在精度上有少量提高。我们的检测器对多语言上的TextFlow表现良好,这表明我们的方法可以很好地推广到各种语言。在ICDAR 2013上,它通过将F量度从0.80提高到0.88,明显优于最近的TextFlow [28]和FASText [1]。精度和召回率方面均取得了可观的进步,分别提高了+ 5%和+ 7%。此外,我们将我们的方法与[8,11,35]进行了比较,后者是在我们首次提交后发布的。它始终在F测量和召回方面获得实质性改进。这可能是由于CTPN具有强大的能力来检测极具挑战性的文本(例如,非常小的文本),其中有些甚至对人类来说是困难的。如图6所示,我们的检测器可以正确检测到那些具有挑战性的检测器,但是其中一些甚至被GT标记遗漏了,这可能会降低我们的评估精度。
We further investigate running time of various methods, as compared in Table 2. FASText [1] achieves 0:15s/image CPU time. Our method is slightly faster than it by obtaining 0:14s/image, but in GPU time. Though it is not fair to compare them directly, the GPU computation has become mainstream with recent great success of deep learning approaches on object detection [25,5,6]. Regardless of running time, our method outperforms the FASText substantially with 11% improvement on F-measure. Our time can be reduced by using a smaller image scale. By using the scale of 450, it is reduced to 0:09s/image, while obtaining P/R/F of 0.92/0.77/0.84 on the ICDAR 2013, which are compared competitively against Gupta et al.’s approach [8] using 0:07s/image with GPU.
与表2相比,我们进一步研究了各种方法的运行时间。FASText [1]实现了0.15s /图像CPU时间。通过获取0.14s /图像,我们的方法要比它快一点,但是在GPU时间上。尽管直接比较它们是不公平的,但随着深度学习方法在对象检测方面的最新成功,GPU计算已成为主流[25,5,6]。无论运行时间如何,我们的方法都比FASText表现更好,其F度量提高了11%。通过使用较小的图像比例可以减少我们的时间。通过使用450的图像尺寸,运行时间缩小到0.09s /图像,而在ICDAR 2013上获得的P / R / F为0.92 / 0.77 / 0.84,与Gupta等人的方法[8]相比具有竞争优势 ,使用GPU时为每张图片0.07秒。
We have presented a Connectionist Text Proposal Network (CTPN) - an efficient text detector that is end-to-end trainable. The CTPN detects a text line in a sequence of fine-scale text proposals directly in convolutional maps
. We develop vertical anchor mechanism that jointly predicts precise location and text/nontext score for each proposal, which is the key to realize accurate localization of text. We propose an in-network RNN layer that connects sequential text proposals elegantly,
allowing it to explore meaningful context information. These key technical developments result in a powerful ability to detect highly challenging text, with less false detections. The CTPN is efficient by achieving new state-ofthe-art performance on five benchmarks, with 0:14s/ image running time.
我们介绍了一个连接文本候选网络(CTPN),一种高效的文本检测器,可端到端训练。CTPN直接在卷积图中检测一系列精细文本候选中的文本行
。我们开发了垂直锚机制,可以共同预测每个候选框的精确位置和文本/非文本得分,这是实现文本精确定位的关键。我们提出了一个内网络的RNN层,该层可以连接顺序文本候选,从而使它能够探索有意义的上下文信息。这些关键的技术发展带来了强大的能力,能够以极少的错误来检测具有挑战性的文本。在五个基准上,CTPN实现最新的性能(每图像运行时间为0.14s)。