(ASSTD) Arbitrary Shape Scene Text Detection with Adaptive Text Region Representation

摘要

Scene text detection attracts much attention in computer vision, because it can be widely used in many applications such as real-time text translation, automatic information entry, blind person assistance, robot sensing and so on. Though many methods have been proposed for horizontal and oriented texts, detecting irregular shape texts such as curved texts is still a challenging problem. To solve the problem, we propose a robust scene text detection method with adaptive text region representation. Given an input image, a text region proposal network is first used for extracting text proposals. Then, these proposals are verified and refined with a refinement network.Here, recurrent neural network based adaptive text region representation is proposed for text region refinement, where a pair of boundary points are predicted each time step until no new points are found.In this way, text regions of arbitrary shapes are detected and represented with adaptive number of boundary points. This gives more accurate description of text regions. Experimental results on five benchmarks, namely, CTW1500, TotalText, ICDAR2013, ICDAR2015 and MSRATD500, show that the proposed method achieves state-ofthe-art in scene text detection.
场景文本检测在计算机视觉中引起了广泛的关注，因为它可以广泛应用于实时文本翻译，自动信息输入，盲人辅助，机器人传感等多种应用中。虽然已经提出了许多用于水平文本和定向文本的检测方法，但是检测诸如弯曲文本的不规则形状文本仍然是一个具有挑战性的问题。为了解决该问题，我们提出了一种鲁棒的具有自适应文本区域表示的场景文本检测方法。给定一张输入图像，首先使用一个文本RPN网络提取文本建议区域。然后，利用一个细化网络(refinement network)来校正和细化上述提取出的文本建议区域。这里，提出了基于自适应文本区域表示的RNN网络用来对文本区域细化，其中每个时间步长预测一对边界点，直到没有新点发现为止。这样，就可以检测任意形状的文本区域，该文本区域由自适应个数的边界点来表示。这样可以更准确地表示出文本区域。本文在五个数据集(CTW1500，TotalText，ICDAR2013，ICDAR2015和MSRA-TD500)上进行了实验,表明所提出的方法在场景文本检测中是最先进的方法。

首先使用RPN提取候选区域
refinement network 对上述候选区域进行校正和细化，使其更精准
自适应地寻找边界点

1、介绍

Text is the most fundamental medium for communicating semantic information. It appears everywhere in daily life: on street nameplates, store signs, product packages, restaurant menus and so on. Such texts in natural environment are known as scene texts. Automatically detecting and recognizing scene texts can be very rewarding with numerous applications, such as real-time text translation, blind person assistance, shopping, robots, smart cars and education. An end-to-end text recognition system usually consists of two steps: text detection and text recognition. In text detection, text regions are detected and labeled with their bounding boxes. And in text recognition, text information
is retrieved from the detected text regions. Text detection is an important step for end-to-end text recognition, without which texts can not be recognized from scene images. Therefore, scene text detection attracts much attention these years.
文本是用于传递语义信息的最基本媒介。它出现在日常生活的各个角落：街道铭牌，商店标志，产品包装，餐厅菜单等。自然环境中的这些文本被称为场景文本。自动检测和识别场景文本可以应用在很多地方，例如实时文本翻译，盲人辅助，购物，机器人，智能汽车和教育。一个端到端的文本识别系统常常包括两部分：文本检测和文本识别。在文本检测部分，检测到文本并使用边界框标注。在文本识别部分，从检测到的文本区域中提取出文本。文本检测是端到端文本识别的重要步骤，没有它就不能从场景图片中识别出文本。因此，在这些年，场景文本检测引起了很多关注。
While traditional optical character reader (OCR) techniques can only deal with texts on printed documents or business cards, scene text detection tries to detect various texts in complex scenes. Due to complex backgrounds and variations of font, size, color, language, illumination condition and orientation, scene text detection becomes a very challenging task. And its performance was poor when hand designed features and traditional classifiers were used before deep learning methods become popular. However, the performance has been much improved in recent years, significantly benefitted from the development of deep learning. Meanwhile, the research focus of text detection has shifted from horizontal scene texts [10] to multi-oriented scene texts [9] and more challenging curved or arbitrary shape scene texts [19]. Therefore, arbitrary shape scene text detection is focused on in this paper.
虽然传统OCR技术只能处理打印文档或者名片上的文本，但场景文本检测尝试检测复杂场景中的各种文本。由于复杂的背景和不同的字体、尺寸、颜色、语言、光照条件和方向，场景文本检测成为了非常有挑战的任务。在深度学习方法流行之前，使用手工设计特征和传统分类器检测方法性能往往较差。然而，近年来，检测性能得到很大改善，这很大程度上得益于深度学习的发展。与此同时，文本检测的研究方向从水平场景文本转变到了多方向的场景文本和更具有挑战的弯曲文本或任意形状的场景文本。因此，本文重点介绍任意形状的场景文本。
In this paper, we propose an arbitrary shape scene text detection method using adaptive text region representation, as shown in Figure 1. Given an input image, a text region proposal network (Text-RPN) is first used for obtaining text proposals. The Convolutional Neural Network (CNN) feature maps of the input image are also obtained in this step. Then, text proposals are verified and refined with a refinement network, whose input are the text proposal features obtained by using region of interest (ROI) pooling to the CNN feature maps. Here, three branches including text/non-text classification, bounding box refinement and recurrent neural network (RNN) based adaptive text region representation exist in the refinement network. In the RNN, a pair of boundary points are predicted each time step until the stop label is predicted. In this way, arbitrary shape text regions can be represented with adaptive number of boundary points. For performance evaluation, the proposed method is tested on five benchmarks, namely, CTW1500, TotalText, ICDAR2013, ICDAR2015 and MSRA-TD500. Experimental results show that the proposed method can process not only multi-oriented scene texts but also arbitrary shape scene texts including curved texts. Moreover, it achieves state-of-the-art performances on the five datasets.
本文中，我们提出了一种任意形状场景文本的检测方法，该方法使用了自适应的文本区域表示，如图1所示。给一张输入图像，首先使用 Text-RPN 获得文本建议框以及输入图像的CNN feature maps；然后用一个细化网络校正和细化文本建议框，该细化网络的输入是上述CNN feature maps经过ROI处理后的文本建议框。细化网络包括三个分支，分别是：文本/非文本分类，边框细化和基于自适应文本区域表示的RNN。在RNN中，每个时间步长预测一对边界点。这样，就可以了使用适应数量的边界点表示不规则文本区域了。在五个数据集(CTW1500，TotalText，ICDAR2013，ICDAR2015和MSRA-TD500)上进行了实验，进行了性能测试。实验表明，本文提出的方法不仅可以处理多方向的文本也可以处理不规则场景文本包括弯曲文本。在五个数据集上都实现了SOTA的性能。

2、相关工作

Traditional sliding window based and Connected component (CC) based scene text detection methods had been widely used before deep learning became the most promising machine learning tool. Sliding window based methods [27, 32] move a multi-scale window over an image and classify the current patch as text or non-text. CC based methods, especially the Maximally Stable Extremal Regions (MSER) based methods [26, 30], get character candidates by extracting CCs. And then, these candidate CCs are classified as text or non-text. These methods usually adopt a bottom-up strategy and often need several steps to detect texts (e.g., character detection, text line construction and text line classification). As each step may lead to misclassification, the performances of these traditional text detection methods are poor.
在深度学习成为最受欢迎的机器学习之前，基于传统滑动窗口和基于连通域的场景文本检测方法已经被广泛使用。基于传统滑动窗口的方法在图像上移动一个多尺度的窗口来把文本和非文本分开。基于连通域的方法，特别是基于最大稳定极值区域（Maximally Stable Extremal Regions，MSER）的方法，通过提取连通域来获取字符候选。然后把这些字符候选连通区域分为文本和非文本。这些方法通常采用自下而上的策略，通常需要几个步骤来检测文本（例如，字符检测，文本行构建，文本行分类）。由于每一步都可能导致误分类，所以这些传统文本检测表现很差。
Recently, deep learning based methods have become popular in scene text detection. These methods can be divided into three groups, including bounding box regression based methods, segmentation based methods, and combined methods. Bounding box regression based methods [5, 8, 11, 12, 13, 16], which are inspired by general object detection methods such as SSD [14] and Faster RCNN [23], treat text as a kind of object and directly estimate its bounding box as the detection result. Segmentation based methods [3, 19, 33] try to solve the problem by segmenting text regions from the background and an additional step is needed to get the final bounding boxes. Combined methods [20] use a similar strategy as Mask R-CNN [4], in which both segmentation and bounding box regression are used for better performance. However, its processing time is increased because more steps are needed than previous methods. Among the three kinds of methods, bounding box regression based methods are the most popular in scene text detection, benefitted from the development of general object detection.
最近，基于深度学习的方法在场景文本检测上更受欢迎了。这些方法可以分成三组，包括基于边框回归的方法，基于分割的方法和两者结合的方法。基于边框回归的方法是受到一边物体检测方法的启发，例如SSD，Faster RCNN，将文本视为一种对象，直接估计其边框作为检测结果。基于分割的方法尝试通过从背景中分割文本区域来解决问题，需要一个额外的步骤去获取最终的边界框。两者结合的方法使用和Mask RCNN相似的策略，利用分割和边框回归获得更好的表现。然而，由于比之前的方法需要的步骤更多，所以它的处理时间加长了。在这三种方法中，基于边框回归的方法是场景文本检测中最受欢迎的，这得益于通用目标检测的方法。

基于深度学习的场景文本检测方法大致分三种
基于边界框直接回归方法，受到通用物体检测算法(SSD、FasterRCNN)的启发
基于分割的方法
基于分割+边界框回归的方法，性能最好，但处理时间更长

For bounding box regression based methods, they can be divided into one stage methods and two-stage methods. One-stage methods including Deep Direct Regression [5], TextBox [12], TextBoxes++ [11], DMPNet [16], SegLink [24] and EAST [34], directly estimate bounding boxes of text regions in one step. Two-stage methods include R2CNN [8], RRD [13], RRPN [22], IncepText [28] and FEN [31]. They consist of text proposal generation stage, in which candidate text regions are generated, and bounding box refinement stage, in which candidate text regions are verified and refined to generate the final detection result. Two-stage methods usually achieve higher performances than one-stage methods. Therefore, the idea of two-stage detection is used in this paper.
对于基于边界框回归的方法，可以分为一阶段方法和两阶段方法。一阶段方法包括Deep Direct Regression，TextBox，TextBoxs++，DMPNet，SegLink和EAST，在一步中直接估计文本区域的边界框。两阶段方法包括R2CNN，RRD，RRPN，IncepText和FEN。它们包括文本建议框生成步骤（生成候选框）和边框微调步骤（候选框确认和微调以生成最终的检测结果）。两阶段方法通常比一阶段的方法能获得更高的性能，所以本文使用两阶段检测的思想。

边界框回归

While most proposed scene text detection methods can only deal with horizontal or oriented texts, detecting arbitrary shape texts such as curved text attracts more attention recently. In CTD [17], a polygon of fixed 14 points are used to represent text region. Meanwhile, recurrent transverse and longitudinal offset connection (TLOC) is proposed for accurate curved text detection. Though a polygon of fixed 14 points is enough for most text regions, it is not enough for some long curve text lines. Besides, 14 points are too many for most horizontal and oriented texts, while 4 points are enough for these texts. In TextSnake [19], a text instance is described as a sequence of ordered, overlapping disks centered at symmetric axes of text regions. Each disk is associated with potentially variable radius and orientation, which are estimated via a Fully Convolutional Network (FCN) model. Moreover, Mask TextSpotter [20] which is inspired by Mask R-CNN, can handle text instances of irregular shapes via semantic segmentation. Though TextSnake and Mask TextSpotter both can deal with text of arbitrary shapes, pixel-wise predictions are both needed in them, which need heavy computation.
虽然大多数提出的场景文本检测方法只能处理水平或定向文本，但检测任意形状文本如弯曲文本最近吸引了更多关注。在CTD中，使用14个点的多边形表示文本区域。同时，提出了循环横向和纵向偏移连接（TLOC）用于更精确地弯曲文本检测。虽然一个多边形14个点对于大多数文本区域足够了，但是对于一些长的弯曲文本行是不够的。除此之外，14个点对于大多数水平和定向文本太多了，4个点足够了。在TextSnake中，文本实例被描述为以文本区域为以文本区域的对称轴为中心的有序重叠圆序列。每个圆都与可变半径和方向有潜在的联系，这些都是通过FCN来估算的。还有，Mask TextSpotter受Mask RCNN的启发，通过语义分割处理不规则形状的文本实例。虽然TextSnake和Mask TextSpotter都可以处理不规则形状的文本，但是像素级别的预测需要更大的计算量。

检测任意形状文本

Considering a polygon of fixed number of points is not suitable for representing text regions of different shapes, an adaptive text region representation using different numbers of points for texts of different shapes is proposed in this paper. Meanwhile, a RNN is employed to learn the adaptive representation of each text region, with which text regions can be directly labeled and pixel-wise segmentation is not needed.
考虑到使用固定点数量的多边形来表示不同形状的文本区域是不合适的，本文提出了使用不同的点的数量的自适应文本区域来表示不同形状的文本。同时，RNN用来学习每个文本区域的自适应表示，这样可以直接标记文本区域，并且不需要逐像素分割。

本文方法

3、方法论

流程图

Figure 1 shows the flowchart of the proposed method for arbitrary shape text detection, which is a two-stage detection method. It consists of two steps: text proposal and proposal refinement. In text proposal, a Text-RPN is used to generate text proposals of an input image. Meanwhile, the CNN feature maps of the input image are obtained here, which can be used in the following. Then, text proposals are verified and refined through a refinement network. In this step, text/non-text classification, bounding box regression and RNN based adaptive text region representation are included. Finally, text regions labeled with polygons of adaptive number of points are output as the detection result.
图1展示了不规则形状文本检测方法的流程图，是一个两阶段的检测方法。包含两步：Text Proposal和Proposal refinement。在Text Proposal这一步，使用Text-RPN生成一张图像的Text Proposal。同时，获得输入图像的CNN feature maps，接下来会使用。然后，通过微调网络验证和微调Text Proposal。在这一步，包括了文本/非文本分类，边框回归和自适应文本区域表示的RNN。最后，使用自适应点数的多边形标记的文本区域作为输出结果。

3.1 Adaptive text region representation

The existing scene text detection methods use polygons of fixed number of points to represent text regions. For horizontal texts, 2 points (left-top point and bottom-right point) are used to represent the text regions. For multi-oriented texts, the 4 points of their bounding boxes are used to represent these regions. Moreover, for curved texts, 14 points are adopted in CTW1500 [17] for text region representation. However, for some very complex scene texts, such as curved long text, even 14 points may be not enough to represent them well. While for most scene texts such as horizontal texts and oriented texts, less than 14 points are enough and using 14 points to represent these text regions is a waste.
现有的场景文本检测方法使用具有固定数量点的多变形来表示文本区域。对于水平文本，两个点（左上和右下）就可以表示文本区域。对于多方向文本，四个点就可以表示文本区域。对于弯曲文本，在CTW1500使用14个点来表示文本区域。然而，对于一些复杂的场景文本，例如弯曲长文本，即使14个点也不足以很好地表示它们。但是对于大多数场景文本例如水平文本和定向文本，少于14个点就足够表示了，使用14个点表示这些文本区域太浪费了。

水平文本区域：两个点表示即可
多方向文本区域：四个点表示即可
弯曲文本区域：14个点表示
较长的弯曲文本区域：14个点可能不够
那么如果统一使用14个点表示文本区域，显然对水平文本和多方向文本较为浪费，因此本文使用自适应的方式，根据文本的表现形式自适应确定文本区域点数的多少

Therefore, it is reasonable to consider using polygons of adaptive numbers of points to represent text regions. Easily, we can imagine that corner points on the boundary of a text region can be used for region representation, as shown in Figure 2 (a). And this is similar as the method for annotating general objects [1]. However, the points in this way are not arranged in a direction and it may be difficult to learn the representation. In the method for annotating general objects, human correction may be needed for accurate segmentation. Considering text regions usually have approximate symmetry top boundary and down boundary as shown in Figure 3, using the pairwise points from the two boundaries for text region representation may be more suitable. It is much easier to learn the pairwise boundary points from one end to the other end of text region, as shown in Figure 2 (b). In this way, different scene text regions can be represented by different numbers of points precisely, as shown in Figure 3. Moreover, to our knowledge, we are the first to use adaptive numbers of pairwise points for text
region representation.
因此，考虑使用自适应数量的点去表示文本区域。很容易想到，使用一个文本区域边框的角点去表示文本，如图2(a)。这类似于标注通用目标的方法。然而，用这种方式，这些点不是沿一个方向排列的，可能很难去学习这个表示。在这个标注通用目标的方法中，可能需要人工校正来进行精确分割。考虑到文本区域通常具有近似对称的上边界和下边界，如图3所示，使用两个边界的成对点进行文本区域表示可能更合适。从文本区域的一边到另一边学习成对边界点会更容易，如图2(b)所示。这样，不同的文本区域可以使用不用的点精确的表示，如图3所示。而且，据我们所知，我们是第一个使用自适应成对点的数量来表示文本区域的。

图2

图3

自适应标注，上下边界成对点标注

3.2 Text proposal

When an input image is given, the first step of the proposed method is text proposal, in which text region candidates called text proposals are generated by Text-RPN. The Text-RPN is similar as RPN in Faster R-CNN [23] except different backbone networks and anchor sizes. In the proposed method, the backbone network is SE-VGG16 as shown in Table 1, which is obtained by adding Squeezeand-Excitation (SE) blocks [7] to VGG16 [25]. As shown in Figure 4, SE blocks adaptively recalibrate(校准) channel-wise feature responses by explicitly modelling interdependencies between channels, which can produce significant performance improvement. Here, FC means fully connected layer and ReLU means Rectified Linear Unit function. Moreover, because scene texts usually have different sizes, anchor sizes are set as {32, 64, 128, 256, 512} for covering more texts while aspect ratios {0.5, 1, 2} are kept.
给定输入图像之后，第一步就是Text Proposal，在这里文本区域候选框称为Text-RPN生成的Text Proposal。除了不同的骨干网络和anchor大小之外，Text-RPN和Faster-RCNN中的RPN是相似的。在此方法中，主干网络是SE-VGG16，如表1，加入了SE（Squeeze-and-Excitation）模块的VGG16。如图4所示，SE模块通过使用通道的相互依赖型，自适应的重新校准了通道方面的特征响应，产生了显著的性能提升。FC表示全连接层，ReLU表示整流线性单元函数。此外，由于场景文本通常具有不同的大小，为了覆盖更多文本，因此anchor大小设置为 {32,64,128,512}{32,64,128,512} ，保持纵横比 {0.5,1,2}{0.5,1,2}。

图4

SE-VGG16

3.3. Proposal refinement

After text proposal, text region candidates in the input image are generated, which will be verified and refined in this step. As shown in Figure 1 a refinement network is employed for proposal refinement, which consists of several branches: text/non-text classification, bounding box regression and RNN based adaptive text region representation. Here, text/non-text classification and bounding box regression are similar as other two-stage text detection methods, while the last branch is proposed for arbitrary shape text representation.
在 text proposal 之后，在输入图像中生成的文本区域候选框，在这一步进行验证和微调。如图1 所示，微调网络用于候选框微调，包含若干分支：文本/非文本分类，边框回归和基于自适应文本表示的RNN。在这里，文本/非文本分类和边框回归和其他的两阶段文本检测的方法是类似的，最后一个分支用于任意形状的文本表示。
For the proposed branch, the input are the features of each text proposal, which are obtained by using ROI pooling to the CNN feature maps generated with SE-VGG16. The output target of this branch is the adaptive number of boundary points for each text region. Because the output length changes for different text regions, it is reasonable to use RNN to predict these points. Therefore, Long ShortTerm Memory (LSTM) [6] is used here, which is a kind of RNN and popular for processing sequence learning problem, such as machine translation, speech recognition, image caption and text recognition.
对于最后一个分支，输入是每个 text proposal的特征，这些特征是在SE-VGG16生成的CNN feature maps上通过 ROI Pooling后获得的。输出是每个文本区域自适应边框点的数量。因为输出长度是不同的文本区域，所以使用RNN预测这些点。因此，在这里使用了LSTM，它是一种RNN，并且用于处理序列学习问题，例如机器翻译，语音识别，图像标题和文本识别。
Though it is proposed that pairwise boundary points are used for text region representation, different ways can be used for pairwise points representation. Easily, we can imagine that using the coordinates of two pairwise points to represent them. In this way, the coordinates of pairwise points are used as the regression targets as shown in Figure 5. However, pairwise points can be represented in a different way, using the coordinate of their center point , the distance from the center point to them , and their orientation . However, the angle target is not stable in some special situations. For example, angle near is very similar to angle near in spatial, but their angles are quite different. This makes the network hard to learn the angle target well. Besides, the orientation can be represented by and , which can be predicted stably. However, more parameters are needed. Therefore, the coordinates of points are used as the regression targets in the proposed method.
尽管提出成对边界点用于文本区域表示，但是可以使用不同方式用于成对点表示。很容易想到，可以使用两对点的坐标来表示。这样，每对点的坐标都可以作为回归目标，如图5所示。然而，每对点都可以使用不同的方式表示，使用中心点的坐标，中心点到两个点的距离和他们的旋转角度。然而，目标角度在某些特殊情况下不好确定。例如，和在空间中是非常相似的，但是他们的角度是不同的。这个就很难正确的学习。除此之外，方向可以通过和来表示，这个可以很好地预测。然而，需要很多参数。因此，本文方法使用点的坐标来作为回归目标。

一种方式是两个点
一种方式中心点 + 距中心点长度 + 方向角度
但是第二种方式存在角度不明确或者为保证角度明确而造成参数过多的问题，因此选用第一种方式

图5

The inputs of all time steps in the LSTM used here are the same, which are the ROI pooling features of the corresponding text proposals. And the output of each time step are the coordinates of the pairwise points on text region boundary. Meanwhile, as adaptive numbers of points are used for different text regions, a stop label is needed to represent when the predicting network stops. Because stop label prediction is a classification problem while coordinates prediction is a regression problem, it is not appropriate to put them in the same branch. Therefore, there are two branches in each time step of the LSTM: one for point coordinate regression and one for stop label prediction. At each time step the coordinates of two pairwise boundary points of text region and the label stop/continue are predicted. If the label is continue, the coordinates of another two points and a new label are predicted in the next time step. Otherwise, the prediction stops and text region is represented with the points predicted before. In this way, text regions in the input image can be detected and represented with different polygons made up by the predicted pairwise points.
LSTM每个时间点的输入都是相同的，都是相应text proposal的 ROI pooling特征。每个时间点的输出是文本区域框的对点的坐标。同时，由于不同的文本区域使用自适应点数，因此需要停止标签来表示预测网络何时停止。因为停止标签预测是分类问题，而坐标预测是回归问题，所以将他们放在同一分支是不合适的。因此，LSTM的每个时间点有两个分支：一个点的坐标的回归，一个停止标签的预测。在每一个时间点，都预测文本区域的两个对边的点的坐标和标签 stop/continue。如果标签是continue，在下一个时间点将会预测另外两个点的坐标和下一个标签，否则，预测停止，文本区域使用之前预测的点表示。这样，输入图像中的文本区域就可以使用不同的多边形（通过预测的对点组成的）来检测和表示了。
While Non-Maximum Suppression (NMS) is extensively used to post-process detection candidates by general object detection methods, it is also needed in the proposed method. As the detected text regions are represented with polygons, normal NMS which is computed based on the area of horizontal bounding box is not suitable here. Instead, a polygon NMS is used, which is computed based on the area of the polygon of text region. After NMS, the remaining text regions are output as the detection result.
虽然NMS被广泛用于通用目标检测方法的后处理检测候选框，但是在本文提出的方法中也需要NMS。因为文本区域是通过多边形来表示的，所以使用普通的基于水平边框的NMS来计算是不合适的。所以使用多边形NMS，是基于多边形文本区域的面积计算的。在NMS后，剩余的文本区域就作为检测结果输出了。

3.4. Training objective

As Text-RPN in the proposed method is similar as the RPN in Faster R-CNN [23], the training loss of Text-RPN is also computed in the similar way as it. Therefore, in this section, we only focus on the loss function of refinement network in proposal refinement. The loss defined on each proposal is the sum of a text/non-text classification loss, a bounding box regression loss, a boundary points regression loss and a stop/continue label classification loss. The multitask loss function on each proposal is defined as:
由于Text-RPN和Faster R-CNN中的RPN是相似的，所以Text-RPN的训练损失也是以相同的方式计算的。因此，在这一部分，我们只关注微调网络的损失函数。每一个proposal的损失函数定义为文本/非文本分类损失，边框回归损失和边界点回归损失和停止/继续标签分类损失之和。在每一个proposal上的多任务损失函数定义为：
$L_{sum} = L_{cls}(p, t) + \lambda_{1} t \textstyle\sum_{i\in\{x, y, w, h\}} L_{reg}(v_{i}, v_{i}^{*}) \\ + \lambda_{2}t\textstyle\sum_{i\in\{x_{1}, y_{1}, x_{2}, y_{2}, ..., x_{n}, y_{n}\}}L_{reg}(u_{i}, u_{i}^{*}) \\ + \lambda_{3}t\textstyle\sum_{i\in\{l_{1}, l_{2}, ..., x_{n/2}\}}L_{cls}(l_{i}, l_{i}^{*})$
, and are balancing parameters that control the trade-off between these terms and they are set as 1 in the proposed method.
For the text/non-text classification loss term, is the indicator of the class label. Text is labeled as 1 (), and background is labeled as 0 (). The parameter is the probability over text and background classes computed after softmax. Then, is the log loss for true class .
, and 是这些项的平衡参数，在本文方法中它们设置为1。
对于文本/非文本损失项，是分类标签的标记。是文本时，不是文本时。参数是softmax计算后的文本和非文本的置信度。是真值对数损失。

For the bounding box regression loss term, is a tuple of true bounding box regression targets including coordinates of the center point and its width and height, and is the predicted tuple for each text proposal. We use the parameterization for and given in Faster R-CNN[23], in which and specify scale-invariant translation and log-space height/width shift relative to an object proposal.
For the boundary points regression loss term, is a tuple of true coordinates of boundary points, and is the predicted tuple for the text label. To make the points learned suitable for text of different scales, the learning targets should also be processed to make them scale invariant. The parameters are processed as following:

where and denote the coordinates of the boundary points, and denote the coordinates of the center point of the corresponding text proposal, and denote the width and height of this proposal.
对于边界点的回归项，是真实边界点坐标的元组，是预测出的坐标元组。为了使学习到的点适应不同尺度的文本，学习目标也应该被处理为使它们尺度不变。参数被处理为如下样子：

Let indicates or , is defined as the smooth L1 loss as in Faster R-CNN [23]:

For the stop/continue label classification loss term, it is also a binary classification and its loss is formatted similar as text/non-text classification loss.
对于停止/继续标签分类损失项，它也是一个二分类，它的损失格式类似于文本/非文本分类损失。

4 实验

4.1. 数据集
Five benchmarks are used in this paper for performance evaluation, which are introduced in the following:
在本文中使用了五个评价基准进行性能验证，介绍如下：

CTW1500: The CTW1500 dataset [17] contains 500 test images and 1000 training images, which contain multi-oriented text, curved text and irregular shape text. Text regions in this dataset are labeled with 14 scene text boundary points at sentence level.
CTW1500：CTW1500数据集包含500张测试图片和1000张训练图片，包含多方向文本，弯曲文本和不规则文本。在这个数据集中文本区域使用14个场景文本边框点以句子级别来标注。
TotalText: The TotalText dataset [2] consists of 300 test images and 1255 training images with more than 3 different text orientations: horizontal, multi-oriented, and curved. The texts in these images are labeled at word level with adaptive number of corner points.
TotalText：TotalText数据集包含300张测试图片和1255张训练图片，超过三种不同的文本方向：水平，多向和弯曲。在这些图像中，文本是以单词级别标记，具有自适应角点数。
ICDAR2013: The ICDAR2013 dataset [10] contains focused scene texts for ICDAR Robust Reading Competition. It includes 233 test images and 229 training images. The scene texts are horizontal and labeled with horizontal bounding boxes made up by 2 points at word level.
ICDAR2013：ICDAR2013数据集包含ICDAR Robust Reading Competition的重点场景文本。包括233张测试图像和229张训练图像。场景文本是水平的，使用2个点的边界框以单词级别标注。
ICDAR2015: The ICDAR2015 dataset [9] focuses on incidental scene text in ICDAR Robust Reading Competition. It includes 500 testing images and 1000 training images. The scene texts have different orientations, which are labeled with inclined boxes made up by 4 points at word level.
ICDAR2015：ICDAR2015数据集侧重于ICDAR Robust Reading Competition中的非主要的场景文本。包括500张测试图像和1000张训练图像。它们有不同的方向，使用四个点组成的倾斜框以单词级别标注。
MSRA-TD500: The MSRA-TD500 dataset [29] contains 200 test images and 300 training images, that contain arbitrarily-oriented texts in both Chinese and English. The texts are labeled with inclined boxes made up by 4 points at sentence level. Some long straight text lines exist in the dataset.
MSRA-TD1500：MSRA-TD500数据集包含200张测试图像和300张训练图像，包含中英文任意方向的文本。通过四个点的倾斜框以句子级别标注。数据集中存在一些长直文本行。

The evaluation for text detection follows the ICDAR evaluation protocol in terms of Recall, Precision and Hmean. Recall represents the ratio of the number of correctly detected text regions to the total number of text regions in the dataset while Precision represents the ratio of the number of correctly detected text regions to the total number of detected text regions. Hmean is single measure of quality by combining recall and precision. A detected text region is considered as correct if its overlap with the ground truth text region is larger than a given threshold. The computation of the three evaluation terms is usually different for different datasets. While the results on ICDAR 2013 and ICDAR 2015 can be evaluated through ICDAR robust reading competition platform, the results of the other three datasets can be evaluated with the given evaluation methods corresponding to them
文本检测的评估遵循ICDAR评估协议的 Recall， Precision和 Hmean。Recall 表示正确检测到的文本区域数与数据集中文本区域总数之比，Precision表示正确检测到的文本区域数与检测到的文本总数之比。Hmean通过结合recall和precision来衡量质量。如果检测到的文本区域与Ground Truth 文本区域的重叠面积大于给定的阈值则认为是正确的。这三个评估项在不同的数据集上计算的方式不同。在 ICDAR2013和 ICDAR2015 上的结果可以通过ICDAR robust reading competition平台来验证，其他三个的数据集可以按照它们相应的验证方法去验证。

4.2. Implementation details

Our scene text detection network is initialized with pretrained VGG16 model for ImageNet classification. When the proposed method is tested on the five datasets, different models are used for them, which are trained using only the training images of each dataset with data augmentation. All models are trained 10 × 104 iterations in total. Learning rates start from 10−3 , and are multiplied by 1/10 after 2 × 104 , 6 × 104 and 8 × 104 iterations. We use 0.0005 weight decay and 0.9 momentum. We use multi-scale training, setting the short side of training images as {400, 600, 720, 1000, 1200}, while maintaining the long side at 2000.
我们的场景文本检测网络是在ImageNet预训练的VGG16上初始化的。当在五个数据集上测试所提出的方法时，使用不同的模型，这些模型仅使用每个数据集的训练图像进行数据增强训练。全部的模型总共训练迭代步数。学习率从开始，分别在迭代步数时乘以衰减。使用 0.0005的权重衰减和0.9的动量。使用多尺度训练，设置训练图像的短边为 {400,600,720,1000,1200} ，保持长边 2000不变。
Because adaptive text region representation is used in the proposed method, it can be simply used for these datasets with text regions labeled with different numbers of points. As ICDAR 2013, ICDAR 2015 and MSRA-TD500 are labeled with quadrilateral boxes, they are easy to be transformed into pairwise points. However, for CTW1500 dataset and TotalText dataset, some operations are needed to transform the ground truthes into the form we needed.
因为在本文方法中使用了自适应文本区域表示，所以可以很简单的使用这些数据集中通过不同的点数标记的文本区域。在ICDAR 2013, ICDAR 2015 和 MSRA-TD500使用四边形框标注，很容易转换成对点。然而，对于 CTW1500数据集和TotalText 数据集，需要一些操作把ground truth 转换为我们需要的形式。
Text regions in CTW1500 are labeled with 14 points, which are needed to be transformed into adaptive number of pairwise points. First, the 14 points are grouped into 7 point pairs. Then, we compute the intersection angle for each point, which is the angle of the two vectors from current point to its nearby two points. And for each point pair, the angle is the smaller one of the two points. Next, point pairs are sorted according to their angles in descending order and we try to remove each point pair in the order. If the ratio of the polygon areas after removing operation to the original area is larger than 0.93, this point pair can be removed. Otherwise, the operation stops and the remaining points are used in the training for text region representation.
在CTW1500中，使用14个点标记文本区域，需要转换成自适应成对点数。首先，14个点是由7对点组成的。然后，我们计算每个点的交叉角，就是从当前点到其附近两个点的矢量的角度。对于每对点角度是两个点中较小的一个。接下来，点对按照他们的角度降序排序，我们尝试删除排列中的每个点对。如果删除操作之后多边形区域与原始区域的面积比大于0.93，则删除该点对。否则，操作停止，剩余的点用于文本区域表示的训练。

CTW1500的转变不太清楚，使用的时候再看

Moreover, text regions in TotalText are labeled with adaptive number of points, but these points are not pairwise. For text regions labeled with even number of points, it is easy to process by group them into pairs. For text regions labeled with odd number of points, the start two points and the end two points should be found first, and then the corresponding points to the remaining points are found based on their distances to the start points on the boundary.
The results of the proposed method are obtained on single scale input image with one trained model. Because test image scale has a deep impact on the detection results, such as FOTS [15] uses different scales for different datasets, we also use different test scales for different datasets for best performance. In our experiments, the scale for ICDAR 2013 is 960 × 1400, the scale for ICDAR 2015 is 1200 × 2000 and the scales for other datasets are all 720 × 1280.
The proposed method is implemented in Caffe and the experiments are finished using a Nvidia P40 GPU.
在TotalText中的文本区域是通过自适应的点数来标记的，但是这些点不是对点。对于标有偶数个点的文本区域，可以很容易的将它们成对分组。但是对于标有奇数点数的文本区域，首先找到开始的两个点和最后的两个点，然后根据它们到边界开始点的距离找到剩余点的对应点。
本文方法的结果是用一个训练模型在单尺度输入图像中获得的。因为测试图像的尺度对检测结果有很大影响，例如FOTS对不用的数据集使用不同的尺度，我们也对不同的测试集使用不同的尺度以获得最好的表现。在我们的试验中，ICDAR2013的尺度是 960×1400，ICDAR2015的尺度是 1200×2000，其他数据集的尺度都是720×1280。
本文方法实在Caffe上实现的，在 Nvidai P40 GPU上完成实验。

4.3 消融研究

In the proposed method the backbone network is SEVGG16, while VGG16 is usually used by other state-of-the-art methods. To verify the effectiveness of the backbone network, we test the proposed method with different backbone networks (SE-VGG16 vs VGG16) on CTW1500 dataset and ICDAR 2015 dataset as shown in Table 2. The results show that SE-VGG16 is better than VGG16, with which better performances achieved on the two datasets.
本文方法使用的主干网络是 SE-VGG16，其他先进的方法使用的是VGG16。为了验证主干网络的优点，我们用不同的主干网络（SE-VGG16和VGG16）在CTW1500数据集和ICDAR2015数据集上做了测试，如表2。结果显示SE-VGG16是比VGG16好的，在两个数据集上都有好的表现。

表

Meanwhile, an adaptive text region representation is proposed for text of arbitrary shapes in this paper. To validate its effectiveness for scene text detection, we add an ablation study on text region representation on CTW1500 dataset. For comparison, fixed text region representation directly uses the fixed 14 points as the regression targets in the experiment. Table 3 shows the experimental results of different text region representation methods on CTW1500 dataset. The recall of the method with adaptive representation is much higher than fixed representation (80.2% vs 76.4%). It justifies that the adaptive text region representation is more suitable for texts of arbitrary shapes.
同时，本文对于不规则形状的文本还提出了自适应文本区域表示。为了验证它的优点，我们在数据集CTW1500上添加了一个文本区域表示的消融研究。作为对比，直接使用14个点固定文本区域表示作为实验的回归目标。表3显示了在数据集CTW1500上的不同区域表示方法的实验结果。使用自适应表示方法的Recall比固定表示的 Recall高出很多（）。证明了，自适应文本区域表示更适合不规则形状的文本。

表

4.4. Comparison with State-of-the-arts

To show the performance of the proposed method for different shape texts, we test it on several benchmarks. We first compare its performance with state-of-the-arts on CTW1500 and TotalText which both contains challenging multi-oriented and curved texts. Then we compare the methods on the two most widely used benchmarks: ICDAR2013 and ICDAR2015. At last we compare them on MSRA-TD500 which contains long straight text lines and multi language texts (Chinese+English).
为了显示所提出方法对不同形状文本的性能，我们在几个基准中进行测试。我们首先将其性能与数据集CTW1500和TotalText上的最新技术进行比较（包含具有挑战性的多向和弯曲文本）。然后我们比较两种最广泛使用的基准测试方法：ICDAR20113 和 ICDAR2015。最后我们在数据集MSRA-TD500上进行了比较，包含长直文本行和多语言文本（中英）。
Table 4 and Table 5 compare the proposed method with state-of-the-art methods on CTW1500 and TotalText, respectively. The propose method is much better than all other methods on CTW1500 including the methods designed for curved texts such as CTD, CTD+TLOC and TextSnake (Hmean: 80.1% vs 69.5%, 73.4% and 75.6%). Meanwhile, it also achieves better performance (Hmean: 78.5%) than all other methods on TotalText. The performances on the two datasets containing challenging multi-oriented and curved texts mean that the proposed method can detect scene text of arbitrary shapes.
表4和表5分别是在数据集 CTW 1500和TotalText上的本文方法和最新方法的比较。在CTW1500上，本文方法是比其他的方法（包括弯曲文本的方法CTD，CTD+TLOC和TextSnake）更好。同时，在数据集TotalText上，也实现了更好的性能(Hmean:78.5%)。在两个包含挑战性的多向和弯取文本数据集上的表现说明了本文方法可以检测任意形状的场景文本。

表

Table 6 shows the experimental results on ICDAR2013 dataset. The proposed method achieves the best performance same as Mask Textspotter, whose Hmean both are 91.7%. Because the proposed method is tested on single scale input image with single model, only the results generated in this situation are used here. The results show that the proposed method can also process horizontal text well.
表6显示了在数据集ICDAR2013上的实验结果。该方法实现了与Mask Textspotter相同的最佳性能，其中Hmean均为91.7％。由于所提出的方法是在单一模型的单一尺度输入图像上进行测试的，因此这里仅使用在这种情况下生成的结果。他的结果表明，所提出的方法也可以很好地处理水平文本。

表

Table 7 shows the experimental results on ICDAR 2015 dataset and the proposed method achieve the second best performance, which is only a little lower than FOTS (Hmean: 87.6% vs 88.0%). While FOTS is trained end-toend by combining text detection and recognition, the proposed method is only trained for text detection, which is much easier to train than FOTS. And the results tested on single scale input image with single mode are used here. The results show that the proposed method achieves comparable performance with state-of-the-arts, which means it can also process multi-oriented text well.
表7显示了ICDAR 2015数据集的实验结果，所提出的方法达到了第二好的性能，仅略低于FOTS（Hmean：87.6％vs 88.0％）。虽然FOTS通过结合文本检测和识别进行端到端训练，但是所提出的方法仅针对文本检测进行训练，这比FOTS更容易训练。本文采用单模型单尺度输入图像获取测试结果。结果表明，该方法与现有技术具有可比性，可以很好地处理多向文本。

表

Table 8 shows the results on MSRA-TD500 dataset and it shows that our detection method can support long straight text line detection and Chinese+English detection well. It achieves Hmean of 83.6% and is better than all other methods.
表8显示了在MSRA-TD500数据集的结果，表明我们的检测方法可以很好地支持长直文本行检测和中文+英文检测。它实现了83.6％的Hmean并且优于所有其他方法。

表

4.5. Speed

The speed of the proposed method is compared with two other methods as shown in Table 9, which are all able to deal with arbitrary shape scene text. From the results, we can see that the speed of the proposed method is much faster than the other two methods. While pixel-wise prediction is needed in Mask Textspotter and TextSnake, it is not needed in the proposed method and less computation is needed.
将所提出方法的速度与表9中所示的两种其他方法进行比较，这些方法都能够处理任意形状的场景文本。从结果中我们可以看出，所提方法的速度比其他两种方法快得多。因为在Mask Textspotter和TextSnake中需要像素预测，但在所提出的方法中不需要它，所以本文方法会有较少的计算。

表

4.6. Qualitative results

Figure 6 illustrates qualitative results on CTW1500, TotalText, ICDAR2013, ICDAR2015 and MSRA-TD500. It shows that the proposed method can deal with various texts of arbitrarily oriented or curved, different languages, nonuniform illuminations and different text lengths at word level or sentence level.
图6显示了在CTW1500，TotalText，ICDAR2013，ICDAR2015和MSRA-TD500的定性结果。它表明，所提出的方法可以处理任意定向或弯曲的各种文本，不同的语言，不均匀的照明和在单词级别或句子级别的不同文本长度。

5. Conclusion

In this paper, we propose a robust arbitrary shape scene text detection method with adaptive text region representation. After text proposal using a Text-RPN, each text region is verified and refined using a RNN for predicting adaptive number of boundary points. Experiments on five benchmarks show that the proposed method can not only detect horizontal and oriented scene texts but also work well for arbitrary shape scene texts. Particularly, it outperforms existing methods significantly on CTW1500 and MSRA-TD500, which are typical of curved texts and multi-oriented texts, respectively. In the future, the proposed method can be improved in several aspects. First, arbitrary shape scene text detection may can be improved by using corner point detection. This will require easier annotations for training images. Second, to fulfill the final goal of text recognition end-to-end text recognition for arbitrary shape scene text will be considered.
在本文中，我们提出了一种具有自适应文本区域表示的鲁棒的任意形状场景文本检测方法。在使用文本RPN的text proposal之后，使用RNN来验证和细化每个文本区域以预测边界点的自适应数量。对五个基准测试的实验表明，该方法不仅可以检测水平和定向场景文本，而且可以很好地适用于任意形状的场景文本。特别是，它在CTW1500和MSRA-TD500上显著优于现有方法，分别是曲线文本和多向文本的典型。
将来，可以在几个方面改进所提出的方法。首先，可以通过使用角点检测来改善任意形状场景文本检测。这将需要更容易的标注来训练图像。其次，为了实现文本识别的最终目标，将考虑对任意形状场景文本进行端到端文本识别。

图