论文翻译:基于端到端的可训练神经网络基于图像的序列识别及其在场景文本识别中的应用

An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition

基于端到端的可训练神经网络基于图像的序列识别及其在场景文本识别中的应用

 

Abstract

Image-based sequence recognition has been a longstanding research topic in computer vision. In this paper, we investigate the problem of scene text recognition, which is among the most important and challenging tasks in image-based sequence recognition. A novel neural network architecture, which integrates feature extraction, sequence modeling and transcription into a unified framework, is proposed. Compared with previous systems for scene text recognition, the proposed architecture possesses four distinctive properties: (1) It is end-to-end trainable, in contrast to most of the existing algorithms whose components are separately trained and tuned. (2) It naturally handles sequences in arbitrary lengths, involving no character segmentation or horizontal scale normalization. (3) It is not confined to any predefined lexicon and achieves remarkable performances in both lexicon-free and lexicon-based scene text recognition tasks. (4) It generates an effective yet much smaller model, which is more practical for real-world application scenarios. The experiments on standard benchmarks, including the IIIT-5K, Street View Text and ICDAR datasets, demonstrate the superiority of the proposed algorithm over the prior arts. Moreover, the proposed algorithm performs well in the task of image-based music score recognition, which evidently verifies the generality of it.

基于图像的序列识别已成为计算机视觉领域的长期研究课题。在本文中,我们研究了场景文本识别问题,这是基于图像的序列识别中最重要和最具挑战性的任务之一。提出了一种新颖的神经网络架构,它将特征提取,序列建模和转录集成到一个统一的框架中。与以前的用于场景文本识别的系统相比,所提出的体系结构具有四个独特的特性:(1)与大多数现有的算法(其组件分别经过训练和调整)相比,它是端对端可训练的。 (2)它自然地处理任意长度的序列,不涉及字符分割或水平尺度归一化。 (3)它不限于任何预定义的词典,并且在无词典和基于词典的场景文本识别任务中均表现出色。 (4)生成有效但小得多的模型,这对于实际应用场景更实用。在包括IIIT-5K,街景文字和ICDAR数据集在内的标准基准上进行的实验证明了该算法优于现有技术的优势。此外,该算法在基于图像的乐谱识别任务中表现良好,显然证明了其通用性。

 

  1. Introduction

Recently, the community has seen a strong revival of neural networks, which is mainly stimulated by the great success of deep neural network models, specifically Deep Convolutional Neural Networks (DCNN), in various vision tasks. However, majority of the recent works related to deep neural networks have devoted to detection or classification of object categories [12, 25]. In this paper, we are concerned with a classic problem in computer vision: imagebased sequence recognition. In real world, a stable of visual objects, such as scene text, handwriting and musical score, tend to occur in the form of sequence, not in isolation. Unlike general object recognition, recognizing such sequence-like objects often requires the system to predict a series of object labels, instead of a single label. Therefore, recognition of such objects can be naturally cast as a sequence recognition problem. Another unique property of sequence-like objects is that their lengths may vary drastically. For instance, English words can either consist of 2 characters such as “OK” or 15 characters such as “congratulations”. Consequently, the most popular deep models like DCNN [25, 26] cannot be directly applied to sequence prediction, since DCNN models often operate on inputs and outputs with fixed dimensions, and thus are incapable of producing a variable-length label sequence.

最近,社区看到了神经网络的强大复兴,这主要是由于深度神经网络模型(尤其是深度卷积神经网络(DCNN))在各种视觉任务中的巨大成功所激发。但是,与深度神经网络有关的最新著作大多数都致力于对象类别的检测或分类[12,25]。在本文中,我们关注计算机视觉中的一个经典问题:基于图像的序列识别。在现实世界中,稳定的视觉对象(例如场景文本,手写和乐谱)倾向于以顺序而不是孤立的形式出现。与一般对象识别不同,识别此类类似序列的对象通常需要系统预测一系列对象标签,而不是单个标签。因此,这种对象的识别自然可以被看作是序列识别问题。类序列对象的另一个独特属性是它们的长度可能会急剧变化。例如,英语单词可以由2个字符组成,例如“确定”,也可以由15个字符组成,例如“祝贺”。因此,像DCNN [25,26]这样最流行的深度模型不能直接应用于序列预测,因为DCNN模型通常对具有固定尺寸的输入和输出进行操作,因此无法生成可变长度的标签序列。

Some attempts have been made to address this problem for a specific sequence-like object (e.g. scene text). For example, the algorithms in [35, 8] firstly detect individual characters and then recognize these detected characters with DCNN models, which are trained using labeled character images. Such methods often require training a strong character detector for accurately detecting and cropping each character out from the original word image. Some other approaches (such as [22]) treat scene text recognition as an image classification problem, and assign a class label to each English word (90K words in total). It turns out a large trained model with a huge number of classes, which is difficult to be generalized to other types of sequencelike objects, such as Chinese texts, musical scores, etc., because the numbers of basic combinations of such kind of sequences can be greater than 1 million. In summary, current systems based on DCNN can not be directly used for image-based sequence recognition.

对于特定的类似序列的对象(例如场景文本),已经尝试解决该问题。例如,[35,8]中的算法首先检测单个字符,然后使用DCNN模型识别这些检测到的字符,该模型使用标记的字符图像进行训练。此类方法通常需要训练强大的字符检测器,以准确地从原始文字图像中检测并裁剪出每个字符。其他一些方法(例如[22])将场景文本识别视为图像分类问题,并为每个英语单词(总共90K个单词)分配一个类别标签。事实证明,这种训练有素的模型具有大量的类,很难将其推广到其他类型的类似序列的对象,例如中文文本,乐谱等,因为此类序列的基本组合数量可以大于一百万。总之,当前基于DCNN的系统不能直接用于基于图像的序列识别。

Recurrent neural networks (RNN) models, another important branch of the deep neural networks family, were mainly designed for handling sequences. One of the advantages of RNN is that it does not need the position of each element in a sequence object image in both training and testing. However, a preprocessing step that converts an input object image into a sequence of image features, is usually essential. For example, Graves et al. [16] extract a set of geometrical or image features from handwritten texts, while Su and Lu [33] convert word images into sequential HOG features. The preprocessing step is independent of the subsequent components in the pipeline, thus the existing systems based on RNN can not be trained and optimized in an end-to-end fashion.

递归神经网络(RNN)模型是深度神经网络家族的另一个重要分支,主要设计用于处理序列。 RNN的优点之一是,在训练和测试中,RNN都不需要序列对象图像中每个元素的位置。 但是,通常必须执行将输入对象图像转换为图像特征序列的预处理步骤。 例如,Graves等。 [16]从手写文本中提取出一组几何或图像特征,而Su和Lu [33]将单词图像转换为连续的HOG特征。 预处理步骤独立于流水线中的后续组件,因此无法以端到端的方式训练和优化基于RNN的现有系统。

Several conventional scene text recognition methods that are not based on neural networks also brought insightful ideas and novel representations into this field. For example, Almazan` et al. [5] and Rodriguez-Serrano et al. [30] proposed to embed word images and text strings in a common vectorial subspace, and word recognition is converted into a retrieval problem. Yao et al. [36] and Gordo et al. [14] used mid-level features for scene text recognition. Though achieved promising performance on standard benchmarks, these methods are generally outperformed by previous algorithms based on neural networks [8, 22], as well as the approach proposed in this paper.

几种不基于神经网络的常规场景文本识别方法也为该领域带来了有见地的想法和新颖的表示形式。 例如,Almazan`等。 [5]和Rodriguez-Serrano等。 [30]提出将单词图像和文本字符串嵌入到一个公共的向量子空间中,并将单词识别转换为检索问题。 姚等。 [36]和戈多等。 [14]使用中级特征进行场景文本识别。 尽管在标准基准上取得了令人满意的性能,但是这些方法通常比以前基于神经网络的算法[8,22]以及本文提出的方法要好。

The main contribution of this paper is a novel neural network model, whose network architecture is specifically designed for recognizing sequence-like objects in images. The proposed neural network model is named as Convolutional Recurrent Neural Network (CRNN), since it is a combination of DCNN and RNN. For sequence-like objects, CRNN possesses several distinctive advantages over conventional neural network models: 1) It can be directly learned from sequence labels (for instance, words), requiring no detailed annotations (for instance, characters); 2) It has the same property of DCNN on learning informative representations directly from image data, requiring neither hand-craft features nor preprocessing steps, including binarization/segmentation, component localization, etc.; 3) It has the same property of RNN, being able to produce a sequence of labels; 4) It is unconstrained to the lengths of sequence-like objects, requiring only height normalization in both training and testing phases; 5) It achieves better or highly competitive performance on scene texts (word recognition) than the prior arts [23, 8]; 6) It contains much less parameters than a standard DCNN model, consuming less storage space.

本文的主要贡献是一种新颖的神经网络模型,该网络模型是专门为识别图像中类似序列的对象而设计的。所提出的神经网络模型是DCNN和RNN的组合,因此被称为卷积递归神经网络(CRNN)。对于类似序列的对象,CRNN与传统的神经网络模型相比具有几个明显的优势:1)可以直接从序列标签(例如单词)中学习,不需要详细的注释(例如字符); 2)它具有直接从图像数据中学习信息表示的DCNN的特性,既不需要手工功能也不需要预处理步骤,包括二值化/分割,组件定位等; 3)具有RNN的相同属性,能够产生一系列标签; 4)它不受序列状物体长度的限制,在训练和测试阶段都只需要高度标准化即可; 5)与现有技术相比,它在场景文本(单词识别)上表现出更好或极具竞争力的表现[23,8]; 6)它包含的参数比标准DCNN模型少得多,占用的存储空间也更少。

 

2. The Proposed Network Architecture

The network architecture of CRNN, as shown in Fig. 1, consists of three components, including the convolutional layers, the recurrent layers, and a transcription layer, from bottom to top。

如图1所示,CRNN的网络架构从下到上由三个部分组成,包括卷积层,循环层和转录层。

At the bottom of CRNN, the convolutional layers automatically extract a feature sequence from each input image. On top of the convolutional network, a recurrent network is built for making prediction for each frame of the feature sequence, outputted by the convolutional layers. The transcription layer at the top of CRNN is adopted to translate the per-frame predictions by the recurrent layers into a label sequence. Though CRNN is composed of different kinds of network architectures (eg. CNN and RNN), it can be jointly trained with one loss function.

在CRNN的底部,卷积层会自动从每个输入图像中提取特征序列。 在卷积网络之上,构建了一个递归网络,用于对由卷积层输出的特征序列的每一帧进行预测。 采用CRNN顶部的转录层,将循环层的每帧预测转换为标记序列。 尽管CRNN由不同类型的网络体系结构(例如CNN和RNN)组成,但可以使用一个损失函数进行联合训练。

论文翻译:基于端到端的可训练神经网络基于图像的序列识别及其在场景文本识别中的应用_第1张图片

Figure 1. The network architecture. The architecture consists of three parts: 1) convolutional layers, which extract a feature sequence from the input image; 2) recurrent layers, which predict a label distribution for each frame; 3) transcription layer, which translates the per-frame predictions into the final label sequence.

图1.网络架构。 该体系结构包括三个部分:1)卷积层,从输入图像中提取特征序列; 2)循环层,预测每个帧的标签分布; 3)转录层,它将每帧的预测翻译成最终的标记序列。

 

2.1 Feature Sequence Extraction

In CRNN model, the component of convolutional layers is constructed by taking the convolutional and max-pooling layers from a standard CNN model (fully-connected layers are removed). Such component is used to extract a sequential feature representation from an input image. Before being fed into the network, all the images need to be scaled to the same height. Then a sequence of feature vectors is extracted from the feature maps produced by the component of convolutional layers, which is the input for the recurrent layers. Specifically, each feature vector of a feature sequence is generated from left to right on the feature maps by column. This means the i-th feature vector is the concatenation of the i-th columns of all the maps. The width of each column in our settings is fixed to single pixel.

在CRNN模型中,卷积层的组件是通过从标准CNN模型中获取卷积层和最大池化层(除去完全连接的层)而构造的。 这样的组件用于从输入图像中提取顺序特征表示。 在送入网络之前,所有图像都需要缩放到相同的高度。 然后,从卷积层分量产生的特征图中提取特征向量序列,该卷积层是循环层的输入。 具体地,特征序列的每个特征向量在特征图上按列从左到右生成。 这意味着第i个特征向量是所有地图的第i列的串联。 我们设置中每列的宽度固定为单个像素。

As the layers of convolution, max-pooling, and elementwise activation function operate on local regions, they are translation invariant. Therefore, each column of the feature maps corresponds to a rectangle region of the original im- age (termed the receptive field), and such rectangle regions are in the same order to their corresponding columns on the feature maps from left to right. As illustrated in Fig. 2, each vector in the feature sequence is associated with a receptive field, and can be considered as the image descriptor for that region.

当卷积层,最大池化层和元素激活函数在局部区域上运行时,它们是平移不变的。 因此,特征图的每一列对应于原始图像的一个矩形区域(称为接收场),并且这些矩形区域从左到右与它们在特征图上相应列的顺序相同。 如图2所示,特征序列中的每个向量都与一个接收场相关联,并且可以被视为该区域的图像描述符。

 

论文翻译:基于端到端的可训练神经网络基于图像的序列识别及其在场景文本识别中的应用_第2张图片

Figure 2. The receptive field. Each vector in the extracted feature sequence is associated with a receptive field on the input image, and can be considered as the feature vector of that field.

图2.接收场。 提取的特征序列中的每个向量都与输入图像上的一个接收场相关联,并且可以视为该场的特征向量。

 

Being robust, rich and trainable, deep convolutional features have been widely adopted for different kinds of visual recognition tasks [25, 12]. Some previous approaches have employed CNN to learn a robust representation for sequence-like objects such as scene text [22]. However, these approaches usually extract holistic representation of the whole image by CNN, then the local deep features are collected for recognizing each component of a sequencelike object. Since CNN requires the input images to be scaled to a fixed size in order to satisfy with its fixed input dimension, it is not appropriate for sequence-like objects due to their large length variation. In CRNN, we convey deep features into sequential representations in order to be invariant to the length variation of sequence-like objects.

作为强大,丰富和可训练的深度卷积特征已被广泛用于各种视觉识别任务[25,12]。 某些先前的方法已经使用CNN来学习对诸如场景文本之类的序列对象的鲁棒表示[22]。 然而,这些方法通常通过CNN提取整个图像的整体表示,然后收集局部深层特征以识别序列状对象的每个组成部分。 由于CNN要求将输入图像缩放到固定大小,以满足其固定的输入尺寸,因此,由于序列长度较大,因此不适合用于类似序列的对象。 在CRNN中,我们将深层特征传达到顺序表示中,以便不变于序列状对象的长度变化。

 

2.2 Sequence Labeling

A deep bidirectional Recurrent Neural Network is built on the top of the convolutional layers, as the recurrent layers. The recurrent layers predict a label distribution  for each frame  in the feature sequence . The advantages of the recurrent layers are three-fold. Firstly, RNN has a strong capability of capturing contextual information within a sequence. Using contextual cues for image-based sequence recognition is more stable and helpful than treating each symbol independently. Taking scene text recognition as an example, wide characters may require several successive frames to fully describe (refer to Fig. 2). Besides, some ambiguous characters are easier to distinguish when observing their contexts, e.g. it is easier to recognize “il” by contrasting the character heights than by recognizing each of them separately. Secondly, RNN can back-propagates error differentials to its input, i.e. the convolutional layer, allowing us to jointly train the recurrent layers and the convolutional layers in a unified network. Thirdly, RNN is able to operate on sequences of arbitrary lengths, traversing from starts to ends.

一个深层的双向递归神经网络被构建在卷积层的顶部,作为递归层。循环层针对特征序列x=x1,…,xT中的每个帧xt预测标签分布yt。循环层的优点是三方面的。首先,RNN具有在序列中捕获上下文信息的强大功能。与单独处理每个符号相比,使用上下文提示进行基于图像的序列识别更加稳定和有用。以场景文本识别为例,宽字符可能需要几个连续的帧才能完整描述(请参阅图2)。此外,某些模棱两可的字符在观察其上下文时更容易区分,例如通过对比字符高度来识别“ il”要比分别识别每个字符要容易。其次,RNN可以将误差差分反向传播到其输入即卷积层,从而使我们能够在统一网络中共同训练递归层和卷积层. 第三,RNN可以对任意长度的序列进行操作,从开始到结束。

 

论文翻译:基于端到端的可训练神经网络基于图像的序列识别及其在场景文本识别中的应用_第3张图片

Figure 3. (a) The structure of a basic LSTM unit. An LSTM consists of a cell module and three gates, namely the input gate, the output gate and the forget gate. (b) The structure of deep bidirectional LSTM we use in our paper. Combining a forward (left to right) and a backward (right to left) LSTMs results in a bidirectional LSTM. Stacking multiple bidirectional LSTM results in a deep bidirectional LSTM.

图3.(a)LSTM基本单元的结构。 LSTM由单元模块和三个门组成,即输入门,输出门和忘记门。 (b)我们在本文中使用的深度双向LSTM的结构。 将向前(从左到右)和向后(从右到左)LSTM组合在一起将产生双向LSTM。 堆叠多个双向LSTM会导致深度双向LSTM。

 

A traditional RNN unit has a self-connected hidden layer between its input and output layers. Each time it receives a frame  in the sequence, it updates its internal state ht with a non-linear function that takes both current input xt and past state  as its inputs:. Then the prediction  is made based on ht. In this way, past contexts  are captured and utilized for prediction. Traditional RNN unit, however, suffers from the vanishing gradient problem [7], which limits the range of context it can store, and adds burden to the training process. Long-Short Term Memory [18, 11] (LSTM) is a type of RNN unit that is specially designed to address this problem. An LSTM (illustrated in Fig. 3) consists of a memory cell and three multiplicative gates, namely the input, output and forget gates. Conceptually, the memory cell stores the past contexts, and the input and output gates allow the cell to store contexts for a long period of time. Meanwhile, the memory in the cell can be cleared by the forget gate. The special design of LSTM allows it to capture long-range dependencies, which often occur in image-based sequences.

传统的RNN单元在其输入和输出层之间具有自连接的隐藏层。每次收到序列中的帧时,它都会使用非线性函数更新其内部状态,该函数将当前输入和过去状态都作为其输入:。然后,基于做出预测yt。通过这种方式,捕获过去的上下文{xt'}t'并将其用于预测。然而,传统的RNN单元遭受梯度消失的困扰[7],这限制了它可以存储的上下文范围,并增加了训练过程的负担。长期内存[18,11](LSTM)是一种RNN单元,专门设计用于解决此问题。 LSTM(图3所示)由一个存储单元和三个乘法门组成,即输入,输出和忘记门。从概念上讲,存储单元存储了过去的上下文,而输入和输出门则允许该单元长时间存储上下文。同时,可以通过忘记门清除单元中的存储器。 LSTM的特殊设计使其可以捕获长期依赖关系,这种依赖关系经常发生在基于图像的序列中。

LSTM is directional, it only uses past contexts. However, in image-based sequences, contexts from both directions are useful and complementary to each other. Therefore, we follow [17] and combine two LSTMs, one forward and one backward, into a bidirectional LSTM. Furthermore, multiple bidirectional LSTMs can be stacked, resulting in a deep bidirectional LSTM as illustrated in Fig. 3.b. The deep structure allows higher level of abstractions than a shallow one, and has achieved significant performance improvements in the task of speech recognition [17].

LSTM是定向的,它仅使用过去的上下文。 但是,在基于图像的序列中,来自两个方向的上下文都是有用的并且彼此互补。 因此,我们遵循[17],将两个LSTM(一个向前和一个向后)组合成双向LSTM。 此外,可以堆叠多个双向LSTM,从而产生如图3.b所示的深层双向LSTM。 较之较浅的结构,较深的结构可以实现更高级别的抽象,并且在语音识别任务中已经实现了显着的性能提升[17]。

 

In recurrent layers, error differentials are propagated in the opposite directions of the arrows shown in Fig. 3.b, i.e. Back-Propagation Through Time (BPTT). At the bottom of the recurrent layers, the sequence of propagated differentials are concatenated into maps, inverting the operation of converting feature maps into feature sequences, and fed back to the convolutional layers. In practice, we create a custom network layer, called “Map-to-Sequence”, as the bridge between convolutional layers and recurrent layers.

在循环层中,误差差沿图3.b所示箭头的相反方向传播,即反向传播时间(BPTT)。 在循环层的底部,将传播的差异序列连接成图,将将特征图转换为特征序列的操作反转,然后反馈到卷积层。 实际上,我们创建了一个自定义网络层,称为“映射到序列”,作为卷积层和循环层之间的桥梁。

 

    1. Transcription

Transcription is the process of converting the per-frame predictions made by RNN into a label sequence. Mathematically, transcription is to find the label sequence with the highest probability conditioned on the per-frame predictions. In practice, there exists two modes of transcription, namely the lexicon-free and lexicon-based transcriptions. A lexicon is a set of label sequences that prediction is constraint to, e.g. a spell checking dictionary. In lexiconfree mode, predictions are made without any lexicon. In lexicon-based mode, predictions are made by choosing the label sequence that has the highest probability.

转录是将RNN进行的每帧预测转换为标签序列的过程。 在数学上,转录是要根据每帧预测找到具有最高概率的标记序列。 实际上,存在两种转录方式,即无词典和基于词典的转录。 词典是预测受其约束的一组标签序列,例如 拼写检查字典。 在无词典模式下,无需任何词典即可进行预测。 在基于词典的模式下,通过选择概率最高的标签序列来进行预测。

 

2.3.1 Probability of label sequence  标签序列的概率

We adopt the conditional probability defined in the Connectionist Temporal Classification (CTC) layer proposed by Graves et al. [15]. The probability is defined for label sequence l conditioned on the per-frame predictions y = y1, . . . , yT , and it ignores the position where each label in l is located. Consequently, when we use the negative log-likelihood of this probability as the objective to train the network, we only need images and their corresponding label sequences, avoiding the labor of labeling positions of individual characters.

我们采用Graves等人提出的在连接主义时间分类(CTC)层中定义的条件概率。 [15]。 该概率是针对以每帧预测y = y1, . . . , yT为条件的标签序列l定义的,它忽略了l中每个标签所处的位置。 因此,当我们以这种可能性的负对数似然度为目标来训练网络时,我们只需要图像及其相应的标签序列,从而避免了为各个字符标注位置的麻烦。

The formulation of the conditional probability is briefly described as follows: The input is a sequence y = y1, . . . , yT where T is the sequence length. Here, each ytϵR|L'| is a probability distribution over the set L'=L , where L' contains all labels in the task (e.g. all English characters), as well as a ’blank’ label denoted by . A sequence-to-sequence mapping function B is defined on sequence DD, where T is the length. B maps π onto l by firstly removing the repeated labels, then removing the ’blank’s. For example, B maps “--hh-e-l-ll-oo--” (’-’ represents ’blank’) onto “hello”. Then, the conditional probability is defined as the sum of probabilities of all π that are mapped by B onto l:

条件概率的公式简要描述如下:输入是序列其中,T是序列长度。 在这里,每个 ytϵR|L'|都是集合L'=L∪上的概率分布,其中L'包含任务中的所有标签(例如,所有英文字符)以及以表示的“空白”标签。 在序列DD上定义了序列到序列的映射函数B,其中T是长度。 B首先删除重复的标签,然后删除“空白”,从而将π映射到l上。 例如,B将“ --hh-e-l-ll-oo-”(“-”代表“空白”)映射到“ hello”。 然后,将条件概率定义为B映射到l上的所有π的概率之和:

ply=π:Bπ=1pπy    (1)

where the probability of π is defined as pπy=t=1Tyπtt , yπtt is the probability of having label πt at time stamp t. Directly computing Eq. 1 would be computationally infeasible due to the exponentially large number of summation items. However, Eq. 1 can be efficiently computed using the forward-backward algorithm described in [15].

其中π的概率定义为pπy=t=1Tyπttyπtt是在时间戳t处具有标签πt的概率。 直接计算式 由于求和项的数量成指数增加,因此1在计算上是不可行的。 但是,等式。 使用[15]中描述的前向-后向算法可以有效地计算图1。

 

2.3.2 Lexicon-free transcription 无词典的转录

In this mode, the sequence l* that has the highest probability as defined in Eq. 1 is taken as the prediction. Since there exists no tractable algorithm to precisely find the solution, we use the strategy adopted in [15]. The sequencel* is approximately found by l* ≈ B(arg maxπp(π|y)), i.e. taking the most probable label πt at each time stamp t, and map the resulted sequence onto l* .

在这种模式下,将具有等式1中定义的最高概率的序列l*作为预测。 由于没有可精确计算的精确算法,因此我们使用[15]中采用的策略。 序列l*l* B(arg maxπp(π|y))近似找到,即在每个时间戳t处取最可能的标记πt,并将得到的序列映射到l*上。

2.3.3 Lexicon-based transcription  2.3.3基于词典的转录

In lexicon-based mode, each test sample is associated with a lexicon D. Basically, the label sequence is recognized by choosing the sequence in the lexicon that has highest conditional probability defined in Eq. 1, i.e. l*=arg maxIDp(l|y). However, for large lexicons, e.g. the 50k-words Hunspell spell-checking dictionary [1], it would be very time-consuming to perform an exhaustive search over the lexicon, i.e. to compute Equation 1 for all sequences in the lexicon and choose the one with the highest probability. To solve this problem, we observe that the label sequences predicted via lexicon-free transcription, described in 2.3.2, are often close to the ground-truth under the edit distance metric. This indicates that we can limit our search to the nearest-neighbor candidates Nδ(l'), where δ is the maximal edit distance and l' is the sequence transcribed from y in lexicon-free mode:

在基于词典的模式下,每个测试样本都与一个词典D相关联。基本上,通过选择词典中方程式1中定义的条件概率最高的序列来识别标签序列,即l*=arg maxI∈Dp(l|y)。 但是,对于大型词典,例如 在使用5万个单词的Hunspell拼写检查字典[1]时,要在词典上进行详尽搜索,即为词典中的所有序列计算等式1并选择概率最高的序列,将非常耗时。 为了解决这个问题,我们观察到在2.3.2中描述的通过无词典转录预测的标签序列在编辑距离度量标准下通常接近于真实情况。 这表明我们可以将搜索范围限制为最邻近的候选对象Nδ(l'),其中δ是最大编辑距离,而l'是在无词典模式下从y转录的序列:

l* Barg max l∈Nδl'ply.   (2)

The candidates Nδ(l')can be found efficiently with the BK-tree data structure [9], which is a metric tree specifically adapted to discrete metric spaces. The search time complexity of BK-tree is O(log |D|), where |D| is the lexicon size. Therefore this scheme readily extends to very large lexicons. In our approach, a BK-tree is constructed offline for a lexicon. Then we perform fast online search with the tree, by finding sequences that have less or equal to δ edit distance to the query sequence.

可以使用BK树数据结构[9]有效地找到候选Nδ(l'),BK树数据结构是专门适合于离散度量空间的度量树。 BK树的搜索时间复杂度为O(log |D|),其中|D|是词典大小。 因此,该方案很容易扩展到非常大的词典。 在我们的方法中,为词典离线构建BK树。 然后,通过查找与查询序列具有小于或等于δ编辑距离的序列,我们对树进行快速在线搜索。

 

 

    1. Network Training

Denote the training dataset by X = {Ii , Ii}i , whereIi is the training image and Ii is the ground truth label sequence. The objective is to minimize the negative log-likelihood of conditional probability of ground truth:

O=-Ii , IiXlog pIiyi,  (3)

where yi is the sequence produced by the recurrent and convolutional layers from Ii . This objective function calculates a cost value directly from an image and its ground truth label sequence. Therefore, the network can be end-to-end trained on pairs of images and sequences, eliminating the procedure of manually labeling all individual components in training images.

其中yi是由Ii的循环层和卷积层产生的序列。 该目标函数直接从图像及其地面真相标签序列计算成本值。 因此,可以在成对的图像和序列上对网络进行端到端训练,从而省去了手动标记训练图像中所有单个组件的过程。

 

The network is trained with stochastic gradient descent (SGD). Gradients are calculated by the back-propagation algorithm. In particular, in the transcription layer, error differentials are back-propagated with the forward-backward algorithm, as described in [15]. In the recurrent layers, the Back-Propagation Through Time (BPTT) is applied to calculate the error differentials.

该网络使用随机梯度下降(SGD)进行训练。 梯度是通过反向传播算法计算的。 特别是,在转录层中,误差差异通过前向后算法向后传播,如[15]所述。 在循环层中,应用反向传播时间(BPTT)来计算误差差异。

 

For optimization, we use the ADADELTA [37] to automatically calculate per-dimension learning rates. Compared with the conventional momentum [31] method, ADADELTA requires no manual setting of a learning rate. More importantly, we find that optimization using ADADELTA converges faster than the momentum method.

为了优化,我们使用ADADELTA [37]自动计算每维度的学习率。 与传统的动量[31]方法相比,ADADELTA不需要手动设置学习速率。 更重要的是,我们发现使用ADADELTA进行优化的收敛速度快于动量法。

 

  1. Experiments

To evaluate the effectiveness of the proposed CRNN model, we conducted experiments on standard benchmarks for scene text recognition and musical score recognition, which are both challenging vision tasks. The datasets and setting for training and testing are given in Sec.3.1, the detailed settings of CRNN for scene text images is provided in Sec.3.2, and the results with the comprehensive comparisons are reported in Sec.3.3. To further demonstrate the generality of CRNN, we verify the proposed algorithm on a music score recognition task in Sec.3.4.

为了评估所提出的CRNN模型的有效性,我们针对场景文本识别和乐谱识别的标准基准进行了实验,这两者都是具有挑战性的视觉任务。 训练和测试的数据集和设置在第3.1节中给出,场景文本图像的CRNN的详细设置在第3.2节中提供,经过全面比较的结果在第3.3节中进行了报告。 为了进一步证明CRNN的通用性,我们在第3.4节中对音乐分数识别任务验证了所提出的算法。

 

    1. Datasets

For all the experiments for scene text recognition, we use the synthetic dataset (Synth) released by Jaderberg et al. [20] as the training data. The dataset contains 8 millions training images and their corresponding ground truth words. Such images are generated by a synthetic text engine and are highly realistic. Our network is trained on the synthetic data once, and tested on all other real-world test datasets without any fine-tuning on their training data. Even though the CRNN model is purely trained with synthetic text data, it works well on real images from standard text recognition benchmarks.

对于所有用于场景文本识别的实验,我们使用Jaderberg等人发布的合成数据集(Synth)。 [20]作为训练数据。 数据集包含800万个训练图像及其相应的地面真实单词。 这样的图像是由合成文本引擎生成的,具有很高的逼真度。 我们的网络接受过一次综合数据训练,并在所有其他真实世界的测试数据集上进行了测试,而无需对其训练数据进行任何微调。 即使CRNN模型是完全由合成文本数据训练而成的,它也可以在标准文本识别基准的真实图像上很好地工作。

 

Four popular benchmarks for scene text recognition are used for performance evaluation, namely ICDAR 2003 (IC03), ICDAR 2013 (IC13), IIIT 5k-word (IIIT5k), and Street View Text (SVT).

四个流行的场景文本识别基准用于性能评估,即ICDAR 2003(IC03),ICDAR 2013(IC13),IIIT 5k字(IIIT5k)和街景文本(SVT)。

 

IC03 [27] test dataset contains 251 scene images with labeled text bounding boxes. Following Wang et al. [34], we ignore images that either contain non-alphanumeric characters or have less than three characters, and get a test set with 860 cropped text images. Each test image is associated with a 50-words lexicon which is defined by Wang et al. [34]. A full lexicon is built by combining all the per-image lexicons. In addition, we use a 50k words lexicon consisting of the words in the Hunspell spell-checking dictionary [1].

IC03 [27]测试数据集包含251个带有标记文本边界框的场景图像。 继王等。 [34],我们将忽略包含非字母数字字符或少于三个字符的图像,并使用860个裁剪的文本图像获取测试集。 每个测试图像都与Wang等人定义的50个单词的词典相关。 [34]。 通过合并所有按图像的词典来构建完整的词典。 另外,我们使用由Hunspell拼写检查字典[1]中的单词组成的5万个单词词典。

论文翻译:基于端到端的可训练神经网络基于图像的序列识别及其在场景文本识别中的应用_第4张图片

Table 1. Network configuration summary. The first row is the top layer. ‘k’, ‘s’ and ‘p’ stand for kernel size, stride and padding size respectively

表1.网络配置摘要。 第一行是顶层。 “ k”,“ s”和“ p”分别代表内核大小,步幅和填充大小

IC13 [24] test dataset inherits most of its data from IC03. It contains 1,015 ground truths cropped word images.

IIIT5k [28] contains 3,000 cropped word test images collected from the Internet. Each image has been associated to a 50-words lexicon and a 1k-words lexicon.

SVT [34] test dataset consists of 249 street view images collected from Google Street View. From them 647 word images are cropped. Each word image has a 50 words lexicon defined by Wang et al. [34].

IC13 [24]测试数据集继承了IC03的大部分数据。 它包含1,015个地面真相裁剪的单词图像。

IIIT5k [28]包含从互联网收集的3,000个裁剪的单词测试图像。 每个图像已与50个单词的词典和1000个单词的词典相关联。

SVT [34]测试数据集包含从Google街景收集的249幅街景图像。 从中裁剪出647个单词图像。 每个单词图像都有一个由Wang等人定义的50个单词的词典。[34]。

 

  1. Implementation Details

The network configuration we use in our experiments is summarized in Table 1. The architecture of the convolutional layers is based on the VGG-VeryDeep architectures [32]. A tweak is made in order to make it suitable for recognizing English texts. In the 3rd and the 4th maxpooling layers, we adopt 1 × 2 sized rectangular pooling windows instead of the conventional squared ones. This tweak yields feature maps with larger width, hence longer feature sequence. For example, an image containing 10 characters is typically of size 100×32, from which a feature sequence 25 frames can be generated. This length exceeds the lengths of most English words. On top of that, the rectangular pooling windows yield rectangular receptive fields (illustrated in Fig. 2), which are beneficial for recognizing some characters that have narrow shapes, such as ’i’ and ’l’.

表1总结了我们在实验中使用的网络配置。卷积层的体系结构基于VGG-VeryDeep体系结构[32]。 为了使它适合于识别英文文本,进行了一些调整。 在第3和第4个maxpooling层中,我们采用1×2大小的矩形池窗口,而不是常规的正方形池窗口。 这种调整会产生具有较大宽度的特征图,因此特征序列更长。 例如,包含10个字符的图像通常大小为100×32,可以从中生成25帧的特征序列。 该长度超过大多数英语单词的长度。 最重要的是,矩形合并窗口会产生矩形的接收场(如图2所示),这对于识别某些形状较窄的字符(例如“ i”和“ l”)很有帮助。

 

The network not only has deep convolutional layers, but also has recurrent layers. Both are known to be hard to train. We find that the batch normalization [19] technique is extremely useful for training network of such depth. Two batch normalization layers are inserted after the 5th and 6th convolutional layers respectively. With the batch normalization layers, the training process is greatly accelerated.

网络不仅具有深层的卷积层,而且具有循环层。 众所周知,两者都很难训练。 我们发现批量归一化[19]技术对于训练这种深度的网络非常有用。 在第五和第六卷积层之后分别插入两个批处理归一化层。 使用批处理归一化层,可以大大加快培训过程。

 

We implement the network within the Torch7 [10] framework, with custom implementations for the LSTM units (in Torch7/CUDA), the transcription layer (in C++) and the BK-tree data structure (in C++). Experiments are carried out on a workstation with a 2.50 GHz Intel(R) Xeon(R) E5- 2609 CPU, 64GB RAM and an NVIDIA(R) Tesla(TM) K40 GPU. Networks are trained with ADADELTA, setting the parameter ρ to 0.9. During training, all images are scaled to 100 × 32 in order to accelerate the training process. The training process takes about 50 hours to reach convergence. Testing images are scaled to have height 32. Widths are proportionally scaled with heights, but at least 100 pixels. The average testing time is 0.16s/sample, as measured on IC03 without a lexicon. The approximate lexicon search is applied to the 50k lexicon of IC03, with the parameter δ set to 3. Testing each sample takes 0.53s on average.

我们在Torch7 [10]框架内实现网络,并为LSTM单元(在Torch7 / CUDA中),转录层(在C ++中)和BK树数据结构(在C ++中)自定义实现。 实验是在装有2.50 GHzIntel®Xeon®E5- 2609 CPU,64GB RAM和NVIDIA®Tesla®K40 GPU的工作站上进行的。 使用ADADELTA训练网络,将参数ρ设置为0.9。 在训练过程中,所有图像均按比例缩放为100×32,以加快训练过程。 培训过程大约需要50个小时才能达到收敛。 将测试图像缩放为高度32。宽度与高度成比例地缩放,但至少100像素。 在没有词典的IC03上测得的平均测试时间为0.16s /样品。 将近似词典搜索应用于IC03的50k词典,并将参数δ设置为3。测试每个样本平均需要0.53s。

 

    1. Comparative Evaluation

All the recognition accuracies on the above four public datasets, obtained by the proposed CRNN model and the recent state-of-the-arts techniques including the approaches based on deep models [23, 22, 21], are shown in Table 2.

表2列出了通过建议的CRNN模型和最新技术(包括基于深度模型的方法)获得的上述四个公共数据集的所有识别准确性。

In the constrained lexicon cases, our method consistently outperforms most state-of-the-arts approaches, and in average beats the best text reader proposed in [22]. Specifically, we obtain superior performance on IIIT5k, and SVT compared to [22], only achieved lower performance on IC03 with the “Full” lexicon. Note that the model in[22] is trained on a specific dictionary, namely that each word is associated to a class label. Unlike [22], CRNN is not limited to recognize a word in a known dictionary, and able to handle random strings (e.g. telephone numbers), sentences or other scripts like Chinese words. Therefore, the results of CRNN are competitive on all the testing datasets.

在受限的词典情况下,我们的方法始终优于大多数最新技术,并且平均而言胜过[22]中提出的最佳文本阅读器。 具体来说,我们在IIIT5k上获得了优异的性能,而与[22]相比,SVT仅在使用“完整”词典的IC03上获得了较低的性能。 注意,in [22]中的模型是在特定词典上训练的,即每个单词都与一个类别标签相关联。 与[22]不同,CRNN不仅限于识别已知词典中的单词,还可以处理随机字符串(例如电话号码),句子或其他脚本(如中文单词)。 因此,CRNN的结果在所有测试数据集上都具有竞争力。

In the unconstrained lexicon cases, our method achieves the best performance on SVT, yet, is still behind some approaches [8, 22] on IC03 and IC13. Note that the blanks in the “none” columns of Table 2 denote that such approaches are unable to be applied to recognition without lexicon or did not report the recognition accuracies in the unconstrained cases. Our method uses only synthetic text with word level labels as the training data, very different to PhotoOCR [8] which used 7.9 millions of real word images with character-level annotations for training. The best performance is reported by [22] in the unconstrained lexicon cases, benefiting from its large dictionary, however, it is not a model strictly unconstrained to a lexicon as mentioned before. In this sense, our results in the unconstrained lexicon case are still promising.

在不受约束的词典情况下,我们的方法在SVT上实现了最佳性能,但仍落后于IC03和IC13的某些方法[8,22]。 请注意,表2中“无”列中的空白表示在没有词汇的情况下,此类方法无法应用于识别,或者在无限制的情况下未报告识别准确性。 我们的方法仅使用带有单词级别标签的合成文本作为训练数据,这与PhotoOCR [8]完全不同,后者使用790万个带有字符级别注释的真实单词图像进行训练。 受益于其庞大的字典,[22]在不受约束的词典情况下报告了最佳性能,但是,它并不是如上所述严格不受词典约束的模型。 从这个意义上讲,我们在无约束词典情况下的结果仍然很有希望。

For further understanding the advantages of the proposed algorithm over other text recognition approaches, we provide a comprehensive comparison on several properties named E2E Train, Conv Ftrs, CharGT-Free, Unconstrained, and Model Size, as summarized in Table 3.

为了进一步了解该算法相对于其他文本识别方法的优势,我们对名为E2E Train,Conv Ftrs,CharGT-Free,Unconstrained和Model Size的几个属性进行了全面比较,如表3所示。

 

论文翻译:基于端到端的可训练神经网络基于图像的序列识别及其在场景文本识别中的应用_第5张图片

Table 3. Comparison among various methods. Attributes for comparison include: 1) being end-to-end trainable (E2E Train); 2) using convolutional features that are directly learned from images rather than using hand-crafted ones (Conv Ftrs); 3) requiring no ground truth bounding boxes for characters during training (CharGT-Free); 4) not confined to a pre-defined dictionary (Unconstrained); 5) the model size (if an end-to-end trainable model is used), measured by the number of model parameters (Model Size, M stands for millions).

表3.各种方法之间的比较。 比较的属性包括:1)端到端可培训(E2E培训); 2)使用直接从图像中学习的卷积特征,而不是使用手工的卷积特征(Conv Ftrs); 3)在训练过程中不需要角色的地面真相边界框(无CharGT); 4)不限于预定义的字典(无约束); 5)模型大小(如果使用了端到端可训练模型),由模型参数的数量(模型大小,M代表百万)衡量。

 

E2E Train: This column is to show whether a certain text reading model is end-to-end trainable, without any preprocess or through several separated steps, which indicates such approaches are elegant and clean for training. As can be observed from Table 3, only the models based on deep neural networks including [22, 21] as well as CRNN have this property.

端到端培训:此列用于显示某种文本阅读模型是否可以进行端到端的培训,而无需任何预处理或通过几个单独的步骤,这表明此类方法对于培训而言是优雅而干净的。 从表3中可以看出,只有基于深度神经网络的模型(包括[22、21]和CRNN)才具有此属性。

Conv Ftrs: This column is to indicate whether an approach uses the convolutional features learned from training images directly or handcraft features as the basic representations.

Conv Ftrs:此列指示方法是直接使用从训练图像中学到的卷积特征还是手工特征作为基本表示。

CharGT-Free: This column is to indicate whether the character-level annotations are essential for training the model. As the input and output labels of CRNN can be a sequence, character-level annotations are not necessary.

CharGT-Free:此列用于指示字符级注释对于训练模型是否必不可少。 由于CRNN的输入和输出标签可以是一个序列,因此不需要字符级注释。

 

Unconstrained: This column is to indicate whether the trained model is constrained to a specific dictionary, unable to handling out-of-dictionary words or random sequences.Notice that though the recent models learned by label embedding [5, 14] and incremental learning [22] achieved highly competitive performance, they are constrained to a specific dictionary.

Unconstrained:此列用于指示训练后的模型是否仅限于特定词典,无法处理字典外单词或随机序列。请注意,尽管最近的模型是通过标签嵌入[5,14]和增量学习[ 22]取得了极好的竞争表现,它们被限制在特定的词典中。

 

论文翻译:基于端到端的可训练神经网络基于图像的序列识别及其在场景文本识别中的应用_第6张图片

Table 2. Recognition accuracies (%) on four datasets. In the second row, “50”, “1k”, “50k” and “Full” denote the lexicon used, and “None” denotes recognition without a lexicon. (*[22] is not lexicon-free in the strict sense, as its outputs are constrained to a 90k dictionary.

表2.四个数据集的识别准确率(%)。 在第二行中,“ 50”,“ 1k”,“ 50k”和“完整”表示使用的词典,“无”表示不使用词典的识别。 (* [22]在严格意义上不是没有词典的,因为它的输出被限制在一个90k的字典中。

 

Model Size: This column is to report the storage space of the learned model. In CRNN, all layers have weightsharing connections, and the fully-connected layers are not needed. Consequently, the number of parameters of CRNN is much less than the models learned on the variants of CNN [22, 21], resulting in a much smaller model compared with [22, 21]. Our model has 8.3 million parameters, taking only 33MB RAM (using 4-bytes single-precision float for each parameter), thus it can be easily ported to mobile devices.

模型大小:此列用于报告学习的模型的存储空间。 在CRNN中,所有层都具有权重共享连接,并且不需要完全连接的层。 因此,CRNN的参数数量远少于从CNN的变体中学习的模型[22,21],因此与[22,21]相比,模型要小得多。 我们的模型具有830万个参数,仅占用33MB RAM(每个参数使用4字节单精度浮点数),因此可以轻松地将其移植到移动设备上。

Table 3 clearly shows the differences among different approaches in details, and fully demonstrates the advantages of CRNN over other competing methods. In addition, to test the impact of parameter δ, we experiment different values of δ in Eq. 2. In Fig. 4 we plot the recognition accuracy as a function of δ. Larger δ results in more candidates, thus more accurate lexicon-based transcription. On the other hand, the computational cost grows with larger δ, due to longer BK-tree search time, as well as larger number of candidate sequences for testing. In practice, we choose δ = 3 as a tradeoff between accuracy and speed.

表3清楚地详细显示了不同方法之间的差异,并充分证明了CRNN相对于其他竞争方法的优势。 另外,为了测试参数δ的影响,我们在式中试验了不同的δ值。 2.在图4中,我们将识别精度绘制为δ的函数。 δ越大,候选者越多,因此基于词典的转录更加准确。 另一方面,由于较长的BK树搜索时间以及用于测试的候选序列数量增加,计算成本随着δ的增加而增长。 实际上,我们选择δ= 3作为精度和速度之间的折衷。

论文翻译:基于端到端的可训练神经网络基于图像的序列识别及其在场景文本识别中的应用_第7张图片

Figure 4. Blue line graph: recognition accuracy as a function parameter δ. Red bars: lexicon search time per sample. Tested on the IC03 dataset with the 50k lexicon.

图4.蓝线图:识别精度作为函数参数δ。 红条:每个样本的词典搜索时间。 使用50k词典在IC03数据集上进行了测试。

 

3.4. Musical Score Recognition

A musical score typically consists of sequences of musical notes arranged on staff lines. Recognizing musical scores in images is known as the Optical Music Recognition (OMR) problem. Previous methods often requires image preprocessing (mostly binirization), staff lines detection and individual notes recognition [29]. We cast the OMR as a sequence recognition problem, and predict a sequence of musical notes directly from the image with CRNN. For simplicity, we recognize pitches only, ignore all chords and assume the same major scales (C major) for all scores.

乐谱通常由排列在谱线上的音符序列组成。 识别图像中的乐谱被称为光学音乐识别(OMR)问题。 以前的方法通常需要图像预处理(主要是二值化),人员线检测和个人笔记识别[29]。 我们将OMR视为序列识别问题,并使用CRNN直接从图像中预测音符序列。 为简单起见,我们仅识别音高,忽略所有和弦,并为所有乐谱采用相同的大音阶(C大调)。

 

To the best of our knowledge, there exists no public datasets for evaluating algorithms on pitch recognition. To prepare the training data needed by CRNN, we collect 2650 images from [2]. Each image contains a fragment of score containing 3 to 20 notes. We manually label the ground truth label sequences (sequences of not ezpitches) for all the images. The collected images are augmented to 265k training samples by being rotated, scaled and corrupted with noise, and by replacing their backgrounds with natural images. For testing, we create three datasets: 1) “Clean”, which contains 260 images collected from [2]. Examples are shown in Fig. 5.a; 2) “Synthesized”, which is created from “Clean”, using the augmentation strategy mentioned above. It contains 200 samples, some of which are shown in Fig. 5.b; 3) “Real-World”, which contains 200 images of score fragments taken from music books with a phone camera. Examples are shown in Fig. 5.c.1

据我们所知,目前尚无用于评估音高识别算法的公共数据集。 为了准备CRNN所需的训练数据,我们从[2]中收集了2650张图像。 每个图像包含一个分数片段,其中包含3至20个音符。 我们为所有图像手动标记地面真相标记序列(非ezpitches序列)。 通过旋转,缩放和受噪声破坏,以及通过将其背景替换为自然图像,可以将收集的图像增强到265k训练样本。 为了进行测试,我们创建了三个数据集:1)“ Clean”,其中包含从[2]中收集的260张图像。 示例如图5.a所示。 2)使用上面提到的扩充策略,从“清洁”创建的“合成”。 它包含200个样本,其中一些如图5.b所示。 3)“真实世界”,其中包含200张使用手机摄像头从乐谱中拍摄的乐谱片段图像。 示例如图5.c.1所示。

 

论文翻译:基于端到端的可训练神经网络基于图像的序列识别及其在场景文本识别中的应用_第8张图片

Figure 5. (a) Clean musical scores images collected from [2] (b) Synthesized musical score images. (c) Real-world score images taken with a mobile phone camera.

图5.(a)从[2]收集的干净的乐谱图像。(b)合成的乐谱图像。 (c)用手机相机拍摄的真实分数图像。

Since we have limited training data, we use a simplified CRNN configuration in order to reduce model capacity. Different from the configuration specified in Tab. 1, the 4th and 6th convolution layers are removed, and the 2-layer bidirectional LSTM is replaced by a 2-layer single directional LSTM. The network is trained on the pairs of images and corresponding label sequences. Two measures are used for evaluating the recognition performance: 1) fragment accuracy, i.e. the percentage of score fragments correctly recognized; 2) average edit distance, i.e. the average edit distance between predicted pitch sequences and the ground truths. For comparison, we evaluate two commercial OMR engines, namely the Capella Scan [3] and the PhotoScore [4].

由于训练数据有限,因此我们使用简化的CRNN配置以减少模型容量。 与选项卡中指定的配置不同。 如图1所示,删除了第4和第6卷积层,并将2层双向LSTM替换为2层单向LSTM。 在图像对和相应的标签序列对上训练网络。 两种方法可用于评估识别性能:1)片段准确性,即正确识别的得分片段的百分比; 2)平均编辑距离,即预测音高序列与基本事实之间的平均编辑距离。 为了进行比较,我们评估了两种商用OMR引擎,即Capella Scan [3]和PhotoScore [4]。

 

Table 4. Comparison of pitch recognition accuracies, among CRNN and two commercial OMR systems, on the three datasets we have collected. Performances are evaluated by fragment accuracies and average edit distance (“fragment accuracy/average edit distance”).

表4.在我们收集的三个数据集上,CRNN和两个商业OMR系统之间的音高识别精度比较。 通过片段精度和平均编辑距离(“片段准确性/平均编辑距离”)评估演奏。

 

Tab.4 summarizes the results. The CRNN outperforms the two commercial systems by a large margin. The Capella Scan and PhotoScore systems perform reasonably well on the Clean dataset, but their performances drop significantly on synthesized and real-world data. The main reason is that they rely on robust binarization to detect staff lines and notes, but the binarization step often fails on synthesized and real-world data due to bad lighting condition, noise corruption and cluttered background. The CRNN, on the other hand, uses convolutional features that are highly robust to noises and distortions. Besides, recurrent layers in CRNN can utilize contextual information in the score. Each note is recognized not only itself, but also by the nearby notes. Consequently, some notes can be recognized by comparing them with the nearby notes, e.g. contrasting their vertical positions.

表4总结了结果。 CRNN大大优于两个商业系统。 Capella Scan和PhotoScore系统在Clean数据集上的表现相当不错,但在合成和真实数据上的性能却大大下降。 主要原因是他们依靠可靠的二值化来检测人员线和便条,但是由于不良的光照条件,噪声破坏和背景混乱,二值化步骤通常无法在合成的和真实的数据上进行。 另一方面,CRNN使用对噪声和失真具有高度鲁棒性的卷积特征。 此外,CRNN中的循环层可以利用分数中的上下文信息。 每个音符不仅可以自己识别,还可以被附近的音符识别。 因此,可以通过将它们与附近的音符进行比较来识别某些音符,例如 对比他们的垂直位置。

 

The results have shown the generality of CRNN, in that it can be readily applied to other image-based sequence recognition problems, requiring minimal domain knowledge. Compared with Capella Scan and PhotoScore, our CRNN-based system is still preliminary and misses many functionalities. But it provides a new scheme for OMR, and has shown promising capabilities in pitch recognition.

结果显示了CRNN的普遍性,因为它可以轻松应用于其他基于图像的序列识别问题,而所需的领域知识最少。 与Capella Scan和PhotoScore相比,我们基于CRNN的系统仍是初步的,缺少许多功能。 但是,它为OMR提供了一种新方案,并且在音高识别方面显示出了令人鼓舞的功能。

 

  1. Conclusion

In this paper, we have presented a novel neural network architecture, called Convolutional Recurrent Neural Network (CRNN), which integrates the advantages of both Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN). CRNN is able to take input images of varying dimensions and produces predictions with different lengths. It directly runs on coarse level labels (e.g. words), requiring no detailed annotations for each individual element (e.g. characters) in the training phase. Moreover, as CRNN abandons fully connected layers used in conventional neural networks, it results in a much more compact and efficient model. All these properties make CRNN an excellent approach for image-based sequence recognition.

在本文中,我们提出了一种新颖的神经网络架构,称为卷积递归神经网络(CRNN),它融合了卷积神经网络(CNN)和递归神经网络(RNN)的优点。 CRNN能够拍摄不同尺寸的输入图像,并产生不同长度的预测。 它直接在粗糙级别的标签(例如单词)上运行,在训练阶段无需为每个单独的元素(例如字符)提供详细的注释。 此外,由于CRNN放弃了常规神经网络中使用的完全连接的层,因此它导致了更加紧凑和有效的模型。 所有这些特性使CRNN成为基于图像的序列识别的绝佳方法。

 

The experiments on the scene text recognition benchmarks demonstrate that CRNN achieves superior or highly competitive performance, compared with conventional methods as well as other CNN and RNN based algorithms. This confirms the advantages of the proposed algorithm. In addition, CRNN significantly outperforms other competitors on a benchmark for Optical Music Recognition (OMR), which verifies the generality of CRNN.

与传统方法以及其他基于CNN和RNN的算法相比,现场文本识别基准上的实验表明CRNN具有优异或极具竞争力的性能。 这证实了所提出算法的优点。 此外,CRNN在光学音乐识别(OMR)的基准上明显优于其他竞争对手,这证明了CRNN的普遍性。

 

Actually, CRNN is a general framework, thus it can be applied to other domains and problems (such as Chinese character recognition), which involve sequence prediction in images. To further speed up CRNN and make it more practical in real-world applications is another direction that is worthy of exploration in the future.

实际上,CRNN是一个通用框架,因此可以应用于涉及图像序列预测的其他领域和问题(例如汉字识别)。 进一步加快CRNN的速度,使其在实际应用中更加实用是另一个值得未来探索的方向。

你可能感兴趣的:(学习新知识)