《Adapting the Tesseract Open Source OCR Engine for Multilingual OCR》论文翻译

 本文是Tesseract在多语种方面OCR改进措施,刚好最近也在做相关工作,就顺便翻译了下。总的来说,看完之后对自己的相关工作也有一定的启发,感觉还不错,就在这分享一下。目前只翻译了下ocr的版面分析和字符预处理以及分类器的构造方面的内容,后续部分因为目前还未用得到,所以后处理部分还未翻译。以后用到的话在添加相关内容吧。有需要的同学可以自己下载并阅读,废话不多说,放链接。

论文的下载链接:https://storage.googleapis.com/pub-tools-public-publication-data/pdf/35248.pdf


 

Abstract

摘要

We describe efforts to adapt the Tesseract open source OCR engine for multiple scripts and languages. Effort has been concentrated on enabling generic multi-lingual operation such that negligible customization is required for a new language beyond providing a corpus of text. Although change was required to various modules, including physical layout analysis, and linguistic post-processing, no change was required to the character classifier beyond changing a few limits. The Tesseract classifier has adapted easily to Simplified Chinese. Test results on English, a mixture of European languages, and Russian, taken from a random sample of books, show a reasonably consistent word error rate between 3.72% and 5.78%, and Simplified Chinese has a character error rate of only 3.77%.

我们将介绍了如何适应多个脚本和语言的Tesseract开源OCR工具。 我们工作的重点的重点是使其能够通用的多语言而可以忽略不同语言语言必须提供一个文本语料库之外,还需要进行分类。 虽然需要改变各种模块,包括版面分析和语言后处理,除了这几个限制改变之外,没有改变字符分类器。 该分类器易于适应简体中文,英语测试结果,欧洲语言的混合,而对于俄文,从随机的书籍样本中,显示出单词错误率在3.72%至5.78%之间,简体中文的字符错误率仅为3.77%。

Keywords

关键字

Tesseract, Multi-Lingual OCR.

Tesseract,多语言OCR

1. Introduction

1. 介绍

Research interest in Latin-based OCR faded away more than adecade ago, in favor of Chinese, Japanese, and Korean (CJK)[1,2], followed more recently by Arabic [3,4], and then Hindi[5,6]. These languages provide greater challenges specifically to classifiers, and also to the other components of OCR systems.Chinese and Japanese share the Han script, which contains thousands of different character shapes. Korean uses the Hangul script, which has several thousand more of its own, as well as using Han characters. The number of characters is one or two orders of magnitude greater than Latin. Arabic is mostly written with connected characters, and its characters change shape according to the position in a word. Hindi combines a small number of alphabetic letters into thousands of shapes that represent syllables. As the letters combine, they form ligatures whose shape only vaguely resembles the original letters. Hindi then combines the problems of CJK and Arabic, by joining all the symbols in a word with a line called the shiro-reka.

Research approaches have used

language-specific work-arounds to avoid the problems in some way, since that is simpler than trying to find a solution that works for all languages. For instance, the large character sets of Han, Hangul, and Hindi are mostly made up of a much smaller number of components, known as radicals in Han, Jamo in Hangul, and letters in Hindi. Since it is much easier to develop a classifier for a small number of classes, one approach has been to recognize the radicals [1, 2, 5] and infer the actual characters from the combination of radicals. This approach is easier for Hangul than for Han or Hindi, since theradicals don't change shape much in Hangul characters, where as in Han, the radicals often are squashed to fit in the character and mostly touch other radicals. Hindi takes this a step further by changing the shape of the consonants when they form a conjunct consonant ligature. Another example of a more language-specific work-around for Arabic, where it is difficult to determine the

character boundaries to segment connected components into characters. A commonly used method is to classify individual vertical pixel strips, each of which is a partial character, and

combine the classifications with a Hidden Markov Model that models the character boundaries [3].

  Google is committed to making its services available in as many languages as possible [7], so we are also interested in adapting the Tesseract Open Source OCR Engine [8, 9] to many languages. This paper discusses our efforts so far in fully internationalizing Tesseract, and the surprising ease with which some of it has been possible. Our approach is use language generic methods, to minimize the manual effort to cover many languages.

十多年前,对基于拉丁语的OCR的研究兴趣逐渐消退,取而代之的是汉语、日语和韩语(CJK)[1,2],其次是阿拉伯语[3,4],然后是印地语[5,6]。 这些语言对分类器,以及OCR系统的其他组件,提出了更大的挑战。 中文和日文有相似的汉字结构,这些汉字有成千上万种不同的特征形状。韩语使用Hangul结构,这些结构也有几千多个和汉字结构一样的特征。 字符数比拉丁文大一到两个数量级。阿拉伯主要是用连接在一起的字符编写的,其字符根据单词中的位置而改变形状。 印地语将少量字母组合成数千种形状代表音节。 当字母组合在一起时,它们形成了语言,其形状只模糊地类似于原始字母。 印地语结合了CJK和阿拉伯语的问题,加入了所有的符号,用一句话来说叫做“希罗-雷卡”。

 本文研究解决特定语言的方案,使用某种方式避免问题,因为这比试图找到一种适用于所有语言的解决方案要简单。例如,大量的汉字、韩语和印地语的字符集少量的组件组成。 因为对于少数类来说它更容易产生分类器,一种方法是识别自由基[1,2,5],并从自由基的组合中推断出实际的字符。 这种方法对韩语来说比汉文或印地文更容易识别,因为在韩语文字中,激进分子的形状变化不大,而在汉文中,激进分子往往被挤压以适应字符和大多接触其他自由基。 印地语通过改变辅音的形状,在它们形成一个连音辅音连接时,采取了进一步的步骤。另一个具体语言工作的例子,对于阿拉伯语,很难确定字符边界,将连接的组件分割成字符。 一种常用的方法是对每个垂直像素条进行分类其中包含部分字符,并将隐马尔可夫模型分类与建模字符边界的相结合[3]。

 谷歌致力于以尽可能多的语言提供其服务[7],因此我们兴趣是将Tesseract开源OCR引擎[8,9]应用于多种语言。这篇文章讨论了我们到目前为止在Tesseract方面的努力,以及其中包含一些令人吃惊的结果。我们的研究使用语言通用方法,并尽可能涵盖更多的语言。

2. Review Of Tesseract For Latin

2. 识别拉丁文的Tesseract的框架

 

《Adapting the Tesseract Open Source OCR Engine for Multilingual OCR》论文翻译_第1张图片

Fig. 1 is a block diagram of the basic components of Tesseract. The new page layout analysis for Tesseract [10] was designed from the beginning to be language-independent, but the rest of the engine was developed for English, without a great deal of thought as to how it might work for other languages. After noting that the commercial engines at the time were strictly for black-on-white text, one of the original design goals of Tesseract was that it should recognize white-on-black (inverse video) text as easily as black-on-white. This led the design (fortuitously as it turned out) in the direction of connected component (CC) analysis and operating on outlines of the components. The first step after CC analysis is to find the blobs in a text region. A blob is a putative classifiable unit, which may be one or more horizontally overlapping CCs, and their inner nested outlines or holes.A problem is detecting inverse text inside a box vs. the holes inside a character. For English, there are very few characters (maybe © and ®) that have more than 2 levels of outline, and it is very rare to have more than 2 holes, so any blob that breaks these rules is "clearly" a box containing inverse characters, or even the inside or outside of a frame around black-on-white characters.

 

《Adapting the Tesseract Open Source OCR Engine for Multilingual OCR》论文翻译_第2张图片

Fig.2 is a block diagram of the word recognizer. In most cases, a blob corresponds to a character, so the word recognizer first classifies each blob, and presents the results to a dictionary search to find a word in the combinations of classifier choices for each blob in the word. If the word result is not good enough, the next

step is to chop poorly recognized characters, where this improves the classifier confidence. After the chopping possibilities are exhausted, a best-first search of the resulting segmentation graph puts back together chopped character fragments, or parts of characters that were broken into multiple CCs in the original

image. At each step in the best-first search, any new blob combinations are classified, and the classifier results are given to the dictionary again. The output for a word is the character string that had the best overall distance-based rating, after weighting according to whether the word was in a dictionary and/or had a

sensible arrangement of punctuation around it. For the English version, most of these punctuation rules were hard-coded.

 

The words in an image are processed twice. On the first pass,successful words, being those that are in a dictionary and are notdangerously ambiguous, are passed to an adaptive classifier fortraining. As soon as the adaptive classifier has sufficient samples,it can provide classification results, even on the first pass. On thesecond pass, words that were not good enough on pass 1 are processed for a second time, in case the adaptive classifier has gained more information since the first pass over the word.

From the foregoing description, there are clearly problems with this design for non-Latin languages, and some of the more complex issues will be dealt with in sections 3, 4 and 5, but some of the problems were simply complex engineering. For instance,the one byte code for the character class was inadequate, but should it be replaced by a UTF-8 string, or by a wider integer code? At first we adapted Tesseract for the Latin languages, and changed the character code to a UTF-8 string, as that was the most flexible, but that turned out to yield problems with the dictionary representation (see section 5), so we ended up using an index into a table of UTF-8 strings as the internal class code.

图1是Tesseract基本组件的框图。 Tesseract[10]新版面分析从一开始就被设计成与语言无关,但引擎的其余部分是专门为英语而发展的,没有大量的考虑如何适用于其他语言。在注意到当时的商业化OCR工具都是是严格的黑白文本,Tesseract其中一个最初的设计目标是,它应该识别黑白(反视频)文本,就像黑白一样容易。 这就导致了设计(偶然的结果)的连接组件(CC)分析和操作的轮廓组件。在CC分析后的第一步是搜索文本块区域。文本块是一个假定的可分类单元,它可能是一个或多个水平重叠的CC,以及它们的内部嵌套轮廓或孔。一个问题是检测盒子里的反文本和包含字符文本的孔洞。 对于英语来说,很少有字符有超过2个层次的轮廓,而且很少有超过2个洞,所以任何文本块打破这些规则是一个“明显”的盒子,包含相反的字符,甚至框架的内部或外部围绕黑白字符。

 

 图2是单词识别器的框图。在大多数情况下,blob对应于一个字符,因此单词识别器首先对每个blob进行分类,并将结果提供给字典搜索在单词,这个单词是在分类器选择组合中找到的。如果单词结果不够好,下一步就是去除识别不好的字符,这样可以改善分类器置信度。在去除所有可能的识别之后,首先搜索得到的分割图可以将切碎的字符片段或字符的部分合并在一起。在最佳优先搜索的每一步,对任何新的BLOB组合进行分类,并将分类器结果再次给出字典。 一个单词的输出是一个字符串,并且它具有最好的基于距离的评分之后根据单词是否在字典中和/或它周围有一个合理的标点排列。 对于英文版,这些标点规则大多是固定的。

 图像中的单词被处理两次。首先,正确的单词,是那些在字典中并且不是模棱两可的词,这个词被传递给一个自适应分类器来训练。自适应分类器一旦有足够的样本,就可以提供正确的分类结果。第二次传递时,在第一次处理上不够好的单词被第二次处理,自适应分类器以获得了更多的信息。

 

 从前面的描述来看,这种拉丁语的设计显然有问题,一些更复杂的问题将在第3、4和5节中处理,但有些是专业的。瑕疵只是复杂的工程。例如,字符类的一个字节代码是不够的,但是应该用UT F-8字符串或更宽的整数代码来替换它吗? 一开始我们 Tesseract用于拉丁语言识别,并将字符代码更改为UT F-8字符串,因为这是最灵活的,但这导致了字典表示的问题因此,我们最终使用一个索引到UT F-8字符串的表中作为内部类代码。

3. Layout Preprocessing

3. 版面分析

Several aspects of the “textord” (text-ordering) module of Tesseract required changes to make it more languageindependent. This section discusses these changes.

Tesseract的“Textord”(文本排序)模块的几个方面需要进行更改,以使其更加独立于语言。 本节讨论这些变化。

3.1 Vertical Text Layout

3.1 垂直文本版面

Chinese, Japanese, and Korean, to a varying degree, all read text lines either horizontally or vertically, and often mix directions on a single page. This problem is not unique to CJK, as English language magazine pages often use vertical text at the side of a photograph or article to credit the photographer or author. Vertical text is detected by the page layout analysis. If a majority of the CCs on a tab-stop have both their left side on a left tab and their right side on a right tab, then everything between the tabstops could be a line of vertical text. To prevent false-positives in tables, a further restriction requires vertical text to have a median vertical gap between CCs to be less than the mean width of the CCs. If the majority of CCs on a page are vertically aligned, the page is rotated by 90 degrees and page layout analysis is run again to reduce the chance of finding false columns in the vertical text.The minority originally horizontal text will then become vertical text in the rotated page, and the body of the text will be horizontal.

《Adapting the Tesseract Open Source OCR Engine for Multilingual OCR》论文翻译_第3张图片

As originally designed, Tesseract had no capability to handle vertical text,and there are a lot of places in the code where some assumption is made over characters being arranged on a horizontal text line. Fortunately, Tesseract operates on outlines of CCs in a signed integer coordinate space,which makes rotations by multiples of 90 degrees trivial, and it doesn't care whether the coordinates are positive or negative. The solution is therefore simply to differentially rotate the vertical and horizontal text blocks on a page, and rotate the characters as needed for classification. Fig. 3 shows an example of this for English text.The page in Fig. 3(a) contains vertical text at the lower-right, which is detected in Fig. 3(b), along with the rest of the text. In Fig. 4, the vertical text region is rotated 90 degrees clockwise, (centered at the bottom-left of the image), so it appears well below the original image, but in horizontal orientation.

《Adapting the Tesseract Open Source OCR Engine for Multilingual OCR》论文翻译_第4张图片

Fig. 5 shows an example for Chinese text. The mainly-vertical body text is rotated out of the image, to make it horizontal, and the header, which was originally horizontal, stays where it started. The vertical and horizontal text blocks are separated in coordinate space, but all Tesseract cares about is that the text lines are

horizontal. The data structure for a text block records the rotations that have been performed on a block, so that the inverse rotation can be applied to the characters as they are passed to the classifier, to make them upright. Automatic orientation detection [12] can be used to ensure that the text is upright when passed to the classifier, as vertical text could have characters that are in at least 3 different orientations relative to the reading direction. After Tesseract processes the rotated text blocks, the coordinate space is re-rotated back to the original image orientation so that reported character bounding boxes are still accurate.

中文、日语和韩语在不同程度上都是水平或垂直阅读文本行,并且通常在一个页面上混合方向。 这个问题不是CJK所独有的,就像英语杂志页面一样通常在照片或文章的侧面使用垂直文本来赞扬摄影师或作者。垂直文本通过页面布局分析检测。如果CC的多数在页面上,他们的左边在左边的标签页上,他们的右边在右边的标签上,那么标签之间的都可以是一行垂直文本。为防止表格中的误报,进一步的限制要求垂直文本之间的中位垂直间隙小于CC的平均宽度。如果页面上的大多数CC是垂直对齐的,则将页面旋转90度,再次执行页面布局分析,这样可以减少在垂直文本中找到错误列的机会。少数原本水平文本将成为垂直文本,在旋转的页面中,主体文本将是水平的。

 

 正如最初设计的那样,Tesseract没有处理垂直文本的能力,代码中有很多地方对在水平文本上排列的字符做了一些假设。幸运的是,Tesseract在一个有符号整数坐标空间中对CC的轮廓进行操作,这使其按照90度的倍数的旋转,而不在乎坐标是否是一个正或负。因此,解决方案只是差异地旋转页面上的垂直和水平文本块,并根据分类需要旋转字符。图3是英语课文的一个例子。 图中的页面包含右下角的垂直文本,以及案文的其余部分。 在图4中,垂直文本顺时针旋转90度(以图像的左下角为中心),因此它看起来远低于原始图像,但在水平方向。

 

图5 给出了中文文本的一个例子。由垂直为主体的文本被旋转成水平得到的图像,而标题最初是水平的,则停留在它的原始为止。垂直文本块和水平文本块在坐标空间中是分开的,Tesseract关心的是文本行是水平的。文本块的数据结构记录块上执行的旋转,当它们被传递给分类器时,可就使用以逆旋转,使它们变成竖直。自动方向检测[12]可以用于确保文本在传递给分类器时是直立的,因为垂直文本可能具有相对于读取方向至少有3个不同方向的字符。之后镜像处理旋转的文本块,坐标空间被重新旋转回原始图像方向,以便字符的包围框仍然是准确的。

3.2 Text-line and Word Finding

3.2 文本行和单词查找

The original Tesseract text-line finder [11] assumed that CCs that make up characters mostly vertically overlap the bulk of the text line. The one real exception is i dots. For general languages this is not true, since many languages have diacritics that sit well above and/or below the bulk of the text-line. For Thai for example, the distance from the body of the text line to the diacritics can be quite extreme. The page layout analysis for Tesseract is designed to simplify text-line finding by sub-dividing text regions into blocks of uniform text size and line spacing. This makes it possible to force-fit a line-spacing model, so the text-line finding has been modified to take advantage of this. The page layout analysis also estimates the residual skew of the text regions, which means the text-line finder no longer has to be insensitive to skew.

 

The modified text-line finding algorithm works independently for each text region from layout analysis, and begins by searching the neighborhood of small CCs (relative to the estimated text size) to find the nearest body-text-sized CC. If there is no nearby bodytext-sized CC, then a small CC is regarded as likely noise, and discarded. (An exception has to be made for dotted/dashed leaders, as typically found in a table of contents.) Otherwise, a bounding box that contains both the small CC and its larger neighbor is constructed and used in place of the bounding box of the small CC in the following projection.

A "horizontal" projection profile is constructed, parallel to the estimated skewed horizontal, from the bounding boxes of the CCs using the modified boxes for small CCs. A dynamic programming algorithm then chooses the best set of segmentation points in the

projection profile. The cost function is the sum of profile entries at the cut points plus a measure of the variance of the spacing between them. For most text, the sum of profile entries is zero,and the variance helps to choose the most regular line-spacing.For more complex situations, the variance and the modified bounding boxes for small CCs combine to help direct the line cuts to maximize the number of diacriticals that stay with their

appropriate body characters.

 Once the cut lines have been determined, whole connected components are placed in the text-line that they vertically overlap the most, (still using the modified boxes) except where a component strongly overlaps multiple lines. Such CCs are presumed to be either characters from multiple lines that touch,and so need cutting at the cut line, or drop-caps, in which case they are placed in the top overlapped line. This algorithm works well, even for Arabic.

After text lines are extracted, the blobs on a line are organized into recognition units. For Latin languages, the logical recognition units correspond to space-delimited words, which is naturally suited for a dictionary-based language model. For languages that are not space-delimited, such as Chinese, it is less clear what the corresponding recognition unit should be. One possibility is to treat each Chinese symbol as a recognition unit. However, given that Chinese symbols are composed of multiple glyphs (radicals), it would be difficult to get the correct character segmentation without the help of recognition. Considering the limited amount of information that is available at this early stage of processing, the solution is to break up the blob sequence at punctuations,which can be detected quite reliably based on their size and spacing to the next blob. Although this does not completely resolve the issue of a very long blob sequence, which is a crucial factor in determining the efficiency and quality when searching the segmentation graph, this would at least reduce the lengths of recognition units into more manageable sizes.

As described in Section 2, detection of white-on-black text is based on the nesting complexity of outlines. This same process also rejects non-text, including halftone noise, black regions on the side, or large container boxes as in sidebar or reversed-video region. Part of the filtering is based on a measure of the topological complexity of the blobs, estimated based on the number of interior components, layers of nested holes, perimeter to area ratio, and so on. However, the complexity of Traditional Chinese characters, by any measure, often exceeds that of an English word enclosed in a box. The solution is to apply a different complexity threshold for different languages, and rely on subsequent analysis to recover any incorrectly rejected blobs. resolve the issue of a very long blob sequence, which is a crucial factor in determining the efficiency and quality when searching the segmentation graph, this would at least reduce the lengths of recognition units into more manageable sizes. As described in Section 2, detection of white-on-black text is based on the nesting complexity of outlines. This same process also rejects non-text, including halftone noise, black regions on

the side, or large container boxes as in sidebar or reversed-video region. Part of the filtering is based on a measure of the topological complexity of the blobs, estimated based on the number of interior components, layers of nested holes, perimeter to area ratio, and so on. However, the complexity of Traditional Chinese characters, by any measure, often exceeds that of an

English word enclosed in a box. The solution is to apply a

different complexity threshold for different languages, and rely on subsequent analysis to recover any incorrectly rejected blobs.

原始的Tesseract文本行查找器[11]假设组成字符的CC大部分是垂直重叠文本行。一个真正的例外是I点。对于一般语言却不是这样的,因为许多语言都有远高于和/或低于大部分文本行。以泰语为例,从文本线的主体到解剖的距离可以非常极端。Tesseract进行版面分析主要是通过将文本区域细分为统一文本大小和行距的块来简化文本行查找。这就有可能为了强制拟合行间距模型,对文本行查找进行了修改以利用这一点。页面布局分析还估计了文本区域的剩余倾斜,这意味着文本行查找器不再需要对倾斜不敏感。

修改后的文本线查找算法从布局分析开始对每个文本区域独立工作,首先搜索小CC的邻域(相对于估计的文本大小)到找到最近的正文大小的CC。 如果附近没有正文大小的CC,那么一个小的CC被认为是可能的噪音,并被丢弃。(如在目录中找到的虚线/虚线来说是一个例外)。 否则, 在下面的投影中,构造并使用包含小CC及其较大邻居的包围框来代替小CC的包围框。

利用对小型CC的修改框,从CC的包围框中构造一个平行于估计倾斜水平的“水平”投影轮廓。动态规划算法选择投影轮廓中最佳分割点集。成本函数是切点的轮廓条目之和,加上它们之间间距的方差的度量。对于大多数文本,轮廓条目之和为零,并且方差有助于选择最规则的行间距。 对于更复杂的情况,将方差和修改后的小CC包围框结合起来,以帮助指导线切割,以最大限度地增加与其适当的主体特征保持一致的差异

 一旦确定了切割线,所有连接的组件都被放置在文本行中,它们垂直重叠最多,(仍然使用修改后的框),除非组件很强重叠多条线。这种CC被认为是来自多个接触线的字符,因此需要在切割线上切割,或者在这种情况下,它们被放置在顶部中 重叠的线。即使是阿拉伯语,这个算法很好地工作。

提取文本行后,将行上的块被组织成识别单元。对于拉丁语言,逻辑识别单元与空间分隔的单词相对应,这自然是适合的 用于基于字典的语言模型。对于没有空间分隔的语言,如汉语,不太清楚相应的识别单元应该是什么。一种可能性是将每个中文符号作为一个识别单元。然而,由于汉字符号是由多个字形(自由基)组成的,如果没有这些符号,就很难得到正确的字符分割 帮助识别。考虑到在处理的早期阶段可获得的信息量有限,解决方案是在标点时分解BLOB序列,可以根据它们的大小和间距来非常可靠地检测到下一个BLOB。 这可以根据它们的大小和间距来非常可靠地检测到下一个BLOB。 尽管这并不完全解决一个很长的BLOB序列的问题, 这是在搜索分割图时确定效率和质量的关键因素,这至少会将识别单元的长度减少到更易于管理的大小。

 如第2节所述,白对黑文本的检测是基于轮廓的嵌套复杂性。同样的过程也拒绝非文本,包括半色调噪声,侧面的黑色区域,或大容器盒,如侧栏或反向视频区域。 部分滤波是基于BLOB拓扑复杂性的度量,根据内部组件的数量、嵌套孔层、周长与面积比等来估计单倍体后代(代号)。然而,无论如何,繁体字的复杂程度,往往超过一个英语单词的盒子。 解决方案是对不同的语言应用不同的复杂性阈值,并依赖于后续的分析来恢复任何不正确拒绝的BLOB。

 

3.3 Estimating x-height in Cyrillic Text

3.3 评估文本的高度

After completing the text line finding step and organizing blocks of blobs into rows, Tesseract estimates x-height for each text line.The x-height estimation algorithm first determines the bounds on the maximum and minimum acceptable x-height based on the initial line size computed for the block. Then, for each line separately, the heights of the bounding boxes of the blobs occurring on the line are quantized and aggregated into a histogram. From this histogram the x-height finding algorithm looks for the two most commonly occurring height modes that are far enough apart to be the potential x-height and ascender height.In order to achieve robustness against the presence of some noise,the algorithm ensures that the height modes picked to be the xheight and ascender height have sufficient number or occurrences relative to the total number of blobs on the line.

 

  This algorithm works quite well for most Latin fonts. However, when applied as-is to Cyrillic, Tesseract fails to find the correct xheight for most of the lines. As a result, on a data set of Russian books the word error-rate of Tesseract turns out to be 97%. The reason for such high error rate is two-fold. First of all the ascender statistics in Cyrillic fonts differ significantly from Latin ones. Simply lowering the threshold for the expected number of ascenders per line is not an effective solution, since it is not infrequent that a line of text would contain one or no ascender letters. The second reason for such poor performance is a high degree of case ambiguity in Cyrillic fonts. For example, out of 33 upper-case modern Russian letters only 6 have a lower-case shape that is significantly different from the upper-case in most fonts. Thus, when working with Cyrillic, Tesseract can be easily misled by the incorrect x-height information and would readily recognize lower-case letters as upper-case.

 

Our approach to fixing the x-height problem for Cyrillic was to adjust the minimum expected number of ascenders on the line, take into account the descender statistics and use x-height information from the neighboring lines in the same block of text more effectively (a block is a text region identified by the page

layout analysis that has a consistent size of text blobs and linespacing, and therefore is likely to contain letters of the same or similar font sizes).

For a given block of text, the improved x-height finding algorithm first tries to find the x-height of each line individually. Based on the result of this computation each line falls into one of the following four categories: (1) the lines where the x-height and ascender modes were found, (2) where descenders were found, (3) where a common blob height that could be used as an estimate of either cap-height or x-height was found, (4) the lines where none of the above were identified (i.e. most likely lines containing noise with blobs that are too small, too large or just inconsistent in size). If any lines from the first category with reliable x-height and ascender height estimates were found in the block, their height estimates are used for the lines in the second category (lines with descenders present) that have a similar x-height estimate. The same x-height estimate is utilized for those lines in the third category (no ascenders or descenders found), whose most common height is within a small margin of the x-height estimate. If the line-by-line approach does not result in finding any reliable x-height and ascender height modes, the statistics for all the blobs in the text block are aggregated and the same search for x-height and ascender height modes is repeated using this cumulative information.

《Adapting the Tesseract Open Source OCR Engine for Multilingual OCR》论文翻译_第5张图片

As the result of the improvements described above the word error rate on a test set of Russian books was reduced to 6%. After the improvements the test set still contained some errors due to the failure to estimate the correct x-height of the text line. However, in many of such cases even a human reader would have to use the information from the neighboring blocks of text or knowledge about the common organization of the books to determine whether the given line is upper- or lower-case.

完成文本行查找步骤并将块块组织成行后,Tesseract估计每个文本行的x-高度。 x-高度估计算法首先根据初始线大小决定最大和最小可接受高度并以此来计算块的高度。然后,对于每一行,发生在该行上的的包围盒的高度分别为被量化并聚合成直方图。 从这个直方图中,x高度查找算法寻找两种最常见的高度模式,它们之间的距离足够远,足以成为潜在的x高度和上升高度。为了实现该算法对某些噪声的存在具有鲁棒性,保证了选择为xheight和ascender高度的高度模式相对总有足够的数量或出现。

 

 这种算法对大多数拉丁字体非常有效。 然而,当将应用于西里尔时,Tesseract未能为大多数行找到正确的xheight。 因此,在一组俄文数据上书籍的单词错误率结果是97%。出错率如此之高的原因是双重的。首先,西里尔字体的ascender统计数据与拉丁字体的差异很大 简单地降低每行预期提升次数的阈值并不是一个有效的解决办法,因为一行文本包含一个或不包含提升字母并非罕见。  表现不佳的第二个原因是西里尔字体的大小写模糊程度很高例如,在33个大写的现代俄语字母中,只有6个字母的小写形状是有意义的,这与大多数字体的大写截然不同。因此,当与西里尔字母合作时,Tesseract很容易被不正确的x-高度信息误导,并且很容易小写字母识别为大写。

 

 我们解决西里尔字母x高度问题的方法是调整上的最小期望上升次数,考虑到下降统计数据,并使用在同一文本块中x高度信相邻的行更有效(块是由页面布局分析识别的文本区域,具有一致的文本块大小和行行间距,并且在那里。因此可能包含相同或相似字体大小的字母)。 对于给定的文本块,改进的x高查找算法首先尝试单独查找每行的x高。基于计算的结果每一行有以下四种情况: (1)发现x高和ascender模式的线,(2)发现descender的线,(3)发现了一个常见的BLOB高度,可以作为帽高度或x高度的估计,(4)没有发现上述任何一条的线条(即。 最有可能的线条包含噪音与斑点太小,太大或只是不一致的大小)。如果在区块中发现了来自第一类的任何具有可靠x高度和ascender高度估计的线,则它们的高度估计用于第二类的线(d线) 有类似x-高度估计。 类似的x高度估计。同样的x-高度估计也被用于第三类(没有发现提升或下降)的那些线,其最常见的高度在x-高度估计的一小部分之内。 如果逐行方法不能找到任何可靠的x高度和上升高度模式,则对文本块中所有块的统计数据进行聚合,并使用此累积信息重复对x-高度和提升高度模式的相同搜索。

 

 由于上述改进,一套俄罗斯书籍测试的单词错误率降低到6%。 改进后,测试集仍然存在一些错误,因为我们错误的估计文本行的正确x高度。然而,在许多这样的情况下,即使是人类读者也必须使用来自相邻文本块的信息或了解书籍的共同组织,以确定给定是大写还是小写

4. Character / Word Recognition

4.  字符/单词识别

One of the main challenges to overcome in adapting Tesseract for multilingual OCR is extending what is primarily designed for alphabetical languages to handle ideographical languages like Chinese and Japanese. These languages are characterized by having a large set of symbols and lacking clear word boundaries, which pose serious tests for a search strategy and classification

engine designed for well delimited words from small alphabets. We will discuss classification of large set of ideographs in the next section, and describe the modifications required to address the search issue first.

在使Tesseract适应多语种OCR方面要克服的主要挑战之一是扩展为字母语言设计的语言,以处理像汉语和日语这样的语言。这些语言的特点是有大量的符号集,且缺乏清晰的单词边界,这对为设计的搜索策略和分类引擎带来严重的考验。我们将在下一节中讨论大量表意文字的分类,并描述解决搜索问题所需的修改。

4.1 Segmentation and Search

4.1 分割和搜索

As mentioned in section 3.2, for non-space delimited languages like Chinese, recognitin units that form the equivalence of words in western languages now correspond to punctuation delimited phrases. Two problems need to be considered to deal with these phrases: they involve deeper search than typical words in Latin and they do not correspond to entries in the dictionary. Tesseract uses a best-first-search strategy over the segmentation graph, which grows exponentially with the length of the blob sequence.While this approach worked on shorter Latin words with fewersegmentation points and a termination condition when the result is

found in the dictionary, it often exhausts available resources when classifying a Chinese phrase. To resolve this issue, we need to dramatically reduce the number of segmentation points evaluated in the permutation and devise a termination condition that is easier to meet.

In order to reduce the number of segmentation points, we incorporate the constraint of roughly constant character widths in a mono-spaced language like Chinese and Japanese. In these languages, characters mostly have similar aspect ratios, and are either full-pitch or half-pitch in their positioning. Although the normalized width distribution would vary across fonts, and the spacing would shift due to line justification and inclusion of digits

or Latin words, which is not uncommon, by and large these constraints provide a strong guideline for whether a particular segmentation point is compatible with another. Therefore, using the deviation from the segmentation model as a cost, we can eliminate a lot of implausible segmentation states and effectively reduce the search space. We also use this estimate to prune the search space based on the best partial solution, making it effectively a beam search. This also provides a termination condition when no further expansion is likely to produce a better solution.

Another powerful constraint is the consistency of character script within a phrase. As we include shape classes from multiple scripts, confusion errors between characters across different scripts become inevitable. Although we can establish the dominant script or language for the page, we must allow for Latin characters as well, since the occurrence of English words inside foreign language books is so common. Under the assumption that characters within a recognition unit would have the same script, we would promote a character interpretation if it improves the overall script consistency of the whole unit. However, blindly promoting script characters based on prior could actually hurt the performance if the word or phrase is truly mixed script. So we apply the constraint only if over half the characters in the top interpretation belong to the same script, and the adjustment is weighted against the shape recognition score, like any other permutation.

如第3.2节所述,对于汉语等非空间定界语言,构成西方语言单词对等的识别单元对应于标点符号划定短语的界限。处理这些短语需要考虑两个问题:它们涉及比拉丁语中的典型词更深的搜索以及与字典中的条目不对应。在分割图像上,Tesseract使用了一种最优搜索策略,该策略随着BLOB序列的长度呈指数增长。 当在字典中找到结果时,这种方法只能在较短的拉丁词上工作且需要较少的词分割点和终止条件,在对汉语短语进行分类时,往往会耗尽可用的资源。为了解决这个问题,我们需要减少在排列中评估的分割点的数量,并设计一个更容易满足的终止条件。

 为了减少分割点的数量,我们将像中文和日文这样的大致恒定的字符宽度间距的语言中加入了大致恒定字符宽度的约束。在这些语言中,字符大多具有相似的纵横比,并且在它们的定位中要么是全音节,要么是半音节。虽然标准化的宽度分布在不同的字体之间会有所不同,而且由于线条的合理性和数字或拉丁词的包含,间距也会发生变化,这在很大程度上这些约束为特定的分割点是否与另一个分割点兼容提供了强有力的指导方针。因此,利用与分割模型的偏差作为代价,可以消除许多不可信的分割状态,有效地减少搜索空间。 我们还使用这个估计来修剪基于最佳部分解的搜索空间,使其有效地成为波束搜索。这也提供了一个终止条件,当没有进一步的扩展可能产生更好的解决方案。

另一个强大的约束是短语中字符脚本的一致性。由于我们包括来自多个脚本的形状类,不同脚本之间的字符之间的混淆错误变得不可避免。虽然我们可以为页面建立主导的脚本或语言,但我们也必须允许拉丁字符,因为英语单词在外语书籍中的出现是如此普遍。假设识别单元中的字符具有相同的脚本,如果它提高了整个单元的整体脚本一致性,我们将促进字符解释。然而,如果单词或短语是真正混合的脚本,盲目地推广基于先验的脚本字符实际上可能会损害性能。因此,只有当顶部解释中超过一半的字符属于同一个脚本时,我们才应用约束,并且调整与形状识别分数加权,就像每个脚本中的任何其他字符一样。

4.2 Shape Classification

4.2 形状分类

Classifiers for large numbers of classes are still a research problem; even today, especially when they are required to operate at the speeds needed for OCR [13, 14]. The curse of dimensionality is largely to blame. The Tesseract shape classifier works surprisingly well on 5000 Chinese characters without requiring any major modifications, so it seems to be well suited to large class-size problems. This result deserves some explanation, so in this section we describe the Tesseract shape classifier.

The features are components of a polygonal approximation of the outline of a shape. In training, a 4-dimensional feature vector of (x, y-position, direction, length) is derived from each element of the polygonal approximation, and clustered to form prototypical feature vectors. (Hence the name: Tesseract.) In recognition, the elements of the polygon are broken into shorter pieces of equal length, so that the length dimension is eliminated from the feature vector. Multiple short features are matched against each prototypical feature from training, which makes the classification process more robust against broken characters.

《Adapting the Tesseract Open Source OCR Engine for Multilingual OCR》论文翻译_第6张图片

Fig.7(a) shows an example prototype of the letter ‘h’ for the font Times Roman. The green line-segments represent cluster means of significant clusters that contain samples from almost every sample of ‘h’ in Times Roman. Blue segments are cluster means that were merged with another cluster to form a significant cluster. Magenta segments were not used, as they matched an existing significant cluster. Red segments did not contain enough samples to be significant, and could not be merged with any neighboring cluster to form a significant cluster.

Fig.7(b) shows how the shorter features of the unknown match against the prototype to achieve insensitivity to broken characters. The short, thick lines are the features of the unknown, being a broken ‘h’ and the longer lines are the prototype features. Colors represent match quality: black -> good, magenta -> reasonable,cyan -> poor, and yellow -> no match. The vertical prototypes are all well matched, despite the fact that the h is broken.

The shape classifier operates in two stages. The first stage, called the class pruner, reduces the character set to a short-list of 1-10 characters, using a method closely related to Locality Sensitive Hashing (LSH) [13]. The final stage computes the distance of the unknown from the prototypes of the characters in the short-list.

Originally designed as a simple and vital time-saving optimization, the class pruner partitions the high-dimensional feature space, by considering each 3-D feature individually. In place of the hash table of LSH, there is a simple look-up table,which returns a vector of integers in the range [0, 3], one for each class in the character set, with the value representing the approximate goodness of match of that feature to a prototype of the character class. The vector results are summed across all features of the unknown, and the classes that have a total score within a fraction of the highest are returned as the shortlist to be classified by the second stage. The class pruner is relatively fast, but its time scales linearly with the number of classes and also with the number of features.

大量类的分类器仍然是一个研究问题; 即使在今天,特别是当他们被要求以OCR所需的速度工作时[13,14]。维度高在很大程度上是限制因素。Tesseract形状分类器在5000个汉字上的工作效果非常好,而不需要任何重大修改,因此它似乎非常适合于大类的分类问题。这一结果值得一些解释,因此在本节中我们描述了Tesseract形状分类器。

这些特征是形状轮廓的多边形近似的组成部分。在训练中,从多边形近似的每个元素导出一个(x,y位置,方向,长度)的四维特征向量,并聚类形成原型特征向量。在识别中,将多边形的元素分解成长度相等的较短的块,从而从特征向量中消除长度维数。多个短特征与训练中的每个原型特征相匹配,这使得分类过程对破碎字符更加健壮。

图7(a)显示字体TimesRoman字母‘h’的示例原型。绿线片段表示重要簇的聚类手段,这些簇包含来自几乎每个“h”样本的TimesRoman样本。蓝色段是与另一个集群合并形成一个重要集群的集群手段。未使用Magenta片段,因为它们与现有的重要集群相匹配。红色段不包含足够大的样本,不能与任何相邻的集群合并,形成一个重要的集群。

 图7(b)说明未知的较短特征如何与原型匹配,以实现对破碎字符的不敏感。短而粗的线条是未知的特征,是一个破碎的“h”,而较长的线条是原型特征。颜色代表匹配质量:黑色->好,洋红->合理,青色->差,黄色->没有匹配。垂直原型都很匹配,尽管h被打破了。

 形状分类器分两个阶段工作。第一阶段,称为类修剪,使用与局部敏感散列(LS H)密切相关的方法,将字符集减少到1-10个字符的入围列表[13]。最后阶段计算与入围名单中字符原型的距离。

 最初设计为一个简单而重要的节省时间的优化,类修剪通过单独考虑每个三维特征来划分高维特征空间。代替LSH的哈希表,有一个简单的查找表,它返回范围[0,3]中的整数向量,字符集中每个类一个值表示将该特征的匹配性与字符类的原型进行近似。 向量的结果是在未知的所有特征之间求和的,总分数在最高分数的类被返回作为被分类的结果。类修剪相对较快,但其时间与类数和特征数成线性关系

你可能感兴趣的:(OCR,论文翻译)