ocr图像识别引擎

深度学习 (Deep Learning)

With the rapid growth of digitization, the need for digitized content is of crucial importance for data processing, storage, and transmission. Optical Character Recognition (“OCR”) is the process of converting a typed, handwritten, or printed text into a digitized format that is editable, searchable, and interpretable while obviating the need for entry of data into systems.

瓦特 i个数字化的快速增长，需要对数字化内容进行数据处理，存储和传输至关重要。光学字符识别(OCR)是将打字，手写或印刷的文本转换为可编辑，可搜索和可解释的数字格式的过程，同时无需将数据输入系统。

Most often than not, scanned documents contain noise which prevents the OCR from recognizing the full content of the text. The scanning process often results in the introduction of noise such as watermarking, background noise, blurriness due to camera motion or shake, faded text, wrinkles, or coffee stains. These noises pose many readability challenges to current text recognition algorithms which significantly degrade their performance.

扫描的文档通常包含噪音，这会阻止OCR识别文本的全部内容。扫描过程通常会引入噪声，例如水印，背景噪声，由于照相机运动或震动引起的模糊，褪色的文本，皱纹或咖啡渍。这些噪声给当前的文本识别算法带来了许多可读性挑战，这大大降低了它们的性能。

Fig 2. Scanned document converted into a text document using OCR 图2.使用OCR将扫描的文档转换为文本文档

基本的OCR预处理方法 (Basic OCR Pre-processing Methods)

Binarization is the conversion of a colored image into an image which consists of only black and white pixels by fixing a threshold.
二值化是通过固定阈值将彩色图像转换为仅包含黑白像素的图像。
Skew correction generally involves skew angle determination and correction of the document image based on the skew angle.
偏斜校正通常包括偏斜角确定和基于偏斜角的文档图像校正。
Noise removal helps to smoothen the image by removing small dots or patches which have high intensity than the rest of the image.
去除噪声有助于消除强度比图像其余部分高的小点或小块，从而使图像平滑。
Thinning and skeletonization ensure the uniformity of the stroke width for handwritten text as different writers have a different style of writing.
细化和骨架化可确保手写文本的笔划宽度均匀，因为不同的作者具有不同的书写风格。

CycleGAN作为高级OCR预处理方法 (CycleGAN as an Advanced OCR Pre-processing Method)

Generative Adversarial Networks (“GANs”) are a deep learning-based generative model. The GAN model architecture involves two sub-models: a generator model for generating new examples, and a discriminator model for classifying whether generated examples are real, from the domain, or fake, generated by the generator model.

生成对抗网络 (GAN)是基于深度学习的生成模型。 GAN模型体系结构涉及两个子模型：一个用于生成新示例的生成器模型，以及一个用于对生成器示例是真实的，来自领域的还是伪造的进行分类的鉴别器模型。

Fig 3. Conceptual diagram of the GAN network (source: https://developers.google.com/machine-learning/gan/gan_structure) 图3. GAN网络的概念图(来源： https : //developers.google.com/machine-learning/gan/gan_structure )

CycleGAN was selected for implementation using TensorFlow, as an advanced OCR pre-processing method. An advantage of CycleGAN is that it does not require paired training data. Generally, paired data are data sets where every data point in one independent sample would be paired uniquely to a data point in another independent sample.

选择使用TensorFlow作为先进的OCR预处理方法来实施CycleGAN。 CycleGAN的一个优点是它不需要成对的训练数据 。通常，配对数据是一个独立样本中的每个数据点将与另一个独立样本中的数据点唯一配对的数据集。

While input and output variables are still required, they do not need to directly correspond to each other. Since paired data is hard to find in most domains, the unsupervised training capabilities of CycleGAN are indeed very useful.

尽管仍然需要输入和输出变量，但它们不需要彼此直接对应。由于在大多数领域都很难找到配对的数据，因此CycleGAN的无监督训练功能确实非常有用。

In the absence of paired images for training, CycleGAN is able to learn a mapping between the distributions of the noisy images to the denoised images using unpaired data, to achieve image-to-image translation for cleaning the noisy documents.

在没有用于训练的成对图像的情况下，CycleGAN能够使用未成对的数据来学习嘈杂图像到去噪图像之间的映射关系，从而实现图像到图像的转换以清理有噪声的文档。

Image-to-image translation is the process of transforming an image from one domain (ie. noisy document image), to another (ie. clean document image). Other features of the image like text should stay recognizably the same, instead of features not directly related to either domain, such as the background.

图像到图像的转换是将图像从一个域(即嘈杂的文档图像)转换为另一域(即干净的文档图像)的过程。图像的其他特征(如文本)应保持可识别的相同，而不是与任何领域都没有直接关系的特征(例如背景)。

CycleGAN体系结构 (CycleGAN Architecture)

The architecture of the CycleGAN comprises of two pairs of generators and discriminators. Each generator has a corresponding discriminator, which attempts to evaluate its synthesized images from the real ones. As with any GANs, the generators and discriminators learn adversarially. Each generator attempts to “fool” the corresponding discriminator, while discriminators learn to not get “fooled”.

CycleGAN的体系结构由两对生成器 和鉴别器组成 。每个生成器都有一个对应的鉴别器，该鉴别器试图从真实图像中评估其合成图像。与任何GAN一样， 生成器和鉴别器在对抗中学习 。每个生成器都试图“欺骗”相应的鉴别器，而鉴别器则学会了不被“欺骗”。

In order for the generator to preserve the text of the dirty documents, the model computes the cycle consistency loss, which evaluates how much an image that was translated from and back to its domain, resembles its original version.

为了使生成器保留脏文档的文本，该模型计算循环一致性损失 ，该损失评估了在其域之间来回转换的图像与其原始版本相似的程度。

Fig 4(a) Conversion of original dirty input to its translated clean output 图4(a)将原始脏输入转换为转换后的干净输出

Fig 4(b) Conversion of original clean input to its translated dirty output 图4(b)将原始的纯净输入转换为其翻译的脏输出

The first generator, G-x2y, converts an original dirty input into a translated clean output. A discriminator, D-y, will attempt to evaluate whether the translated clean output is a real or generated image. The discriminator will then provide the probability that the evaluated image is a real image.

第一个生成器G-x2y将原始脏输入转换为转换后的干净输出。鉴别符Dy将尝试评估转换后的干净输出是真实图像还是生成的图像。然后，鉴别器将提供所评估的图像是真实图像的可能性。

The second generator, G-y2x, converts an original clean input into a translated dirty output. The discriminator, D-x, will try to tell apart the real dirty images from generated ones. The created model will be trained in two directions, with a set of dirty images and a set of clean images, as illustrated above.

第二个生成器G-y2x将原始的干净输入转换为转换后的脏输出。鉴别符Dx将尝试从生成的图像中分辨出真实的脏图像。创建的模型将在两个方向上进行训练，分别带有一组脏图像和一组干净图像，如上所述。

方法论与设计 (Methodology and Design)

Background noise removal is the process of removing the background noise, such as uneven contrast, background spots, dog-eared pages, faded sunspots, wrinkles on the documents. The background noise limits the performance of OCR as it is difficult to differentiate the text from its background.

背景噪声消除是指消除背景噪声的过程，例如对比度不均匀，背景斑点，狗耳朵页，黑斑褪色，文档上的皱纹。背景噪声限制了OCR的性能，因为很难区分文本和背景。

The CycleGAN model was trained using the Kaggle Document Denoising Dataset, which consists of noisy documents with noise in various forms such as coffee stains, faded sunspots, dog-eared pages, and wrinkles.

使用Kaggle Document Denoising Dataset训练了CycleGAN模型，该数据集由带有各种形式噪声的嘈杂文档组成，例如咖啡渍，褪黑斑，狗耳状页和皱纹。

Fig 5. Types of dirty documents in the Kaggle Document Denoising Dataset 图5. Kaggle文档降噪数据集中的脏文档类型

In order to fine-tune the model training, synthetic text generation was performed to introduce more noise in addition to the Kaggle dataset. This was achieved using the DocCreator program, an open-source and multi-platform software that can create virtually unlimited amounts of different ground truth synthetic document images based on a small number of real images. Various realistic degradation models had been applied to the original corpus, resulting in synthetic images generated.

为了微调模型训练，除了Kaggle数据集外，还执行了合成文本生成操作，以引入更多噪声。这是使用DocCreator程序实现的，该程序是一种开源的多平台软件，可以基于少量的真实图像创建几乎无限量的不同的地面真实合成文档图像。各种现实的退化模型已应用于原始语料库，从而生成了合成图像。

Fig 6. Types of dirty documents in the synthetic text generated 图6.生成的合成文本中脏文件的类型

The train data are grouped under folders trainA and trainB, which consists of both noisy and clean document images. The validation data are categorized under folders testA and testB, consisting of noisy and clean document images as well. A test dataset of unseen noisy document images under the TEST folder was used to test the trained network, and evaluate the models for removing background noise from document images.

火车数据被分组在文件夹trainA和trainB下，该文件夹包含嘈杂的文档图像和干净的文档图像。验证数据归类于文件夹testA和testB下，其中也包含嘈杂且干净的文档图像。使用TEST文件夹下看不见的嘈杂文档图像的测试数据集来测试经过训练的网络，并评估用于从文档图像中去除背景噪声的模型。

Fig 7. Breakdown of training, validation and test data 图7.训练，验证和测试数据的分解

For model training, the Adam optimizer was used with a learning rate of 0.0002 and a momentum of 0.5, with noisy input images of size 256 X 512. Due to hardware constraints, the best results were obtained by training the CycleGAN model for 300 epochs, with a batch size of 3.

在模型训练中，使用了Adam优化器，其学习率为0.0002，动量为0.5，输入图像的噪声为256 X512。由于硬件的限制，通过对CycleGAN模型进行300个历时的训练可获得最佳结果，批次大小为3。

结果评估 (Evaluation of Results)

A factorial design experiment was performed to examine how multiple factors could affect a dependent variable, both independently and together. With training parameters such as the Adam optimizer, input image size and batch size kept constant, four different factors were evaluated in this project.

进行了析因设计实验，以检验多个因素如何独立或共同影响因变量。通过使用诸如Adam优化器之类的训练参数，输入图像大小和批次大小保持恒定，在此项目中评估了四个不同的因素。

Fig 8. Factors considered in this experiment 图8.本实验考虑的因素

Due to the complexity of the CycleGAN architecture, an in-depth evaluation of the CycleGAN performance was carried out using various metrics as follows:

由于CycleGAN体系结构的复杂性，使用以下各种指标对CycleGAN性能进行了深入评估：

Discriminator loss function takes two inputs — real images and generated images. The real loss is a sigmoid cross-entropy loss of the real images and an array of ones since these are the real images. Generated loss is a sigmoid cross-entropy loss of the generated images and an array of zeros as these are the fake images. The total discriminator loss is the sum of the mean square error of the real loss and the generated loss.
鉴别器损失功能需要两个输入-真实图像和生成的图像。实际损失是真实图像的S形交叉熵损失，并且是一系列真实图像，因为它们是真实图像。生成的损失是生成图像的S型交叉熵损失，以及零数组，因为它们是伪图像。鉴别器总损失是实际损失与产生损失的均方误差之和。
Accuracy is calculated by expressing the total discriminator loss as a percentage. The lower the discriminator loss, the higher the accuracy.
通过将总的鉴别器损耗表示为百分比来计算准确性 。鉴别器损耗越低，准确性越高。
Generator loss is a sigmoid cross-entropy loss of the generated images and an array of ones. This includes the L1 loss which is the mean absolute error between the generated image and the target image, hence allowing the generated image to become structurally similar to the target image.
发生器损失是所生成图像和一系列图像的S形交叉熵损失。这包括L1损失，L1损失是生成的图像和目标图像之间的平均绝对误差，因此使生成的图像在结构上变得与目标图像相似。
In cycle consistency loss, the original dirty image is passed via the first generator to yield a generated image. This generated image is passed via the second generator to yield the reconstructed image. The mean absolute error is calculated between the original dirty image and the reconstructed image. The lower the mean absolute error, the more structurally similar the reconstructed image is compared to the original dirty image.
在循环一致性损失中 ，原始脏图像通过第一个生成器传递以生成生成的图像。该生成的图像通过第二生成器传递以产生重建的图像。计算原始脏图像和重建图像之间的平均绝对误差。平均绝对误差越低，重构图像与原始脏图像的结构越相似。
Peak signal-to-noise ratio (“PSNR”) is defined as the ratio of the maximum possible power of a signal and the power of distorting noise which deteriorates the quality of its representation. PSNR is usually expressed in terms of mean-squared error. The higher the PSNR value, the better is the image quality.
峰值信噪比(“ PSNR”)定义为信号的最大可能功率与失真噪声功率的比值，这会使信号表示质量下降。 PSNR通常用均方误差表示。 PSNR值越高，图像质量越好。

Fig 9. Average results for quantifiable performance evaluation metrics 图9.可量化的绩效评估指标的平均结果

Comparing the results obtained from the four factors, the significant improvement in accuracy from 61% to 99% reflects a reduced discriminator loss obtained during the translation from original images to generated images. With a decrease in generator loss, the generated image was structurally similar to the original image.

比较从这四个因素获得的结果，从61％到99％的准确率显着提高反映了从原始图像到生成图像的转换过程中获得的鉴别器损失减少。随着生成器损耗的减少，生成的图像在结构上与原始图像相似。

For cycle consistency loss, considerable improvement was seen with the increase in training epochs and datasets. The decrease in mean absolute error meant that the reconstructed image was structurally similar to the original dirty image. There was no significant difference noted for the PSNR value as high image quality was obtained.

对于循环一致性损失 ，随着训练时期和数据集的增加，可以看到相当大的改善。平均绝对误差的减少意味着重建图像在结构上与原始脏图像相似。由于获得了高图像质量，因此PSNR值没有显着差异。

Let’s take a look at the learning curves upon training the model for 300 epochs using the combined dataset.

让我们看一下使用组合数据集训练模型300个纪元时的学习曲线。

Fig 10(a) Generator and Discriminator Losses 图10(a)生成器和鉴别器损耗

Fig 10(b) Training Cycle Consistency Loss 图10(b)训练周期一致性损失

Fig 10(c) Training Accuracy 图10(c)训练精度

The above learning curves derived from the training dataset gives us an idea of how well the model is learning. These learning curves were evaluated on the individual batches during each forward pass.

从训练数据集中得出的上述学习曲线使我们对模型的学习程度有所了解。在每次前进过程中，对各个批次的学习曲线进行了评估。

The loss curves reflected a good fit as it had decreased to a point of stability. The amount of “wiggle” in the loss is related to the batch size. When the batch size is 1, the wiggle will be relatively high. When the batch size is the full dataset, the wiggle will be minimal because every gradient update should be improving the loss function monotonically (unless the learning rate is set too high).

损耗曲线降低到稳定点，反映出良好的拟合。损失中的“摆动”量与批量大小有关。当批次大小为1时，摆动会相对较高。当批次大小为完整数据集时，摆动将是最小的，因为每个梯度更新都应单调改善损失函数(除非学习率设置得太高)。

As for the training accuracy plot, there was some volatility given the stochastic nature of the learning algorithm. Although a slower learning rate of 0.0002 was used, we had increased the momentum to 0.5 to ensure that the model is learning well.

至于训练精度图 ，鉴于学习算法的随机性，存在一定的波动性。尽管使用了较低的学习速度0.0002，但我们已将动量增加到0.5以确保模型学习良好。

Template matching is a method for searching and finding the location of a template image in a larger image. The template image is slid over the input image and a comparison is made between the template and patch of input image under the template image. A grayscale image is returned, where each pixel denotes how much does the neighborhood of that pixel had matched with the template.
模板匹配是一种用于在较大图像中搜索和查找模板图像位置的方法。模板图像在输入图像上滑动，并且在模板和模板图像下的输入图像的小块之间进行比较。返回一个灰度图像，其中每个像素表示该像素的邻域与模板匹配了多少。

100 epochs using only the Kaggle数据集匹配 Kaggle Dataset 100个历元的模板

300 Kaggle数据集匹配 epochs using only the 300个 Kaggle Dataset 历元的模板

Fig 11(c) Template matching for 100 epochs using only the synthetic text generated 图11(c)仅使用生成的 合成文本匹配 100个纪元的模板

300 epochs using the 组合数据集匹配 combined dataset 300个纪元的模板

Overall, the CycleGAN model had performed well for template matching. The matching result shows a grayscale image denoted by the level of intensity of how much the neighborhood of that pixel had matched with the template image. The detected point indicated a good coverage of the text being detected and matched to the template image.

总体而言，CycleGAN模型在模板匹配方面表现良好。匹配结果显示灰度图像，该灰度图像由该像素的邻域与模板图像相匹配的强度水平表示。所检测到的点表示所检测文本的良好覆盖范围并与模板图像匹配。

Mean squared error (“MSE”) measures the average of the squares of the errors, i.e. the average squared difference between the estimated values and what is estimated. A value of zero for MSE indicates perfect similarity, while a value greater than one implies less similarity and will continue to increase as the average difference between pixel intensities increases as well.
均方误差(“ MSE”)测量误差平方的平均值，即，估计值与估计值之间的平均平方差。 MSE的零值表示完全相似，而大于1的值表示相似度较小，并且随着像素强度之间的平均差也增加而将继续增加。
Structural similarity index (“SSIM”) attempts to model the perceived change in the structural information of the image, whereas MSE is the actual estimated perceived error. Unlike MSE, the SSIM value can vary between negative one and one, where one indicates perfect similarity.
结构相似性指数(“ SSIM”)尝试对图像的结构信息中的感知变化建模，而MSE是实际估计的感知误差。与MSE不同，SSIM值可以在负1和1之间变化，其中1表示完美相似。

100 epochs using only the Kaggle数据集的 Kaggle Dataset 100个纪元的输出

300 epochs using only the Kaggle数据集的 Kaggle Dataset 300个纪元的输出

Fig 12(c) Output for 100 epochs using only the synthetic text generated 图12(c)仅使用生成的 合成文本输出 100个纪元

300 epochs using the 组合数据集的 combined dataset 300个纪元的输出

With the increase in training epochs and datasets, the decrease in MSE is notable with an improvement in SSIM value. This indicates a higher structural similarity and the reduced difference in pixel intensities between the test image and output image generated.

随着训练纪元和数据集的增加， MSE的下降显着，而SSIM值则有所提高。这表明在测试图像和生成的输出图像之间更高的结构相似性和减小的像素强度差异。

Image subtraction is the process of subtracting the dirty image from the generated image. The purpose is to compare the pixel intensity of both images to ascertain how well the dirty image has been cleaned using the trained model.
图像减法是从生成的图像中减去脏图像的过程。目的是比较两个图像的像素强度，以确定使用训练后的模型对脏图像的清洁程度。

100 epochs (left), and Kaggle数据集的 300 epochs (right) using only the 100个历元 (左)和 Kaggle Dataset 300个历元 (右)的图像减法输出

Fig 13(b) Image subtraction output for 100 epochs using only the synthetic text generated (left), and 300 epochs using the combined dataset (right) 图13(b)仅使用生成的 合成文本对 100个历元进行图像减法输出(左)，而使用 组合数据集将对 300个历元进行图像减法输出(右)

The test image was subtracted from the output image generated to ascertain how well the dirty image had been cleaned using the trained model. By comparing the pixel intensity of the output images, favorable results (ie. almost full black output image) were obtained by CycleGAN with a higher number of training epochs using the combined dataset.

从生成的输出图像中减去测试图像，以确定使用训练后的模型对脏图像的清洁程度。通过比较输出图像的像素强度，使用合并后的数据集，CycloneGAN获得了更好的结果(即几乎全黑的输出图像)，并且训练次数越多。

The sample output images generated by CycleGAN for the various training parameters and factors are shown below.

下面显示了CycleGAN为各种训练参数和因子生成的样本输出图像。

Original column: Original dirty/ clean image;

原始列：原始脏/干净图像；

Translated column: Corresponding clean/ dirty images upon training the original images;

翻译的专栏：训练原始图像后对应的清洁/肮脏图像；

Reconstructed column: Images reconstructed back to the original dirty/clean state using the translated images.

重建列：使用翻译后的图像将图像重建回原始的脏/干净状态。

100 epochs using only the Kaggle数据集训练了 Kaggle Dataset 100个时期的输出图像

translated image 翻译图像中缺少文本

The output image trained for 100 epochs using only the Kaggle Dataset was not ideal for OCR. There was missing text noted in both the translated and reconstructed images.

仅使用Kaggle数据集训练了100个纪元的输出图像对于OCR而言并不理想。在翻译后的图像和重建的图像中都缺少文本。

300 epochs using only the Kaggle数据集训练了 Kaggle Dataset 300个时期的输出图像

translated image 翻译图像中缺少文本

There was gradual improvement in the output image trained for 300 epochs using only the Kaggle Dataset. With a higher number of training epochs, the occurrence of missing text had decreased in both the translated and reconstructed images.

仅使用Kaggle数据集训练了300个纪元的输出图像就逐步得到了改善。随着训练时期的增加，翻译和重构图像中丢失文本的发生率均降低了。

Fig 16(a) Output image trained for 100 epochs using only the synthetic text generated 图16(a)仅使用生成的 合成文本训练了 100个时期的输出图像

translated image 翻译图像中文字的清晰度

The output image trained for 100 epochs using only the synthetic text generated was illegible. Thus, increasing the number of epochs for training would help to improve the readability of output images.

仅使用生成的合成文本训练了100个纪元的输出图像难以辨认。因此，增加训练的时期数将有助于提高输出图像的可读性。

300 epochs using the 组合数据集训练了 combined dataset 300个时期的输出图像

As observed from the above output image trained for 300 epochs using the combined dataset, the model performance for CycleGAN had shown a significant improvement with an increased number of training epochs. The addition of noise with the use of synthetic text generation helped to increase the amount of training data, thereby improving model performance. Deep networks such as CycleGAN perform better with a large amount of training data. The risk of over-fitting was mitigated by exposing the training model to an increased number of training samples.

从使用组合数据集训练了300个纪元的上述输出图像中观察到，随着训练纪元数量的增加，CycleGAN的模型性能已显示出显着改善。通过使用合成文本生成来添加噪声有助于增加训练数据量，从而提高模型性能。诸如CycleGAN之类的深度网络在拥有大量训练数据的情况下表现更好。通过将训练模型暴露于数量增加的训练样本中，可以减轻过度拟合的风险。

结论 (Conclusion)

CycleGAN has proven to be an effective denoising engine to denoise and clean up documents for OCR.

事实证明，CycleGAN是一种有效的去噪引擎，可以对OCR的文件进行去噪和清理。

In the absence of paired training data, CycleGAN’s unique use of cycle-consistency loss addressed the issue of learning meaningful transformations with unpaired data. It allows the generator to generate clean images that preserved the text of the dirty input image through image-to-image translation. The increased number of training epochs combined with an increase in the dataset through generating synthetic text had also helped to significantly improve the performance of the CycleGAN model.

在缺乏成对的训练数据的情况下，CycleGAN对周期一致性损失的独特使用解决了使用不成对的数据学习有意义的变换的问题。它允许生成器生成干净的图像，该图像通过图像到图像的转换来保留脏输入图像的文本。通过生成合成文本，训练时期的增加与数据集的增加相结合，还有助于显着提高CycleGAN模型的性能。

Learning to Clean: A GAN Perspective by Monika Sharma, Abhishek Verma, Lovekesh Vig
学习清洁：GAN视角Monika Sharma，Abhishek Verma，Lovekesh Vig
Pre-processing in OCR
OCR中的预处理
CycleGAN in Tensorflow Core tutorials
Tensorflow Core教程中的CycleGAN
Keras implementation of GAN
GAN的Keras实现
Cleaning Up Dirty Scanned Documents with Deep Learning
通过深度学习清理脏的扫描文档
Stanford CS Class CS231n: Convolutional Neural Networks for Visual Recognition
Stanford CS类CS231n：用于视觉识别的卷积神经网络
Overfit and Underfit
过拟合和欠拟合

翻译自: https://medium.com/towards-artificial-intelligence/cyclegan-as-a-denoising-engine-for-ocr-images-8d2a4988f769