Synthetic Data and Artificial Neural Networks for Natural Scene Text Recognition
这是一篇2014年的文章,主要贡献在于提出了一种合成自然场景文本的方法,并用该方法合成的高度逼真的数据训练深度神经网络识别场景文本,并取得了较好的效果。文章提出的场景文字合成方法已经成为目前常用的大批量OCR数据集合成方法。
considering that much of the text found in natural scenes is computer-generated and only the physical rendering process (e.g. printing, painting) and the imaging process (e.g. camera, viewpoint, illumination, clutter) are not controlled by a computer algorithm.
文章认为自然场景中发现的大部分文本是由计算机生成的,只有物理渲染过程(如打印、绘画)和成像过程(如照相机、视点、光照、杂波)不受计算机算法控制。合成的图像样本可由前景图像层、背景图像层、边缘/阴影层组合而成。基于这一思想将合成过程分为6步:
1.Font rendering – a font is randomly selected from a catalogue of over 1400 fonts downloadedfrom Google Fonts. The kerning, weight, underline, and other properties are varied randomly from arbitrarily defined distributions. The word is rendered on to the foreground image-layer’s alpha channel with either a horizontal bottom text line or following a random curve.
1.字体呈现 – 随机选择字体,将文本沿水平文本线或随机曲线呈现到图像前景层中。
2.Border/shadow rendering – an inset border, outset border or shadow with a random width may be rendered from the foreground.
2.边缘/阴影渲染 – 从图像前景层中获得边缘或随机宽度的阴影。
3. Base coloring – each of the three image-layers are filled with a different uniform color sampled from clusters over natural images. The clusters are formed by k-means clustering the three color components of each image of the training datasets of [16] into three clusters.
3. 基础着色 – 三个图像层中的每一个都填充有从自然图像上的簇中采样的不同的均匀颜色。通过k均值将[16]的训练数据集的每个图像的三个颜色分量聚类成三类。
4.Projective distortion – the foreground and border/shadow image-layers are distorted with a random, full-projective transformation, simulating the 3D world.
4.投影失真 – 对前景和边界/阴影图像层进行随机的全投影变换扭曲,模拟3D环境。
5. Natural data blending – each of the image-layers are blended with a randomly-sampled crop of an image from the training datasets of ICDAR 2003 and SVT. The amount of blend and alpha blend mode (e.g. normal, add, multiply, burn, max, etc.) is dictated by a random process, and this creates an eclectic range of textures and compositions. The three image-layers are also blended together in a random manner, to give a single output image.
5.自然数据混合 – 每个图像层与来自ICDAR 2003和SVT的训练数据集的随机采样的图像混合。 混合方式和混合程度由随机过程决定,这产生了折衷的纹理和组合范围。 三个图像层也以随机方式混合在一起,以提供单个输出图像。
6. Noise – Gaussian noise, blur, and JPEG compression artefacts are introduced to the image.
6.噪声 – 高斯噪声,模糊和JPEG压缩伪像被引入图像。