1 An Overview of the Tesseract OCR Engine

input: a binary image


step by step pipeline:

step1: connected componet analysis


step2: recognition proceeds, two-pass process

the first pass: 努力识别每一个word

the second pass : 重新识别之前识别的不是特别好的word


2.1  Chopping chop points

Candidate chop points: 使用词(比如 hello )的轮廓的折线逼近的凹点。

2.2 Associating Broken Character


结合2.1和2.2 ,总体思路是fully-chop-then-associate



step3: 处理标记为含糊不清的空格




你可能感兴趣的:(1 An Overview of the Tesseract OCR Engine)