1 An Overview of the Tesseract OCR Engine

input: a binary image

 

step by step pipeline:

step1: connected componet analysis

 

step2: recognition proceeds, two-pass process

the first pass: 努力识别每一个word

the second pass : 重新识别之前识别的不是特别好的word

 

2.1  Chopping chop points

Candidate chop points: 使用词(比如 hello )的轮廓的折线逼近的凹点。

2.2 Associating Broken Character

 

结合2.1和2.2 ,总体思路是fully-chop-then-associate

 

 

step3: 处理标记为含糊不清的空格

 

 

 

你可能感兴趣的:(1 An Overview of the Tesseract OCR Engine)