tesseract-ocr语言库训练的一种出错情况

看过一些其他博客关于tesseract-ocr的介绍,关于训练语言库的方法都类似。但是,由于一些小地方的错误,都没有出现预期的结果。比如定义字体特征文件,文件的后缀为.txt文件,具体怎么设置可以详看http://blog.csdn.net/firehood_/article/details/8433077的文章。我根据这个步骤下来,只有到最后一步“7.生成语言文件”时才出现了错误。它的批处理文件里是这样的内容:

  1. rem 执行改批处理前先要目录下创建font_properties文件  
  2.   
  3. echo Run Tesseract for Training..  
  4. tesseract.exe num.font.exp0.tif num.font.exp0 nobatch box.train  
  5.   
  6. echo Compute the Character Set..  
  7. unicharset_extractor.exe num.font.exp0.box  
  8. mftraining -F font_properties -U unicharset -O num.unicharset num.font.exp0.tr  
  9.   
  10. echo Clustering..  
  11. cntraining.exe num.font.exp0.tr  
  12.   
  13. echo Rename Files..  
  14. rename normproto num.normproto  
  15. rename inttemp num.inttemp  
  16. rename pffmtable num.pffmtable  
  17. rename shapetable num.shapetable   
  18.   
  19. echo Create Tessdata..  
  20. combine_tessdata.exe num.
可是出现了这样的结果:

E:\tesseract\tessdata>test.bat

E:\tesseract\tessdata>rem 执行改批处理前先要目录下创建font_properties文件

E:\tesseract\tessdata>echo Run Tesseract for Training..
Run Tesseract for Training..

E:\tesseract\tessdata>tesseract.exe num.font.exp0.tif num.font.exp0 nobatch box.
train
Tesseract Open Source OCR Engine v3.02 with Leptonica
Page 1 of 5
APPLY_BOXES:
   Boxes read from boxfile:      10
   Found 10 good blobs.
TRAINING ... Font name = font
Generated training data for 1 words
Page 2 of 5
APPLY_BOXES:
   Boxes read from boxfile:      10
   Found 10 good blobs.
Generated training data for 4 words
Page 3 of 5
APPLY_BOXES:
   Boxes read from boxfile:      10
   Found 10 good blobs.
Generated training data for 1 words
Page 4 of 5
APPLY_BOXES:
   Boxes read from boxfile:      10
   Found 10 good blobs.
Generated training data for 1 words
Page 5 of 5
APPLY_BOXES:
   Boxes read from boxfile:      10
   Found 10 good blobs.
Generated training data for 1 words

E:\tesseract\tessdata>echo Compute the Character Set..
Compute the Character Set..

E:\tesseract\tessdata>unicharset_extractor.exe num.font.exp0.box
Extracting unicharset from num.font.exp0.box
Wrote unicharset file ./unicharset.

E:\tesseract\tessdata>mftraining -F font_properties -U unicharset -O num.unichar
set num.font.exp0.tr
Warning: No shape table file present: shapetable
Failed to load font_properties from font_properties

E:\tesseract\tessdata>echo Clustering..
Clustering..


E:\tesseract\tessdata>cntraining.exe num.font.exp0.tr
Reading num.font.exp0.tr ...
Clustering ...

Writing normproto ...

E:\tesseract\tessdata>echo Rename Files..
Rename Files..

E:\tesseract\tessdata>rename normproto num.normproto

E:\tesseract\tessdata>rename inttemp num.inttemp
系统找不到指定的文件。

E:\tesseract\tessdata>rename pffmtable num.pffmtable
系统找不到指定的文件。

E:\tesseract\tessdata>rename shapetable num.shapetable
系统找不到指定的文件。

E:\tesseract\tessdata>echo Create Tessdata..
Create Tessdata..

E:\tesseract\tessdata>combine_tessdata.exe num.
Combining tessdata files
Error opening unicharset file
Error combining tessdata files into num.traineddata

非常郁闷,经过反复的尝试,最后发现,了问题所在。只要把批处理文件中“mftraining -F font_properties -U unicharset -O num.unicharset num.font.exp0.tr”改为“mftraining -F font_properties.txt -U unicharset -O num.unicharset num.font.exp0.tr”问题就解决了!出现了想要的结果:

tesseract-ocr语言库训练的一种出错情况_第1张图片

你可能感兴趣的:(OCR)