tesseract训练
1.下载jTessBoxEditor(jre使用的是jre7),用TIFF/BoxGenerator添加常用的宋体中文,Output:zhong chi_sim.exp0.tif ->Generate,生成
zhong.chi_sim.exp0.tif和zhong.chi_sim.exp0.box2个文件
2.创建文件font_properties,内容:chi_sim 0 0 0 0 0
3.创建bat文件start.bat,内容:
rem 执行改批处理前先要目录下创建font_properties文件
echo Run Tesseract for Training..
D:\app\Tesseract-OCR\tesseract.exe zhong.chi_sim.exp0.tif zhong.chi_sim.exp0 nobatch box.train
echo Compute the Character Set..
D:\app\Tesseract-OCR\unicharset_extractor.exe zhong.chi_sim.exp0.box
D:\app\Tesseract-OCR\mftraining.exe -F font_properties -U unicharset -O zhong.unicharset zhong.chi_sim.exp0.tr
echo Clustering..
D:\app\Tesseract-OCR\cntraining.exe zhong.chi_sim.exp0.tr
echo Rename Files..
rename normproto zhong.normproto
rename inttemp zhong.inttemp
rename pffmtable zhong.pffmtable
rename shapetable zhong.shapetable
echo Create Tessdata..
D:\app\Tesseract-OCR\combine_tessdata.exe zhong.
pause
4.运行start.bat,等待命令行结果:1,3,4,5,13不为-1就是成功了!
TessdataManager combined tesseract data files.
Offset for type 0 is -1
Offset for type 1 is 140
Offset for type 2 is -1
Offset for type 3 is 509098
Offset for type 4 is 42657207
Offset for type 5 is 42726936
Offset for type 6 is -1
Offset for type 7 is -1
Offset for type 8 is -1
Offset for type 9 is -1
Offset for type 10 is -1
Offset for type 11 is -1
Offset for type 12 is -1
Offset for type 13 is 43579530
Offset for type 14 is -1
Offset for type 15 is -1
Offset for type 16 is -1
5.生成zhong.traineddata,copy到tesseract的tessdata文件夹下
6.运行命令tesseract.exe E:\temp\image\y.jpg E:\temp\image\y -l zhong,可以在y.txt中查看识别的结果