前言

最近要识别充值卡上的序号与密码，故出一篇训练Tesserct的教程。

下载安装Tesseract

访问github的 tesseract-ocr 下载windows下的安装包（因笔者是在windows系统下作训练）

An unofficial installer for windows for Tesseract 3.05-dev and Tesseract 4.00-dev is available from Tesseract at UB Mannheim. This includes the training tools.An installer for the old version 3.02 is available for Windows from our download page. This includes the English training data. If you want to use another language, download the appropriate training data, unpack it using 7-zip, and copy the .traineddata file into the 'tessdata' directory, probablyC:\Program Files\Tesseract-OCR\tessdata

点击上面的download，我们来到了下载页面：

Binaries for Windows

4.0.0: https://github.com/tesseract-ocr/tesseract/wiki/4.0-with-LSTM#400-alpha-for-windows
3.5.1: https://github.com/parrot-office/tesseract/releases/tag/3.5.1 (3rd party - @parrot-office)
选择适合版本下载即可。

有关安装和环境变量的配置，不在此多说，大家搜索就知道怎么配置了。

下载jTessBoxEditor

点我进入官网点击左方的downlaod即可到下载页面

将需要识别的图片转换为tif

有很多转换工具，选择你喜欢的工具进行转换即可

合并tif文件

运行jTessBoxEditor,双击train.bat即可运行：

run

按下Ctrl+M，弹出合并选择文件

QQ截图20180111201844.jpg

xlk

选择打开后会让你选择保存的位置：

save

我们保存为：num.font.exp10

生成box

tesseract num.font.exp10.tif num.font.exp10 -psm 10 digits batch.nochop makebox

make-box

注意：因为我的tif都是单字且为数字，故加了 -psm 10 digits ，有关此选项的其他配置请搜索了解。

修正识别错误的box

再次运行jTessBoxEditor，

fix-box

发现6被识别成.了

6-bug

fix-1

点击蓝色的小圈，然后修改红色箭头的坐标信息以及正确的字符值：6

fix

修改后，记得点击保存。

建立字体属性文件

font 0 0 0 0 0

保存为：font_properties

生成tr信息

tesseract num.font.exp10.tif num.font.exp10 -psm 10 digits nobatch box.train

tesseract-tr

生成字体特征信息

unicharset_extractor num.font.exp10.box

shapeclustering -F font_properties -U unicharset num.font.exp10.tr
mftraining -F font_properties -U unicharset -O unicharset num.font.exp10.tr
cntraining num.font.exp10.tr

将得到的：unicharset、inttemp、pffmtable、shapetable、normproto重命名为num.开头
如：

QQ截图20180111203412.jpg

最后执行

combine_tessdata num.

得到训练信息num.traineddata

QQ截图20180111203510.jpg

Tesseract训练

前言