tesseract是是谷歌公司开发得到通用文字识别的一个库,可以训练图集,让这个库识别更准确。相关软件下载链接在最下面
以下操作都是在系统拥有tesseract4.0应用程序的情况下实现的,具体的下载链接如下:https://digi.bib.uni-mannheim.de/tesseract/
环境准备:
代码如下:
from PIL import Image
import pytesseract
class Output:
BYTES = 'bytes'
DATAFRAME = 'data.frame'
DICT = 'dict'
STRING = 'string'
def ocr():
'''
参数说明:图片、输出样式、识别语言包、其他配置(--)
--psm 7 表示一行文本,提升识别精准度
--oem 3 表示正常识别字符集
tessedit_char_whitelist=0123456789R 识别白名单
'''
text = pytesseract.image_to_string(Image.open("./ocr.png"), output_type=Output.DICT, lang="eng", config="--psm 7 --oem 3 -c tessedit_char_whitelist=0123456789R")
return text["text"]
// 在此之前要准备好环境相应的bin include lib 文件,下面是4.0版本
#include
#include
#include
#include
int main()
{
// Load image
cv::Mat im = cv::imread("C:/Users/Administrator/Desktop/test-ocr/images/888.png");
if (im.empty())
{
std::cout << "Cannot open source image!" << std::endl;
system("pause");
return -1;
}
cv::Mat gray;
cv::cvtColor(im, gray, CV_BGR2GRAY);
// ...other image pre-processing here...
// Pass it to Tesseract API
tesseract::TessBaseAPI tess;
tess.Init(NULL, "eng", tesseract::OEM_DEFAULT);
tess.SetVariable("tessedit_char_whitelist", "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmopqrstuvwxyz0123456789"); // 目前没起作用
tess.SetPageSegMode(tesseract::PSM_SINGLE_BLOCK);
tess.SetImage((uchar*)gray.data, gray.cols, gray.rows, 1, gray.cols);
// Get the text
char* out = tess.GetUTF8Text();
std::cout << out << std::endl;
system("pause");
return 0;
}
关于上面参数 psm 说明:
Page segmentation modes:(参数 --psm 说明)
0 Orientation and script detection (OSD) only.
1 Automatic page segmentation with OSD.
2 Automatic page segmentation, but no OSD, or OCR.
3 Fully automatic page segmentation, but no OSD. (Default)
4 Assume a single column of text of variable sizes.
5 Assume a single uniform block of vertically aligned text.
6 Assume a single uniform block of text.
7 Treat the image as a single text line.
8 Treat the image as a single word.
9 Treat the image as a single word in a circle.
10 Treat the image as a single character.
11 Sparse text. Find as much text as possible in no particular order.
12 Sparse text with OSD.
13 Raw line. Treat the image as a single text line,
// 如下为c++说明
enum PageSegMode {
PSM_OSD_ONLY, ///< Orientation and script detection only.
PSM_AUTO_OSD, ///< Automatic page segmentation with orientation and
///< script detection. (OSD)
PSM_AUTO_ONLY, ///< Automatic page segmentation, but no OSD, or OCR.
PSM_AUTO, ///< Fully automatic page segmentation, but no OSD.
PSM_SINGLE_COLUMN, ///< Assume a single column of text of variable sizes.
PSM_SINGLE_BLOCK_VERT_TEXT, ///< Assume a single uniform block of vertically
///< aligned text.
PSM_SINGLE_BLOCK, ///< Assume a single uniform block of text. (Default.)
PSM_SINGLE_LINE, ///< Treat the image as a single text line.
PSM_SINGLE_WORD, ///< Treat the image as a single word.
PSM_CIRCLE_WORD, ///< Treat the image as a single word in a circle.
PSM_SINGLE_CHAR, ///< Treat the image as a single character.
PSM_SPARSE_TEXT, ///< Find as much text as possible in no particular order.
PSM_SPARSE_TEXT_OSD, ///< Sparse text with orientation and script det.
PSM_RAW_LINE, ///< Treat the image as a single text line, bypassing
///< hacks that are Tesseract-specific.
PSM_COUNT ///< Number of enum entries.
};
关于参数 oem 说明:
enum OcrEngineMode {
OEM_TESSERACT_ONLY, // Run Tesseract only - fastest; deprecated
OEM_LSTM_ONLY, // Run just the LSTM line recognizer.
OEM_TESSERACT_LSTM_COMBINED, // Run the LSTM recognizer, but allow fallback
// to Tesseract when things get difficult.
// deprecated
OEM_DEFAULT, // Specify this mode when calling init_*(),
// to indicate that any of the above modes
// should be automatically inferred from the
// variables in the language-specific config,
// command-line configs, or if not specified
// in any of the above should be set to the
// default OEM_TESSERACT_ONLY.
OEM_COUNT // Number of OEMs
};
还没开始搞!记得提醒我一下。。。。。。【开始啦 开始啦…】
如下网址是训练字符集的教程,需要tesseract5.0版本
https://blog.csdn.net/m0_37693841/article/details/105672637
链接:https://pan.baidu.com/s/13GOAwDxR4e1KHDOulGxjuA
提取码:9nll
复制这段内容后打开百度网盘手机App,操作更方便哦
如上连接,文件包括
首先安装jdk,后安装ocr,jTessBoxEditor只需要解压到文件夹即可
注意:OCR安装后需要在环境变量里面添加
然后在添加:TESSDATA_PREFIX -> D:\TESS-OCR\Tesseract-OCR\tessdata 将字符文件目录添加进来
打开jTessBoxEditor 软件,点击tools -> Merge,选择需要训练的图片,然后储存为一张tiff
生成tiff文件,注意格式:<语言名.字体名.版本号> 如:ocr-R47.eng.exp0
生成lstmbox文件,用如下命令行
tesseract eng-ocr.eng.exp0.tif eng-ocr.eng.exp0 -l eng --psm 6 lstmbox
继续打开jTessBoxEditor
点击 box editor -> open 打开刚刚生成的tiff文件,会自动关联box文件
命令行输入:
tesseract eng-ocr.eng.exp0.tif eng-ocr.eng.exp0 -l eng --psm 6 lstm.train
此时会生成一个 **eng-ocr.eng.exp0.lstmf ** 文件,后文用上
刚刚下载的文件"eng.traineddata"表示的是纯英文字符集,如果是要中文字符集就选择中文,
从https://github.com/tesseract-ocr/tessdata_best下载所需语言的.traineddataw文件,放入文件夹,命令行输入如下
combine_tessdata -e eng.traineddata eng.lstm
创建一个名字为 eng.training_files.txt 文件,文件中写入 eng-ocr.eng.exp0.lstmf 文件的绝对路径
输入如下命令开始训练
lstmtraining --debug_interval -1 --max_iterations 100 --continue_from="E:\tesseract\eng.lstm" --model_output="E:\tesseract\output" --train_listfile="E:\tesseract\eng.training_files.txt" --traineddata="F:\tesseract\eng.traineddata"
训练时参数如下:
训练完之后会生成两个文件: output_checkpoint output1.667_2.checkpoint,表示成功
命令行输入:
lstmtraining --stop_training --continue_from="E:\tesseract\output_checkpoint" --traineddata="E:\tesseract\eng.traineddata" --model_output="E:\tesseract\eng.traineddata"
参数如下:
命令成功之后会生成文件: eng_ocr.traineddata
之后再将生成的文件放入到tesseract的tessdata文件夹中,存放语言,调用的时候用 -l eng-ocr 来调用\
百度网盘:
链接:https://pan.baidu.com/s/13GOAwDxR4e1KHDOulGxjuA
提取码:9nll
如果有积分也可以支持一下-CSDN下载:
https://download.csdn.net/download/qq_42874244/12532423
写在最后:(链接挂了联系我 https://abraverman.gitee.io/ )
经历了漫长的实验和试错,终于是完成了识别训练,网上很多文章,有正确的,也有错误的。不管怎样,我终究是完成了。开心