Tesseract,一款由HP实验室开发由Google维护的开源OCR(Optical Character Recognition , 光学字符识别)引擎,与Microsoft Office Document Imaging(MODI)相比,我们可以不断的训练的库,使图像转换文本的能力不断增强;如果团队深度需要,还可以以它为模板,开发出符合自身需求的OCR引擎。
下载地址:
http://digi.bib.uni-mannheim.de/tesseract/tesseract-ocr-setup-4.00.00dev.exe
下载过慢,可以使用网盘下载
tesseract-ocr-setup-4.00.00dev.exe,提取码:mo60
下载地址:
https://codechina.csdn.net/mirrors/tesseract-ocr/tessdata/-/archive/master/tessdata-master.zip
安装很简单直接点击下一步就行,记得选择自己安装盘符就行了,安装成功后会在相应磁盘下有Tesseract-OCR文件夹,如图:
需要将我们下载的语言库tessdata-master.zip解压,复制到Tesseract-OCR文件夹下tessdata目录下面,重复的文件进行跳过,如图:
我的是安装在D:\Program Files (x86)\Tesseract-OCR,配置环境变量path,如图:
还需要配置我们的语言库环境变量TESSDATA_PREFIX,如图:
到这里基本安装成功了,在cmd中执行tesseract -v
,如图:
用命令tesseract --list-langs
来查看Tesseract-OCR支持语言,如图:
编写测试代码,如下:
import requests
from PIL import Image
import pytesseract
from io import BytesIO
# pytesseract.pytesseract.tesseract_cmd = 'D:/Program Files (x86)/Tesseract-OCR/tesseract.exe'
# tessdata_dir_config = '--tessdata-dir "D:/Program Files (x86)/Tesseract-OCR/tessdata"'
tempdata = requests.get("https://image1.guazistatic.com/qn200416174110f7bacec292ac6b9f867957cbcf4079eb.jpg")
tempIm = BytesIO(tempdata.content)
image = Image.open(tempIm)
code = pytesseract.image_to_string(image)
print(code)
如果代码不指定pytesseract.pytesseract.tesseract_cmd
,需要在python库里修改pytesseract.py文件,如图:
添加我们tesseract-ocr的安装路径,如图:
最后启动脚本,输出如下:
yum install -y automake autoconf libtool gcc gcc-c++
yum install -y libpng-devel libjpeg-devel libtiff-devel
下载leptonica-1.78和tesseract-ocr4.0
wget https://gitee.com/mirrors/leptonica/repository/archive/1.78.0.zip
wget https://gitee.com/MaNongM/tesseract/repository/archive/4.0.0.zip
unzip mirrors-leptonica-1.78.0.zip
cd leptonica/
sh autogen.sh
./configure --prefix=/usr/local/leptonica
make
make install
配置leptonica环境变量
vim /etc/profile
添加以下字段
PKG_CONFIG_PATH=$PKG_CONFIG_PATH:/usr/local/leptonica/lib/pkgconfig
export PKG_CONFIG_PATH
CPLUS_INCLUDE_PATH=$CPLUS_INCLUDE_PATH:/usr/local/leptonica/include/leptonica
export CPLUS_INCLUDE_PATH
C_INCLUDE_PATH=$C_INCLUDE_PATH:/usr/local/leptonica/include/leptonica
export C_INCLUDE_PATH
LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/leptonica/lib
export LD_LIBRARY_PATH
LIBRARY_PATH=$LIBRARY_PATH:/usr/local/leptonica/lib
export LIBRARY_PATH
LIBLEPT_HEADERSDIR=/usr/local/leptonica/include/leptonica
export LIBLEPT_HEADERSDIR
应用配置
source /etc/profile
unzip MaNongM-tesseract-4.0.0.zip
cd tesseract-4.0.0/
sh autogen.sh
./configure --prefix=/usr/local/tesseract
make
make install
配置 tesseract 环境变量,打开 /etc/profile
vim /etc/profile
追加以下字段
PATH=$PATH:/usr/local/tesseract/bin
export PATH
TESSDATA_PREFIX=/usr/local/tesseract/share/tessdata
export TESSDATA_PREFIX
应用配置
source /etc/profile
下载tessdata语言包
weget https://gitee.com/rx-code/tessdata_fast/repository/archive/master.zip
unzip rx-code-tessdata_fast-master.zip
mv /usr/local/tesseract/share/tessdata/ /usr/local/tesseract/share/tessdata_bak #备份原来的数据包
mkdir /usr/local/tesseract/share/tessdata/
mv tessdata_fast/* /usr/local/tesseract/share/tessdata/
注:要养成备份数据好习惯,不然出错就要重新安装tesseract
测试一下:
tesseract -v
输出:
tesseract 4.0.0
leptonica-1.78.0
libjpeg 6b (libjpeg-turbo 1.2.90) : libpng 1.5.13 : libtiff 4.0.3 : zlib 1.2.7
Found AVX
Found SSE
# test.py
import requests
from PIL import Image
import pytesseract
from io import BytesIO
# pytesseract.pytesseract.tesseract_cmd = 'D:/Program Files (x86)/Tesseract-OCR/tesseract.exe'
# tessdata_dir_config = '--tessdata-dir "D:/Program Files (x86)/Tesseract-OCR/tessdata"'
tempdata = requests.get("https://image1.guazistatic.com/qn200416174110f7bacec292ac6b9f867957cbcf4079eb.jpg")
tempIm = BytesIO(tempdata.content)
image = Image.open(tempIm)
code = pytesseract.image_to_string(image)
print(code)
执行测试脚本:
python3 test.py
输出结果:
2013-04
win10和linux安装基本类似,只是需要额外的库,安装时间比较慢,安装过程中出错可能需要重新安装,这里的环境变量配置很重要,同时需要解析图片的语言库,如果缺少语言库,解析图片就会报错