在爬虫过程中,难免会遇到各种各样的验证码,而大多数验证码还是图形验证码,这时候我们可以直接用 OCR 来识别。
tesserocr 是 Python 的一个 OCR 识别库 ,但其实是对 tesseract 做的一 层 Python API 封装,所以它的核心是 tesseract。
1)安装Tesserocr和Pillow
用Anaconda安装的Python环境,安装tesserocr时报错,tesserocr.cpp(555): fatal error C1083: Cannot open include file: 'leptonica/allheaders.h': No such file or directory
C:\Users\Administrator>pip install tesserocr pillow
Collecting tesserocr
Using cached https://files.pythonhosted.org/packages/cf/0d/9e554f041962b8dd7ac
d978330535fed879452bb0af257c287ca4ae9c525/tesserocr-2.2.2.tar.gz
Requirement already satisfied: pillow in d:\programdata\anaconda3\lib\site-packa
ges (5.0.0)
Building wheels for collected packages: tesserocr
Running setup.py bdist_wheel for tesserocr ... error
Complete output from command d:\programdata\anaconda3\python.exe -u -c "import
setuptools, tokenize;__file__='C:\\Users\\ADMINI~1\\AppData\\Local\\Temp\\pip-i
nstall-qo9ktk0e\\tesserocr\\setup.py';f=getattr(tokenize, 'open', open)(__file__
);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'ex
ec'))" bdist_wheel -d C:\Users\ADMINI~1\AppData\Local\Temp\pip-wheel-_vovqp88 --
python-tag cp36:
Supporting tesseract v3.05.01
Building with configs: {'libraries': ['tesseract', 'lept'], 'cython_compile_ti
me_env': {'TESSERACT_VERSION': 197889}}
running bdist_wheel
running build
running build_ext
building 'tesserocr' extension
creating build
creating build\temp.win-amd64-3.6
creating build\temp.win-amd64-3.6\Release
C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\BIN\x86_amd64\cl.exe /c
/nologo /Ox /W3 /GL /DNDEBUG /MD -Id:\programdata\anaconda3\include -Id:\progra
mdata\anaconda3\include "-IC:\Program Files (x86)\Microsoft Visual Studio 14.0\V
C\INCLUDE" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt"
"-IC:\Program Files (x86)\Windows Kits\8.1\include\shared" "-IC:\Program Files (
x86)\Windows Kits\8.1\include\um" "-IC:\Program Files (x86)\Windows Kits\8.1\inc
lude\winrt" /EHsc /Tptesserocr.cpp /Fobuild\temp.win-amd64-3.6\Release\tesserocr
.obj
tesserocr.cpp
tesserocr.cpp(555): fatal error C1083: Cannot open include file: 'leptonica/al
lheaders.h': No such file or directory
error: command 'C:\\Program Files (x86)\\Microsoft Visual Studio 14.0\\VC\\BIN
\\x86_amd64\\cl.exe' failed with exit status 2
----------------------------------------
Failed building wheel for tesserocr
Running setup.py clean for tesserocr
Failed to build tesserocr
Installing collected packages: tesserocr
Running setup.py install for tesserocr ... error
Complete output from command d:\programdata\anaconda3\python.exe -u -c "impo
rt setuptools, tokenize;__file__='C:\\Users\\ADMINI~1\\AppData\\Local\\Temp\\pip
-install-qo9ktk0e\\tesserocr\\setup.py';f=getattr(tokenize, 'open', open)(__file
__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, '
exec'))" install --record C:\Users\ADMINI~1\AppData\Local\Temp\pip-record-hal0op
96\install-record.txt --single-version-externally-managed --compile:
Supporting tesseract v3.05.01
Building with configs: {'libraries': ['tesseract', 'lept'], 'cython_compile_
time_env': {'TESSERACT_VERSION': 197889}}
running install
running build
running build_ext
building 'tesserocr' extension
creating build
creating build\temp.win-amd64-3.6
creating build\temp.win-amd64-3.6\Release
C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\BIN\x86_amd64\cl.exe
/c /nologo /Ox /W3 /GL /DNDEBUG /MD -Id:\programdata\anaconda3\include -Id:\prog
ramdata\anaconda3\include "-IC:\Program Files (x86)\Microsoft Visual Studio 14.0
\VC\INCLUDE" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt
" "-IC:\Program Files (x86)\Windows Kits\8.1\include\shared" "-IC:\Program Files
(x86)\Windows Kits\8.1\include\um" "-IC:\Program Files (x86)\Windows Kits\8.1\i
nclude\winrt" /EHsc /Tptesserocr.cpp /Fobuild\temp.win-amd64-3.6\Release\tessero
cr.obj
tesserocr.cpp
tesserocr.cpp(555): fatal error C1083: Cannot open include file: 'leptonica/
allheaders.h': No such file or directory
error: command 'C:\\Program Files (x86)\\Microsoft Visual Studio 14.0\\VC\\B
IN\\x86_amd64\\cl.exe' failed with exit status 2
----------------------------------------
Command "d:\programdata\anaconda3\python.exe -u -c "import setuptools, tokenize;
__file__='C:\\Users\\ADMINI~1\\AppData\\Local\\Temp\\pip-install-qo9ktk0e\\tesse
rocr\\setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replac
e('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --recor
d C:\Users\ADMINI~1\AppData\Local\Temp\pip-record-hal0op96\install-record.txt --
single-version-externally-managed --compile" failed with error code 1 in C:\User
s\ADMINI~1\AppData\Local\Temp\pip-install-qo9ktk0e\tesserocr\
解决方法:
cmd中输入命令 conda install -c simonflueckiger tesserocr (此时我使用的该方法安装的,tesserocr版本是2.4对应的tesseract版本是3.05.02 这给我后来的安装过程带来了很大的麻烦)
或者使用whl文件安装
https://github.com/simonflueckiger/tesserocr-windows_build/releases
注意:
a)安装时出现如下错误:
Microsoft Visual C++ 14.0 is required. Get it with "Microsoft Visual C++ Build Tools": https://visualstudio.microsoft.com/downloads/
需要下载安装,注意安装时需要联网:
Microsoft Visual C++ Build Tools 2015
b)由于在pycharm中新建项目Test时使用了新的anaconda编译环境,所以上面安装的包,都安装到默认的环境中,而项目中依然报错:ModuleNotFoundError: No module named 'tesserocr'
这时需要在安装命令中指定安装到哪个环境中:
conda install -n Test -c simonflueckiger tesserocr
Test为我的项目中使用的环境名
编写测试程序
|
安装完后执行程序出现Failed to init API错误
result = tesserocr.image_to_text(image)
File "tesserocr.pyx", line 2443, in tesserocr._tesserocr.image_to_text
RuntimeError: Failed to init API, possibly an invalid tessdata path: D:\Dev\Anaconda3\envs\Test\
通常网上的解决方法如下:
方法1:将Tesseract的tessdata目录copy到错误提示的目录下
方法2:添加环境变量TESSDATA_PREFIX指到tessdata的父目录(通常是Tesseract-OCR的安装目录)
注:
tesserocr库内部集成了所需的tesseract的库文件,但是没有集成语言包,tessdata的获取方法:
(1)可以下载安装与tesserocr匹配版本的Tesseract
参考网址:
https://github.com/tesseract-ocr/tesseract
https://tesseract-ocr.github.io/
https://digi.bib.uni-mannheim.de/tesseract/
(2)也可以从如下地方下载:
https://github.com/tesseract-ocr/tessdata_best
https://github.com/tesseract-ocr/tessdata_fast
https://github.com/tesseract-ocr/tessdata
但试过后问题依然没有解决,后来发现按上面方法安装的tesserocr对于的版本是3.05.02而,tesseract版本是4.0.X,用的tessdata有是4.0.x的所以卸载tesserocr,使用whl文件安装
注意如果ananconda下有多个python环境,想按在哪个下面就要进入到哪个环境下的scripts目录下(也就是pip.exe所在的目录)执行如下命令:
pip install D:\Dev\tesserocr-2.4.0-cp37-cp37m-win_amd64.whl
因为在其他路径下默认环境变量中python的环境变量可能被指到了anacond的默认环境下,那么安装tesserocr也就被安装到默认环境下而不是你项目所使用的编译环境,当然如果你没有使用多python编译环境那么久没这个苦恼。
注:通常使用conda install -n 环境 要安装的程序包 可以给指定的编译环境安装依赖,但是像上面使用whl安装则conda不识别whl文件总是搜索自己的依赖库,提示找不到要安装的内容。
安装完毕后,在程序中通过【 print(tesserocr.tesseract_version()) 】发现现在tesserocr对应的版本为tesseract 4.0.0了,与我本地安装的版本一致了。但是运行上面的图片验证码程序依然报刚才的 failed init API的错误,后来仔细排查发现原来是我最初遇到这个错误时,几经折腾,途中尝试从网上下载过不同版本的tesseract, 最终我本地使用的tesseract 4.0.0了,但往报错的地方复制的tessdata目录是5.0.0版的,所以初始化失败。于是从4.0.0版的tesseract安装目录下重新复制一份,问题就解决了。
总结了一下 failed init API的解决方法:
a)要确认自己使用的tesserocr对应的tesseract版本是什么,要使用对应的tesseract版本的tessdata。
b)对于tessdata的位置tesserocr默认使用的是当前python环境的根目录,也就是与python.exe同级。
当然也可以在程序中改变调用方式,来指定路径:
api = tesserocr.PyTessBaseAPI(lang='eng', path='D:\\Dev\\tessdata')
api.SetImage(image)
print(api.GetUTF8Text())
api.End()
c)本地其实不需要安装Tesseract,只是为了获取tessdata。只要能拿到相应的tessdata目录即可,tessdata网上可以下载语言包。
上述问题解决后,验证识别结果,识别率不太高,具体如何优化,后面继续研究。