python使用tesseract-ocr识别认证码

我自己的理解

有可能绕过认证码尽量选择绕过去，请多尝试，绕过去爬取效率高不少；

先不考虑那些一天一套认证码的网站，百度“Python 认证码识别”，搜索得到的结果一般都是tersserocr、pytesser这几个；简单的说就是在你电脑上安装tesseract-ocr；然后使用Python中的subprocess.Popen执行对应的语句，捕捉终端、文件中显示的结果；所以不局限于语言；

强烈建议用cnn等，比较明确算法，准确率也高；现在的tf要不错比较好用；teaseract识别规范的还好，cnn看有时间再写吧；

简要步骤：

1、安装tesseract-ocr：

2、安装pytesser；

3、训练tesseract；

4、使用

详细的步骤：

##1、安装tesseract-ocr：

window

直接网上搜索就有了，exe很方便，也可以使用包；

Linux

我的虚拟机是archlinux，pacman -S tesseract-ocr；

如果是ubuntu，请安装下面的一堆东西（这个我没有认证。。。）：

sudo apt-get install autoconf automake libtool

sudo apt-get install libpng12-dev

sudo apt-get install libjpeg62-dev

sudo apt-get install libtiff4-dev

sudo apt-get install zlib1g-dev

sudo apt-get install libleptonica # install leptonica

sudo apt-get install gcc

sudo apt-get install g++

sudo apt-get install automake

tar zxvf tesseract-3.00.tar.gz

cd tesseract-3.00 && ./configure && make && sudo make install

总的来说，缺什么补什么

##2、安装pytesser；

pytersser的做法是，你输入图片之后，使用tersseract识别得到一个文件，这个文件名被固定了，再去读取文件得到识别结果；所以对于多线程需自己动手修改下；Python要使用pytesser得安依赖包PIL；

优先使用Python的pip安装，若出现问题则挽起袖子手动安装（这折腾也是无奈，只是自己手贱不知道动了py哪里，安装的时候出现ssl问题。。。）；

window：

如果你的机器是32位，恭喜，网上一堆32位系统的安装包；

64位系统的请到这里下载，都是编译好的：http://www.lfd.uci.edu/~gohlke/pythonlibs/

Linux：

默认安装路劲：/usr/local/lib 安装libjpeg：

$ tar zxvf jpegsrc.v7.tar.gz

$ cd jpeg-7

$ ./configure --enable-shared --enable-static

$ make

$ sudo make install

安装zlib：

$ tar zxvf zlib-1.2.8.tar.gz

$ cd zlib-1.2.8

$ ./configure

$ make

$ sudo make install

安装freetype

$ tar zxf freetype-2.6.1.tar.gz

$ cd freetype-2.6.1

$ ./configure

$ make

$ sudo make install

安装PIL unzip Imaging-1.1.7.zip cd Imaging-1.1.7 修改setup.py文件，修改配置文件路径 JPEG_ROOT = "/usr/local/include" ZLIB_ROOT = "/usr/local/include" FREETYPE_ROOT = "/usr/local/include"

bash-3.2$ python setup.py build_ext -i

c)测试编译： python selftest.py d)安装： python setup.py install

安装pytersser 方法很多：pip，github下载；

常见的错误提示：

AttributeError: 'NoneType' object has no attribute 'bands'

修改nano /usr/lib/python2.7/site-packages/PIL/Image.py

1496 行添加：self.load()

def split(self):

"Split image into bands"

self.load()

if self.im.bands == 1:

ims = [self.copy()]

else:

ims = []

self.load()

for i in range(self.im.bands):

ims.append(self._new(self.im.getband(i)))

return tuple(ims)

OSError: [Errno 2] No such file or directory

原因可能很多：没有安装 tesseract-ocr；Python引用问题；图片不存在等；

使用Image_to_string时出现的问题：

Traceback (most recent call last):

File "C:\Users\TF-2016\Desktop\spider\ruijie\ruijie.py", line 33, in

print image_file_to_string('11.png', graceful_errors=True)

File "C:\Python27\lib\site-packages\pytesser\pytesser.py", line 48, in image_file_to_string

call_tesseract(filename, scratch_text_name_root)

File "C:\Python27\lib\site-packages\pytesser\pytesser.py", line 23, in call_tesseract

proc = subprocess.Popen(args)

File "C:\Python27\lib\subprocess.py", line 710, in __init__

errread, errwrite)

File "C:\Python27\lib\subprocess.py", line 958, in _execute_child

startupinfo)

WindowsError: [Error 2]

将需要运行的文件，直接放在pytesser的包下或者

>>> import os

>>> os.chdir('C:\Python27\Lib\site-packages\pytesser')

3、训练tesseract：

《Python网络数据采集》中也有介绍；这里是使用jTessBoxEditor完成的，一个现成的工具；

如果采用默认的eng来识别认证码，基本不靠谱，各种杂数据；自己训练下数据好很多；

安装Tesseract-OCR之后就训练自己的识别库：

1、创建diff：

jTessBoxEditor--tool---merge tiff--shift select more image

diff 文件命名格式：[lang].[fontname].exp[num].tif

exp：num.font.exp0.tif

2、创建好tiff之后创建.box

tesseract.exe num.font.exp0.tif num.font.exp0 batch.nochop makebox

3、文字矫正：

jTessBoxEditor--open--num.font.exp0.tif

简单描述一下：打开tif之后默认已经进行了一次识别，需要做的是调整识别结果；可以调节识别框位置大小、识别结果；调整完毕后记得回车或者点击一下设置按钮；

4、定义字体特征文件：

创建font_properties文件，文件内容为：font 0 0 0 0 0

exp：font 0 0 0 0 0；表示字体不是粗体、斜体

5、生成traineddata文件

样本图片路劲下执行这些：也可以创建bat一次性执行完

tesseract.exe num.font.exp0.tif num.font.exp0 nobatch box.train

unicharset_extractor.exe num.font.exp0.box

mftraining -F font_properties -U unicharset -O num.unicharset num.font.exp0.tr

cntraining.exe num.font.exp0.tr

rename normproto num.normproto

rename inttemp num.inttemp

rename pffmtable num.pffmtable

rename shapetable num.shapetable

combine_tessdata.exe num.

确认打印结果中的Offset 1、3、4、5、13这些项不是-1就行了。

6、拷贝

最后将traineddata拷贝到：Tesseract-OCR--tessdata

7、测试

tesseract.exe pin.png result-eng -l eng

tesseract.exe pin.png result-num -l num

4、使用

import pytesser

from PIL import Image

im = Image.open('./pin.png')

im.show()

print pytesser.image_to_string(im)

目前识别率还是不理想，但已经可以用来跑数据了~~（折腾的是微博（weibo.com）的认证码；如果是weibo.cn的认证码，我表示我自己输入十几次没一次通过，囧），考虑试试卷积~~

python使用tesseract-ocr识别认证码

我自己的理解

简要步骤：

详细的步骤：

你可能感兴趣的:(python使用tesseract-ocr识别认证码)