初次练习使用pytesser3识别简单验证码时,遇到了‘gbk’编码不能识别的问题,经过一番折腾后解决了,特此记录下来并分享给大家!
验证码识别代码为:
from PIL import Image
import pytesser3
im = Image.open("captcha.gif", )
print(pytesser3.image_file_to_string("captcha.gif"))
print(pytesser3.image_to_string(im))
在这之前我已经安装好PIL, Tesseract-OCR并修改了环境变量。也修改了pytesser3包__init__.py中的tesseract_exe_name 为tesseract-OCR的安装路径。
运行上述代码后,报错如下:
Traceback (most recent call last):
File ".../pytesser3_try.py", line 6, in
print(pytesser3.image_file_to_string("captcha.gif"))
File "C:\Users\1\AppData\Roaming\Python\Python36\site-packages\pytesser3\__init__.py", line 44, in image_file_to_string
text = util.retrieve_text(scratch_text_name_root)
File "C:\Users\1\AppData\Roaming\Python\Python36\site-packages\pytesser3\util.py", line 11, in retrieve_text
text = inf.read()
UnicodeDecodeError: 'gbk' codec can't decode byte 0x99 in position 8: illegal multibyte sequence
经过研究后发现,解决该问题的方法是给上述提到的util.py中的retrieve_text函数中的open函数添加一个encoding参数。
原代码为:
def retrieve_text(scratch_text_name_root):
inf = open(scratch_text_name_root + '.txt')
text = inf.read()
inf.close()
return text
修改后为:
def retrieve_text(scratch_text_name_root):
inf = open(scratch_text_name_root + '.txt', encoding='utf-8')
text = inf.read()
inf.close()
return text
此时在运行上述代码就没有问题了!