前提概要:自动化测试或者爬虫一定会遇到的一个问题就是:怎么识别或者绕过验证码。总结以下几种方式:
准备:
(1)安装Tesseract-OCR:https://digi.bib.uni-mannheim.de/tesseract/,建议安装4.0的。
(2)安装pytesseract,python第三方库,pycharm中直接安装
识别:
(1)截取验证码图片
(2)处理图片:二值化
(3)识别图片内容,并删除保存的图片
代码示例:
# -*- coding:utf-8 -*-
import traceback
from PIL import Image
from io import BytesIO
import time
import pytesseract
from auto_common.base.remove_file_or_dir import *
def screenshot_code(driver, veri_code_xpath):
"""
截图验证码图片
:param veri_code_xpath: 验证码图片的xpath
:return: 验证码图片
"""
element_screen = driver.find_element_by_xpath(veri_code_xpath)
location = element_screen.location
size = element_screen.size
# 截取当前窗口保存为png
graph_ver_code = driver.get_screenshot_as_png()
# 打开截图定位要截取的位置
image = Image.open(BytesIO(graph_ver_code))
left = location['x']
top = location['y']
right = location['x'] + size['width']
bottom = location['y'] + size['height']
image = image.crop((left, top, right, bottom))
return image
def edit_picture(image):
"""
图片二值化,增加识别率
:param image:
:return:
"""
image = image.convert('L')
rows, cols = image.size
for i in range(rows):
for j in range(cols):
pixel = image.getpixel((i, j))
if pixel > 150:
image.putpixel((i, j), 255)
elif pixel < 130:
image.putpixel((i, j), 0)
pic_name = time.strftime("%Y%m%d%H%M%S", time.localtime())
current_path = os.getcwd()
father_path = os.path.dirname(current_path)
image_path = r'G:\project\picture\%s.png' % pic_name
image.save(image_path)
return image_path
def recognize_captcha(image_path):
"""
识别验证码
:param image_path:
:return:返回识别的验证码
"""
image = Image.open(image_path)
code = pytesseract.image_to_string(image)
# 识别后删除图片,可忽略
remove_file(image_path)
print(code)
return code
可能遇到的问题:
(1)解码错误:UnicodeDecodeError: 'utf-8' codec can't decode....
可能原因:pytesseract.py文件的路径配置问题,将变量tesseract_cmd值改为OCR安装路径,如:
tesseract_cmd = r'F:\software\OCR\Tesseract-OCR\tesseract.exe'
(2)识别率低
调整二值化的阈值,或者使用训练图片库进行训练调整。相当于自己造轮子,可以找一些开源的优化方案。
若接口中有返回验证码,可从接口中获取验证码。
方案:使用fiddler抓包自动保存到本地,读取文件中的验证码信息。
(1)打开fiddler菜单:Rules》Customize Rules
(2)在OnBeforeRequest方法中加入如下JavaScript代码,登录接口地址和文件保存路径自定义,
//保存请求
if (oSession.fullUrl.Contains("登录接口地址1") || oSession.fullUrl.Contains("登录接口地址2"))
{
var fso;
var file;
fso = new ActiveXObject("Scripting.FileSystemObject");
//文件保存路径,可自定义
var timestamp = Date.parse(new Date());
file = fso.OpenTextFile("G:\\project\\response"+timestamp+".txt",8 ,true, true);
file.writeLine("Request url: " + oSession.url);
file.writeLine("Request header:" + "\n" + oSession.oRequest.headers);
file.writeLine("Request body: " + oSession.GetRequestBodyAsString());
file.writeLine("\n");
file.close();
}
(3)在OnBeforeResponse方法中加入如下JavaScript代码,登录接口地址和文件保存路径自定义,
//保存响应
if (oSession.fullUrl.Contains("登录接口地址2") || oSession.fullUrl.Contains("登录接口地址2"))
{
oSession.utilDecodeResponse();//消除保存的请求可能存在乱码的情况
var fso;
var file;
fso = new ActiveXObject("Scripting.FileSystemObject");
//文件保存路径,可自定义
var timestamp = Date.parse(new Date());
file = fso.OpenTextFile("G:\\project\\response"+timestamp+".txt",8 ,true, true);
file.writeLine("Response code: " + oSession.responseCode);
file.writeLine("Response body: " + oSession.GetResponseBodyAsString());
file.writeLine("\n");
file.close();
}
(4)打开fiddler,然后打开登录页面,生成登录请求文件
(5)获取请求文件中的验证码的方法,python
# -*- coding:utf-8 -*-
import ast
# 获取请求文件中的验证码
def return_veri_code(response_file):
"""
获取请求文件中的验证码
:param response_file: 请求文件路径
:return:
"""
with open(response_file, 'r', encoding='utf-16') as fp:
li = fp.readlines()
expect = 'Response body: {"header":{"code'
code = ''
for i in li:
if expect in i:
real = i[15:]
# 将字符串转化为字典
real_dic = ast.literal_eval(real)
# 获得请求中的验证码
code = real_dic['body']['code']
return code
(6)在登录方法中调用获取请求文件的方法,直接登录。
准备:
(1)访问百度只能云平台:https://login.bce.baidu.com/?redirect=https%3A%2F%2Fconsole.bce.baidu.com%2F%3Ffromai%3D1#/aip/overview
(2)注册账号并创建应用:https://jingyan.baidu.com/article/ab0b563063a586c15bfa7d55.html
(3)获取个人的API_KEY和SECRET_KEY,一天可以免费调用5000次。
识别代码示例:
(1)调用百度OCR开放接口方法:
# -*-coding:utf-8 -*-
import requests
import base64
import traceback
def image_to_words(image_path):
"""
调用百度OCR开发接口识别图片文字
:param image_path: 图片路径
:return: words:文本信息
"""
# client_id 为官网获取的API_KEY, client_secret 为官网获取的SECRET_KEY
host = 'https://aip.baidubce.com/oauth/2.0/token?grant_type=client_credentials&' \
'client_id=API_KEY&client_secret=SECRET_KEY'
response = requests.get(host)
token_data = response.json()
# 获取access_token
if response:
access_token = token_data['access_token']
print('access_token获取成功:', access_token)
else:
access_token = ''
print('access_token获取失败')
request_url = "https://aip.baidubce.com/rest/2.0/ocr/v1/general_basic"
f = open(image_path, 'rb')
img = base64.b64encode(f.read())
f.close()
params = {"image": img}
request_url = request_url + "?access_token=" + access_token
headers = {'content-type': 'application/x-www-form-urlencoded'}
response = requests.post(request_url, data=params, headers=headers)
words = ''
if response:
datas = response.json().get('words_result')
for i in datas:
data = i.get('words')
words = words + data
return words
else:
print('识别异常:', traceback.print_exc())
注:大佬们如有其它方法,可以留言,待我研究验证后会更新到文章中,希望汇集各路大佬的智慧,更好的解决这个问题。