python自动化测试登录验证码识别解决方案总结

前提概要:自动化测试或者爬虫一定会遇到的一个问题就是:怎么识别或者绕过验证码。总结以下几种方式:

  • OCR识别
    • 本地OCR识别
    • 第三方OCR接口识别
  • fiddler抓包获取验证码
  • 通过session绕过验证登录
  • 写死验证码或放开验证码验证

1、本地ocr识别

准备:

(1)安装Tesseract-OCR:https://digi.bib.uni-mannheim.de/tesseract/,建议安装4.0的。

(2)安装pytesseract,python第三方库,pycharm中直接安装

识别:

(1)截取验证码图片

(2)处理图片:二值化

(3)识别图片内容,并删除保存的图片

代码示例:

# -*- coding:utf-8 -*-
import traceback
from PIL import Image
from io import BytesIO
import time
import pytesseract
from auto_common.base.remove_file_or_dir import *


def screenshot_code(driver, veri_code_xpath):
    """
    截图验证码图片
    :param veri_code_xpath: 验证码图片的xpath
    :return: 验证码图片
    """
    element_screen = driver.find_element_by_xpath(veri_code_xpath)
    location = element_screen.location
    size = element_screen.size
    # 截取当前窗口保存为png
    graph_ver_code = driver.get_screenshot_as_png()

    # 打开截图定位要截取的位置
    image = Image.open(BytesIO(graph_ver_code))
    left = location['x']
    top = location['y']
    right = location['x'] + size['width']
    bottom = location['y'] + size['height']
    image = image.crop((left, top, right, bottom))
    return image


def edit_picture(image):
    """
    图片二值化,增加识别率
    :param image:
    :return:
    """
    image = image.convert('L')
    rows, cols = image.size
    for i in range(rows):
        for j in range(cols):
            pixel = image.getpixel((i, j))
            if pixel > 150:
                image.putpixel((i, j), 255)
            elif pixel < 130:
                image.putpixel((i, j), 0)
    pic_name = time.strftime("%Y%m%d%H%M%S", time.localtime())
    current_path = os.getcwd()
    father_path = os.path.dirname(current_path)
    image_path = r'G:\project\picture\%s.png' % pic_name
    image.save(image_path)
    return image_path


def recognize_captcha(image_path):
    """
    识别验证码
    :param image_path:
    :return:返回识别的验证码
    """
    image = Image.open(image_path)
    code = pytesseract.image_to_string(image)
    # 识别后删除图片,可忽略
    remove_file(image_path)
    print(code)
    return code

可能遇到的问题:

(1)解码错误:UnicodeDecodeError: 'utf-8' codec can't decode....

可能原因:pytesseract.py文件的路径配置问题,将变量tesseract_cmd值改为OCR安装路径,如:

tesseract_cmd = r'F:\software\OCR\Tesseract-OCR\tesseract.exe'

(2)识别率低

调整二值化的阈值,或者使用训练图片库进行训练调整。相当于自己造轮子,可以找一些开源的优化方案。

2、fiddler抓包验证码识别

若接口中有返回验证码,可从接口中获取验证码。

方案:使用fiddler抓包自动保存到本地,读取文件中的验证码信息。

(1)打开fiddler菜单:Rules》Customize Rules

python自动化测试登录验证码识别解决方案总结_第1张图片

(2)在OnBeforeRequest方法中加入如下JavaScript代码,登录接口地址和文件保存路径自定义,

         //保存请求
        if (oSession.fullUrl.Contains("登录接口地址1") || oSession.fullUrl.Contains("登录接口地址2"))
        {
            var fso;
            var file;
            fso = new ActiveXObject("Scripting.FileSystemObject");
            //文件保存路径,可自定义
            var timestamp = Date.parse(new Date());
            file = fso.OpenTextFile("G:\\project\\response"+timestamp+".txt",8 ,true, true);
            file.writeLine("Request url: " + oSession.url);
            file.writeLine("Request header:" + "\n" + oSession.oRequest.headers);
            file.writeLine("Request body: " + oSession.GetRequestBodyAsString());
            file.writeLine("\n");
            file.close();
            
        }

(3)在OnBeforeResponse方法中加入如下JavaScript代码,登录接口地址和文件保存路径自定义,

        //保存响应
        if (oSession.fullUrl.Contains("登录接口地址2") || oSession.fullUrl.Contains("登录接口地址2"))
        {
            oSession.utilDecodeResponse();//消除保存的请求可能存在乱码的情况
            var fso;
            var file;
            fso = new ActiveXObject("Scripting.FileSystemObject");
            //文件保存路径,可自定义
            var timestamp = Date.parse(new Date());
            file = fso.OpenTextFile("G:\\project\\response"+timestamp+".txt",8 ,true, true);
            file.writeLine("Response code: " + oSession.responseCode);
            file.writeLine("Response body: " + oSession.GetResponseBodyAsString());
            file.writeLine("\n");
            file.close();
        }

(4)打开fiddler,然后打开登录页面,生成登录请求文件

(5)获取请求文件中的验证码的方法,python

# -*- coding:utf-8 -*-
import ast


# 获取请求文件中的验证码
def return_veri_code(response_file):
    """
    获取请求文件中的验证码
    :param response_file: 请求文件路径
    :return:
    """
    with open(response_file, 'r', encoding='utf-16') as fp:
        li = fp.readlines()
        expect = 'Response body: {"header":{"code'
        code = ''
        for i in li:
            if expect in i:
                real = i[15:]
                # 将字符串转化为字典
                real_dic = ast.literal_eval(real)
                # 获得请求中的验证码
                code = real_dic['body']['code']
    return code

(6)在登录方法中调用获取请求文件的方法,直接登录。

3、百度AI通用文字识别开放接口

准备:

(1)访问百度只能云平台:https://login.bce.baidu.com/?redirect=https%3A%2F%2Fconsole.bce.baidu.com%2F%3Ffromai%3D1#/aip/overview

(2)注册账号并创建应用:https://jingyan.baidu.com/article/ab0b563063a586c15bfa7d55.html

(3)获取个人的API_KEY和SECRET_KEY,一天可以免费调用5000次。

识别代码示例:

(1)调用百度OCR开放接口方法:

# -*-coding:utf-8 -*-
import requests
import base64
import traceback


def image_to_words(image_path):
    """
    调用百度OCR开发接口识别图片文字
    :param image_path: 图片路径
    :return: words:文本信息
    """
    # client_id 为官网获取的API_KEY, client_secret 为官网获取的SECRET_KEY
    host = 'https://aip.baidubce.com/oauth/2.0/token?grant_type=client_credentials&' \
           'client_id=API_KEY&client_secret=SECRET_KEY'
    response = requests.get(host)
    token_data = response.json()
    # 获取access_token
    if response:
        access_token = token_data['access_token']
        print('access_token获取成功:', access_token)
    else:
        access_token = ''
        print('access_token获取失败')
    request_url = "https://aip.baidubce.com/rest/2.0/ocr/v1/general_basic"
    f = open(image_path, 'rb')
    img = base64.b64encode(f.read())
    f.close()
    params = {"image": img}
    request_url = request_url + "?access_token=" + access_token
    headers = {'content-type': 'application/x-www-form-urlencoded'}
    response = requests.post(request_url, data=params, headers=headers)
    words = ''
    if response:
        datas = response.json().get('words_result')
        for i in datas:
            data = i.get('words')
            words = words + data
        return words
    else:
        print('识别异常:', traceback.print_exc())

4、通过session绕过验证登录

5、写死验证码或放开验证码

注:大佬们如有其它方法,可以留言,待我研究验证后会更新到文章中,希望汇集各路大佬的智慧,更好的解决这个问题。

 

 

 

 

 

 

你可能感兴趣的:(python爬虫,python,ocr,爬虫)