现在有很多验证码图片获取后都是乱序的需要重组,webdriver截图是一个很方便的处理方式,但是webdriver过于占用内存,故提供一个重组的方式,现已前程无忧为例,记录一下解决方案,大体思路可以分为以下几个步骤:获取原始验证码图片----->获取css偏移量数组---->新建空白图片文件---->按顺序根据css偏移量和验证码图片尺寸抠图并粘贴到空白文件。
验证码的html源码如下:
可以看到验证码图片为一个个的小图片拼接而成。
首先获取原始验证码图片,上图红框中的url即为原始验证码图片
import requests
from PIL import Image
def get_captcha(url):
session = requests.Session()
session.headers = {
'Host': 'ehire.51job.com',
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:61.0) Gecko/20100101 Firefox/61.0',
'Accept': '*/*',
'Accept-Language': 'zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2',
'Accept-Encoding': 'gzip, deflate, br',
'Referer': 'https://ehire.51job.com/',
'Cookie': 'guid=15257617072656350082; search=jobarea%7E%60180200%7C%21ord_field%7E%600%7C%21recentSearch0%7E%601%A1%FB%A1%FA180200%2C00%A1%FB%A1%FA000000%A1%FB%A1%FA0000%A1%FB%A1%FA26%A1%FB%A1%FA9%A1%FB%A1%FA99%A1%FB%A1%FA99%A1%FB%A1%FA99%A1%FB%A1%FA99%A1%FB%A1%FA99%A1%FB%A1%FA99%A1%FB%A1%FA%A1%FB%A1%FA2%A1%FB%A1%FA%A1%FB%A1%FA-1%A1%FB%A1%FA1525763865%A1%FB%A1%FA0%A1%FB%A1%FA%A1%FB%A1%FA%7C%21recentSearch1%7E%601%A1%FB%A1%FA180200%2C00%A1%FB%A1%FA000000%A1%FB%A1%FA0000%A1%FB%A1%FA26%A1%FB%A1%FA9%A1%FB%A1%FA99%A1%FB%A1%FA99%A1%FB%A1%FA99%A1%FB%A1%FA99%A1%FB%A1%FA99%A1%FB%A1%FA99%A1%FB%A1%FA%B9%A4%B3%CC%CF%EE%C4%BF%A1%FB%A1%FA2%A1%FB%A1%FA%A1%FB%A1%FA-1%A1%FB%A1%FA1525761723%A1%FB%A1%FA0%A1%FB%A1%FA%A1%FB%A1%FA%7C%21; nsearch=jobarea%3D%26%7C%26ord_field%3D%26%7C%26recentSearch0%3D%26%7C%26recentSearch1%3D%26%7C%26recentSearch2%3D%26%7C%26recentSearch3%3D%26%7C%26recentSearch4%3D%26%7C%26collapse_expansion%3D; slife=lowbrowser%3Dnot%26%7C%26; ps=us%3DWmdbOQN%252FCzxdO1szCnFWZAEyATUELFs4VmgBLwoxVWBbYVAzBmULPVw8DWhQPQQ1ATVUbVFlBSwBZAVgCm5RNFob%26%7C%26; EhireGuid=c892eff94eec45e1a6a32afce3c57ed0; LangType=Lang=&Flag=1; adv=adsnew%3D1%26%7C%26adsresume%3D1%26%7C%26adsfrom%3Dhttps%253A%252F%252Fwww.baidu.com%252Fbaidu%253Ftn%253Dmonline_3_dg%2526ie%253Dutf-8%2526wd%253D%2525E5%252589%25258D%2525E7%2525A8%25258B%2525E6%252597%2525A0%2525E5%2525BF%2525A7%26%7C%26adsnum%3D2004282; partner=www_baidu_com; 51job=cenglish%3D0%26%7C%26; ASP.NET_SessionId=kgah40twoauxke1poschvlpl',
'Connection': 'keep-alive'
}
text = session.get(url).content
with open('captcha.jpg', 'wb') as f:
f.write(text)
获取到的原始图片如下:
第二步获取css偏移量数组, 每个小图片的坐标位置是通过css样式来确定的,通过class找到对应的css样式:
获取background-position并转化为数组如下,必须按顺序排列:
offset_list = [['66', '40'], ['286', '40'], ['66', '98'], ['44', '40'], ['154', '40'], ['22', '40'], ['88', '98'],
['198', '40'], ['198', '98'], ['264', '98'], ['308', '40'], ['176', '40'], ['0', '98'], ['132', '98'],
['132', '40'], ['176', '98'], ['88', '40'], ['154', '98'], ['220', '40'], ['264', '40'], ['110', '40'],
['242', '98'], ['286', '98'], ['0', '40'], ['242', '40'], ['44', '98'], ['220', '98'], ['22', '98'],
['308', '98'], ['110', '98']]
图片重组:
#获取每张小图的偏移量
def convert_index_to_offset(index):
if index < 15: #完整的验证码图片是由30个小图片组合而成,共2行15列
return (index * 22, 0)
else:
i = index - 15
return (i * 22, 58) #每张小图的大小为22*58
#获取每张小图的坐标,供抠图时使用
def convert_css_to_offset(off):
# (left, upper)o ----- o
# | |
# o ----- o(right, lower)
return (int(off[0]), int(off[1]), int(off[0]) + 22, int(off[1]) + 58)
#图片重组
def recombine_captcha():
captcha = Image.new('RGB', (22 * 15, 58 * 2)) #新建空白图片
img = Image.open('captcha.jpg') #实例化原始图片Image对象
for i, off in enumerate(offset_list):
box = convert_css_to_offset(off) #根据css backgound-position获取每张小图的坐标
regoin = img.crop(box) #抠图
offset = convert_index_to_offset(i) #获取当前小图在空白图片的坐标
captcha.paste(regoin, offset) #根据当前坐标将小图粘贴到空白图片
captcha.save('regoin.jpg')
重组后的图片如下: