个人公众号:螺旋编程极客 >>期待您的关注
最近公司有个新需求,大体流程是这样的,进入天津市市场主体信用信息公示系统,根据excel中表格的企业名称或税号查询企业的股东信息,查到之后获取股东信息的税号,然后再分别查询股东的股东,最后把查询结果录入excel。
读excel——>查询企业股东——》获取股东税号——》输入股东税号查询其股东——》查询结果录入excel,是不是让人觉得十分无语,简单一句话,查询股东的股东的相关信息录入excel,当时听到这个需求感觉理论上是可以实现的,唯一的难点就在于滑块验证码,破解了它之后后面的就是一些网页数据提取的工作了。
话不多说,上爬虫呗,因为有滑块验证码这个东西的存在,所以只能选择浏览器爬虫了,虽然效率慢点,但是万物皆可爬,因为抓包分析那些请求数据实在是让人恶心的想吐。在这里我使用 “艺赛旗RPA设计器” 来辅助完成工作,不得不说,这个东西真的好用,而且它的python库十分强大,设计完流程可以自动生成python代码,自己只需要关心一些核心的算法和业务逻辑就可以了,事半功倍。
首先看一下验证码的图片:
是比较常见的 “极验” 验证码,很多网站都在使用这个东西,但是政府的网站明显落后了一点,现在 “极验3.0” 已经更新了,这个还停留在2.0。区别就是2.0一开始显示的是完整的图片,点击滑动按钮会出现有缺口的图片,而3.0一开始显示的就是带缺口的图片,不过也是可以破解的。
在这里我们以2.0为例,3.0的核心代码我也会贴上,先看一下2.0破解的步骤:
因为我们使用了RPA设计器,所以像点击鼠标,截图之类的代码都不需要自己去写,选择相应的元素,点击对应页面的元素,他就可以自动为我们生成python代码,当然是高度封装的,源码是可以随时看的,底层其实还是那一套。唯一需要我们动手写的是计算偏移量以及鼠标移动,虽然他本身有鼠标拖动的组件,但是拖动的时候过于直来直去,会被检测到,提示 “被怪物吃掉” 所以我稍微修改了一下他的源码,封装了一个自己的方法,先看一下验证码识别的流程图:
设计好了流程图设计器就可以帮我们自动生成代码,代码如下:
# coding=utf-8
# 编译日期:2019-08-14 10:09:34
import time
import pdb
from ubpa.ilog import ILog
from ubpa.base_img import *
import getopt
from sys import argv
import sys
from ubpa.itools import rpa_import
GlobalFun = rpa_import.import_global_fun(__file__)
import ubpa.ibox as ibox
import ubpa.iexcel as iexcel
import ubpa.ifile as ifile
import ubpa.iie as iie
import ubpa.iimg as iimg
import ubpa.ikeyboard as ikeyboard
class getTjInfo:
def __init__(self,**kwargs):
self.__logger = ILog(__file__)
self.path = set_img_res_path(__file__)
self.robot_no = ''
self.proc_no = ''
self.job_no = ''
if('robot_no' in kwargs.keys()):
self.robot_no = kwargs['robot_no']
if('proc_no' in kwargs.keys()):
self.proc_no = kwargs['proc_no']
if('job_no' in kwargs.keys()):
self.job_no = kwargs['job_no']
#验证码识别
def checkCode(self):
existFlg=None
distance=None
xy=None
imageTwo=None
imageOne=None
# 截图
self.__logger.debug('Flow:checkCode,StepNodeTag:1313475530184,Note:')
imageOne = iimg.capture_image(win_title=r'天津市市场主体',win_text=r'',in_img_path=r'C:/Users/Administrator/Desktop/',left_indent=823,top_indent=521,width=266,height=121,waitfor=30)
# 鼠标点击
self.__logger.debug('Flow:checkCode,StepNodeTag:1313475530183,Note:')
iie.do_click_pos(win_title=r'天津市市场主体',url=r'http://credit.scjg.tj.gov.cn/gsxt/',selector=r'.gt_holder gt_popup gt_show > DIV:nth-of-type(2) > DIV:nth-of-type(2) > DIV:nth-of-type(2) > DIV:nth-of-type(2)',button=r'left',curson=r'center',times=1,run_mode=r'unctrl',waitfor=10,scroll_view='no')
time.sleep(4)
# 截图
self.__logger.debug('Flow:checkCode,StepNodeTag:1313475530186,Note:')
imageTwo = iimg.capture_image(win_title=r'天津市市场主体',win_text=r'',in_img_path=r'C:/Users/Administrator/Desktop/',left_indent=823,top_indent=521,width=266,height=121,waitfor=30)
# 自定义函数
self.__logger.debug('Flow:checkCode,StepNodeTag:1313475530182,Note:')
distance = GlobalFun.get_distance(imageOne,imageTwo)
# 获取元素位置
self.__logger.debug('Flow:checkCode,StepNodeTag:1313475530181,Note:')
xy = iie.get_element_rect(win_title=r'天津市市场主体',url=r'http://credit.scjg.tj.gov.cn/gsxt*',selector=r'.gt_holder gt_popup gt_show > DIV:nth-of-type(2) > DIV:nth-of-type(2) > DIV:nth-of-type(2) > DIV:nth-of-type(2)',curson=r'center',waitfor=10)
# 代码块
self.__logger.debug('Flow:checkCode,StepNodeTag:1313475530180,Note:')
print(xy)
lastxy=(xy[0]+distance,xy[1],xy[2],xy[3])
print(lastxy)
if(xy==(847.0, 682.0, 44, 44) and lastxy==(900.0, 682.0, 44, 44)):
print('修正')
lastxy=(895.0, 682.0, 44, 44)
if(xy==(847.0, 682.0, 44, 44) and lastxy==(976.0, 682.0, 44, 44)):
print('修正')
lastxy=(868.0, 682.0, 44, 44)
# 自定义函数
self.__logger.debug('Flow:checkCode,StepNodeTag:1313475530185,Note:')
GlobalFun.myDo_drag_to(win_title=r'天津市市场主体', srcpos=xy,distpos=lastxy)
#删除文件
self.__logger.debug('Flow:checkCode,StepNodeTag:1313475530187,Note:')
ifile.del_file(file=imageOne)
#删除文件
self.__logger.debug('Flow:checkCode,StepNodeTag:1313475530188,Note:')
ifile.del_file(file=imageTwo)
# 图像检测
self.__logger.debug('Flow:checkCode,StepNodeTag:13140204311151,Note:')
time.sleep(3.5)
existFlg = iimg.img_exists(win_title=r'天津市市场主体',img_res_path=self.path,image=r'snapshot_20190813135330024.png',fuzzy=True,confidence=0.85,waitfor=3)
# IF分支
self.__logger.debug('Flow:checkCode,StepNodeTag:13140549531176,Note:')
if existFlg:
#消息框
self.__logger.debug('Flow:checkCode,StepNodeTag:13143951406201,Note:')
ibox.msg_box(msg='验证失败,重试!',timeout=1.5)
time.sleep(1)
# 鼠标点击
self.__logger.debug('Flow:checkCode,StepNodeTag:13140738964184,Note:')
iie.do_click_pos(win_title=r'天津市市场主体',url=r'http://credit.scjg.tj.gov.cn/gsxt*',selector=r'.gt_holder gt_popup gt_show > DIV:nth-of-type(2) > DIV:nth-of-type(2) > DIV:nth-of-type(1) > DIV:nth-of-type(3) > A:nth-of-type(1)',button=r'left',curson=r'center',times=1,run_mode=r'unctrl',waitfor=100,scroll_view='no')
time.sleep(1.5)
# Return返回
self.__logger.debug('Flow:checkCode,StepNodeTag:13140556594179,Note:')
return True
else:
# Return返回
self.__logger.debug('Flow:checkCode,StepNodeTag:13140620186183,Note:')
return False
# 代码块
self.__logger.debug('Flow:checkCode,StepNodeTag:13141700326199,Note:')
print(existFlg)
#处理表格数据
def dealTableData(self,tableData=None):
currentCom=None
currentTableData=None
currentComName=None
# 代码块
self.__logger.debug('Flow:dealTableData,StepNodeTag:13161341316275,Note:')
columns=tableData.columns
realDataList=tableData.values.tolist()
# 热键输入
self.__logger.debug('Flow:dealTableData,StepNodeTag:13161638010281,Note:')
ikeyboard.key_send_cs(win_title=r'天津市市场主体',text='^{F4}',waitfor=10)
# 热键输入
self.__logger.debug('Flow:dealTableData,StepNodeTag:13164935964357,Note:')
ikeyboard.key_send_cs(text='^{F4}',waitfor=10)
#消息框
self.__logger.debug('Flow:dealTableData,StepNodeTag:13161731539284,Note:')
ibox.msg_box(msg='开始处理二级公司数据',timeout=2)
time.sleep(0.002)
# For循环
self.__logger.debug('Flow:dealTableData,StepNodeTag:13161910442289,Note:')
for i in range(len(realDataList)):
# 代码块
self.__logger.debug('Flow:dealTableData,StepNodeTag:13162051185291,Note:')
currentList=realDataList[i]
if(columns[0]=='有限责任公司本年度是否有股权转让 '):
currentCom=currentList[0]
currentComName=currentList[0]
if(columns[0]=='企业是否有股权信息或购买其它公司股权'):
currentCom=currentList[0]
currentComName=currentList[1]
if("天津" not in currentCom):
continue
time.sleep(1)
# 子流程:finishCheckCode
self.__logger.debug('Flow:dealTableData,StepNodeTag:13161503452279,Note:')
self.finishCheckCode(comName=currentComName)
# 子流程:goToDetail
self.__logger.debug('Flow:dealTableData,StepNodeTag:13163507635351,Note:')
currentTableData=self.goToDetail()
# 代码块
self.__logger.debug('Flow:dealTableData,StepNodeTag:13170553111368,Note:')
currentTableData[0].drop(['变更后股权比例','股权变更日期'], axis=1)
currentTableData[1].drop(['投资设立企业后购买股权企业名称',r'统一社会信用代码/注册号'], axis=1)
lastTableData0=currentTableData[0].values.tolist()
lastTableData1=currentTableData[1].values.tolist()
#插入行
self.__logger.debug('Flow:dealTableData,StepNodeTag:13171433395371,Note:')
iexcel.ins_row(path='C:/Users/Administrator/Desktop/testData.xlsx',data=lastTableData1)
# 热键输入
self.__logger.debug('Flow:dealTableData,StepNodeTag:13165826952364,Note:')
ikeyboard.key_send_cs(win_title=r'天津市市场主体',text='^{F4}',waitfor=10)
# 热键输入
self.__logger.debug('Flow:dealTableData,StepNodeTag:13165850702366,Note:')
ikeyboard.key_send_cs(text='^{F4}',waitfor=10)
#完成验证
def finishCheckCode(self,comName='911200006630613577'):
#网站
self.__logger.debug('Flow:finishCheckCode,StepNodeTag:1308451082917,Note:')
iie.open_url(url=r'http://credit.scjg.tj.gov.cn/gsxt/')
# 鼠标点击
self.__logger.debug('Flow:finishCheckCode,StepNodeTag:1308451082912,Note:')
iie.do_click_pos(win_title=r'天津市市场主体',url=r'http://credit.scjg.tj.gov.cn/gsxt/#',selector=r'http-equiv="x-ua-compatible":nth-of-type(1) > DIV:nth-of-type(3) > DIV:nth-of-type(2) > UL:nth-of-type(1) > LI:nth-of-type(2) > A:nth-of-type(1)',button=r'left',curson=r'center',times=1,run_mode=r'unctrl',waitfor=300,scroll_view='no')
time.sleep(0.5)
# 鼠标点击
self.__logger.debug('Flow:finishCheckCode,StepNodeTag:1308451082913,Note:')
iie.do_click_pos(win_title=r'天津市市场主体',url=r'http://credit.scjg.tj.gov.cn/gsxt/#',selector=r'http-equiv="x-ua-compatible":nth-of-type(1) > DIV:nth-of-type(3) > DIV:nth-of-type(2) > UL:nth-of-type(1) > LI:nth-of-type(2) > A:nth-of-type(1)',button=r'left',curson=r'center',offsetY=45,times=1,run_mode=r'unctrl',waitfor=10,scroll_view='no')
time.sleep(2)
# 设置文本
self.__logger.debug('Flow:finishCheckCode,StepNodeTag:130845108293,Note:')
iie.set_text(url=r'http://credit.scjg.tj.gov.cn/gsxt/',selector=r'#searchName',text=comName,waitfor=10)
# 鼠标点击
self.__logger.debug('Flow:finishCheckCode,StepNodeTag:130845108292,Note:')
iie.do_click_pos(win_title=r'天津市市场主体',url=r'http://credit.scjg.tj.gov.cn/gsxt/',selector=r'#entSearchLink',button=r'left',curson=r'center',times=1,run_mode=r'unctrl',waitfor=10,scroll_view='no')
time.sleep(2.5)
# While循环
self.__logger.debug('Flow:finishCheckCode,StepNodeTag:13135310584131,Note:')
while True:
# 子流程:checkCode
self.__logger.debug('Flow:finishCheckCode,StepNodeTag:13140310308161,Note:')
tvar13140310308161=self.checkCode()
# IF分支
self.__logger.debug('Flow:finishCheckCode,StepNodeTag:13135336032135,Note:')
if tvar13140310308161:
pass
else:
# Break中断
self.__logger.debug('Flow:finishCheckCode,StepNodeTag:13135345176138,Note:')
break
#获取股东公司信息
def getChildCom(self):
tableData2=None
tableData1=None
table2Columns=None
table1Columns=None
tableDatas=None
# 子流程:goToDetail
self.__logger.debug('Flow:getChildCom,StepNodeTag:13162921190325,Note:')
tableDatas=self.goToDetail()
# IF分支
self.__logger.debug('Flow:getChildCom,StepNodeTag:13145633622210,Note:')
if tableDatas[0].columns[1]=='否':
pass
else:
# 代码块
self.__logger.debug('Flow:getChildCom,StepNodeTag:13163242189338,Note:')
tableData1=tableDatas[0]
# 子流程:dealTableData
self.__logger.debug('Flow:getChildCom,StepNodeTag:13163127828333,Note:')
self.dealTableData(tableData=tableData1)
# IF分支
self.__logger.debug('Flow:getChildCom,StepNodeTag:13145822725214,Note:')
if tableDatas[1].columns[1]=='否':
pass
else:
# 代码块
self.__logger.debug('Flow:getChildCom,StepNodeTag:13163302199339,Note:')
tableData2=tableDatas[1]
# 子流程:dealTableData
self.__logger.debug('Flow:getChildCom,StepNodeTag:13163131389335,Note:')
self.dealTableData(tableData=tableData2)
#消息框
self.__logger.debug('Flow:getChildCom,StepNodeTag:13151905124229,Note:')
ibox.msg_box(msg='当前企业数据处理完毕,下一个。。',timeout=1.5)
time.sleep(1.5)
#去往详情页
def goToDetail(self):
table2Data=None
table1Data=None
# 鼠标点击
self.__logger.debug('Flow:goToDetail,StepNodeTag:13162716929301,Note:')
iie.do_click_pos(win_title=r'天津市市场主体',url=r'http://credit.scjg.tj.gov.cn/gsxt/',selector=r'#center_content > DIV:nth-of-type(1) > DIV:nth-of-type(2) > DIV:nth-of-type(2) > UL:nth-of-type(1) > LI:nth-of-type(1) > H1:nth-of-type(1) > A:nth-of-type(1)',button=r'left',curson=r'center',times=1,run_mode=r'unctrl',waitfor=20,scroll_view='no')
time.sleep(1.5)
# 鼠标点击
self.__logger.debug('Flow:goToDetail,StepNodeTag:13162716929300,Note:')
iie.do_click_pos(win_title=r'天津市市场主体',url=r'http://credit.scjg.tj.gov.cn/gsxt/',selector=r'#tabs > DIV:nth-of-type(1) > DIV:nth-of-type(3) > SPAN:nth-of-type(1)',button=r'left',curson=r'center',times=1,run_mode=r'unctrl',waitfor=30,scroll_view='no')
# 鼠标点击
self.__logger.debug('Flow:goToDetail,StepNodeTag:13162716929299,Note:')
iie.do_click_pos(win_title=r'天津市市场主体',url=r'http://credit.scjg.tj.gov.cn/gsxt/',selector=r'#tableInfoDiv > DIV:nth-of-type(2) > TABLE:nth-of-type(1) > TBODY:nth-of-type(1) > TR:nth-of-type(3) > TD:nth-of-type(4) > A:nth-of-type(1)',button=r'left',curson=r'center',times=1,run_mode=r'unctrl',waitfor=30,scroll_view='no')
# 自定义函数
self.__logger.debug('Flow:goToDetail,StepNodeTag:13162716929298,Note:股权转让')
table1Data = GlobalFun.getTableData('年报详情','#show_alter')
# 自定义函数
self.__logger.debug('Flow:goToDetail,StepNodeTag:13162716929297,Note:是否有狗买')
table2Data = GlobalFun.getTableData('年报详情','#show_invest')
# Return返回
self.__logger.debug('Flow:goToDetail,StepNodeTag:13162803390322,Note:')
return table1Data,table2Data
def Main(self):
# 子流程:finishCheckCode
self.__logger.debug('Flow:Main,StepNodeTag:13165330947360,Note:')
self.finishCheckCode(comName='911200006630613577')
# 子流程:getChildCom
self.__logger.debug('Flow:Main,StepNodeTag:13151828292226,Note:')
self.getChildCom()
#消息框
self.__logger.debug('Flow:Main,StepNodeTag:13152147770235,Note:')
ibox.msg_box(msg='全部数据处理完毕!')
if __name__ == '__main__':
robot_no = ''
proc_no = ''
job_no = ''
try:
argv = sys.argv[1:]
opts, args = getopt.getopt(argv,"hr:p:j:",["robot = ","proc = ","job = "])
except getopt.GetoptError:
print ('robot.py -r -p -j ' )
for opt, arg in opts:
if opt == '-h':
print ('robot.py -r -p -j ' )
elif opt in ("-r", "--robot"):
robot_no = arg
elif opt in ("-p", "--proc"):
proc_no = arg
elif opt in ("-j", "--job"):
job_no = arg
pro = getTjInfo(robot_no=robot_no,proc_no=proc_no,job_no=job_no)
pro.Main()
使用的全局函数的代码,在这里我们需要引入PIL库来进行图片的读取以及像素的处理,具体方法见 get_distance ,引入pyautogui库来对浏览器页面进行操作,在这里主要用它控制鼠标滑动,具体方法见 myDo_drag_to 引入pandas库来进行页面表格的数据获取,具体方法见 getTableData ,如下:
# 编译日期:2019-08-12 10:47:48
# coding=utf-8
from selenium.webdriver.common.action_chains import ActionChains
from selenium import webdriver
import time
from PIL import Image
import ubpa.ics as ics
import pyautogui
from ubpa import iwin
import math
import ubpa.iie as iie
import re
import pandas as pd
def getTableData(titleStr,selectorStr):
table_string = iie.get_html(title=titleStr,selector=selectorStr,waitfor=30)
tb_start = re.compile('')
tb_end = re.compile('')
last_str = tb_end.sub('', tb_start.sub('', table_string))
#调用了pandas中的read_html方法,注意header=0,有些表格header不是0
data = pd.read_html(last_str, flavor="bs4", header=0)[0]
print(data)
print(data.columns)
return data
def get_point_axis(axis_list,distpos,point):
pos_val_list = []
for i in range(1, 10000):
if i >= point:
break
n = len(axis_list) * (i / (point + 1))
pos_val = axis_list[int(n)]
pos_val_list.append(pos_val)
pos_val_list.append(distpos)
return pos_val_list
def get_axis_list(srcpos=(0, 0), distpos=(0, 0)):
pos_list = []
x1 = srcpos[0]
y1 = srcpos[1]
x2 = distpos[0]
y2 = distpos[1]
if x1 == x2:
if y1 > y2:
for i in range(math.ceil(y2), int(y1) + 1):
pos_list.append((x1, i))
pos_list.reverse()
elif y1 < y2:
for i in range(math.ceil(y1), int(y2) + 1):
pos_list.append((x1, i))
else:
pos_list = []
else:
if y1 == y2:
if x1 < x2:
x1 = math.ceil(x1)
x2 = int(x2)
length = x2 - x1
for i in range(0, length + 1):
pos_list.append((x1 + i, y2))
if x1 > x2:
x1 = int(x1)
x2 = math.ceil(x2)
length = x1 - x2
for i in range(0, length + 1):
pos_list.append((x1 + i, y2))
else:
if x1 < x2:
for i in range(math.ceil(x1), int(x2) + 1):
if y1 < y2:
h = (i - x1) * (y2 - y1) / (x2 - x1)
pos_list.append((i, y1 + h))
else:
h = (i - x1) * (y1 - y2) / (x2 - x1)
pos_list.append((i, y1 - h))
else:
for i in range(math.ceil(x2), int(x1) + 1):
if y1 < y2:
h = (i - x2) * (y2 - y1) / (x1 - x2)
pos_list.append((i, y2 - h))
else:
h = (i - x2) * (y1 - y2) / (x1 - x2)
pos_list.append((i, y2 + h))
pos_list.reverse()
return pos_list
def myDo_drag_to(win_title=None, srcpos=(0,0), distpos=(0,0), point=0, stimes=1, model=pyautogui.easeInOutQuad, waitfor=10):
'''
验证拖拽
x1:起点位置x坐标
y1:起点位置y坐标
x2:终点位置x坐标
y2:终点位置y坐标
point:停顿次数,默认是0
stimes:移动快慢,默认是1
model:移动方式,easeInQuad先慢后快,easeOutQuad先快后慢,easeInOutQuad开始和结束快 中间慢,easeInBounce结束反弹,easeInElastic持续反弹
'''
try:
if win_title != None and win_title.strip() != '':
''''如果窗口不活跃状态'''
if not iwin.do_win_is_active(win_title):
iwin.do_win_activate(win_title=win_title, waitfor=2)
pyautogui.moveTo(srcpos[0], srcpos[1], 0.5)
pyautogui.mouseDown(button='left', _pause=True)
axis_list = get_axis_list(srcpos, distpos)
if len(axis_list) > 0:
pos_val_list = get_point_axis(axis_list, distpos, point)
# print(pos_val_list)
for index in pos_val_list:
pyautogui.dragTo(float(index[0]+20), float(index[1]), stimes, model)
time.sleep(0.5)
pyautogui.dragTo(float(index[0]-5), float(index[1]), stimes, model)
time.sleep(0.5)
pyautogui.dragTo(float(index[0]), float(index[1]), stimes, model)
time.sleep(0.5)
pyautogui.mouseUp(button='left', _pause=True)
except Exception as e:
raise e
# 2.0获取偏移量
def get_distance(imageOne,imageTwo):
'''
拿到滑动验证码需要移动的距离
:param image1:没有缺口的图片对象
:param image2:带缺口的图片对象
:return:需要移动的距离
'''
threshold=150
left=60
image1 = Image.open(imageOne)
image2 = Image.open(imageTwo)
for i in range(left,image1.size[0]):
for j in range(image1.size[1]):
rgb1=image1.load()[i,j]
rgb2=image2.load()[i,j]
res1=abs(rgb1[0]-rgb2[0])
res2=abs(rgb1[1]-rgb2[1])
res3=abs(rgb1[2]-rgb2[2])
if not (res1 < threshold and res2 < threshold and res3 < threshold):
print(i-7)
return i-7 #经过测试,误差为大概为7
print(i-7)
return i-7#经过测试,误差为大概为7
以上代码为整个流程的代码,我在这里全贴出来了,3.0验证码破解的获取偏移量方法如下:
#极验3.0破解方法
def get_gap(image):
"""
获取缺口偏移量
:param image: 带缺口图片
:return:
"""
# left_list保存所有符合条件的x轴坐标
left_list = []
# 需要获取的是凹槽的x轴坐标,就不需要遍历所有y轴,遍历几个等分点就行
for i in [10 * i for i in range(1,image.size[1]/11)]:
# x轴从x为image.size[0]/5.16的像素点开始遍历,因为凹槽不会在x轴为50以内
for j in range(image.size[0]/5.16, image.size[0] - int(image.size[0]/8.6)):
if is_pixel_equal(image, j, i, left_list):
break
#其中(x, z)中的x为凹槽左侧的位置,z是count,就是从该x点坐标起有多少连续像素点的R、G、B都是小于150的,因为我们遍历y轴,所有我们的得到几个值,其中,z值最接近40的,结果最符合
left_list = sorted(left_list, key=lambda x: abs(x[1]-40))
#取第一个元素的x下标 最后结果 -7 或者 -14 一般 -7就可以
return left_list[0][0] - 7
def is_pixel_equal(image, x, y, left_list):
"""
判断两个像素是否相同
:param image: 图片
:param x: 位置x
:param y: 位置y
:return: 像素是否相同
"""
# 取图片的像素点
pixel1 = image.load()[x, y]
threshold = 150
# count记录一次向右有多少个像素点R、G、B都是小于150的
count = 0
# 如果该点的R、G、B都小于150,就开始向右遍历,记录向右有多少个像素点R、G、B都是小于150的
if pixel1[0] < threshold and pixel1[1] < threshold and pixel1[2] < threshold:
for i in range(x + 1, image.size[0]):
piexl = image.load()[i, y]
if piexl[0] < threshold and piexl[1] < threshold and piexl[2] < threshold:
count += 1
else:
break
if int(image.size[0]/8.6) < count < int(image.size[0]/4.3):
left_list.append((x, count))
return True
else:
return False
代码都有明确注释,静下心来看的话很容易就可以明白。
还有一个不错的处理页面表格的方法,上面的代码里已经有了,代码如下:
def getTableData(titleStr,selectorStr):
table_string = iie.get_html(title=titleStr,selector=selectorStr,waitfor=30)
tb_start = re.compile('')
tb_end = re.compile('')
last_str = tb_end.sub('', tb_start.sub('', table_string))
#调用了pandas中的read_html方法,注意header=0,有些表格header不是0
data = pd.read_html(last_str, flavor="bs4", header=0)[0]
print(data)
print(data.columns)
return data
titleStr为浏览器标题,只要标题包含传入的参数就可以识别,selectorStr是css选择器的选择字符串,css选择器是设计器原生支持的,本身这个东西在爬虫方面也很重要,不懂的可以自行百度,iie是他们自己的python库里的组件,可以直接读取已经打开的页面的信息,使用这个方法传入页面table的位置,就可以把表格转化为dataframe类型,不得不说,pandas还是好用!
验证码运行效果,失败了会自己重试,如下:
整体运行效果,最后成功抓取了企业表格的数据录入excel,如下:
感谢您的观看!