爬虫项目:用selenium模拟登陆后,用requests的Session维护一个会话爬取数据

好久没写博客了,由于工作忙,今天也是账号有问题,解决不了问题,那就把我遇到问题总结一下,方便大家查阅。

最近遇到一个很头疼问题,就是用selenium模拟登陆账号之后,要获取数据,最让人头疼的是这个网站的cookie是会话cookie,只要你关闭页面,cookie立马失效,你什么数据都获取不到,最让人头疼的是获取了登录后的cookie但就是无法请求到数据?

遇到这个问题解决办法是就是用requests的Session()保持登录状态。

不啰嗦了,直接上解决过程
爬虫项目:用selenium模拟登陆后,用requests的Session维护一个会话爬取数据_第1张图片

  1. 首先用selenium模拟登陆
import json
import re
import time
import requests
from lxml import etree
from selenium import webdriver


class IndustrialBank:

    def __init__(self):
        self.session = requests.Session()
        self.driver = webdriver.Chrome()
        self.headers = {
            "User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36"
        }

    def __call__(self, *args, **kwargs):
        self.login_simulation()
        # self.active_info()
        # self.credit_bill()
        # self.charge_out()

def login_simulation(self):

        login_url = 'https://personalbank.cib.com.cn/pers/main/login.do'
        self.driver.get(login_url)

        # 先切换到卡号这个页面
        self.driver.find_element_by_xpath(
            "//ul[@class='login-type fn-clearfix']/li[2]/label[@for='logintype1']").click()

        # 然后在输入卡号
        card_num=input("请输入理财卡或信用卡卡号:")
       
        time.sleep(2)

        # 输入网银密码
        password=input("请输入网银登录密码:")
     
        self.driver.find_element_by_id("loginNameTemp").send_keys(card_num)

        # 点击下一步
        self.driver.find_element_by_id("loginNextBtn").click()

        # 模拟输入密码
        self.driver.find_element_by_id("iloginPwd").send_keys(password)

        # 模拟点击 登录 按钮
        self.driver.find_element_by_id("loginSubmitBtn").click()

  1. 登陆之后就是如何方便维护一个会话,让其不要失效这里用到了Session()
 def get_cookie(self, cookie_dict):
        """
        获取登录后的cookie
        :param cookie_dict: 获取的cookie字典
        :return: cookies
        """
        cookie = {}
        for i in cookie_dict:
            cookie[i["name"]] = i["value"]
        # 把cookie字典转化为cookiejar
        cookies = requests.utils.cookiejar_from_dict(cookie, cookiejar=None,
                                                     overwrite=True)  # overwrite=True 替换已有的cookie
        # 用来保持登录,处理重定向
        self.query_account(cookies)

   welcome_url = "https://personalbank.cib.com.cn/pers/main/welcome.do"

        res=self.session.get(url=welcome_url, headers=self.headers, cookies=cookies)
        # print(res.text)

        # # 获取信用卡卡号
        url = 'https://personalbank.cib.com.cn/pers/main/welcome!list.do?dataSet.sidx=&dataSet.sord=asc'
        res1=self.session.get(url=url,headers=self.headers,cookies=cookies)
        json_data=res1.json()
        credit_num=json_data.get("rows")[1].get("cell")[2]
        # print(credit_num)

        jump_url = "https://personalbank.cib.com.cn/pers/creditCard/query/queryPastBill!jump.do?FUNID=FIN09|FIN09_08|FIN09_08_02|FIN09_08_02_02&credAccountNo="+credit_num
        response = self.session.get(url=jump_url, headers=self.headers, allow_redirects=False, cookies=cookies)

        location = response.headers['Location']
        # print(location) #https://personalbank.cib.com.cn/pers/main/master/jump?appCode=%2Fpers%2FcreditCard&clientTicket=b6a5722a0436414bb6a5722a0436414b&url=https%3A%2F%2Fpersonalbank.cib.com.cn%3A443%2Fpers%2FcreditCard%2Fquery%2FqueryPastBill%21jump.do%3FFUNID%3DFIN09%257CFIN09_08%257CFIN09_08_02%257CFIN09_08_02_02%26credAccountNo%3D4512891210486184
        resp = self.session.get(url=location, headers=self.headers, allow_redirects=False, cookies=cookies)

        location2 = resp.headers['Location']
        # print(location2) #https://personalbank.cib.com.cn:443/pers/creditCard/query/queryPastBill!jump.do?FUNID=FIN09%7CFIN09_08%7CFIN09_08_02%7CFIN09_08_02_02&credAccountNo=4512891210486184&loginTicket=381b4bea01f71be9dd257cabecbe987b&jumpTicket=e78660fbe62780026272db37e32c954e
        location2_url = re.sub(':443', '', str(location2)).replace("%7C", "|")
        # print(location2_url) #https://personalbank.cib.com.cn/pers/creditCard/query/queryPastBill!jump.do?FUNID=FIN09|FIN09_08|FIN09_08_02|FIN09_08_02_02&credAccountNo=4512891210486184&loginTicket=381b4bea01f71be9dd257cabecbe987b&jumpTicket=e78660fbe62780026272db37e32c954e
        self.session.get(url=location2_url, headers=self.headers, cookies=cookies)

        bills_url = "https://personalbank.cib.com.cn/pers/creditCard/query/queryPastBill!list.do?&dataSet.sidx=&dataSet.sord=asc"

        params = { "queryYearMonth": "201907"}
        a=self.session.post(url=bills_url, headers=self.headers, params=params, cookies=cookies)
        print(a.json())

不要看只有这么一点代码,但是分析过程真的很复杂
爬虫项目:用selenium模拟登陆后,用requests的Session维护一个会话爬取数据_第2张图片
爬虫项目:用selenium模拟登陆后,用requests的Session维护一个会话爬取数据_第3张图片因为这是ajax异步加载的,你点击到你需要的页面后,然后到network里面的all找到链接的加载顺序,主要看302重定向的链接进行请求,在请求时设置禁止跳转,将这几个重定向的链接逐一请求一遍,然后请求你要爬取的链接

你可能感兴趣的:(爬虫)