scrapy cookies

scrapy 模拟登录方式

  • 直接向目标url发起请求并携带cookie

  • 像目标url发送post请求携带data(账号和密码)

  • 通过selenium来模拟登录 (input标签 切换登录方式 找到用户名和密码的输入框 定位按钮)
    本篇博文重点讲述直接向目标url发起请求并携带cookie方法。
    以qq空间为例,获取https://user.qzone.qq.com/QQ号码网页源代码并进行解析。由于在获取该网页源代码前必须先进行登录,输入用户名和密码。而cookies是记录网页的登录信息,所以我们必须携带cookies信息。
    刷新网站-Network-QQ号-找到cookies信息。

向目标url发起请求并携带cookie方法(一)

1、直接在settings.py文件里设置

  • LOG_LEVEL = 'WARNING'
  • ROBOTSTXT_OBEY = False
    -COOKIES_ENABLED = False 记住!一定要打开COOKIES_ENABLED = False,否则无法获取响应。
  • 打开DEFAULT_REQUEST_HEADERS,并将cookies以字典方式添加进来
    DEFAULT_REQUEST_HEADERS = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64)AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.114 Safari/537.36',
    'cookie': 'RK=uMy93J/fTg; ptcz=**************************************5e6f05a2a8c4d2951; qz_screen=1280x720; pgv_pvid=4704727674; QZ_FE_WEBP_SUPPORT=1; _Q_w_s__QZN_TodoMsgCnt=1; zzpaneluin=; zzpanelkey=; qpsvr_localtk=0.7101978041494885; pgv_info=ssid=s2835616349; uin=o***********; skey=@fszYKYM5U; p_uin=o0********; pt4_token=TQPtF8ZsI5YhKRXdHCeU753ztK6DL6*mUHVinn06Q; p_skey=328z5QLGVe8-qxZwKtW25qNnIGNEQyqTSVkVXv3KxB8; Loading=Yes; cpu_performance_v8=4',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,/;q=0.8',
    'Accept-Language': 'en',
    }
    2、直接在spider主文件中直接编写
    def parse(self, response):
    res = response.text
    with open('qzone.html','w',encoding='utf-8')as f:
    f.write(res)
import scrapy


class QzSpider(scrapy.Spider):
    name = 'qz'
    allowed_domains = ['qq.com']
    start_urls = ['https://user.qzone.qq.com/*********/']


    def parse(self, response):
        res = response.text
        with open('qzone.html','w',encoding='utf-8')as f:
            f.write(res)

3、打开qzone.html,通过谷歌浏览器打开,我们发现该网页不需要登录,直接打开。

向目标url发起请求并携带cookie方法(二)

1、我们也可以不需要经过settings.py文件进行初始化设置。我们打开spider主文件。self.parse()方法是对start_urls进行解析,所以我们要对start_urls加入cookies与UA等.那么对于类class CookSpider(scrapy.Spider):。我们查找继承类Spider,发现start_requests(self)方法相当于在spider主文件中代替settings.py文件进行初始化设置。
2、scrapy.Request()中cookie参数是一个字典。而网页中的cookies是一个字符串,所以我们必须将字符串通过字典倒推式进行转换。首先我们进行观察,我们将字符串以split(' : ')方式转换成列表,同时通过列表遍历结果传递给参数i,由于i相当于键值对,可以通过i.split('=')[0]:i.split('=')[1]:转换成字典。
cookies= {i.split('=')[0]:i.split('=')[1] for i in cook.split(' : ')}
3、scrapy.Request().
scrapy.Request()
init(self, url, callback=None, method='GET', headers=None, body=None,
cookies=None, meta=None, encoding='utf-8', priority=0,
dont_filter=False, errback=None, flags=None, cb_kwargs=None)
它携 cookies参数,所以:
yield scrapy.Request(
url = self.start_urls[0],
callback = self.parse,
cookies = cookies)

3、具体代码如下:

import scrapy


class CookSpider(scrapy.Spider):
    name = 'cook'
    allowed_domains = ['qq.com']
    start_urls = ['https://user.qzone.qq.com/**********']

    def start_requests(self):
        cookie= 'RK=uMy93J/fTg; ptcz=44ac271d86936267c9755e44b***************5e6f05a2a8c4d2951; qz_screen=1280x720; pgv_pvid=4704727674; QZ_FE_WEBP_SUPPORT=1; __Q_w_s__QZN_TodoMsgCnt=1; zzpaneluin=; zzpanelkey=; _qpsvr_localtk=0.5254784523847567; pgv_info=ssid=s9175078990; uin=o************; skey=@keSzZIJQf; p_uin=o0***********; pt4_token=k9qvz-YCV5-kxS3X1dsrlwbfvyWQvWJGJRgDtRNttZM_; p_skey=5SKJsCWzK14EC1mfz9poDp8KQr-QSk9oniPmByEakxA_; Loading=Yes; x-stgw-ssl-info=87ea822f31b5ea36ac0669e3753b2520|0.139|-|25|.|I|TLSv1.2|ECDHE-RSA-AES128-GCM-SHA256|37000|h2|0; cpu_performance_v8=2'
        cookies ={i.split('=')[0]:i.split('=')[1] for i in cookie.split('; ')}
        yield scrapy.Request(
            url = self.start_urls[0],
            callback = self.parse,
            cookies = cookies

        )

    def parse(self, response):
        res = response.text
        with open('qqzone.html','w',encoding='utf-8')as f:
            f.write(res)

4、同理我们也可以在start_requests()中添加user-angent

import scrapy


class CookSpider(scrapy.Spider):
    name = 'cook'
    allowed_domains = ['qq.com']
    start_urls = ['https://user.qzone.qq.com/**********']

    def start_requests(self):
        cookie= 'RK=uMy93J/fTg; ptcz=44ac271d86936******************************5e6f05a2a8c4d2951; qz_screen=1280x720; pgv_pvid=4704727674; QZ_FE_WEBP_SUPPORT=1; __Q_w_s__QZN_TodoMsgCnt=1; zzpaneluin=; zzpanelkey=; _qpsvr_localtk=0.5254784523847567; pgv_info=ssid=s9175078990; uin=o******************; skey=@keSzZIJQf; p_uin=o*************; pt4_token=k9qvz-YCV5-kxS3X1dsrlwbfvyWQvWJGJRgDtRNttZM_; p_skey=5SKJsCWzK14EC1mfz9poDp8KQr-QSk9oniPmByEakxA_; Loading=Yes; x-stgw-ssl-info=87ea822f31b5ea36ac0669e3753b2520|0.139|-|25|.|I|TLSv1.2|ECDHE-RSA-AES128-GCM-SHA256|37000|h2|0; cpu_performance_v8=2'
        cookies ={i.split('=')[0]:i.split('=')[1] for i in cookie.split('; ')}
        headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64)AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.114 Safari/537.36'}
        yield scrapy.Request(
            url = self.start_urls[0],
            callback = self.parse,
            cookies = cookies,
            headers = headers

        )

    def parse(self, response):
        res = response.text
        with open('qqzone.html','w',encoding='utf-8')as f:
            f.write(res)

你可能感兴趣的:(scrapy cookies)