python3下使用scrapy实现模拟用户登录与cookie存储 —— 基础篇(马蜂窝)

python3下使用scrapy实现模拟用户登录与cookie存储 —— 基础篇(马蜂窝)

1. 背景

  • 相关基础知识点回顾:
    • python3下使用requests实现模拟用户登录(马蜂窝): http://blog.csdn.net/zwq912318834/article/details/79571110

2. 环境

  • 系统:win7
  • python 3.6.1
  • scrapy 1.4.0

3. 标准的模拟登陆步骤

  • 第一步:首先进入用户登录的页面,拿到一些登录所需的参数(比如说知乎网站,登陆页面里的 _xsrf)。
  • 第二步:将这些参数,和账户密码,一起post到服务器,登录。
  • 第三步:检查用户登录是否成功。
  • 第四步:如果用户登录失败,排查错误,重新启动登录程序。
  • 第五步:如果用户登录成功,按照正常流程爬取网站页面。
# 以马蜂窝网站登录为例,讲解如何模拟用户登录
# 保持登录状态,访问其他页面


# 爬虫文件:mafengwoSpider.py
# -*- coding: utf-8 -*-

import scrapy
import datetime
import re

class mafengwoSpider(scrapy.Spider):
    # 定制化设置
    custom_settings = {
        'LOG_LEVEL': 'DEBUG',       # Log等级,默认是最低级别debug
        'ROBOTSTXT_OBEY': False,    # default Obey robots.txt rules
        'DOWNLOAD_DELAY': 2,        # 下载延时,默认是0
        'COOKIES_ENABLED': True,    # 默认enable,爬取登录后的数据时需要启用。 会增加流量,因为request和response中会多携带cookie的部分
        'COOKIES_DEBUG': True,      # 默认值为False,如果启用,Scrapy将记录所有在request(Cookie 请求头)发送的cookies及response接收到的cookies(Set-Cookie 接收头)。
        'DOWNLOAD_TIMEOUT': 25,     # 下载超时,既可以是爬虫全局统一控制,也可以在具体请求中填入到Request.meta中,Request.meta['download_timeout']
    }

    name = 'mafengwo'
    allowed_domains = ['mafengwo.cn']
    host = "http://www.mafengwo.cn/"
    username = "13725168940"            # 蚂蜂窝帐号
    password = "aaa00000000"          # 马蜂窝密码
    headerData = {
        "Referer": "https://passport.mafengwo.cn/",
        'User-Agent': "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36",
    }


    # 爬虫运行的起始位置
    # 第一步:爬取马蜂窝登录页面
    def start_requests(self):
        print("start mafengwo clawer")
        # 马蜂窝登录页面
        mafengwoLoginPage = "https://passport.mafengwo.cn/"
        loginIndexReq = scrapy.Request(
            url = mafengwoLoginPage,
            headers = self.headerData,
            callback = self.parseLoginPage,
            dont_filter = True,     # 防止页面因为重复爬取,被过滤了
        )
        yield loginIndexReq


    # 第二步:分析登录页面,取出必要的参数,然后发起登录请求POST
    def parseLoginPage(self, response):
        print(f"parseLoginPage: url = {response.url}")
        # 如果这个登录页面含有一些登录必备的信息,那么就在这个函数里面进行信息提取( response.text )

        loginPostUrl = "https://passport.mafengwo.cn/login/"
        # FormRequest 是Scrapy发送POST请求的方法
        yield scrapy.FormRequest(
            url = loginPostUrl,
            headers = self.headerData,
            method = "POST",
            # post的具体数据
            formdata = {
                "passport": self.username,
                "password": self.password,
                # "other": "other",
            },
            callback = self.loginResParse,
            dont_filter = True,
        )

    # 第三步:分析登录结果,然后发起登录状态的验证请求
    def loginResParse(self, response):
        print(f"loginResParse: url = {response.url}")

        # 通过访问个人中心页面的返回状态码来判断是否为登录状态
        # 这个页面,只有登录过的用户,才能访问。否则会被重定向(302) 到登录页面
        routeUrl = "http://www.mafengwo.cn/plan/route.php"
        # 下面有两个关键点
        # 第一个是header,如果不设置,会返回500的错误
        # 第二个是dont_redirect,设置为True时,是不允许重定向,用户处于非登录状态时,是无法进入这个页面的,服务器返回302错误。
        #       dont_redirect,如果设置为False,允许重定向,进入这个页面时,会自动跳转到登录页面。会把登录页面抓下来。返回200的状态码
        yield scrapy.Request(
            url = routeUrl,
            headers = self.headerData,
            meta={
                'dont_redirect': True,      # 禁止网页重定向302, 如果设置这个,但是页面又一定要跳转,那么爬虫会异常
                # 'handle_httpstatus_list': [301, 302]      # 对哪些异常返回进行处理
            },
            callback = self.isLoginStatusParse,
            dont_filter = True,
        )


    # 第五步:分析用户的登录状态, 如果登录成功,那么接着爬取其他页面
    # 如果登录失败,爬虫会直接终止。
    def isLoginStatusParse(self, response):
        print(f"isLoginStatusParse: url = {response.url}")

        # 如果能进到这一步,都没有出错的话,那么后面就可以用登录状态,访问后面的页面了
        # ………………………………
        # 不需要存储cookie
        # 其他网页爬取
        # ………………………………
        yield scrapy.Request(
            url = "https://www.mafengwo.cn/travel-scenic-spot/mafengwo/10045.html",
            headers=self.headerData,
            # 如果不指定callback,那么默认会使用parse函数
        )


    # 正常的分析页面请求
    def parse(self, response):
        print(f"parse: url = {response.url}, meta = {response.meta}")


    # 请求错误处理:可以打印,写文件,或者写到数据库中
    def errorHandle(self, failure):
        print(f"request error: {failure.value.response}")


    # 爬虫运行完毕时的收尾工作,例如:可以打印信息,可以发送邮件
    def closed(self, reason):
        # 爬取结束的时候可以发送邮件
        finishTime = datetime.datetime.now()
        subject = f"clawerName had finished, reason = {reason}, finishedTime = {finishTime}"
  • 登录成功的Log:
E:\Miniconda\python.exe E:/documentCode/scrapyMafengwo/start.py
2018-03-19 17:03:54 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: scrapyMafengwo)
2018-03-19 17:03:54 [scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'scrapyMafengwo', 'NEWSPIDER_MODULE': 'scrapyMafengwo.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['scrapyMafengwo.spiders']}
2018-03-19 17:03:54 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2018-03-19 17:03:54 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2018-03-19 17:03:54 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2018-03-19 17:03:54 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2018-03-19 17:03:54 [scrapy.core.engine] INFO: Spider opened
2018-03-19 17:03:54 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-03-19 17:03:54 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6024
start mafengwo clawer
2018-03-19 17:03:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://passport.mafengwo.cn/> (referer: https://passport.mafengwo.cn/)
parseLoginPage: url = https://passport.mafengwo.cn/
2018-03-19 17:03:57 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET http://www.mafengwo.cn> from <POST https://passport.mafengwo.cn/login/>
2018-03-19 17:03:58 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.mafengwo.cn> (referer: None)
loginResParse: url = http://www.mafengwo.cn
2018-03-19 17:03:59 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.mafengwo.cn/plan/route.php> (referer: https://passport.mafengwo.cn/)
isLoginStatusParse: url = http://www.mafengwo.cn/plan/route.php
2018-03-19 17:04:01 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.mafengwo.cn/travel-scenic-spot/mafengwo/10045.html> (referer: https://passport.mafengwo.cn/)
parse: url = https://www.mafengwo.cn/travel-scenic-spot/mafengwo/10045.html, meta = {'depth': 3, 'download_timeout': 25.0, 'download_slot': 'www.mafengwo.cn', 'download_latency': 0.2569999694824219}
subject = clawerName had finished, reason = finished, finishedTime = 2018-03-19 17:04:01.638400
2018-03-19 17:04:01 [scrapy.core.engine] INFO: Closing spider (finished)
2018-03-19 17:04:01 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 3251,
 'downloader/request_count': 5,
 'downloader/request_method_count/GET': 4,
 'downloader/request_method_count/POST': 1,
 'downloader/response_bytes': 38259,
 'downloader/response_count': 5,
 'downloader/response_status_count/200': 4,
 'downloader/response_status_count/302': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2018, 3, 19, 9, 4, 1, 638400),
 'log_count/DEBUG': 6,
 'log_count/INFO': 7,
 'request_depth_max': 3,
 'response_received_count': 4,
 'scheduler/dequeued': 5,
 'scheduler/dequeued/memory': 5,
 'scheduler/enqueued': 5,
 'scheduler/enqueued/memory': 5,
 'start_time': datetime.datetime(2018, 3, 19, 9, 3, 54, 707400)}
2018-03-19 17:04:01 [scrapy.core.engine] INFO: Spider closed (finished)

Process finished with exit code 0
  • 登录失败的Log:
2018-03-19 17:05:06 [scrapy.core.engine] INFO: Spider opened
2018-03-19 17:05:06 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-03-19 17:05:06 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6024
start mafengwo clawer
2018-03-19 17:05:07 [scrapy.core.engine] DEBUG: Crawled (200) https://passport.mafengwo.cn/> (referer: https://passport.mafengwo.cn/)
parseLoginPage: url = https://passport.mafengwo.cn/
2018-03-19 17:05:08 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to https://passport.mafengwo.cn/> from https://passport.mafengwo.cn/login/>
2018-03-19 17:05:10 [scrapy.core.engine] DEBUG: Crawled (200) https://passport.mafengwo.cn/> (referer: https://passport.mafengwo.cn/)
loginResParse: url = https://passport.mafengwo.cn/
2018-03-19 17:05:10 [scrapy.core.engine] DEBUG: Crawled (302) http://www.mafengwo.cn/plan/route.php> (referer: https://passport.mafengwo.cn/)
2018-03-19 17:05:10 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <302 http://www.mafengwo.cn/plan/route.php>: HTTP status code is not handled or not allowed
2018-03-19 17:05:10 [scrapy.core.engine] INFO: Closing spider (finished)
2018-03-19 17:05:10 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 2234,
 'downloader/request_count': 4,
 'downloader/request_method_count/GET': 3,
 'downloader/request_method_count/POST': 1,
 'downloader/response_bytes': 5044,
 'downloader/response_count': 4,
 'downloader/response_status_count/200': 2,
 'downloader/response_status_count/302': 2,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2018, 3, 19, 9, 5, 10, 368900),
 'httperror/response_ignored_count': 1,
 'httperror/response_ignored_status_count/302': 1,
 'log_count/DEBUG': 5,
 'log_count/INFO': 8,
 'request_depth_max': 2,
 'response_received_count': 3,
 'scheduler/dequeued': 4,
 'scheduler/dequeued/memory': 4,
 'scheduler/enqueued': 4,
 'scheduler/enqueued/memory': 4,
 'start_time': datetime.datetime(2018, 3, 19, 9, 5, 6, 871900)}
2018-03-19 17:05:10 [scrapy.core.engine] INFO: Spider closed (finished)
subject = clawerName had finished, reason = finished, finishedTime = 2018-03-19 17:05:10.368900

Process finished with exit code 0
  • 对比一下,就可以看到,在验证用户登录状态这个步骤时,如果用户处于非登录状态,而且又不允许页面重定向(302)到登录页面,那么爬虫就会在这个地方终止,不再继续往后爬取
loginResParse: url = https://passport.mafengwo.cn/
2018-03-19 17:05:10 [scrapy.core.engine] DEBUG: Crawled (302) .mafengwo.cn/plan/route.php> (referer: https://passport.mafengwo.cn/)
2018-03-19 17:05:10 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <302 http://www.mafengwo.cn/plan/route.php>: HTTP status code is not handled or not allowed

4. 注意事项

  • settings设置
'ROBOTSTXT_OBEY': False,    # default Obey robots.txt rules,因为很多网站都禁止爬虫爬取
'DOWNLOAD_DELAY': 2,        # 下载延时,默认是0,防止过快,导致IP和帐号被封
'COOKIES_ENABLED': True,    # 默认enable,爬取登录后的数据时需要启用
  • header的配置
# 需要有,否则服务器会拒绝请求
headerData = {
    "Referer": "https://passport.mafengwo.cn/",
    'User-Agent': "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36",
    }
  • 下载中间件配置:middleware.py
# 由于要保持用户登录状态,所以用户使用的user-agent,IP地址,都不要变。
# 要不然容易导致用户数据异常,账户被封。
# 这些设置,都在middleware.py中,所以尤其需要注意

5. cookie的本地存储与使用

  • 在验证用户登录成功之后,可以选择把cookie保存下来。然后在下次登录时,可以直接使用这个cookie登录(当然,并不推荐这种方式

5.1. 把cookie保存在本地

# 文件mafengwoSpider.py

# 将cookie保存到文件中
def convertToCookieFormat(cookieLstInfo, cookieFileName):
    '''
    CookieReq = [b'PHPSESSID=427jcfptrsogeg7onenojvqmp0; mfw_uuid=5ab0adb9-177d-a7d3-a47a-9522417e0652; oad_n=a%3A3%3A%7Bs%3A3%3A%22oid%22%3Bi%3A1029%3Bs%3A2%3A%22dm%22%3Bs%3A20%3A%22passport.mafengwo.cn%22%3Bs%3A2%3A%22ft%22%3Bs%3A19%3A%222018-03-20+14%3A44%3A09%22%3B%7D; __today_login=1; mafengwo=d336513fb8fc6edd490db9725739bb85_94281374_5ab0adbac4ba51.24002232_5ab0adbac4ba92.98161419; uol_throttle=94281374; mfw_uid=94281374']
    :param cookieLstInfo:
    :return:
    '''
    cookieDict = {}
    if len(cookieLstInfo) > 0:
        # bs = str(b, encoding = "utf8")
        cookieStr = str(cookieLstInfo[0], encoding="utf8")
        print(f"cookieStr = {cookieStr}")
        for cookieItemStr in cookieStr.split(";"):
            cookieItem = cookieItemStr.strip().split("=")
            print(f"cookieItemStr = {cookieItemStr}, cookieItem = {cookieItem}")
            cookieDict[cookieItem[0].strip()] = cookieItem[1].strip()
        print(f"cookieDict = {cookieDict}")

        # 将cookie写入到文件中,方便后面使用
        with open(cookieFileName, 'w') as f:
            for cookieKey, cookieValue in cookieDict.items():
                f.write(str(cookieKey) + ':' + str(cookieValue) + '\n')
        return cookieDict

# 第五步:分析用户的登录状态, 如果登录成功,那么接着爬取其他页面
# 如果登录失败,爬虫会直接终止。
def isLoginStatusParse(self, response):
    print(f"isLoginStatusParse: url = {response.url}")

    # 查询网址的Cookie
    # 发出请求的Cookie, 事实上是要存储这个cookie,因为当用户登录成功之后,
    # 以后,就会将cookie信息放到请求中,带给服务器,来表明自己的身份
    CookieReq = response.request.headers.getlist('Cookie')
    print(CookieReq = {CookieReq}')
    cookieFileName = "mafengwoCookies.txt"
    cookieDict = convertToCookieFormat(Cookie, cookieFileName)

    # 响应Cookie
    Cookie = response.headers.getlist('Set-Cookie')
    print(f"Set-Cookie = {Cookie}")

    # 如果能进到这一步,都没有出错的话,那么后面就可以用登录状态,访问后面的页面了
    # ………………………………
    # 不需要存储cookie
    # 其他网页爬取
    # ………………………………
    yield scrapy.Request(
        url = "https://www.mafengwo.cn/travel-scenic-spot/mafengwo/10045.html",
        headers=self.headerData,
        # 如果不指定callback,那么默认会使用parse函数
    )
  • 存储结果如下
# 文件:mafengwoCookies.txt

PHPSESSID:vperarhkjekdsv5mut4vjk9ri0
mfw_uuid:5ab0bcc6-0279-cbef-673e-15fd2c0b73c5
oad_n:a%3A3%3A%7Bs%3A3%3A%22oid%22%3Bi%3A1029%3Bs%3A2%3A%22dm%22%3Bs%3A20%3A%22passport.mafengwo.cn%22%3Bs%3A2%3A%22ft%22%3Bs%3A19%3A%222018-03-20+15%3A48%3A22%22%3B%7D
__today_login:1
mafengwo:926d677d880bf9c3981934bb3d710b8c_94281374_5ab0bcc8e795c0.78689785_5ab0bcc8e79637.22817262
uol_throttle:94281374
mfw_uid:94281374

5.2. 读取cookie使用

  • 这个部分,当然,也可以直接用浏览器登录,然后从浏览器中拿到cookie,然后作为登录的凭证。
# 从文件中,把cookie信息取出来
def getCookieFromFile(cookieFileName):
    '''
        PHPSESSID:nkv0d5g29bde1ni5p9bha8cq04
        mfw_uuid:5ab0b3a3-22ac-61f1-ba72-db5a070c7e5d
        oad_n:a%3A3%3A%7Bs%3A3%3A%22oid%22%3Bi%3A1029%3Bs%3A2%3A%22dm%22%3Bs%3A20%3A%22passport.mafengwo.cn%22%3Bs%3A2%3A%22ft%22%3Bs%3A19%3A%222018-03-20+15%3A09%3A23%22%3B%7D
        __today_login:1
        mafengwo:7e7cd3cffefcc05d3cbb217172a2d9fa_94281374_5ab0b3a5ac8007.33269268_5ab0b3a5ac8053.87485829
        uol_throttle:94281374
        mfw_uid:94281374
    :param cookieFileName:
    :return:
    '''
    cookieDict = {}
    f = open(cookieFileName, "r")  # 打开文件
    for line in f.readlines():
        print(f"line = {line}")
        if line != "":
            cookieItem = line.split(":")
            cookieDict[cookieItem[0].strip()] = cookieItem[1].strip()
    f.close()  # 关闭文件
    return cookieDict


# 爬虫运行的起始位置
def start_requests(self):
    print("start mafengwo clawer")
    cookieFileName = "mafengwoCookies.txt"
    cookieDict = getCookieFromFile(cookieFileName)

    # 通过访问个人中心页面的返回状态码来判断是否为登录状态
    # 这个页面,只有登录过的用户,才能访问。否则会被重定向(302) 到登录页面
    routeUrl = "http://www.mafengwo.cn/plan/route.php"
    # 下面有两个关键点
    # 第一个是header,如果不设置,会返回500的错误
    # 第二个是dont_redirect,设置为True时,是不允许重定向,用户处于非登录状态时,是无法进入这个页面的,服务器返回302错误。
    #       dont_redirect,如果设置为False,允许重定向,进入这个页面时,会自动跳转到登录页面。会把登录页面抓下来。返回200的状态码
    yield scrapy.Request(
        url=routeUrl,
        headers=self.headerData,
        cookies=cookieDict,
        meta={
            # 'dont_redirect': True,    # 禁止网页重定向302, 如果设置这个,但是页面又一定要跳转,那么爬虫会异常
            # 'handle_httpstatus_list': [301, 302]      # 对哪些异常返回进行处理
        },
        callback=self.isLoginStatusParse,
        dont_filter=True,
    )
  • 需要说明的是:
  • 第一,如果cookie是能用的,那确实很方便。
  • 第二,但是如果一旦cookie失效了,那么这个cookie就会在所有的requests中流转,不但无法访问rout页面,同时也无法访问重定向(302)后的登录页面,爬虫也就异常终止了(这也是不推荐使用cookie登录的原因)。如下:
line = #mfw_uid:9474669944

2018-03-20 15:58:09 [scrapy.downloadermiddlewares.cookies] DEBUG: Sending cookies to: 
Cookie: #PHPSESSID=vperarhkjekdsv5mut4vjk9ri0; #mfw_uuid=5ab0bcc6-0279-cbef-673e-15fd2c0b73c5; #oad_n=a%3A3%3A%7Bs%3A3%3A%22oid%22%3Bi%3A1029%3Bs%3A2%3A%22dm%22%3Bs%3A20%3A%22passport.mafengwo.cn%22%3Bs%3A2%3A%22ft%22%3Bs%3A19%3A%222018-03-20+15%3A48%3A22%22%3B%7D; #__today_login=1; #mafengwo=926d677d880bf9c3981934bb3d710b8c_94281374_5ab0bcc8e795c0.78689785_5ab0bcc8e79637.22817262; #uol_throttle=94281374; #mfw_uid=94281374

2018-03-20 15:58:09 [scrapy.downloadermiddlewares.cookies] DEBUG: Received cookies from: <302 http://www.mafengwo.cn/plan/route.php>
Set-Cookie: PHPSESSID=25kotnplj2fl5ftd0m6gari4b6; path=/; domain=.mafengwo.cn; HttpOnly

Set-Cookie: mfw_uuid=5ab0bfef-bfc3-a0d8-da65-a49fe77e191a; expires=Wed, 20-Mar-2019 08:01:51 GMT; Max-Age=31536000; path=/; domain=.mafengwo.cn

Set-Cookie: oad_n=a%3A3%3A%7Bs%3A3%3A%22oid%22%3Bi%3A1029%3Bs%3A2%3A%22dm%22%3Bs%3A15%3A%22www.mafengwo.cn%22%3Bs%3A2%3A%22ft%22%3Bs%3A19%3A%222018-03-20+16%3A01%3A51%22%3B%7D; expires=Tue, 27-Mar-2018 08:01:51 GMT; Max-Age=604800; path=/; domain=.mafengwo.cn

2018-03-20 15:58:09 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to 3A%2F%2Fwww.mafengwo.cn%2Fplan%2Froute.php> from 
2018-03-20 15:58:09 [scrapy.downloadermiddlewares.cookies] DEBUG: Sending cookies to: 3A%2F%2Fwww.mafengwo.cn%2Fplan%2Froute.php>
Cookie: #PHPSESSID=vperarhkjekdsv5mut4vjk9ri0; #mfw_uuid=5ab0bcc6-0279-cbef-673e-15fd2c0b73c5; #oad_n=a%3A3%3A%7Bs%3A3%3A%22oid%22%3Bi%3A1029%3Bs%3A2%3A%22dm%22%3Bs%3A20%3A%22passport.mafengwo.cn%22%3Bs%3A2%3A%22ft%22%3Bs%3A19%3A%222018-03-20+15%3A48%3A22%22%3B%7D; #__today_login=1; #mafengwo=926d677d880bf9c3981934bb3d710b8c_94281374_5ab0bcc8e795c0.78689785_5ab0bcc8e79637.22817262; #uol_throttle=94281374; #mfw_uid=94281374; PHPSESSID=25kotnplj2fl5ftd0m6gari4b6; mfw_uuid=5ab0bfef-bfc3-a0d8-da65-a49fe77e191a; oad_n=a%3A3%3A%7Bs%3A3%3A%22oid%22%3Bi%3A1029%3Bs%3A2%3A%22dm%22%3Bs%3A15%3A%22www.mafengwo.cn%22%3Bs%3A2%3A%22ft%22%3Bs%3A19%3A%222018-03-20+16%3A01%3A51%22%3B%7D

2018-03-20 15:58:12 [scrapy.core.engine] DEBUG: Crawled (400) 3A%2F%2Fwww.mafengwo.cn%2Fplan%2Froute.php> (referer: https://passport.mafengwo.cn/)
2018-03-20 15:58:12 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <400 https://passport.mafengwo.cn?return_url=http%3A%2F%2Fwww.mafengwo.cn%2Fplan%2Froute.php>: HTTP status code is not handled or not allowed
2018-03-20 15:58:12 [scrapy.core.engine] INFO: Closing spider (finished)
  • 关于cookie的文章可以参考:
    • Scrapy框架–cookie的获取/传递/本地保存:https://www.cnblogs.com/thunderLL/p/7992040.html
    • Scrapy源码注解–CookiesMiddleware:http://www.cnblogs.com/thunderLL/p/8060279.html
    • site-packages\scrapy\downloadermiddlewares\cookies.py

你可能感兴趣的:(python爬虫)