python3下使用scrapy实现模拟用户登录与cookie存储 —— 基础篇(马蜂窝)
1. 背景
- 相关基础知识点回顾:
- python3下使用requests实现模拟用户登录(马蜂窝): http://blog.csdn.net/zwq912318834/article/details/79571110
2. 环境
- 系统:win7
- python 3.6.1
- scrapy 1.4.0
3. 标准的模拟登陆步骤
- 第一步:首先进入用户登录的页面,拿到一些登录所需的参数(比如说知乎网站,登陆页面里的 _xsrf)。
- 第二步:将这些参数,和账户密码,一起post到服务器,登录。
- 第三步:检查用户登录是否成功。
- 第四步:如果用户登录失败,排查错误,重新启动登录程序。
- 第五步:如果用户登录成功,按照正常流程爬取网站页面。
import scrapy
import datetime
import re
class mafengwoSpider(scrapy.Spider):
custom_settings = {
'LOG_LEVEL': 'DEBUG',
'ROBOTSTXT_OBEY': False,
'DOWNLOAD_DELAY': 2,
'COOKIES_ENABLED': True,
'COOKIES_DEBUG': True,
'DOWNLOAD_TIMEOUT': 25,
}
name = 'mafengwo'
allowed_domains = ['mafengwo.cn']
host = "http://www.mafengwo.cn/"
username = "13725168940"
password = "aaa00000000"
headerData = {
"Referer": "https://passport.mafengwo.cn/",
'User-Agent': "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36",
}
def start_requests(self):
print("start mafengwo clawer")
mafengwoLoginPage = "https://passport.mafengwo.cn/"
loginIndexReq = scrapy.Request(
url = mafengwoLoginPage,
headers = self.headerData,
callback = self.parseLoginPage,
dont_filter = True,
)
yield loginIndexReq
def parseLoginPage(self, response):
print(f"parseLoginPage: url = {response.url}")
loginPostUrl = "https://passport.mafengwo.cn/login/"
yield scrapy.FormRequest(
url = loginPostUrl,
headers = self.headerData,
method = "POST",
formdata = {
"passport": self.username,
"password": self.password,
},
callback = self.loginResParse,
dont_filter = True,
)
def loginResParse(self, response):
print(f"loginResParse: url = {response.url}")
routeUrl = "http://www.mafengwo.cn/plan/route.php"
yield scrapy.Request(
url = routeUrl,
headers = self.headerData,
meta={
'dont_redirect': True,
},
callback = self.isLoginStatusParse,
dont_filter = True,
)
def isLoginStatusParse(self, response):
print(f"isLoginStatusParse: url = {response.url}")
yield scrapy.Request(
url = "https://www.mafengwo.cn/travel-scenic-spot/mafengwo/10045.html",
headers=self.headerData,
)
def parse(self, response):
print(f"parse: url = {response.url}, meta = {response.meta}")
def errorHandle(self, failure):
print(f"request error: {failure.value.response}")
def closed(self, reason):
finishTime = datetime.datetime.now()
subject = f"clawerName had finished, reason = {reason}, finishedTime = {finishTime}"
E:\Miniconda\python.exe E:/documentCode/scrapyMafengwo/start.py
2018-03-19 17:03:54 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: scrapyMafengwo)
2018-03-19 17:03:54 [scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'scrapyMafengwo', 'NEWSPIDER_MODULE': 'scrapyMafengwo.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['scrapyMafengwo.spiders']}
2018-03-19 17:03:54 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2018-03-19 17:03:54 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2018-03-19 17:03:54 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2018-03-19 17:03:54 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2018-03-19 17:03:54 [scrapy.core.engine] INFO: Spider opened
2018-03-19 17:03:54 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-03-19 17:03:54 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6024
start mafengwo clawer
2018-03-19 17:03:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://passport.mafengwo.cn/> (referer: https://passport.mafengwo.cn/)
parseLoginPage: url = https://passport.mafengwo.cn/
2018-03-19 17:03:57 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET http://www.mafengwo.cn> from <POST https://passport.mafengwo.cn/login/>
2018-03-19 17:03:58 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.mafengwo.cn> (referer: None)
loginResParse: url = http://www.mafengwo.cn
2018-03-19 17:03:59 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.mafengwo.cn/plan/route.php> (referer: https://passport.mafengwo.cn/)
isLoginStatusParse: url = http://www.mafengwo.cn/plan/route.php
2018-03-19 17:04:01 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.mafengwo.cn/travel-scenic-spot/mafengwo/10045.html> (referer: https://passport.mafengwo.cn/)
parse: url = https://www.mafengwo.cn/travel-scenic-spot/mafengwo/10045.html, meta = {'depth': 3, 'download_timeout': 25.0, 'download_slot': 'www.mafengwo.cn', 'download_latency': 0.2569999694824219}
subject = clawerName had finished, reason = finished, finishedTime = 2018-03-19 17:04:01.638400
2018-03-19 17:04:01 [scrapy.core.engine] INFO: Closing spider (finished)
2018-03-19 17:04:01 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 3251,
'downloader/request_count': 5,
'downloader/request_method_count/GET': 4,
'downloader/request_method_count/POST': 1,
'downloader/response_bytes': 38259,
'downloader/response_count': 5,
'downloader/response_status_count/200': 4,
'downloader/response_status_count/302': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2018, 3, 19, 9, 4, 1, 638400),
'log_count/DEBUG': 6,
'log_count/INFO': 7,
'request_depth_max': 3,
'response_received_count': 4,
'scheduler/dequeued': 5,
'scheduler/dequeued/memory': 5,
'scheduler/enqueued': 5,
'scheduler/enqueued/memory': 5,
'start_time': datetime.datetime(2018, 3, 19, 9, 3, 54, 707400)}
2018-03-19 17:04:01 [scrapy.core.engine] INFO: Spider closed (finished)
Process finished with exit code 0
2018-03-19 17:05:06 [scrapy.core.engine] INFO: Spider opened
2018-03-19 17:05:06 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-03-19 17:05:06 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6024
start mafengwo clawer
2018-03-19 17:05:07 [scrapy.core.engine] DEBUG: Crawled (200) https://passport.mafengwo.cn/> (referer: https://passport.mafengwo.cn/)
parseLoginPage: url = https://passport.mafengwo.cn/
2018-03-19 17:05:08 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to https://passport.mafengwo.cn/> from https://passport.mafengwo.cn/login/>
2018-03-19 17:05:10 [scrapy.core.engine] DEBUG: Crawled (200) https://passport.mafengwo.cn/> (referer: https://passport.mafengwo.cn/)
loginResParse: url = https://passport.mafengwo.cn/
2018-03-19 17:05:10 [scrapy.core.engine] DEBUG: Crawled (302) http://www.mafengwo.cn/plan/route.php> (referer: https://passport.mafengwo.cn/)
2018-03-19 17:05:10 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <302 http://www.mafengwo.cn/plan/route.php>: HTTP status code is not handled or not allowed
2018-03-19 17:05:10 [scrapy.core.engine] INFO: Closing spider (finished)
2018-03-19 17:05:10 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 2234,
'downloader/request_count': 4,
'downloader/request_method_count/GET': 3,
'downloader/request_method_count/POST': 1,
'downloader/response_bytes': 5044,
'downloader/response_count': 4,
'downloader/response_status_count/200': 2,
'downloader/response_status_count/302': 2,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2018, 3, 19, 9, 5, 10, 368900),
'httperror/response_ignored_count': 1,
'httperror/response_ignored_status_count/302': 1,
'log_count/DEBUG': 5,
'log_count/INFO': 8,
'request_depth_max': 2,
'response_received_count': 3,
'scheduler/dequeued': 4,
'scheduler/dequeued/memory': 4,
'scheduler/enqueued': 4,
'scheduler/enqueued/memory': 4,
'start_time': datetime.datetime(2018, 3, 19, 9, 5, 6, 871900)}
2018-03-19 17:05:10 [scrapy.core.engine] INFO: Spider closed (finished)
subject = clawerName had finished, reason = finished, finishedTime = 2018-03-19 17:05:10.368900
Process finished with exit code 0
- 对比一下,就可以看到,在验证用户登录状态这个步骤时,如果用户处于非登录状态,而且又不允许页面重定向(302)到登录页面,那么爬虫就会在这个地方终止,不再继续往后爬取。
loginResParse: url = https://passport.mafengwo.cn/
2018-03-19 17:05:10 [scrapy.core.engine] DEBUG: Crawled (302) .mafengwo.cn/plan/route.php> (referer: https://passport.mafengwo.cn/)
2018-03-19 17:05:10 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <302 http://www.mafengwo.cn/plan/route.php>: HTTP status code is not handled or not allowed
4. 注意事项
'ROBOTSTXT_OBEY': False,
'DOWNLOAD_DELAY': 2,
'COOKIES_ENABLED': True,
# 需要有,否则服务器会拒绝请求
headerData = {
"Referer": "https://passport.mafengwo.cn/",
'User-Agent': "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36",
}
5. cookie的本地存储与使用
- 在验证用户登录成功之后,可以选择把cookie保存下来。然后在下次登录时,可以直接使用这个cookie登录(当然,并不推荐这种方式)
5.1. 把cookie保存在本地
def convertToCookieFormat(cookieLstInfo, cookieFileName):
'''
CookieReq = [b'PHPSESSID=427jcfptrsogeg7onenojvqmp0; mfw_uuid=5ab0adb9-177d-a7d3-a47a-9522417e0652; oad_n=a%3A3%3A%7Bs%3A3%3A%22oid%22%3Bi%3A1029%3Bs%3A2%3A%22dm%22%3Bs%3A20%3A%22passport.mafengwo.cn%22%3Bs%3A2%3A%22ft%22%3Bs%3A19%3A%222018-03-20+14%3A44%3A09%22%3B%7D; __today_login=1; mafengwo=d336513fb8fc6edd490db9725739bb85_94281374_5ab0adbac4ba51.24002232_5ab0adbac4ba92.98161419; uol_throttle=94281374; mfw_uid=94281374']
:param cookieLstInfo:
:return:
'''
cookieDict = {}
if len(cookieLstInfo) > 0:
cookieStr = str(cookieLstInfo[0], encoding="utf8")
print(f"cookieStr = {cookieStr}")
for cookieItemStr in cookieStr.split(";"):
cookieItem = cookieItemStr.strip().split("=")
print(f"cookieItemStr = {cookieItemStr}, cookieItem = {cookieItem}")
cookieDict[cookieItem[0].strip()] = cookieItem[1].strip()
print(f"cookieDict = {cookieDict}")
with open(cookieFileName, 'w') as f:
for cookieKey, cookieValue in cookieDict.items():
f.write(str(cookieKey) + ':' + str(cookieValue) + '\n')
return cookieDict
def isLoginStatusParse(self, response):
print(f"isLoginStatusParse: url = {response.url}")
CookieReq = response.request.headers.getlist('Cookie')
print(CookieReq = {CookieReq}')
cookieFileName = "mafengwoCookies.txt"
cookieDict = convertToCookieFormat(Cookie, cookieFileName)
# 响应Cookie
Cookie = response.headers.getlist('Set-Cookie')
print(f"Set-Cookie = {Cookie}")
# 如果能进到这一步,都没有出错的话,那么后面就可以用登录状态,访问后面的页面了
# ………………………………
# 不需要存储cookie
# 其他网页爬取
# ………………………………
yield scrapy.Request(
url = "https://www.mafengwo.cn/travel-scenic-spot/mafengwo/10045.html",
headers=self.headerData,
# 如果不指定callback,那么默认会使用parse函数
)
PHPSESSID:vperarhkjekdsv5mut4vjk9ri0
mfw_uuid:5ab0bcc6-0279-cbef-673e-15fd2c0b73c5
oad_n:a%3A3%3A%7Bs%3A3%3A%22oid%22%3Bi%3A1029%3Bs%3A2%3A%22dm%22%3Bs%3A20%3A%22passport.mafengwo.cn%22%3Bs%3A2%3A%22ft%22%3Bs%3A19%3A%222018-03-20+15%3A48%3A22%22%3B%7D
__today_login:1
mafengwo:926d677d880bf9c3981934bb3d710b8c_94281374_5ab0bcc8e795c0.78689785_5ab0bcc8e79637.22817262
uol_throttle:94281374
mfw_uid:94281374
5.2. 读取cookie使用
- 这个部分,当然,也可以直接用浏览器登录,然后从浏览器中拿到cookie,然后作为登录的凭证。
def getCookieFromFile(cookieFileName):
'''
PHPSESSID:nkv0d5g29bde1ni5p9bha8cq04
mfw_uuid:5ab0b3a3-22ac-61f1-ba72-db5a070c7e5d
oad_n:a%3A3%3A%7Bs%3A3%3A%22oid%22%3Bi%3A1029%3Bs%3A2%3A%22dm%22%3Bs%3A20%3A%22passport.mafengwo.cn%22%3Bs%3A2%3A%22ft%22%3Bs%3A19%3A%222018-03-20+15%3A09%3A23%22%3B%7D
__today_login:1
mafengwo:7e7cd3cffefcc05d3cbb217172a2d9fa_94281374_5ab0b3a5ac8007.33269268_5ab0b3a5ac8053.87485829
uol_throttle:94281374
mfw_uid:94281374
:param cookieFileName:
:return:
'''
cookieDict = {}
f = open(cookieFileName, "r")
for line in f.readlines():
print(f"line = {line}")
if line != "":
cookieItem = line.split(":")
cookieDict[cookieItem[0].strip()] = cookieItem[1].strip()
f.close()
return cookieDict
def start_requests(self):
print("start mafengwo clawer")
cookieFileName = "mafengwoCookies.txt"
cookieDict = getCookieFromFile(cookieFileName)
routeUrl = "http://www.mafengwo.cn/plan/route.php"
yield scrapy.Request(
url=routeUrl,
headers=self.headerData,
cookies=cookieDict,
meta={
},
callback=self.isLoginStatusParse,
dont_filter=True,
)
- 需要说明的是:
- 第一,如果cookie是能用的,那确实很方便。
- 第二,但是如果一旦cookie失效了,那么这个cookie就会在所有的requests中流转,不但无法访问rout页面,同时也无法访问重定向(302)后的登录页面,爬虫也就异常终止了(这也是不推荐使用cookie登录的原因)。如下:
line =
2018-03-20 15:58:09 [scrapy.downloadermiddlewares.cookies] DEBUG: Sending cookies to:
Cookie:
2018-03-20 15:58:09 [scrapy.downloadermiddlewares.cookies] DEBUG: Received cookies from: <302 http://www.mafengwo.cn/plan/route.php>
Set-Cookie: PHPSESSID=25kotnplj2fl5ftd0m6gari4b6; path=/; domain=.mafengwo.cn; HttpOnly
Set-Cookie: mfw_uuid=5ab0bfef-bfc3-a0d8-da65-a49fe77e191a; expires=Wed, 20-Mar-2019 08:01:51 GMT; Max-Age=31536000; path=/; domain=.mafengwo.cn
Set-Cookie: oad_n=a%3A3%3A%7Bs%3A3%3A%22oid%22%3Bi%3A1029%3Bs%3A2%3A%22dm%22%3Bs%3A15%3A%22www.mafengwo.cn%22%3Bs%3A2%3A%22ft%22%3Bs%3A19%3A%222018-03-20+16%3A01%3A51%22%3B%7D; expires=Tue, 27-Mar-2018 08:01:51 GMT; Max-Age=604800; path=/; domain=.mafengwo.cn
2018-03-20 15:58:09 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to 3A%2F%2Fwww.mafengwo.cn%2Fplan%2Froute.php> from
2018-03-20 15:58:09 [scrapy.downloadermiddlewares.cookies] DEBUG: Sending cookies to: 3A%2F%2Fwww.mafengwo.cn%2Fplan%2Froute.php>
Cookie:
2018-03-20 15:58:12 [scrapy.core.engine] DEBUG: Crawled (400) 3A%2F%2Fwww.mafengwo.cn%2Fplan%2Froute.php> (referer: https://passport.mafengwo.cn/)
2018-03-20 15:58:12 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <400 https://passport.mafengwo.cn?return_url=http%3A%2F%2Fwww.mafengwo.cn%2Fplan%2Froute.php>: HTTP status code is not handled or not allowed
2018-03-20 15:58:12 [scrapy.core.engine] INFO: Closing spider (finished)
- 关于cookie的文章可以参考:
- Scrapy框架–cookie的获取/传递/本地保存:https://www.cnblogs.com/thunderLL/p/7992040.html
- Scrapy源码注解–CookiesMiddleware:http://www.cnblogs.com/thunderLL/p/8060279.html
- site-packages\scrapy\downloadermiddlewares\cookies.py