**
**
问题:
1,网络问题。
确定是否是本机或爬虫目标网站出现网络问题
2,单位时间内请求页面频率过高
3,网站监测到非人为行为,断开连接
解决思路:
最有效的方法是异常捕获try except!!!
1.判断是否网络有误,如果有错误,建议换稳定的网络
2,设置下载延迟
setting.py文件中添加以下内容:
# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 3
DOWNLOAD_TIMEOUT = 60
DOWNLOAD_DELAY = 3
下载延迟3秒
DOWNLOAD_TIMEOUT = 60
下载超时60秒,有些网页打开很慢,该设置表示,到60秒后若还没加载出来自动舍弃
3,设置UA:
设置UA有多种方法:
1),直接在spider中添加
class KespiderSpider(scrapy.Spider):
name = 'kespider'
allowed_domains = ['rong.jingdata.com']
headers = {
'user-agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36",
}
cookeis = COOKIES[0]
industry_list = ["E_COMMERCE", "SOCIAL_NETWORK", ]
phase_list = ["ANGEL", ]
def start_requests(self):
for industry in self.industry_list:
for phase in self.phase_list:
for i in range(15, 16):
try:
url = "https://rong.jingdata.com/n/api/column/0/company?phase={}&industry={}&sortField=HOT_SCORE&p={}".format(
phase, industry, str(i))
print(url)
yield scrapy.Request(url=url, headers=self.headers, cookies=self.cookeis, callback=self.parse,
dont_filter=True)
except Exception as e:
print("出现错误:", e)
2)在setting.py文件中添加:
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = '******(+http://www.yourdomain.com)'
3)使用middlewares中间件,python源码中提供了很多中间件,以下是中间件的目录
将useragent.py复制到项目根目录下,如下图:
打开useragent.py,可以看到默认从setting.py中读取UA,
"""Set User-Agent header per spider or use a default value from settings"""
from scrapy import signals
class UserAgentMiddleware(object):
"""This middleware allows spiders to override the user_agent"""
def __init__(self, user_agent='Scrapy'):
self.user_agent = user_agent
@classmethod
def from_crawler(cls, crawler):
o = cls(crawler.settings['USER_AGENT'])
crawler.signals.connect(o.spider_opened, signal=signals.spider_opened)
return o
def spider_opened(self, spider):
self.user_agent = getattr(spider, 'user_agent', self.user_agent)
def process_request(self, request, spider):
if self.user_agent:
request.headers.setdefault(b'User-Agent', self.user_agent)
设置随机UA,首先命令行 pip install fake-useragent 安装 fake-useragent库。
在useragent.py中修改代码,建议直接复制粘贴。
"""Set User-Agent header per spider or use a default value from settings"""
from scrapy import signals
from fake_useragent import UserAgent
class RandomUserAgentMiddleware(object):
"""This middleware allows spiders to override the user_agent"""
def __init__(self, crawler):
super(RandomUserAgentMiddleware, self).__init__()
self.user_agent = UserAgent()
self.user_agent_type = crawler.settings.get("RANDOM_UA_TYPE", "random")
@classmethod
def from_crawler(cls, crawler):
return crawler
def process_request(self, request, spider):
def get_user_agent():
return getattr(self.user_agent, self.user_agent_type)
request.headers.setdefault("User-Agent", get_user_agent())
最后不要忘了在setting.py中开启该中间件:
DOWNLOADER_MIDDLEWARES = {
# 'dangdang.middlewares.DangdangDownloaderMiddleware': 543,
'dangdang.useragent.RandomUserAgentMiddleware': 544,
}
附加:
若使用scrapy_redis,可以使用命令行加参数运行,自动保存爬虫状态:
scrapy crawl somespider -s JOBDIR=crawls/somespider-1
注意:somespider要替换成自己的爬虫名称。
完结!!!;